Download as pdf or txt
Download as pdf or txt
You are on page 1of 394

Information Storage and

Management (ISM) v4
Introduction
By the end of the course you should be able to:

• List the modern technologies driving digital transformation


• Explain a modern data center environment
• Explain FC SAN components, architecture, and topologies, plus virtualization in a
FC SAN environment
• Explain IP and FCoE SAN concepts
• List intelligent storage system components and types
• Explain block-based, file-based, and object-based storage systems
• Explain software defined storage and networking
• Explain information availability and various business continuity solutions
• Explain data protection solutions including replication, backup and recovery,
deduplication, archiving, and migration
• Explain storage security domains, threats, and various security controls in storage
infrastructure
• Explain storage infrastructure operations management

1
Growth of the Digital Universe
We live in a digital universe – software creates and defines a world. A massive amount of digital
data is continuously generated, collected, stored, and analyzed through software in the digital
universe. IDC report predicts worldwide data creation will grow to an enormous 163 zettabytes
(ZB) by 2025.
The data in the digital universe comes from diverse sources, including both individuals and
organizations. Individuals constantly generate and consume information through numerous
activities, such as web searches, emails, uploading and downloading content and sharing media
files. In organizations, the volume and importance of information for business operations continue
to grow at astounding rates. Technologies driving digital transformation including Internet of
Things (IoT) have significantly contributed to the growth of the digital universe.
In the past, individuals created most of the data in the world. Now IDC predicts organizations will
create 60 percent of world’s data through applications relying on machine learning, automation,
machine-to-machine technologies, and embedded devices.

Why Information Storage and Management


Organizations have become increasingly information-dependent in the 21st century, and
information must be available whenever and wherever it is required. It is critical for users and
applications to have continuous, fast, reliable, and secure access to information for business
operations to run as required. Some examples of such organizations and processes include
banking and financial institutions, online retailers, airline reservations, social networks, stock
trading, scientific research, and healthcare.

2
Data is the lifeblood of a rapidly growing digital existence, opening up new opportunities
for businesses and gain a competitive edge. For example, an online retailer may need to
identify the preferred product types and brands of customers by analyzing their search,
browsing, and purchase patterns. This information helps the retailer to maintain a
sufficient inventory of popular products, and also advertise relevant products to the
existing and potential customers. It is essential for organizations to store, protect, process,
and manage information in an efficient and cost-effective manner. Legal, regulatory, and
contractual obligations regarding the availability, retention, and protection of data further
add to the challenges of storing and managing information.
To meet all these requirements and more, organizations are increasingly undertaking
digital transformation initiatives to implement intelligent storage solutions. These solutions
enable efficient and optimized storage and management of information. They also enable
extraction of value from information to derive new business opportunities, gain a
competitive advantage, and create sources of revenue.

3
Digital Data
Definition: Digital Data
A collection of facts that is transmitted and stored in electronic form, and processed
through software.

A generic definition of data is that it is a collection of facts, typically collected for analysis
or reference. Data can exist in various forms such as facts stored in a person's mind,
photographs and drawings, a bank ledger, and tabled results of a scientific survey. Digital
data is a collection of facts that is transmitted and stored in electronic form, and processed
through software. Devices such as desktops, laptops, tablets, mobile phones, and
electronic sensors generate digital data.
Digital data is stored as strings of binary values on a storage medium. This storage
medium is either internal or external to the devices generating or accessing the data. The
storage devices may be of different types, such as magnetic, optical, or SSD. Examples
of digital data are electronic documents, text files, emails, ebooks, digital images, digital
audio, and digital video.

Types of Digital Data


• Unstructured data has no inherent structure and is usually stored as different types of files

Text documents, PDFs, images, and videos

• Quasi-structured data consists of textual data with erratic formats that can be formatted
with effort and software tools

Clickstream data

• Semi-structured data consists of textual data files with an apparent pattern, enabling
analysis

Spreadsheets and XML files

4
• Structured data has a defined data model, format, structure

Database

What is Information?
Definition: Information
Processed data that is presented in a specific content to enable useful
interpretation and decision-making.
The terms “data” and “information” are closely related and you can use these two terms
interchangeably. However, it is important to understand the difference between the two.
Data, by itself, is simply a collection of facts that requires processing for it to be useful.
For example, annual sales figures of an organization is data. When data is processed and
in a specific context, it can be interpreted in a useful manner. This processed and
organized data is called information.
For example, when you process the annual sales data into a sales report, it provides
useful information, such as the average sales for a product (indicating product demand
and popularity), and a comparison of the actual sales to the projected sales.
Information thus creates knowledge and enables decision-making. Processing and
analyzing data is vital to any organization. It enables organizations to derive value from
data, and create intelligence to enable decision-making and organizational effectiveness.
It is easier to process structured data due to its organized form. On the other hand,
processing non-structured data and extracting information from it using traditional

5
applications is difficult, time-consuming, and requires considerable resources. Emerging
architectures, technologies, and techniques enable storing, managing, analyzing, and
deriving value from unstructured data coming from numerous sources.

• Example: Annual sales data processed into a sales report


• Enables calculation of the average sales for a product and the comparison of actual sales
to projected sales
• Emerging architectures and technologies enable extracting information from non-
structured data

Information Storage
In a computing environment, storage devices (or storage) are devices consisting of
nonvolatile recording media on which digital data or information can be persistently stored.
Storage may be internal or external to a compute system. Based on the nature of the
storage media used, storage devices are classified as:

• Magnetic storage devices: For example, hard disk drive and magnetic tape drive
• Optical storage devices: For example, Blu-ray, and DVD
• Flash-based storage devices: For example, solid-state drive (SSD), memory card, and
USB thumb drive (or pen drive)
Storage is a core component in an organization’s IT infrastructure. Various factors such
as the media, architecture, capacity, addressing, reliability, and performance influence the
choice and use of storage devices in an enterprise environment. For example, disk drives
and SSDs are used for storing business-critical information that needs to be continuously
accessible to applications. Magnetic tapes and optical storage are typically used for
backing up and archiving data.
In enterprise environments, information is typically stored on storage systems/storage
arrays. A storage system is a hardware component that contains a group of
homogeneous/heterogeneous storage devices that are assembled within a cabinet.
These enterprise-class storage systems are designed for high capacity, scalability,
performance, reliability, and security to meet business requirements.
The compute systems that run business applications are provided storage capacity from
storage systems. Storage systems are covered in Module, ‘Intelligent Storage Systems
(ISS)’. Organizations typically house their IT infrastructure, including compute systems,
storage systems, and network equipment within a data center.

Data Center
A data center is a dedicated facility where an organization houses, operates, and
maintains its IT infrastructure along with other supporting infrastructures. It centralizes an
organization’s IT equipment and data-processing operations. A data center may be
constructed in-house and located in an organization’s own facility. The data center may
also be outsourced, with equipment being at a third-party site. A data center typically
consists of the following:

6
• Facility: It is the building and floor space where organizations construct the data center. It
typically has a raised floor with ducts underneath holding power and network cables.
• IT equipment: It includes components such as compute systems, storage, and
connectivity elements along with cabinets for housing the IT equipment.
• Support infrastructure: It includes power supply, fire, heating, ventilation, and air
conditioning (HVAC) systems. It also includes security systems such as biometrics, badge
readers, and video surveillance systems.
• Digital transformation is disrupting every industry, and with the evolution of modern
technologies, organizations are facing too many business challenges.
Organizations must operate in real time, develop smarter products, and deliver a
great user experience. They must be agile, operate efficiently, and make decisions
quickly to be successful.
• However, these disruptive technologies along with agile methodologies are less
resilient on traditional IT infrastructure and services. Organization’s IT department
also faces several challenges in supporting business challenges. So, organizations
are moving towards modern data center to overcome the business challenges and
be successful in their digital transformation journey.

Key Characteristics of a Data Center


Data centers are designed and built to fulfill the key characteristics as shown in the figure. Although
the characteristics are applicable to almost all data center components, the details here primarily
focus on storage systems

7
Availability: Availability of information as and when required should be ensured
Availability: Availability of information as and when required should be ensured. Unavailability of
information can severely affect business operations, lead to substantial financial losses, and
damage the reputation of an organization.
Security: Policies and procedures should be established, and control measures should be
implemented to prevent unauthorized access to and alteration of information.
Capacity: Data center operations require adequate resources to efficiently store and process
large and increasing amounts of data. When capacity requirements increase, additional capacity
should be provided either without interrupting the availability or with minimal disruption. Capacity
may be managed by adding new resources or by reallocating existing resources.
Scalability: Organizations may need to deploy additional resources such as compute systems,
new applications, and databases to meet the growing requirements. Data center resources should
scale to meet the changing requirements, without interrupting business operations.
Performance: Data center components should provide optimal performance based on the
required service levels.
Data Integrity: Data integrity refers to mechanisms, such as error correction codes or parity bits,
which ensure that data is stored and retrieved exactly as it was received.

8
Manageability: A data center should provide easy, flexible, and integrated management of all its
components. Efficient manageability can be achieved through automation for reducing manual
intervention in common, repeatable tasks

9
Digital Transformation

Digital transformation puts technology at the heart of an organization’s products, services,


and operations.
With people, customers, businesses, and things communicating, transacting, and
negotiating with each other, a new world comes into being. It is the world of the digital
business that uses data as a way to create value. According to Gartner, by 2020, more
than seven billion people and businesses, and at least 30 billion devices, will be connected
to the Internet. Organizations need to accelerate their digital transformation journeys to
avoid being left behind in an increasingly digital world.
Digital transformation is imperative for all businesses. Businesses of all shapes and sizes
are changing to a more digital mindset. This digital mindset is being driven by the need to
innovate more quickly. Digital transformation puts technology at the heart of an
organization’s products, services, and operations.
In general terms, digital transformation is defined as the integration of digital technology
into all areas of a business. This results in fundamental changes to how businesses
operate and how they deliver value to customers, improve efficiency, reduce business
risks, and uncover new opportunities.

10
Key Technologies Driving Digital
Transformation
In this digital world, the organizations needs to develop new applications, using agile processes
and new tools to assure rapid time-to-market. Simultaneously, the organizations still expects IT to
operate and manage the traditional applications which provide much revenue.To survive in the
business, the organization has to transform and adopts modern technologies to support the digital
transformation. Some of the key technologies that drive digital transformation are Cloud, Big Data
Analytics, Internet of Things, Machine Learning, and Artificial Intelligence.

Question 1
Which data asset is an example of unstructured data?

Database table

XML data file

News article text

Webserver log

11
Question 2
Why are businesses undergoing digital transformation?

To meet regulatory requirements

To innovate more quickly

To avoid security risks

To eliminate management costs

12
13
Modern Technology Driving
Digital Transformation

14
Cloud Computing
Definition: Cloud Computing
A model for enabling convenient, on-demand network access to a shared pool of
configurable computing resources (for example, networks, servers, storage,
applications, and services) that can be rapidly provisioned and released with
minimal management effort or service provider interaction.
Source: The National Institute of Standards and Technology (NIST)—a part of the
U.S. Department of Commerce—in its Special Publication 800 to 145

The term “cloud” originates from the cloud-like bubble that is commonly used in technical
architecture diagrams to represent a system. This system may be the Internet, a network,
or a compute cluster. In cloud computing, a cloud is a collection of IT resources, including
hardware and software resources. You can deploy these resources either in a single data
center, or across multiple geographically dispersed data centers that are connected over
a network.
A cloud service provider is responsible for building, operating, and managing cloud
infrastructure. The cloud computing model enables consumers to hire IT resources as a
service from a provider. A cloud service is a combination of hardware and software
resources that are offered for consumption by a provider. The cloud infrastructure contains
IT resource pools, from which you can provision resources to consumers as services over
a network, such as the Internet or an intranet. Resources are returned to the pool when
the consumer releases them.

Cloud Computing Example


Example: The cloud model is similar to utility services such as electricity, water, and telephone.
When consumers use these utilities, they are typically unaware of how the utilities are generated

15
or distributed. The consumers periodically pay for the utilities based on usage. Similarly, in cloud
computing, the cloud is an abstraction of an IT infrastructure. Consumers hire IT resources as
services from the cloud without the risks and costs that are associated with owning the resources.
Cloud services are accessed from different types of client devices over wired and wireless network
connections. Consumers pay only for the services that they use, either based on a subscription or
based on resource consumption.

Essential Cloud Characteristics


In SP 800 to145, NIST specifies that a cloud infrastructure should have the five essential
characteristics.

To learn more, click each characteristic.

Measured Service: “Cloud systems automatically control and optimize resource use by
leveraging a metering capability at some level of abstraction appropriate to the type of service (for
example, storage, processing, bandwidth, and active user accounts). Resource usage can be
monitored, controlled, and reported, providing transparency for both the provider and consumer
of the utilized service.” – NIST
Resource Pooling: “The provider’s computing resources are pooled to serve multiple consumers
using a multitenant model, with different physical and virtual resources that are dynamically
assigned and reassigned according to consumer demand. There is a sense of location
independence in that the customer generally has no control or knowledge over the exact location
of the provided resources but may be able to specify location at a higher level of abstraction (for
example, country, state, or datacenter). Examples of resources include storage, processing,
memory, and network bandwidth.” – NIST
Rapid Elasticity: “Capabilities can be rapidly and elastically provisioned, in some cases
automatically, to scale rapidly outward and inward commensurate with demand. To the consumer,

16
the capabilities available for provisioning often appear to be unlimited and can be appropriated in
any quantity at any time.” – NIST
On-demand Self-service: “A consumer can unilaterally provision computing capabilities, such as
server time or networked storage, as needed automatically without requiring human interaction
with each service provider.” – NIST
Broad Network Access: “Capabilities are available over the network and accessed through
standard mechanisms that promote use by heterogeneous thin or thick client platforms (for
example, mobile phones, tablets, laptops, and workstations).” – NIST

Cloud Service Models


Select here for details.

A cloud service model specifies the services and the capabilities that are provided to
consumers. In SP 800 to145, NIST classifies cloud service offerings into the three primary
models:

• Infrastructure as a Service (IaaS)


• Platform as a Service (PaaS)
• Software as a Service (SaaS)
Many alternate cloud service models based on IaaS, PaaS, and SaaS are defined in
various publications and by different industry groups. These service models are specific
to the cloud services and capabilities that are provided.
Examples of such service models include:

• Network as a Service (NaaS),


• Database as a Service (DBaaS)
• Big Data as a Service (BDaaS)
• Security as a Service (SECaaS)
• Disaster Recovery as a Service (DRaaS)
However, these models eventually belong to one of the three primary cloud service
models.

17
Cloud administrators or architects assess and identify potential cloud service offerings. The
assessment includes evaluating what services to create and upgrade, and the necessary feature
set for each service. It also includes the service level objectives (SLOs) of each service aligning
to consumer needs and market conditions. SLOs are specific measurable characteristics such as
availability, throughput, frequency, and response time. They provide a measurement of
performance of the service provider. SLOs are key elements of a service level agreement (SLA).
SLA is a legal document that describes items such as what service level will be provided, how it
will be supported, service location, and the responsibilities of the consumer and the provider.

Infrastructure as a Service (IaaS)

Definition: Infrastructure as a Service


“The capability provided to the consumer is to provision processing, storage,
networks, and other fundamental computing resources where the consumer is able
to deploy and run arbitrary software, which can include operating systems and
applications. The consumer does not manage or control the underlying cloud
infrastructure but has control over operating systems, storage, and deployed
applications; and possibly limited control of select networking components (for
example, host firewalls).” – NIST
IaaS pricing may be subscription-based or based on resource usage. The provider pools the
underlying IT resources and multiple consumers share these resources through a multitenant
model.
Organizations can even implement IaaS internally, where internal IT manages the resources and
services.

Platform as a Service (PaaS)

Definition: Platform as a Service


18
“The capability provided to the consumer is to deploy onto the cloud infrastructure
consumer-created or acquired applications created using programming languages,
libraries, services, and tools supported by the provider. The consumer does not
manage or control the underlying cloud infrastructure including network, servers,
operating systems, or storage, but has control over the deployed applications and
possibly configuration settings for the application-hosting environment.” – NIST

• The number of consumers


• The types of consumers (developer, tester, and so on)
• The time for which the platform is in use
• The compute, storage, or network resources that the platform consumes.
In the PaaS model, a cloud service includes compute, storage, and network resources
along with platform software. Platform software includes software such as operating
system, database, programming frameworks, middleware, and tools to develop, test,
deploy, and manage applications.
Most PaaS offerings support multiple operating systems and programming frameworks
for application development and deployment. Typically you can calculate PaaS usage
fees based on the following factors:

Software as a Service (SaaS)

Definition: Software as a Service


“The capability provided to the consumer is to use the provider’s applications
running on a cloud infrastructure. The applications are accessible from various
client devices through either a thin client interface, such as a web browser (for
example, web-based email), or a program interface. The consumer does not
manage or control the underlying cloud infrastructure including network, servers,

19
operating systems, storage, or even individual application capabilities, except
limited user-specific application configuration settings.” – NIST
In the SaaS model, a provider offers a cloud-hosted application to multiple consumers as a
service. The consumers do not own or manage any aspect of the cloud infrastructure.
In SaaS, a given version of an application, with a specific configuration (hardware and software)
typically provides service to multiple consumers by partitioning their individual sessions and data.
SaaS applications execute in the cloud and usually do not need installation on end-point devices.
This feature enables a consumer to access the application on demand from any location and use
it through a web browser on various end-point devices.
Some SaaS applications may require to install a client interface locally on an end-point device.
Customer Relationship Management (CRM), email, Enterprise Resource Planning (ERP), and
office suites are examples of applications that are delivered through SaaS.

Cloud Deployment Models


A cloud deployment model provides a basis for how cloud infrastructure is built, managed, and
accessed. In SP 800 to 145, NIST specifies the four primary cloud deployment models listed.
Each cloud deployment model may be used for any of the cloud service models: IaaS, PaaS, and
SaaS. The different deployment models present several tradeoffs in terms of control, scale, cost,
and availability of resources.

20
Public Cloud

Definition: Public Cloud


“The cloud infrastructure is provisioned for open use by the general public. It may
be owned, managed, and operated by a business, academic, or government
organization, or some combination of them. It exists on the premises of the cloud
provider.” – NIST
Public cloud services may be free, subscription-based, or provided on a pay-per-use model. A
public cloud provides the benefits of low upfront expenditure on IT resources and enormous
scalability.
However, some concerns for the consumers include network availability, risks associated with
multitenancy, visibility, and control over the cloud resources and data, and restrictive default
service levels.

21
Private Cloud

Definition: Private Cloud


“The cloud infrastructure is provisioned for exclusive use by a single organization
comprising multiple consumers (for example, business units). It may be owned,
managed, and operated by the organization, a third party, or some combination of
them, and it may exist on or off premises.” – NIST
Many organizations may not want to adopt public clouds due to concerns related to privacy,
external threats, and lack of control over the IT resources and data. When compared to a public
cloud, a private cloud offers organizations a greater degree of privacy and control over the cloud
infrastructure, applications, and data.
There are two variants of private cloud: on-premise and externally hosted, as shown in illustration
1 and illustration 2 respectively. An organization deploys on-premise private cloud in its data
center within its own premises.
In the externally hosted private cloud (or off-premise private cloud) model, an organization
outsources the implementation of the private cloud to an external cloud service provider. The cloud
infrastructure is hosted on the premises of the provider and multiple tenants may share. However,
the organization’s private cloud resources are securely separated from other cloud tenants by
access policies implemented by the provider.

22
Community Cloud

Definition: Community Cloud


“The cloud infrastructure is provisioned for exclusive use by a specific community
of consumers from organizations that have shared concerns (for example, mission,
security requirements, policy, and compliance considerations). It may be owned,
managed, and operated by one or more of the organizations in the community, a
third party, or some combination of them, and it may exist on or off premises.” –
NIST
The organizations participating in the community cloud typically share the cost of deploying the
cloud and offering cloud services. This enables them to lower their individual investments. Since
the costs are shared by a fewer consumer than in a public cloud, this option may be more
expensive. However, a community cloud may offer a higher level of control and protection than a
public cloud. As with the private cloud, there are two variants of a community cloud: on-premise
and externally hosted.

23
Hybrid Cloud

Definition: Hybrid Cloud


“The cloud infrastructure is a composition of two or more distinct cloud
infrastructures (private, community, or public) that remain unique entities, but are
bound by standardized or proprietary technology that enables data and application
portability (for example, cloud bursting for load balancing between clouds.)” – NIST
The illustration shows hybrid cloud that consists of an on-premise private cloud deployed by
enterprise P, and a public cloud serving enterprise and individual consumers in addition to
enterprise P.

24
Evolution of Hybrid Cloud: Multicloud
To create the best possible solution for their businesses, today organizations want to choose
different public cloud service providers. To achieve this goal, some organizations have started
adopting a multicloud approach.
The drivers for adopting this approach include avoiding vendor lock-in, data control, cost savings,
and performance optimization. This approach helps to meet the business demands since,
sometimes no single cloud model can suit the varied requirements and workloads across an
organization. Some application workloads run better on one cloud platform while other workloads
achieve higher performance and lower cost on another platform.
Also, certain compliance, regulation, and governance policies require an organization’s data to
reside in particular locations. A multicloud strategy can help organizations meet those
requirements because different cloud models from various cloud service providers can be
selected. Each cloud vendor offers different service options at different prices.
Organizations can also analyze the performance of their various application workloads and
compare them to what is available from other vendors. This method helps to analyze both
workload performance and cost for various services in each cloud. Options can then be identified
that meet the workload performance and cost requirements of the organization.

25
Cloud Computing Use Cases
Use Case Description

Provisioning resources for a limited time from a public cloud to handle


Cloud bursting
peak workloads

Web application hosting Hosting less critical applications on the public cloud

Migrating packaged Migrating standard packaged applications such as e-mail to the public
applications cloud

Application
Developing and testing applications in the public cloud before
development and
launching them
testing

Using cloud to analyze the voluminous data to gain insights and for
Big Data Analytics
deriving business value

Adoping cloud for a DR solution can provide cost benefit, scalability and
Disaster Recovery
faster recovery of data

Using IoT in cloud provides infrastructure for enhancing the


Internet of Things
network connectivity, storage space, and tools for data analysis

26
Big Data Analytics

27
Big Data: An Overview

Definition: Big Data


Information assets whose high volume, high velocity, and high variety require the
use of new technical architectures and analytical methods to gain insights and for
deriving business value.

Big Data represents the information assets whose high volume, high velocity, and high
variety require the use of new technical architectures and analytical methods to gain
insights and for deriving business value.
Many organizations such as government departments, retail, telecommunications,
healthcare, social networks, banks, and insurance companies employ data science
techniques to benefit from Big Data analytics.
The definition of Big Data has three principal aspects, which are:

• Characteristics of Data
• Data Processing Needs
• Business Value

1. Big Data includes data sets of considerable sizes containing both structured and
non-structured digital data. Apart from its size, the data gets generated and
changes rapidly, and also comes from diverse sources. These and other
characteristics are covered next.
2. Big Data also exceeds the storage and processing capability of conventional IT
infrastructure and software systems. It not only needs a highly-scalable
architecture for efficient storage, but also requires new and innovative technologies
and methods for processing.
These technologies typically make use of platforms such as distributed processing,
massively-parallel processing, and machine learning. The emerging discipline of Data

28
Science represents the synthesis of several existing disciplines, such as statistics,
mathematics, data visualization, and computer science for Big Data analytics.
3. Big Data analytics has tremendous business importance to organizations. Searching,
aggregating, and cross-referencing large data sets in real-time or near-real time enables
gaining valuable insights from the data. This enables better data-driven decision making.

Characteristics of Big Data


Apart form the characteristics of volume, velocity, and variety—popularly known as “the 3V’s, the
three other characteristics of Big Data include variability, veracity, and value.

To learn more, click each topic.

Volume: The word “Big” in Big Data refers to the massive volumes of data. Organizations are
witnessing an ever-increasing growth in data of all types, such as transaction-based data stored
over the years, sensor data, and unstructured data streaming in from social media. This growth in
data is reaching Petabyte—and even Exabyte—scales. The excessive volume not only requires
substantial cost-effective storage, but also gives rise to challenges in data analysis.
Velocity: Velocity refers to the rate at which data is produced and changes, and also how fast
the data must be processed to meet business requirements. Today, data is generated at an
exceptional speed, and real-time or near-real time analysis of the data is a challenge for many
organizations. It is essential for the data to be processed and analyzed, and the results to be
delivered in a timely manner. An example of such a requirement is real-time face recognition for
screening passengers at airports.
Variety: Variety refers to the diversity in the formats and types of data. Data is generated by
numerous sources in various structured and non-structured forms. Organizations face the
challenge of managing, merging, and analyzing the different varieties of data in a cost-effective
manner. The combination of data from a variety of data sources and in a variety of formats is a
key requirement in Big Data analytics. An example of such a requirement is combining a large
number of changing records of a particular patient with various published medical research to find
the best treatment.

29
Variability: Variability refers to the constantly changing meaning of data. For example, analysis
of natural language search and social media posts requires interpretation of complex and highly-
variable grammar. The inconsistency in the meaning of data gives rise to challenges related to
gathering the data and in interpreting its context.
Veracity: Veracity refers to the varying quality and reliability data. The quality of the data being
gathered can differ greatly, and the accuracy of analysis depends on the veracity of the source
data. Establishing trust in Big Data presents a major challenge because as the variety and number
of sources grows, the likelihood of noise and errors in the data increases. Therefore, a significant
effort may go into cleaning data to remove noise and errors, and to produce accurate data sets
before analysis can begin. For example, a retail organization may have gathered customer
behavior data from across systems to analyze product purchase patterns and to predict purchase
intent. The organization would have to clean and transform the data to make it consistent and
reliable.
Value: Value refers to the cost-effectiveness of the Big Data analytics technology used and the
business value derived from it. Many large enterprise scale organizations have maintained large
data repositories, such as data warehouses, managed non-structured data, and carried out real-
time data analytics for many years. With hardware and software becoming more affordable and
the emergence of more providers, Big Data analytics technologies are now available to a much
broader market. Organizations are also gaining the benefits of business process enhancements,
increased revenues, and better decision making.

30
Data Repositories
Data for analytics typically comes from repositories such as enterprise data warehouses
and data lakes.
A data warehouse is a central repository of integrated data gathered from multiple different
sources. It stores current and historical data in a structured format. It is designed for query
and analysis to support an organization’s decision making process. For example, a data
warehouse may contain current and historical sales data that is used for generating trend
reports for sales comparisons.
A data lake is a collection of structured and non-structured data assets that are stored as
exact or near-exact copies of the source formats. The data lake architecture is a “store-
everything” approach to Big Data. Unlike conventional data warehouses, data is not
classified when it is stored in the repository, as the value of the data may not be clear at
the outset. The data is also not arranged as per a specific schema and is stored using an
object-based storage architecture.
As a result, data preparation is eliminated and a data lake is less structured compared to
a data warehouse. Data is classified, organized, or analyzed only when it is accessed.
When a business need arises, the data lake is queried, and the resultant subset of data
is then analyzed to provide a solution. The purpose of a data lake is to present an
unrefined view of data to highly-skilled analysts, and to enable them to implement their
own data refinement and analysis techniques.

Components of a Big Data Analytics


Solutions

The technology layers in a Big Data analytics solution include storage, MapReduce technologies,
and query technologies. These components are collectively called the ‘SMAQ stack’. SMAQ
solutions may be implemented as a combination of multi-component systems. It may also be
offered as a product with a self-contained system comprising storage, MapReduce, and query –
all in one.

31
These ‘SMAQ stack’ components are:

• Storage: The foundation layer of the stack, and has a distributed architecture
characteristic with primarily unstructured content in non-relational form
• MapReduce: An intermediate layer in the stack. It enables the distribution of computation
across multiple generic compute systems for parallel processing to gain speed and cost
advantage. It also supports a batch-oriented processing model of data retrieval and
computation as opposed to the record-set orientation of most SQL-based databases.
• Query: This layer typically implements a NoSQL database for storing, retrieving, and
processing data. It also provides a user-friendly platform for analytics and reporting

Storage
Select here for details.
A storage system in the SMAQ stack is based on either a proprietary or an open-source
distributed file system, such as Hadoop Distributed File System (HDFS). The storage
system may also support multiple file systems for client access. The storage system
consists of multiple nodes—collectively called a “cluster”—, and the file system is
distributed across all the nodes in the cluster.
Each node in the cluster has processing capability and storage capacity. The system has
a highly scalable architecture, and you can add extra nodes dynamically to meet the
workload and the capacity needs.
The distributed file system like HDFS typically provides only an interface similar to that of
regular file systems. Unlike a database, they can only store and retrieve data and not
index it, which is essential for fast data retrieval. To mitigate this challenge and gain the

32
advantages of a database system, SMAQ solutions may implement a NoSQL database
on top of the distributed file system. NoSQL databases may have built-in MapReduce
features that enable processing to be parallelized over their data stores.
In many applications, the primary source of data is in a relational database. Therefore,
SMAQ solutions may also support the interfacing of MapReduce with relational database
systems. MapReduce fetches datasets and stores the results of the computation in
storage. The data must be available in a distributed fashion, to serve each processing
node.
The design and features of the storage layer are important not just because of the
interface with MapReduce, but also because they affect the ease with which data can be
loaded and the results of computation extracted and searched.

MapReduce
MapReduce is the driving force behind most Big Data processing solutions. It is a parallel
programming framework for processing large datasets on a compute cluster. The key innovation
of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over
multiple compute systems or nodes. This distribution solves the issue of processing data that is
too large for a single machine to process.
MapReduce works in two phases namely ‘Map’ and ‘Reduce’ as the name suggests. An input
dataset is split into independent chunks which are distributed to multiple compute systems.

• The Map function processes the chunks in a parallel manner, and transforms them into
multiple smaller intermediate datasets
• The Reduce function condenses the intermediate results and reduces them to a
summarized dataset, which is the wanted end result
Typically both the input and the output datasets are stored on a file-system. The MapReduce
framework is highly scalable and supports the addition of processing nodes to process chunks.
Apache’s Hadoop MapReduce is the predominant open source Java-based implementation of
MapReduce.

33
Query
Query simplifies the specification of MapReduce operations, and the retrieval and analysis
of the results.
It is non-intuitive and inconvenient to specify MapReduce jobs in terms of distinct Map and
Reduce functions in a programming language. To mitigate this challenge, SMAQ systems
incorporate a higher-level query layer to simplify both the specification of the MapReduce
operations, and the analysis of the results.
The query layer implements high-level languages that enable users to describe, run, and
monitor MapReduce jobs. The languages are designed to handle not only the processing,
but also the loading and saving of data from and to the MapReduce cluster. The languages
typically support integration with NoSQL databases that you implement on the
MapReduce cluster.

34
Big Data Use Cases
Use Case Description
Provides consolidated diagnostic information and enable
healthcare providers to analyze patient data; improve patient care
and outcomes; minimize errors; increase patient engagement; and
Healthcare
improve operations and services. These solutions also enable
healthcare providers to monitor patients and analyze their
experiences in real time.
Supports activities such as correlating purchase history, profiling
customers, and analyzing behavior on social networks. This also
Finance enables in controlling customer acquisition costs and target sales
promotions more effectively. Big Data analytics is also being used
extensively in detecting credit card frauds.
Provides valuable insights for competitive pricing, anticipating
future demand, effective marketing campaigns, optimized
Retail and
inventory assortment, and improved distribution. This enables
eCommerce
them to provide optimal prices and services to customers, and also
improve operations and revenue.
Facilitates improved efficiency and effectiveness across a variety
of domains such as social services, education, defense, national
Government
security, crime prevention, transportation, tax compliance, and
revenue management.
Enables valuable insights from the data that is generated through
social networking. This analysis enables the discovery and
analysis of communities, personalization for solitary activities (for
Social
example, search) and social activities (for example, discovery of
Network
potential friends). It also involves the analysis of user behavior in
Analysis
open forums (for example, conventional sites, blogs, and
communities) and in commercial platforms (for example,
eCommerce).
Supports gathering data from online and offline games and players
Gaming worldwide. Data can be used for advertising, and to improve the
gaming experience, advertising,
Enables business to use geolocation services in their applications
Geolocation
to locate customers. This data can be used to improve service
Services
and the overall customer experience.

35
Internet of Things

Internet of Things: An Overview

In this rapidly transforming digital landscape, the speed of communication has become a
necessary metric for every organization to access their information. The evolution of
Internet and the rise of devices that are connected to Internet provide new opportunities
for smarter decision making, getting a competitive edge, and to improve the life of
customers. These devices range from laptops, mobiles phones to irrigation systems to
cars generating digital data.
The Internet of Things (IoT) is the concept of networking things such as objects and people
to collect and exchange data. The idea is that real-life objects can independently share
and process information - without humans having anything to do with the data input stage.
IoT supports Machine to Machine (M2M) communication enabling devices to
communicate with each other to provide faster, accurate, and timely data-driven results.
The use of IoT requires organizations to store large volumes of data, and to process and
analyze data in real time. It also requires a transformation in data center to meet the
network, security, and data storage and management requirements.

36
Components of Internet of Things
Select here for details.
An IoT implementation requires a proper understanding of its components and all IoT
devices and applications communication using various standards.
The three components of IoT include:

The main components include:

• Sensors: Smart devices that detect changes in their surrounding environment, produce,
and transmit digital data. Sensors should be able to detect a wide range of physical
phenomenon ranging from temperature, pressure, to motion and magnetic fields.
Examples of sensors include thermostats, moisture sensors, accelerometer, gas/smoke
sensors and so on. In IoT, different sensors are used for different IoT applications to
produce and transfer the data for processing.
• Actuators: Devices that collect data from sensors and perform the required action.
Actuators consume energy to produce physical action like creating a motion or controlling
a system. Examples of actuators include electric motor that uses electric power to
generate motion, hydraulic actuators use fluid pressure to generate motion. In IoT,
actuators help to automate the operations by applying a force based on the dynamics of
data generated by sensors.
• Gateways: IoT involves billions of devices that are on various networks getting connected
for data communication. Gateways are devices that manage data traffic between networks
by translating their network protocols. This process ensures that devices operating in
various networks are interoperable. In IoT, these gateway devices can also be designed
to analyze and secure the data that are collected from sensors before transmitting it to the
next phase.

Example: In a modern irrigation system, IoT devices are used to monitor the crop field and
automate the irrigation system to increase the efficiency and productivity of the overall agricultural
processes. Soil moisture sensors detect the moisture levels in the soil and send the appropriate
data to the actuator. Based on the data, the actuator device will control the flow of water through
the valves. Since, these devices generate a lot of data, gateways help to transfer this data to the

37
cloud for storage. Gateways communicate with sensors using various protocols and translate the
data that is appropriate for cloud transmission.

Internet of Things Use Cases

• Home Automation: The use of IoT has entered the residential environment with the
introduction of smart home technology. Various electronic objects at home such as air
conditioner, lights, refrigerators, security cameras, kitchen stoves can be connected to the
Internet with the help of sensors. This will allow the home owners to efficiently monitor
and control the objects anytime irrespective of the location.
• Smart Cities: The smart cities concept highlights the need to enhance the quality of life of
the citizens using smart public infrastructure. This process enables optimization of power
usage, efficient water supply, manage waste collections, reliable public transportation
using IoT sensors. All these data will collected and sent to a control center which directs
the necessary actions. This application of IoT can also be extended to build smarter
environment by early detection of earthquake, air pollution, and forest fire.
• Wearables: With the use of wearables and embedded devices on people, IoT sensors can
collect data about the users regarding their health, heartbeat, and exercise patterns. For
example, embedded chips enable doctors to monitor patients who are in critical care, by
tracking and charting all their vital signs constantly. Wearables also have their application
in detecting and reporting crimes in the city.
• Manufacturing Industries: Using IoT in manufacturing industries is helping them to identify
optimization possibilities in their day to day operations. By applying IoT, they are not just
able to monitor but they are also able to automate the complex tasks involved.
Use Case Description
Allows home owners to monitor and control home appliances
Home Automation
anytime irrespective of the location
Highlights the need to enhance the quality of life of the citizens
Smart Cities
using IoT
Helps to collect data about the users health. Also helps to detect and
Wearables
report crimes
Manufacturing Helps industries to identify optimization possibilities in their
Industries day to day operations

38
Machine Learning
Machine Learning (ML) Overview
Data is growing at an astronomical rate, and it is impossible to take full advantage of it
manually to get insights. Automation can provide faster, better, and deeper data insights.
With the advancement of computer systems and modern technologies, intelligent
machines are being built to automatically learn from data and to make decisions. As cloud
computing and Big Data technologies generate voluminous data, these intelligent
machines help to process the data in real time.
Artificial intelligence, machine learning, and deep learning are three intertwined concepts
that help to build this human-like ability into computer systems. Artificial Intelligence (AI)
is an umbrella term, while machine and deep learning are the techniques that make AI
possible. AI is a technology of creating intelligent systems that work and think like humans.
Machine learning refers to the process of ‘training’ the machine, feeding large amounts of
data into algorithms that give it the ability to learn how to perform the task without being
explicitly programmed. Instead of writing a program, a machine is provided with data. With
the help of algorithms, machines learn from the data and complete a specific task. When
the machine is provided with a new dataset, it adapts to it by learning from previous
experiences to produce reliable outputs.
Deep learning is a machine learning technique that uses neural networks as the
underlying architecture for training models. Fast compute and storage with a lot of memory
and high-bandwidth networking will enable machine to learn faster and provide accurate
results. Neural networks is a set of algorithms that are used to establish relationships in a
dataset by imitating a human brain. A training model is an object which is provided with
an algorithm along with a set of data from which it can learn.

Algorithm Types
A machine learning process involves creating mathematical and statistical algorithms that
can accept input data and use some sort of analysis to predict the output. In this process,
the first step is to collect the datasets for analysis. Once the data is collected, select the
type of algorithm to be used, then build a model. Train the model with test data sets, and
improvise the model accordingly for future decision making. Most machine learning
algorithms can be classified into the following three types:

39
Impact on the Data Center
Artificial Intelligence and Machine Learning are providing new opportunities for the data
center as well as creating challenges if organizations are not prepared to support these
technologies from their infrastructure aspect.
Machine learning helps in making the data center and its management efficient by
reducing energy usage, maximizing usage and operation of resources, automating
operations, and preventing downtime.
Machine learning algorithms can be applied to data logs collected from infrastructure
resources to identify any problems or security issues that would otherwise become
challenging using manual operations. As this operational data log becomes a larger
dataset for machine learning systems, it requires sufficient storage capacity to efficiently
manage and store.
Machine learning applications require high-end microprocessors for faster processing of
data and modern storage solutions to keep up with the processing speed. Organizations
can consider using hybrid cloud storage options for reducing the data center footprint,
load-balancing, and cost-effectiveness.

Machine Learning Use Cases


Use Case Description
A large amount of data that is generated by this industry is
processed using machine learning solutions to increase their
productivity. It helps to efficiently use energy storage by tracking
Energy
the usage. Handles different types of energy sources using
autonomous grids. It also helps to predict component failures
and consumption demand.

Media and Content from the media and entertainment industry can be
Entertainment automatically tagged using metadata by applying machine
learning solutions. This method enhances content-based search

40
Use Case Description
activity by finding the right content quickly and helps the content
developers to optimize the content to specific audiences based
on their search data. It also plays an important role in creating
video subtitles using natural language processing.
Machine learning can be applied to sports in predicting the
results of the games, helps coaches to get insights into the
Sports players performance and to better organize the games with
appropriate strategy by analyzing the performance and game
data.
Banks and other businesses use machine learning to detect and
prevent fraudulent activities for credit cards and bank accounts.
It also helps to identify investment opportunities for traders by
Financial
monitoring market changes. It is used to provide risk
Services
management solutions like predicting financial crisis, loan
repayment capabilities of the customers, and securing financial
data.

41
Concepts In Practice
Concepts in Practice
• Dell EMC Cloud for Microsoft Azure
• Pivotal Cloud Foundry
• Dell Edge Gateway
• Dell EMC Ready Solution for Artificial Intelligence
• VMware Cloud on AWS

Dell EMC Cloud for Microsoft Azure


Delivers Infrastructure and Platform as a Service with a consistent Azure experience on-
premises and in the public cloud. This platform is built on VxRack AS hyper-converged
architecture that has modular building blocks that are called nodes and powered by
Microsoft Windows software-defined storage and networking capabilities. It is managed
using Microsoft Azure Stack interface. Cloud for Microsoft Azure Stack provides a simple,
cost-effective solution that delivers multiple performance and capacity options to match
any use case and covers a wide variety of cloud-native applications and workloads.

• Delivers Infrastructure and Platform as a service


• Provides consistent Azure experience on-premise and in public cloud
• Uses Dell EMC VxRack AS hyper-converged architecture
• Powered by Microsoft Windows software-defined storage and networking

Pivotal Cloud Foundry


An enterprise Platform as a Service solution, which is built on the foundation of the Cloud
Foundry open-source PaaS project. Pivotal CF, powered by Cloud Foundry, enables
streamlined application development, deployment, and operations in both private and
public clouds. It supports multiple programming languages and frameworks. It helps
developers to deploy their applications without being concerned about configuring and
managing the underlying cloud infrastructure. It provides zero downtime stack updates
while migrating the applications to the new stack. Developers can use the security controls
offered by PCF.

42
• An enterprise Platform as a Service solution
• Application development, deployment, and operations in both public and private clouds
• Supports multiple programming languages and frameworks
• Offloads configuring and managing infrastructure tasks from developers

Dell Edge Gateway


An intelligent device that is designed to aggregate, secure, analyze, and relay data from
diverse sensors and equipment at the edge of the network. These gateways bridge both
legacy systems and modern sensors to the internet, helping to get business insights from
the real-time, pervasive data in your machines and equipment. It is compact, consumes
less power, and suitable for challenging field and mobile use cases. It is designed for
flexible manageability using Dell Edge Device Manager or a third-party on-premise
console.

• Aggregates, secures, analyzes, and relays data


• Operates at the edge of the network
• Compact, consumes less power
• Bridges both legacy systems and modern sensors to the Internet

Dell EMC Ready Solution for Artificial


Intelligence

These solutions shorten the deployment time from months to days. They include software that
streamlines the set-up of data science environments to just a few clicks, boosting data scientist
productivity. These solutions are optimized with software, servers, networking, storage, and
services to help organizations to get faster and deeper insights. These solutions include:

• Dell EMC Machine Learning with Hadoop: Builds on the power of tested and proven
Dell EMC Ready Bundles for Hadoop, created in partnership with Cloudera®. This
solution includes an optimized solution stack along with data science and framework
optimization. It consists of Cloudera Data Science Workbench with the added ease of a
Dell EMC Data Science Engine
• Dell EMC Deep Learning with Intel: Simplifies and accelerates the adoption of deep
learning technology with an optimized solution stack that simplifies the entire workflow
from model building to training to inferencing. It consists of PowerEdge C servers and Dell
EMC H-series networking based on Intel Omni-Path networking.

43
• Dell EMC Deep Learning with NVIDIA: Provides a GPU-optimized solution stack that
can shave valuable time from deep learning projects. It consists of PowerEdge servers
with NVIDIA GPUs and Isilon Scale-out NAS storage.

• Shortens development time


• Optimized with software, servers, networking, storage, and services
Includes the following:

• Dell EMC Machine Learning with Hadoop


• Dell EMC Deep Learning with Intel
• Dell EMC Deep Learning with NVIDIA

VMware Cloud on AWS


Extends the VMware Software Defined Data Center (SDDC) software onto the AWS
cloud. This SDDC software consists of several other products including vCenter Server
for data center management, vSAN for software-defined storage, and NSX for software-
defined networking. It enables customers to run their VMware vSphere based applications
across private, public, and hybrid cloud environments with optimized access to AWS
services. It helps virtual machines in SDDC to access AWS EC2 and S3 services. This
solution provides workload migration, allows customers to use the global presence of
AWS data centers, and flexibility of management.

• Extends SDDC software to AWS cloud


• Consists of vCenter Server, VMware vSphere, vSAN, and NSX
• Supports private, public, and hybrid cloud environments.
• Helps virtual machines in SDDC to access AWS EC2 and S3 services

Question 1
Which machine learning technique uses neural networks
as the underlying architecture for training models?

Internet of things

Big data analytics

44
Edge computing

Deep learning

Question 2
Identify the cloud computing characteristic that controls
and optimizes resource use by leveraging a metering
capacity
• Measured service

Correct!

On-demand self service

Rapid elasticity

Resource pooling

45
46
Modern Data Center
Environment
Compute System
What is a Compute System
A compute system is a computing device (combination of hardware, firmware, and system
software) that runs business applications.
Examples of compute systems include physical servers, desktops, laptops, and mobile
devices. The term compute system refers to physical servers and hosts on which platform
software, management software, and business applications of an organization are
deployed.
A compute system’s hardware consists of processor(s), memory, internal storage, and I/O
devices. The logical components of a compute system include the operating system (OS),
file system, logical volume manager, and device drivers. The OS may include the other
software, or they can be installed individually.
In an enterprise data center, applications are typically deployed on compute clusters for
high availability and for balancing computing workloads. A compute cluster is a group of
two or more compute systems that function together, sharing certain network and storage
resources, and logically viewed as a single system. Compute clustering is covered in detail
in Module, ‘Introduction to Business Continuity’.

Types of Compute Systems


The compute systems used in building data centers are typically classified into three
categories: tower compute system, rack-mounted compute system, and blade compute
system

47
• Tower
• Rack-mounted
• Blade

A tower compute system, also known as a tower server, is a compute system built in an
upright stand-alone enclosure called a “tower”, which looks similar to a desktop cabinet.
Tower servers have a robust build, and have integrated power supply and cooling. They
typically have individual monitors, keyboards, and mice.
Tower servers occupy significant floor space and require complex cabling when deployed
in a data center. They are also bulky, and a group of tower servers generate considerable
noise from their cooling units. Tower servers are typically used in smaller environments.
Deploying many tower servers in large environments may involve substantial expenditure.
A rack-mounted compute system, also known as a rack server, is a compute system
designed to be fixed inside a frame called a “rack”. A rack is a standardized enclosure
containing multiple mounting slots called “bays”, each of which holds a server in place
with the help of screws. A single rack contains multiple servers stacked vertically in bays,
thereby simplifying network cabling, consolidating network equipment, and reducing the
floor space use. Each rack server has its own power supply and cooling unit. Typically, a
console is mounted on a rack to enable administrators to manage all the servers in the
rack.
Some concerns with rack servers are that they are cumbersome to work with, and they
generate many heat because of which more cooling is required, which in turn increases
power costs. A “rack unit” (denoted by U or RU) is a unit of measure of the height of a
server designed to be mounted on a rack. One rack unit is 1.75 inches (44.45 mm). A 1 U
rack server is typically 19 inches (482.6 mm) wide.

48
The standard rack cabinets are 19 inches wide and the common rack cabinet sizes are
42U, 37U, and 27U. The rack cabinets are also used to house network, storage,
telecommunication, and other equipment modules. A rack cabinet may also contain a
combination of different types of equipment modules.
A blade compute system, also known as a blade server, is an electronic circuit board
containing only core processing components, such as processor(s), memory, integrated
network controllers, storage drive, and essential I/O cards and ports. Each blade server
is a self-contained compute system and is typically dedicated to a single application.
A blade server is housed in a slot inside a blade enclosure (or chassis), which holds
multiple blades and provides integrated power supply, cooling, networking, and
management functions. The blade enclosure enables interconnection of the blades
through a high-speed bus and also provides connectivity to external storage systems.
The modular design of the blade servers makes them smaller, which minimizes the floor
space requirements, increases the compute system density and scalability, and provides
better energy efficiency as compared to the tower and the rack servers. It also reduces
the complexity of the compute infrastructure and simplifies compute infrastructure
management. It provides these benefits without compromising on any capability that a
non-blade compute system provides.
Some concerns with blade servers include the high cost of a blade system (blade servers
and chassis), and the proprietary architecture of most blade systems due to which a blade
server can typically be plugged only into a chassis from the same vendor.

Physical Components of a Compute System


A compute system comprises multiple physical hardware components assembled inside a metal
enclosure. Some key components are described here.
Processor: A processor, also known as a Central Processing Unit (CPU), is an integrated circuit
(IC). This processor executes the instructions of a software program by performing fundamental
arithmetical, logical, and input/output operations. A common processor/instruction set architecture
is the x86 architecture with 32-bit and 64-bit processing capabilities. Modern processors have
multiple cores (independent processing units), each capable of functioning as an individual
processor. Socket- A single package which can have one or more processor cores with one or
more logical processors in each core. A dual-core processor, for example, can provide almost
double the performance of a single-core processor, by allowing two virtual CPUs to execute at the
same time.
Random-Access Memory (RAM): The RAM or main memory is an IC that serves as a volatile
data storage internal to a compute system. The RAM is directly accessible by the processor, and
holds the software programs for the execution and the data that are used by the processor.
Read-Only Memory (ROM): A ROM is a type of non-volatile semiconductor memory from which
data can only be read but not written to. It contains the boot firmware (that enables a compute
system to start), power management firmware, and other device-specific firmware.
Motherboard: A motherboard is a printed circuit board (PCB) to which all compute system
components connect. It has sockets to hold components such as the microprocessor chip, RAM,

49
and ROM. It also has network ports, I/O ports to connect devices such as keyboard, mouse, and
printers, and essential circuitry to carry out computing operations. A motherboard may also have
integrated components, such as a graphics processing unit (GPU), a network interface card (NIC),
and adapters to connect to external storage devices.
Chipset: A chipset is a collection of microchips on a motherboard, and it is designed to perform
specific functions. The two key chipset types are Northbridge and Southbridge. Northbridge
manages processor access to the RAM and the GPU, while Southbridge connects the processor
to different peripheral ports, such as USB ports
Secondary Storage: Secondary storage is a persistent storage device, such as a hard disk drive
or a solid-state drive. In this storage the OS and the application software are installed. The
processor cannot directly access secondary storage. The desired applications and data are loaded
from the secondary storage on to the RAM to enable the processor to access them.

Logical Components of a Compute System


The key logical components of a compute system are:

• Operating system
• Virtual memory
• Logical volume manager
• File system

50
Logical Components: Operating System
The operating system (OS) is a software that acts as an intermediary between a user of
a compute system and the compute system hardware. It controls and manages the
hardware and software on a compute system.
The OS manages hardware functions, applications execution, and provides a user
interface (UI) for users to operate and use the compute system.
The image depicts a generic architecture of an OS. Some functions (or services) of an OS
include program execution, memory management, resources management and
allocation, and input/output management. An OS also provides networking and basic
security for the access and usage of all managed resources. It also performs basic storage
management tasks while managing other underlying components, such as the device
drivers, logical volume manager, and file system. An OS also contains high-level
Application Programming Interfaces (APIs) to enable programs to request services.

Logical Components: Virtual Memory


The amount of physical memory (RAM) in a compute system determines both the size
and the number of applications that can run on the compute system. Memory virtualization
presents physical memory to applications as a single logical collection of contiguous
memory locations called virtual memory. While executing applications, the processor
generates logical addresses (virtual addresses) that map into the virtual memory. The
memory management unit of the processor and then maps the virtual address to the
physical address. The OS utility, which is known as the virtual memory manager (VMM),
manages the virtual.
An extra memory virtualization feature of an OS enables the capacity of secondary storage
devices to be allocated to the virtual memory. This device creates a virtual memory with
an address space that is larger than the physical memory space present in the compute
system. This virtual memory enables multiple applications and processes, whose
aggregate memory requirement is greater than the available physical memory to run on a
compute system without impacting each other.

51
The VMM manages the virtual-to-physical memory mapping. This VMM fetches data from
the secondary storage when a process references a virtual address that points to data at
the secondary storage. The space used by the VMM on the secondary storage is known
as a swap space. A swap space (also known as page file or swap file) is a portion of the
storage drive that is used as physical memory.
In a virtual memory implementation, the memory of a system is divided into contiguous
blocks of fixed-size pages. A process known as paging moves inactive physical memory
pages onto the swap file and brings them back to the physical memory when required.

Logical Components: Logical Volume


Manager (LVM)
Creates and controls compute level logical storage:

• Provides a logical view of physical storage


• Logical data blocks are mapped to physical data blocks.
Physical volumes form a volume group:

• LVM manages volume groups as a single entity.


Logical volumes are created from a volume group

Logical Volume Manager (LVM) is software that runs on a compute system and manages
logical and physical storage. LVM is an intermediate layer between the file system and
the physical drives. It can partition a larger-capacity disk into virtual, smaller-capacity

52
volumes (partitioning) or aggregate several smaller disks to form a larger virtual volume
(concatenation). LVMs are mostly offered as part of the OS. The evolution of LVMs
enabled dynamic extension of file system capacity and efficient storage management. The
LVM provides optimized storage access and simplifies storage resource management. It
hides details about the physical disk and the location of data on the disk. It enables
administrators to change the storage allocation even when the application is running.
The basic LVM components are physical volumes, logical volume groups, and logical
volumes. In LVM terminology, each physical disk that is connected to the compute system
is a physical volume (PV). A volume group is created by grouping one or more PVs. A
unique physical volume identifier (PVID) is assigned to each PV when it is initialized for
use by the LVM. Physical volumes can be added or removed from a volume group
dynamically. Each PV is divided into equal-sized data blocks called physical extents when
the volume group is created.
Logical volumes (LV) are created within a given volume group. A LV can be thought of as
a disk partition, whereas the volume group itself can be thought of as a disk. The size of
a LV is based on a multiple of the number of physical extents. The LV appears as a
physical device to the OS. A LV is made up of noncontiguous physical extents and may
span over multiple physical volumes. A file system is created on a logical volume.

53
Logical Components: LVM Example
Disk partitioning was introduced to improve the flexibility and utilization of disk drives. In
partitioning, a disk drive is divided into logical containers called logical volumes.
For example, a large physical drive can be partitioned into multiple LVs to maintain data
according to the file system and application requirements. The partitions are created from
groups of contiguous cylinders when the hard disk is initially set up on the host. The host’s
file system accesses the logical volumes without any knowledge of partitioning and
physical structure of the disk. Concatenation is the process of grouping several physical
drives and presenting them to the host as one large logical volume.

Logical Components: File System


A file is a collection of related records or data stored as a single named unit in contiguous
logical address space. Files are of different types, such as text, executable, image,
audio/video, binary, library, and archive. Files have various attributes, such as name,
unique identifier, type, size, location, owner, and protection.
A file system is an OS component that controls and manages the storage and retrieval of
files in a compute system. A file system enables easy access to the files residing on a
storage drive, a partition, or a logical volume. It consists of logical structures and software
routines that control access to files. It enables users to perform various operations on files,
such as create, access (sequential/random), write, search, edit, and delete.
A file system typically groups and organizes files in a tree-like hierarchical structure. It
enables users to group files within a logical collection called a directory, which is
containers for storing pointers to multiple files. A file system maintains a pointer map to
the directories, subdirectories (if any), and files that are part of the file system. It also
stores all the metadata (file attributes) associated with the files.
A file system block is the smallest unit allocated for storing data. Each file system block is
a contiguous area on the physical disk. The block size of a file system is fixed at the time
of its creation. The file system size depends on the block size and the total number of file
system blocks
File systems may be broadly classified as follows :

54
• Disk-based
• Network-based
• Virtual

A disk-based file system manages the files stored on storage devices such as solid-state
drives, disk drives, and optical drives. Examples of disk-based file systems are Microsoft
NT File System (NTFS), Apple Hierarchical File System (HFS) Plus, Extended File System
family for Linux, Oracle ZFS, and Universal Disk Format (UDF).
A network-based file system uses networking to enable file system access between
compute systems. Network-based file systems may use either the client/server model, or
may be distributed/clustered. In the client/server model, the file system resides on a
server, and is accessed by clients over the network. The client/server model enables
clients to mount the remote file systems from the server.
NFS for UNIX environment and CIFS for Windows environment (both covered in Module,
‘File-based Storage System (NAS)’) are two standard client/server file sharing protocols.
Examples of network-based file systems are: Microsoft Distributed File System (DFS),
Hadoop Distributed File System (HDFS), VMware Virtual Machine File System (VMFS),
Red Hat GlusterFS, and Red Hat CephFS.
A virtual file system is a memory-based file system. This process enables compute systems to
transparently access different types of file systems on local and network storage devices. It
provides an abstraction layer that enables applications to access different types of file systems in
a uniform way. It bridges the differences between the file systems for different operating systems,
without the application’s knowledge of the type of file system they are accessing. The examples
of virtual file systems are Linux Virtual File System (VFS) and Oracle CacheFS

55
Compute and Desktop
Virtualization
What is Compute Virtualization?
Definition: Compute Virtualization
The technique of abstracting the physical compute hardware from the operating
system and applications enabling multiple operating systems to run concurrently
on a single or clustered physical compute system(s).

Compute virtualization is a technique of abstracting the physical hardware of a compute


system from the operating system (OS) and applications. The decoupling of the physical
hardware from the OS and applications enables multiple operating systems to run
concurrently on a single or clustered physical compute system(s).
Compute virtualization enables the creation of virtual compute systems called virtual
machines (VMs). Each VM runs an OS and applications, and is isolated from the other
VMs on the same compute system. Compute virtualization is achieved by a hypervisor,
which is virtualization software that is installed on a physical compute system. The
hypervisor provides virtual hardware resources, such as CPU, memory, storage, and
network resources to all the VMs. Depending on the hardware capabilities, many VMs can
be created on a single physical compute system.
A VM is a logical entity; but to the OS running on the VM, it appears as a physical compute
system, with its own processor, memory, network controller, and disks. However, all VMs
share the same underlying physical hardware of the compute system. The hypervisor
allocates the compute system’s hardware resources dynamically to each VM.
56
From a hypervisor’s perspective, each VM is a discrete set of files.

Need for Compute Virtualization


Need for Compute Virtualization
Before Virtualization

In an x86-based physical compute system, the software, and hardware are tightly coupled
and it can run only one OS at a time. A physical compute system often faces resource
conflicts when multiple applications running on the compute have conflicting
requirements. Moreover, many applications do not take full advantage of the hardware
capabilities available to them.
Resources such as processors, memory, and storage frequently remain underutilized.
Many compute systems also requires complex network cabling and considerable floor
space and power requirements. Hardware configuration, provisioning, and management
become complex and require more time. A physical compute is a single point of failure
because its failure leads to application unavailability.
Compute virtualization enables overcoming these challenges by allowing multiple
operating systems and applications to run on a single compute system. It converts
physical machines to virtual machines and consolidates the converted machines onto a
single compute system. Server consolidation improves resource utilization and enables
organizations to run their data center with a fewer machines. This server consolidation, in
turn reduces the hardware acquisition costs and operational costs, and saves the data
center space and energy requirements.
Compute virtualization increases the management efficiency and reduces the
maintenance time. The creation of VMs takes less time compared to a physical compute
setup. The organizations can provision compute resources faster, and with greater ease
to meet the growing resource requirements. Individual VMs can be restarted, upgraded,
or even failed, without affecting the other VMs on the same physical compute. Moreover,
VMs are portable and can be copied or moved from one physical compute to another
without causing application unavailability.
Drawbacks:

• IT silos and underutilized resources


• Inflexible and expensive
• Management inefficiencies
• Risk of downtime

57
After Virtualization

In an x86-based physical compute system, the software, and hardware are tightly coupled
and it can run only one OS at a time. A physical compute system often faces resource
conflicts when multiple applications running on the compute have conflicting
requirements. Moreover, many applications do not take full advantage of the hardware
capabilities available to them.
Resources such as processors, memory, and storage frequently remain underutilized.
Many compute systems also requires complex network cabling and considerable floor
space and power requirements. Hardware configuration, provisioning, and management
become complex and require more time. A physical compute is a single point of failure
because its failure leads to application unavailability.
Compute virtualization enables overcoming these challenges by allowing multiple
operating systems and applications to run on a single compute system. It converts
physical machines to virtual machines and consolidates the converted machines onto a
single compute system. Server consolidation improves resource utilization and enables
organizations to run their data center with a fewer machines. This server consolidation, in
turn reduces the hardware acquisition costs and operational costs, and saves the data
center space and energy requirements.
Compute virtualization increases the management efficiency and reduces the
maintenance time. The creation of VMs takes less time compared to a physical compute
setup. The organizations can provision compute resources faster, and with greater ease
to meet the growing resource requirements. Individual VMs can be restarted, upgraded,
or even failed, without affecting the other VMs on the same physical compute. Moreover,
VMs are portable and can be copied or moved from one physical compute to another
without causing application unavailability.

58
Benefits:

• Server consolidation and improved resource utilization


• Flexible infrastructure at lower costs
• Increased management efficiency
• Increased availability and improved business continuity

What is a Hypervisor?

Definition: Hypervisor
Software that provides a virtualization layer for abstracting compute system
hardware, and enables the creation of multiple virtual machines.
Hypervisor is a compute virtualization software that is installed on a compute system. It
provides a virtualization layer that abstracts the processor, memory, network, and storage
of the compute system and enables the creation of multiple virtual machines. Each VM
runs its own OS, which essentially enables multiple operating systems to run concurrently
on the same physical compute system. The hypervisor provides standardized hardware
resources to all the VMs.
A hypervisor has two key components: kernel and virtual machine manager (VMM). A
hypervisor kernel provides the same functionality as the kernel of any OS, including
process management, file system management, and memory management. It is designed
and optimized to run multiple VMs concurrently. It receives requests for resources through

59
the VMM, and presents the requests to the physical hardware. Each virtual machine is
assigned a VMM that gets a share of the processor, memory, I/O devices, and storage
from the physical compute system to successfully run the VM.
Hypervisors are categorized into two types: bare-metal (Type I) and hosted (Type II). A
bare-metal hypervisor is directly installed on the physical compute hardware in the same
way as an OS. It has direct access to the hardware resources of the compute system and
is therefore more efficient than a hosted hypervisor. A bare-metal hypervisor is designed
for enterprise data centers and third platform infrastructure. It also supports the advanced
capabilities such as resource management, high availability, and security. The image
represents a bare-metal hypervisor. A hosted hypervisor is installed as an application on
an operating system. The hosted hypervisor does not have direct access to the hardware,
and all requests pass through the OS running on the physical compute system.
here are two key components to a hypervisor:
Hypervisor Kernel

• Provides functionality similar to an OS kernel.


• Presents resource requests to physical hardware.

Virtual machine manager (VMM)

• Each VM is assigned a VMM.


• Abstracts physical hardware and presents to VM.
There are also two types of hypervisor: bare-metal and hosted

60
What is a Virtual Machine

Definition: Virtual Machine (VM)


A logical compute system with virtual hardware on which a supported guest OS
and its applications.
A virtual machine (VM) is a logical compute system with virtual hardware on which a
supported guest OS and its applications run. A VM is created by a hosted or a bare-metal
hypervisor installed on a physical compute system. An OS, called a “guest OS”, is installed
on the VM in the same way it is installed on a physical compute system. From the
perspective of the guest OS, the VM appears as a physical compute system.
A VM has a self-contained operating environment, comprising OS, applications, and
virtual hardware, such as a virtual processor, virtual memory, virtual storage, and virtual
network resources. As discussed previously, a dedicated virtual machine manager (VMM)
is responsible for the execution of a VM. Each VM has its own configuration for hardware,
software, network, and security. The hypervisor translates the VM’s resource requests
and maps the virtual hardware of the VM to the hardware of the physical compute system.
For example, a VM’s I/O requests that to a virtual disk drive are translated by the
hypervisor and mapped to a file on the physical compute system’s disk drive.
Compute virtualization software enables creating and managing several VMs. Each VM
has a different OS of its own—on a physical compute system or on a compute cluster.
VMs are created on a compute system, and provisioned to different users to deploy their
applications. The VM hardware and software are configured to meet the application’s
requirements. The different VMs are isolated from each other, so that the applications and
the services running on one VM do not interfere with those running on other VMs. The
isolation also provides fault tolerance so that if one VM crashes, the other VMs remain
unaffected.
Important points about a VM:

• Created by a hypervisor installed on a physical compute system


• Comprises virtual hardware, such as virtual processor, virtual storage, and virtual network
resources
• Appears as a physical compute system to the guest OS
• Hypervisor maps the virtual hardware to the physical hardware
• VMs on a compute system are isolated from each other

61
VM Hardware
Select here for details.

When a VM is created, it is presented with virtual hardware components that appear as


physical hardware components to the guest OS. Within a given vendor’s environment,
each VM has standardized hardware components that make them portable across
physical compute systems.
Based on the requirements, the virtual components can be added or removed from a VM.
However, not all components are available for addition and configuration. Some hardware
devices are part of the virtual motherboard and cannot be modified or removed. For
example, the video card and the PCI controllers are available by default and cannot be
removed.
The image shows the typical hardware components of a VM. These components include:
virtual processor(s), virtual motherboard, virtual RAM, virtual disk, virtual network adapter,
optical drives, serial and parallel ports, and peripheral devices.

A VM can be configured with one or more virtual processors. Each VM is assigned a virtual
motherboard with the standardized devices essential for a compute system to function.
Virtual RAM is the amount of physical memory allocated to a VM, and it can be configured
based on the requirements. The virtual disk is a large physical file, or a set of files that

62
stores the VM’s OS, program files, application data, and other data associated with the
VM. A virtual network adapter functions like a physical network adapter. It provides
connectivity between VMs running on the same or different compute systems, and
between a VM and physical compute systems.
Virtual optical drives and floppy drives can be configured to connect to either physical
devices or to image files, such as ISO on the storage. SCSI/IDE virtual controllers provide
a way for the VMs to connect to the storage devices. The virtual USB controller is used to
connect to a physical USB controller and to access the connected USB devices. Serial
and parallel ports provide an interface for connecting peripherals to the VM.

VM Files
From a hypervisor’s perspective, a VM is a discrete set of files on a storage device. Some
of the key files that make up a VM are the configuration file, the virtual disk file, the memory
file, and the logfile. The configuration file stores the VM’s configuration information,
including VM name, location, BIOS information, guest OS type, virtual disk parameters,
number of processors, memory size, number of adapters and associated MAC addresses,
SCSI controller type, and disk drive type. The virtual disk file stores the contents of a VM’s
disk drive. A VM can have multiple virtual disk files, each of which appears as a separate
disk drive to the VM.
The memory state file stores the memory contents of a VM and is used to resume a VM
that is in a suspended state. The snapshot file stores the running state of the VM including
its settings and the virtual disk, and may optionally include the memory state of the VM. It
is typically used to revert the VM to a previous state. Log files are used to keep a record
about the VM’s activity and are often used for troubleshooting purposes.

From a hypervisor’s perspective, a VM is a discrete set of files on a storage device. These


files are:
Configuration Stores information, such as VM name, BIOS information, guest OS
file type, memory size
Virtual disk file Stores the contents of the VM's disk drive
Memory state
Stores the memory contents of a VM in a suspended state
file
Snapshot file Stores the VM settings and virtual disk of a VM
Log file Keeps a log of the VM’s activity and is used in troubleshooting
For managing VM files, a hypervisor may use a native clustered file system, or the
Network File System (NFS). A hypervisor’s native clustered file system is optimized to
store VM files. It may be deployed on Fibre Channel and iSCSI storage, apart from the
local storage. The virtual disks are stored as files on the native clustered file system.
Network File System enables storing of VM files on remote file servers (NAS device)
accessed over an IP network. The NFS client built into the hypervisor uses the NFS
protocol to communicate with the NAS device.

63
What is Desktop Virtualization?
Definition: Desktop Virtualization
Technology that decouples the OS, applications, and user state from a physical
compute system to create a virtual desktop environment that can be accessed from
any client device.
With the traditional desktop machine, the OS, applications, and user profiles are all tied to
a specific piece of hardware. With legacy desktops, business productivity is impacted
greatly when a client device is broken or lost. Managing a vast desktop environment is
also a challenging task.
Desktop virtualization decouples the OS, applications, and user state (profiles, data, and
settings) from a physical compute system. These components, collectively called a virtual
desktop, are hosted on a remote compute system. It can be accessed by a user from any
client device, such as laptops, desktops, thin clients, or mobile devices. A user accesses
the virtual desktop environment over a network on a client through a web browser or a
client application.
The OS and applications of the virtual desktop execute on the remote compute system,
while a view of the virtual desktop’s user interface (UI) is presented to the end-point
device. Desktop virtualization uses a remote display protocol to transmit the virtual
desktop’s UI to the end-point devices. The remote display protocol also sends back
keystrokes and graphical input information from the end-point device, enabling the user
to interact with the virtual desktop.
Some key benefits of desktop virtualization are:

• Simplified desktop infrastructure management: Desktop virtualization simplifies


desktop infrastructure management, and creates an opportunity to reduce the
maintenance costs. New virtual desktops can be configured and deployed faster than
physical machines. The patches, updates, and upgrades can be centrally applied to the
OS and applications. This process simplifies or eliminates many redundant, manual, and
time-consuming tasks.
• Improved data protection and compliance: Applications and data are located centrally,
which ensures that business-critical data is not at risk if there is loss or theft of the device.
Virtual desktops are also easier to back up compared to deploying backup solutions on
end-point devices.
• Flexibility of access: Desktop virtualization enables users to access their desktops and
applications without being bound to a specific end-point device. The virtual desktops can
be accessed remotely from different end-point devices. These benefits create a flexible
work scenario and enables user productivity from remote locations. Desktop virtualization
also enables Bring Your Own Device (BYOD), which creates an opportunity to reduce
acquisition and operational costs.

64
Use Cases for Compute and Desktop
Virtualization

Cloud application streaming: Cloud application streaming employs application virtualization to


stream applications from the cloud to client devices. Streaming applications from the cloud enable
organizations to reach more users on multiple devices, without modifying the application code.
The application is deployed on a cloud infrastructure, and the output is streamed to client devices,
such as desktops, tablets, and mobile phones. Because the application runs in the cloud, it can
flexibly scale to meet the massive growth in processing and storage needs, regardless of the client
devices the end users are using. The cloud service can stream either all or portions of the
application from the cloud. Cloud application streaming enables an application to be delivered to
client devices on which it may not be possible to run the application natively.
Desktop as a Service: Desktop as a Service (DaaS) is a cloud service in which a virtual desktop
infrastructure (VDI) is hosted by a cloud service provider. The provider offers a complete,
business-ready VDI solution, delivered as a cloud service with either subscription-based, or pay-
as-you-go billing. The service provider (internal IT or public) manages the deployment of the virtual
desktops, data storage, backup, security, and OS updates/upgrades. The virtual desktops are
securely hosted in the cloud and managed by the provider. DaaS has a multitenant architecture,
wherein virtual desktops of multiple users share the same underlying infrastructure. However,
individual virtual desktops are isolated from each other and protected against unauthorized access
and crashes on other virtual desktops. The virtual desktops can be easily provisioned by
consumers, and they are delivered over the Internet to any client device. DaaS provides
organizations with a simple, flexible, and efficient approach to IT. It enables to lower CAPEX and
OPEX for acquiring and managing end-user computing infrastructure.

Compute and desktop virtualization provide several benefits to organizations and facilitate
the transformation to the modern data center. Two use cases are described below.
Use Case Description
• Streaming applications from the cloud to diverse client
devices
Cloud Application • Applications flexibly scale to meet growth in processing
Streaming and storage needs
• Applications can be delivered to devices on which they
may run natively
• Cloud service in which a VDI is hosted by a cloud service
Desktop as a Service provider
(DaaS) • Provider manages VDI and OS updates
• Facilitates CAPEX and OPEX savings

65
Storage and Network
Evolution of Server-centric Storage
Architecture (Internal DAS)
In a traditional environment, business units/departments in an organization have their own servers
running the business applications of the respective business unit/department. Storage devices are
connected directly to the servers and are typically internal to the server. These storage devices
cannot be shared with any other server is called server-centric storage architecture (Internal DAS).
In this architecture, each server has a limited number of storage devices. The storage device
exists only in relation to the server to which it is connected.
The figure depicts an example of server-centric architecture. In the image, the servers of different
departments in an organization have directly connected storage, and clients connect to the servers
over a local area network (LAN) or a wide area network (WAN).

Evolution of Information-centric Storage


Architecture (SAN)
To overcome the challenges of the server-centric architecture, storage evolved to the information-
centric architecture. In information-centric architecture (SAN), storage devices exist independently
of servers, and are managed centrally and shared between multiple compute systems.
Storage devices assembled within storage systems form a storage pool, and several compute
systems access the same storage pool over a specialized, high-speed storage area network
(SAN). A SAN is used for information exchange between compute systems and storage systems,
and for connecting storage systems. It enables compute systems to share storage resources,
improve the utilization of storage systems, and facilitate centralized storage management.

66
SANs are classified based on protocols they support. Common SAN deployment types are Fibre
Channel SAN (FC SAN), Internet Protocol SAN (IP SAN), and Fibre Channel over Ethernet SAN
(FCoE SAN). These topics are covered later in the course.
The figure depicts an example of information-centric architecture. In the image, the servers of
different departments in an organization are connected to the shared storage over a SAN. The
clients connect to the servers over a LAN or a WAN. When a new server is deployed in the
environment, storage is assigned to the server from the same shared pool of storage devices. The
storage capacity can be increased dynamically and without impacting information availability by
adding storage devices to the pool.
This architecture improves the overall storage capacity utilization, while making management of
information and storage more flexible and cost-effective.

Types of Storage Devices


Storage Type Description
• Stores data on a circular disk with a ferromagnetic coating
Magnetic disk drive • Provides random read/write access
• Most popular storage device with large storage capacity
• Stores data on a semiconductor-based memory
Solid-state (flash)
drive • Very low latency per I/O, low power requirements, and very
high throughput
• Stores data on a thin plastic film with a magnetic coating
Magnetic tape drive • Provides only sequential data access
• Low-cost solution for long term data storage
• Stores data on a polycarbonate disc with a reflective coating
Optical disc drive • Write Once and Read Many capability: CD, DVD, BD
• Low-cost solution for long-term data storage

67
Overview of Storage Virtualization
Abstracts physical storage resources to create virtual storage resources:

• Virtual volumes
• Virtual disk files
• Virtual storage systems
Storage virtualization software can be:

• Built into the operating environment of a storage system


• Installed on an independent compute system
• Built into a hypervisor

Introduction to Connectivity
Communication paths between IT infrastructure components for information exchange
and resource sharing.
Types of connectivity:

• Compute-to-compute connectivity
• Compute-to-storage connectivity

Compute-to-Compute Connectivity
Compute-to-compute connectivity typically uses protocols based on the Internet Protocol (IP).
Each physical compute system is connected to a network through one or more host interface
devices, called a network interface controller (NIC). Physical switches and routers are the
commonly used interconnecting devices. A switch enables different compute systems in the
network to communicate with each other.
A router is an OSI Layer-3 device that enables different networks to communicate with each other.
The commonly used network cables are copper cables and optical fiber cables.
The figure shows a network (LAN or WAN) that provides interconnections among the physical
compute systems. It is necessary to ensure that appropriate switches and routers, with adequate
bandwidth and ports, are available to provide the required network performance.

68
Compute-to-Storage Connectivity
Storage may be connected directly to a compute system or over a SAN as discussed previously
in this lesson. Connectivity and communication between compute and storage are enabled
through physical components and interface protocols. The physical components that connect
compute to storage are host interface device, port, and cable.
Host bus adapter: A host bus adapter (HBA) is a host interface device that connects a compute
system to storage or to a SAN. It is an application-specific integrated circuit (ASIC) board. It
performs I/O interface functions between a compute system and storage, relieving the processor
from more I/O processing workload. A compute system typically contains multiple HBAs.
Port: A port is a specialized outlet that enables connectivity between the compute system and
storage. An HBA may contain one or more ports to connect the compute system to the storage.
Cables connect compute systems to internal or external devices using copper or fiber optic media.

69
What is a Protocol?

Definition: Protocols
Define formats for communication between devices. Protocols are implemented
using interface devices (or controllers) at both the source and the destination
devices.
Protocol Description
• Widely used protocol for high-speed compute-to-storage
Fibre Channel communication
(FC) • Provides a serial data transmission that operates over copper
wire and/or optical fiber
Internet Protocol • Existing IP-based network leveraged for storage communication
(IP) • Examples: iSCSI and FCIP protocols

70
Overview of Network Virtualization
Abstracts physical network resources to create virtual network resources:

• Virtual switch
• Virtual LAN
• Virtual SAN
Network virtualization software can be:

• Built into the operating environment of a network device


• Installed on an independent compute system
• Built into a hypervisor

Applications
Application Overview

Definition: Application
A software program or set of programs that is designed to perform a group of
coordinated tasks.

For anyone who uses computers or smartphones, applications are used every day. From
reading your email to Facebook and Twitter, when you post pictures or write your tweet,
you are using an application.
For the business, applications unlock value from the digital world. Using a great
application reshapes user experiences and creates touch points in how to get the
information you want. Applications are crucial in how businesses provide value to their
customers, which drives fundamental business objectives. Applications manage the
information and provide it in a form that is useful to the business to meet specific
requirements.
Examples:

• Customer relationship management (CRM)


• Enterprise Resource Planning (ERP)
• Email such as Microsoft Outlook

71

Modern Applications
Modern applications consist of a set of business-related functional parts, called microservices,
that are assembled with specific rules and best practices.

• Modern Applications
• Microservices

A modern application requires a dynamic modern infrastructure platform. This platform is


programmable inline with the attributes such as on-demand self-service, pooling of resources,
virtualization, accessibility, and scalability.
Modern applications deliver the services in hours and not weeks or months that are common in
the new world of a digital business. Long-term technology commitments are reduced because it
is easier to replace particular modules in a modern application.
Exampless: Facebook, Uber, and Netflix

Microservice architecture, or microservices, is a distinctive method of developing software


systems that has grown in popularity in recent years. In this architecture, the application
is decomposed into small, loosely coupled, and independently operating services. A
microservice runs in its own process and communicates to other services through REST
APIs.
Every microservice can be deployed, upgraded, scaled, and restarted independent of
other services in the application. When managed by an automated system, teams can
frequently update live applications without negatively impacting users.

Traditional vs. Modern Applications

Traditional applications are monolithic, it means, the modules are interdependent.


Changing one affects the others. Modern applications are designed to run independently.
These independent and distributed runtime modules that make up an application are
termed microservices.
Generally traditional applications are built using a single programming language and
framework. The modern application modules are decomposed, multiple programming
languages can be used to develop these applications.
The source code for traditional application is commonly commercial off-the-shelf, or
custom developed in-house, such as Oracle Financials. The modern applications often
use open-source or support a Freemium model, where the code is available as open-
source but support and enhancements can be purchased.

72
In traditional application environment, the infrastructure manages the resiliency from
hardware failure and scalability of the application. The modern application handles
component failure and scalability itself, by using distributed system architectures driving
high availability.
Traditional Application Characteristics Modern Application Characteristics

Monolithic Distributed

Common programming language Multiple programming languages

Resiliency and scale are infrastructure Resiliency and scale are application
managed managed

Infrastructure is application-specific Infrastructure is application-agnostic

PC-based devices Large variety of devices (BYOD)

DevOps, Continuous development and


Separate Build/Test/Run
deployment

Examples: CRM, ERP, and Email – Microsoft


Examples: Facebook, Uber, and Netflix
Outlook

What is Application Virtualization?


Definition: Application Virtualization
The technique of decoupling an application from the underlying computing platform
(operating system and hardware) to enable the application to be used on a compute
system without installation.
Some key benefits of application virtualization are described below.

• Simplified application management: Application virtualization provides a solution to


meet an organization’s need for simplified and improved application deployment, delivery,
and manageability.
• Eliminate OS modifications: Since application virtualization decouples an application
from the OS, it leaves the underlying OS unaltered. This process provides additional
security, and protects the OS from potential corruptions and problems that may arise due
to changes to the file system and registry.
• Resolve application conflicts and compatibility issues: Application virtualization
enables the use of conflicting applications on the same end-point device. It also enables
the use of applications that otherwise do not execute on an end-point device due to
incompatibility with the underlying computing platform.
• Simplified OS image management: Application virtualization simplifies OS image
management. Since application delivery is separated from the OS, there is no need to
include "standard" applications in end-point images. As a result, managing images is
simpler, especially in the context of OS patches and upgrades.

73
• Flexibility of access: Application virtualization enables an organization’s workforce and
customers to access applications hosted on a remote compute system from any location,
and through diverse end-point devices types.

Application Virtualization Techniques


Listed are three techniques for application virtualization.

• Application Encapsulation
• Application Presentation
• Application Streaming

In application encapsulation, an application is aggregated within a virtualized container, along with


the assets, such as files, virtual registry, and class libraries that it requires for execution. This
process, known as packaging or sequencing, converts an application into a standalone, self-
contained executable package that can directly run on a compute system. The assets required for
execution are included within the virtual container. Therefore, the application does not have any
dependency on the underlying OS, and does not require a traditional installation on the compute
system.
The application’s virtual container isolates it from the underlying OS and other applications,
thereby minimizing application conflicts. During application execution, all function calls made by
the application to the OS for assets get redirected to the assets within the virtual container. The
application is thus restricted from writing to the OS file system or registry, or modifying the OS in
any other way.

In application presentation, an application’s user interface (UI) is separated from its execution. The
application executes on a remote compute system, while its UI is presented to an end-point client
device over a network. When a user accesses the application, the screen pixel information and the
optional sound for the application are transmitted to the client. A software agent installed on the
client receives this information and updates the client’s display. The agent also transmits the
keystrokes and graphical input information back from the client, allowing the user to control the
application.
This process makes it appear as if the application is running on the client when, in fact, it is running
on the remote compute system. Application presentation enables the delivery of an application on
devices that have less computing power than what is normally required to execute the application.
In application presentation, application sessions are created in the remote compute system and a
user connects to an individual session from a client by means of the software agent. Individual
sessions are isolated from each other, which secures the data of each user and also protects the
application crashes.
In application streaming, an application is deployed on a remote compute system, and is
downloaded in portions to an end-point client device for local execution. A user typically launches
the application from a shortcut, which causes the client to connect to the remote compute system
to start the streaming process. Initially, only a limited portion of the application is downloaded into
memory. This portion is sufficient to start the execution of the application on the client.
Since a limited portion of the application is delivered to the client before the application starts, the
user experiences rapid application launch. The streaming approach also reduces network traffic.
As the user accesses different application functions, more of the application is downloaded to the
74
client. The additional portions of the application may also be downloaded in the background
without user intervention. Application streaming requires an agent or client software on clients.
Alternatively, the application may be streamed to a web browser by using a plug-in installed on the
client. In some cases, application streaming enables offline access to the application by caching
them locally on the client.

75
Software-Defined Data Center (SDDC)
What is a Software-Defined Data Center?

Definition: Software-Defined Data Center (SDDC)

An architectural approach to IT infrastructure that extends virtualization concepts such as abstraction,


pooling, and automation to all of the data center’s resources and services to achieve IT as a service.

In an SDDC, compute, storage, networking, security, and availability services are pooled, aggregated, and
delivered as a service. SDDC services are managed by intelligent, policy-driven software.SDDC is a vision
that can be interpreted in many ways and can be implemented by numerous concrete architectures.

Typically, an SDDC is viewed as a conglomeration of virtual infrastructure components, among which are
software-defined compute (compute virtualization), software-defined network (SDN), and software-
defined storage (SDS).

SDDC is viewed as an important step in the progress towards a complete virtualized data center (VDC),
and is regarded as the necessary foundational infrastructure for the modern data center.

SDDC Architecture
The software-defined approach separates the control or management functions from the underlying
components and provides it to external software. The external software takes over the control operations
and enables the management of multi-vendor infrastructure components centrally.

Principally, a physical infrastructure component (compute, network, and storage) has a control path and
a data path. The control path sets and manages the policies for the resources, and the data path performs
the transmission of data. The software-defined approach decouples the control path from the data path.
By abstracting the control path, resource management function operates at the control layer. This layer
gives the ability to partition the resource pools, and manage them uniquely by policy.

This decoupling of the control path and data path enables the centralization of data provisioning and
management tasks through software that is external to the infrastructure components. The software runs
on a centralized compute system or a stand-alone device, called the software-defined controller. The
figure illustrates the software-defined architecture, where the management function is abstracted from
the underlying infrastructure components using controller.

76
Software-Defined Controller
A software-defined controller is software with built-in intelligence that automates provisioning and
configuration based on the defined policies. It enables organizations to dynamically, uniformly, and easily
modify and manage their infrastructure.

The controller discovers the available underlying resources and provides an aggregated view of resources.
It abstracts the underlying hardware resources (compute, storage, and network) and pools them. This
enables the rapid provisioning of resources from the pool based on predefined policies that align to the
service level agreements for different consumers.

The controller provides a single control point to the entire infrastructure enabling policy-based
infrastructure management. The controller enables an administrator to use a software interface to
manage the resources, node connectivity, and traffic flow; control behavior of underlying components;
apply policies uniformly across the infrastructure components; and enforce security.

The controller also provides interfaces that enable applications, external to the controller, to request
resources and access these resources as services.

Benefits of Software-Defined Architecture


By extending virtualization throughout the data center, SDDC provides several benefits to the
organizations. Some key benefits are described here:
Benefit Description
Agility • On-demand self-service
• Faster resource provisioning

77
Cost efficiency • Use of the existing infrastructure and commodity hardware
lowers CAPEX

Improved control • Policy-based governance


• Automated Business Continuity (BC) / Disaster Recovery
(DR)
• Support for operational analytics

Centralized • Unified management platform for centralized monitoring and


management administration

Flexibility • Use of commodity and advanced hardware technologies


• Hybrid cloud support

78
Modern Data Center Infrastructure
Architecture

79
Modern Data Center Infrastructure
The image is a block diagram depicting the core IT infrastructure building blocks that make up a data
center.

The IT infrastructure is arranged in five logical layers and three cross-layer functions. The five layers are
physical infrastructure, virtual infrastructure, software-defined infrastructure, orchestration, and services.
Each of these layers has various types of hardware and/or software components as shown in the figure.

The three cross-layer functions are business continuity, security, and management. Business continuity
and security functions include mechanisms and processes that are required to provide reliable and secure
access to applications, information, and services. The management function includes various processes
that enable the efficient administration of the data center and the services for meeting business
requirements.

Applications that are deployed in the data center may be a combination of internal applications, business
applications, and modern applications that are either custom-built or off-the-shelf.

The fulfillment of the five essential cloud characteristics ensures the infrastructure can be transformed
into a cloud infrastructure that could be either private or public. Further, by integrating cloud extensibility,
the infrastructure can be connected to an external cloud to leverage the hybrid cloud model.

80
81
Physical Infrastructure

The physical infrastructure forms the foundation layer of a data center. It includes equipment such as
compute systems, storage systems, and networking devices.

This equipment along with the operating systems, system software, protocols, and tools that enable the
physical equipment to perform their functions.

A key function of physical infrastructure is to execute the requests generated by the virtual and software-
defined infrastructure.

Additional functions are: storing data on the storage devices, performing compute-to-compute
communication, executing programs on compute systems, and creating backup copies of data.

Foundation layer of the data center infrastructure


Physical components are: compute systems, storage, and network devices; they require operating
systems, system software, and protocols for their functions.
Executes the requests generated by the virtual and software-defined layers

Virtual Infrastructure

Virtualization is the process of abstracting physical resources, such as compute, storage, and network, and
creating virtual resources from them. Virtualization is achieved by using virtualization software that is
deployed on compute systems, storage systems, and network devices.

Virtualization software aggregates physical resources into resource pools from which it creates virtual
resources. A resource pool is an aggregation of computing resources, such as processing power, memory,
storage, and network bandwidth.

For example, storage virtualization software pools the capacity of multiple storage devices to create a
single large storage capacity. Similarly, compute virtualization software pools the processing power and
memory capacity of a physical compute system. This physical computes create an aggregation of the
power of all processors (in megahertz) and all memory (in megabytes). Examples of virtual resources
include virtual compute (virtual machines), virtual storage (LUNs), and virtual networks.

Virtualization enables a single hardware resource to support multiple concurrent instances of systems, or
multiple hardware resources to support a single instance of system. For example, a single disk drive can
be partitioned and presented as multiple disk drives to a compute system. Similarly, multiple disk drives
can be concatenated and presented as a single disk drive to a compute system.

82
Note: While deploying a data center, an organization may choose not to deploy virtualization. In such an
environment, the software-defined layer is deployed directly over the physical infrastructure. Further, it is
also possible that part of the infrastructure is virtualized and rest is not virtualized.

Virtualization abstracts physical resources and creates virtual resources.


Virtual components:

• Virtual compute, virtual storage, and virtual network.


• Created from physical resource pools using virtualization software

Benefits of virtualization:

• Resource consolidation and multitenant environment


• Improved resource utilization and increased ROI
• Flexible resource provisioning and rapid elasticity

Software-Defined Infrastructure
Deployed either on virtual layer or on physical layer

All infrastructure components are virtualized and aggregated into pools.

• Underlying resources are abstracted from applications


• Enables ITaaS

Centralized, automated, and policy-driven management and delivery of heterogeneous resources

Components:

• Software-defined compute
• Software-defined storage
• Software-defined network

The software-defined infrastructure layer is deployed either on the virtual layer or on the
physical layer. In the software-defined approach, all infrastructure components are virtualized
and aggregated into pools. This component abstracts all underlying resources from
applications.
The software-defined approach enables ITaaS, in which consumers provision all infrastructure
components as services. It centralizes and automates the management and delivery of
heterogeneous resources based on policies. The key architectural components in the software-
defined approach include software-defined compute (equivalent to compute virtualization),
software-defined storage (SDS), and software-defined network (SDN).

83
Orchestration
Component: orchestration software, which provides:

• Workflows for executing automated tasks


• Interaction with various components across layers and functions to invoke provisioning
tasks

The orchestration layer includes the orchestration software. The key function of this layer is to
provide workflows for executing automated tasks to accomplish a wanted outcome. Workflow
refers to a series of interrelated tasks that perform a business operation. The orchestration
software enables this automated arrangement, coordination, and management of the tasks. This
function helps to group and sequence tasks with dependencies among them into a single,
automated workflow.
Associated with each service listed in the service catalog, there is an orchestration workflow
defined. When a service is selected from the service catalog, an associated workflow in the
orchestration layer is triggered. Based on this workflow, the orchestration software interacts
with the components across the software-defined layer and the BC, security, and management
functions. This orchestration entities executes the provisioning of tasks.

Services
Delivers IT resources as services to users:

• Enables users to achieve desired business results


• Users have no liabilities associated with owning the resources

Components:

• Service catalog
• Self-service portal

Functions of service layer:

• Stores service information in service catalog and presents them to the users
• Enables users to access services using a self-service portal

Similar to a cloud service, an IT service is a means of delivering IT resources to the end users
to enable them to achieve the desired business results and outcomes without having any
liabilities such as risks and costs associated with owning the resources. Examples of services
are application hosting, storage capacity, file services, and email. The service layer is accessible
to applications and end users.

84
This layer includes a service catalog that presents the information about all the IT resources
being offered as services. The service catalog is a database of information about the services
and includes various information about the services, including the description of the services,
the types of services, cost, supported SLAs, and security mechanisms.
The provisioning and management requests are passed on to the orchestration layer, where the
orchestration workflows—to fulfill the requests—are defined.

Business Continuity
Enables ensuring the availability of services in line with SLA

Supports all the layers to provide uninterrupted services

Includes adoption of measures to mitigate the impact of downtime

Measure Description
Proactive • Business impact analysis
• Risk assessment
• Technology solutions deployment (backup and replication)

Reactive • Disaster recovery


• Disaster restart

The business continuity (BC) cross-layer function specifies the adoption of proactive and reactive
measures that enable an organization to mitigate the impact of downtime due to planned and
unplanned outages.
The proactive measures include activities and processes such as business impact analysis, risk
assessment, and technology solutions such as backup, archiving, and replication.
The reactive measures include activities and processes such as disaster recovery and disaster restart
to be invoked in the event of a service failure.
This function supports all the layers—physical, virtual, software-defined, orchestration, and
services—to provide uninterrupted services to the consumers.
The BC cross-layer function of a cloud infrastructure enables a business to ensure the availability
of services in line with the service level agreement (SLA).

Security
Supports all the layers to provide secure services

Specifies the adoption of administrative mechanisms

• Security and personnel policies


• Standard procedures to direct safe execution of operations

Specifies the adoption of technical mechanisms

85
• Firewall
• Intrusion detection and prevention systems
• Anti-virus

Security mechanisms enables organization to meet governance, risk, and compliance (GRC) requirements

The security cross-layer function supports all the infrastructure layers—physical, virtual, software-
defined, orchestration, and service—to provide secure services to the consumers. Security specifies
the adoption of administrative and technical mechanisms that mitigate or minimize the security
threats and provide a secure data center environment.
Administrative mechanisms include security and personnel policies or standard procedures to
direct the safe execution of various operations. Technical mechanisms are usually implemented
through tools or devices deployed on the IT infrastructure. Examples of technical mechanisms
include firewall, intrusion detection and prevention systems, and anti-virus software.
Governance, risk, and compliance (GRC) specify processes that help an organization in ensuring
that their acts are ethically correct and in accordance with their risk appetite (the risk level an
organization chooses to accept), internal policies, and external regulations.
Security mechanisms should be deployed to meet the GRC requirements. Security and GRC are
covered in Module, ‘Storage Infrastructure Security’

Management
• Storage infrastructure configuration and capacity provisioning
• Problem resolution
• Capacity and availability management
• Compliance conformance
• Monitoring services

The management cross-layer function specifies the adoption of activities related to data center
operations management. Adoption of these activities enables an organization to align the
creation and delivery of IT services to meet their business objectives. This course focuses on
the aspect of storage infrastructure management.
Storage operation management enables IT administrators to manage the data center
infrastructure and services. Storage operation management tasks include handling of
infrastructure configuration, resource provisioning, problem resolution, capacity, availability,
and compliance conformance.
This function supports all the layers to perform monitoring, management, and reporting for the
entities of the infrastructure.

86
Do-It-Yourself Infrastructure
In the Do-It-Yourself (DIY) approach, organizations integrate the best in class infrastructure
components including hardware and software that is purchased from different vendors. This
approach enables the organizations to use the advantages of high-quality products and services
from the respective leading vendors and provides specific functions with more options and
configurations for organizations to build their cloud infrastructure.
When multivendor infrastructure components are involved, it requires integration and testing for
compatibility between those components and with the existing infrastructure.
Do-it-yourself infrastructure enables organizations to select vendors of their choice for
infrastructure components. This solution provides an option for organizations to switch vendors if
they are unable to provide the committed support or unable to meet the Service Level Agreement
(SLA) requirements.
You can build the infrastructure for cloud in two methods using the do-it-yourself approach. The
two methods are as follows:
Greenfield
Greenfield Method: Greenfield environments enable architects to design exactly what is required to meet
the business needs using new infrastructure that is built specifically for a purpose. Greenfield
environments can avoid some of the older and less efficient processes, rules, methods, misconfigurations,
constraints, and bottlenecks that exist in the current environment. Greenfield environments also have the
added benefit of enabling a business to migrate infrastructure to a different technology or vendor and to
build in technologies that help avoid future lock-in. But greenfield environments also have some
downsides, such as higher cost, lack of staff expertise, and possibly increased implementation time.
Brownfield
Brownfield Method: This method involves upgrading or adding new cloud infrastructure elements to the
already existing infrastructure. This method allows organizations to repurpose the existing infrastructure
components, providing a cost benefit. Simultaneously the organization may face integration issues, which
can compromise the stability of the overall system. Existing infrastructure or processes such as resource
type, available capacity, provisioning processes and managing the resources may place extra constraints
on the architect’s design. These constraints may negatively affect performance or functionality.

87
Converged and Hyper-Converged
Infrastructure
There are two types of converged systems; to learn more, click each tab.
• Converged Infrastructure (CI)
CI brings together distinct infrastructure components into a single
package, including compute, network, storage, virtualization, and
management. They are hardware-focused systems where the compute
system access storage over a SAN.
The infrastructure components are integrated, tested, optimized, and
delivered to the customers as a single block. This solution offers single
management software capable of managing all of the components
within the package.

Hyper-converged Infrastructure (HCI)


HCI offers efficiency using modular building blocks that are known
as nodes. A node consists of a server with Direct Attached Storage. They
are software-defined systems that decouple the compute, storage,
networking functions and run these functions on a common set of physical
resources. They do not have a physical Storage Area Network (SAN), or
a distinct physical storage controller like converged infrastructure.
The storage controller function runs as a software-based service on each
compute system.

88
Concepts In Practice
• Dell EMC VxBlock

• Converged infrastructure that simplifies all aspects of IT operations


• Integrates with compute, network, storage, and virtualization technologies
• Supports large-scale consolidation, peak performance, and high availability for traditional
and cloud-based workloads

Simplifies all aspects of IT and enables customers to modernize their infrastructure and achieve
better business outcomes faster. By seamlessly integrating enterprise-class compute, network,
storage, and virtualization technologies, it delivers most advanced converged infrastructure. It
is designed to support large-scale consolidation, peak performance, and high availability for
traditional and cloud-based workloads. It is a converged system optimized for data reduction
and copy data management. Customers can quickly deploy, easily scale, and manage your
systems simply and effectively. Deliver on both midrange and enterprise requirements with the
all-flash design, enterprise features, and support for a broad spectrum of general-purpose
workloads.

• Dell EMC VxRail

Consists of:

Software Hardware
o VMware vSphere o Nodes based on industry leading PowerEdge servers
▪ ESXi o High density general purpose nodes
▪ vCenter
o VxRail Manager
o VMware vSAN

• Designed, purchased, and supported as one product


• Fastest growing hyper-converged system
• Transforms VMware infrastructures by simplifying IT operations
• Accelerates transformation
• Drives operational efficiency
• Lowers capital and operational costs

Dell EMC VxRail Appliances are the fastest growing hyper-converged systems worldwide.
They are the standard for transforming VMware infrastructures, dramatically simplifying IT
operations while lowering overall capital and operational costs.
It is important to remember that while VxRail is composed of many industry standard
components it is treated as a single entity. You don’t need to worry about updating VMware or
the PowerEdge microcode. That is all handled by VxRail. This makes VxRail the simplest way
to stand up VMware clusters. The details can make VxRail seem more complex than it is.

89
VxRail gives you VMware clusters. You can run whatever runs on a normal VxRail cluster on
a VxRail.
VxRail Appliances accelerate transformation and reduces risk with automated lifecycle
management. For example, users have to perform one-click for software and firmware updates
after deployment.
Drives operational efficiency for a 30% TCO advantage versus HCI systems built using VSAN
Ready Nodes. Unifies support for all VxRail hardware and software delivering 42% lower total
cost of serviceability.Engineered, manufactured, managed, supported, and sustained as ONE
for single end-to-end lifecycle support.Fully loaded with enterprise data services for built-in
data protection, cloud storage, and disaster recovery.

Dell EMC VxRack FLEX

• Rack-scale hyper-converged system


• Deliver flexible, scalable performance, and capacity on demand
• Create a virtual pool of block storage
• Scalability, flexibility, performance, and time-to-value

A Dell EMC engineered and manufactured rack-scale hyper-converged system that


delivers an unmatched combination of performance, resiliency and flexibility to address
enterprise data center needs. VxRack FLEX creates a server-based SAN by combining
virtualization software, known as VxFlex OS, with Dell EMC PowerEdge servers to
deliver flexible, scalable performance, and capacity on demand. Local storage resources
are combined to create a virtual pool of block storage with varying performance tiers.
The architecture enables you to scale from as few as four nodes to over a thousand nodes.
In addition, it provides enterprise-grade data protection, multitenant capabilities, and
add-on enterprise features such as QoS, thin provisioning, and snapshots. VxRack FLEX
delivers the scalability, flexibility, performance, and time-to-value required to meet the
demands of the modern enterprise data center.

• Dell EMC VxRack SDDC


• The infrastructure foundation for realizing a multi-cloud vision
• Optimized for predictable performance, scalability, optimal user experience and cost
savings
• Stand up a complete VMware based cloud environment.

The ultimate infrastructure foundation for realizing a multi-cloud vision. VxRack SDDC creates
IT certainty, improves service outcomes and reduces operational risk by leveraging known, trusted
technologies and operational processes. Optimized for predictable performance, scalability,
optimal user experience and cost savings, VxRack SDDC delivers the simplest path to hybrid cloud
with an automated elastic cloud infrastructure at rack scale. The industry’s most advanced
integrated system for VMware Cloud Foundation, VxRack SDDC is a hyper-converged rack-scale
system engineered with automation and serviceability extensions offering integrated end to end
lifecycle management and 24x7 single vendor support.

• Easily creates a foundation for a complete VMware private cloud


• Fully integrated with VMware vSphere, vSAN, and NSX
90
• Includes physical and virtual network infrastructure for multi-rack scaling and growth
• Automated management and serviceability extensions integrated with VMware Cloud
Foundation for single pane of glass management
• Full lifecycle management and support for the entire engineered system

• Dell EMC PowerEdge Server

Servers that deliver operational efficiency and top performance at any scale

Benefits include:

o Scalable business architecture


o Intelligent automation
o Integrated security

As the foundation for a complete, adaptive and scalable solution, the 13th generation of Dell EMC
PowerEdge servers delivers outstanding operational efficiency and top performance at any scale.
It increases productivity with processing power, exceptional memory capacity, and highly scalable
internal storage. PowerEdge provide insight from data, environment virtualization, and enable a
mobile workforce. Major benefits of PowerEdge Servers are:

• Scalable Business Architecture: maximizes performance across the widest range of


applications with highly scalable architectures and flexible internal storage.
• Intelligent Automation: Automates the entire server lifecycle from deployment to
retirement with embedded intelligence that dramatically increases productivity.
• Integrated Security: Protects customers and business with a deep layer of defense built
into the hardware and firmware of every server.

• Dell EMC XC Series Appliance


• Hyper-converged appliance
• Integrates with the Dell EMC PowerEdge server and Nutanix software
• Managed without any specialized IT resources
• Uses HTML5-based management interface

A hyper-converged appliance. It integrates with the Dell EMC PowerEdge servers, the Nutanix
software, and a choice of hypervisors to run any virtualized workload. It is ideal for enterprise
business applications, server virtualization, hybrid or private cloud projects, and virtual desktop
infrastructure (VDI). User can deploy an XC Series cluster in 30 minutes and manage it without
specialized IT resources. The XC Series makes managing infrastructure efficient with a unified
HTML5-based management interface, enterprise-class data management capabilities, cloud
integration, and comprehensive diagnostics and analytics.
The features of Dell EMC XC Series are:

• Available in flexible combinations of CPU, memory, and SSD/HDD

• Includes thin provisioning and cloning, replication, and tiering

91
• Dell EMC validates, tests, and supports globally

• Able to grow one node at a time with nondisruptive, scale-out expansion

• Dell Wyse Thin Clients


• Dell offers secure, reliable, cost-effective thin clients
• Easy integration into VDI or web-based environment
• Simplify security and scalability

Dell offers a wide selection of secure, reliable, cost-effective Wyse thin clients designed to integrate
into any virtualized or web-based infrastructure, while meeting the budget and performance
requirements for any application. Wyse thin and zero clients are built for easy integration into VDI
or web-based environment with instant, hands-free operation and performance that meets
demands. Simplify security and scalability with simple deployment and remote management in an
elegant, space-saving design. Malware-resistant and tailored for Citrix, Microsoft and VMware.

• VMware Horizon
• VDI Solution:
o Delivers virtualized or hosted desktops and applications through a single platform
• Supports both Windows and Linux-based desktops

VMware Horizon is a VDI solution for delivering virtualized or hosted desktops and applications through
a single platform to the end users. These desktop and application services—including RDS, hosted apps,
packaged apps with VMware ThinApp, and SaaS apps—can all be accessed from one unified workspace
across devices and locations. Horizon provides IT with a streamlined approach to deliver, protect, and
manage desktops and applications while containing costs and ensuring that end users can work anytime,
anywhere, on any device. Horizon supports both Windows as well as Linux-based desktops

• VMware ESXi
• Bare-metal hypervisor
• Comprises underlying VMkernel OS that supports running multiple VMs
o VMkernel controls and manages compute resources

VMware ESXi is a bare-metal hypervisor. ESXi has a compact architecture that is designed for integration
directly into virtualization-optimized compute system hardware, enabling rapid installation,
configuration, and deployment. ESXi abstracts processor, memory, storage, and network resources into
multiple VMs that run unmodified operating systems and applications. The ESXi architecture comprises
underlying operating system called VMkernel, that provides a means to run management applications
and VMs. VMkernel controls all hardware resources on the compute system and manages resources for
the applications. It provides core OS functionality, such as process management, file system, resource
scheduling, and device drivers.

4. VMware Cloud Foundation

• Natively integrated software-defined stack


• Storage elasticity and high performance
• End-to-end security

92
• Self-driving operations
• Automated infrastructure provisioning

VMware Cloud Foundation makes it easy to deploy and run a hybrid cloud. It provides
integrated cloud infrastructure (compute, storage, networking, and security) and cloud
management services to run enterprise applications in both private and public
environments.
Cloud Foundation provides a complete set of software-defined services for compute,
storage, networking and security, and cloud management to run enterprise apps -
traditional or containerized - in private or public environments. Cloud Foundation
simplifies the path to the hybrid cloud by delivering a single integrated solution that is
easy to operate with integrated automated life cycle management. Cloud Foundation is
built on VMware’s leading hyperconverged architecture (vSAN) with all-flash
performance and enterprise-class storage services including deduplication, compression,
and erasure coding. vSAN implements hyperconverged storage architecture delivers
elastic storage and drastically simplifies storage management.
Cloud Foundation delivers end to end security for all applications by delivering
microsegmentation, distributed firewalls, and VPN (NSX), VM, hypervisor, and vMotion
encryption (vSphere), and data at rest, cluster, and storage encryption (vSAN).
Cloud Foundation delivers self-driving operations (vRealize Operations, vRealize Log
Insight) from applications to infrastructure to help organizations plan, manage, and scale
their SDDC. Users can perform application-aware monitoring and troubleshooting along
with automated and proactive workload management, balancing, and remediation. It
automatically deploys all of the building blocks of the Software-Defined Data Center:
compute, storage, networking, and cloud management.

Question 1
Which cross-layer function enables an organization to mitigate the impact of downtime?

Security

Service

Management

93
Business continuity

Question 2
Which layer function provides workflows for executing automated tasks to accomplish a wanted
outcome?

Security

Management

Services

Orchestration

Correct!

94
95
Intelligent Storage Systems (ISS)
Storage Requirements for Modern Data
Center
Listed are key requirements for an effective storage infrastructure:

• Process massive amount of IOPS


• Elastic and nondisruptive horizontal scaling of resources
• Intelligent resource management
• Automated and policy driven configuration
• Support for multiple protocols for data access
• Supports APIs for software-defined and cloud integration
• Centralized management and chargeback in a multi-tenancy environment

Technology Solutions
Listed are technology solutions that can meet the modern data center requirements for the storage
infrastructure:

• Intelligent storage system


o Block-based
o File-based
o Object-based
o Unified
• Storage virtualization
• Software-defined storage

Components of Intelligent Storage Systems


Video: Components of Intelligent Storage
System
Definition: Intelligent Storage System
A feature-rich storage array that provides highly optimized I/O processing capabilities.

• Has a purpose-built operating environment that provides intelligent resource management


capability
• Provides large amount of cache

96
• Provides multiple I/O paths

Intelligent storage systems are feature-rich storage arrays that provide highly optimized I/O
processing capabilities. These intelligent storage systems have the capability to meet the
requirements of today’s I/O intensive modern applications. These applications require high
levels of performance, availability, security, and scalability.
Therefore, to meet the requirements of the applications, many vendors of intelligent storage
systems now support SSDs, hybrid drives, encryption, compression, deduplication, and scale-
out architecture.
The storage systems have an operating environment that intelligently and optimally handles
the management, provisioning, and utilization of storage resources. The storage systems are
configured with a large amount of memory (called cache) and multiple I/O paths and use
sophisticated algorithms to meet the requirements of performance-sensitive applications

ISS Features
Listed are some common features of an ISS:

• Supports combination of HDD and SS


• Service massive amount of IOPS
• Scale-out architecture
• Deduplication, compression, and encryption
• Automated storage tiering
• Virtual storage provisioning
• Supports APIs to integrate with SDDC and cloud
• Data Protection

ISS Components
Two key components of an ISS:

Controller Storage
• Block-based • All HDDs
• File-bases • All SSDs
• Object-based • Combination of both
• Unified

An intelligent storage system has two key components, controller and storage. A controller is
a compute system that runs a purpose-built operating system that is responsible for
performing several key functions for the storage system.

• Examples of such functions are serving I/Os from the application servers, storage
management, RAID protection, local and remote replication, provisioning storage,

97
automated tiering, data compression, data encryption, and intelligent cache
management.

An intelligent storage system typically has more than one controller for redundancy. Each
controller consists of one or more processors and a certain amount of cache memory to
process a large number of I/O requests. These controllers are connected to the compute
system either directly or via a storage network. The controllers receive I/O requests from the
compute systems that are read or written from/to the storage by the controller. Depending
on the type of the data access method used for a storage system, the controller can either be
classified as block-based, file-based, object-based, or unified. An storage system can have all
hard disk drives, all solid state drives, or a combination of both.

Hard Disk Drive Components


A hard disk drive is a persistent storage device that stores and retrieves data using rapidly rotating
disks (platters) coated with magnetic material.
The key components of a hard disk drive (HDD) are platter, spindle, read/write head, actuator arm
assembly, and controller board. I/O operations in hard drives are performed by rapidly moving the
arm across the rotating flat platters that are coated with magnetic material.
Data is transferred between the disk controller and magnetic platters through the read/write (R/W)
head which is attached to the arm. Data can be recorded and erased on magnetic platters any number
of times.

Platter
A typical hard disk drive consists of one or more flat circular disks called platters. The data is
recorded on these platters in binary codes (0s and 1s). The set of rotating platters is sealed in a case,
called Head Disk Assembly (HDA). A platter is a rigid, round disk coated with magnetic material
on both surfaces (top and bottom).

98
The data is encoded by polarizing the magnetic area or domains of the disk surface. Data can be
written to or read from both surfaces of the platter. The number of platters and the storage capacity
of each platter determine the total capacity of the drive.
Spindle
A spindle connects all the platters and is connected to a motor. The motor of the spindle rotates with a
constant speed. The disk platter spins at a speed of several thousands of revolutions per minute (rpm).
Read/Write head
Read/write (R/W) heads, read and write data from or to the platters. Drives have two R/W heads
per platter, one for each surface of the platter. The R/W head changes the magnetic polarization on
the surface of the platter when writing data. While reading data, the head detects the magnetic
polarization on the surface of the platter.
During reads and writes, the R/W head senses the magnetic polarization and never touches the
surface of the platter. When the spindle rotates, a microscopic air gap is maintained between the
R/W heads and the platters, known as the head flying height. This air gap is removed when the
spindle stops rotating and the R/W head rests on a special area on the platter near the spindle. This
area is called the landing zone
Actuator Arm Assembly
R/W heads are mounted on the actuator arm assembly, which positions the R/W head at the location on
the platter where the data needs to be written or read. The R/W heads for all platters on a drive are
attached to one actuator arm assembly and move across the platters simultaneously.
Drive Controller Board

The controller is a printed circuit board, mounted at the bottom of a disk drive. It consists of a
microprocessor, internal memory, circuitry, and firmware.
The firmware controls the power supplied to the spindle motor as well as controls the speed of the
motor. It also manages the communication between the drive and the compute system.
In addition, it controls the R/W operations by moving the actuator arm and switching between
different R/W heads, and performs the optimization of data access.

Physical Disk Structure and Logical Block


Addressing
Data on the disk is recorded on tracks, which are concentric rings on the platter around the spindle.
The tracks are numbered, starting from zero, from the outer edge of the platter. The number of
tracks per inch (TPI) on the platter (or the track density) measures how tightly the tracks are packed
on a platter.
Each track is divided into smaller units called sectors. A sector is the smallest, individually
addressable unit of storage. The track and sector structure is written on the platter by the drive
manufacturer using a low-level formatting operation. The number of sectors per track varies
according to the drive type. Typically, a sector holds 512 bytes of user data. Besides user data, a
sector also stores other information, such as the sector number, head number or platter number, and
track number. This information helps the controller to locate the data on the drive.
A cylinder is a set of identical tracks on both surfaces of each drive platter. The location of R/W
heads is referred to by the cylinder number, not by the track number. Earlier drives used physical

99
addresses consisting of cylinder, head, and sector (CHS) number. These addresses referred to
specific locations on the disk, and the OS had to be aware of the geometry of each disk used.
Logical block addressing (LBA) has simplified the addressing by using a linear address to access
physical blocks of data. The disk controller translates LBA to a CHS address; the compute system
needs to know only the size of the disk drive in terms of the number of blocks. The logical blocks
are mapped to physical sectors on a 1:1 basis.

In the illustration, the drive shows eight sectors per track, six heads, and four cylinders. This means a total
of 8 × 6 × 4 = 192 blocks. The block number ranges from 0 to 191. Each block has its own unique address.
Assuming that the sector holds 512 bytes, a 500 GB drive with a formatted capacity of 465.7 GB has in
excess of 976,000,000 blocks.

HDD Performance
A disk drive is an electromechanical device that governs the overall performance of the storage system
environment.

The various factors that affect the performance of disk drives are:

• Seek time
• Rotational latency

100
• Disk transfer rate

Disk service time = seek time + rotational latency + data transfer rate

Seek Time
Seek time is described as:

• The time to position the read/write head


• The lower the seek time, the faster the I/O
operation
• Seek time specifications include:
o Full stroke
o Average
o Track-to-track
• The drive manufacturer specifies seek time
of a disk.

The seek time (also called access time) describes


the time taken to position the R/W heads across
the platter with a radial movement (moving along
the radius of the platter). In other words, it is the
time taken to position and settle the arm and the
head over the correct track. Therefore, the lower
the seek time, the faster the I/O operation.
Disk vendors publish the following seek time
specifications:

• Full Stroke: It is the time taken by the R/W head to move across the entire width of
the disk, from the innermost track to the outermost track.
• Average: It is the average time taken by the R/W head to move from one random track
to another, normally listed as the time for one-third of a full stroke.
• Track-to-Track: It is the time taken by the R/W head to move between adjacent
tracks.

Each of these specifications is measured in milliseconds (ms). The seek time of a disk is
typically specified by the drive manufacturer. The average seek time on a modern disk is
typically in the range of 3 to 15 ms. Seek time has more impact on the I/O operation of random
tracks rather than the adjacent tracks.
To minimize the seek time, data can be written to only a subset of the available cylinders.
This results in lower usable capacity than the actual capacity of the drive. For example, a 500
GB disk drive is set up to use only the first 40 percent of the cylinders and is effectively treated
as a 200 GB drive. This is known as short-stroking the drive.

101
Rotational Latency
• The time the platter takes to rotate and
position the data under the R/W head
• Depends on the rotation speed of the
spindle
• Average rotational latency: One-half
of the time taken for a full rotation

To access data, the actuator arm moves the


R/W head over the platter to a particular
track while the platter spins to position the
requested sector under the R/W head. The
time taken by the platter to rotate and
position the data under the R/W head is
called rotational latency.
This latency depends on the rotation speed
of the spindle and is measured in milliseconds. The average rotational latency is one-half of
the time taken for a full rotation. Similar to the seek time, rotational latency has more impact
on the reading/writing of random sectors on the disk than on the same operations on adjacent
sectors.
Average rotational latency is approximately 5.5 ms for a 5,400-rpm drive, and around 2 ms
for a 15,000-rpm drive.

Data Transfer Rate


The average amount of data per unit time that the drive can deliver to the :

• Internal transfer rate: Speed at which data moves from the surface of a platter to the internal
buffer of the disk
• External transfer rate: Rate at which data move through the interface to the HBA

102
The data transfer rate (also called transfer rate) refers to the average amount of data per unit
time that the drive can deliver to the HBA. In a read operation, the data first moves from
disk platters to R/W heads; then it moves to the drive’s internal buffer. Finally, data moves
from the buffer through the interface to the compute system’s HBA.
In a write operation, the data moves from the HBA to the internal buffer of the disk drive
through the drive’s interface. The data then moves from the buffer to the R/W heads. Finally,
it moves from the R/W heads to the platters. The data transfer rates during the R/W
operations are measured in terms of internal and external transfer rates.
Internal transfer rate is the speed at which data moves from a platter’s surface to the internal
buffer (cache) of the disk. The internal transfer rate takes into account factors such as the
seek time and rotational latency. External transfer rate is the rate at which data can move
through the interface to the HBA.
The external transfer rate is generally the advertised speed of the interface, such as 133 MB/s
for ATA.

I/O Controller Utilization vs. Response Time


• Based on fundamental laws of disk drive performance
• For performance-sensitive applications disks are commonly utilized below 70% of their I/O
serving capability

The utilization of a disk I/O controller has a significant impact on the I/O response time.
Consider that a disk is viewed as a black box consisting of two elements: the queue and
the disk I/O controller. Queue is the location where an I/O request waits before it is
processed by the I/O controller and disk I/O controller processes I/Os waiting in the queue
one by one.
The I/O requests arrive at the controller at the rate generated by the application. The I/O
arrival rate, the queue length, and the time taken by the I/O controller to process each
request determines the I/O response time. If the controller is busy or heavily utilized, the
queue size will be large and the response time will be high.
As the utilization reaches 100 percent, that is, as the I/O controller saturates, the response
time moves closer to infinity. In essence, the saturated component or the bottleneck forces
the serialization of I/O requests; meaning, each I/O request must wait for the completion
of the I/O requests that preceded it

103
The figure shows a graph plotted between utilization and response time. The graph indicates that as the
utilization increases, the response time changes are nonlinear. When the average queue sizes are low, the
response time remains low. The response time increases slowly with added load on the queue and
increases exponentially when the utilization exceeds 70 percent. Therefore, for performance-sensitive
applications, it is common to utilize disks below their 70 percent of I/O serving capability.

Solid State Drive Components


Solid state drives (SSDs) are storage devices that contain non-volatile flash memory. Solid state drives are
superior to mechanical hard disk drives in terms of performance, power use, and availability. These drives
are especially well suited for low-latency applications that require consistent, low (less than 1 millisecond)
read/write response times.

An HDD servicing small-block, highly-concurrent, and random workloads involves considerable rotational
and seek latency, which significantly reduces throughput. Externally, solid state drives have the same
physical format and connectors as mechanical hard disk drive. This uniformity maintains the compatibility
in both form and format with mechanical hard disk drives. It also
allows for easy replacement of a mechanical drive with a solid
state drive.

Internally, a solid state drive’s hardware architecture consists of


the following components: I/O interface, controller, and mass
storage.

The I/O interface enables connecting the power and data


connectors to the solid state drives. Solid state drives typically
support standard connectors such as SATA, SAS, or FC.

The controller includes a drive controller, RAM, and non-volatile memory (NVRAM). The
drive controller manages all drive functions.
The non-volatile RAM (NVRAM) is used to store the SSD’s operational software and data.
Not all SSDs have separate NVRAM. Some models store their programs and data to the
drive’s mass storage.

104
The RAM is used in the management of data being read and written from the SSD as a cache,
and for the SSD’s operational programs and data. SSDs include many features such as
encryption and write coalescing.
The mass storage is an array of non-volatile memory chips. They retain their contents when
powered off. These chips are commonly called Flash memory. The number and capacity of
the individual chips vary directly in relationship to the SSD’s capacity. The larger the
capacity of the SSD, the larger is the capacity and the greater is the number of the Flash
memory chips.
SSDs consume less power compared to hard disk drives. Because SSDs do not have moving
parts, they generate less heat compared to HDDs. Therefore, it further reduces the need for
cooling in storage enclosure, which further reduces the overall system power consumption.
SSDs have multiple parallel I/O channels from its drive controller to the flash memory
storage chips. Generally, the larger the number of flash memory chips in the drive, the larger
is the number of channels.

SSD Addressing
Solid state memory chips have different capacities, for example a solid state memory chip can be
32 GB or 4 GB per chip. However, all memory chips share the same logical organization, that is
pages and blocks.
At the lowest level, a solid state drive stores bits. Eight bits make up a byte, and while on the typical
mechanical hard drive 512 bytes would make up a sector, solid state drives do not have sectors.
Solid state drives have a similar physical data object called a page.
Like a mechanical hard drive sector, the page is the smallest object that can be read or written on a
solid state drive. Unlike mechanical hard drives, pages do not have a standard capacity. A page’s
capacity depends on the architecture of the solid state memory chip. Typical page capacities are 4
KB, 8 KB, and 16 KB.
A solid state drive block is made up of pages. A block may have 32, 64, or 128 pages. 32 is a
common block size. The total capacity of a block depends on the solid state chip’s page size. Only
entire blocks may be written or erased on a solid state memory chip.
Individual pages may be read or invalidated (a logical function). For a block to be written, pages
are assembled into full blocks in the solid state drive’s cache RAM and then written to the block
storage object.

105
Page States
A page has three possible states, erased (empty), valid, and invalid.
In order to write any data to a page, its owning block location on the flash memory chip must be
electrically erased. This function is performed by the SSD’s hardware. Once a page has been
erased, new data can be written to it.
For example: when a 4 KB of data is written to a 4 KB capacity page, the state of that page is
changed to valid, as it is holding valid data. A
valid page’s data can be read any number of
times. If the drive receives a write request to
the valid page, the page is marked invalid and
that write goes to another page. This is
because erasing blocks is time consuming and
may increase the response time.
Once a page is marked invalid, its data can no
longer be read. An invalid page needs to be
erased before it can once again be written with
new data. Garbage collection handles this
process. Garage collection is the process of
providing new erased blocks.

106
SDD Performance
Access type

• SSD performs random reads the best.


• SSDs use all internal I/O channels in parallel for multithreaded large block I/Os.

Drive state

• New SSD or SSD with substantial unused capacity offers best performance

Workload duration

SSDs are ideal for most workloads

Solid state drives are semiconductor, random-access devices; these result in very low
response times compared to hard disk drives. This, combined with the multiple parallel
I/O channels on the back end, gives SSDs performance characteristics that are better than
hard drives.SSD performance is dependent on access type, drive state, and workload
duration. SSD performs random reads the best.
In carefully tuned multi-threaded, small-block random I/O workload storage
environments, SSDs can deliver much lower response times and higher throughput than
hard drives. Because they are random access devices, SSDs pay no penalty for retrieving
I/O that is stored in more than one area; as a result their response time is in an order of
magnitude faster than the response time of hard drives.
A new SSD or an SSD with substantial unused capacity has the best performance. Drives
with substantial amounts of their capacity consumed will take longer to complete the
read-modify-write cycle. SSDs are best for workloads with short bursts of activity.

Solid State Hybrid Drive

Definition: Solid-State Hybrid Drive


Hybrid storage technologies combine NAND flash memory or SSDs, with the HDD technology.

In SSHDs the data elements that are associated with


performance, such as most frequently accessed data items,
are stored in the NAND flash memory. This method
provides a significant performance improvement over
traditional hard drives.
In hybrid storage technology, the objective is to achieve a
balance of improved performance and high-capacity
storage availability by combining hard drives and SSD.
107
Optimized performance is ensured by placing "hot data", or data that is most directly associated
with improved performance, on the "faster" part of the storage architecture.

Non-Volatile Memory Express (NVMe)


Definition: NVMe

NVMe (Non-Volatile Memory Express) is a new device interface for Non-Volatile Memory (NVM) storage
technologies using PCIe connectivity.

• A standard developed by an open industry consortium, directed by a 13 company promoter


group which includes Dell.
• Core design objective is to achieve high levels of parallelism, concurrency, and scalability
and realize the performance benefits of NAND flash and emerging Storage Class Memory
(SCM).

NVM stands for non-volatile memory such as NAND flash memory. NVMe has been
designed to capitalize on the low latency and internal parallelism of solid-state storage
devices.
The previous interface protocols like SCSI were developed for use with far slower hard
disk drives where a very lengthy delay exists between a request and data transfer, where
data speeds are much slower than RAM speeds, and where disk rotation and seek time
give rise to further optimization requirements.
NVMe is a command set and associated storage interface standards that specify efficient
access to storage devices and systems based on Non-Volatile Memory (NVM) media.
NVMe is broadly applicable to NVM storage technology, including current NAND-based
flash and higher-performance, Storage Class Memory (SCM).

Storage Class Memory (SCM)


Definition: Storage Class Memory

A solid-state memory that blurs the boundaries between storage and memory by being low-cost, fast, and
nonvolatile.

Features:

• Non-volatile
• Short access time like DRAM
• Low cost per bit like disk
• Solid-state, no moving parts

Despite the emergence of flash storage and more recently, the NVMe stack, external
storage systems are still orders of magnitude slower than server memory technologies

108
(RAM). They can also be a barrier to achieving the highest end-to-end system
performance.
The memory industry has been aiming towards something that has the speed of DRAM
but the capacity, cost, and persistence of NAND flash memory. The shift from SATA to
faster interfaces such as SAS and PCI-Express using the NVMe protocol has made SSDs
much faster, but nowhere near the speed of DRAM.
Now, a new frontier in storage media bridges the latency gap between server storage and
external storage: storage-class memory (SCM). This new class of memory technology has
performance characteristics that fall between DRAM and flash characteristics. Figure
highlights where SCM fits into the storage media hierarchy.
SCM is slower than DRAM but read and write speeds are over 10 times faster than flash
and can support higher IOPS while offering comparable throughput. Furthermore, data
access in flash is at the block and page levels, but SCM can be addressed at the bit or
word level. This granularity eliminates the need to erase an entire block to program it,
and it also simplifies random access.
However, because the price per gigabyte is expected to be substantially higher, SCM is
unlikely to be a replacement for flash in enterprise storage. With new storage media, price
per gigabyte is a key contributor to adoption. For example, in spite of the clear advantages
of flash over HDDs, the industry hasn’t yet completely converted from HDDs to flash.
Other persistent memory technologies are also in development, some with the potential
for broad adoption in enterprise and embedded applications, such as nanotube RAM
(NRAM) and resistive RAM (ReRAM).

109
RAID Techniques
Why RAID?
Definition: RAID (Redundant Array of Independent Disks)

A technique that combines multiple disk drives into a logical unit (RAID set) and provides protection,
performance, or both.

• Provides data protection against drive failures


• Improves storage system performance by serving I/Os from multiple drives simultaneously
• Two implementation methods
o Software RAID
o Hardware RAID

RAID is a technique in which multiple disk drives are combined into a logical unit called a RAID set and
data is written in blocks across the disks in the RAID set. RAID protects against data loss when a drive
fails, by using redundant drives and parity. RAID also helps in improving the storage system performance
as read and write operations are served simultaneously from multiple disk drives.

RAID is typically implemented by using a specialized hardware controller present either on the compute
system or on the storage system. The key functions of a RAID controller are: management and control
of drive aggregations, translation of I/O requests between logical and physical drives, and data
regeneration in the event of drive failures.

Software RAID uses compute system-based software to provide RAID functions and is
implemented at the operating-system level. Software RAID implementations offer cost and
simplicity benefits when compared with hardware RAID.
However, they have the following limitations:

• Performance: Software RAID affects the overall system performance. This is due to
additional CPU cycles required to perform RAID calculations.
• Supported features: Software RAID does not support all RAID levels.
• Operating system compatibility: Software RAID is tied to the operating system;
hence, upgrades to software RAID or to the operating system should be validated for
compatibility. This leads to inflexibility in the data-processing environment.

110
RAID Array Components
A RAID array is an enclosure that contains various disk drives and supporting hardware to
implement RAID. A subset of disks within a RAID array can be grouped to form logical
associations called logical arrays, also known as a RAID set or a RAID group.

RAID Techniques
Three different RAID techniques form the basis for defining various RAID levels; they are:

Striping

111
Striping is a technique of spreading data across multiple drives (more than one) in order to use the
drives in parallel. All the read/write heads work simultaneously, allowing more data to be processed
in a shorter time and increasing performance, compared to reading and writing from a single disk.
Within each disk in a RAID set, a predefined number of contiguously addressable disk blocks are
defined as strip.
The set of aligned strips that spans across all the disks within the RAID set is called a stripe. The
illustration shows representations of a striped RAID set. Strip size (also called stripe depth)
describes the number of blocks in a strip (represented as “A1, A2, A3, and A4”). The maximum
amount of data that can be written to or read from a single disk in the set, assuming that the accessed
data starts at the beginning of the strip.
All strips in a stripe have the same number of blocks. Having a smaller strip size means that the
data is broken into smaller pieces while it is spread across the disks. Stripe size (represented as A)
is a multiple of strip size by the number of data disks in the RAID set.
For example: in a four disk striped RAID set with a strip size of 64KB, the stripe size is 256 KB
(64KB x 4). In other words, A = A1 +A2 + A3 + A4. Stripe width refers to the number of data
strips in a stripe. Striped RAID does not provide any data protection unless parity or mirroring is
used.

Mirroring

Mirroring is a technique whereby the same data is stored on two different disk drives, yielding two
copies of the data. If one disk drive failure occurs, the data remains intact on the surviving disk
drive and the controller continues to service the compute system’s data requests from the surviving
disk of a mirrored pair.
When the failed disk is replaced with a new disk, the controller copies the data from the surviving
disk of the mirrored pair. This activity is transparent to the compute system. In addition to providing
complete data redundancy, mirroring enables fast recovery from disk failure. However, disk
mirroring provides only data protection and is not a substitute for data backup.
Mirroring constantly captures changes in the data, whereas a backup captures point-in-time images
of the data. Mirroring involves duplication of data – the amount of storage capacity needed is twice
the amount of data being stored. Therefore, mirroring is considered expensive and is preferred for
mission-critical applications that cannot afford the risk of any data loss. Mirroring improves read
performance because read requests can be serviced by both disks.
However, write performance is slightly lower than that in a single disk because each write request
manifests as two writes on the disk drives. Mirroring does not deliver the same levels of write
performance as a striped RAID.

Parity

Parity is a method to protect striped data from disk drive failure without the cost of mirroring. An
additional disk drive is added to hold parity, a mathematical construct that allows re-creation of the
missing data. Parity is a redundancy technique that ensures protection of data without maintaining
a full set of duplicate data.
Calculation of parity is a function of the RAID controller. Parity information can be stored on
separate, dedicated disk drives, or distributed across all the drives in a RAID set. The first three
disks in the figure, labeled D1 to D3, contain the data. The fourth disk, labeled P, stores the parity
information, which, in this case, is the sum of the elements in each row. Now, if one of the data
112
disks fails, the missing value can be calculated by subtracting the sum of the rest of the elements
from the parity value. In the diagram, for simplicity, the computation of parity is represented as an
arithmetic sum of the data. However, parity calculation is a bitwise XOR operation.
Compared to mirroring, parity implementation considerably reduces the cost associated with data
protection. Consider an example of a parity RAID configuration with four disks where three disks
hold data, and the fourth holds the parity information. In this example, parity requires only 33
percent extra disk space compared to mirroring, which requires 100 percent extra disk space.
However, there are some disadvantages of using parity. Parity information is generated from data
on the data disk. Therefore, parity is recalculated every time there is a change in data. This
recalculation is time-consuming and affects the performance of the RAID array.
For parity RAID, the stripe size calculation does not include the parity strip.
For example: in a four (3 + 1) disk parity RAID set with a strip size of 64 KB, the stripe size will
be 192 KB (64KB x 3).

RAID Levels
Commonly used RAID levels are:

• RAID 0 – Striped set with no fault tolerance


• RAID 1 – Disk mirroring
• RAID 1 + 0 – Mirroring and Striping RAID
• RAID 3 - Striped set with parallel access and dedicated parity
• RAID 5 – Striped set with independent disk access and a distributed parity
• RAID 6 – Striped set with independent disk access and dual distributed parity

The RAID level selection depends on the parameters such as application performance, data availability
requirements, and cost. These RAID levels are defined based on striping, mirroring, and parity techniques.
Some RAID levels use a single technique, whereas others use a combination of techniques. The commonly
used RAID levels are RAID 0, RAID 1, 5, 6 and 1+0.

RAID 0 configuration uses data striping techniques, where data is striped across all the disks within
a RAID set. It utilizes the full storage capacity of a RAID set.
To read data, all the strips are gathered by the controller. When the number of drives in the RAID
set increases, the performance improves because more data can be read or written simultaneously.
RAID 0 is a good option for applications that need high I/O throughput. However, if these
applications require high availability during drive failures, RAID 0 does not provide data protection
and availability.

113
RAID 0

RAID 0 configuration uses data striping techniques, where data is striped across all the disks within
a RAID set. It utilizes the full storage capacity of a RAID set.
To read data, all the strips are gathered by the controller. When the number of drives in the RAID
set increases, the performance improves because more data can be read or written simultaneously.
RAID 0 is a good option for applications that need high I/O throughput. However, if these
applications require high availability during drive failures, RAID 0 does not provide data protection
and availability.

114
RAID 1

RAID 1 is based on the mirroring


technique. In this RAID configuration, data
is mirrored to provide fault tolerance. A
RAID 1 set consists of two disk drives and
every write is written to both disks.
The mirroring is transparent to the compute
system. During disk failure, the impact on
data recovery in RAID 1 is the least among
all RAID implementations. This is because
the RAID controller uses the mirror drive
for data recovery.
RAID 1 is suitable for applications that
require high availability and cost is not a
constraint.

115
RAID 1+0 (Mirroring and Striping)

Most data centers require data redundancy and performance from their RAID arrays. RAID 1+0
combines the performance benefits of RAID 0 with the redundancy benefits of RAID 1.
It uses mirroring and striping techniques and combines their benefits. This RAID type requires an
even number of disks, the minimum being four.
RAID 1+0 is also known as RAID 10 (Ten) or RAID 1/0. RAID 1+0 is also called striped mirror.
The basic element of RAID 1+0 is a mirrored pair. This means that data is first mirrored and then
both copies of the data are striped across multiple disk drive pairs in a RAID set.
When replacing a failed drive, only the mirror is rebuilt. In other words, the storage system
controller uses the surviving drive in the mirrored pair for data recovery and continuous operation.
Data from the surviving disk is copied to the replacement disk.

116
RAID 3

RAID 3 stripes data for performance and uses parity for fault tolerance.
Parity information is stored on a dedicated drive so that the data can be reconstructed if a drive
fails in a RAID set. For example, in a set of five disks, four are used for data and one for parity.
Therefore, the total disk space that is required is 1.25 times the size of the data disks. RAID 3
always reads and writes complete stripes of data across all disks because the drives operate in
parallel. There are no partial writes that update one out of many strips in a stripe.
Note: RAID 3 is not typically used in practice.

117
RAID 5
RAID 5 is a versatile RAID implementation. It is similar to RAID 4 because it uses striping. The
drives (strips) are also independently accessible.
The difference between RAID 4 and RAID 5 is the parity location. In RAID 4, parity is written to
a dedicated drive, creating a write bottleneck for the parity disk.
In RAID 5, parity is distributed across all disks to overcome the write bottleneck of a dedicated
parity disk.

118
RAID 6
RAID 6 works the same way as RAID 5, except that RAID 6 includes a second parity element to
enable survival if two disk failures occur in a RAID set. Therefore, a RAID 6 implementation
requires at least four disks.
RAID 6 distributes the parity across all the disks. The write penalty (explained later in this module)
in RAID 6 is more than that in RAID 5; therefore, RAID 5 writes perform better than RAID 6.
The rebuild operation in RAID 6 may take longer than that in RAID 5 due to the presence of two
parity sets.

RAID Impacts on Performance


When choosing a RAID type, it is imperative to consider its impact on disk performance and application
IOPS. In both mirrored and parity RAID configurations, every write operation translates into more I/O
overhead for the disks, which is referred to as a write penalty.

• In RAID 5, every write (update) to a disk manifests as four I/O operations (2 disk reads and
2 disk writes)
• In RAID 6, every write (update) to a disk manifests as six I/O operations (3 disk reads and
3 disk writes)
• In RAID 1, every write manifests as two I/O operations (2 disk writes)
119
The figure illustrates a single write operation on RAID 5 that contains a group of five disks. The parity (P)
at the controller is calculated as follows: Cp = C1 + C2 + C3 + C4 (XOR operations)

This slide illustrates a single write operation on RAID 5 that contains a group of five disks. The
parity (P) at the controller is calculated as follows:
Cp = C1 + C2 + C3 + C4 (XOR operations)
Whenever the controller performs a write I/O, parity must be computed by reading the old parity
(Cp old) and the old data (C4 old) from the disk, which means two read I/Os. Then, the new parity
(Cp new) is computed as follows:
Cp new = Cp old – C4 old + C4 new (XOR operations)
After computing the new parity, the controller completes the write I/O by writing the new data and
the new parity onto the disks, amounting to two write I/Os. Therefore, the controller performs two
disk reads and two disk writes for every write operation, and the write penalty is 4.
In RAID 6, which maintains dual parity, a disk write requires three read operations: two parity and
one data. After calculating both the new parities, the controller performs three write operations:
two parity and an I/O. Therefore, in a RAID 6 implementation, the controller performs six I/O
operations for each write I/O, and the write penalty is 6.

120
RAID Comparison
RAID Minimum Number of Available Storage Write Protection
Level Disks Capacity (%) Penalty
1 2 50 2 Mirror
1+0 4 50 2 Mirror
3 3 [(n-1)/n] * 100 4 Parity (Supports single disk
failure)
5 3 [(n-1)/n] * 100 4 Parity (Supports single disk
failure)
6 4 [(n-2)/n] * 100 6 Parity (Supports two disk
failures)

Dynamic Disk Sparing (Hot Sparing)


A hot sparing refers to a process that temporarily replaces a failed disk drive with a spare drive in
a RAID array by taking the identity of the failed disk drive. With the hot spare, one of the following
methods of data recovery is performed depending on the RAID implementation:

• If parity RAID is used, the data is rebuilt


onto the hot spare from the parity and the
data on the surviving disk drives in the
RAID set.
• If mirroring is used, the data from the
surviving mirror is used to copy the data
onto the hot spare.

When a new disk drive is added to the system,


data from the hot spare is copied to it. The hot
spare returns to its idle state, ready to replace
the next failed drive. Alternatively, the hot spare
replaces the failed disk drive permanently. This
means that it is no longer a hot spare, and a new
hot spare must be configured on the storage system.
A hot spare should be large enough to accommodate data from a failed drive. Some systems
implement multiple hot spares to improve data availability.A hot spare can be configured as
automatic or user initiated, which specifies how it will be used in the event of disk failure.
In an automatic configuration, when the recoverable error rates for a disk exceed a
predetermined threshold, the disk subsystem tries to copy data from the failing disk to the
hot spare automatically. If this task is completed before the damaged disk fails, the subsystem
switches to the hot spare and marks the failing disk as unusable.
Otherwise, it uses parity or the mirrored disk to recover the data. In the case of a user-
initiated configuration, the administrator has control of the rebuild process. For example, the
rebuild could occur overnight to prevent any degradation of system performance. However,
the system is at risk of data loss if another disk failure occurs.
121
Exercise: RAID
Scenario

A customer has an HCI appliance with 4 nodes. The customer requires a data protection solution.
For that, the customer has two options.

• Option 1 is RAID-6 erasure coding


• Option 2 is RAID-1 mirroring

Challenges/Requirements

The customer is concerned about storage space utilization and needs a low cost solution.

Deliverables

Calculate the number of nodes required for data protection and propose the best option.

Debrief

Solution:

• RAID-6 erasure coding provides protection against two nodes failing by configuring fault
tolerance value to 2. If an appliance consists of four data nodes, then the minimum number
of nodes required is 6 (4+2)
• RAID-1 mirroring requires minimum of 6 nodes
• This way mirroring occupies more space than erasure coding, which leads to an increase in
the cost
• Minimum number of nodes required is 6, the best solution for data protection is RAID-6
erasure coding

Exercise: Storage Design


Scenario
An organization plans to deploy a new business application:

• Required storage capacity = 1.5 TB


• Peak I/O workload = 5200 IOPS
• Typical I/O size = 4 KB

Challenges/Requirements

The application is business critical and must have an acceptable response time
Specifications of the available disk drive option:

122
• RPM = 15,000
• Storage capacity = 250 GB
• Average seek time = 4.2 ms
• Data transfer rate = 80 MB/s

Deliverables

Calculate the number of disk drives required for the application

Debrief

• Step 1: Calculate the time required to perform one I/O (disk service time)
o Disk service time = Average seek time + Rotational latency + Data transfer time
▪ Average seek time = 4.2 ms (given)
▪ Rotational latency = 0.5 x (60 / 15000) = 2 ms
▪ Data transfer time = 4 KB / 80 MB/s = 0.05 ms
o Disk service time = 4.2 ms + 2 ms + 0.05 ms = 6.25 ms
• Step 2: Calculate the maximum number of IOPS the drive can perform
o Maximum number of IOPS = 1 / 6.25 ms = 160 IOPS
o Maximum number of IOPS at 70% utilization = 160 x 0.7= 112 IOPS
• Step 3: Calculate the number of drives for the application
o Drives required to meet performance requirement = 5200 / 112 = 47
o Drives required to meet capacity requirement = 1.5 TB / 250 GB = 6
• Number of drives required = Maximum (Capacity, Performance)
o Maximum (6, 47) = 47 disk drives

123
Types of Intelligent Storage Systems
Video: Types of Intelligent Storage Systems
Types of Intelligent Storage Systems
Based on the type of data access, a storage system can be classified as :

• Block-based
• File-based
• Object-based
• Unified

A unified storage system provides block-based, file-based, and object-based data access in a single
system.

Scale-up Vs. Scale-out Architecture


An intelligent storage system may be built either based on scale-up or scale-out architecture.

A scale-up storage architecture provides the capability to scale the capacity and performance of a single
storage system based on requirements. Scaling up a storage system involves upgrading or adding
controllers and storage. These systems have a fixed capacity ceiling, which limits their scalability and the
performance also starts degrading when reaching the capacity limit.

A scale-out storage architecture provides the capability to maximize its capacity by simply adding nodes
to the cluster. Nodes can be added quickly to the cluster, when more performance and capacity is needed,
without causing any downtime. This provides the flexibility to use many nodes of moderate performance
and availability characteristics to produce a total system that has better aggregate performance and
availability. Scale-out architecture pools the resources in the cluster and distributes the workload across
all the nodes. This results in linear performance improvements as more nodes are added to the cluster.

124
Question 1
Which one of the following is a characteristic of RAID 5?

• Distributed parity

Correct!


Double parity


No parity


All

Question 2
Which one of the following is not a SSD page state?


125
Invalid

Erased

Start

Valid

Question 3
What is the stripe size of a five disk parity RAID 5 set that has a strip size of 64 KB?

128 KB

64 KB

320 KB

256 KB

Correct!

126
Block-Based Storage System
Components of a Block-Based Storage System
What Is a Block-Based Storage System?
A block-based storage system provides compute systems with block-level access to the storage volumes.
In this environment, the file system is created on the compute systems and data is accessed on a network
at the block level.

These block-based storage systems can either be based on scale-up or scale-out architecture. The block-
based storage system consists of one or more controllers and storage. Controllers and storage are
discussed next.

Components of a Controller
A controller of a block-based storage system consists of three key components: front end, cache, and back
end. An I/O request that is received from the compute system at the front-end port is processed through
cache and back end, to enable storage and retrieval of data from the storage. A read request can be
serviced directly from cache if the requested data is found in the cache.

In modern intelligent storage systems, front end, cache, and back end are typically integrated on a single
board (referred as a storage processor or storage controller).

127
Component: Cache
Cache is semiconductor memory where data is placed temporarily to reduce the time that is required to
service I/O requests from the compute system. Cache improves storage system performance by isolating
compute systems from the storage (HDDs and SSDs). In this case, cache improves storage system
performance by isolating compute systems from the mechanical delays that are associated with rotating
disks or HDDs.

Rotating disks are the slowest component of an intelligent storage system. Data access on rotating disks
usually takes several milliseconds because of seek time and rotational latency. Accessing data from cache
is fast and typically takes less than a millisecond. On intelligent storage systems, write data is first placed
in cache and then written to the storage.

Read Operation with Cache


When a compute system issues a read request, the storage controller reads the tag RAM to determine
whether the required data is available in cache. If the requested data is found in the cache, it is called a
read cache hit or read hit and data is sent directly to the compute system, without any back-end storage
operation. This provides a fast response time to the compute system (about a millisecond).

If the requested data is not found in cache, it is called a cache miss and the data must be read from the
storage. The back end accesses the appropriate storage device and retrieves the requested data. Data is
then placed in cache and finally sent to the compute system through the front end. Cache misses increase
the I/O response time.

Read performance is measured in terms of the read hit ratio, or the hit rate, expressed as a percentage.
This ratio is the number of read hits with respect to the total number of read requests. A higher read hit
ratio improves the read performance.

128
Write Operation with Cache

Write operations with cache provide performance advantages over writing directly to storage.
When an I/O is written to cache and acknowledged, it is completed in less time (from the compute
system’s perspective) than it would take to write directly to storage. Sequential writes also offer

129
opportunities for optimization because many smaller writes can be coalesced for larger transfers to
storage with the use of cache.
A write operation with cache is implemented in the following ways:

Write-through cache

Data is placed in the cache and immediately written to the storage, and an acknowledgment is sent to the
compute system. Because data is committed to storage as it arrives, the risks of data loss are low, but the
write-response time is longer because of the storage operations.

Write-back cache

Data is placed in the cache and immediately written to the storage, and an acknowledgment is sent to the
compute system. Because data is committed to storage as it arrives, the risks of data loss are low, but the
write-response time is longer because of the storage operations.

Write Operation Details


Cache can be bypassed under certain conditions, such as large size write I/O. In this implementation, if the
size of an I/O request exceeds the predefined size, called write aside size, writes are sent directly to
storage. This reduces the impact of large writes consuming a large cache space. This is particularly useful
in an environment where cache resources are constrained and cache is required for small random I/Os.

Cache can be implemented as either dedicated cache or global cache. With dedicated cache, separate sets
of memory locations are reserved for reads and writes. In global cache, both reads and writes can use any
of the available memory addresses. Cache management is more efficient in a global cache implementation
because only one global set of addresses has to be managed.

Global cache enables users to specify the percentages of cache available for reads and writes for cache
management. Typically, the read cache is small, but it should be increased if the application being used is
read-intensive. In other global cache implementations, the ratio of cache available for reads versus writes
is dynamically adjusted based on the workloads.

Cache Management: Algorithms

Cache is an expensive resource that needs proper management.

Even though modern intelligent storage systems come with a large amount of cache, when all cache pages
are filled, some pages have to be freed up to accommodate new data and avoid performance degradation.

Various cache management algorithms are implemented in intelligent storage systems to proactively
maintain a set of free pages. A list of pages that can be potentially freed up whenever required may also
be maintained.

130
Least Recently Used (LRU): An algorithm that
continuously monitors data access in cache and
identifies the cache pages that have not been
accessed for a long time. LRU either frees up these
pages or marks them for reuse. This algorithm is
based on the assumption that data that has not
been accessed for a while will not be requested by
the compute system. However, if a page contains
write data that has not yet been committed to
storage, the data is first written to the storage before the page is reused.

Prefetch: A prefetch or read-ahead algorithm is used when read requests are sequential. In a sequential
read request, a contiguous set of associated blocks is retrieved. Several other blocks that have not yet
been requested by the compute system can be read from the storage and placed into cache in advance.
When the compute system subsequently requests these blocks, the read operations will be read hits. This
process significantly improves the response time experienced by the compute system.

Cache Data Protection


Cache is volatile memory; so a power failure or any kind of cache failure will cause loss of the data that is
not yet committed to the storage drive. This risk of losing uncommitted data that is held in cache can be
mitigated using cache mirroring and cache vaulting:

Cache mirroring

Each write to cache is held in two different memory locations on two independent memory cards.
If a cache failure occurs, the write data will still be safe in the mirrored location and can be
committed to the storage drive. Reads are staged from the storage drive to the cache; therefore, if
a cache failure occurs, the data can still be accessed from the storage drives. Because only writes
are mirrored, this method results in better utilization of the available cache.
In cache mirroring approaches, the problem of maintaining cache coherency is introduced. Cache
coherency means that data in two different cache locations must be identical at all times. It is the
responsibility of the storage system's operating environment to ensure coherency.

Cache vaulting

The risk of data loss due to power failure can be addressed in various ways: powering the memory
with a battery until the AC power is restored or using battery power to write the cache content to
the storage drives. If an extended power failure occurs, using batteries is not a viable option. This
is because in intelligent storage systems, large amounts of data might need to be committed to
numerous storage drives, and batteries might not provide power for sufficient time to write each
piece of data to its intended storage drive.
Therefore, storage vendors use a set of physical storage drives to dump the contents of cache during
power failure. This is called cache vaulting and the storage drives are called vault drives. When

131
power is restored, data from these storage drives is written back to write cache and then written to
the intended drives.

Component: Back End


The back end provides an interface between cache and the physical storage drives. It consists of two
components: back-end ports and back-end controllers. The back-end controls data transfers between
cache and the physical drives. From cache, data is sent to the back end and then routed to the destination
storage drives.

Physical drives are connected to ports on the back end. The back-end controller communicates with the
storage drives when performing reads and writes and also provides additional, but limited, temporary data
storage. The algorithms that are implemented on back-end controllers provide error detection and
correction, along with RAID functionality.

For high data protection and high availability, storage systems are configured with dual controllers with
multiple ports. Such configurations provide an alternative path to physical storage drives if a controller or
port failure occurs. This reliability is further enhanced if the storage drives are also dual-ported. In that
case, each drive port can connect to a separate controller. Multiple controllers also facilitate load
balancing.

132
Storage
Physical storage drives are connected to the back-end storage controller and provide persistent data
storage. Modern intelligent storage systems provide support to a variety of storage drives with different
speeds and types, such as FC, SATA, SAS, and solid state drives. They also support the use of a mix of SSD,
FC, or SATA within the same storage system.

Workloads that have predictable access patterns typically work well with a combination of HDDs and SSDs.
If the workload changes, or constant high performance is required for all the storage being presented,
using a SSD can meet the desirable performance requirements.

133
Storage Provisioning
Overview of Storage Provisioning

Definition: Storage Provisioning

The process of assigning storage resources to compute system based on capacity, availability, and
performance requirements.

Storage provisioning can be performed in two ways: traditional and virtual.

Virtual provisioning leverages virtualization technology for provisioning storage for applications.

Logical Unit Number (LUN)


Definition: LUN

Each logical unit created from the RAID set is assigned a unique ID, called a LUN. A LUN is also referred to
as a volume, partition, or device.

LUNs hide the organization and composition of the RAID set from the compute systems

LUNs created by traditional storage provisioning methods are also referred to as thick

Once allocated, a LUN appears to a host as an internal physical disk

RAID sets usually have a large capacity because they combine the total capacity of individual
drives in the set. Logical units are created from the RAID sets by partitioning (seen as slices
of the RAID set) the available capacity into smaller units. These units are then assigned to
the compute system based on their storage requirements. Logical units are spread across all
the physical drives that belong to that set.
Each logical unit created from the RAID set is assigned a unique ID, called a logical unit
number (LUN). LUNs hide the organization and composition of the RAID set from the
compute systems. LUNs created by traditional storage provisioning methods are also
referred to as thick LUNs to distinguish them from the LUNs created by virtual provisioning
methods.
When a LUN is configured and assigned to a non-virtualized compute system, a bus scan is
required to identify the LUN. This LUN appears as a raw storage drive to the operating
system. To make this drive usable, it is formatted with a file system and then the file system
is mounted. In a virtualized compute system environment, the LUN is assigned to the
hypervisor, which recognizes it as a raw storage drive. This drive is configured with the
hypervisor file system, and then virtual storage drives are created on it.
Virtual storage drives are files on the hypervisor file system. The virtual storage drives are
then assigned to virtual machines and appear as raw storage drive to them. To make the

134
virtual storage drive usable to the virtual machine, similar steps are followed as in a non-
virtualized environment. Here, the LUN space may be shared and accessed simultaneously
by multiple virtual machines.
Virtual machines can also access a LUN directly on the storage system. In this method the
entire LUN is allocated to a single virtual machine. Storing data in this way is recommended
when the applications running on the virtual machine are response-time sensitive, and
sharing storage with other virtual machines may impact their response time. The direct
access method is also used when a virtual machine is clustered with a physical machine. In
this case, the virtual machine is required to access the LUN that is being accessed by the
physical machine.

Traditional Provisioning
In traditional storage provisioning, physical storage drives are logically grouped together on which a
required RAID level is applied to form a set, called RAID set. The number of drives in the RAID set and the
RAID level determine the availability, capacity, and performance of the RAID set. It is highly recommended
to create the RAID set from drives of the same type, speed, and capacity to ensure maximum usable
capacity, reliability, and consistency in performance.

For example, if drives of different capacities are mixed in a RAID set, the capacity of the smallest drive is
used from each drive in the set to make up the RAID set’s overall capacity. The remaining capacity of the
larger drives remains unused. Likewise, mixing higher speed drives with lower speed drives lowers the
overall performance of the RAID set.

The illustration shows a RAID set consisting of five storage drives that have been sliced or partitioned into
two LUNs: LUN 0 and LUN 1. These LUNs are then assigned to Compute 1 and Compute 2 for their storage
requirements.

135
Virtual Provisioning
Virtual provisioning enables creating and presenting a LUN with more capacity than is physically allocated
to it on the storage system. The LUN created using virtual provisioning is called a thin LUN to distinguish it
from the traditional LUN. Thin LUNs do not require physical storage to be completely allocated to them at
the time they are created and presented to a compute system.

Physical storage is allocated to the compute system “on-demand” from a shared pool of physical capacity.
A shared pool consists of physical storage drives. A shared pool in virtual provisioning is analogous to a
RAID set, which is a collection of drives on which LUNs are created. Similar to a RAID set, a shared pool
supports a single RAID protection level. However, unlike a RAID set, a shared pool might contain large
numbers of drives. Shared pools can be homogeneous (containing a single drive type) or heterogeneous
(containing mixed drive types, such as SSD, FC, SAS, and SATA drives).

Virtual provisioning enables more efficient allocation of storage to compute systems. Virtual provisioning
also enables oversubscription, where more capacity is presented to the compute systems than is actually
available on the storage system. Both the shared pool and the thin LUN can be expanded non-disruptively
as the storage requirements of the compute systems grow. Multiple shared pools can be created within a
storage system, and a shared pool may be shared by multiple thin LUNs.

Expand Thin LUNs and Storage Pool


A storage pool comprises physical drives that provide the physical storage that is used by Thin LUNs. A
storage pool is created by specifying a set of drives and a RAID type for that pool.

Thin LUNs are then created out of that pool (similar to traditional LUN created on a RAID set). All the Thin
LUNs created from a pool share the storage resources of that pool. Adding drives to a storage pool
increases the available shared capacity for all the Thin LUNs in the pool.

136
Drives can be added to a storage pool while the pool is used in production. The allocated capacity is
reclaimed by the pool when Thin LUNs are destroyed.

When a storage pool is expanded, the sudden introduction of new empty drives combined
with relative full drives cause a data imbalance. This imbalance is resolved by automating a
one-time data relocation, referred to as rebalancing. Storage pool rebalancing is a technique
that provides the ability to automatically relocate extents (minimum amount of physical
storage capacity that is allocated to the thin LUN from the pool) on physical storage drives
over the entire pool when new drives are added to the pool.
Storage pool rebalancing restripes data across all the drives( both existing and new drives)
in the storage pool. This enables spreading out the data equally on all the physical drives
within the storage pool, ensuring that the used capacity of each drive is uniform across the
pool. After the storage pool capacity is increased, the capacity of the existing LUNs can be
expanded.

Traditional Provisioning vs. Virtual


Provisioning
Administrators typically allocate storage capacity based on anticipated storage requirements. This
generally results in the over provisioning of storage capacity, which then leads to higher costs and lower
capacity utilization.

Administrators often over-provision storage to an application for various reasons such as, to avoid
frequent provisioning of storage if the LUN capacity is exhausted, and to reduce disruption to application
availability.

Virtual provisioning addresses these challenges. Virtual provisioning improves storage capacity utilization
and simplifies storage management.

The illustration illustrates an example, comparing virtual provisioning with traditional storage provisioning.

137
With traditional provisioning, three LUNs are created and presented to one or more compute
systems. The total storage capacity of the storage system is 2 TB. The allocated capacity of
LUN 1 is 500 GB, of which only 100 GB is consumed, and the remaining 400 GB is unused.
The size of LUN 2 is 550 GB, of which 50 GB is consumed, and 500 GB is unused. The size of
LUN 3 is 800 GB, of which 200 GB is consumed, and 600 GB is unused.
In total, the storage system has 350 GB of data, 1.5 TB of allocated but unused capacity, and
only 150 GB of remaining capacity available for other applications.Now consider the same 2
TB storage system with virtual provisioning. Here, three thin LUNs of the same sizes are
created. However, there is no allocated unused capacity. In total, the storage system with
virtual provisioning has the same 350 GB of data, but 1.65 TB of capacity is available for
other applications, whereas only 150 GB is available in traditional storage provisioning.
Virtual provisioning and thin LUN offer many benefits, although in some cases traditional
LUN is better suited for an application. Thin LUNs are appropriate for applications that can
tolerate performance variations. In some cases, performance improvement is perceived when
using a thin LUN, due to striping across a large number of drives in the pool. However, when
multiple thin LUNs contend for shared storage resources in a given pool, and when utilization
reaches higher levels, the performance can degrade. Thin LUNs provide the best storage
space efficiency and are suitable for applications where space consumption is difficult to
forecast. Using thin LUNs benefits organizations in reducing power and acquisition costs and
in simplifying their storage management.
Traditional LUNs are suited for applications that require predictable performance.
Traditional LUNs provide full control for precise data placement and allow an administrator
to create LUNs on different RAID groups if there is any workload contention. Organizations
that are not highly concerned about storage space efficiency may still use traditional
LUNs.Both traditional and thin LUNs can coexist in the same storage system. Based on the
requirement, an administrator may migrate data between thin and traditional LUNs.
Virtual provisioning and thin LUN offer many benefits, although in some cases traditional
LUN is better suited for an application. Thin LUNs are appropriate for applications that can

138
tolerate performance variations. In some cases, performance improvement is perceived when
using a thin LUN, due to striping across a large number of drives in the pool.
However, when multiple thin LUNs contend for shared storage resources in a given pool, and
when utilization reaches higher levels, the performance can degrade. Thin LUNs provide the
best storage space efficiency and are suitable for applications where space consumption is
difficult to forecast. Using thin LUNs benefits organizations in reducing power and
acquisition costs and in simplifying their storage management.
Traditional LUNs are suited for applications that require predictable performance.
Traditional LUNs provide full control for precise data placement and allow an administrator
to create LUNs on different RAID groups if there is any workload contention. Organizations
that are not highly concerned about storage space efficiency may still use traditional
LUNs.Both traditional and thin LUNs can coexist in the same storage system. Based on the
requirement, an administrator may migrate data between thin and traditional LUNs.

LUN Masking
Definition: LUN Masking

A process that provides data access control by defining which LUNs a compute system can access.

The LUN masking function is implemented on the storage system. This ensures that volume access by a
compute system is controlled appropriately, preventing unauthorized, or accidental use in a shared
environment.

For example, consider a storage system with two LUNs that store data of the sales and finance
departments. Without LUN masking, both departments can easily see and modify each other’s data, posing
a high risk to data integrity and security. With LUN masking, LUNs are accessible only to the designated
compute systems.

139
Storage Tiering
Storage Tiering Overview
Definition: Storage Tiering

A technique of establishing a hierarchy of storage types and identifying the candidate data to relocate to
the appropriate storage type to meet service level requirements at a minimal cost.

• Each tier has different levels of protection, performance, and cost


• Efficient storage tiering requires defining tiering policies
• Tiering options in block-based storage systems are: FAST VP and Cache tiering

Storage tiering is a technique of establishing a hierarchy of different storage types (tiers).


This enables storing the right data to the right tier, based on service level requirements, at a
minimal cost. Each tier has different levels of protection, performance, and cost. For example,
high performance solid-state drives (SSDs) or FC drives can be configured as tier 1 storage
to keep frequently accessed data and low cost SATA drives as tier 2 storage to keep the less
frequently accessed data.
Keeping frequently used data in SSD or FC improves application performance. Moving less-
frequently accessed data to SATA can free up storage capacity in high performance drives
and reduce the cost of storage. This movement of data happens based on defined tiering
policies. The tiering policy might be based on parameters, such as frequency of access.
For example, if a policy states “move the data that are not accessed for the last 30 mins to the
lower tier,” then all the data matching this condition are moved to the lower tier.
The process of moving the data from one type of tier to another is typically automated. In
automated storage tiering, the application workload is proactively monitored; the active data
is automatically moved to a higher performance tier and the inactive data is moved to higher
capacity, lower performance tier. The data movement between the tiers is performed non-
disruptively.
The techniques of storage tiering implemented in a block-based storage system are: FAST
VP and cache tiering.

LUN and Sub-LUN (FAST VP) Tiering


The process of storage tiering within a storage system is called intra-array storage tiering. It enables the
efficient use of SSD, FC, and SATA drives within a system and provides performance and cost optimization.

The goal is to keep the SSDs busy by storing the most frequently accessed data on them, while moving out
the less frequently accessed data to the SATA drives. Data movements that are executed between tiers
can be performed at the LUN level or at the sub-LUN level. The performance can be further improved by
implementing tiered cache.

140
LUN tiering Sub-LUN Tiering

• Moves entire LUN from one tier to • A LUN is broken down into smaller segments
another. and tiered at that level.
• Does not give effective cost and • Provides effective cost and performance
performance benefits. benefits

Traditionally, storage tiering is operated at the LUN level that moves an entire LUN from
one tier of storage to another. This movement includes both active and inactive data in that
LUN. This method does not give effective cost and performance benefits.
Today, storage tiering can be implemented at the sub-LUN level. In sub-LUN level tiering, a
LUN is broken down into smaller segments and tiered at that level. Movement of data with
much finer granularity, for example 8 MB, greatly enhances the value proposition of
automated storage tiering. Tiering at the sub-LUN level effectively moves active data to faster
drives and less active data to slower drives.

141
Cache Tiering
• Enables creation of a large capacity
secondary cache using SSDs
• Enables tiering between DRAM cache
and SSDs (secondary cache)
• Most reads are served directly from
high performance tiered cache
• Enhances performance during peak
workloads
• Non-disruptive and transparent to
applications

Tiering is also implemented at the cache level. A large cache in a storage system improves
performance by retaining large amount of frequently accessed data in a cache; so most reads
are served directly from the cache. However, configuring a large cache in the storage system
involves more cost.
An alternative way to increase the size of the cache is by utilizing the SSDs on the storage
system. In cache tiering, SSDs are used to create a large capacity secondary cache and to
enable tiering between DRAM (primary cache) and SSDs (secondary cache).
Server flash-caching is another tier of cache in which flash-cache card is installed in the
server to further enhance the application performance.
Tiering is also implemented at the cache level. A large cache in a storage system improves
performance by retaining large amount of frequently accessed data in a cache; so most reads
are served directly from the cache. However, configuring a large cache in the storage system
involves more cost.
An alternative way to increase the size of the cache is by utilizing the SSDs on the storage
system. In cache tiering, SSDs are used to create a large capacity secondary cache and to
enable tiering between DRAM (primary cache) and SSDs (secondary cache).
Server flash-caching is another tier of cache in which flash-cache card is installed in the
server to further enhance the application performance.

142
Use Case - Block-Based Storage in a Cloud
Storage as a Service

To gain cost advantage, organizations may move their application to a cloud. To ensure proper functioning
of the application and provide acceptable performance, service providers offer block-based storage in
cloud.

The service providers enable the consumers to create block-based storage volumes and attach them to
the virtual machine instances. After the volumes are attached, consumers can create the file system on
these volumes and run applications the way they would on an on-premise data center.

143
Concepts In Practice
Dell EMC XtremIO

• All-flash, block-based, scale out enterprise storage system


• Uses a clustered design to grow capacity and performance as required
• A powerful OS (XIOS) manages the storage cluster
• Simplified and efficient provisioning and management

DellEMC XtremIO is an all-flash, block-based, scale-out enterprise storage system that


provides substantial improvements to I/O performance. It is purpose-built to leverage flash
media and delivers new levels of real-world performance, administrative ease, and advanced
data services for applications. It uses a scale-out clustered design that grows capacity and
performance linearly to meet any requirement.
XtremIO storage systems are created from building blocks called "X-Bricks" that are each
a high-availability, high-performance, fully active/active storage system with no single point
of failure. XtremIO's powerful operating system, XIOS, manages the XtremIO storage
cluster. XIOS ensures that the system remains balanced and always delivers the highest levels
of performance with no administrator intervention.
XtremIO helps the administrators to become more efficient by enabling system configuration
in a few clicks, provisioning storage in seconds, and monitoring the environment with real-
time metrics.

Dell EMC FAST VP

• Performs storage tiering at sub-LUN level


• Data movement between tiers are based on user-defined policies
• Optimizes performance and cost
• Increases storage efficiency

Performs storage tiering at a sub-LUN level in a virtual provisioned environment. FAST VP


automatically moves more active data (data that is more frequently accessed) to the best
performing storage tier, and it moves less active data to a lower performance and less
expensive tier.
Data movement between the tiers is based on user-defined policies, and is executed
automatically and non-disruptively by FAST VP.

Dell EMC PowerMax

• Built with end-to-end NVMe


• High-end block and file consolidation

144
• Real-time machine learning to optimize performance and cost
• Up to 10 million IOPS
• A multi-controller, active/active scale-out architecture
• Configuration management is simple with Unisphere
• CloudIQ is the fitness tracker for Dell EMC storage

DellEMC PowerMax is the fast storage array delivering unprecedented levels of performance
with up to 10M IOPS, 150 GB per second of sustained bandwidth. The key to unlocking the
next level of performance is NVMe, which removes the bottleneck form storage (SAS), which
maximizes the power of flash drives, and most importantly opens the door to the next media
disruption with storage class memory (SCM).
PowerMax will deliver up to 25% better response times with NVMe Flash drives. The
combination of NVMe and SCM will unlock even greater performance reaching up to 50%
better response times.
The array offers flexible scale-up and scale-out architecture. Configuration management is
simple with Unisphere for PowerMax. The intuitive HTML5 GUI provides a simple and
feature-rich user experience.
The easiest way to describe CloudIQ is that it is like a fitness tracker for your storage
environment, providing a single, simple, display to monitor and predict the health of your
storage environment. CloudIQ makes it simple to track storage health, report on historical
trends, plan for future growth, and proactively discover and remediate issues from any
browser or mobile device.

Dell EMC SC Series

• SC offers two categories of arrays SC Hybrid(SSD & HDD) and SC All-Flash


• Intelligent Deduplication and Compression
• Dynamic Capacity and RAID tiering
• Thin RAID and Block-based storage
• Intelligent data reduction – store more data on fewer drives
• Federation – Up to 10 SC Series arrays can be linked together in federated configurations
under unified management with seamless data mobility between arrays.

SC offers two categories of arrays SC Hybrid(SSD & HDD) and SC All-Flash. SC Series was
one of the original pioneers of auto-tiering – and have the most full-featured, powerful
implementation, helping you get great flash performance with less hardware, and a less
expensive mix of hardware. SC arrays also provision RAID dynamically to help cut costs and
increase performance.
In addition to leading platform efficiency (auto-tiering, RAID tiering, thin methods), SC
arrays also offer the most comprehensive data reduction with Intelligent Deduplication and
Compression on:

• SSDs in all-flash configurations


145
• SSDs and HDDs in hybrid configurations

SC Series provides users with advanced thin provisioning technologies that optimize storage
utilization within their environments. Unlike traditional SANs, Storage Center does not
require users to pre-allocate space. Storage is pooled, ensuring space is available when and
where it is needed. You can even reclaim capacity that is no longer in use by applications,
automatically reduce the space needed for virtual OS volumes and thin import volumes on
legacy storage to improve capacity utilization.
SC Series Remote Instant Replay software efficiently replicates periodic snapshots between
local and remote sites, helping to ensure business continuity at a fraction of the cost of other
replication solutions.

Question 1
Which is a logical unit that serves as the target for storage operations, such as the SCSI protocol
READs and WRITEs?

• LUN

Correct!

Storage pool

All

RAID

Question 2
The process of storage tiering within a storage system is called ?????

Inter-array storage tiering

146

• Intra-array storage tiering

Correct!


LUN tiering


Sub-LUN tiering

Question 3
Which is a process that provides data access control by defining which LUNs a compute system can
access?

LUN masking

Virtual provisioning

Tiering

Caching

147
Fibre Channel SAN
Introduction to SAN
Storage Area Network (SAN) Overview

Definition: SAN
A network whose primary purpose is the transfer of data between computer systems and storage
devices and among storage devices.
Source: Storage Networking Industry Association
Storage Area Network (SAN) is a network that primarily connects the storage systems with the
compute systems and also connects the storage systems with each other. It enables multiple
compute systems to access and share storage resources. It also enables to transfer data between the
storage systems. With long-distance SAN, the data transfer over SAN can be extended across
geographic locations. A SAN usually provides access to block-based storage systems.

Benefits of SAN

Select here for details.

• Enables both consolidation and sharing of storage resources across multiple compute
systems

148
o Improves utilization of storage resources
o Centralizes management
• Enables connectivity across geographically dispersed locations
o Enables compute systems across locations to access shared data
o Enables replication of data between storage systems that reside in separate locations
o Facilitates remote backup of application data

SAN addresses the limitations of Direct-Attached Storage (DAS) environment. Unlike a DAS
environment, where the compute systems own the storage, SANs enable both consolidation
and sharing of storage resources across multiple compute systems. This process improves the
utilization of storage resources compared to a DAS environment. It also reduces the total
amount of storage that an organization needs to purchase and manage. With consolidation,
storage management becomes centralized and less complex, which further reduces the cost
of managing information.
A SAN may span over wide locations. This flexibility enables organizations to connect
geographically dispersed compute systems and storage systems. The long-distance SAN
connectivity enables the compute systems across locations to access shared data. The long-
distance connectivity also enables the replication of data between storage systems that reside
in separate locations. The replication over long-distances helps in protecting data against
local and regional disaster.
Further, the long-distance SAN connectivity facilitates remote backup of application data.
Backup data can be transferred through a SAN to a backup device that may reside at a
remote location. This feature avoids having to ship tapes (backup media) from the primary
site to the remote site. Also avoids associated pitfalls such as packing and shipping expenses
and lost tapes in transit.

Requirements for a SAN


An effective SAN infrastructure must provide:

• High throughput to support high-performance computing


• Interconnectivity among many devices over wide locations to transfer massively
distributed, high volume of data
• Elastic and non-disruptive scaling to support applications that are horizontally scaled
• Automated and policy-driven infrastructure configuration
• Simplified, flexible, and agile management operations

The IT industry is in the middle of a massive technological and structural shift towards
modern technologies. These technologies include cloud services, big data analytics, IoT, and
AI. Applications that support these technologies require significantly higher performance,
scalability, and availability compared to the traditional applications.
Similar to the compute and storage infrastructure, the SAN infrastructure must also be ready
to support the requirements of modern applications. Therefore, it is necessary to establish

149
how the modern application requirements are translated into the SAN requirements. This
slide provides a list of key requirements for an effective SAN infrastructure.

FC SAN Overview
• A SAN that uses FC protocol for communication
• A high-speed network that runs on high-speed optical
fiber cables and serial copper cables
• FC speeds commonly run at 1, 2, 4, 8, 16, 32, and 128
Gb/s
• Provides high scalability

Fibre Channel SAN (FC SAN) uses Fibre Channel (FC)


protocol for communication. FC protocol (FCP) is used to
transport data, commands, and status information between
the compute systems and the storage systems. It is also used
to transfer data between the storage systems.
FC is a high-speed network technology that runs on high-
speed optical fiber cables and serial copper cables. The FC
technology was developed to meet the demand for the
increased speed of data transfer between compute systems
and mass storage systems. In comparison with Ultra-Small
Computer System Interface (Ultra-SCSI) that is commonly
used in the DAS environments, FC is a significant leap in
storage networking technology.
Note: FibRE refers to the protocol, whereas fibER refers to a media.

Components of FC SAN
The key FC SAN components are network adapters, cables, and
interconnecting devices. These components are described in the
following:

• Network adapters
o FC HBAs in compute system
o Front-end adapters in storage system
• Cables
o Copper cables for short distance
o Optical fiber cables for long distance

150
o Two types:
▪ Multimode
▪ Single-mode
• Interconnecting devices
o FC hubs, FC switches, and FC directors

Network Adapters In an FC SAN, the end devices, such as compute systems and storage systems are all
referred to as nodes. Each node is a source or destination of information. Each node requires one or
more network adapters to provide a physical interface for communicating with other nodes. Examples
of network adapters are FC host bus adapters (HBAs) and storage system front-end adapters. An FC HBA
has SCSI-to-FC processing capability. It encapsulates operating system or hypervisor storage I/Os
(usually SCSI I/O) into FC frames before sending the frames to the FC storage systems over an FC SAN.

Cables FC SAN implementations primarily use optical fiber cabling. Copper cables may be
used for shorter distances because it provides acceptable signal-to-noise ratio for distances
up to 30 meters. Optical fiber cables carry data in the form of light. There are two types of
optical cables: multimode and single-mode. Multimode fiber (MMF) cable carries multiple
beams of light that is projected at different angles simultaneously onto the core of the cable.
In an MMF transmission, multiple light beams traveling inside the cable tend to disperse and
collide. This collision weakens the signal strength after it travels a certain distance – a process
that is known as modal dispersion. Due to modal dispersion, an MMF cable is typically used
for short distances, commonly within a data center.
Single-mode fiber (SMF) carries a single ray of light that is projected at the center of the core.
The small core and the single light wave help to limit modal dispersion. Single-mode provides
minimum signal attenuation over maximum distance (up to 10 km). A single-mode cable is
used for long-distance cable runs, and the distance usually depends on the power of the laser
at the transmitter and the sensitivity of the receiver. A connector is attached at the end of a
cable to enable swift connection and disconnection of the cable to and from a port. A standard
connector (SC) and a lucent connector (LC) are two commonly used connectors for fiber
optic cables.
Interconnecting Devices The commonly used interconnecting devices in FC SANs are FC hubs, FC
switches, and FC directors.

151
FC Interconnecting Devices
FC Hub FC Switch FC Director
• Nodes are • Each node has a • High-end switches
connected in a dedicated with a higher port
logical loop communication path count
• Nodes share loop • Provides a fixed port • Has a modular
• Provides limited count ─ active or unused architecture
connectivity and • Active ports can be • Port count is scaled-up
scalability scaled-up non- by inserting line
disruptively cards/blades
• Some components are • All key components
redundant and hot- are redundant and hot-
swappable swappable

FC hubs are used as communication devices in Fibre Channel Arbitrated Loop (FC-AL)
implementations (discussed later). Hubs physically connect nodes in a logical loop or a physical
star topology. All the nodes must share the loop because data travels through all the connection
points. Because of the availability of low-cost and high-performance switches, the FC switches are
preferred over the FC hubs in FC SAN deployments.
FC switches are more intelligent than FC hubs and directly route data from one physical port to
another. Therefore, the nodes do not share the data path. Instead, each node has a dedicated
communication path. The FC switches are commonly available with a fixed port count. Some of
the ports can be active for operational purpose and the rest remain unused. The number of active
ports can be scaled-up non-disruptively. Some of the components of a switch such as power
supplies and fans are redundant and hot-swappable. Hot-swappable means components can be
replaced while a device is powered-on and remains in operation.
FC directors are high-end switches with a higher port count. A director has a modular architecture
and its port count is scaled-up by inserting extra line cards or blades to the director’s chassis.
Directors contain redundant components with automated failover capability. Its key components
such as switch controllers, blades, power supplies, and fan modules are all hot-swappable. These
ensure high availability for business critical applications.

FC Interconnecting Options
The FC architecture supports three basic interconnectivity options: point-to-point, fibre channel
arbitrated loop (FC-AL), and fibre channel switched fabric (FC-SW). These interconnectivity
options are described in the following:

152
Point-to-Point In this configuration, two nodes are connected directly to each other. This
configuration provides a dedicated connection for data transmission between nodes. However, the
point-to-point configuration offers limited connectivity and scalability and is used in a DAS
environment.

FC Arbitrated Loop (FC-AL) In this configuration, the devices are attached to a shared loop. Each
device contends with other devices to perform I/O operations. The devices on the loop must
“arbitrate” to gain control of the loop. At any given time, only one device can perform I/O
operations on the loop. Because each device in a loop must wait for its turn to process an I/O
request, the overall performance in FC-AL environments is low.
Further, adding or removing a device results in loop re-initialization, which can cause a momentary
pause in loop traffic. As a loop configuration, FC-AL can be implemented without any
interconnecting devices by directly connecting one device to another two devices in a ring through
cables. However, FC-AL implementations may also use FC hubs through which the arbitrated loop
is physically connected in a star topology.

FC Switched Fabric (FC-SW) It includes a single FC switch or a network of FC switches (including FC


directors) to interconnect the nodes. It is also referred to as fabric connect. A fabric is a logical
space in which all nodes communicate with one another in a network. In a fabric, the link between
any two switches is called an interswitch link (ISL). ISLs enable switches to be connected together
to form a single, larger fabric. They enable the transfer of both storage traffic and fabric
management traffic from one switch to another.
In FC-SW, nodes do not share a loop. Instead, data is transferred through a dedicated path between
the nodes. Unlike a loop configuration, an FC-SW configuration provides high scalability. The
addition or removal of a node in a
switched fabric is minimally
disruptive. It does not affect the
ongoing traffic between other
nodes.

153
Port Types in Switched Fabric
Port Description
N_Port An end point in the fabric. This port is also known as the node port. Typically, it is a
compute system port (FC HBA port) or a storage system port that is connected to a
switch in a switched fabric.
E_Port A port that forms the connection between two FC switches. This port is also known as
the expansion port. The E_Port on an FC switch connects to the E_Port of another FC
switch in the fabric ISLs.
F_Port A port on a switch that connects an N_Port. It is also known as a fabric port.
G_Port A generic port on a switch that can operate as an E_Port or an F_Port and determines
its functionality automatically during initialization.

NVMe over Fibre Channel


• Organizations are adopting NVMe protocol to access SSDs over the PCIe bus
• NVMe over FC is designed to transfer NVMe-based data over a FC network
• Reduces latency and improves the performance of SSDs
• FC protocol maps NVMe (upper layer protocol) to the lower layers for the data transfer

154
FC Architecture
FC Architecture Overview
• Provides benefits of both channel and network technologies
o Provides high performance with low protocol overheads
o Provides high scalability with long-distance capability
• Implements SCSI over FC network
o Transports SCSI data through FC network
• Storage devices, attached to FC SAN, appear as locally attached to the operating system or
hypervisor

Traditionally, compute operating systems have communicated with peripheral devices over
channel connections, such as Enterprise Systems Connection (ESCON) and SCSI. Channel
technologies provide high levels of performance with low protocol overheads. Such
performance is achievable due to the static nature of channels and the high level of hardware
and software integration that is provided by the channel technologies. However, these
technologies suffer from inherent limitations in terms of the number of devices that can be
connected and the distance between these devices.
In contrast to channel technology, network technologies are more flexible and provide
greater distance capabilities. Network connectivity provides greater scalability and uses
shared bandwidth for communication. This flexibility results in greater protocol overhead
and reduced performance.
The FC architecture represents true channel and network integration and captures some of
the benefits of both channel and network technology. FC protocol provides both the channel
speed for data transfer with low protocol overhead and the scalability of network technology.
FC provides a serial data transfer interface that operates over copper wire and optical fiber.
FC protocol forms the fundamental construct of the FC SAN infrastructure. FC protocol
predominantly is the implementation of SCSI over an FC network. SCSI data is encapsulated
and transported within FC frames. SCSI over FC overcomes the distance and the scalability
limitations that are associated with traditional direct-attached storage. Storage devices
attached to the FC SAN appear as locally attached devices to the operating system (OS) or
hypervisor running on the compute system.

FC Protocol Stack
It is easier to understand a communication protocol by viewing it as a structure of independent layers. FCP
defines the communication protocol in five layers: FC-0 through FC-4 (except FC-3 layer, which is not
implemented).

Listed is a breakdown of each layer with its function and features.

155
FC Function Features Specified by FC Layer
Layer
FC-4 Mapping interface Mapping upper layer protocol (for example SCSI) to lower FC
layers
FC-3 Common services Not implemented
FC-2 Routing, flow Frame structure, FC addressing, flow control
control
FC-1 Encode/decode 8b/10b or 64b/66b encoding, bit, and frame synchronization
FC-0 Physical layer Media, cables, connector

FC-4 Layer: It is the uppermost layer in the FCP stack. This layer defines the application interfaces
and the way Upper Layer Protocols (ULPs) are mapped to the lower FC layers. The FC standard
defines several protocols that can operate on the FC-4 layer. Some of the protocols include SCSI,
High Performance Parallel Interface (HIPPI) Framing Protocol, ESCON, Asynchronous Transfer
Mode (ATM), and IP.
FC-2 Layer: It provides FC addressing, structure, and organization of data (frames, sequences,
and exchanges). It also defines fabric services, classes of service, flow control, and routing.
FC-1 Layer: It defines how data is encoded prior to transmission and decoded upon receipt. At the
transmitter node, an 8-bit character is encoded into a 10-bit transmission character. This character
is then transmitted to the receiver node. At the receiver node, the 10-bit character is passed to the
FC-1 layer, which decodes the 10-bit character into the original 8-bit character. FC links, with a
speed of 10 Gbps and above, use 64-bit to 66-bit encoding algorithm. This layer also defines the
transmission words such as FC frame delimiters, which identify the start and the end of a frame
and the primitive signals that indicate events at a transmitting port. In addition to these, the FC-1
layer performs link initialization and error recovery.
FC-0 Layer: It is the lowest layer in the FCP stack. This layer defines the physical interface, media,
and transmission of bits. The FC-0 specification includes cables, connectors, and optical and
electrical parameters for various data rates. The FC transmission can use both electrical and optical
media.

FC Addressing in Switched Fabric


• FC address is assigned to node ports during fabric login
o Used for communication between nodes in an FC SAN
• FC address size is 24 bits:
• Main purpose of an FC address is routing data through the fabric

156
An FC address is dynamically assigned when a node port logs on to the fabric. The FC
address has a distinct format, as shown on the image. The first field of the FC address
contains the domain ID of the switch. A domain ID is a unique number that is provided to
each switch in the fabric. The area ID is used to identify a group of switch ports that are used
for connecting nodes.
An example of a group of ports with common area ID is a port card on the switch. The last
field, the port ID, identifies the port within the group. The FC address size is 24 bits. The
primary purpose of an FC address is routing data through the fabric.

World Wide Name


• Unique 64-bit identifier
• Static to node ports on an FC network
o Similar to MAC address of NIC
o WWNN and WWPN are used to physically identify FC network adapters and node
ports respectively.

Each device in the FC environment is assigned a 64-bit unique identifier that is called the
World Wide Name (WWN). The FC environment uses two types of WWNs: World Wide
Node Name (WWNN) and World Wide Port Name (WWPN). WWNN is used to physically
identify FC network adapters, and WWPN is used to physically identify FC adapter ports or
node ports. For example, a dual-port FC HBA has one WWNN and two WWPNs.
Unlike an FC address, which is assigned dynamically, a WWN is a static name for each device
on an FC network. WWNs are similar to the Media Access Control (MAC) addresses used in
IP networking. WWNs are burned into the hardware or assigned through software. Several
configuration definitions in an FC SAN use WWN for identifying storage systems and FC
HBAs. WWNs are critical for FC SAN configuration as each node port has to be registered
by its WWN before the FC SAN recognizes it.
The name server in an FC SAN environment keeps the association of WWNs to the
dynamically created FC addresses for node ports. The illustration on the slide illustrates the
WWN structure examples for a storage system and an HBA.

Structure and Organization of FC Data

157
FC Data Description
Structure
Exchange • Enables two N_Ports to identify and manage a set of information units
o Information unit: upper layer protocol-specific information that
is sent to another port to perform certain operation
o Each information unit maps to a sequence
• Includes one or more sequences

Sequence • Contiguous set of frames that correspond to an information unit

Frame • Fundamental unit of data transfer


• Each frame consists of five parts: SOF, frame header, data field, CRC,
and EOF

Exchange: An exchange operation enables two node ports to identify and manage a set of
information units. Each upper layer protocol (ULP) has its protocol-specific information that must
be sent to another port to perform certain operations. This protocol-specific information is called
an information unit. The structure of these information units is defined in the FC-4 layer. This unit
maps to a sequence. An exchange is composed of one or more sequences.
Sequence: A sequence refers to a contiguous set of frames that are sent from one port to another.
A sequence corresponds to an information unit, as defined by the ULP.
Frame: A frame is the fundamental unit of data transfer at FC-2 layer. An FC frame consists of
five parts: start of frame (SOF), frame header, data field, cyclic redundancy check (CRC), and end
of frame (EOF). The SOF and EOF act as delimiters. The frame header is 24 bytes long and
contains addressing information for the frame. The data field in an FC frame contains the data
payload, up to 2,112 bytes of actual data – usually the SCSI data. The CRC checksum facilitates
error detection for the content of the frame. This checksum verifies data integrity by checking
whether the content of the frames is received correctly. The CRC checksum is calculated by the
sender before encoding at the FC-1 layer. Similarly, it is calculated by the receiver after decoding
at the FC-1 layer.

Fabric Login Types


Fabric services define three login types:

• Fabric login (FLOGI)


o Occurs between an N_Port and an F_Port
o Node sends a FLOGI frame with WWN to Fabric Login Server on switch
o Node obtains FC address from switch
o Immediately after FLOGI, N_Port registers with Name Server on switch
o N_Port queries name server about all other logged in ports
• Port login (PLOGI)
o Occurs between two N_Ports to establish a session
o Exchange service parameters relevant to the session

158
• Process login (PRLI)
o Occurs between two N_Ports to exchange ULP related parameters

Fabric Login (FLOGI): It is performed between an N_Port and an F_Port. To log on to the fabric,
a node sends a FLOGI frame with the WWNN and WWPN parameters to the login service at the
predefined FC address FFFFFE (Fabric Login Server). In turn, the switch accepts the login and
returns an Accept (ACC) frame with the assigned FC address for the node. Immediately after the
FLOGI, the N_Port registers itself with the local Name Server on the switch, indicating its WWNN,
WWPN, port type, class of service, assigned FC address, and so on. After the N_Port has logged
in, it can query the name server database for information about all other logged in ports.
Port Login (PLOGI): It is performed between two N_Ports to establish a session. The initiator
N_Port sends a PLOGI request frame to the target N_Port, which accepts it. The target N_Port
returns an ACC to the initiator N_Port. Next, the N_Ports exchange service parameters relevant to
the session.
Process Login (PRLI): It is also performed between two N_Ports. This login relates to the FC-4
ULPs, such as SCSI. If the ULP is SCSI, N_Ports exchange SCSI-related service parameters.

159
Topologies, Link Aggregation and Zoning
Single-switch Topology
• Fabric consists of only a single switch
• Both compute systems, and storage systems connect to same switch
• No ISLs are required for compute-to-storage traffic
• Every switch port is usable for node connectivity

FC switches (including FC directors)


may be connected in various ways to
form different fabric topologies. Each
topology provides certain benefits.
In a single-switch topology, the fabric
consists of only a single switch. Both the
compute systems and the storage systems
are connected to the same switch. A key
advantage of a single-switch fabric is that
it does not need to use any switch port for
ISLs. Therefore, every switch port is
usable for compute system or storage
system connectivity. Further, this
topology helps eliminate FC frames
traveling over the ISLs and therefore
eliminates the ISL delays.
A typical implementation of a single-
switch fabric would involve the deployment of an FC director. FC directors are high-end
switches with a high port count. When extra switch ports are needed over time, new ports
can be added through add-on line cards (blades) in spare slots available on the director
chassis. To some extent, a bladed solution alleviates the port count scalability problem
inherent in a single-switch topology.

Mesh Topology
A mesh topology may be one of the two types: full mesh or partial mesh.
Full Mesh Topology

• Each switch is connected to every other switch


• Maximum of one ISL is required

160
In a full mesh, every switch is connected to every other switch in the topology.
A full mesh topology may be appropriate when the number of switches that are involved is
small. A typical deployment would involve up to four switches or directors, with each of them
servicing highly localized compute-to-storage traffic. In a full mesh topology, a maximum of
one ISL or hop is required for compute-to-storage traffic.
However, with the increase in the number of switches, the number of switch ports that are
used for ISL also increases. This process reduces the available switch ports for node
connectivity.

Partial Mesh Topology

• Not all the switches are connected to every other switch


• Several ISLs may be required

In a partial mesh topology, not all the switches are connected to every other switch. In this
topology, several hops or ISLs may be required for the traffic to reach its destination.
Partial mesh offers more scalability than full mesh topology. However, without proper
placement of compute and storage systems, traffic management in a partial mesh fabric
might be complicate. Also ISLs could become overloaded due to excessive traffic aggregation.

161
Core-Edge Topology
• Consists of edge and core switch tiers
• Storage systems are usually connected to the core tier
• Maximum of one ISL is required for compute-to-storage traffic

The edge tier is composed of switches and offers an inexpensive approach to adding more
compute systems in a fabric. The edge-tier switches are not connected to each other. Each
switch at the edge tier is attached to a switch at the core tier through ISLs.
The core tier is composed of directors that ensure high fabric availability. Also, typically all
traffic must either traverse this tier or terminate at this tier. In this configuration, all storage
systems are connected to the core tier, enabling compute-to-storage traffic to traverse only
one ISL. Compute systems that require high performance may be connected directly to the
core tier and therefore avoid ISL delays.The core-edge topology increases connectivity within
the FC SAN while conserving the overall port utilization. It eliminates the need to connect
edge switches to other edge switches over ISLs.
Reduction of ISLs can greatly increase the number of node ports that can be connected to the
fabric. If fabric expansion is required, then administrators would need to connect extra edge
switches to the core. The core of the fabric is also extended by adding more switches or
directors at the core tier. Based on the number of core-tier switches, this topology has
different variations, such as single-core topology and dual-core topology. To transform a
single-core topology to dual-core, new ISLs are created to connect each edge switch to the
new core switch in the fabric.

162
Link Aggregation
Combines multiple ISLs into a single logical ISL (port-channel)

• Provides higher throughput than a single ISL could provide


• Distributes network traffic over ISLs, ensuring even ISL utilization

Link aggregation combines two or more parallel ISLs into a single logical ISL, called a port-
channel, yielding higher throughput than a single ISL could provide.

• For example, the aggregation of 10 ISLs into a single port-channel provides up to 160 Gb/s
throughput assuming the bandwidth of an ISL is 16 Gb/s. Link aggregation optimizes fabric
performance by distributing network traffic across the shared bandwidth of all the ISLs in
a port-channel. This allows the network traffic for a pair of node ports to flow through all
the available ISLs in the port-channel rather than restricting the traffic to a specific,
potentially congested ISL. The number of ISLs in a port channel can be scaled depending
on application’s performance requirement.

This image illustrates two examples.


The example on the left is based on an FC SAN infrastructure with no link aggregation enabled.

• Four HBA ports H1, H2, H3, and H4 have been configured to generate I/O activity to four
storage system ports S1, S2, S3, and S4 respectively.
• The HBAs and the storage systems are connected to two separate FC switches with three
ISLs between the switches.
• Let us assume that the bandwidth of each ISL is 8 Gb/s and the data transmission rate for
the port-pairs {H1,S1}, {H2,S2}, {H3,S3}, and {H4,S4} are 5 Gb/s, 1.5 Gb/s, 2 Gb/s, and
4.5 Gb/s.

Without link aggregation, the fabric typically assigns a particular ISL for each of the port-pairs in
a round-robin fashion. It is possible that port-pairs {H1,S1} and {H4,S4} are assigned to the same
ISL in their respective routes. The other two ISLs are assigned to the port-pairs {H2,S2} and
{H3,S3}. Two of the three ISLs are under-utilized, whereas the third ISL is saturated and becomes
a performance bottleneck for the port-pairs assigned to it.
The example on the right has aggregated the three ISLs into a port-channel that provides throughput
up to 24 Gb/s. Network traffic for all the port-pairs are distributed over the ISLs in the port-channel,
which ensures even ISL utilization.

163
Zoning
Definition: Zoning

A logical private path between node ports in a fabric.

• Each zone contains members(FC HBA and storage system ports)


• Benefits:
o Security
o Restricts RSCN traffic

Zoning is a logical private path between node ports in a fabric. Whenever a change takes place in
the name server database, the fabric controller sends a Registered State Change Notification
(RSCN) to all the nodes impacted by the change. If zoning is not configured, the fabric controller
sends the RSCN to all the nodes in the fabric. Involving the nodes that are not impacted by the
change increases the amount of fabric-management traffic.
For a large fabric, the amount of FC traffic generated due to this process can be significant and
might impact the compute-to-storage data traffic. Zoning helps to limit the number of RSCNs in a
fabric. In the presence of zoning, a fabric sends the RSCN to only those nodes in a zone where the
change has occurred.
Zoning also provides access control, along with other access control mechanisms, such as LUN
masking. Zoning provides control by enabling only the members in the same zone to establish
communication with each other.
Zone members, zones, and zone sets form the hierarchy that is defined in the zoning process. A
zone set is composed of a group of zones that can be activated or deactivated as a single entity in a
fabric. Multiple zone sets may be defined in a fabric, but only one zone set can be active at a time.

164
Members are the nodes within the FC SAN that can be included in a zone. FC switch ports, FC
HBA ports, and storage system ports can be members of a zone. A port or node can be a member
of multiple zones. Nodes that are distributed across multiple switches in a switched fabric may also
be grouped into the same zone. Zone sets are also referred to as zone configurations.

Types of Zoning
The illustration shows three types of zoning on an FC network.

The three types of zoning are:

• WWN Zoning

Uses World Wide Names to define zones. The zone members are the unique WWN addresses of the FC
HBA and its targets (storage systems). A major advantage of WWN zoning is its flexibility. If an
administrator moves a node to another switch port in the fabric, the node maintains connectivity to its
zone partners without having to modify the zone configuration. This functionality is possible because the
WWN is static to the node port.

Port Zoning

Uses the switch port ID to define zones. In port zoning, access to node is determined by the physical switch
port to which a node is connected. The zone members are the port identifiers (switch domain ID and port
number) to which FC HBA and its targets (storage systems) are connected. If a node is moved to another
switch port in the fabric, port zoning must be modified to enable the node, in its new port, to participate
in its original zone. However, if an FC HBA or storage system port fails, an administrator has to replace the
failed device without changing the zoning configuration.

165
Mixed Zoning

Combines the qualities of both WWN zoning and port zoning. Using mixed zoning enables a specific node
port to be tied to the WWN of another node.

Exercise: FC SAN Topologies


Scenario

• An organization’s storage infrastructure includes three block-based storage systems


• Storage systems are direct-attached to 45 compute systems
• Compute systems are dual-attached to the storage systems
• Each storage system has 32 front-end ports, which could support a maximum of 16 compute
systems
• Each storage system has the storage drive capacity to support a maximum of 32 compute
systems

Challenges/Requirements

• Organization requires an additional 45 compute systems to meet its growth requirements


• Existing storage systems are poorly utilized and addition of new compute systems require
an addition of new storage systems
• Organization wants to implement FC SAN to overcome the scalability and utilization
challenges
• Number of ISLs required for compute-to-storage traffic must be minimized to meet the
performance requirement of applications

Deliverables

Given that 72-port FC switches are available for interconnectivity:

• Propose a fabric topology to address organization’s challenges/requirements and justify


your choice
• Determine the minimum number of switches required in the fabric

Debrief

The recommended solution is core-edge topology:

• Provides higher scalability than mesh topology


• Provides a maximum of one-hop/one-ISL storage access to all compute systems
• Increases connectivity by conserving the overall switch port utilization

The recommended configuration:

• Total number of compute system ports = 90 compute systems × 2 ports = 180 ports

166
• Total number of storage system ports = 3 storage systems × 32 ports = 96 ports
• Number of switches at the core = 96 storage system ports / 72 ports per switch ≈ 2 switches
• Core switches provide 144 ports of which 96 ports will be used for storage system
connectivity
o Remaining 48 ports can be used for ISLs and future growth
• Number of switches at the edge = 180 compute system ports / 72 ports per switch ≈ 3
switches
• Edge switches provide 216 ports of which 180 ports will be used for compute system
connectivity
o Remaining 36 ports can be used for ISLs and future growth
• At a minimum, two core switches and three edge switches are required to implement the
core-edge fabric

SAN Virtualization
Block-level Storage Virtualization
• Provides a virtualization layer in SAN
o Abstracts block-based storage systems
o Aggregates LUNs to create storage pool
• Virtual volumes from storage pool are assigned to compute systems
o Virtualization layer maps virtual volumes to LUNs
• Benefits:
o Online expansion of virtual volumes
o Non-disruptive data migration

The figure on the slide shows two compute systems, each of which has one virtual volume assigned. These
virtual volumes are mapped to the LUNs in the storage systems. When an I/O is sent to a virtual volume,
it is redirected to the mapped LUNs through the virtualization layer at the FC SAN. Depending on the
capabilities of the virtualization appliance, the architecture may allow for more complex mapping between
the LUNs and the virtual volumes.

167
Block-level storage virtualization aggregates block storage devices (LUNs) and enables
provisioning of virtual storage volumes, independent of the underlying physical storage. A
virtualization layer, which exists at the SAN, abstracts the identity of block-based storage
systems and creates a storage pool by aggregating LUNs from the storage systems.
Virtual volumes are created from the storage pool and assigned to the compute systems.
Instead of being directed to the LUNs on the individual storage systems, the compute systems
are directed to the virtual volumes provided by the virtualization layer. The virtualization
layer maps the virtual volumes to the LUNs on the individual storage systems.
The compute systems remain unaware of the mapping operation and access the virtual
volumes as if they were accessing the physical storage attached to them. Typically, the
virtualization layer is managed via a dedicated virtualization appliance to which the compute
systems and the storage systems are connected.
Block-level storage virtualization enables extending the virtual volumes non-disruptively to
meet application’s capacity scaling requirements. It also provides the advantage of non-
disruptive data migration. In a traditional SAN environment, LUN migration from one
storage system to another is an offline event.
After migration, the compute systems are updated to reflect the new storage system
configuration. In other instances, processor cycles at the compute system were required to
migrate data from one storage system to the other, especially in a multivendor environment.
With a block-level storage virtualization solution in place, the virtualization layer handles the
migration of data, which enables LUNs to remain online and accessible while data is
migrating. No physical changes are required because the compute system still points to the
same virtual volume on the virtualization layer. However, the mapping information on the
virtualization layer should be changed. These changes can be executed dynamically and are
transparent to the end user.

168
Virtual SAN/Virtual Fabric
Definition: VSAN

A logical fabric on an FC SAN, enabling communication among a group of nodes, regardless of their physical
location in the fabric.

• Each VSAN has its own fabric services, configuration, and set of FC addresses
• VSANs improve SAN security, scalability, availability, and manageability

In a VSAN, a group of node ports communicate with each other using a virtual topology that
is defined on the physical SAN. Multiple VSANs may be created on a single physical SAN.
Each VSAN behaves and is managed as an independent fabric. Each VSAN has its own fabric
services, configuration, and set of FC addresses. Fabric-related configurations in one VSAN
do not affect the traffic in another VSAN. A VSAN may be extended across sites, enabling
communication among a group of nodes, in either site with a common set of requirements.
VSANs improve SAN security, scalability, availability, and manageability. VSANs provide
enhanced security by isolating the sensitive data in a VSAN and by restricting the access to
the resources located within that VSAN. For example, a cloud provider typically isolates the
storage pools for multiple cloud services by creating multiple VSANs on an FC SAN.
Further, the same FC address can be assigned to nodes in different VSANs, thus increasing
the fabric scalability. The events causing traffic disruptions in one VSAN are contained
within that VSAN and are not propagated to other VSANs. VSANs facilitate an easy, flexible,
and less expensive way to manage networks.

169
Configuring VSANs is easier and quicker compared to building separate physical FC SANs
for various node groups. To regroup nodes, an administrator changes the VSAN
configurations without moving nodes and recabling.

VSAN Configuration
• Define VSANs on fabric switch with specific VSAN IDs
• Assign VSAN IDs to F_Ports to include them in the VSANs
• An N_Port connecting to an F_Port in a VSAN becomes a member of that VSAN
• Switch forwards FC frames between F_Ports that belong to the same VSAN

To configure VSANs on a fabric, an administrator first needs to define VSANs on fabric switches.
Each VSAN is identified with a specific number called VSAN ID. The next step is to assign a
VSAN ID to the F_Ports on the switch. By assigning a VSAN ID to an F_Port, the port is included
in the VSAN. In this manner, multiple F_Ports can be grouped into a VSAN. For example, an
administrator may group switch ports (F_Ports) 1 and 2 into VSAN 10 (ID) and ports 6–12 into
VSAN 20 (ID). If an N_Port connects to an F_Port that belongs to a VSAN, it becomes a member
of that VSAN. The switch transfers FC frames between switch ports that belong to the same VSAN.
VSAN versus Zone:

• Both VSANs and zones enable node ports within a fabric to be logically segmented into
groups. But they are not same and their purposes are different. There is a hierarchical
relationship between them. An administrator first assigns physical ports to VSANs and then
configures independent zones for each VSAN. A VSAN has its own independent fabric
services, but the fabric services are not available on a per-zone basis

VSAN Trunking

Select here for details.

• Allows network traffic from multiple VSANs to traverse a single ISL (trunk link)
• Enables an E_Port (trunk port) to send or receive multiple VSAN traffic over a trunk link
• Reduces the number of ISLs between switches that are configured with multiple VSANs

170
The illustration shows a VSAN trunking configuration that is contrasted with a network configuration
without VSAN trunking. In both the cases, the switches have VSAN 10, VSAN 20, and VSAN 30 configured.
If VSAN trunking is not used, three ISLs are required to transfer traffic between the three distinct VSANs.
When trunking is configured, a single ISL is used to transfer all VSAN traffic.

VSAN trunking allows network traffic from multiple VSANs to traverse a single ISL. It
supports a single ISL to permit traffic from multiple VSANs along the same path. The ISL
through which multiple VSANs traffic travels is called a trunk link. VSAN trunking enables
a single E_Port to be used for sending or receiving traffic from multiple VSANs over a trunk
link. The E_Port capable of transferring multiple VSANs traffic is called a trunk port. The
sending and receiving switches must have at least one trunk E_Port configured for all or a
subset of the VSANs defined on the switches.
VSAN trunking eliminates the need to create dedicated ISL(s) for each VSAN. It reduces the
number of ISLs when the switches are configured with multiple VSANs. As the number of
ISLs between the switches decreases, the number of E_Ports used for the ISLs also reduces.
By eliminating needless ISLs, the utilization of the remaining ISLs increases. The complexity
of managing the FC SAN is also minimized with a reduced number of ISLs.

VSAN Tagging
Definition: VSAN Tagging
A process of adding or removing a tag to the FC frames that contains VSAN-specific information.
Associated with VSAN trunking, it helps isolate FC frames from multiple VSANs that travel
through and share a trunk link.
Whenever an FC frame enters an FC switch, it is tagged with a VSAN header indicating the VSAN
ID of the switch port (F_Port) before sending the frame down to a trunk link. The receiving FC
switch reads the tag and forwards the frame to the destination port that corresponds to that VSAN
ID. The tag is removed once the frame leaves a trunk link to reach an N_Port.

171
Concepts In Practice
Dell EMC Connectrix

• Group of networked storage connectivity products that support NVMe over FC technology
• Products under Connectrix brand:
o Directors: Ideal for largest mission-critical storage area network environments
o Switches: Ideal for departmental or edge storage area networks

Connectrix: A group of networked storage connectivity products. Dell EMC offers the
following connectivity products under the Connectrix brand:

• Directors: Ideal for largest mission-critical storage area network environments. They
offer high port density and high component redundancy. They allow physical and
virtual servers to share storage resources securely. They provide up to 32 Gbps Fibre
Channel connectivity. They provide high-availability, maximum scalability, and
deliver high performance to keep pace with all-flash storage environments.
• Switches: Ideal for departmental or edge storage area networks. It provides
foundation for growth in smaller environments to deployment in large data centers.
They support up to 32 Gbps Fibre Channel connectivity. They provide high
availability through redundant connections and scales with 1U and 2U models.

172
Dell EMC VPLEX

• Provides solution for block-level storage virtualization and data migration both within and
across data centers
• Provides the capability to mirror data of a virtual volume both within and across locations
• VS6 engine with VPLEX for all-flash model provides the fastest and most scalable VPLEX
solution for all-flash systems
• Enables organizations to move cold data to inexpensive cloud storage

Provides solution for block-level storage virtualization and data mobility both within and
across data centers. It forms a pool of distributed block storage resources and enables
creating virtual storage volumes from the pool. These virtual volumes are then allocated to
the compute systems.
VPLEX provides nondisruptive data mobility among storage systems to balance the
application workload and to enable both local and remote data access. It uses a unique
clustering architecture and advanced data caching techniques. They enable multiple compute
systems that are located across two locations to access a single copy of data. Data migration
with VPLEX can be done without any downtime, saving countless weekends of maintenance
downtime and IT resources. VPLEX enables IT organizations to build modern data center
infrastructure that is:

• Always available even in the face of disasters


• Agile in responding to business requirements
• Non-disruptive when adopting latest storage technology

The new VS6 engine with VPLEX for all-flash model provides the fastest and most scalable
VPLEX solution for all-flash systems. VPLEX also enables organizations to move cold data
to inexpensive cloud storage.

Question 1
Which layer of FC protocol stack provides FC addressing, structure, and organization of data?

FC - 1 - Layer

FC - 4 - Layer

173

FC - 0 - Layer

FC - 2 - Layer

Correct!

Question 2
Identify the topology that requires maximum of one ISL for compute to storage communication.
Select all that applies.

Partial mesh topology

Full mesh topology

Single-switch topology

Core-edge topology

174
IP and FCoE SAN
Overview of TCP/IP
OSI Reference Model
The OSI reference model is a logical structure for network operations standardized by the International
Standards Organization (ISO). Each layer in the OSI reference model only interacts directly with the layer
immediately beneath it, and provides facilities for use by the layer above it. The following layers make up
the OSI model:

• A logical structure for network operations


• The OSI model organizes the communications process into seven different layers
• Protocols are within the layers
• Layers 4-7 provide end to end communication
• Layers 1-3 are used for network access providing packet, frame and bit level
communication

Each layer is described as follows:

175
• Physical Layer - Defines the electrical and physical specifications for devices.
• Data Link Layer - Provides the functional and procedural means to transfer data between
network entities. It also detects and possibly correct errors that may occur in the Physical
Layer.
• Network Layer - Transfers variable length data sequences from a source to destination
through one or more networks while also maintaining a quality of service requested by the
Transport Layer.
• Transport Layer - Provides transparent transfer of data between end users, providing
reliable data transfer services to the upper layers.
• Session Layer - Controls the connections between computers. It establishes, manages, and
terminates the connections between the local and remote application.
• Presentation Layer - Establishes a context between the Application layer entities in which
the high-layer entities can use different syntax and semantics.
• Application Layer - Provides a user interface that enables user to access the network and
applications.

TCP/IP Reference Model


TCP/IP is a hierarchical protocol suite that is named after its two primary protocols Transmission Control
Protocol (TCP) and Internet Protocol (IP). It is made up of four layers as specified in the image.

• TCP/IP is a 4-layer hierarchical model


• An example of an implementation of the
OSI reference model
• Also known as Internet Protocol Suite

The four layers are described as follows:

• The link layer is used to describe the local


network topology and the interfaces
needed to affect transmission of Internet
layer datagrams to next-neighbor hosts.
• The network layer is responsible for end-
to-end communications and delivery of
packets across multiple network links.
• The transport layer provides process to
process delivery of the entire message.
• The application layer enables users to
access the network.

176
Comparing Reference Models
The purpose of the reference models is to show how to facilitate communication between different
systems without requiring changes to the logic of the underlying architecture.

• Facilitates communication between different systems


• Layered architecture
• Standard protocols and interfaces
• Example: OSI and TCP/IP

The purpose of the reference models is to show how to facilitate communication between different
systems without requiring changes to the logic of the underlying architecture. To understand the
complex system and for simplification, the reference models are implemented as a layered
structure.
The OSI and the TCP/IP reference models have much in common. The architectural layers form a
hierarchy and items are listed in order by rank. Higher layers depend upon services from lower
layers, and lower layers provide services for upper layers. Also, the functionality of layers is
roughly similar, except a few. The presentation and the session layer of the OSI reference model
was combined with the application layer and represented as the application layer in the TCP/IP
Model. The model also does not distinguish the physical and the data link layer.
To understand the complex system and for simplification, the reference models are implemented
as a layered structure. The Open Systems Interconnection (OSI) and TCP/IP reference models are
widely adopted and are important network architectures (reference model). Both of them defines
the essential features of network services and enhanced functionality.OSI Model is a logical
structure for network operations standardized by the International Standards Organization (ISO).

177
• The OSI model is a layered framework for the design of a network system that enables
communication between all types of systems.
• TCP/IP is a hierarchical protocol suite that is made up of interactive modules, providing
specific functionality.

Network Layer and IP


IP is one of the major protocols in the Transmission Control Protocol (TCP)/Internet Protocol (IP) protocol
suite. This protocol works at layer 3, the network layer of the OSI model and at the Internet layer of the
TCP/IP model. Thus, this protocol is responsible for end-to-end communication and delivery of packets
across multiple network links based on their logical addresses.

The current versions are:

• Internet Protocol version 4 (IPv4)


o 32-bit address (example: 192.168.1.12)
• Internet Protocol version 6 (IPv6)
o 128-bit address (example: 2002:ac18:af02:00f4:020e:cff:fe6e:d527

Connection Establishment: Three-way


handshake
The transport layer is the heart of the TCP/IP protocol suite. Due to the use of connection-oriented
protocol TCP, the layer provides reliable, process-to-process, and full-duplex service.

Transmission Control Protocol (TCP) explicitly defines the connection establishment process. The
connection establishment in TCP is called three-way handshaking. Three-way handshaking is a process to
negotiate the sequence and acknowledgment fields and start the session. The process consists of the
following steps:

• The client initiates the connection by sending the TCP SYN packet to the destination host.

In the illustration,

o SYN refers to synchronous and ACK refers to acknowledgement


o The packet contains the random sequence number, which marks the beginning of
the sequence numbers of data that the client will transmit
o This sequence number is called the initial sequence number
• The server, which is a destination host, receives the packet, and responds with its own
sequence number. The response also includes the acknowledgment number, which is
client’s sequence number that is incremented by 1. That is SYN+ACK segment is sent.
• Client acknowledges the response of the server by sending the acknowledgment ACK
segment. It acknowledges the receipt of the second segment with the ACK flag.

178
179
Overview of IP SAN
Uses Internet Protocol (IP) for the transport of storage traffic. It transports block I/O over an IP-
based network.
Provides an efficient and dedicated point-to-point storage solution.
Typically runs over a standard IP-based network and uses the TCP/IP) for communication,
commonly:

• Internet SCSI (iSCSI)


• Fibre Channel over IP (FCIP)

Drivers of IP SAN

Select here for details.

The following are drivers have led to the adoption of IP SAN:

• Existing IP-based network infrastructure can be leveraged


o Reduced cost compared to deploying new FC SAN infrastructure
• IP network makes it possible to extend or connect SANs over long distances
• Many long-distance disaster recovery solutions already leverage IP-based network.

180
• Many robust and mature security options are available for IP network

The advantages of FC SAN such as scalability and high performance come with the additional cost
of buying FC components, such as FC HBA and FC switches. On the other hand IP is a matured
technology and using IP as a storage networking option provides several advantages. These are
listed below:

• Most organizations have an existing IP-based network infrastructure, which could be used
for storage networking. The use of existing network may be a more economical option than
deploying a new FC SAN infrastructure.
• IP network has no distance limitation, which makes it possible to extend or connect SANs
over long distances. With IP SAN, organizations can extend the geographical reach of their
storage infrastructure and transfer data that are distributed over wide locations.
• Many long-distance disaster recovery (DR) solutions are already leveraging IP-based
networks. In addition, many robust and mature security options are available for IP
networks.

Role of TCP/IP in IP SAN


As we know, the IP SAN protocols typically run over a standard Ethernet network and uses the
Transmission Control Protocol/Internet Protocol (TCP/IP) for communication along with transport
of storage traffic.
The entire process of communication is carried out by the encapsulation of SCSCI commands into
the TCP segments. As depicted in the image, iSCSI fits into the network protocol stack and sits on
top of the TCP/IP protocol stack. It takes SCSI commands, data, and responses and encapsulates
them into TCP segments for transportation. Upon receiving iSCSI TCP segments, the iSCSI layer
pulls out the SCSI information and passes it to the SCSI driver software.

181
IP SAN Protocols
Two primary protocols that leverage IP as the transport mechanism for block-level data transmission are
Internet SCSI (iSCSI) and Fibre Channel over IP (FCIP).

iSCSI

• IP-based protocol that enables transporting SCSI data over an IP network


• Encapsulates SCSI I/O into IP packets and transports them using TCP/IP

iSCSI is widely adopted for transferring SCSI data over IP between compute systems and
storage systems and among the storage systems. It is relatively inexpensive and easy to
implement, especially environments in which an FC SAN does not exist.

FCIP

• IP-based protocol that is used to interconnect distributed FC SAN islands over an IP


network
• Encapsulates FC frames onto IP packet and transports over existing IP network
• Enables transmission by tunneling data between FC SAN islands

FCIP: Organizations are looking for ways to transport data over a long distance between
their disparate FC SANs at multiple geographic locations. One of the best ways to achieve
this goal is to interconnect geographically dispersed FC SANs through reliable, high-speed
links. This approach involves transporting the FC block data over the IP infrastructure.
The FCIP standard has rapidly gained acceptance as a manageable, cost-effective way to
blend the best of the two worlds: FC SAN and the proven, widely deployed IP infrastructure.

182
iSCSI
iSCSI Overview
iSCSI is an IP-based protocol that establishes and manages connections between compute systems
and storage systems over IP.
It is an encapsulation of SCSI I/O over IP, where it encapsulates SCSI commands and data into IP
packets and transports them using TCP/IP.
It is widely adopted for transferring SCSI data over IP between compute systems and storage
systems and among the storage systems. iSCSI is relatively inexpensive and easy to implement,
especially environments in which an FC SAN does not exist.

Components of iSCSI Network


Key components for iSCSI communication are:

• iSCSI initiators
o Example: iSCSI HBA
• iSCSI targets
o Example: Storage system with iSCSI port
• IP-based network
o Example: Gigabit Ethernet LAN

183
Types of iSCSI Initiator
Hardware and software initiators are types of iSCSI initiators that are used by the host to access iSCSI
targets.

• Standard NIC with software iSCSI adapter


o NIC provides network interface
o Software adapters provide iSCSI functionality
o Both iSCSI and TCP/IP processing require CPU cycles of compute system
• TCP Offload Engine (TOE) NIC with software iSCSI adapter
o TOE NIC performs TCP/IP processing
o Software adapter provides iSCSI functionality
o iSCSI processing requires CPU cycles of compute system
• iSCSI HBA
o Performs both iSCSI and TCP/IP processing
o Frees-up CPU cycles of compute system for business applications

The computing operations of the software iSCSI initiator are performed by the server’s operating
system. Whereas a hardware iSCSI initiator is a dedicated, host-based network interface card (NIC)
with the integrated resources to handle the iSCSI processing functions. Following are the common
examples of iSCSI initiators:

• Standard NIC with software iSCSI adapter: The software iSCSI adapter is an operating
system or hypervisor kernel-resident software. It uses an existing NIC of the compute
system to emulate an iSCSI initiator. It is least expensive and easy to implement because
most compute systems come with at least one, and often with two embedded NICs. It
requires only a software initiator for iSCSI functionality. Because NICs provide standard
networking function, both the TCP/IP processing and the encapsulation of SCSI data into
IP packets are carried out by the CPU of the compute system. This functionality places
more overhead on the CPU. If a standard NIC is used in heavy I/O load situations, the CPU
of the compute system might become a bottleneck.

184
• TOE NIC with software iSCSI adapter: A TOE NIC offloads the TCP/IP processing from
the CPU of a compute system and leaves only the iSCSI functionality to the CPU. The
compute system passes the iSCSI information to the TOE NIC and then the TOE NIC sends
the information to the destination using TCP/IP. Although this solution improves
performance, the iSCSI functionality is still handled by a software adapter that requires
CPU cycles of the compute system.
• iSCSI HBA: An iSCSI HBA is a hardware adapter with built-in iSCSI functionality. It is
capable of providing performance benefits over software iSCSI adapters by offloading the
entire iSCSI and TCP/IP processing from the CPU of a compute system.

iSCSI Connectivity
iSCSI implementations support two types of connectivity: native and bridged. The connectivities are
described here:

• Native

• iSCSI initiators connect to iSCSI targets directly/through IP network


• No FC component

Native iSCSI: In this type of


connectivity, the compute systems with iSCSI initiators may be either directly attached to the iSCSI
targets or connected through an IP-based network. FC components are not required for native iSCSI
connectivity. The figure on the left shows a native iSCSI implementation that includes a storage system
with an iSCSI port. The storage system is connected to an IP network. After an iSCSI initiator is logged
on to the network, it can access the available LUNs on the storage system.

Bridged

• iSCSI initiators are attached to IP network


• Storage systems are attached to FC SAN
• iSCSI gateway provides bridging functionality

Bridged iSCSI: This type of connectivity enables the initiators to exist in an IP environment while the
storage systems remain in an FC SAN environment. It enables the coexistence of FC with IP by providing
iSCSI-to-FC bridging functionality. The figure on the right illustrates a bridged iSCSI implementation. It

185
shows connectivity between a compute system with an iSCSI initiator and a storage system with an FC
port. As the storage system does not have any iSCSI port, a gateway or a multiprotocol router is used.
The gateway facilitates the communication between the compute system with iSCSI ports and the
storage system with only FC ports. The gateway converts IP packets to FC frames and conversely, thus
bridging the connectivity between the IP and FC environments. The gateway contains both FC and
Ethernet ports to facilitate the communication between the FC and the IP environments. The iSCSI
initiator is configured with the gateway’s IP address as its target destination. On the other side, the
gateway is configured as an FC initiator to the storage system.

Combining FC and Native iSCSI Connectivity


Typically, a storage system typically comes with both FC and iSCSI ports. The combination
enables both the native iSCSI connectivity and the FC connectivity in the same environment and
no bridge device is needed.

iSCSI Protocol Stack


The image displays a model of iSCSI protocol layers and depicts the encapsulation order of the SCSI
commands for their delivery through a physical carrier.

• SCSI is the command protocol that works at the application layer of the Open System
Interconnection (OSI) model
• The initiators and the targets use SCSI commands and responses to talk to each other
• The SCSI commands, data, and status messages are encapsulated into TCP/IP and
transmitted across the network between the initiators and the targets

186
iSCSI Address and Name
An iSCSI address is the path to iSCSI initiator/target, which is comprised of:

• Location of iSCSI initiator/target


o Combination of IP address and TCP port number
• iSCSI name
o Unique identifier for initiator/target in an iSCSI network

An iSCSI address is comprised of the location of an iSCSI initiator or target on the network and
the iSCSI name. The location is a combination of the host name or IP address and the TCP port
number. For iSCSI initiators, the TCP port number is omitted from the address.

187
iSCSI name is a unique worldwide iSCSI identifier that is used to identify the initiators and targets
within an iSCSI network to facilitate communication. The unique identifier can be a combination
of the names of the department, application, manufacturer, serial number, asset number, or any tag
that can be used to recognize and manage the iSCSI nodes. The following are three types of iSCSI
names commonly used:

• iSCSI Qualified Name (IQN): An organization must own a registered domain name to
generate iSCSI Qualified Names. This domain name does not need to be active or resolve
to an address. It needs to be reserved to prevent other organizations from using the same
domain name to generate iSCSI names. A date is included in the name to avoid potential
conflicts caused by the transfer of domain names. An example of an IQN is iqn.2015-
04.com.example:optional_string. The optional string provides a serial number, an asset
number, or any other device identifiers. IQN enables storage administrators to assign
meaningful names to the iSCSI initiators and the iSCSI targets, and therefore, manages
those devices more easily.
• Extended Unique Identifier (EUI): An EUI is a globally unique identifier based on the IEEE
EUI-64 naming standard. An EUI is composed of the eui prefix followed by a 16-character
hexadecimal name, such as eui.0300732A32598D26.
• Network Address Authority (NAA): NAA is another worldwide unique naming format as
defined by the International Committee for Information Technology Standards (INCITS)
T11 – Fibre Channel (FC) protocols and is used by Serial Attached SCSI (SAS). This format
enables the SCSI storage devices that contain both iSCSI ports and SAS ports to use the
same NAA-based SCSI device name. An NAA is composed of the naa prefix followed by
a hexadecimal name, such as naa.52004567BA64678D. The hexadecimal representation
has a maximum size of 32 characters (128 bit identifier).

iSCSI Discovery
For iSCSI communication, initiator must discover location and name of targets on the network.

iSCSI discovery commonly takes place in two ways:

• SendTargets discovery
o Initiator is manually configured with the target’s network portal
o Initiator issues SendTargets command; target responds with required parameters
• Internet Storage Name Service (iSNS)
o iSNS in the iSCSI SAN is equivalent in function to the name server in an FC SAN
o Initiators and targets register themselves with iSNS server
o Initiator may query iSNS server for a list of available targets

• iSNS Discovery Domain


• iSNS discovery domains function in the same way as FC zones. Discovery domains provide
functional groupings of devices (including iSCSI initiators and targets) in an IP SAN. The
iSNS server is configured with discovery domains.
• For devices to communicate with one another, they must be configured in the same
discovery domain. The iSNS server may send state change notifications (SCNs) to the

188
registered devices. State change notifications inform the registered devices about network
events. These events affect the operational state of devices such as the addition or removal
of devices from a discovery domain.

Virtual LAN (VLAN)

Select here for details.

Definition: VLAN

A logical network created on a LAN enabling communication between a group of nodes with a common
set of functional requirements, independent of their physical location in the network.

Well-suited for iSCSI deployments as they enable isolating the iSCSI traffic from other network traffic (for
example, compute-to-compute traffic).

Help in isolating specific network traffic from other network traffic in a physical Ethernet network

Configuring a VLAN:

• Define VLANs on switches with specific VLAN IDs


• Configure VLAN membership based on a supported technique
o Port-based
o MAC-based

189
o Protocol-based
o IP subnet address-based
o Application-based

A VLAN conceptually functions in the same way as a VSAN. Each VLAN behaves and is
managed as an independent LAN. Two nodes connected to a VLAN can communicate
between themselves without routing of frames – even if they are in different physical
locations. VLAN traffic must be forwarded through a router or OSI Layer-3 switching device
when two nodes in different VLANs are communicating – even if they are connected to the
same physical LAN. Network broadcasts within a VLAN generally do not propagate to nodes
that belong to a different VLAN, unless configured to cross a VLAN boundary.
To configure VLANs, an administrator first defines the VLANs on the switches. Each VLAN
is identified by a unique 12-bit VLAN ID (as per IEEE 802.1Q standard). The next step is to
configure the VLAN membership based on an appropriate technique supported by the
switches. The switches can be port-based, MAC-based, protocol-based, IP subnet address-
based, and application-based. In the port-based technique, membership in a VLAN is defined
by assigning a VLAN ID to a switch port. When a node connects to a switch port that belongs
to a VLAN, the node becomes a member of that VLAN.
In the MAC-based technique, the membership in a VLAN is defined by the MAC address of
the node. In the protocol-based technique, different VLANs are assigned to different
protocols based on the protocol type field found in the OSI Layer 2 header. In the IP subnet
address-based technique, the VLAN membership is based on the IP subnet address. All the
nodes in an IP subnet are members of the same VLAN. In the application-based technique, a
specific application, for example, a file transfer protocol (FTP) application can be configured
to execute on one VLAN. A detailed discussion on these VLAN configuration techniques is
beyond the scope of this course.

VLAN Trunking and Tagging


• VLAN trunking allows a single network link (trunk link) to carry multiple VLAN traffic
• To enable trunking, trunk ports must be configured on both sending and receiving network
components
• Sending network component inserts a tag field containing VLAN ID into an Ethernet frame
before sending through a trunk link
• Receiving network component reads the tag and forwards the frame to destination port(s)
o Tag is removed once a frame leaves trunk link to reach a node port

Similar to the VSAN trunking, network traffic from multiple VLANs may traverse a trunk
link. A single network port, called trunk port, is used for sending or receiving traffic from
multiple VLANs over a trunk link. Both the sending and the receiving network components
must have at least one trunk port configured for all or a subset of the VLANs defined on the
network component.
As with VSAN tagging, VLAN has its own tagging mechanism. The tagging is performed by
inserting a 4-byte tag field containing 12-bit VLAN ID into the Ethernet frame (as per IEEE
802.1Q standard) before it is transmitted through a trunk link. The receiving network

190
component reads the tag and forwards the frame to the destination port(s) that corresponds
to that VLAN ID. The tag is removed once the frame leaves a trunk link to reach a node port.
IEEE 802.1ad Multi-tagging: IEEE 802.1ad is an amendment to IEEE 802.1Q and enables
inserting multiple VLAN tags to an Ethernet frame. IEEE 802.1Q mandates a single tag with
a 12-bit VLAN ID field, which limits the number of VLANs in an environment theoretically
up to 4096. In a large environment such as a cloud infrastructure, this limitation may restrict
VLAN scalability. IEEE 802.1ad provides the flexibility to accommodate a larger number of
VLANs. For example, by using a double-tag, theoretically 16777216 (4096×4096 ) VLANs
may be configured.

Stretched VLAN
Definition: Stretched VLAN

A VLAN that spans multiple sites and enables OSI Layer 2 communication between a group of nodes over
an OSI Layer 3 WAN infrastructure, independent of their physical location.

In a typical multisite environment, network traffic between sites is routed through an OSI
Layer 3 WAN connection. Because of the routing, it is not possible to transmit OSI Layer 2
traffic between the nodes in two sites. A stretched VLAN extends a VLAN across the sites. It
also enables nodes in two different sites to communicate over a WAN as if connected to the
same network.
Stretched VLANs also enable the movement of virtual machines (VMs) between sites without
the need to change their network configurations. This simplifies the creation of high-
availability clusters, VM migration, and application and workload mobility across sites. The
clustering across sites, for example, enables moving VMs to an alternate site in the event of a
disaster or during the maintenance of one site. Without a stretched VLAN, the IP addresses
of the VMs must be changed to match the addressing scheme at the other site.

191
Advantages of IP SAN in Modern Data Center
Advances in IP-based networked storage technology such as IP SAN have created an opportunity
for organizations of all sizes to cost-effectively build, manage, and maintain their data center. In
comparison to internal server storage or DAS, it efficiently handles the complexity of the modern
data center by using existing IP networks and components.
In a data center IP SAN offers multiple advantages which are common to midsize businesses,
including the following:
Increased utilization Consolidated IP-based storage enables servers to access and share
storage, helping maximize utilization of these resources
Reduced Consolidated storage enables centralized management, helping
management costs simplify administrative tasks and reduce management costs
Increased reliability A shared set of dedicated IP-based storage systems can help
significantly increase the reliability and availability of application data
Simplified backup IP SAN enables administrators to easily implement consistent,
and recovery common, and simple backup and recovery processes

192
FCIP
Video: FCIP
FCIP Overview
FC SAN provides a high-performance infrastructure for localized data movement. It also:

• Provides IP-based protocol that is used to interconnect distributed FC SAN islands over an
IP network
• Encapsulates FC frames onto IP packet and transports over existing IP network
• Enables transmission by tunneling data between FC SAN islands
• Provides disaster recovery solution by enabling replication of FC data across an IP network
• Facilitates data sharing and data collaboration from worldwide locations

FCIP Connectivity
• An FCIP tunnel consists of one or more independent connections between two FCIP ports
o Transports encapsulated FC frames over TCP/IP
• FCIP entity such as FCIP gateway is connected to each fabric to enable tunneling through
an IP network

193
In an FCIP environment, FCIP entity such as an FCIP gateway is connected to each fabric
through a standard FC connection. The FCIP gateway at one end of the IP network
encapsulates the FC frames into IP packets. The gateway at the other end removes the IP
wrapper and sends the FC data to the adjoined fabric. The fabric treats these gateways as
fabric switches. An IP address is assigned to the port on the gateway, which is connected to
an IP network. After the IP connectivity is established, the nodes in the two independent
fabrics can communicate with other.
An FCIP tunnel consists of one or more independent connections between two FCIP ports on
gateways (tunnel endpoints). Each tunnel transports encapsulated FC frames over a TCP/IP
network. The nodes in either fabric are unaware of the existence of the tunnel. Multiple
tunnels may be configured between the fabrics based on connectivity requirement. Some
implementations enable aggregating FCIP links (tunnels) to increase throughput and to
provide link redundancy and load balancing.

FCIP Tunnel Configuration – Separate Fabric


Only a small subset of nodes in either fabric requires connectivity across an FCIP tunnel. Thus, an FCIP
tunnel may also use vendor-specific features to route network traffic between specific nodes without
merging the fabrics.

The image illustrates a solution for FC-FC routing but the FCIP tunnel is configured in a way that does not
merge the fabrics. In this deployment:

• Ex_Port and VE_Port are configured on each FCIP gateway


• The EX_Port on the FCIP gateway connects to an E_Port on an FC switch in the adjoined
fabric
• The EX_Port functions similarly to an E_Port, but does not propagate fabric services from
one fabric to another
• The EX_Port enables FC-FC routing through the FCIP tunnel, but the fabrics remain
separate

194
FCIP Protocol Stack
Protocol Stack

The FCIP protocol stack is shown on the image.

• Applications generate SCSI commands and data, which are processed by various layers of
the protocol stack
• The upper layer protocol SCSI includes the SCSI driver program that executes the read-
and-write commands
• Below the SCSI layer is the FC protocol (FCP) layer, which is simply an FC frame whose
payload is SCSI
• The FC frames can be encapsulated into the IP packet and sent to a remote FC SAN over
the IP
• The FCIP layer encapsulates the FC frames onto the IP payload and passes them to the TCP
layer
• TCP and IP are used for transporting the encapsulated information across Ethernet,
wireless, or other media that support the TCP/IP traffic

195
Encapsulation

Encapsulation of FC frame on to IP packet could cause the IP packet to be fragmented. The


fragmentation occurs when the data link cannot support the maximum transmission unit (MTU)
size of an IP packet.

• When an IP packet is fragmented, the required parts of the header must be copied by all
fragments

• When a TCP packet is segmented, normal TCP operations are responsible for receiving and
resequencing the data
• The receiving and resequencing is performed prior to passing it on to the FC processing
portion of the device

196
197
FCoE
Video: FCOE
FCoE Overview
Fibre Channel over Ethernet (FCoE):

• A protocol that transports FC data along with regular Ethernet traffic over a Converged
Enhanced Ethernet (CEE) network
• Uses FCoE protocol, defined by the T11 standards committee, that encapsulates FC frames
into Ethernet frames.
• Based on an enhanced Ethernet standard that supports Data Center Bridging (DCB)
functionalities (also called CEE functionalities)

Drivers for FCoE

Select here for details.

Multi-function network components are used to transfer both compute-to-compute and FC storage traffic
to:

• Reduce the complexity of managing multiple discrete networks


• Reduce the number of network adapters, cables, and switches required in a data center
• Reduce power and space consumption in a data center

Data centers typically have multiple networks to handle various types of network traffic –
such as, an Ethernet LAN for TCP/IP communication and an FC SAN for FC
communication. TCP/IP is typically used for compute-to-compute communication, data
backup, infrastructure management communication, and so on. FC is typically used for
moving block-level data between storage systems and compute systems.
To support multiple networks, compute systems in a data center are equipped with multiple
redundant physical network interfaces – for example, multiple Ethernet and FC network
adapters. Besides, to enable the communication, different types of networking switches and
physical cabling infrastructure are implemented in data centers. The need for two different
kinds of physical network infrastructure increases the overall cost and complexity of data
center operation.
FCoE provides the flexibility to deploy the same network components for transferring both
compute-to-compute traffic and FC storage traffic. This helps to mitigate the complexity of
managing multiple discrete network infrastructures. FCoE uses multi-functional network

198
adapters and switches. Therefore, FCoE reduces the number of network adapters, cables,
and switches, along with power and space consumption required in a data center.

Components of FCoE
The key FCoE components are:

• Network adapters
o Example: Converged Network Adapter (CNA) and software FCoE adapter
• Cables
o Example: Copper cables and fiber optical cables
• FCoE switch

199
What is CNA?
A physical adapter that provides functionality of both NIC and FC HBA, plus:

• Encapsulates FC frames into Ethernet frames and forwards them over CEE links
• Contains separate modules for 10 GE, FC, and FCoE ASICs

CNA consolidates both FC traffic and regular Ethernet traffic on a common Ethernet
infrastructure. CNAs connect compute systems to the FCoE switches. They are responsible
for encapsulating FC traffic onto Ethernet frames and forwarding them to FCoE switches
over CEE links. They eliminate the need to deploy separate adapters and cables for FC and
Ethernet communications, thus reducing the required number of network adapters and
switch ports.
A CNA offloads the FCoE protocol processing task from the compute system, so freeing the
CPU resources of the compute system for application processing. It contains separate
modules for 10-Gigabit Ethernet (GE), FC, and FCoE Application Specific Integrated
Circuits (ASICs). The FCoE ASIC encapsulates FC frames into Ethernet frames. One end of
this ASIC is connected to 10 GE and FC ASICs for compute system connectivity, while the
other end provides a 10 GE interface to connect to an FCoE switch.

200
FCoE Switch
An FCoE switch has both Ethernet switch and FC switch functionalities. It has a Fibre Channel Forwarder
(FCF), an Ethernet Bridge, and a set of ports that can be used for FC and Ethernet connectivity:

• FCF functions as the communication bridge between CEE and FC networks


o Handles FCoE login requests, applies zoning, and provides fabric services
o Encapsulates and decapsulates FC frames
• Upon receiving the incoming Ethernet traffic, the FCoE switch inspects the Ethertype and
forwards to the appropriate destination.
o FCoE frames contain an FC payload and are forwarded to the FCF where the frame
is extracted and sent to the FC SAN over the FC ports
o Non FCoE frames are handled as typical Ethernet traffic and forwarded over the
Ethernet ports

FCoE SAN Connectivity


The most common FCoE connectivity uses FCoE switches to interconnect a CEE network containing
compute systems with an FC SAN containing storage systems:

• The compute systems have FCoE ports that provide connectivity to the FCoE switches
• The FCoE switches enable the consolidation of FC traffic and Ethernet traffic onto CEE
links
201
This type of FCoE connectivity is suitable when an organization has an existing FC SAN environment.
Connecting FCoE compute systems to the FC storage systems through FCoE switches do not require any
change in the FC environment.

202
Concepts In Practice
Dell PowerConnect B-8000 Network Switch

• Provides a unified FCoE Solution


• Supports 10-GbE and FC ports
• Supports comprehensive Layer 2 LAN capabilities with high Performance and availability
• Provides a versatile solution for Server I/O Consolidation

A top-of-rack link layer CEE/DCB and FCoE switch. It comprises of 24 10-Gigabit Ethernet ports for LAN
connections and 8 Fibre Channel ports with up to 8-Gigabit speed for Fibre Channel SAN connections.
The network switch supports comprehensive Layer 2 LAN capabilities and, provide high performance
and availability. The functionality of server I/O Consolidation is too supported by the Power Connect B-
8000 Network Switch

Dell EMC Networking S-Series 10GbE switches

• Provides high performance open networking top-of-rack switches


• Provides support for iSCSI storage area networks
• Provides flexibility and is cost effective
• Flexible, powerful 10-GbE ToR switches for data centers of all sizes

A high-performance open networking top-of-rack switches with multirate Gigabit Ethernet and unified
ports. It offers flexibility and cost-effectiveness for the enterprise, and Tier2 cloud service provider with
demanding compute and storage traffic environments. The switches support iSCSI and FC storage
deployment, including DCB converged lossless transactions. It comprises of 10GbE S4000-ON Series
switches, 1/10G BASE-T S4048T-ON, S4128T-ON, and S4148T-ON switches. Dell EMC Networking S-Series
10GbE switches offers active fabric designs using S- or Z-Series core switches to create a two-tier
1/20/40/100-GbE data center network.

Dell Networking Z-Series core/aggregation switches

• Provides optimal flexibility, performance, density, and power efficiency


• Includes 10/25/40/50/100GbE options

Open networking and SDN-ready fixed form factor switches. They are purpose-built for applications in
modern computing environments. They not only simplify manageability, it provides optimal flexibility,
performance, density and power efficiency for the data center. It also supports both VLAN Tagging and
Double VLAN Tagging and comprises of 10/25/40/50/100GbE options.

Dell EMC S4148U

• Offers various port speed choices for Fibre Channel and Ethernet connectivity
• Provides flexibility and high performance for modern workloads
• Can be used in the following use cases:
o Provide end to end FC switch connectivity

203
o NPIV Gateway Edge switch in large multi-vendor SAN environments
• Supports up to 32 Gbps FC and 100 GbE Ethernet connectivity

A feature rich multi-functional switch offering various port speed choices for Fibre Channel and Ethernet
connectivity. It is designed for flexibility and high performance for today’s demanding modern
workloads and performance. It can be used as an end to end FC switch and as an NPIV Gateway Edge
switch in a large multi-vendor SAN environment. It supports up to 32 Gbps FC and 100 GbE Ethernet
connectivity.

Question 1
Which protocol is used by IP SAN for the transport of block-level data?

ICMP

ARP

Controls

• Internet Protocol

Correct!

Question 2
Which of the following is not a key component for iSCSI communication?

IP based network

Initiator

204

Target

Buffer

Correct!

Question 3
Which function is supported by FCIP?

Transferring SCSI data over Ethernet

Transporting the FC block data over the Ethernet

Transferring SCSI data over IP

• Transporting the FC block data over the IP infrastructure

Correct!

205
File-Based and Object-Based Storage System
NAS Components and Architecture
Video: NAS Components and Architecture
File Sharing Environment
• File sharing enables users to share files with other users
• Creator or owner of a file determines the type of access to be given to other users
• File sharing environment ensures data integrity when multiple users access a shared file
simultaneously
• Examples of file sharing methods:
o File Transfer Protocol (FTP)
o Peer-to-Peer (P2P)
o Network File System (NFS) and Common Internet File System (CIFS)
o Distributed File System (DFS)

In a file-sharing environment, a user who creates the file (the creator or owner of a file)
determines the type of access (such as read, write, execute, append, delete) to be given to other
users. When multiple users try to access a shared file simultaneously, a locking scheme is
required to maintain data integrity and simultaneously make this sharing possible.
Some examples of file-sharing methods are the peer-to-peer (P2P) model, File Transfer
Protocol (FTP), client/server models that use file-sharing protocols such as NFS and CIFS,
and Distributed File System (DFS).FTP is a client/server protocol that enables data transfer
over a network. An FTP server and an FTP client communicate with each other using TCP
as the transport protocol.
A peer-to-peer (P2P) file sharing model uses peer-to-peer network. P2P enables client
machines to directly share files with each other over a network. Clients use a file sharing
software that searches for other peer clients. This software differs from client/server model
that uses file servers to store files for sharing.
The standard client/server file-sharing protocols are NFS and CIFS. This protocols enable
the owner of a file to set the required type of access, such as read-only or read/write, for a
particular user or group of users. Using this protocol, the clients mount remote file systems
that are available on dedicated file servers.
A distributed file system (DFS) is a file system that is distributed across several compute
systems. A DFS can provide compute systems with direct access to the entire file system, while
ensuring efficient management and data security. Hadoop Distributed File System (HDFS) is
an example of distributed file system which is later discussed in this module. Vendors now
support HDFS on their NAS systems to support the scale-out architecture. The scale-out
architecture helps to meet the big data analytics requirements.

206
What Is NAS?
Definition: NAS

An IP-based, dedicated, high-performance file sharing and storage device.

• Enables NAS clients to share files over IP network


• Uses specialized operating system that is optimized for file I/O
• Enables both UNIX and Windows users to share data

NAS provides the advantages of server consolidation by eliminating the need for multiple file
servers. It also consolidates the storage used by the clients onto a single system, making it
easier to manage the storage. NAS uses network and file-sharing protocols to provide access
to the file data. These protocols include TCP/IP for data transfer and Common Internet File
System (CIFS) and Network File System (NFS) for network file service. Apart from these
protocols, the NAS systems may also use HDFS and its associated protocols (discussed later
in the module) over TCP/IP to access files. NAS enables both UNIX and Microsoft Windows
users to share the same data seamlessly.
A NAS device uses its own operating system and integrated hardware and software
components to meet specific file-service needs. Its operating system is optimized for file I/O
and, therefore, performs file I/O better than a general-purpose server. As a result, a NAS
device can serve more clients than general-purpose servers and provide the benefit of server
consolidation.

General Purpose Servers Vs. NAS Devices


A NAS device is optimized for file-serving functions such as storing, retrieving, and accessing files for
applications and clients; as shown on the image:

• A general-purpose server can be used to host any application because it runs a general-
purpose operating system
• Unlike a general-purpose server, a NAS device is dedicated to file-serving

207
• It has a specialized operating system dedicated for file serving by using industry standard
protocols. NAS vendors also support features, such as clustering for high availability,
scalability, and performance
• The clustering feature enables multiple NAS controllers/heads/nodes to function as a single
entity. The workload can be distributed across all the available nodes. Therefore, NAS
devices support massive workloads

Components of NAS System


• Controller/NAS head consists of:
o CPU, memory, network adapter, and so on
o Specialized operating systems installed
• Storage
o Supports different types of storage devices
• Scalability of the components depends on NAS architecture
o Scale-up NAS
o Scale-out NAS

A NAS system consists of two components, controller and storage. A controller is a compute
system that contains components such as network, memory, and CPU resources. A
specialized operating system optimized for file serving is installed on the controller. Each
controller may connect to all storage in the system. The controllers can be active/active, with
all controllers accessing the storage, or active/passive with some controllers performing all
the I/O processing while others act as spares. A spare is used for I/O processing if an active
controller fails. The controller is responsible for configuration of RAID set, creating LUNs,
installing file system, and exporting the file share on the network.

208
Storage is used to persistently store data. The NAS system may have different types of storage
devices to support different requirements. The NAS system may support SSD, SAS, and
SATA in a single system.
The extent to which the components, such as CPU, memory, network adapters, and storage,
can be scaled depends upon the type of NAS architecture used. There are two types of NAS
architectures; scale-up and scale-out. Both these architectures are detailed in the next few
slides.

Scale-Up NAS
A scale-up NAS architecture provides the capability to
scale the capacity and performance of a single NAS
system based on requirements. Scaling up a NAS
system involves upgrading or adding NAS heads and
storage.
These NAS systems have a fixed capacity ceiling,
which limits their scalability. The performance of these
systems starts degrading when reaching the capacity
limit.

Scale-Up NAS Implementations


There are two types of scale-up NAS implementations:
Unified NAS A unified NAS system contains one or more NAS heads and storage in a single system.
NAS heads are connected to the storage. The storage may consist of different drive types, such as
SAS, ATA, FC, and solid-state drives, to meet different workload requirements.
Each NAS head in a unified NAS has front-end Ethernet ports, which connect to the IP network.
The front-end ports provide connectivity to the clients. Each NAS head has back-end ports to
provide connectivity to the attached storage. Unified NAS systems have NAS management
software that can be used to perform all the administrative tasks for the NAS head and storage.

209
Gateway NAS
A gateway NAS system consists of one or more NAS heads and uses external and independently
managed storage. In gateway NAS implementation, the NAS gateway shares the storage from a
block-based storage system. The management functions in this type of solution are more complex
than those in an integrated a unified NAS environment. This is because there are separate
administrative tasks for the NAS head and the storage.
The administrative tasks of the NAS gateway are performed by the NAS management software.
The storage system is managed with the management software of the block-based storage system.
A gateway solution can use the FC infrastructure, such as switches and directors for accessing
SAN-attached storage arrays or direct-attached storage arrays.

210
Scale-Out NAS
• Pools multiple nodes in a cluster to work as a single NAS device
• Scales performance and/or capacity non-disruptively
• Creates a single file system that runs on all nodes in the cluster
o Clients, which are connected to any node can access the entire file system
o File system grows dynamically as nodes are added
• Stripes data across nodes with mirror or parity protection

The scale-out NAS implementation pools multiple NAS nodes together in a cluster. A node
may consist of either the NAS head or the storage or both. The cluster performs the NAS
operation as a single entity. A scale-out NAS provides the capability to scale its resources by
simply adding nodes to a clustered NAS architecture. The cluster works as a single NAS
device and is managed centrally. Nodes can be added to the cluster, when more performance
or more capacity is needed, without causing any downtime. Scale-out NAS provides the
flexibility to use many nodes of moderate performance and the availability characteristics.
This scale-out NAS produce a total system that has better aggregate performance and
availability. It also provides ease of use, low cost, and theoretically unlimited scalability.

211
Scale-out NAS uses a distributed clustered file system that runs on all nodes in the cluster.
All information is shared among nodes, so the entire file system is accessible by clients
connecting to any node in the cluster. Scale-out NAS stripes data across all nodes in a cluster
along with mirror or parity protection. As data is sent from clients to the cluster, the data is
divided and allocated to different nodes in parallel. When a client sends a request to read a
file, the scale-out NAS retrieves the appropriate blocks from multiple nodes. It recombines
the blocks into a file, and presents the file to the client. As nodes are added, the file system
grows dynamically and data is evenly distributed to every node. Each node added to the
cluster increases the aggregate storage, memory, CPU, and network capacity. Hence, cluster
performance is also increased.
Scale-out NAS clusters use separate internal and external networks for back-end and front-
end connectivity respectively. An internal network provides connections for intra-cluster
communication, and an external network connection enables clients to access and share file
data. Each node in the cluster connects to the internal network. The internal network offers
high throughput and low latency and uses high-speed networking technology, such as
InfiniBand or Gigabit Ethernet. To enable clients to access a node, the node must be
connected to the external Ethernet network. Redundant internal or external networks may
be used for high availability.

Tip: InfiniBand is a networking technology that provides a low-latency, high-bandwidth


communication link between hosts and peripherals. It provides serial connection and is often
used for inter-server communications in high-performance computing environments.
InfiniBand enables remote direct memory access (RDMA) that enables a device (host or
peripheral) to access data directly from the memory of a remote device. InfiniBand also
enables a single physical link to carry multiple channels of data simultaneously by using a
multiplexing technique.

NAS File Access Methods


Different methods can be used to access files on a NAS system. The most common methods are:

• Common Internet File System / Server Message Block (CIFS/SMB)


• Network File System (NFS)
• Hadoop Distributed File System (HDFS)

CIFS/SMB

• Client-server application protocol


o An open variation of the Server Message Block (SMB) protocol which is used for
Windows file sharing
• Enables clients to access files that are on a server over TCP/IP
• Stateful Protocol
o Maintains connection information regarding every connected client
o Can automatically restore connections and reopen files that were open prior to
interruption

212
Common Internet File System (CIFS) is a client/server application protocol that enables
client programs to make requests for files and services on remote computers over TCP/IP. It
is a public or open variation of Server Message Block (SMB) protocol.
The CIFS protocol enables remote clients to gain access to files on a server. CIFS enables file
sharing with other clients by using special locks. Filenames in CIFS are encoded using
Unicode characters. CIFS provides the following features to ensure data integrity:

• It uses file and record locking to prevent users from overwriting the work of another
user on a file or a record.
• It supports fault tolerance and can automatically restore connections and reopen files
that were open prior to an interruption.

The fault tolerance features of CIFS depend on whether an application is written to take
advantage of these features. Moreover, CIFS is a stateful protocol because the CIFS server
maintains connection information regarding every connected client. If a network failure or
CIFS server failure occurs, the client receives a disconnection notification. User disruption is
minimized if the application has the embedded intelligence to restore the connection.
However, if the embedded intelligence is missing, the user must take steps to reestablish the
CIFS connection.
Users refer to remote file systems with an easy-to-use file-naming scheme: \\server\share or
\\servername.domain.suffix\share.

NFS

• Client-server application protocol


• Enables clients to access files that are on a server
• Uses Remote Procedure Call (RPC) mechanism to provide access to remote file system

Network File System (NFS) is a client/server protocol for file sharing that is commonly used
on UNIX systems. NFS was originally based on the connectionless User Datagram Protocol
(UDP). It uses a machine-independent model to represent user data. It also uses Remote
Procedure Call (RPC) for interprocess communication between two computers.
The NFS protocol provides a set of RPCs to access a remote file system for the following
operations:

• Searching files and directories


• Opening, reading, writing to, and closing a file
• Changing file attributes
• Modifying file links and directories

NFS creates a connection between the client and the remote system to transfer data.

HDFS

• A file system that spans multiple nodes in a cluster and enables user data to be stored in
files.
213
• Presents a traditional hierarchical file organization so that users or applications can
manipulate (create, rename, move, or remove) files and directories
• Presents a streaming interface to run any application of choice using the MapReduce
framework

HDFS is supported by many of the scale-out NAS vendors. HDFS requires programmatic access because
the file system cannot be mounted. All HDFS communication is layered on top of the TCP/IP protocol.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode that acts as a
master server. This cluster has in-memory maps of every file, file locations as well as all the blocks within
the file and which DataNodes they reside on. The NameNode is responsible for managing the file system
namespace and controlling the access to the files by clients. DataNodes act as slaves that serve
read/write requests and perform block creation, deletion, and replication as directed by the NameNode.

Scale-Up NAS I/O Operation

214
The figure illustrates an I/O operation in a scale-up NAS system. The process of handling I/Os in
a scale-up NAS environment is as follows:

• The requestor (client) packages an I/O request into TCP/IP and forwards it through the
network stack. The NAS system receives this request from the network.
• The NAS system converts the I/O request into an appropriate physical storage request,
which is a block-level I/O. This system then performs the operation on the physical storage.
• When the NAS system receives data from the storage, it processes and repackages the data
into an appropriate file protocol response.
• The NAS system packages this response into TCP/IP again and forwards it to the client
through the network.

Scale-Out NAS I/O Operation

The figure illustrates I/O operation in a scale-out NAS system. A scale-out NAS consists of multiple NAS
nodes and each of these nodes has the functionality similar to a NameNode or a DataNode. In some
proprietary scale-out NAS implementations, each node may function as both a NameNode and DataNode,
typically to provide Hadoop integration. All the NAS nodes in scale-out NAS are clustered.

Write Operation Read Operation


• Client sends a file to the NAS node • Client requests a file
• Node to which the client is • Node to which the client is connected
connected receives the file receives the request
• File is striped across the nodes • The node retrieves and rebuilds the file
and gives it to the client

New nodes can be added as required. As new nodes are added, the file system grows dynamically and
is evenly distributed to each node. As the client sends a file to store to the NAS system, the file is evenly
striped across the nodes. When a client writes data, even though that client is connected to only one
node, the write operation occurs in multiple nodes in the cluster. This operation is also true for read
operations. A client is connected to only one node at a time. However, when that client requests a file
from the cluster, the node to which the client is connected don’t have the entire file locally on its drives.

215
The node to which the client is connected retrieves and rebuilds the file using the back-end InfiniBand
network.

216
File-Level Virtualization and Tiering
Video: File-level Virtualization and Tiering
What Is File-Level Virtualization?
A network-based file sharing environment is composed of multiple file servers or NAS devices. It might be
required to move the files from one device to another due to reasons such as cost or performance. File-
level virtualization, which is implemented in NAS or the file server environment, provides a simple, non-
disruptive file-mobility solution. It also:

• Eliminates dependency between data accessed at the file-level and the location where the
files are physically stored
• Enables users to use a logical path, rather than a physical path, to access files
• Uses global namespace that maps logical path of file resources to their physical path
• Provides non-disruptive file mobility across file servers or NAS devices

Before and After File-Level Virtualization


Before virtualization, each client knows exactly where its file resources are located. This
environment leads to underutilized storage resources and capacity problems because files are bound
to a specific NAS device or file server. It may be required to move the files from one server to
another because of performance reasons or when the file server fills up. Moving files across the
environment is not easy and may make files inaccessible during file movement. Moreover, hosts
and applications need to be reconfigured to access the file at the new location. This operation makes
it difficult for storage administrators to improve storage efficiency while maintaining the required
service level.
File-level virtualization simplifies file mobility. It provides user or application independence from
the location where the files are stored. File-level virtualization facilitates the movement of files
across online file servers or NAS devices. It means that while the files are being moved, clients can
access their files non-disruptively. Clients can also read their files from the old location and write
them back to the new location without realizing that the physical location has changed.

217
File-Level Storage Tiering
• Moves files from higher tier to lower tier
• Storage tiers are defined based on cost, performance, and availability parameters
• Uses policy engine to determine the files that are required to move to the lower tier
• Predominant use of file tiering is archival

As the unstructured data in the NAS environment grows, organizations deploy a tiered
storage environment. This environment optimizes the primary storage for performance and
the secondary storage for capacity and cost.
Storage tiering works on the principle of Hierarchical Storage Management (HSM). HSM is
a file mobility concept where a policy-engine, which can be software or hardware. Where this
policies are configured, facilitates moving files from the primary tiered storage to the
secondary tiered storage that meets the predefined policies. In HSM, a hierarchy of storage
tier is defined based on parameters such as cost, performance, and/or availability of
storage.Some prevalent reasons to tier data across storage systems or between storage system
and cloud is archival or to meet compliance requirements.
As an example, the policy engine might be configured to relocate all the files in the primary
storage tier that have not been accessed in one month and archive those files to the secondary
storage. For each archived file, the policy engine creates a small space-saving stub file in the
primary storage that points to the data on the secondary storage. When a user tries to access
the file from its original location on the primary storage, the user is transparently provided
with the actual file from the secondary storage.

218
Inter-array Tiering and Cloud Tiering
The figure illustrates the file-level storage tiering. In a file-level storage tiering environment, a file
can be moved to a secondary storage tier or to the cloud. Before moving a file from primary NAS
to secondary NAS or from primary NAS to cloud, the policy engine scans the primary NAS to
identify files that meet the predefined policies. After identifying the data files, the stub files are
created and the data files are moved to the destination storage tier.

Use-Case for Scale-Out NAS: Data Lake


The data lake represents a paradigm shift from the linear data flow model. As data and the insights
gathered from it increase in value, the enterprise-wide consolidated storage is transformed into a hub
around which the ingestion and consumption systems work (see figure). This enables enterprises to bring
analytics to data and avoid expensive cost of multiple systems, storage, and time for ingestion and analysis.

219
The key characteristics of a scale-out data lake are that it:

• Accepts data from various sources like file shares, archives, web applications, devices, and
the cloud, in both streaming and batch processes
• Enables access to this data for a variety of uses from conventional purpose to mobile,
analytics, and cloud applications
• Scales to meet the demands of future consolidation and growth as technology evolves; new
possibilities emerge for applying data to gain competitive advantage in the market place
• Provides a tiering ability that enables organizations to manage their costs without setting
up specialized infrastructures for cost optimization

By eliminating a number of parallel linear data flows. The enterprises can consolidate vast
amounts of their data into a single store, a data lake, through a native and simple ingestion
process. Analytics can be performed on this data which provides insight. Actions can be taken
based on this insight in an iterative manner, as the organization and technology matures.
Enterprises can thus eliminate the cost of having silos or islands of information spread across
their enterprises.
Scale-out NAS has the ability to provide the storage platform to this data lake. The scale-out
NAS enhances this paradigm by providing scaling capabilities in terms of capacity,
performance, security, and protection.

220
Object-Based and Unified Storage Overview
Drivers for Object-Based Storage
Amount of data created annually is growing exponentially and more than 90% of data generated is
unstructured

o Rapid adoption of third platform technologies leads to significant growth of data


o Longer data retention due to regulatory compliance also leads to data growth
• Data must be instantly accessible through a variety of devices from anywhere in the world
• Traditional storage solutions are inefficient in managing this data and in handling the
growth

The amount of data created each year is growing exponentially and the recent studies have
shown that more than 90 percent of data generated is unstructured (e-mail, instant messages,
graphics, images, and videos). Today, organizations not only have to store and protect
petabytes of data, but they also have to retain the data over longer periods of time, for
regulation and compliance reasons. They have also recognized that data can help gain
competitive advantages and even support new revenue streams. In addition to increasing
amounts of data, there has also been a significant shift in how people want and expect to
access their data. The rising adoption rate of smartphones, tablets, and other mobile devices
by consumers, combined with increasing acceptance of these devices in enterprise
workplaces, has resulted in an expectation for on-demand access to data from anywhere on
any device.
Traditional storage solutions like NAS, which is a dominant solution for storing unstructured
data, cannot scale to the capacities required or provide universal access across geographically
dispersed locations. Data growth adds high overhead to the NAS in terms of managing large
number of permission and nested directories. File systems require more management as they
scale and are limited in size. Their performance degrades as file system size increases, and do
not accommodate metadata beyond file properties which is a requirement of many new
applications.These challenges demand a smarter approach (object storage) that allows to
manage data growth at low cost, provides extensive metadata capabilities, and also provides
massive scalability to keep up with the rapidly growing data storage and access demands.

Object-Based Storage Device (OSD)


Definition: Object-based Storage Device (OSD)

Stores data in the form of objects on flat address space based on its content and attributes rather than the
name and location.

221
• Object contains user data, related metadata, and user-defined attributes
o Objects are uniquely identified using object ID
• OSD provides APIs to integrate with software-defined data center and cloud

An object is the fundamental unit of object-based storage that contains user data, related
metadata (size, date, ownership, etc.), and user defined attributes of data (retention, access
pattern, and other business-relevant attributes). The additional metadata or attributes
enable optimized search, retention and deletion of objects.
For example, when an MRI scan of a patient is stored as a file in a NAS system, the metadata
is basic and may include information such as file name, date of creation, owner, and file type.
When stored as an object, the metadata component of the object may include additional
information such as patient name, ID, and attending physician’s name, apart from the basic
metadata.
Each object stored in the object-based storage system is identified by a unique identifier
called the object ID. The object ID allows easy access to objects without the need to specify
the storage location. The object ID is generated using specialized algorithms (such as a hash
function) on the data and guarantees that every object is uniquely identified. Any changes in
the object, like user-based edits to the file, results in a new object ID. Most of the object
storage system supports APIs to integrate it with software-defined data center and cloud
environments.

Hierarchical File System Vs. Flat Address


Space
• Hierarchical file system organizes data in the form of files/directories
o Limits the number of files that can be stored
• OSD uses flat address space that enables storing large number of objects
o Enables the OSD to meet the scale-out storage requirement of third platform

222
File-based storage systems (NAS) are based on file hierarchies that are complex in structure.
Most file systems have restrictions on the number of files, directories and levels of hierarchy
that can be supported, which limits the amount of data that can be stored.
OSD stores data using flat address space where the objects exist at the same level and one
object cannot be placed inside another object. Therefore, there is no hierarchy of directories
and files, and as a result, billions of objects are to be stored in a single namespace. This
enables the OSD to meet scale-out storage requirement needs.

Components of Object-Based Storage Device


OSD system typically comprises three key components:

• OSD nodes (controllers)


• Internal network
• Storage

The OSD system is composed of one or more nodes. A node is a server that runs the OSD
operating environment and provides services to store, retrieve, and manage data in the
system. Typically OSD systems are architected to work with inexpensive x86-based nodes,

223
each node provides both compute and storage resources, and scales linearly in capacity and
performance by simply adding nodes.
The OSD node has two key services: metadata service and storage service. The metadata
service is responsible for generating the object ID from the contents (may also include other
attributes of data) of a file. It also maintains the mapping of the object IDs and the file system
namespace. In some implementations, the metadata service runs inside an application server.
The storage service manages a set of disks on which the user data is stored.
The OSD nodes connect to the storage via an internal network. The internal network provides
node-to-node connectivity and node-to-storage connectivity. The application server accesses
the node to store and retrieve data over an external network. OSD typically uses low-cost and
high-density disk drives to store the objects. As more capacity is required, more disk drives
can be added to the system.

Key Features of OSD


Typically, the object-based storage device has the following features:

Features Description
Scale-out architecture Provides linear scalability where nodes are independently added to the
cluster to scale massively
Multitenancy Enables multiple applications/clients to be served from the same
infrastructure
Metadata-driven policy Intelligently drive data placement, protection, and data services based on the
service requirements
Global namespace Abstracts storage from the application and provides a common view which is
independent of location and making scaling seamless
Flexible data access Supports REST/SOAP APIs for web/mobile access, and file sharing protocols
method (CIFS and NFS) for file service access
Automated system Provides auto-configuring, auto-healing capabilities to reduce administrative
management complexity and downtime
Data protection: Geo Object is protected using either replication or erasure coding technique and
distribution the copies are distributed across different locations
Addition details for each OSD feature are:

• Scale-out architecture: Scalability has always been the most important characteristic of
enterprise storage systems, since the rationale of consolidating storage assumes that the
system can easily grow with aggregate demand. OSD is based on distributed scale-out
architecture where each node in the cluster contributes with its resources to the total amount
of space and performance. Nodes are independently added to the cluster that provides
massive scaling to support petabytes and even exabytes of capacity with billions of objects
that make it suitable for cloud environment.
• Multi-tenancy: Enables multiple applications to be securely served from the same
infrastructure. Each application is securely partitioned and data is neither co-mingled nor

224
accessible by other tenants. This feature is ideal for businesses providing cloud services for
multiple customers or departments within an enterprise.
• Metadata-driven policy: Metadata and policy-based information management capabilities
combine to intelligently (automate) drive data placement, data protection, and other data
services (compression, deduplication, retention, and deletion) based on the service
requirements. For example, when an object is created, it is created on one node and
subsequently copied to one or more additional nodes, depending on the policies in place.
The nodes can be within the same data center or geographically dispersed.
• Global namespace: Another significant value of object storage is that it presents a single
global namespace to the clients. A global namespace abstracts storage from the application
and provides a common view, independent of location and making scaling seamless. This
unburdens client applications from the need to keep track of where data is stored. The global
namespace provides the ability to transparently spread data across storage systems for
greater performance, load balancing, and non-disruptive operation. The global namespace
is especially important when the infrastructure spans multiple sites and geographies.
• Flexible data access method: OSD supports REST/SOAP APIs for web/mobile access,
and file sharing protocols (CIFS and NFS) for file service access. Some OSD storage
systems support HDFS interface for big data analytics.
• Automated system management: OSD provides self-configuring and auto-healing
capabilities to reduce administrative complexity and downtime. With respect to services or
processes running in the OSD, there is no single point of failure. If one of the services goes
down, and if the node becomes unavailable, or site becomes unavailable, there are
redundant components and services that will facilitate normal operations.
• Data protection: The objects stored in an OSD are protected using two methods:
replication and erasure coding. The replication provides data redundancy by creating an
exact copy of an object. The replica requires the same storage space as the source object.
Based on the policy configured for the object, one or more replicas are created and
distributed across different locations. Erasure coding technique is discussed in the next
slide.

Object Protection: Erasure Coding


Provides space-optimal data redundancy to protect data loss against multiple drive failures

• A set of n disks is divided into m disks to hold data and k disks to hold coding information
• Coding information is calculated from data

The figure illustrates an example of dividing a data into nine data segments (m = 9) and three coding
fragments (k = 3). The maximum number of
drive failure supported in this example is
three.

Object storage systems support erasure


coding technique that provides space-
optimal data redundancy to protect
data loss against multiple drive failures.
225
In storage systems, erasure coding can also ensure data integrity without using RAID. This
avoids the capacity overhead of keeping multiple copies and the processing overhead of
running RAID calculations on very large data sets. The result is data protection for very large
storage systems without the risk of very long RAID rebuild cycles.
In general, erasure coding technique breaks the data into fragments, encoded with redundant
data and stored across a set of different locations, such as disks, storage nodes, or geographic
locations. In a typical erasure coded storage system, a set of n disks is divided into m disks to
hold data and k disks to hold coding information, where n, m, and k are integers. The coding
information is calculated from the data. If up to k of the n disks fail, their contents can be
recomputed from the surviving disks.
Erasure coding offers higher fault tolerance (tolerates k faults) than replication with less
storage cost. The additional storage requirement for storing coding segments increases as the
value of k/m increases.

226
Use Case: Cloud-Based Storage
The capabilities or features of OSD such as multi-tenancy, scalability, geographical distribution of
data, and data sharing across heterogeneous platforms or tenants while ensuring integrity of data,
make it a strong option for cloud-based storage. Enterprise end-users and cloud subscribers are also
interested in the cloud storage offerings because it provides better agility, on-demand scalability,
lower cost, and operational efficiency compared to traditional storage solution.
Cloud storage provides unified and universal access, policy-based data placement, and massive
scalability. It also enables data access through web service or file access protocols and provides
automated data protection and efficiency to manage large amount of data. With the growing
adoption of cloud computing, cloud service providers can leverage OSD to offer storage-as-a-
service, backup-as-a-service, and archive-as-a-service to their consumers.

Gateways provide a translation layer between the standard interfaces (iSCSI, FC, NFS, CIFS) and
cloud provider’s REST API

• Sits in a data center and presents file and block-based storage interfaces to applications
• Performs protocol conversion to send data directly to cloud storage
• Encrypts the data before it transmits to the cloud storage
• Supports deduplication and compression
• Provides a local cache to reduce latency

227
The lack of standardized cloud storage APIs has made gateway appliance a crucial
component for cloud adoption. Typically service providers offer cloud-based object storage
with interfaces such as REST or SOAP, but most of the business applications expect storage
resources with block-based iSCSI or FC interfaces or file-based interfaces, such as NFS or
CIFS. The cloud-based object storage gateways provide a translation layer between these
standard interfaces and service provider's REST API.
The gateway device is a physical or virtual appliance that sits in a data center and presents
file and block-based storage interfaces to the applications. It performs protocol conversion
so that data can be sent directly to cloud storage. To provide security for the data sent to the
cloud, most gateways automatically encrypt the data before it is sent. To speed up data
transmission times (as well as to minimize cloud storage costs), most gateways support data
deduplication and compression.
Cloud-based object storage gateway provides a local cache to reduce latency associated with
having the storage capacity far away from the data center. The gateway appliances offer not
only an interface to the cloud, but also provide a layer of management that can even help to
determine what data should be sent to the cloud and what data should be held locally.

Unified Storage Overview

Select here for details.

Definition: Unified Storage

A single integrated(converged) storage infrastructure that consolidates block (iSCSI, FC, FCoE), file
(CIFS/SMB, NFS), and object (REST, SOAP) access.

• Deploying unified storage provides following benefits


o Reduces capital and operational expenses
o Managed through single management interface
o Increases storage utilization
• Integration with software-defined environment provides storage for mobile, cloud, big data,
and social applications

In an enterprise data center, typically different storage systems (block-based, file-based, and
object-based storage) are deployed to meet the needs of different applications. In many cases,
this situation has been complicated by mergers and acquisitions that bring together disparate
storage infrastructures. The resulting silos of storage have increased the overall cost because
of complex management, low storage utilization, and direct data center costs for power,
space, and cooling.

228
An ideal solution would be to have an integrated storage solution that supports block, file,
and object access.
There are numerous benefits associated with deploying unified storage systems:

• Creates a single pool of storage resources that can be managed with a single
management interface.
• Sharing of pooled storage capacity for multiple business workloads should lead to a
lower overall system cost and administrative time, thus reducing the total cost of
ownership (TCO).
• Provides the capability to plan the overall storage capacity consumption. Deploying a
unified storage system takes away the guesswork associated with planning for file and
block storage capacity separately.
• Increased utilization, with no stranded capacity. Unified storage eliminates the
capacity utilization penalty associated with planning for block and file storage support
separately.
• Provides the capability to integrate with software-defined storage environment to
provide next generation storage solutions for mobile, cloud, big data, and social
computing needs.

Unified Storage Architecture


A unified storage architecture enables the creation of a common storage pool that can be shared across a
diverse set of applications with a common set of management processes. The key component of a unified
storage architecture is unified controller. The unified controller provides the functionalities of block
storage, file storage, and object storage. It contains iSCSI, FC, FCoE, and IP front-end ports for direct block
access to application servers and file access to NAS clients.

For block-level access, the controller configures LUNs and presents them to application
servers and the LUNs presented to the application server appear as local physical disks. A
file system is configured on these LUNs at the server and is made available to applications
for storing data.
229
For NAS clients, the controller configures LUNs and creates a file system on these LUNs and
creates a NFS, CIFS, or mixed share, and exports the share to the clients. Some storage
vendors offer REST API to enable object-level access for storing data from the web/cloud
applications.
In some implementation, there are dedicated or separate controllers for block functionality,
NAS functionality, and object functionality.

Concepts In Practice
Dell EMC Isilon

• Scale-out NAS product


• Enables pooling of nodes to construct a clustered NAS system
• OneFS operating environment creates single file system across the cluster

Dell EMC ECS

• Hyper-scale storage infrastructure


• Provides universal accessibility with support for object and HDFS
• Provides a single platform for all web, mobile, Big Data, and social media applications

Provides a hyper-scale storage infrastructure that is specifically designed to support modern


applications with unparalleled availability, protection, simplicity, and scale. It provides universal
accessibility with support for object, and HDFS. ECS Appliance enables cloud service providers to deliver
competitive cloud storage services at scale. ECS provides a single platform for all web, mobile, Big Data,
and social media applications.

Dell EMC Unity

• Belongs to family of unified storage platforms


• Unified management for file, block, virtual volume objects
• Available as all NAND flash drive, mix different flash drive types, and capacities for lowest
cost
• Cloud tiering and archiving. File and block data movement to and from cloud.

Delivers a full block and file unified environment in a single enclosure. The purpose built Dell EMC Unity
system can be configured as an All Flash system with only solid state drives, or as a Hybrid system with
a mix of solid state and spinning media to deliver the best on both performance and economics. The
Unisphere management interface offers a consistent look and feel whether you are managing block
resources, file resources, or both. Dell EMC Unity offers multiple solutions to address security and
availability. Unified Snapshots provide point-in-time copies of block and file data that can be used for
backup and restoration purposes. Asynchronous Replication offers an IP-based replication strategy
within a system or between two systems. Synchronous Block Replication benefits FC environments that
are close together and require a zero data loss schema. Data at Rest Encryption ensures user data on

230
the system is protected from physical theft and can stand in the place of drive disposal processes, such
as shredding.

Question 1
Which file access method provides file sharing that is commonly used on UNIX systems?

• NFS

Correct!

CIFS

HDFS

NTFS

Question 2
Which system provides the capability to scale capacity and performance of a single NAS system?

• Scale-Up NAS

Correct Response

• •

Scale-Out NAS

Incorrect

231
CIFS

NFS

Question 3
What is an advantage of a flat address space over a hierarchical address space?

Provides access to block, file, and object with same interface

Provides access to data, based on retention policies

Highly scalable with minimal impact on performance

Consumes less bandwidth on network while accessing data

Question 4
What accurately describes unified storage?

Provides block and file storage access using objects

• Provides block, file, and object-based access within one platform

Correct!


Specialized storage device purposely built for archiving

232

Supports block and file access using flat address space

233
Software-Defined Storage and Networking
Software-Defined Storage (SDS)
Video: Introduction to Sofware-Defined
Storage
Drivers for Software-Defined Storage
• In traditional environments, the creation of complex IT silos in data centers leads to:
o Management overhead, increased costs, and poor resource utilization
• In data centers, critical functionality and management tied to storage system limits:
o Resource sharing, automation, and standardization
• Traditional architecture makes it difficult to provide for:
o Data growth, scaling and self-service

In a traditional data center, there are several challenges in provisioning and managing
storage in an efficient and cost-effective manner. Some key challenges are described here.
In a traditional environment, each application type normally has its own vertical stack of
compute, networking, storage, and security. This leads to the creation of a loose collection of
IT silos, which increases the infrastructure’s complexity. This challenges creates
management overhead and increases operating expenses. It also leads to poor resource
utilization because capacity cannot be shared across stacks.
Data centers have multi-vendor, heterogeneous storage systems, and each type of storage
system (block-based, file-based, and object-based) has its own unique value. However, critical
functionality is often tied to specific storage types, and each storage system commonly has its
own monitoring and management tools. There is limited resource sharing, no centralized
management, a little automation, and a lack of standards in this environment.
Application workload complexities and higher SLA demands pose a further challenge to IT.
IT finds it difficult to allocate storage to satisfy the capacity requirements of applications in
real time. There are also new requirements and expectations for continuous access and
delivery of resources as in a cloud environment.
Traditional environments are not architected for technologies such as cloud computing, Big
Data analytics, and mobile applications. Therefore, there are several challenges in managing
massive data growth, cost-effective scaling, and providing self-service access to storage. These
challenges have led to the advent of the software-defined storage model.

234
What Is Software-Defined Storage?
Definition: Software-Defined Storage (SDS)

Storage infrastructure managed and automated by software, which pools heterogeneous storage
resources, and dynamically allocates them based on policy to match application needs.

• Abstracts the physical details of storage and delivers storage as software


• Supports multiple types of storage systems and access methods
o Enables storing data on both storage systems and commodity disks
o Provides a unified external view of storage infrastructure
• Enables building cost-effective hyperscale storage infrastructure

SDS abstracts heterogeneous storage systems and their underlying capabilities, and pools the
storage resources. Storage capacity is dynamically and automatically allocated from the
storage pools based on policies to match the needs of applications.
In general, SDS software abstracts the physical details of storage (media, formats, location,
low-level hardware configuration), and delivers storage as software. A storage system is a
combination of hardware and software. The software stack exposes the data access method
such as block, file, or object. This software stack also uses persistent media such as HDD or
SSD to store the data. SDS software separates the software layer of a storage system from the
hardware.
It supports combinations of multiple storage types and access methods, such as block, file,
and object. It enables storing data on both storage systems and commodity disks, while
providing a unified external view of storage. This functionality allows organizations to reuse
existing storage assets, and mix and match them with commodity resources. Thus SDS serve
data through a single namespace and storage system spread across these different assets.
For example, in a data center that contains several distinct file servers, SDS can provide a
global file system, spanning the file servers and allowing location-independent file access.
SDS enables organizations to build modern, hyperscale storage infrastructure in a cost-
effective manner using standardized, commercial off-the-shelf components. The components
individually provide lower performance. However, at sufficient scale and with the use of SDS
software, the pool of components provides greater capacity and performance characteristics.

235
Key Attributes of Software-Defined Storage
SDS transforms existing heterogeneous physical storage into a simple, extensible, and open virtual storage
platform. The key attributes of software-defined storage are as follows:

Attribute Description
Storage abstraction and pooling Single large storage pool spanning across the underlying
storage infrastructure
Automated, policy-driven storage Dynamic composition of storage services based on application
provisioning policies
Unified management Single control point for the entire infrastructure
Self-service Users self-provision storage services from a service catalog
Open and extensible Integration of external interfaces and applications through the
use of APIs
Additional details on the key attributes of software-defined storage are as follows:

• Storage abstraction and pooling: SDS abstracts and pools storage resources across
heterogeneous storage infrastructure. SDS software creates a single large storage pool with
the underlying storage resources, from which several virtual storage pools are created. SDS
decouples the storage control path from the data path. Applications connect to storage
through the data path.
• Automated, policy-driven storage provisioning: A “storage service” is some combination
of capacity, performance, protection, encryption, and replication. In the SDS model, storage
services are dynamically composed from available resources. SDS uses application policies
to create a “just-in-time” model for storage service delivery. Storage assets and capabilities
are configured and assigned to specific applications only when they are needed. If the policy
changes, the storage environment dynamically and automatically responds with the new
requested service level.
• Unified management: SDS provides a unified storage management interface that provides
an abstract view of the storage infrastructure. Unified management provides a single control
point for the entire infrastructure across all physical and virtual resources.
• Self-service: Resource pooling enables multi-tenancy, and automated storage provisioning
enables self-service access to storage resources. Users select storage services from a self-
service catalog and self-provision them.
• Open and extensible: An SDS environment is open and easy to extend enabling new
capabilities to be added. An extensible architecture enables integrating multi-vendor
storage, and external management interfaces and applications into the SDS environment
through the use of application programming interfaces (APIs).

236
Compute-Based Storage Area Network
• A software-defined SAN created from direct-attached storage
o Creates a large block-based storage pool
• A client program on compute systems exposes shared block volumes
• Compute systems that contribute storage run a server program
o Server program performs I/O requested by client
• Metadata manager configures and monitors the compute-based SAN

237
Software-Defined Storage Architecture

The image depicts the generic architecture of a software-defined storage environment. Although the
physical storage devices themselves are central to SDS, they are not a part of the SDS environment.
Physical storage may be block-based, file-based, or object-based storage systems or commodity hardware.
The fundamental component of the SDS environment is the policy-driven control plane,
which manages and provisions storage. The control plane is implemented through software
called “SDS controller”, which is also termed as a “storage engine” in some SDS products.
The SDS controller is software that manages, abstracts, pools, and automates the physical
storage systems into policy-based virtual storage pools. By using automation and
orchestration, the controller enables self-service access to a catalog of storage resources.
Users provision storage using data services, which may be block, file, or object services.
An SDS controller may provide either all or a part of the features and services that are shown
in the architecture. For example, an SDS controller may only support file and block data
services. Some controllers may also support the Hadoop Distributed File System (HDFS).
Some SDS products provide the feature of creating a block-based storage pool from the local
direct-attached storage (DAS) of x86-based commodity servers in a compute cluster. The
storage pool is then shared among the servers in the cluster.
The REST API is the core interface to the SDS controller. All underlying resources managed
by the controller are accessible through the API. The REST API makes the SDS environment
open and extensible, which enables integration of multi-vendor storage, external
management tools, and written applications. The API also integrates with monitoring and
reporting tools. Further, the API provides access to external cloud/object storage.

238
Compute-Based Storage Area Network
• A software-defined SAN created from direct-attached storage
o Creates a large block-based storage pool
• A client program on compute systems exposes shared block volumes
• Compute systems that contribute storage run a server program
o Server program performs I/O requested by client
• Metadata manager configures and monitors the compute-based SAN

A compute-based storage area network is a software-defined virtual SAN created from the
direct-attached storage located locally on the compute systems in a cluster. A compute-based
SAN software creates a large pool of block-based storage that can be shared among the
compute systems (or nodes) in the cluster. This software creates a large-scale SAN without
storage systems, and enables using the local storage of existing compute systems. The
convergence of storage and compute ensures that the local storage on compute systems, which
often go unused, is not wasted.
A compute system that requires access to the block storage volumes, runs a client program.
The client program is a block device driver that exposes shared block volumes to an
application on the compute system. The blocks that the client exposes can be blocks from
anywhere within the compute-based SAN. This process enables the application to issue an
I/O request, and the client fulfills it regardless of where the particular blocks reside. The
client communicates with other compute systems either over Ethernet (ETH) or Infiniband
(IB) – a high-speed, low latency communication standard for compute networking. The
compute systems that contribute their local storage to the shared storage pool within the
virtual SAN, run an instance of a server program. The server program owns the local storage
and performs I/O operations as requested by a client from a compute system within the
cluster.

239
A compute-based SAN’s control component, which is known as the metadata manager, serves
as the monitoring and configuration agent. It holds cluster-wide mapping information and
monitors capacity, performance, and load balancing. It is also responsible for decisions
regarding migration, rebuilds, and all system-related functions. The metadata manager is
not on the virtual SAN data path, and reads and writes do not traverse the metadata
manager. The metadata manager may communicate with other compute-based SAN
components within the cluster to perform system maintenance and management operations
but not data operations. The metadata manager may run on a compute system within the
compute-based SAN, or on an external compute system.

Benefits of Software-Defined Storage


The key benefits of software-defined storage are described below:

Benefit Description
Simplified storage • Breaks down storage silos and their associated complexity
environment • Provides centralized management across all physical and virtual
storage environments
• Simplifies management by enabling administrators to centralize
storage management and provisioning tasks

Operational • Automated policy-driven storage provisioning improves quality of


efficiency services, reduces errors, and lowers operational cost
• Provides faster streamlined storage provisioning, which enables new
requirements to be satisfied more rapidly

Agility • Ability to deliver self-service access to storage through a service


catalog provides agility and reduces time-to-market

Reusing existing • Supports multi-vendor storage systems and commodity hardware,


infrastructure which enables organizations to work with their existing
infrastructure and protects the current investments of organizations

Cloud support • Enables an enterprise data center to connect to external cloud


storage services for consuming services such as cloud-based backup,
and disaster recovery
• Facilitates extending object storage to existing file and block-based
storage, which enables organizations to deploy mobile and cloud
applications on their existing infrastructure.

240
Control Plane Functions and User Interfaces
Key control plane functions are:

• Asset discovery
• Resource abstraction and pooling
• Provisioning resources for services

SDS controller provides two native user interfaces:

• Command-line interface (CLI)


• Graphical user interface (GUI)
o Has an administrator view and a user view

The control plane in software-defined storage is implemented by SDS controller software,


which enables storage management and provisioning. An SDS controller commonly provides
two native user interfaces: a command-line interface (CLI) and a graphical user interface
(GUI). Both the interfaces may either be integrated into the controller, or may be external to
it. If the native user interfaces are external, and then they apply the REST API to interact
with the controller.
The CLI provides granular access to the controller’s functions and more control over
controller operations as compared to the GUI.
The GUI is a browser-based interface that can be used with a supported web browser. The
GUI may be used by both storage administrators and by end users. For this option, the GUI
has two views: an administrator view and a user view. The administrator view enables an
administrator to carry out tasks such as managing the infrastructure, creating service
catalogs, and defining storage services. The user view enables an end user to access the service
catalog and self-provision storage services.

Asset Discovery
• Controller automatically detects assets when they are added to the SDS environment
o Controller obtains or confirms asset configuration information
• Examples of asset categories that can be discovered are:
o Storage systems
o Storage networks
o Compute systems and clusters
o Data protection solutions

An SDS controller automatically detects an asset when it is added to the SDS environment.
The controller uses the asset’s credentials to connect to it over the network, and either obtains
or confirms its configuration information. This process is called “discovery”. Discovery can

241
also be initiated manually to verify the status of an asset. Examples of assets are storage
systems, storage networks, compute systems and clusters, and data protection solutions.
If the asset is a storage system, the controller collects information about the storage ports and
the pools that it provides. If the asset is a compute system, the controller discovers its initiator
ports. Clusters can also be discovered, enabling volumes to be provisioned to the compute
systems in the cluster. The controller can also discover the storage area networks within a
data center.

Resource Abstraction and Pooling


Data centers commonly contain many physical storage systems of different types and often from
multiple manufacturers. Each physical storage system must also be individually managed, which
is time consuming and error prone.
An SDS controller exposes the storage infrastructure through a simplified model, hiding and
handling details such as storage system and disk selection, LUN creation, LUN masking, and the
differences between the storage systems.
The SDS controller leverages the intelligence of individual storage systems. It abstracts storage
across the physical storage systems and manages individual components. This functionality enables
administrators and users to treat storage as a large resource. It enables focusing just on the amount
of storage needed, and the performance and protection characteristics required.

Resource Provisioning
Service Catalog and Self-Service

• Administrator creates storage services and organizes them into categories in a service
catalog
o Services are block, file, and object data services
o Administrator can restrict services to specific users
• Service catalog provides users with self-service access to predefined storage services

242
o Users place service requests through the GUI or a client software
• SDS controller automates the provisioning of resources
• Administrators can view details of requests in real time

After configuring the storage abstractions, an administrator customizes and exposes storage
services by creating service catalogs for tenants. The administrator uses the GUI’s
administrator view to create storage services and organize them into categories in a service
catalog. The service catalog provides the tenant users with access to the set of predefined
storage services. An administrator can create different categories of services such as block
service, file service, and object service. The administrator can configure the different services
within each category, and also restrict them to specific users or user groups.
The user view of the GUI provides users within a tenant with access to their service catalog.
The user view presents all the services and categories that are available for provisioning for
a specific user. Users can request a service by simply clicking the service and placing a request
to run it. Some SDS platforms may not provide an interface for users to request services, and
require the use of external client software.
An SDS controller automates the provisioning of resources when a user requests for a service.
It employs a policy-based placement algorithm to find the best fit in the infrastructure to
fulfill user requests for data services. The SDS controller uses orchestration for automating
the provisioning process. Orchestration uses workflows to automate the arrangement,
coordination, and management of various functions required to provision resources. As a
result, provisioning does not require administrator or user interaction.
The administrator can view the details and the progress of placed requests in real time. The
details include which service was requested, which parameters were specified in the service
request, who requested it, the outcome of the request submission, and the affected resources
and volumes.

Block Data Service

• Provides a block volume of required size, and performance and protection levels
• Examples of block services:
o Create a block volume
o Delete a block volume
o Bind a block volume to compute
o Unbind a block volume from compute
o Mount a block volume
o Unmount a block volume
o Expand a block volume

The block data service provides a block volume of required size, performance level, and
protection level to a user. Examples of the services that an administrator defines in this
service category are as follows:

• Create a block volume: A user can create a block storage volume by selecting a virtual
storage system and virtual pool. On receiving the request, the SDS controller chooses

243
the physical pool from the selected virtual pool and storage system. It creates a block
volume, which corresponds to a LUN on the storage system.
• Delete a block volume: A user can delete an existing volume. On receiving the request,
the SDS controller destroys the volume from the physical storage pool.
• Bind a block volume to compute: A user can assign a block volume to a selected
compute system/cluster. On receiving this request, the SDS controller binds the block
volume to the specified compute system/cluster. However, the volume cannot be
written to or read from unless it is mounted.
• Unbind block volume from compute: A user can unbind a volume from a compute
system/cluster. This block service simply makes the block volume invisible to the
compute.
• Mount a block volume: A user can mount a block volume on a compute system/cluster.
The SDS controller sends commands to the OS to mount the volume. This operation
is specific to the type of OS on the compute system such as Windows, Linux, and ESXi.
• Unmount block volume: A user can unmount a block volume from a compute
system/cluster. On receiving the request, the SDS controller sends commands to the
compute to unmount the volume.
• Expand block volume: A user can expand/extend a block volume by combining it
either with a newly created volume or with an existing volume. On receiving the
request to expand a volume, the SDS controller commands the storage system to
expand the LUN.

244
Software-Defined Storage Extensibility
Definition: Application Programming Interface (API)

A set of programmatic instructions and specifications that provides an interface for software components
to communicate with each other. It specifies a set of routines that can be called from a software
component enabling interaction with the software providing the API.

• APIs may either be pre-compiled code leveraged in programming, or web-based


• Web-based APIs may be implemented as:
o Simple Object Access Protocol (SOAP) based web services
o Representational state transfer (REST) APIs

An API specifies a set of routines (operations), input parameters, outputs/responses,


datatype, and errors. The routines can be called from a software component enabling it to
interact with the software providing the API. Thus, an API provides a programmable
interface, which is a means for communicating with an application without understanding its
underlying architecture. This functionality enables programmers to use the component-
based approach to build software systems. APIs may be pre-compiled code that is applied in
programming languages, and can also be web-based.
A web-based API is typically defined as a set of Hypertext Transfer Protocol (HTTP) request
messages and the structure of response messages. The response messages are usually in an
Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. Web-
based APIs may be implemented as Simple Object Access Protocol (SOAP) based web
services or as representational state transfer (REST) APIs. A detailed description of APIs is
beyond the scope of this course. This lesson provides an overview of APIs, and focuses
primarily on REST APIs.

Need for APIs


• APIs enable integrating third-party data services and capabilities into existing architecture
• In SDDC, APIs enable orchestration and provisioning resources from pools
o Ensures meeting the SLAs that organizations require
• In SDS, the REST API provides the interface to all underlying resources
o Enables storage provisioning, management, and metering
o Enables extension of functionality, and integration with external platforms and
applications

As modern technologies become more prevalent, the ability to dynamically adapt to


variations in application workloads and storage requirements is becoming increasingly
important. The next-generation software-defined data centers and cloud stacks are powered
by APIs. With advancements in technology, APIs are providing improving communication
and connectivity between IT systems, and increasing agility through automation.

245
APIs provide a flexible, easy-to-use means for integrating third-party applications and
capabilities into existing infrastructure. This integration also provides a layer of security
between public (external) and private (internal) business capabilities. Further enabling
organizations to provide services in the way they see fit while offering end users various
services.

• For example, a public cloud storage provider may provide an API that allows a
consumer-written application to access and use cloud storage as regular storage.

Similarly, online social networks may provide APIs that enable developers to access to the
feeds of their users. Further, with the advent of the Internet of Things, devices enabled with
web-based APIs are becoming common. APIs enable the smart devices to communicate with
each other and with applications.
In a software-defined data center, APIs enable automated provisioning of resources from
compute, storage, and networking pools to ensure that SLAs are met. The use of APIs is
enabling software-defined storage to be easily managed and provisioned.
In SDS, the REST API provides the interface to all underlying resources. Management
interfaces use the API to provision, manage, monitor, and meter logical storage resources.
The API also provides a means to integrate with multi-vendor storage systems and external
storage platforms. It also offers a programmable environment enabling developers and users
to extend SDS functionality.

Representational State Transfer (REST)


• REST is a client/server software architecture style
o Leverages HTTP methods for client/server interaction
o Used for developing “RESTful” APIs
• Provides an easy means to consume services, and combine multiple web resources into
applications

Representational State Transfer (REST) is a client/server software architecture approach


that was originally introduced for building large-scale, distributed hypermedia (for example,
hypertext, audio, video, image, and text) systems. REST is not a standard but rather an
architectural style that has become a choice for developing HTTP-based APIs called
“RESTful” APIs. It leverages HTTP methods such as GET, POST, PUT, DELETE for
client/server interaction. It supports the resource-oriented architecture for the development
of scalable and lightweight web applications while adhering to a set of constraints.
REST-based communication provides simple, human-readable data access methods.
RESTful APIs do not require XML-based web service protocols such as SOAP to support
their light-weight interfaces. However, they still support XML-based and JSON data
formats. These services provide an easy means to consume services, and support the
combination of multiple web resources into new applications. Recent trends reveal increasing
adoption of REST for developing APIs to provide simple and cost-effective request-based
services, and support the demand for real-time data.
246
Integrating External Management Tools and
Applications
External Management External Cloud/Object Monitoring and Reporting
Interfaces and Applications Storage Services Tools

REST API

SDS Controller

Storage Systems

The REST API enables the extensibility of the SDS functionality through integration with written
applications, and external management tools and cloud stacks such as VMware, Microsoft, and
OpenStack. This provides an alternative to provisioning storage from the native management
interface. The open platform enables users and developers to write new data services. This enables
building an open development community around the platform.
The API also integrates with tools for monitoring and reporting system utilization, performance,
and health. This also enables generating chargeback/showback reports. The API may also support
cloud/object storage platforms such as, Amazon S3, and OpenStack Swift. Further, the API may
also support integration with HDFS for running Hadoop applications.
The REST API :

• Describes the programmatic interfaces that allow users to create, read, update, and delete
resources through the HTTP methods PUT, GET, POST, and DELETE
• Accessible using any web browser or programming platform that can issue HTTP requests

The browser may require a special plugin such as httpAnalyzer for Internet Explorer, Poster for
Firefox, and PostMan for Chrome. The REST API may also be accessed using scripting platforms
such as Perl. Vendors may also provide class libraries that enable developers to write applications
that access the SDS data services.

247
Software-Defined Networking (SDN)
Software-Defined Networking Overview
Definition: Software-Defined Networking
(SDN)

An approach to abstract and separate the


control plane functions from the data plane
functions. Instead of the integrated control
functions at the network components level,
the software external to the components
takes over the control functions. The
software runs on a compute system or a
stand-alone device and is called network
controller.

• Controller gathers configuration


information from network
components
• Controller provides instructions
to data plane

Traditionally, a network component such as a switch or a router consists of a data plane and
a control plane. These planes are bundled together and implemented in the firmware of the
network components. The function of the data plane is to transfer the network traffic from
one physical port to another port by following rules that are programmed into the
component. The function of the control plane is to provide the programming logic that the
data plane follows for switching or routing of the network traffic.
Software-defined networking is an approach to abstract and separate the control plane
functions from the data plane functions. Instead of the integrated control functions at the
network components level, the software external to the components takes over the control
functions. The software runs on a compute system or a stand-alone device and is called
network controller. The network controller interacts with the network components to gather
configuration information and to provide instructions for data plane to handle the network
traffic.
Software-defined networking versus network virtualization: Network virtualization is a
process of abstracting all the network components and their functions into software. Whereas
SDN does not virtualize all the network components, but moves the decision making to a
control plane. Based on the decision, the hardware components execute the actions. Though
they both allow for flexible network operations, they perform different roles and functions.

248
Software-Defined Networking Architecture

The architecture of SDN consists of three layers along with APIs in between to define the
communication.

• Infrastructure Layer: This layer consists of networking devices such as switches and
routers. It is responsible for handling data packets such as forwarding or dropping of
packets and handling the devices. This layer forms the data plane and performs actions
based on the instructions received.
• Control Layer: This layer consists of controllers and acts as a the brain of the SDN
architecture. It is responsible for making decisions such as how the packets should be
forwarded based on the requirements, and relays the decisions to the networking devices
(data plane) for execution. It also extracts the information about the network from the data
plane and communicates it to the application layer. This layer forms the control plane.
• Application Layer: This layer consists of applications and services such as business
applications, and analytics that define the network behavior through policies and also define
the requirements. It communicates the requirements through the APIs to the control layer.
This layer forms the application plane of the SDN architecture.
• APIs: in SDN architecture, APIs are referred as northbound interfaces and southbound
interfaces. Northbound interfaces define the communications between the controller and
application layer. Southbound interfaces define the communications between the control
and infrastructure layer.

249
Software-Defined Networking Benefits
Software-defined networking in a SAN provides several benefits. These benefits are:

Benefit Details
Centralized • Provides a single point of control for the entire network
Control infrastructure that may span across data centers
• Centralized control plane provides the programming logic for
transferring the network traffic, which can be uniformly and quickly
applied across the network infrastructure
• Programming logic can be upgraded centrally to add new features
based on application requirements.

Policy-based • Many hardware-based network management operations such as


Automation zoning can be automated
• Management operations may be programmed in the network
controller based on business policies and best practices
• Reduces the need for manual operations that are repetitive, error-
prone, and time-consuming
• Helps to standardize the management operations

Simplified, Agile • Network controller usually provides a management interface that


Management includes a limited and standardized set of management functions
• Management functions are available in a simplified form,
abstracting the underlying operational complexity
• Makes it easier to configure a network infrastructure and to modify
the network configuration to respond to changing application
requirements

• Centralized Control: The software-defined approach provides a single point of control for
the entire network infrastructure that may span across data centers. The centralized control
plane provides the programming logic for transferring the network traffic, which can be
uniformly and quickly applied across the network infrastructure. The programming logic
can be upgraded centrally to add new features based on application requirements.
• Policy-based Automation: With a software-defined approach, many hardware-based
network management operations such as zoning can be automated. Management operations
may be programmed in the network controller based on business policies and best practices.
This process reduces the need for manual operations that are repetitive, error-prone, and
time-consuming. Policy-based automation also helps to standardize the management
operations.
• Simplified, Agile Management: The network controller usually provides a management
interface that includes a limited and standardized set of management functions. With
policy-based automation in place, these management functions are available in a simplified
form, abstracting the underlying operational complexity. This process makes it easy to

250
configure a network infrastructure and to modify the network configuration to respond to
changing application requirements.

Software-Defined Networking Use Case


Listed some common use cases where SDN is used to strengthen the security, automate the processes for
faster provisioning of network resources and enable business continuity.

Use Case Details


Data Center Security • Security against lateral movements
• Visibility of trends using analytics such as switch data
• Security policies and control for each workload

Automation • Automated network provisioning


• Programmatically control the entire network environment

Business Continuity • Hybrid cloud initiatives


• Disaster recovery

Note: Micro-segmentation is a method of isolating and securing the workloads by defining various security
policies and controls for each workload

• Data Center Security: Protecting information is a strategic necessity for organizations. With
SDN, organizations protect data through embedded security, to prevent credential stealing
and computer infiltration for both the physical and virtual layers. It enables visibility of
trends using analytics available that offer insight into switch traffic. Micro-segmentation
feature of SDN lets organizations define security policies and controls for each workload
based on dynamic security groups. This process helps to ensure immediate responses to
threats inside the data center.
• Automation: Many organizations cannot change their networks fast enough to keep up with
new applications and workloads. With SDN, Organizations can bring up workloads in
seconds or minutes using automated network provisioning. There is no need to make major
revisions to the physical network every time the organization introduces an application or
service. Changes can be quickly made through software and require few, if any, cabling
updates. IT can programmatically create, snapshot, store, move, delete, and restore entire
networking environments with simplicity and speed. This automation of networking tasks
benefits both new application deployments as well as changes to existing applications in
the IT infrastructure.
• Business Continuity: SDN also simplifies and accelerates private and hybrid cloud
initiatives. Organizations can rapidly develop, automatically deliver, and manage all their
enterprise applications, whether they reside on-premises or off-premises, from a single
unified platform. IT can easily replicate entire application environments to remote data
centers for disaster recovery. It can also move them from one corporate data center to

251
another or deploy them into a hybrid cloud environment, without disrupting the applications
or touching the physical network.

Concepts in Practice
Dell EMC ViPR Controller

• Software-defined storage platform that supports block and file data services
• Supports data protection across data centers
• Extensible through a REST API
• Driven by open-source community

A software-defined storage platform that abstracts, pools, and automates a data center’s
physical storage infrastructure. It delivers block and file storage services on demand through
a self-service catalog. It supports data protection across geographically dispersed data
centers. It provides a single control plane to manage heterogeneous storage environments,
including Dell EMC and non-Dell EMC block and file storage.
ViPR Controller also provides a REST-based API making the storage architecture
extensible. It supports multiple vendors enabling organizations to choose storage platforms
from either Dell EMC or third-party. It also supports different cloud stacks such as VMware,
Microsoft, and OpenStack. ViPR Controller development is driven by the open-source
community, which enables expanding its features and functionalities.
Dell EMC VxFlex OS

• Software for creating compute-based SAN from local storage


o Leverages HDDs, SSDs, and flash cards
• Supports physical and virtual servers
• Scale-out elastic architecture, with massively parallel processing

A software-defined storage platform that abstracts, pools, and automates a data center’s
physical storage infrastructure. It delivers block and file storage services on demand through
a self-service catalog. It supports data protection across geographically dispersed data
centers. It provides a single control plane to manage heterogeneous storage environments,
including Dell EMC and non-Dell EMC block and file storage.
ViPR Controller also provides a REST-based API making the storage architecture
extensible. It supports multiple vendors enabling organizations to choose storage platforms
from either Dell EMC or third-party. It also supports different cloud stacks such as VMware,
Microsoft, and OpenStack. ViPR Controller development is driven by the open-source
community, which enables expanding its features and functionalities.
VMware NSX

• Network virtualization platform for SDDC architecture


• Virtual networks are programmatically provisioned and managed, independent of
underlying hardware
• Enables a library of logical networking elements, such as logical switches, routers,
firewalls, and load balancer
252
A network virtualization platform for SDDC architecture. It is a reproduction of the network
and its services, in a virtualized environment. NSX provides software that represents logical
network components such as switches, routers, distributed services for firewalls, load
balancers, and VPN. It reproduces Layer 2 to Layer 7 networking services that include
switching, routing, firewalling, and load balancing in software.
VMware NSX lets you create, delete, save, and restore networks without changing the
physical network. This process reduces the time to provision by simplifying overall network
operations. NSX Manager is integrated with vCenter for single pane management and all
these network resources can be deployed whether in a cloud or a self-service portal
environment.

Question 1
Which is the fundamental component of the SDS environment which manages and provisions
storage?

API

Control pane

Storage plane

Rest API

Question 2
Which product decouples compute and storage to scale each resource together or independently
to drive maximum efficiency?

• ViPR Controller

253
Incorrect


Dell EMC VxRack Flex


Dell EMC VxRack SDDC

• VxFlex OS

Correct Response

Question 3
Which layer represents the ‘brain’ of SDN architecture?

Application layer

API layer

Infrastructure layer

• Control layer

Correct!

254
Introduction to Business Continuity
Business Continuity Overview
Video: Business Continuity Overview
Business Continuity
Definition: Business Continuity (BC)

Process that prepares for, responds to, and recovers from a system outage that can adversely affect
business operations.

• BC process enables continuous availability of information and services in the event of


failure to meet the required SLA
• BC involves various proactive and reactive countermeasures
• It is important to automate BC process to reduce the manual intervention
• Goal of BC solution is to ensure information availability

Business continuity (BC) is a set of processes that includes all activities that a business must
perform to mitigate the impact of planned and unplanned downtime. BC entails preparing
for, responding to, and recovering from a system outage that adversely affects business
operations. It describes the processes and procedures an organization establishes to ensure
that essential functions can continue during and after a disaster.
Business continuity prevents interruption of mission-critical services, and reestablishes the
impacted services as swiftly and smoothly as possible by using an automated process. BC
involves proactive measures such as business impact analysis, risk assessment, building
resilient IT infrastructure, deploying data protection solutions (backup and replication). It
also involves reactive countermeasures such as disaster recovery.
In a modern data center, policy-based services can be created that include data protection
through the self-service portal. Consumers can select the class of service that best meets their
performance, cost, and protection requirements on demand. Once the service is activated,
the underlying data protection solutions that are required to support the service is
automatically invoked to meet the required data protection.
For example: If a service requires VM backup for every six hours and then backing up VM
is scheduled automatically every six hours.The goal of a BC solution is to ensure “information
availability” required to conduct vital business operations.

Importance of Business Continuity

255
Today, businesses rely on information more than ever. Continuous access to information is a must for the
smooth functioning of business operations for any organization.

Listed are some important factors:

Application Business applications rely on data protection techniques for uninterrupted


Dependency and reliable access to data.
High-risk Data Organizations seek to protect their sensitive data to reduce the risk of financial,
legal, and business loss.
Data Protection Legal requirements mandate protection against unauthorized modification,
Laws loss, and unlawful processing of personal data.
For business applications, it is essential to have uninterrupted, fast, reliable, and secure
access to data for enabling these applications to provide services. This access, in turn, relies
on how well the infrastructure and data is protected and managed.
Data is the most valuable asset for an organization. An organization can use its data to
efficiently bill customers, advertise relevant products to the existing and potential customers.
It also enables organizations to launch new products and services, and perform trend analysis
to devise targeted marketing plans. These sensitive data, if lost, may lead to significant
financial, legal, and business loss apart from serious damage to the reputation of an
organization. An organization seeks to reduce the risk of sensitive data loss to operate its
business successfully. It should focus its protection efforts where the need exists—its high-
risk data.
Many government laws mandate that an organization must be responsible for protecting its
employee’s and customer’s personal data. The data should be safe from unauthorized
modification, loss, and unlawful processing. Examples of such laws are U.S. Health Insurance
Portability and Accountability Act (HIPAA), U.S. Gramm-Leach-Bliley Act (GLBA), and
U.K. Data Protection Act. An organization must be proficient at protecting and managing
personal data in compliance with legal requirements.

256
Information Availability
Definition: Information Availability (IA)
The ability of an IT infrastructure to function according to business requirements and customer
expectations, during its specified time of operation.
The operating time is the specified or agreed time of operation when a component or service is
supposed to be available. IA ensures that people (employees, customers, suppliers, and partners)
can access information whenever they need it. IT organizations need to design and build their
infrastructure to maximize the availability of the information, while minimizing the impact of an
outage on consumers.
Information Availability can be defined in terms of:
Accessibility Information should be accessible to the right user when required.
Reliability Information should be reliable and correct in all aspects. It is “the same” as what
was stored and there is no alternation or corruption to the information.
Timeliness Defines the time window (a particular time of the day, week, month, and year as
specified) during which information must be accessible.
For example: if online access to an application is required between 8:00 am and
10:00 pm each day, any disruption to data availability outside of this time slot is
not considered to affect timeliness.

Causes of Information Unavailability


• Application failure (for example: due to catastrophic exceptions caused by bad logic)
• Data loss
• Infrastructure component failure (for example: due to power failure or disaster)
• Data center or site down
o For example: due to power failure or disaster
• Refreshing IT infrastructure

Data center failure due to disaster (natural or man-made disasters such as flood, fire,
earthquake, and so on) is not the only cause of information unavailability. Poor application
design or resource configuration errors can lead to information unavailability. For example,
if the database server is down for some reason, then the data is inaccessible to the consumers,
which leads to IT service outage.
Even the unavailability of data due to several factors (data corruption and human error)
leads to outage. The IT department is routinely required to take on activities such as
refreshing the data center infrastructure, migration, running routine maintenance, or even
relocating to a new data center. Any of these activities can have its own significant and
negative impact on information availability.
Note: In general, the outages can be broadly categorized into planned and unplanned outages.

257
• Planned outages may include installation and maintenance of new hardware, software
upgrades or patches, performing application and data restores, facility operations
(renovation and construction), and migration.
• Unplanned outages include failure caused by human errors, database corruption,
failure of physical and virtual components, and natural or human-made disasters

Impact of Information Unavailability


An IT service outage, due to information unavailability, results in loss of productivity, loss of
revenue, poor financial performance, and damages to reputation. The loss of revenue includes
direct loss, compensatory payments, future revenue loss, billing loss, and investment loss. The
damages to reputations may result in a loss of confidence or credibility with customers, suppliers,
financial markets, banks, and business partners. The other possible consequences of outage include
the cost of extra rented equipment, overtime, and extra shipping.

Measurement of Information Availability


Information availability relies on the availability of both physical and virtual components of a data
center. The failure of these components might disrupt information availability. A failure is the
termination of a component’s ability to perform a required function.
The component’s ability can be restored by performing various external corrective actions, such as
a manual reboot, a repair, or replacement of the failed component(s). Proactive risk analysis,
performed as part of the BC planning process, considers the component failure rate and average
repair time, which are measured by MTBF and MTTR.

258
MTBF: Average time available for a system or component to perform its normal operations
between failures

• MTBF = Total uptime / Number of failures

MTTR: Average time required to repair a failed component

• MTTR = Total downtime / Number of failures

IA = MTBF / (MTBF + MTTR) or IA = uptime / (uptime + downtime)

Key BC Concepts: RPO and RTO


Recovery Point Objectives (RPO) Recovery Time Objectives (RTO)
Point-in-time to which data must be Time within which systems and applications must be
recovered. recovered.

When designing an information availability strategy for an application or a service,


organizations must consider two important parameters that are closely associated with
recovery.

• Recovery Point Objective: RPO is the point-in-time to which data must be recovered
after an outage. It defines the amount of data loss that a business can endure. Based

259
on the RPO, organizations plan for the frequency with which a backup or replica must
be made. For example, if the RPO of a particular business application is 24 hours,
then backups are created every midnight. The corresponding recovery strategy is to
restore data from the set of last backups. An organization can plan for an appropriate
BC solution on the basis of the RPO it sets.
• Recovery Time Objective: RTO is the time within which systems and applications
must be recovered after an outage. It defines the amount of downtime that a business
can endure and survive. Based on the RTO, an organization can decide which BC
technology is best suited. The more critical the application, the lower the RTO should
be.

Both RPO and RTO are counted in minutes, hours, or days and are directly related to the
criticality of the IT service and data. Usually, the lower the RTO and RPO, the higher is the
cost of a BC solution or technology.

BC Planning Lifecycle
BC planning must follow a disciplined approach like any other planning process. Organizations today
dedicate specialized resources to develop and maintain BC plans. From the conceptualization to the
realization of the BC plan, a lifecycle of activities can be defined for the BC process. The BC planning
lifecycle includes five stages:

Establish Objectives

• Determine BC requirements
• Estimate the scope and budget to achieve requirements
• Select a BC team that includes subject matter experts from all areas of business, whether
internal or external
• Create BC policies

Analyze

• Collect information on data profiles, business processes, infrastructure support,


dependencies, and frequency of using business infrastructure
• Conduct a business impact analysis
• Identify critical business processes and assign recovery priorities
• Perform risk analysis for critical functions and create mitigation strategies
• Perform cost benefit analysis for available solutions based on the mitigation strategy
• Evaluate options

Design and Develop

• Define the team structure and assign individual roles and responsibilities; for example,
different teams are formed for activities such as emergency response and infrastructure and
application recovery

260
• Design data protection strategies and develop infrastructure
• Develop contingency solution and emergency response procedures
• Detail recovery and restart procedures

Implement

• Implement risk management and mitigation procedures that include backup, replication,
and management of resources
• Prepare the DR sites that can be utilized if a disaster affects the primary data center. The
DR site could be one of the organization’s own data center or could be a cloud
• Implement redundancy for every resource in a data center to avoid single points of failure

Train, Test, Assess, and Maintain

• Train the employees who are responsible for backup and replication of business-critical
data on a regular basis or whenever there is a modification in the BC plan
• Train employees on emergency response procedures when disasters are declared
• Train the recovery team on recovery procedures based on contingency scenarios
• Perform damage-assessment processes and review recovery plans
• Test the BC plan regularly to evaluate its performance and identify its limitations
• Assess the performance reports and identify limitations
• Update the BC plans and recovery/restart procedures to reflect regular changes within the
data center

Key BC Concepts: Disaster Recovery


Definition: Disaster Recovery (DR)

A part of BC process, which involves a set of policies and procedures for restoring IT infrastructure,
including data that is required to support ongoing IT services, after a natural or human-induced disaster
occurs.

A disaster may impact the ability of a data center to remain up and provide services to users. This disaster
may cause information unavailability. Disaster recovery (DR) mitigates the risk of information
unavailability due to a disaster. It involves a set of policies and procedures for restoring IT infrastructure
including data. This infrastructure and data are required to support the ongoing IT services after a disaster
occurs.

261
A disaster may impact the ability of a data center to remain up and provide services to users.
This disaster may cause information unavailability. Disaster recovery (DR) mitigates the risk
of information unavailability due to a disaster. It involves a set of policies and procedures for
restoring IT infrastructure including data. These infrastructure and data are required to
support the ongoing IT services after a disaster occurs.
The fundamental principle of DR is to maintain a secondary data center or site, called a DR
site. The primary data center and the DR data center should be located in different
geographical regions to avoid the impact of a regional disaster. The DR site must house a
complete copy of the production data. Commonly, all production data is replicated from the
primary site to the DR site either continuously or periodically. A backup copy can also be
maintained at the DR site. Usually, the IT infrastructure at the primary site is unlikely to be
restored within a short time after a catastrophic event.
Organizations often keep their DR site ready to restart business operations if there is an
outage at the primary data center. This may require the maintenance of a complete set of IT
resources at the DR site that matches the IT resources at the primary site. Organization can
either build their own DR site, or they can use cloud to build DR site.

Business Continuity Technology Solutions


With the aim of meeting the required information and service availability, the organizations should
build a resilient IT infrastructure. Building a resilient IT infrastructure requires the following high
availability and data protection solutions:

• Deploying redundancy at both the IT infrastructure component level and the site level to
avoid single point of failure
• Deploying data protection solutions such as backup, replication, migration, and archiving
• Automatic failover mechanism is one of the important methods as well. It is one the
efficient and cost effective way to ensure HA. For example, scripts can be defined to bring
up a new VM automatically when the current VM stops responding or goes down.
• Architecting resilient modern applications

For example: when a disaster occurred at one of the data centers of an organization, the BC triggers
the DR process. This process typically involves both manual and automated procedure to reactivate
the service (application) at a functioning data center. This reactivation of service requires the

262
transfer of application users, VMs, data, and services to the new data center. This process involves
the use of redundant infrastructure across different geographic locations, live migration, backup,
and replication solutions.

Video: Business Continuity Solutions


Exercise: Information Availability
Scenario
A system has three components and requires all three to be operational from 8 am to 8 pm, Monday
to Friday. Failure of component 2 occurs as follows:

• Monday = 9 am to 12 pm
• Tuesday = No failure
• Wednesday = 5 pm to 8 pm
• Thursday = 4 pm to 7 pm
• Friday = 5 pm to 6 pm
• Saturday = 8 am to 1 pm

Deliverables
Calculate the availability of component 2

Debrief

A system has three components and requires all three to be operational from 8 am to 8 pm, Monday to
Friday. Failure of component 2 occurs as follows:

• Monday = 9 am to 12 pm
• Tuesday = No failure

263
• Wednesday = 5 pm to 8 pm
• Thursday = 4 pm to 7 pm
• Friday = 5 pm to 6 pm
• Saturday = 8 am to 1 pm

Calculate the availability of component 2

• Availability is calculated as: system uptime / (system uptime + system downtime)


o System uptime / (system uptime + system downtime)
• System downtime = 3 hours on Monday + 3 hours on Wednesday + 3 hours on Thursday +
1 hour on Friday = 10 hours
o We do not need to consider downtime on Saturday because component 2 is not
required to be operational on weekends
• System uptime = total operational time – system downtime, which is:
o 60 hours – 10 hours, which is 50 hours
• Availability (%) = (50 / 60) × 100 = 83.3%

Exercise: MTBF and MTTR


Scenario
A system has three components and requires all three to be operational for 24 hours from Monday
to Friday. Failure of component 1 occurs as follows:

• Monday = No failure
• Tuesday = 5 am to 7 am
• Wednesday = No failure
• Thursday = 4 pm to 8 pm
• Friday = 8 am to 11 am

Deliverables
Calculate the MTBF and MTTR of component 1
Debrief
MTBF is calculated as: total uptime / number of failures

• Total downtime = 2 hours on Tuesday + 4 hours on Thursday + 3 hours on Friday = 9 hours


• Total uptime = (5 × 24) – 9 = 111 hours
• MTBF = 111 / 3 = 37 hours

MTTR is calculated as: total downtime / number of failures

• Total downtime = 2 hours on Tuesday + 4 hours on Thursday + 3 hours on Friday = 9 hours


• MTTR = 9 hours / 3 = 3 hou

264
Fault Tolerance IT Infrastructure
Fault Tolerance IT Infrastructure Overview
Definition: Fault Tolerance

Ability of an IT system to continue functioning in the event of a failure.

A fault may cause a complete outage of a component or cause a faulty component to run but only to
produce incorrect or degraded output. The common reasons for a fault or a failure are: hardware failure,
software issue, and administrator/user errors.

Fault tolerance ensures that a single fault or failure does not make an entire system or a service
unavailable. It protects an IT system or a service against various types of unavailability.

Fault tolerance may be provided by software, hardware, or a combination of both. The closer an
organization reaches 100 percent fault tolerance, the more costly the infrastructure.

A fault may cause a complete outage of a component or cause a faulty component to run but
only to produce incorrect or degraded output. The common reasons for a fault or a failure
are: hardware failure, software issue, and administrator/user errors. Fault tolerance ensures
that a single fault or failure does not make an entire system or a service unavailable.
Fault tolerance protects an IT system or a service against the following types of
unavailability:

• Transient unavailability: It occurs once for short time and then disappears. For
example, an online transaction times out but works fine when a user retries the
operation.
• Intermittent unavailability: It is a recurring unavailability that is characterized by an
outage and then availability again and then another outage, and so on.
• Permanent unavailability: It exists until the faulty component is repaired or replaced.
Examples of permanent unavailability are network link outage, application issues,
and manufacturing defects.

Fault tolerance may be provided by software, hardware, or a combination of both. The closer
an organization reaches 100 percent fault tolerance, the more costly is the infrastructure

265
Key Requirements for Fault Tolerance
A fault tolerant IT infrastructure should meet two key requirements such as fault isolation and
eliminating single points of failure (SPOF).

Fault Isolation
Fault isolation limits the scope of a fault into local area so that the other areas of a system are not
impacted by the fault. It does not prevent failure of a component but ensures that the failure does
not impact the overall system.
Fault isolation requires a fault detection mechanism that identifies the location of a fault and a
contained system design (like sandbox) that prevents a faulty system component from impacting
other components.

The example represents two I/O paths between a compute system and a storage system. The
compute system uses both the paths to send I/O requests to the storage system. If an error or fault
occurs on a path causing a path failure, the fault isolation mechanism present in the environment
automatically detects the failed path. It isolates the failed path from the set of available paths and
marks it as a dead path to avoid sending the pending I/Os through it. All pending I/Os are redirected
to the live path. This helps avoiding the time-out and the retry delays.

Single Point of Failure

266
Definition: Single Point of Failure
Refers to any individual component or aspect of an infrastructure whose failure can make the entire
system or service unavailable.
Single point of failure may occur at infrastructure component-level and site-level (data center).

The illustration provides an example where various IT infrastructure components, including the
compute system, VM instance, network devices, storage, and site itself, become a single point of
failure. Assume that a web application runs on a VM instance and it uses a database server which
runs on another VM to store and retrieve application data. If the database server is down and then
the application would not be able to access the data and in turn would impact the availability of the
service.
Consider another example where a group of compute systems is networked through a single FC
switch. The switch would present a single point of failure. If the switch failed, all of the compute
systems connected to that switch would become inaccessible and result in service unavailability. It
is important for organizations to build a fault tolerance IT infrastructure that eliminates single
points of failure in the environment.

Eliminating Single Points of Failure


Single points of failure can be avoided by implementing fault tolerance mechanisms such as redundancy

• Implement redundancy at component level


o Compute
o Network
o Storage
• Implement multiple availability zones
o Avoid single points of failure at data center (site) level

It is important to have high availability mechanisms that enable automated application/service failover

Highly available infrastructures are typically configured without single points of failure to
ensure that individual component failures do not result in service outages. The general
method to avoid single points of failure is to provide redundant components for each
necessary resource, so that a service can continue with the available resource even if a
component fails.

267
Organizations may also create multiple availability zones to avoid single points of failure at
data center level. Usually, each zone is isolated from others, so that the failure of one zone
would not impact the other zones. It is important to have high availability mechanisms that
enable automated application/service failover within and across the zones if there is a
component failure or disaster.
Note:
N+1 redundancy is a common form of fault tolerance mechanism that ensures service
availability if there is a component failure. A set of N components has at least one standby
component. This approach is typically implemented as an active/passive arrangement, as the
additional component does not actively participate in the service operations. The standby
component is active only if any one of the active components fails.
N+1 redundancy with active/active component configuration is also available. In such cases
all the component remains active. For example, if an active/active configuration is
implemented at the site level and then a service is fully deployed in both the sites. The load
for this service is balanced between the sites. If one of the sites is down, the available site
would manage the service operations and manage the workload.

Implementing Redundancy at Component-


Level
Organizations should follow stringent guidelines to implement fault tolerance in their data centers for
uninterrupted services. The underlying IT infrastructure components (compute, storage, and network)
should be highly available and the single points of failure at the component level should

The example represents an infrastructure that is designed to mitigate the single points of
failure at component level. The single points of failure at the compute level can be avoided by
implementing redundant compute systems in a clustered configuration. Single points of
failure at the network level can be avoided through path and node redundancy and various
fault tolerance protocols.

268
Multiple independent paths can be configured between nodes so that if a component along
the main path fails, traffic is rerouted along another path. The key techniques for protecting
storage from single points of failure are RAID, erasure coding techniques, dynamic disk
sparing, and configuring redundant storage system components. Many storage systems also
support redundant array independent nodes (RAIN) architecture to improve the fault
tolerance.

Compute Clustering
• Two or more compute systems/hypervisors are clustered to provide high availability and
load balancing
• Service running on a failed compute
system moves to another compute
system
o Heartbeat mechanism
determines the health of
compute systems in a cluster
• Two common clustering
implementations are:
o Active/active
o Active/passive

Compute clustering is one of the key fault


tolerance mechanisms. It provide continuous availability of service even when a VM instance,
physical compute systems, operating system, or hypervisor fails.
Clustering is a technique where at least two compute systems (or nodes) work together and
are viewed as a single compute system to provide high availability and load balancing. If one
of the compute systems fails, the service running in the compute system can failover to
another compute system in the cluster. This method minimizes or avoids any outage.
The two common cluster implementations are active/active and active/passive.

• In active/active clustering, the nodes in a cluster are all active participants and run
the same service of their clients. The active/active cluster balances requests for service
among the nodes. If one of the nodes fails, the surviving nodes take the load of the
failed one. This method enhances both the performance and the availability of a
service. The nodes in the cluster have access to shared storage volumes. In
active/active clustering only one node can write or update the data in a shared file
system or database at a given time.
• In active/passive clustering, the service runs on one or more nodes and the passive
node waits for a failover. If the active node fails, the service that had been running on
the active node is failed over to the passive node. Active/passive clustering does not
provide performance improvement like active/active clustering.

Clustering uses a heartbeat mechanism to determine the health of each node in the cluster.
The exchange of heartbeat signals, usually happens over a private network enables
participating cluster members to monitor one another’s status.
269
Clustering can be implemented between multiple physical compute systems, or between
multiple VMs, or between VM and physical compute system, or between multiple
hypervisors.

Compute Cluster Example


Multiple hypervisors running on different systems are clustered

• Provides continuous availability of services running on VMs even if the compute system
or a hypervisor fails
o Typically a live instance (a secondary VM) of a primary VM is created on another
compute system

The illustration shows an example of clustering where multiple hypervisors running on


different compute systems are clustered. They are accessing hypervisor’s native file system
which is a clustered file system that enables multiple hypervisors to access the same shared
storage resources concurrently. This method provides high availability for services running
on VMs by pooling the VMs and compute systems that reside on into a cluster.
If a physical compute system running a VM fails, the VM is restarted on another compute
system in the cluster. This method provides rapid recovery of services running on VMs if
there is a compute system failure. In some hypervisor cluster implementations, the hypervisor
uses its native technique to provide continuous availability of services running on VMs even
if a physical compute system or a hypervisor fails.
In this implementation, a live instance (a secondary VM) of a primary VM is created on
another compute system. The primary and secondary VMs exchange heartbeats. If the
primary VM fails due to hardware failure, the clustering enables failover to the secondary
VM immediately. After a transparent failover occurs, a new secondary VM is created and
redundancy is reestablished.
The hypervisor running the primary VM as shown in the illustration captures the sequence
of events for the primary VM. This includes instructions from the virtual I/O devices, virtual
NICs, and so on. Then it transfers these sequences to the hypervisor running on another
compute system. The hypervisor running the secondary VM receives these event sequences
and sends them to the secondary VM for execution.

270
The primary and the secondary VMs share the same storage, but all output operations are
performed only by the primary VM. A locking mechanism ensures that the secondary VM
does not perform write operations on the shared storage. The hypervisor posts all events to
the secondary VM at the same execution point as they occurred on the primary VM. This
way, these VMs “play” the same set of events and their states are synchronized with each
other.

Network Fault Tolerance Mechanisms


A short-time network interruption could impact plenty of services running in a data center
environment. So, the network infrastructure must be fully redundant and highly available with no
single points of failure. The following techniques provide fault tolerance mechanism against link
failure:
Link Aggregation
• Combines links between two switches and also between a switch and a node
• Enables network traffic failover in the event of a link failure in the aggregation

Link aggregation combines two or more network links into a single logical link, called port
channel, yielding higher bandwidth than a single link could provide. Link aggregation
enables distribution of network traffic across the links and traffic failover if there is a link
failure.
If a link in the aggregation is lost, all
network traffic on that link is
redistributed across the remaining links
NIC Teaming

• Groups NICs so that they appear as a


single, logical NIC to the operation system or hypervisor
• Provides network traffic failover in the event of a NIC/link failure
• Distributes network traffic across NICs

NIC teaming groups NICs so that they appear as a single, logical NIC to the OS or hypervisor.
NIC teaming provides network traffic failover
to prevent connectivity loss if there is a NIC
failure or a network link outage.
Sometimes, NIC teaming enables aggregation of
network bandwidth of individual NICs. The
bandwidth aggregation facilitates distribution
of network traffic across NICs in the team.
Multipathing

271
• Enables a compute system to use multiple paths for transferring data to a LUN
• Enables failover by redirecting I/O from a failed path to another active path
• Performs load balancing by distributing I/O across active paths

Multipathing enables organizations to meet aggressive availability and performance service


levels. It enables a compute system to use multiple
paths for transferring data to a LUN on a storage
system. Multipathing enables automated path
failover. It eliminates the possibility of disrupting
an application or service due to the failure of an
adapter, cable, port, and so on. When path
failover happens all outstanding and subsequent
I/O requests are automatically directed to
alternative paths.
To use multipathing, multiple paths must exist
between the compute and the storage systems.
Each path can be configured as either active or
standby. If one or more active paths fail then
standby paths become active. If an active path
fails, the multipathing process detects the failed
path and then redirects I/Os of the failed path to
another active path.
Multipathing can be an integrated operating
system and hypervisor function. It can also be a
third party software module that can be installed
to the operating system or hypervisor. The
illustration shows a configuration where four
paths between the physical compute system (with
dual-port HBAs) and the LUN enable
multipathing. Multipathing can perform load
balancing by distributing I/O across all active paths.
Elastic Load Balancing

• Enables dynamic distribution of application and client I/O traffic


• Dynamically scales resources (VM instances) to meet traffic demands
• Provides fault tolerance capability by detecting the unhealthy VM instances and
automatically redirects the I/Os to other healthy VM instances

Elastic load balancing enables dynamic distribution of application and client I/O traffic among VM
instances. It dynamically scales resources (VM instances) to meet traffic demands. Load balancer
provides fault tolerance capability by detecting the unhealthy VM instances and automatically redirects
the I/Os to other healthy VM instances.

Storage Fault Tolerance Mechanisms

272
Data centers comprise storage systems with a large number of disk drives, and solid state drives.
This storage systems support various applications and services running in the environment. The
failure of these drives could result in data loss and information unavailability. The greater the
number of drives in use the greater is the probability of a drive failure.
The following techniques provide data protection in the event of drive failure:
RAID

Provides data protection against one or two drive failures

• RAID is a technique that combines multiple drives into a logical unit that is called a RAID
set. Nearly all RAID implementation models provide data protection against drive failures.
• The illustration provides an example of RAID 6 (dual distributed parity), where data is
protected against two disk failures.

Erasure Coding

Provides space-optimal data redundancy to protect data loss against multiple drive failure

Dynamic Disk Sparing

• Automatically replaces a failed drive with a spare drive to protect against data loss
• Multiple spare drives can be configured to improve availability

Dynamic disk sparing is a fault tolerance mechanism that refers to a spare drive which
automatically replaces a failed disk drive by taking the identity of it. A spare drive should be large
enough to accommodate data from a failed drive. Some systems implement multiple spare drives
to improve data availability.
In dynamic disk sparing, when the recoverable error rates for a disk exceed a predetermined
threshold, the disk subsystem tries to copy data from the failing disk to the spare drive
automatically. If this task is completed before the damaged disk fails, the subsystem switches to
the spare disk and marks the failing disk as unusable. Otherwise, it uses parity or the mirrored disk
to recover the data.

Storage Virtualization

Storage Resiliency using Virtualization

• Virtual volume is created using virtualization appliance


• Each I/O to the volume is mirrored to the LUNs on the storage systems
• Virtual volume is continuously available to compute system
• Even if one of the storage systems is unavailable due to failure
• Select here for additional details

The illustration provides an example of a virtual volume that is mirrored between LUNs of two
different storage systems. Each I/O to the virtual volume is mirrored to the underlying LUNs on
the storage systems. If one of the storage systems incurs an outage due to failure or maintenance,

273
the virtualization appliance will be able to continue processing I/O on the surviving mirror leg.
Upon restoration of the failed storage system, the data from the surviving LUN is resynchronized
to the recovered leg. This method provides protection and high availability for critical services if
there is a storage system failure.

Fault Tolerance at Site-Level – Availability


Zones
An availability zone is a location with its own set of resources and isolated from other zones.

A zone can be an entire data center or a part of the data center

• Enables running multiple service instances within and across zones to survive data center
or site failure
• If there is an outage, the service should seamlessly failover across the zones

Zones within a particular region are typically connected through low-latency network for enabling faster
service failover.

An important high availability design best practice is to create availability zones. An


availability zone is a location with its own set of resources and isolated from other zones.
Therefore, a failure in one zone will not impact other zones. A zone can be a part of a data
center or may even be an entire data center.
This method provides redundant computing facilities on which applications or services can
be deployed. Organizations can deploy multiple zones within a data center (to run multiple

274
instances of a service), so that if one of the zones incurs an outage due to some reason, the
service can be failed over to the other zone.
For example, if two compute systems are deployed, one in zone A and the other in zone B,
and then the probability that both go down simultaneously due to an external event is low.
This simple strategy enables the organization to construct highly reliable web services by
placing compute systems into multiple zones. So the failure of one zone does not disrupt the
service, or at the least, enable to rapidly reconstruct the service in the second zone.
Organizations also deploy multiple zones across geographically dispersed data centers (to
run multiple instances of a service). This method helps the services to survive even if the
failure is at the data center level.
It is also important that there should be a mechanism that enables seamless (automated)
failover of services running in one zone to another. Automated failover provides a reduced
RTO when compared to the manual process. A failover process also depends upon other
capabilities, including replication and live migration capabilities, and reliable network
infrastructure between the zones.

Fault Tolerance at Site-Level – Example


High availability can be achieved by moving services across zones that are located in different locations
without user interruption. The services can be moved across zones by implementing stretched cluster.

A stretched cluster is a cluster with compute systems in different remote locations provide DR capability
if there is a disaster in one of the data centers. Stretched clusters are typically built as a way to create
active/active zones to provide high availability and enable dynamic workload balancing across zones.

The illustration also shows that a virtual volume is created from the federated storage
resources across zones. The virtualization appliance has the ability to mirror the data of a
virtual volume between the LUNs located in two different storage systems at different
locations.
Each I/O from a host to the virtual volume is mirrored to the underlying LUNs on the storage
systems. If an outage occurs at one of the data centers, for example at zone A, then the
running VMs at zone A can be restarted at Zone B without impacting the service availability.
This setup also enables accessing the storage even if one of the storage systems is unavailable.
If storage system at zone A is unavailable, then the hypervisor running there still accesses the
virtual volume. The hypervisor can access the data from the available storage system at zone
B.

275
Resilient Application Overview
Applications have to be designed to deal with IT resource’s failure to guarantee the required availability

• Fault resilient applications have logic to detect and handle transient fault conditions to avoid
application downtime
• Examples of key application design strategies for improving availability:
o Graceful degradation of application functionality
o Retry logic in application code
o Persistent application state model

Today, organizations typically build their IT infrastructure using commodity systems to


achieve scalability and keep hardware costs down. In this environment, it is assumed that
some components will fail. Therefore, in the design of an application the failure of individual
resources often has to be anticipated to ensure an acceptable availability of the application.
A reliable application properly manages the failure of one or more modules and continues
operating properly. If a failed operation is retried a few milliseconds later, the operation may
succeed. These types of error conditions are called as transient faults. Fault resilient
applications have logic to detect and handle transient fault conditions in order to avoid
application downtime.

276
Key Application Design Strategies for
Improving Availability
Graceful Degradation

• Application maintains limited functionality even when some of the modules or supporting
services are not available
• Unavailability of certain application components or modules should not bring down the
entire application

Refers to the ability of an application to maintain limited functionality even when some of
the components, modules, or supporting services are not available. The purpose of graceful
degradation of application functionality is to prevent the complete failure of a business
application.
For example, consider an eCommerce application that consists of modules such as product
catalog, shopping cart, order status, order submission, and order processing. Assume that
due to some problem the payment gateway is unavailable. It is impossible for the order
processing module of the application to continue. If the application is not designed to handle
this scenario, the entire application might go offline.
However, in this same scenario, it is still possible to make the product catalog module
available to consumers, to view the product catalog. The application could also enable one to
place the order and move it into the shopping cart. This method provides the ability to process
the orders when the payment gateway is available or after failing over to a secondary
payment gateway.
Fault Detection and Retry Logic

• Refers to a mechanism that implements a logic in the code of an application to improve the
availability
• To detect and retry the service that is temporarily down; may result in successful restore of
service

A key mechanism in an application design is to implement retry logic within a code to handle
a service that is temporarily down. When applications use other services, errors can occur
because of temporary conditions such as intermittent service, infrastructure-level faults, or
network issues. Often, this form of problem can be solved by retrying the operation a few
milliseconds later, and the operation may succeed.
To implement the retry logic in an application, it is important to detect and identify that
particular exception which is likely to be caused by a transient fault condition. A retry
strategy must be defined to state how many retries can be attempted before deciding that the
fault is not transient.
Persistent Application State Model

• Application state information is stored out of the memory


• Stored in a data repository
• If an instance fails, the state information is still available in the repository
277
In a stateful application model, the session state information of an application (for example user
ID, selected products in a shopping cart, and so on) is stored in compute system memory. However,
the information that is stored in the memory can be lost if there is an outage with the compute
system where the application runs.
In a persistent application state model, the state information is stored out of the memory and is
stored in a repository (database). If a VM running the application instance fails, the state
information is still available in the repository. A new application instance is created on another VM
which can access the state information from the database and resume the processing.

Concepts In Practice
Dell EMC PowerPath

• Host-based multipathing software


• Provides path failover and load-balancing functionality
• Automatic detection and recovery from host-to-array path failures
• PowerPath/VE software enables optimizing virtual environments with PowerPath
multipathing features

A family of software products that ensures consistent application availability and


performance across I/O paths on physical and virtual platforms. It provides automated path
management and tools that enable to satisfy aggressive SLAs without investing in more
infrastructure.
Dell EMC PowerPath/VE is compatible with VMware vSphere and Microsoft Hyper-V-
based virtual environments. It can be used together with Dell EMC PowerPath to perform
the following functions in both physical and virtual environments:

• Standardize Path Management: Optimize I/O paths in physical and virtual


environments (PowerPath/VE) and cloud deployments
• Optimize Load Balancing: Adjust I/O paths to dynamically rebalance your
application environment for peak performance
• Automate Failover/Recovery: Define failover and recovery rules that route
application requests to alternative resources in the event of component failures or user
errors

Concepts in Practice
VMware HA

• Provides high availability for applications running in virtual machines


• If there is a fault in a physical compute system, then the affected VMs are automatically
restarted on other compute systems

Provides high availability for applications running in VMs. If there is a fault in a physical compute
system, then the affected VMs are automatically restarted on other compute systems.
278
VMware HA minimizes unplanned downtime and IT service disruption while eliminating the need
for dedicated standby hardware and installation of additional software.
VMware FT

• Provides continuous availability for application in the event of server failure


• Creates a live shadow instance of a VM that is in virtual lockstep with the primary instance
• FT eliminates even the smallest chance of data loss or disruption

Provides continuous availability for applications in the event of server failures. It creates a
live shadow instance of a VM that is in virtual lockstep with the primary VM instance.
VMware FT is used to prevent application disruption due to hardware failures. The
downtime that is associated with mission-critical applications can be expensive and disruptive
to businesses. By enabling instantaneous failover between the two instances in the event of
hardware failure, FT eliminates even the smallest chance of data loss or disruption.

Question 1
Which defines the amount of data loss that a business can endure?

• RPO

Correct Response


MTBF

• • RTO

Incorrect


MTTR

Question 2
Which refers to the ability of an application to maintain limited functionality even when some of
the components, modules, or supporting services are not available?

Persistent state model

279

• Graceful degradation

Correct!


Availability zone


Retry logic

280
Data Protection Solutions
Replication
Video: Replication Overview
Introduction to Data Replication
Definition: Data Replication

A process of creating an exact copy (replica) of the data to ensure business continuity in the event of a
local outage or disaster.

• Replicas are used to restore and restart operations if data loss occurs
• Data can be replicated to one or more locations based on the business requirements

Data is one of the most valuable assets of any organization. It is being stored, mined,
transformed, and used continuously. It is a critical component in the operation and function
of organizations. Outages, whatever may be the cause, are costly, and customers are always
concerned about data availability. Safeguarding and keeping the data highly available are
some of the top priorities of any organization.
To avoid disruptions in business operations, it is necessary to implement data protection
technologies in a data center. A data replication solution is one of the key data protection
solutions that enables organizations to achieve business continuity, high availability, and data
protection.
Data replication is the process of creating an exact copy (replica) of data. If a data loss occurs,
then the replicas are used to restore and restart operations. For example, if a production VM
goes down and then the replica VM can be used to restart the production operations with

281
minimum disruption. Based on business requirements, data can be replicated to one or more
locations.
For example, data can be replicated within a data center, between data centers, from a data
center to a cloud, or between clouds.In a replication environment, a compute system accessing
the production data from one or more LUNs on storage system is called a production compute
system. These LUNs are known as source LUNs, production LUNs, or the source. A LUN on
which the production data is replicated to is called the target LUN or the target or replica.

Primary Uses of Replicas


Replicas are created for various purposes which include the following:

Alternative Source for Backup


Under normal backup operations, data is read from the production LUNs and written to the backup
device. This places an extra burden on the production infrastructure because production LUNs are
simultaneously involved in production operations and servicing data for backup operations.
To avoid this situation, a replica can be created from production LUN and it can be used as a source
to perform backup operations. This method alleviates the backup I/O workload on the production
LUNs.
Fast Recovery and Restart
For critical applications, replicas can be taken at short, regular intervals. This enables fast recovery from
data loss. If a complete failure of the source LUN occurs, the replication solution enables to restart the
production operation on the replica. This approach reduces the RTO.
Decision-Support Activities
Running reports using the data on the replicas greatly reduces the I/O burden on the production device.
Testing Platform
Replicas are also used for testing new applications or upgrades.
For example, an organization may use the replica to test the production application upgrade. If the
test is successful, the upgrade may be implemented on the production environment.

282
Data Migration
Another use for a replica is data migration. Data migrations are performed for various reasons such as
migrating from a smaller capacity LUN to one of a larger capacity.

Replica Characteristics and Types


Replica Characteristics Replica Types
Recoverability/Restartability Point-in-Time (PIT)

• Replica could restore data to the source device • Nonzero RPO


• Restart business operation from replica

Consistency Continuous

• Ensures the usability of a replica • Near-zero RPO


• Replica must be consistent with the source

A replica should have the following characteristics:


Recoverability Enables restoration of data from the replicas to the source if data loss
occurs.
Restartability Enables restarting business operations using the replicas.
Consistency Replica must be consistent with the source so that it is usable for both
recovery and restart operations.
For example, if a service running on a primary data center is to fail over to
remote site due to disaster. There must be a consistent replica available at that
site. So, ensuring consistency is the primary requirement for all the replication
technologies.Replicas can either be point-in-time (PIT) or continuous and the
choice of replica ties back into RPO.
PIT replica The data on the replica is an identical image of the production at some specific
timestamp.
For example, a replica of a file system is created at 4:00 PM on Monday. This
replica would then be referred to as the Monday 4:00 PM PIT copy. The RPO
maps to the time when the PIT was created to the time when any kind of
failure on the production occurred. If there is a failure on the production at
8:00 PM and there is a 4:00 PM PIT available, the RPO would be 4 hours (8-
4=4). To minimize RPO, take periodic PITs.
Continuous The data on the replica is in-sync with the production data always. The
replica objective with any continuous replication is to reduce the RPO to zero or near-
zero.

283
Types of Replication
Replication can be classified into two major categories:
Local Replication

• Refers to replicating data


within the same location
o Within a data center in
compute-based
replication
o Within a storage
system in storage
system-based
replication
• Typically used for operational
restore of data if there is a
data loss

Local replication is the process of replicating data within the same storage system or the same
data center.
Local replicas help to restore the data if there is a data loss or enable restarting the
application immediately to ensure business continuity.
Remote Replication

• Refers to replicating data to remote locations (locations can be geographically dispersed)


• Data can be synchronously or asynchronously replicated
• Helps to mitigate the risks associated with regional outages
• Enables organizations to replicate the data to cloud for DR purpose

Remote replication is the process of replicating data to remote locations (locations can be
geographically dispersed).

284
Remote replication helps organizations to mitigate the risks that are associated with regional
outages resulting from natural or human-made disasters. During disasters, the services can be
moved to a remote location to ensure continuous business operation.
Remote replication also enables organizations to replicate their data to the cloud for DR purpose.
In a remote replication, data can be synchronously or asynchronously replicated.

Video: Storage-Based Replication


Local Replication: VM Snapshot
A VM snapshot preserves the state and data of a VM at a specific PIT. The state includes the VM’s power
state (for example, powered-on, powered-off, or suspended). The data includes all the files that make up
the VM. This includes disks, memory, and other devices, such as virtual network interface cards. This VM
snapshot is useful for quick restore of a VM.

• For example, an administrator can create a snapshot of a VM, make changes such as
applying patches and software upgrades to the VM. If anything goes wrong, the
administrator can restore the VM to its previous state using the VM snapshot. The
hypervisor provides an option to create and manage multiple snapshots. Taking multiple
snapshots provide several restore points for a VM. While more snapshots improve the
resiliency of the infrastructure, it is important to consider the storage space they consume.

When a snapshot is created for a VM, a child virtual disk (delta disk file) is created from the base image or
parent virtual disk. The snapshot mechanism prevents the guest operating system from writing to the base
image or parent virtual disk. Instead it directs all writes to the delta disk file. Successive snapshots generate
a new child virtual disk from the last child virtual disk in the chain. Snapshots hold only changed blocks.

Sometimes it may be required to retain a snapshot for longer period. It must be noted that larger
snapshots take longer time to commit and may impact the performance. Source (parent VM) must be
healthy in order to use snapshot for roll back.

285
Local Replication: VM Snapshot Example
In this example, child virtual disk 1 stores all the changes that are made to the parent VM after
snapshot 1 is created. Similarly, child virtual disk 2 and child virtual disk 3 store all the changes
after snapshot 2 and snapshot 3 are created respectively. When committing snapshot 3 for the VM,
the data on child virtual disk file 1 and 2 are committed prior to committing data on child virtual
disk 3 to the parent virtual disk file. After committing the data, the child virtual disk 1, child virtual
disk 2, and child virtual disk 3 are deleted. However, while rolling back to the snapshot 1(PIT),
child disk file 1 is retained and the snapshots 2 and 3 are discarded.

Local Replication: Storage System-Based


Snapshot

Select here for details.

• Redirects new writes that are destined for the source LUN to a reserved LUN in the storage
pool
• Replica (snapshot) still points to the source LUN

286
o All reads from replica are served from the source LUN
• A redirect on write (RoW) mechanism may be used.
• With virtual provisioning, all data is stored in a shared storage pool
• The source thin LUN is a set of pointers to the data in the storage pool
• When the PIT is activated, the snapshot LUN must look identical to the source LUN at that
specified point in time
• The Snapshot LUN is also a set of pointers to the data in the storage pool
• After the PIT has been activated, new writes that are destined for the source LUN are
redirected to the shared storage pool and the pointer is updated (ROW)
• The Snapshot LUN still points to data that reflects the original PIT
o All reads from replica are served from the shared storage pool

287
Storage system-based snapshot is a space optimal pointer-based virtual replication. At the
time of replication session activation, the target (snapshot) contains pointers to the location
of the data on the source. The snapshot does not contain data at any time. The snapshot is
known as a virtual replica.
The snapshot is immediately accessible after the replication session activation. Snapshot is
typically recommended when the changes to the source are less than 30 percent. Multiple
snapshots can be created from the same source LUN for various business requirements.
Some snapshot software provides the capability of automatic termination of a snapshot upon
reaching the expiration date. This approach is useful where a rolling snapshot might be taken
and then automatically removed after its time of usefulness has passed. The unavailability of
the source device invalidates the data on the target. The storage system-based snapshot uses
a Redirect on Write (RoW) mechanism.
Some pointer-based virtual replication implementations use redirect on write technology
(RoW). RoW redirects new writes that are destined for the source LUN to a reserved LUN
in the storage pool. In RoW, a new write from source compute system is written to a new
location (redirected) inside the pool. The original data remains where it is, and is untouched
by the RoW process. In a RoW snapshot, the original data remains where it is, and is
therefore read from the original location on the source LUN.

Local Replication: Clone


Cloning provides the ability to create fully populated point-in-time copies of LUNs within a storage system
or create a copy of an existing VM

• Clone of a storage volume


o Initial synchronization is performed between the source LUN and the replica (clone)
o Changes made to both the source and the replica can be tracked at some predefined
granularity
• VM clone
o Clone is a copy of an existing virtual machine (parent VM)
▪ The clone VM’s MAC address is different from the parent VM
o Typically clones are deployed when many identical VMs are required
▪ Reduces the time that is required to deploy a new VM

Cloning provides the ability to create fully populated point-in-time copies of LUNs within a storage
system or create a copy of an existing VM.
Clone of a storage volume:

• When the replication session is started, an initial synchronization is performed between the
source LUN and the replica (clone). Synchronization is the process of copying data from
the source LUN to the clone. During synchronization process, the replica is not available
for any compute system access. Once the synchronization is completed, the replica is
exactly same as source LUN. The replica can be detached from the source LUN. It can be
made available to another compute system for business operations. Subsequent

288
synchronizations involve only a copy of any data that has changed on the source LUN since
the previous synchronization.
• Typically after detachment, changes made to both the source and replica can be tracked at
some predefined granularity. This approach enables incremental resynchronization (source
to target) or incremental restore (target to source). The clone must be the same size as the
source LUN.

VM Clone:

• A VM clone is a copy of an existing VM. The existing VM is called the parent of the clone.
When the cloning operation completes, the clone becomes a separate VM. The changes
made to a clone do not affect the parent VM. Changes made to the parent VM do not appear
in a clone. A clone's MAC address is different from that of the parent VM.
• In general, installing a guest operating system and applications on a VM is a time
consuming task. With clones, administrators can make many copies of a virtual machine
from a single installation and configuration process. For example, in an organization, the
administrator can clone a VM for each new employee, with a suite of preconfigured
software applications.

Remote Replication: Synchronous


• Write is committed to both the source and the remote replica before it is acknowledged to
the compute system
• Enables to restart business operations at a remote site with zero data loss; Provides near
zero RPO

The illustration provides an example of synchronous remote replication. If the source site is unavailable
due to disaster, then the service can be restarted immediately in the remote site to meet the required SLA.
Storage-based remote replication solution can avoid downtime by enabling business
operations at remote sites. Storage-based synchronous remote replication provides near zero
RPO where the target is identical to the source always.
In synchronous replication, writes must be committed to the source and the remote target
prior to acknowledging “write complete” to the production compute system. Additional
writes on the source cannot occur until each preceding write has been completed and
acknowledged.

289
This approach ensures that data is identical on the source and the target at all times. Further,
writes are transmitted to the remote site exactly in the order in which they are received at the
source. Write ordering is maintained and it ensures transactional consistency when the
applications are restarted at the remote location. As a result, the remote images are always
restartable copies.

• Note: Application response time is increased with synchronous remote replication.


Since, writes must be committed on both the source and the target before sending the
“write complete” acknowledgment to the compute system. The degree of impact on
response time depends primarily on the distance and the network bandwidth between
sites. If the bandwidth provided for synchronous remote replication is less than the
maximum write workload, there will be times during the day when the response time
might be excessively elongated, causing applications to time out. The distances over
which synchronous replication can be deployed depend on the application’s capability
to tolerate the extensions in response time. Typically synchronous remote replication
is deployed for distances less than 200 kilometers (125 miles) between the two sites.

Remote Replication: Asynchronous


It is important for an organization to replicate data across geographical locations to mitigate the risk
involved during disaster. If the data is replicated (synchronously) between sites and the disaster strikes,
then there would be a chance that both the sites may be impacted. This leads to data loss and service
outage.

Replicating data across sites which are 1000s of kilometers apart would help organization to face any
disaster. If a disaster strikes at one of the regions then the data would still be available in another region.
The service could move to the location. Asynchronous replication enables to replicate data across sites
which are 1000s of kilometers apart.

A write is committed to the source and immediately acknowledged to the compute system:

• Data is buffered at the source and sent to the remote site periodically
• Applications write response time is not dependent on the latency of the link
• Replica is behind the source by a finite amount (finite RPO)

290
In asynchronous remote replication, a write from a production compute system is committed
to the source and immediately acknowledged to the compute system. Asynchronous
replication also mitigates the impact to the application’s response time because the writes are
acknowledged immediately to the compute system.
This method enables replicating data over distances of up to several thousand kilometers
between the source site and the secondary site (remote locations). In this replication, the
required bandwidth can be provisioned equal to or greater than the average write workload.
In asynchronous replication, compute system writes are collected into buffer (delta set) at the
source. This delta set is transferred to the remote site in regular intervals. Adequate buffer
capacity should be provisioned to perform asynchronous replication. Some storage vendors
offer a feature called delta set extension, which enables to offload delta set from buffer (cache)
to specially configured drives. This feature makes asynchronous replication resilient to the
temporary increase in write workload or loss of network link.
In asynchronous replication, RPO depends on the size of the buffer, the available network
bandwidth, and the write workload to the source. This replication can take advantage of
locality of reference (repeated writes to the same location). If the same location is written
multiple times in the buffer prior to transmission to the remote site, only the final version of
the data is transmitted. This feature conserves link bandwidth.

Remote Replication: Multisite

Select here for details.

In a two-site synchronous replication, the source and target sites are usually within a short distance. If a
regional disaster occurs, both the source and the target sites might become unavailable. This can lead to
extended RPO and RTO.

Since the last known good copy of data would need to come from another source, such as an offsite tape.
A regional disaster will not affect the target site in a two-site asynchronous replication since the sites are
typically several hundred or several thousand kilometers apart.

If the source site fails, production can be shifted to the target site. However, there is no further remote
protection of data until the failure is resolved.

• Data from source site is replicated to multiple remote sites for DR purpose
o Disaster recovery protection is always available if any 1-site failure occurs
• Mitigates the risk in 2-site replication
o No DR protection after source or remote site failure

291
Multisite replication mitigates the risks that are identified in two-site replication. In a
multisite replication, data from the source site is replicated to two or more remote sites. The
illustration provides an example of a three-site remote replication solution. In this approach,
data at the source is replicated to two different storage systems at two different sites. The
source-to-bunker site (target 1) replication is synchronous with a near-zero RPO. The source-
to-remote site (target 2) replication is asynchronous with an RPO in the order of minutes.
The key benefit of this replication is the ability to fail over to either of the two remote sites in
the case of source-site failure.
Disaster recovery protection is always available if any one-site failure occurs. During normal
operations, all three sites are available and the production workload is at the source site. At
any given instance, the data at the bunker and the source is identical. The data at the remote
site is behind the data at the source and the bunker. The replication network links between
the bunker and the remote sites are in place but will not be in use. The difference in the data
between the bunker and the remote sites is tracked. If a source site disaster occurs, operations
can be resumed at the bunker or the remote sites with incremental resynchronization
between these two sites.

Video: Network-Based Replication


Continuous Data Protection (CDP)
• Network-based replication solution
• CDP provides the ability to restore data and VMs to any previous PIT
• Supports heterogeneous compute and storage platforms
• Supports both local and remote replication
o Data can also be replicated to more than two sites (multisite)
• Supports WAN optimization techniques to reduce bandwidth requirements

292
Continuous data protection (CDP) is a network-based replication solution that provides the
capability to restore data and VMs to any previous PIT.
Traditional data protection technologies offer a limited number of recovery points. If a data
loss occurs, the system can be rolled back only to the last available recovery point. CDP tracks
all the changes to the production volumes and maintains consistent point-in-time images. This
makes the CDP to restore data to any previous PIT.
CDP supports both local and remote replication of data and VMs to meet operational and
disaster recovery respectively. In a CDP implementation, data can be replicated to more than
two sites using synchronous and asynchronous replication. CDP supports various WAN
optimization techniques (deduplication, compression). These techniques reduce bandwidth
requirements, and also optimally use the available bandwidth.

Key CDP Components


The following are key CDP components:

Journal Volume

Contains all the data that has changed from the time the replication session started to the production
volume
CDP uses a journal volume to store all the data that has changed on the production volume from the
time the replication session started. The journal contains the metadata and data that enable roll back
to any recovery points. The amount of space that is configured for the journal determines how far back
the recovery points can go.
CDP Appliance

• Intelligent hardware platform that runs the CDP software


• Manages both the local and the remote replications
• Appliance could also be virtual, where CDP software is running inside VMs

CDP also uses an appliance and a write splitter. A CDP appliance is an intelligent hardware platform that
runs the CDP software and manages local and remote data replications. Some vendors offer virtual
appliance where the CDP software is running inside VMs.
Write Splitter

• Intercept writes to the production volume from the compute system and splits each write
into two copies
• Can be implemented at the compute, fabric, or storage system

Write splitters intercept writes to the production volume from the compute system and split each write
into two copies. Write splitting can be performed at the compute, fabric, or storage system.

293
CDP Operations: Local and Remote
Replication
The illustration provides an example of a CDP local and remote replication operations where the write
splitter is deployed at the compute system.

Typically the replica is synchronized with the source, and then the replication process starts.
After the replication starts, all the writes from the compute system to the source (production
volume) are split into two copies. One copy is sent to the local CDP appliance at the source
site, and the other copy is sent to the production volume. Then the local appliance writes the
data to the journal at the source site and the data in turn is written to the local replica. If a
file is accidentally deleted, or the file is corrupted, the local journal enables organizations to
recover the application data to any PIT.
In remote replication, the local appliance at the source site sends the received write I/O to the
appliance at the remote (DR) site. Then, the write is applied to the journal volume at the
remote site. As a next step, data from the journal volume is sent to the remote replica at
predefined intervals. CDP operates in either synchronous or asynchronous mode.
In the synchronous replication mode, the application waits for an acknowle

294
Hypervisor-based CDP
The illustration shows a CDP local replication
implementation.

• Protects a single or multiple VMs locally


or remotely
• Enables to restore VM to any PIT
• Virtual appliance is running on a
hypervisor
• Write splitter is embedded in the
hypervisor

Some vendors offer continuous data protection for


VMs through hypervisor-based CDP
implementation. In this deployment, the specialized
hardware-based appliance is replaced with virtual
appliance which is running on a hypervisor. The
write splitter is embedded in the hypervisor. This
option protects single or multiple VMs locally or
remotely and enables to restore VMs to any PIT. The
local and remote replication operations are as
similar as network-based CDP replication.

295
Backup and Recovery
Information Storage and Management (ISM)
v4

Main Content

Backup and Recovery Overview

Select here for details.

Definition: Backup

An additional copy of production data, which is created and retained for the sole purpose of recovering
lost or corrupted data.

• Typically both application data and server configurations are backed up to restore data and
servers if there is an outage.
• Businesses also implement backup solutions to comply with regulatory requirements.
• To implement a successful backup and recovery solution
o IT needs to evaluate the backup methods along with their recovery considerations
and retention requirements

Like protecting the IT infrastructure components (compute, storage, and network), it is also
critical for organizations to protect the data. Typically organizations implement data
protection solution to protect the data from accidentally deleting files, application crashes,
data corruption, and disaster. Data should be protected at local and remote locations to
ensure the availability of service.
For example: when a service is failed over to other zone (data center), the data should be
available at the destination. This approach helps to successfully failover the service to
minimize the outage. One of the key data protection solutions that are widely implemented is
backup.
A backup is an additional copy of production data, which is created and retained for the sole
purpose of recovering the lost or corrupted data. With the growing business and the
regulatory demands for data storage, retention, and availability, organizations face the task

296
of backing up an ever-increasing amount of data. This task becomes more challenging with
the growth of data, reduced IT budgets, and less time available for taking backups.
Moreover, organizations need fast backup and recovery of data to meet their service level
agreements. Most organizations spend a considerable amount of time and money protecting
their application data but give less attention to protecting their server configurations. During
disaster recovery, server configurations must be re-created before the application and data
are accessible to the user.
The process of system recovery involves reinstalling the operating system, applications, and
server settings and then recovering the data. So it is important to backup both application
data and server configurations.Evaluating backup technologies, recovery, and retention
requirements for data and applications is an essential step to ensure successful
implementation of a backup and recovery solution.

Video: Backup and Recovery Overview


Backup Targets
Backup Description
Target
Tape Library • Tapes are portable and can be used for long term offsite storage.
• Must be stored in locations with a controlled environment
• Not optimized to recognize duplicate content
• Data integrity and recoverability are major issues with tape-based
backup media.

Disk Library • Enhanced backup and recovery performance


• No inherent offsite capability
• Disk-based backup appliance includes features such as deduplication,
compression, encryption, and replication to support business objectives

Virtual Tape • Disks are emulated and presented as tapes to backup software.
Library • Does not require any additional modules or changes in the legacy
backup software
• Provides better performance and reliability over physical tape
• Does not require the usual maintenance tasks that are associated with a
physical tape drive, such as periodic cleaning and drive calibration

A tape library contains one or more tape drives that records and retrieves data on a magnetic
tape. Tape is portable, and one of the primary reasons for the use of tape is long-term, offsite
storage. Backups that are implemented using tape devices involve several hidden costs. Tapes
must be stored in locations with a controlled environment to ensure preservation of the media
and to prevent data corruption. Physical transportation of the tapes to offsite locations also
adds management overhead and increases the possibility of loss of tapes during offsite
shipment.

297
The traditional backup process, using tapes, is not optimized to recognize duplicate content.
Due to its sequential data access, both backing up of data and restoring it take more time
with tape. This data access may impact the backup window and RTO. A backup window is a
period during which a production volume is available to perform backup. Data integrity and
recoverability are also major issues with tape-based backup media.
Disk density has increased dramatically over the past few years, lowering the cost per GB.
So, it became a viable backup target for organizations. When used in a highly available
configuration in a storage array, disks offer a reliable and fast backup target medium. One
way to implement a backup to disk system is by using it as a staging area. This approach
offloads backup data to a secondary backup target such as tape after a period of time.
Some vendors offer a purpose-built, disk-based backup appliances that are emerged as the
optimal backup target solution. These systems are optimized for backup and recovery
operations, offering extensive integration with popular backup management applications.
The integrated features such as replication, compression, encryption, and data deduplication
increase the value of purpose-built backup appliances.
Virtual tape libraries use disks as backup media. Virtual tapes are disk drives that are
emulated and presented as tapes to the backup software. Compared to physical tapes, virtual
tapes offer better performance, better reliability, and random disk access. A virtual tape
drive does not require the usual maintenance tasks that are associated with a physical tape
drive, such as periodic cleaning and drive calibration. Compared to the disk library, a virtual
tape library offers easy installation and administration because it is preconfigured by the
manufacturer. A key feature that is available on virtual tape library appliances is replication.

Backup Operation

(1) Backup server initiates scheduled backup process.

(2) Backup server retrieves backup-related information from the backup catalog.

(3a) Backup server instructs storage node to load backup media in the backup device.

(3b) Backup server instructs backup clients to send data to be backed up to the storage node.

(4) Backup clients send data to storage node and update the backup catalog on the backup server.

(5) Storage node sends data to the backup device

(6) Storage node sends metadata and media information to the backup server

(7) Backup server updates the backup catalog

298
The backup operation is
typically initiated by a server, but it can also be initiated by a client. The backup server
initiates the backup process for different clients that is based on the backup schedule
configured for them.
For example: the backup for a group of clients may be scheduled to start at 3:00 a.m. every
day. The backup server coordinates the backup process with all the components in a backup
environment. The backup server maintains the information about backup clients to be
backed up and storage nodes to be used in a backup operation. The backup server retrieves
the backup related information from the backup catalog. Based on this information, the
backup server instructs the storage node to load the appropriate backup media into the
backup devices.
Simultaneously, it instructs the backup clients to gather the data to be backed up and sends
it over the network to the assigned storage node. After the backup data is sent to the storage
node, the client sends some backup metadata (the number of files, name of the files, storage
node details, and so on) to the backup server. The storage node receives the client data,
organizes it, and sends it to the backup device. The storage node sends extra backup metadata
(location of the data on the backup device, time of backup, and so on) to the backup server.
The backup server updates the backup catalog with this information. The backup data from
the client can be sent to the backup device over a LAN or SAN network.
Hot backup and cold backup are the two methods that are deployed for backup. They are
based on the state of the application when the backup is performed. In a hot backup, the
application is up-and-running, with users accessing their data during the backup process.
This method of backup is also referred to as online backup. The hot backup of online
production data is challenging because data is actively being used and changed. If a file is
open, it is normally not backed up during the backup process.
In such situations, an open file agent is required to back up the open file. These agents interact
directly with the operating system or application and enable the creation of consistent copies
of open files. The disadvantage that is associated with a hot backup is that the agents usually
affect the overall application performance. A cold backup requires the application to be shut
down during the backup process. Hence, this method is also referred to as offline backup.
Consistent backups of databases can also be done by using a cold backup. The disadvantage
of a cold backup is that the database is inaccessible to users during the backup process.

299
Recovery Operation

(1) Backup client requests backup server for data restore

(2) Backup server scans backup catalog to identify data to be restored and the client that will receive data

(3) Backup server instructs storage node to load backup media in the backup device

(4) Data is then read and sent to the backup client

(5) Storage node sends restore metadata to the backup server

(6) Backup server updates the backup catalog

After the data is backed up, it can be restored when required. A restore process can be
manually initiated from the client. A recovery operation restores data to its original state at
a specific PIT. Typically backup applications support restoring one or more individual files,
directories, or VMs. The illustration depicts a restore operation.
Upon receiving a restore request, an administrator opens the restore application to view the
list of clients that have been backed up. While selecting the client for which a restore request
has been made, the administrator also needs to identify the client that receives the restored
data. Data can be restored on the same client for whom the restore request has been made or
on any other client.
The administrator then selects the data to be restored and the specified point in time to which
the data has to be restored based on the RPO. Because all this information comes from the
backup catalog, the restore application needs to communicate with the backup server. The
backup server instructs the appropriate storage node to mount the specific backup media
onto the backup device. Data is then read and sent to the client that has been identified to

300
receive the restored data.Some restorations are successfully accomplished by recovering only
the requested production data. For example, the recovery process of a spreadsheet is
completed when the specific file is restored. In database restorations, additional data, such
as log files, must be restored along with the production data. This approach ensures
consistency of the restored data. In these cases, the RTO is extended due to the additional
steps in the restore operation. It is also important for the backup and recovery applications
to have security mechanisms to avoid recovery of data by nonauthorized users.

Backup Granularity

Backup granularity depends on business needs and the required RTO/RPO. Based on the granularity,
backups can be categorized as full, incremental, and cumulative (or differential).

Most organizations use a combination of these backup types to meet their backup and recovery
requirements.

The illustration depicts the different backup granularity levels.

301
Backup Granularity
Backup granularity depends on business needs and the required RTO/RPO. Based on the granularity,
backups can be categorized as full, incremental, and cumulative (or differential).

Most organizations use a combination of these backup types to meet their backup and recovery
requirements.

The illustration depicts the different backup granularity levels.

Full Backup
It is a full copy of the entire data set. Organizations typically use full backup on a periodic basis because it
requires more storage space and also takes more time to back up. The full backup provides a faster data
recovery.

Incremental Backup
It copies the data that has changed since the last backup. For example, a full backup is created on Monday,
and incremental backups are created for the rest of the week. Tuesday's backup would only contain the
data that has changed since Monday. Wednesday's backup would only contain the data that has changed
since Tuesday.

The primary disadvantage to incremental backups is that they can be time-consuming to restore. Suppose
an administrator wants to restore the backup from Wednesday. To do so, the administrator has to first
restore Monday's full backup. After that, the administrator has to restore Tuesday's copy, followed by
Wednesday's.

Cumulative Backup
It copies the data that has changed since the last full backup. Suppose, for example, the administrator
wants to create a full backup on Monday and differential backups for the rest of the week. Tuesday's
backup would contain all of the data that has changed since Monday. It would therefore be identical to an
incremental backup at this point.

On Wednesday, however, the differential backup would backup any data that had changed since Monday
(full backup). The advantage that differential backups have over incremental is shorter restore times.
Restoring a differential backup never requires more than two copies.The tradeoff is that as time
progresses, a differential backup can grow to contain more data than an incremental backup.

302
303
Agent-Based Backup
In this approach, an agent or client is installed on a virtual machine or a physical compute system. The
agent streams the backup data to the backup device as shown in the illustration.

• Agent is running inside the application servers (physical/virtual)


o Performs file-level backup
• Impacts performance of applications running on compute systems
o Performing backup on multiple VMs on a compute system may consume more
resources and lead to resource contention

This backup does not capture virtual machine configuration files. The agent running on the
compute system consumes CPU cycles and memory resources. If multiple VMs on a compute
system are backed up simultaneously, then the combined I/O and bandwidth demands that
are placed on the compute system by the various backup operations can deplete the compute
system resources.
This approach may impact the performance of the services or applications running on the
VMs. To overcome these challenges, the backup process can be offloaded from the VMs to a
proxy server. This can be achieved by using the image-based backup approach.

Image-Based Backup

Image-based backup makes a copy of the virtual drive and configuration that are associated with a
particular VM.

• Backup is saved as a single entity called a VM image

304
o Enables quick restoration of a VM
• Supports recovery at VM-level and file-level
• No agent is required inside the VM to perform backup
• Backup processing is offloaded from VMs to a proxy server

Image-based backup makes a copy of the virtual drive and configuration that are associated
with a particular VM. The backup is saved as a single entity called as VM image. This type
of backup is suitable for restoring an entire VM if there is a hardware failure or human error
such as the accidental deletion of the VM. Image-based backup also supports file-level
recovery.
In an image-level backup, the backup software can backup VMs without installing backup
agents inside the VMs or at the hypervisor-level. The backup processing is performed by a
proxy server that acts as the backup client, thereby offloading the backup processing from
the VMs. The proxy server communicates to the management server responsible for
managing the virtualized compute environment. It sends commands to create a snapshot of
the VM to be backed up and to mount the snapshot to the proxy server. A snapshot captures
the configuration and virtual drive data of the target VM and provides a point-in-time view
of the VM. The proxy server then performs backup by using the snapshot.Some vendors
support incremental backup through tracking changed blocks. This feature identifies and
tags any blocks that have changed since the last VM snapshot. This approach enables the
backup application to backup only the blocks that have changed, rather than backing up
every block. This considerably reduces the amount of data to be backed up and the number
of VM that needs to be backed up within a backup window.

Image-Based Backup: Recovery-In-Place


Definition: Recovery-in-place

A term that refers to running a VM directly from the backup device, using a backed up copy of the VM
image instead of restoring that image file.

• Eliminates the need to transfer the image from the backup device to the primary storage
before it is restarted
o Provides an almost instant recovery of a failed VM
• Requires a random access device to work efficiently
o Disk-based backup target
• Reduces the RTO and network bandwidth to restore VM files

One of the primary benefits of recovery in place is that it eliminates the need to transfer the image from
the backup area to the primary storage area before it is restarted. So, the application that is running on
those VMs can be accessed more quickly. This method not only saves time for recovery, but also reduces
network bandwidth to restore files.

305
NDMP-Based Backup
Definition: NDMP

An open standard TCP/IP-based protocol that is designed for a backup in a NAS environment.

• Data can be backed up using NDMP regardless of the operating system or platform
• Backup data is sent directly from NAS to the backup device
o No longer necessary to transport data through application servers
• Backs up and restores data while preserving security attributes of file system (NFS and
CIFS) and maintains data integrity

As the amount of unstructured data continues to grow exponentially, organizations face the
daunting task of ensuring that critical data on NAS systems are protected. Most NAS heads
run on proprietary operating systems that are designed for serving files.
To maintain its operational efficiency generally, it does not support the hosting of third-party
applications such as backup clients. This forced backup administrators to backup data from
application server or mount each NAS volume through CIFS or NFS from another server
across the network, which hosted a backup agent. These approaches may lead to performance
degradation of application server and production network during backup operations, due to
overhead.
Further, security structures differ on the two network file systems, NFS and CIFS. Backups
that are implemented through one of the file systems would not effectively backup any data
security attributes on the NAS head that was accessed through a different file system. For
example, CIFS backup, when restored, would not be able to restore NFS file attributes and
vice versa. These backup challenges of the NAS environment can be addressed with the use
of Network Data Management Protocol (NDMP).
NDMP is an industry-standard TCP/IP-based protocol that is designed for a backup in a
NAS environment. It communicates with several elements in the backup environment (NAS
head, backup devices, backup server, and so on) for data transfer and enables vendors to use
a common protocol for the backup architecture. Data can be backed up using NDMP
regardless of the operating system or platform. NDMP backs up and restores data without
losing the data integrity and file system structure (regarding different rights and permission
in different file systems).
Due to its flexibility, it is no longer necessary to transport data through the application server,
which reduces the load on the application server and improves the backup speed. NDMP
optimizes backup and restore by using the high-speed connection between the backup devices
and the NAS head. In NDMP, backup data is sent directly from the NAS head to the backup
device, whereas metadata is sent to the backup server.

306
Primary Storage-Based Backup
This backup approach backs up data directly from primary storage system to backup target without
requiring additional backup software.

• Eliminates the backup impact on application servers


• Improves the backup and recovery performance to meet SLAs

Typically, an agent runs on the application servers that control the backup process. This
agent stores configuration data for mapping the LUNs on the primary storage system to the
backup device to orchestrate backup (the transfer of changed blocks and creation of backup
images) and recovery operations. This backup information (metadata) is stored in a catalog
which is local to the application server.
When a backup is triggered through the agent running on application server, the application
momentarily pauses simply to mark the point in time for that backup. The data blocks that
have changed since the last backup is sent across the network to the backup device. The direct
movement from primary storage to backup device eliminates the LAN impact by isolating all
backup traffic to the SAN. This approach eliminates backup impact on application servers
and provides faster backup and recovery to meet the application protection SLAs.
For data recovery, the backup administrator triggers recovery operation and then the
primary storage reads the backup image from the backup device. The primary storage
replaces production LUN with the recovered copy.

Cloud-Based Backup: Backup as a Service


• Enables consumers to procure backup services on demand through a
self-service portal
o Provides the capability to perform backup and recovery at any
time, from anywhere
• Reduces the backup management overhead
o Transforms from CAPEX to OPEX
o Pay-per-use/subscription-based pricing
o Enables organizations to meet long-term retention
requirements

307
• Backing up to cloud ensures regular and automated backup of data
• Gives consumers the flexibility to select a backup technology based on their current
requirements

Data is important for businesses of all sizes. Organizations need to regularly back up data to
avoid losses, stay compliant, and preserve data integrity. IT organizations today are dealing
with the explosion of data, particularly with the development of third platform technologies.
Data explosion poses the challenge of data backup and quick data restore. It strains the
backup windows, IT budget, and IT management. The growth and complexity of the data
environment, added with proliferation of virtual machines and mobile devices constantly
outpaces the existing data backup plans.
Deployment of a new backup solution takes weeks of planning, justification, procurement,
and setup. However, technology and data protection requirements change quickly.
Enterprises must also comply with regulatory and litigation requirements. These challenges
can be addressed with the emergence of cloud-based backup (backup as a service).
Backup as a service enables organizations to procure backup services on-demand in the
cloud. The backup service is offered by a service provider to consumers. Organizations can
build their own cloud infrastructure and provide backup services on demand to their
employees/users. Some organizations prefer a hybrid cloud option for their backup strategy.
They keep a local backup copy in their private cloud and use a public cloud for keeping their
remote copy for DR purpose. For providing backup as a service, organizations and service
providers should have necessary backup technologies in place to meet the required service
levels.
Backup as a service enables individual consumers or organizations to reduce their backup
management overhead. It also enables the individual consumer/user to perform backup and
recovery anytime, from anywhere, using a network connection. Consumers do not need to
invest in capital equipment to implement and manage their backup infrastructure. These
infrastructure resources are rented without obtaining ownership of the resources. Based on
the consumer demand, backups can be scheduled and infrastructure resources can be
allocated with a metering service. This will help to monitor and report resource consumption.
Many organizations’ remote and branch offices have limited or no backup in place. Mobile
workers represent a particular risk because of the increased possibility of lost or stolen
devices. Backing up to cloud ensures regular and automated backup of data. Cloud
computing gives consumers the flexibility to select a backup technology, based on their
requirement. It also enables to quickly move to a different technology when their backup
requirement changes.
ata can be restored from the cloud using two methods namely web-based restore and media-
based restore. In web-based restore, the requested data is gathered and sent to the server,
running cloud backup agent. The agent software restores data on the server. This method is
considered if sufficient bandwidth is available. If a large amount of data needs to be restored
and sufficient bandwidth is not available, then the consumer may request data restoration
using backup media such as DVD or disk drives. In this option, the service provider gathers
the data to restore, stores data to a set of backup media, and ships it to the consumer.

308
Data Deduplication

Video: Data Deduplication


What is Data Deduplication?
Definition: Data Deduplication

The process of detecting and identifying the unique data segments within a given set of data to eliminate
redundancy.

• Duplication process:
o Chunk the dataset
o Identify duplicate chunk
o Eliminate the redundant chunk
• Deduplication could be performed in
backup and production environment
• Effectiveness of deduplication is
expressed as a deduplication ratio

The use of deduplication techniques reduces the amount of data to be backed-up. Data
deduplication operates by segmenting a dataset into blocks and identifying redundant data
and writing the unique blocks to a backup target.
To identify redundant blocks, the data deduplication system creates a hash value or digital
signature, like a fingerprint, for each data block. It also creates an index of the signatures for
a given repository. The index provides the reference list to determine whether blocks exist in
a repository.
When the data deduplication system sees a block it has processed before, instead of storing
the block again, it inserts a pointer to the original block in the repository. It is important to
note that the data deduplication can be performed in backup as well as in production
environment. In production environment, the deduplication is implemented at primary
storage systems to eliminate redundant data in the production volume.The effectiveness of
data deduplication is expressed as a deduplication ratio. It is the ratio of data before
deduplication to the amount of data after deduplication. This ratio is typically depicted as
“ratio:1” or “ratio X” (10:1 or 10 X). For example, if 200 GB of data consumes 20 GB of
storage capacity after data deduplication, the space reduction ratio is 10:1.
Every data deduplication vendor claims that their product offers a certain ratio of data
reduction. However, the actual data deduplication ratio varies, based on many factors.

309
Drivers for Data Deduplication
With the growth of data and 24x7 service availability requirements, organizations are facing challenges in
protecting their data. Typically, many redundant data is backed-up. It increases the backup window size
and also results in unnecessary consumption of resources, such as backup storage space and network
bandwidth.

There are also requirements to preserve data for longer periods – whether driven by the need of
consumers or legal and regulatory concerns. Backing up large amount of duplicate data at the remote site
or cloud for DR purpose is also cumbersome and requires lots of bandwidth.

Data deduplication provides the solution for organizations to overcome these challenges in a backup
environment.

310
Factors Affecting Deduplication Ratio
Factor Description
Retention period Longer the data retention period, the greater is the chance of identical data
existence in the backup
Frequency of full More frequently the full backups are conducted, the greater is the advantage of
backup deduplication
Change rate Fewer the changes to the content between backups, the greater is the efficiency
of deduplication
Data type The more unique the data, the less intrinsic duplication exists
Deduplication The highest amount of deduplication across an organization is discovered using
method variable-length, sub-file deduplication
Data deduplication performance (or ratio) is tied to the following factors:
Retention period: This is the period of time that defines how long the backup copies are retained.
The longer the retention, the greater is the chance of identical data existence in the backup set
which would increase the deduplication ratio and storage space savings.
Frequency of full backup: As more full backups are performed, it increases the amount of same
data being repeatedly backed-up. So, it results in high deduplication ratio.
Change rate: This is the rate at which the data received from the backup application changes from
backup to backup. Client data with a few changes between backups produces higher deduplication
ratios.
Data type: Backups of user data such as text documents, PowerPoint presentations, spreadsheets,
and emails are known to contain redundant data and are good deduplication candidates. Other data
such as audio, video, and scanned images are highly unique and typically do not yield good
deduplication ratio.
Deduplication method: Deduplication method also determines the effective deduplication ratio.
Variable-length, subfile deduplication discovers the highest amount of deduplication of data.

Deduplication Granularity Level


The level at which data is identified as duplicate affects the amount of redundancy or commonality. The
operational levels of deduplication include file-level deduplication and sub-file deduplication.

File-level Deduplication

• Detects and removes redundant copies of identical files


• Only one copy of the file is stored; the subsequent copies are replaced with a pointer to the
original file
• Does not address the problem of duplicate content inside the files

File-level deduplication (also called single instance storage) detects and removes redundant
copies of identical files in a backup environment. Only one copy of the file is stored; the
subsequent copies are replaced with a pointer to the original file. By removing all of the
subsequent copies of a file, a significant amount of space savings can be achieved.

311
File-level deduplication is simple but does not address the problem of duplicate content inside
the files. A change in any part of a file also results in classifying that as a new file and saving
it as a separate copy. For example, two 10-MB presentations with a difference in just the title
page are not considered as duplicate files, and each file is stored separately.

Sub-file Level Deduplication


Breaks down files to smaller segments

• Detects redundant data within and across files

Two methods:

• Fixed-length block
• Variable-length block

Sub-file deduplication breaks the file into smaller blocks and then uses a standard hash
algorithm to detect redundant data within and across the file. As a result, sub-file
deduplication eliminates duplicate data across files.
There are two forms of sub-file deduplication, fixed-length and variable-length. The fixed-
length block deduplication divides the files into fixed-length blocks and uses a hash algorithm
to find duplicate data. Although simple in design, the fixed-length block may miss
opportunities to discover redundant data because the block boundaries of similar data may
be different.
For example: the addition of a person’s name to a document’s title page may shift the whole
document, and make all blocks appear to have changed, causing the failure of the
deduplication method to detect equivalencies. In variable-length block deduplication, if there
is a change in the block and then the boundary for that block only is adjusted, leaving the
remaining blocks unchanged.
More data is identified as common data, and there is less backup data to store as only the
unique data is backed-up. Variable-length block deduplication yields a greater granularity
in identifying duplicate data, improving upon the limitations of file-level, and fixed-length
block level deduplication.

Source-Based Deduplication Method


• Data is deduplicated at the source (backup
client)
o Backup client sends only new,
unique segments across the network
• Reduced storage capacity and network
bandwidth requirements
• Recommended for ROBO environment for
taking centralized backup
• Cloud service providers can also implement
this method when performing backup from
consumer’s location to their location
312
Source-based data deduplication eliminates redundant data at the source (backup client)
before transmission to the backup device. The deduplication software or agent on the clients
checks each file or block for duplicate content. Source-based deduplication reduces the
amount of data that is transmitted over a network from the source to the backup device, thus
requiring less network bandwidth. There is also a substantial reduction in the capacity that
is required to store the backup data.
However, a deduplication agent running on the client may impact the backup performance,
especially when a large amount of data needs to be backed-up. When image-level backup is
implemented, the backup workload is moved to a proxy server. The deduplication agent is
installed on the proxy server to perform deduplication without impacting the VMs running
applications. Organizations can implement source-based deduplication when performing
backup (backup as a service) from their location to provider’s location.

Target-Based Deduplication Method


• Data is deduplicated at the target
o Inline
o Post-process
• Offloads the backup client from deduplication process
• Requires sufficient network
bandwidth
• In some implementations, part of the
deduplication load is moved to the
backup server
o Reduces the burden on the
target
o Improves the overall backup
performance

Target-based data deduplication occurs at


the backup device, which offloads the
deduplication process and its performance
impact from the backup client. In target-
based deduplication, the backup
application sends data to the target backup device where the data is deduplicated, either
immediately (inline) or at a scheduled time (post-process).
With inline data deduplication, the incoming backup stream is divided into small chunks,
and then compared to data that has already been deduplicated. The inline deduplication
method requires less storage space than the post process approach. However, inline
deduplication may slow down the overall data backup process.
Inline deduplication systems of some vendors use the continued advancement of CPU
technology. This increases the performance of the inline deduplication by minimizing disk
accesses required to deduplicate data. Such inline deduplication systems identify duplicate
data segments in memory, which minimizes the disk usage.
In post-process deduplication, the backup data is first stored to the disk in its native backup
format and deduplicated after the backup is complete. In this approach, the deduplication
313
process is separated from the backup process and the deduplication happens outside the
backup window. However, the full backup dataset is transmitted across the network to the
storage target before the redundancies are eliminated. So, this approach requires adequate
storage capacity to accommodate the full backup dataset.
Organizations can consider implementing target-based deduplication when their backup
application does not have built in deduplication capabilities. It supports the current backup
environment without any operational changes. Target-based deduplication reduces the
amount of storage that is are required, but unlike source-based deduplication, it does not
reduce the amount of data that is are sent across a network during the backup. In some
implementations, part of the deduplication functionality is moved to the backup client or
backup server. This reduces the burden on the target backup device for performing
deduplication and improves the overall backup performance.

314
Data Archiving
Data Archiving Video
Data Archiving Overview
Definition: Data Archiving

The process of identifying and moving inactive data out of current production systems into low-cost
storage tier for long-term retention and future reference.

• Data archive is a repository where fixed


content is stored
• Organizations set their own policies for
qualifying data to archive.
• Archiving enables organizations to:
o Reduce on-going primary storage
acquisition costs
o Meet regulatory compliance
o Reduce backup challenges
including the backup window by
moving static data out of the
recurring backup stream process
o Use this data for generating new
revenue strategies

In the information life cycle, data is actively


created, accessed, and changed. As data ages, it is less likely to be changed and eventually
becomes “fixed” but remains accessed by applications and users. This data is called fixed
content. Assets such as X-rays, MRIs, CAD/CAM designs, surveillance video, MP3s, and
financial documents are examples of fixed data. These data are growing at over 90%
annually.
Data archiving is the process of moving data (fixed content) that is no longer actively accessed
to a separate low-cost archival storage tier for long-term retention and future reference. Data
archive is a storage repository that is used to store these data. Organizations set their own
policies for qualifying data to move into archives. These policy settings are used to automate
the process of identifying and moving the appropriate data into the archive system.
Organizations implement archiving processes and technologies to reduce primary storage
cost. With archiving, the capacity on expensive primary storage can be reclaimed by moving
infrequently accessed data to lower-cost archive tier. Archiving fixed content before taking
backup helps to reduce the backup window and backup storage acquisition costs.

315
Government regulations and legal/contractual obligations mandate organizations to retain
their data for an extended period. The key to determine how long to retain archives of an
organization is to understand which regulations apply to the particular industry and which
retention rules apply to that regulation.

• For instance, all publicly traded companies are subject to the Sarbanes-Oxley (SOX)
Act. This act defines email retention requirements, among other things related to data
storage and security.

Archiving helps organizations to adhere to compliances. Archiving can help organizations


use growing volumes of information in potentially new and unanticipated ways.

• For example, new product innovation can be fostered if engineers can access archived
project materials such as designs, test results, and requirement documents. Besides to
meeting governance and compliance requirements, organizations retain data for
business intelligence and competitive advantage. Both active and archived
information can help data scientists drive innovations or help to improve current
business processes.

Backup vs. Archiving


Data archiving is often confused with data backup. Backups are used to restore data in case it is lost,
corrupted, or destroyed. In contrast, data archives protect older data that is not required for everyday
business operations but may occasionally need to be accessed. The table compares some of the significant
differences between backup and archiving.

Data Backup Data Archiving


Secondary copy of data Primary copy of data
Used for data recovery operations Available for data retrieval
Primary objective – operational recovery and Primary objective – compliance adherence and
disaster recovery lower cost
Typically short-term (weeks or months) retention Long-term (months, years, or decades)
retention

Data Archiving Operations


• Archiving agent scans primary storage to find files that meet the archiving policy. The
archive server indexes the files.
• Once the files have been indexed, they are moved to archive storage and small stub files
are left on the primary storage.

316
The data archiving operation has an archiving agent, archive server/policy engine, and
archive storage. The archiving agent scans the primary storage to find files that meet the
archiving policy. This policy is defined on the archive server (policy engine).
After the files are identified for archiving, the archive server creates the index for the files.
Once the files have been indexed, they are moved to the archive storage and small stub files
are left on the primary storage. In other words, each archived file on primary storage is
replaced with a stub file. The stub file contains the address of the archived file. As the size of
the stub file is small, it saves space on primary storage.
From the perspective of a client, the data movement from primary storage to secondary
storage is transparent.

Use Case: Email Archiving


Emails are a part of business processes. They represent a correspondence between two or more parties
and are immutable after generation. Email archiving is the process of archiving emails from the mail server
to an archive storage. After the email is archived, it is retained for years, based on the retention policy.

Legal Dispute

Email archiving helps an organization to address legal disputes.


For example, an organization may be involved in a legal dispute. They need to produce all emails
within a specified time period containing specific keywords that were sent to or from certain
people.
Government Compliance

317
Email archiving helps to meet government compliance requirements such as Sarbanes-Oxley and
SEC regulations.
For example, an organization may need to produce all emails from all individuals that are involved
in stock sales or transfers. Failure to comply with these requirements could cause an organization
to incur penalties.
Mailbox Space Savings

Email archiving provides more mailbox space by moving old emails to archive storage.
For example, an organization may configure a quota on each mailbox to limit its size. A fixed quota
for a mailbox forces users to delete emails as they approach the quota size. However, users often
need to access emails that are weeks, months, or even years old. With email archiving,
organizations can free up space in user mailboxes and still provide user access to older emails.

Purpose-Built Archive Storage – CAS


Content addressed storage (CAS) is an object-based storage device that is purposely built for storing and
managing fixed data.

• Each object that is stored in CAS is assigned a globally unique content address (digital
fingerprint of the content).
• Application server accesses the CAS device through the CAS API.

CAS stores user data and its attributes as an object. The stored object is assigned a globally
unique address, which is known as a content address (CA). This address is derived from the
binary representation of an object. Content addressing eliminates the need for application
servers to understand and manage the physical location of objects on a storage system.
Content address (digital fingerprint of the content) not only simplifies the task of managing
huge number of objects, but also ensures content authenticity. The application server can
access the CAS device only through the CAS API.

318
Cloud-Based Archiving
Organizations prefer hybrid cloud options. Archived data that may require high-speed access is retained
internally (private cloud) while lower-priority archive data is moved to low-cost, public cloud-based archive
storage.

• No CAPEX, pay-as-you-go, faster deployment


• Reduced management overhead of IT
• Supports massive data growth and retention requirements

In a traditional in-house data archiving model, archiving systems and underlying


infrastructure are deployed and managed within an organization’s data center. Due to
exponential data growth, organizations are facing challenges with increased cost and
complexity in their archiving environment. Often an existing infrastructure is siloed by
architecture and policy. Organizations are looking for new ways to improve the agility and
the scalability of their archiving environments.
Cloud computing provides highly scalable and flexible computing that is available on
demand. It empowers self-service requesting through a fully automated request-fulfillment
process in the background. It provides capital cost savings and agility to organizations. With
cloud-based archiving, organizations are required to pay as they use and can scale the usage
as needed. It also enables the organization to access their data from any device and any
location.
Typically a cloud-based archiving service is designed to classify, index, search, and retrieve
data in a security-rich manner. It automates regulatory monitoring and reporting. It also
enables organizations to consistently enforce the policies for the centralized cloud archive
repository. Hybrid cloud archiving is one step toward the cloud from the traditional in-house
approach. Archived data that may require high-speed access is retained internally. while
lower-priority archive data is moved to low-cost, public cloud-based archive storage.

319
Migration
Video: Data Migration
Data Migration
Definition: Data Migration

Involves the transferal of data between hosts (physical or virtual), storage devices, or formats.

• In today’s competitive business environment, IT organizations should require non-


disruptive live migration solutions in place to meet the required SLAs
• Organization deploys data migration solutions for the following reasons:
o Data center maintenance without downtime
o Disaster avoidance
o Technology refresh
o Data center migration or consolidation
o Workload balancing across data centers (multiple sites)

Traditionally, migrating data and applications within or between data centers involved a series of
manual tasks and activities. IT would either make physical backups or use data replication services to
transfer applications and data to an alternate location. Applications had to be stopped and could not be
restarted until testing and verification were complete. In today’s competitive business environment, IT
organizations should require non-disruptive live migration solutions in place to meet the required SLAs.

Storage System-Based Migration


• Moves data between heterogeneous storage systems
o Storage system that performs migration is called as control storage system
• Push: Data is pushed from control system to remote system
• Pull: Data is pulled to the control system from remote system

320
Storage system-based migration moves data
between heterogeneous storage systems.
This technology is application and server-
operating-system independent because the
migration operations are performed by one
of the storage systems. The storage system
that performs migration operations is called
as control storage system. Data can be
moved from/to the devices in the control
storage system to/from a remote storage
system.
Data migration solutions perform push and
pull operations for data movement. These
terms are defined from the perspective of the
control storage system. In the push
operation, data is moved from the control
storage system to the remote storage system.
In the pull operation, data is moved from the
remote storage system to the control storage
system.
During the push and pull operations, compute system’s access to the remote device is not
enabled. Since, the control storage system has no control over the remote storage and cannot
track any change on the remote device. Data integrity cannot be guaranteed if changes are
made to the remote device during the push and pull operations. The push/pull operations can
be either hot or cold. These terms apply to the control devices only.
In a cold operation, the control device is inaccessible to the compute system during migration.
Cold operations guarantee data consistency because both the control and the remote devices
are offline. In a hot operation, the control device is online for compute system operations.
During hot push/pull operations, changes can be made to the control device. Since, the control
storage system can keep track of all changes, and thus ensure data integrity.

Virtualization Appliance-Based Migration


• Virtualization layer handles the migration of data
o Enables LUNs to remain online and accessible by compute system while data is
migrating
• Support data migration between multivendor heterogeneous storage systems
• Service provider could implement to migrate the customer data from their storage system
to a cloud-based storage
• Example:
o An administrator wants to perform a data migration from storage system A to
system B as shown in the illustration.
o The virtualization layer handles the migration of data, which enables LUNs to
remain online and accessible while data is migrating.
o In this case, physical changes are not required because the compute system still
points to the same virtual volume on the virtualization layer. However, the mapping
321
information resides on the appliance should be changed. These changes can be run
dynamically and made transparent to the user.

Data migration can also be implemented using a virtualization appliance at the SAN.
Virtualization appliance provides a translation layer in the SAN, between the compute
systems and the storage systems. The LUNs created at the storage systems are assigned to the
appliance. The appliance abstracts the identity of these LUNs and creates a storage pool by
aggregating LUNs from the storage systems.
A virtual volume is created from the storage pool and assigned to the compute system. When
an I/O is sent to a virtual volume, it is redirected through the virtualization layer to the
mapped LUNs. The key advantage of using virtualization appliance is to support data
migration between multivendor heterogeneous storage systems.
In a cloud environment, the service provider could also implement virtualization-based data
migration. They migrate the customer data from their storage system to a shared storage
used by the service provider. This approach enables the customer to migrate without causing
downtime to their applications and users during the migration process. The providers
themselves perform this data migration without the need to go for a third-party data
migration specialist.

Hypervisor-Based Migration: VM Migration


Running services on VMs are moved from one physical compute system to another without any downtime:

• Enables scheduled maintenance without any downtime


• Facilitates VM load balancing

Organizations using a virtualized infrastructure have many reasons to move running VMs
from one physical compute system to another. The compute systems can be located within a
data center or across data centers. The migration can be used for routine maintenance, and
VM distribution across sites to balance system load.
The migration can also be used for disaster recovery, or consolidating VMs onto fewer
physical compute systems. The ideal virtual infrastructure platform should enable
organizations to move the running VMs as quickly as possible and with minimal impact on
the users. This can be achieved with the help of implementing VM live migrations.
In a VM live migration the entire active state of a VM is moved from one hypervisor to
another. The state information includes memory contents and all other information that
identifies the VM. This method involves copying the contents of VM memory from the source
hypervisor to the target. Then transferring the control of the VM’s disk files to the target
hypervisor. Next, the VM is suspended on the source hypervisor, and the VM is resumed on
the target hypervisor.
Performing VM live migration requires a high-speed network connection. It is important to
ensure that even after the migration, the VM network identity and network connections are
preserved. VM live Migration with stretched cluster provides the ability to move VMs across
322
data centers. This solution is suitable for cloud environment, where consumers of a given
application are spread across the globe and working in different time zones. If an application
is closer to the consumers, then the productivity is enhanced to a great extent.
Live migration with stretched cluster provide

323
Hypervisor-Based Migration: VM Storage
Migration
Migrates VM files from one storage system to
another without any service disruption

Key Benefits:

• Simplify array migration and storage


upgrades
• Dynamically optimize storage I/O
performance
• Efficiently manage storage capacity

In a VM storage migration, VM files are


moved from one storage system to another
system without any downtime. This
approach enables the administrator to move VM files across dissimilar storage systems. VM
storage migration starts by copying the metadata about the VM from the source system to
the target storage system. The metadata essentially consists of configuration, swap, and log
files. After the metadata is copied, the VM disk file is moved to the new location. During
migration, there might be a chance that the source is updated. It is necessary to track the
changes on the source to maintain data integrity. After the migration is completed, the blocks
that have changed since the migration has started are transferred to the new location.
The key benefits of VM storage migration are:

• Simplify array migration and storage upgrades: The traditional process of moving
data to new storage is cumbersome, time-consuming, and disruptive. With VM
storage migration, organization can make it easier and faster to embrace new storage
platforms. This is to adopt flexible leasing models, retire older systems, and conduct
storage upgrades.

• Dynamically optimize storage I/O performance: With storage migration, IT


administrators can move VM disk files to alternative LUNs that are properly
configured to deliver optimal performance. This migration avoids scheduled
downtime, eliminating the time and cost associated with traditional methods.

• Efficiently manage storage capacity: Nondisruptive VM disk file migration to


different classes of storage enables cost-effective management of VM disks as part of
a tiered storage strategy.

324
Disaster Recovery as a Service (DRaaS)
• Enables organizations to have a DR site in the cloud
o Service provider offers resources to run consumer’s IT services in the cloud during
disaster
o Pay-as-you-go pricing model
• Resources at the service provider location may be dedicated to the consumer, or they can
be shared
• During normal production operations, IT services run at the consumer’s production data
center
• If there is a disaster, the business operations failover to the provider’s infrastructure

Organizations need to rely on business continuity processes to mitigate the impact of service
disruptions due to disaster. Traditional disaster recovery methods often require buying and
maintaining a complete set of IT resources at secondary data centers. This IT resources
should match the business-critical systems at the primary data center. This includes sufficient
storage to house a complete copy of all business data at the secondary site. This may be a
complex process and expensive solution for organizations.
Disaster Recovery-as-a-Service (DRaaS) has emerged as a solution that offers a viable DR
solution to organizations. DRaaS enables organizations to have a DR site in the cloud. The
cloud service provider assumes the responsibility for providing IT resources to enable
organizations to continue running their IT services if there is a disaster. Resources at the
service provider’s location may either be dedicated to the consumer or they can be shared.
From organizations (consumers) perspective, having a DR site in the cloud reduces the need
for data center space and IT infrastructure. This approach leads to significant cost
reductions, and eliminates the need for upfront capital expenditure. DRaaS is gaining
popularity among organizations. This is due to its pay-as-you-go pricing model and the use
of automated virtual platforms. This can lower costs and minimize the recovery time after a
failure. During normal production operations, IT services run at the organization’s
production data center. Replication of data occurs from the organization’s production
environment to the cloud over the network.
Typically during normal operating conditions, a DRaaS implementation may only need a
small share of resources. This helps to synchronize the application data and VM
configurations from the consumer’s site to the cloud. The full set of resources required to run
the application in the cloud is consumed only if a disaster occurs. If there is a business
disruption or disaster, the business operations failover to the provider’s infrastructure

325
Exercise: Backup, Replication, and Archiving
Scenario
A major multinational bank runs business-critical applications in a data center:

• Has multiple remote/branch offices (ROBO) across different geographic locations


• Currently uses tape as its primary backup storage media for backing up virtual machines
(VMs) and application data
• Uses an agent-based backup solution for backing up data
• Has a file-sharing environment in which multiple NAS systems serve all the users
o The data is backed up from application servers to backup device
• Approximately 25% of data in the production environment is inactive data (fixed content)
• Has two data centers which are 1000 miles apart

A major multinational bank runs business-critical applications in a data center. It has over a million
customers and multiple remote/branch offices (ROBO) across different geographic locations. The bank
currently uses tape as their primary backup storage media for backing up virtual machines (VMs) and
application data. It uses an agent-based backup solution for backing up data. It currently performs a full
backup every Sunday, and an incremental backup on other days. It also has a file-sharing environment
in which multiple NAS systems serve all the users. During NAS backup, the data is backed up from
application servers to backup device. Approximately 25% of data in the production environment is
inactive data (fixed content). The organization has two data centers which are 1000 miles apart.
Challenges

• Backup operations consume resources on the compute systems that are running multiple
VMs
o Significantly impacting the applications deployed on the VMs
• During NAS backup, the application servers are impacted
o Data is backed up from these servers to the backup device
• Backup environment has a huge amount of redundant data
o Increases the infrastructure cost and impacts the backup window
• Recovering data or VMs also takes more time
• Branch offices also have limited IT resources for managing backup
o Backing up data from branch offices to a centralized data center was restricted due
to the time and cost involved in sending huge volumes of data over the WAN
• Organization incurs a huge investment and operational expense in managing an offsite
backup infrastructure at remote site

Requirements

• Need faster backup and restore to meet the SLAs


• Need to eliminate redundant copies of data
• Need an effective solution to address the backup and recovery challenges of remote and
branch offices
• Need to offload the backup workload from the compute system to avoid performance
impact to applications

326
• Requires a solution to overcome the backup challenges in a NAS environment
• Requires a strategy to eliminate backing up fixed content from the production environment
• Requires a solution to reduce the management overhead and the investment cost in
managing the offsite backup copy
• Requires a remote replication solution for DR that should not impact the response time of
the application

Deliverables
Recommend solutions that will meet the organization's requirements
Debrief

• Implement disk-based backup solution to improve the backup and recovery performance
for meeting SLAs
• Implement deduplication solution to eliminate the redundant copies of data
• Disk-based backup solutions along with source-based deduplication
o Eliminate the challenges associated with centrally backing up remote office data
o Deduplication considerably reduces the required network bandwidth
• Implement image-based backup that helps to offload backup operation from VMs to a proxy
server
o No backup agent is required inside the VM to backup
• Deploy NDMP-based backup solution for NAS environment
o In NDMP-based backup, data is sent directly from the NAS head to the backup
device without impacting application servers
• Organization can implement data archiving solutions that archive fixed content from the
production environment
o Reduce the amount of data to be backed up
• Organization can choose backup as a service to replicate the offsite backup copy to the
cloud
o Saves CAPEX and reduces the management overhead to the organization
• To meet the DR requirement, the organization can implement asynchronous remote
replication:
o Provides finite RPO and does not impact response time

327
Concepts In Practice
Dell EMC NetWorker

• Software that centralizes, automates, and accelerates data backup and recovery
• Delivers enterprise-class performance and security to meet even the most demanding
service level requirements
• Supports source-based and target-based deduplication capabilities by integrating with
DELL EMC Avamar and DELL EMC Data Domain respectively

Backup and recovery software which centralizes, automates, and accelerates data backup
and recovery operations. The following are key features of NetWorker:

• Supports heterogeneous platforms such as Windows, UNIX, Linux, and also virtual
environments
• Supports different backup targets – tapes, disks, Data Domain purpose-built backup
appliance, and virtual tapes
• Supports multiplexing (or multi-streaming) of data
• Delivers enterprise-class performance and security to meet even the most demanding
service level requirements
• Provides both source-based and target-based deduplication capabilities by integrating
with DELL EMC Avamar and DELL EMC Data Domain respectively
• The cloud-backup option in NetWorker enables backing up data to public cloud
configurations

Dell EMC Avamar

• Disk-based backup and recovery solution that provides inherent source-based deduplication
• Uses variable-length deduplication, which significantly reduces backup time by only
storing unique daily changes
• Provides various options for backup, including guest OS-level backup and image-level
backup
• Data is encrypted and deduplicated to secure and minimize the network bandwidth
consumption

A disk-based backup and recovery solution that provides inherent source-based data
deduplication. With its unique global data deduplication feature, Avamar differs from
traditional backup and recovery solutions by identifying and storing only unique sub-file
data. Avamar employs variable-length deduplication, which significantly reduces backup
time by only storing unique daily changes while maintaining daily full backups for
immediate, single-step restore.
DELL EMC Avamar provides a variety of options for backup, including guest OS-level
backup and image-level backup. The three major components of an Avamar system include
Avamar server, Avamar backup clients, and Avamar administrator. Avamar server provides
the essential processes and services required for client access and remote system
administration. The Avamar client software runs on each compute system that is being
328
backed up. Avamar administrator is a user management console application that is used to
remotely administer an Avamar system.

Dell EMC Data Domain

• A target-based data deduplication solution


• Data Domain Boost software increases the backup performance by distributing parts of
deduplication process to the backup server
• Provides secure multitenancy
• Supports backup and archive in a single system
• Supports low-cost disaster recovery to the cloud

DELL EMC Data Domain deduplication storage systems continue to revolutionize disk
backup, archiving, and disaster recovery with high-speed, inline deduplication. DELL EMC
Data Domain deduplication storage system is a target-based data deduplication solution.
Using high-speed, inline deduplication technology, the Data Domain system provides a
storage footprint that is significantly smaller on average than that of the original data set.
DELL EMC Data Domain Boost software significantly increases backup performance by
distributing the parts of the deduplication process to the backup server. With Data Domain
Boost, only unique, compressed data segments are sent to a Data Domain system. For
archiving and compliance solutions, Data Domain systems allow customers to cost-effectively
archive non-changing data while keeping it online for fast, reliable access and recovery.
DELL EMC Data Domain Extended Retention is a solution for long-term retention of backup
data. It is designed with an internal tiering approach to enable cost-effective, long-term
retention of data on disk by implementing deduplication technology. Data Domain provides
secure multi-tenancy that enables data protection-as-a-service for large enterprises and
service providers who are looking to offer services based on Data Domain in a private or
public cloud. With secure multi-tenancy, a Data Domain system will logically isolate tenant
data, ensuring that each tenant’s data is only visible and accessible to them.
DELL EMC Data Domain Replicator software transfers only the deduplicated and
compressed unique changes across any IP network, requiring a fraction of the bandwidth,
time, and cost, compared to traditional replication methods. Data Domain Cloud DR (DD
CDR) allows enterprises to copy backed-up VMs from their on-premise Data Domain and
Avamar environments to the public cloud.

Dell EMC Integrated Data Protection Appliance

• Pre-integrated protection storage and software for comprehensive, modern protection and
faster time to value
• Extends data protection seamlessly to private and public clouds
• Flash-enabled for faster performance and instant recoverability
• Protection for modern applications and optimized for VMware virtual Environments

A pre-integrated, turnkey solution that is simple to deploy and scale, provides comprehensive
protection for a diverse application ecosystem, and comes with native cloud tiering for long-
term retention. IDPA combines protection storage, protection software, search, and analytics
329
to reduce the complexity of managing multiple data silos, point solutions, and vendor
relationships.
IDPA is an innovative solution that provides support for modern applications like MongoDB
and MySQL, and is optimized for VMware. It is also built on industry proven data
invulnerability architecture, delivering encryption, fault detection, and healing.

Dell EMC SRDF

• Remote replication solution that provides DR and data mobility solutions for PowerMax
(VMAX) storage system
• Provides the ability to maintain multiple, host-independent, remotely mirrored copies of
data

SRDF family includes:

• SRDF/S and SRDF/A


• SRDF/DM
• SRDF/AR
• Concurrent and Cascaded SRDF

SRDF, which stands for Symmetrix Remote Data Facility), is a family of software that is the
industry standard for remote replication in mission-critical environments. Built for the
industry-leading highend PowerMax (VMAX) hardware architecture, the SRDF family of
solutions is trusted globally for disaster recovery and business continuity.
The SRDF family offers unmatched deployment flexibility and massive scalability to deliver
a wide range of distance replication capabilities.
SRDF consists of the following options:

• SRDF/S (synchronous option for zero data exposure loss)


• SRDF/A (asynchronous option for extended distances)
• SRDF/Star (multi-site replication option)
• SRDF/CG (consistency groups for federated data sets across arrays)
• SRDF/Metro (for active/active data center protection)

Dell EMC TimeFinder SnapVX

• Creates a PIT copy of a source LUN


• Uses redirect on first write technology
• Provides a new option to secure snaps against accidental or internal deletion
• Provides instant restore which means when a LUN level restore is initiated, the restored
view is available immediately

Enables zero-impact snapshots, simple user-defined names, faster and secure snapshot
creation/expiration, cascading, compatibility with SRDF, and support for legacy VMAX
replication modes. SnapVX reduces replication storage costs by up to 10x and is optimized
for cloud scale with its highly efficient snaps. Customers can take up to 256 snapshots and
330
establish up to 1024 target volumes per source device, providing read/write access as pointer
(snap) or full (clone) copies.
SnapVX also provides a new option to secure snaps against accidental or internal deletion. It
provides instant restore which means when a LUN level restore is initiated, the restored view
is available immediately. Snapshot provides point-in-time data copies for backups, testing,
decision support, and data recovery.

Dell EMC RecoverPoint

• Enable continuous data protection for any PIT recovery to optimize RPO and RTO
• Ensure recovery consistency for interdependent applications
• Provide synchronous or asynchronous replication policies
• Reduce WAN bandwidth consumption and utilize available bandwidth optimally
• Offer multisite support

Provides continuous data protection for comprehensive operational and disaster recovery. It
supports major 3rd party arrays via VPLEX.
RecoverPoint delivers benefits including the ability to:

• Enable continuous data protection for any PIT recovery to optimize RPO and RTO
• Ensure recovery consistency for interdependent applications
• Provide synchronous or asynchronous replication policies
• Reduce WAN bandwidth consumption and utilize available bandwidth optimally
• Offer multisite support

Dell EMC Power Vault

• Simplifies data backup and archive by easily integrating the LTO family of tape drives into
your data center
• It’s lower power consumption makes it an ideal part of a cloud physical infrastructure build-
out
• Linear Tape File System (LTFS) support removes software incompatibilities, creating
portability between different vendors and operating systems

Simplifies data backup and archive by easily integrating the LTO family of tape drives into
your data center. Supporting TBs of native capacity on a single cartridge, LTO drives provide
decades of shelf life for industries and tasks that need reliable, long-term, large-capacity data
retention, such as:

• Healthcare imaging
• Media and entertainment
• Video surveillance
• Geophysical (oil and gas) data
• Computational analysis, such as genome mapping and event simulations

Its lower power consumption makes it an ideal part of a cloud physical infrastructure build-
out. Linear Tape File System (LTFS) support removes software incompatibilities, creating
331
portability between different vendors and operating systems to extend the life of your
infrastructure investments.

Dell EMC SourceOne

• Archiving software that helps organizations to archive aging emails, files, and the Microsoft
SharePoint content to the appropriate storage tiers

SourceOne family of products includes:

• DELL EMC SourceOne Email Management


• DELL EMC SourceOne for Microsoft SharePoint
• DELL EMC SourceOne for File Systems
• DELL EMC SourceOne Email Supervisor

A family of archiving software. It helps organizations to reduce the burden of aging emails,
files, and Microsoft SharePoint content by archiving them to the appropriate storage tier.
SourceOne helps in meeting the compliance requirements by managing emails, files, and
SharePoint content as business records and enforcing retention/disposition policies.
The SourceOne family of products includes:

• DELL EMC SourceOne Email Management for archiving email messages and other
items
• DELL EMC SourceOne for Microsoft SharePoint for archiving SharePoint content
• DELL EMC SourceOne for File Systems for archiving files from file servers
• DELL EMC SourceOne Email Supervisor for monitoring corporate email policy
compliance

VMware vCloud Air Disaster Recovery

Recovery-as-a-service offering which:

• Provides simple, affordable protection in the cloud for your vSphere environment
• Offers enhanced recovery times for business and mission-critical applications running on
vSphere
• Offers scalable disaster recovery protection capacity in the cloud to address the changing
business requirements

A DRaaS offering owned and operated by VMware, built on vSphere Replication and vCloud Air – a
hybrid cloud platform for infrastructure-as-a-service (IaaS). Disaster Recovery leverages vSphere
Replication to provide robust, asynchronous replication capabilities at the hypervisor layer. This
approach towards replication helps in easy configuration of virtual machines in vSphere for disaster
recovery, without depending on underlying infrastructure hardware or data center mirroring. Per-
virtual-machine replication and restore granularity further provide the ability to meet dynamic recovery
objectives without overshooting the actual business requirements for disaster recovery as they change.

VMware vMotion

332
• Performs live migration of a running VM from one physical server to another, without any
downtime
• VM retains its network identity and connections, ensuring a seamless migration process
• Enables to perform maintenance without disrupting business operations

Performs live migration of a running virtual machine from one physical server to another,
without downtime. The virtual machine retains its network identity and connections,
ensuring a seamless migration process. Transferring the virtual machine's active memory
and the precise execution state over a high-speed network, allows the virtual machine to move
from one host to another. This entire process takes less time on a gigabit Ethernet network.
vMotion provides the following benefits:

• Perform hardware maintenance without scheduling downtime or disrupting business


operations
• Move virtual machines away from failing or underperforming servers
• Allows vSphere DRS to balance VMs across hosts

VMware Storage vMotion

• Enables live migration of VM disk files within and across storage systems without any
downtime
• Performs zero-downtime storage migrations with complete transaction integrity
• Migrates the disk files of VMs running any supported operating system on any supported
server hardware

enables live migration of virtual machine disk files within and across storage systems without service
disruptions. Storage vMotion performs zero-downtime storage migrations with complete transaction
integrity. It migrates the disk files of virtual machines running any supported operating system on any
supported server hardware. It performs live migration of virtual machine disk files across any Fibre
Channel, iSCSI, FCoE, and NFS storage system supported by VMware vSphere. It allows to redistribute
VMs or virtual disks to different storage systems or volumes to balance capacity or improve
performance.

333
Question 1
Question 1
Which is the period during which a production volume is available to perform a backup?

Backup media

RPO

• Backup window

Correct!


RTO

Question 2
Which provides the ability to create fully populated point-in-time copies of LUNs within a storage
system or create a copy of an existing VM?

Snapshot

• Clone

Correct!


LUN Masking

334

Pointer-based virtual replica

Question 3
Which factor impacts the deduplication ratio in a backup environment?

Type of backup server

Type of backup media

Value of data

Retention Period

Correct!

335
Storage Infrastructure Security
Introduction to Information Security
ntroduction to Information Security
Definition: Information Security

It includes a set of practices that protect information and information systems from unauthorized access,
use, destruction, deletion, modification, and disruption.

Source: US Federal law (Title 38 Part IV, Chapter 57, Subchapter III USC 5727)

• Information is an organization’s most valuable asset


• Organizations are transforming to modern technologies infrastructure
o Cloud is one of the core elements of the modern technologies
o Trust is one of the key concerns for consumers using modern technologies
▪ Trust = Visibility + Control
• Securing infrastructure is important for the platform of most of the technological
environment

Information is an organization’s most valuable asset. This information, including intellectual


property, personal identities, and financial transactions, is routinely processed and stored in
storage systems, which are accessed through the network. As a result, storage is now more
exposed to various security threats that can potentially damage business-critical data and
disrupt critical services. Organizations deploy various tools within their infrastructure to
protect the asset. These tools must be deployed on various infrastructure assets, such as
compute (processes information), storage (stores information), and network (carries
information) to protect the information.
As organizations are adopting modern technologies, in which cloud is a core element, one of
the key concerns they have is ‘trust’. Trust depends on the degree of control and visibility
available to the information’s owner. Therefore, securing storage infrastructure has become
an integral component of the storage management process in modern technological
environment. It is an intensive and necessary task, essential to manage, and protect vital
information.
Information security includes a set of practices that protect information and information
systems from unauthorized disclosure, access, use, destruction, deletion, modification, and
disruption.
Information security involves implementing various kinds of safeguards or controls to lessen
the risk of an exploitation or a vulnerability in the information system. The risk and the
vulnerabilities could otherwise cause a significant impact to organization’s business. From
this perspective, security is an ongoing process, not static, and requires continuous re-
validation and modification. Securing the storage infrastructure begins with understanding

336
the goals of information security. Information security is vital for every business
organization.

Goals of Information Security


The goals of information security
are:

• CIA
o Confidentiality
o Integrity
o Availability
• Accountability

The goal of information security


is to provide Confidentiality,
Integrity, and Availability,
commonly referred to as the security triad, or CIA:

• Confidentiality provides the required secrecy of information to ensure that only


authorized users have access to data.
• Integrity ensures that unauthorized changes to information are not allowed. The
objective of ensuring integrity is to detect and protect against unauthorized alteration
or deletion of information.
• Availability ensures that authorized users have reliable and timely access to compute,
storage, network, application, and data resources.

Ensuring confidentiality, integrity, and availability are the primary objective of any IT
security implementation. These goals are supported by using authentication, authorization,
and auditing processes.
Accountability is another important principle of information security. It refers to the process
where the users or applications are responsible for the actions or events that are executed on
the systems. Accountability can be achieved by auditing logs

337
Authentication, Authorization, and Auditing

Authentication, authorization, and auditing also referred as AAA plays an important role in
protecting the customers data in a multitenant cloud environment:

• Authentication is a process to ensure that ‘users’ or ‘assets’ are who they claim to be by
verifying their identity credentials. The user has to prove identity to the provider to access
the data stored. A user may be authenticated using a single-factor or multifactor method.
Single-factor authentication involves the use of only one factor, such as a password.
Multifactor authentication uses more than one factor to authenticate a user.
• Authorization is a process of determining the privileges that a user/device/application has,
to access a particular service or a resource. For example, a user with administrator’s
privileges is authorized to access more services or resources compared to a user with non-
administrator privileges. For example, the administrator can have ‘read/write’ access and a
normal user can have ‘read-only’ access. Authorization should be performed only if the
authentication is successful. The most common authentication and authorization controls,
used in a data center environment are Windows Access Control List (ACL), UNIX
permissions, Kerberos, and Challenge-Handshake Authentication Protocol (CHAP). It is
essential to verify the effectiveness of security controls that are deployed with the help of
auditing.
• Auditing refers to the logging of all transactions for the purpose of assessing the
effectiveness of security controls. It helps to validate the behavior of the infrastructure
components, and to perform forensics, debugging, and monitoring activities.

For example: In cloud computing, a customer can access the cloud service catalog using the
credentials. Once the customer is authenticated, a different view of the catalog is provided along
with different options, based on the privileges assigned. Administrator can have a different view
of the catalog compared to a normal user. The number of times a customer has logged in to the
catalog is audited for monitoring purposes.

338
Security Concepts and Relationships
The figure shows relationship among various security concepts in a data center environment. An
organization (owner of the asset) wants to safeguard the asset from threat agents (attackers) who
seek to abuse the assets. Risk arises when the likelihood of a threat agent (an attacker) to exploit
the vulnerability arises. Therefore, the organizations deploy various countermeasures to minimize
risk by reducing the vulnerabilities.
Risk assessment is the first step to determine the extent of potential threats and risks in an
infrastructure. The process assesses risk and helps to identify appropriate controls to mitigate or
eliminate risks. Organizations must apply their basic information security and risk-management
policies and standards to their infrastructure.
Some of the key security areas that an organization must focus on while building the infrastructure
are: authentication, identity and access management, data loss prevention and data breach
notification, governance, risk, and compliance (GRC), privacy, network monitoring and analysis,
security information and event logging, incident management, and security management.

Security Concepts
The following are important security concepts:
Security Assets

• Information, hardware, and software


• Security considerations:
o Must provide easy access to authorized users
o Must be difficult for potential attackers to compromise
o Cost of securing the assets should be a fraction of the value of the assets

339
Information is one of the most important assets for any organization. Other assets include
hardware, software, and other infrastructure components required to access the information.
To protect these assets, organizations deploy security controls. These security controls have
two objectives.

• The first objective is to ensure that the resources are easily accessible to authorized
users.
• The second objective is to make it difficult for potential attackers to access and
compromise the system.

The effectiveness of a security control can be measured by two key criteria. One, the cost of
implementing the system should be a fraction of the value of the protected data. Two, it
should cost heavily to a potential attacker, in terms of money, effort, and time, to compromise
and access the assets.

Security Threats

• Potential attacks that can be carried out


• Attacks can be classified as:
o Passive attacks attempt to gain unauthorized access into the system
o Active attacks attempt data modification, Denial of Service (DoS), and repudiation
attacks

Threats are the potential attacks that can be carried out on an IT infrastructure. These attacks can be
classified as active or passive. Passive attacks are attempts to gain unauthorized access into the system.
Passive attacks pose threats to confidentiality of information. Active attacks include data modification,
denial of service (DoS), and repudiation attacks. Active attacks pose threats to data integrity,
availability, and accountability.

Security Vulnerabilities

• A weaknesses that an attacker exploits to carry out attacks


• Security considerations:
o Attack surface
o Attack vectors
o Work factor
• Managing vulnerabilities:
o Minimize the attack surface
o Maximize the work factor
o Install security controls

Vulnerability is a weakness of any information system that an attacker exploits to carry out
an attack. The components that provide a path enabling access to information are vulnerable
to potential attacks. It is important to implement adequate security controls at all the access
points on these components.
Attack surface, attack vector, and work factor are the three factors to consider when
assessing the extent to which an environment is vulnerable to security threats. Attack surface

340
refers to the various entry points that an attacker can use to launch an attack, which includes
people, process, and technology. For example, each component of a storage infrastructure is
a source of potential vulnerability. An attack vector is a step or a series of steps necessary to
complete an attack. For example, an attacker might exploit a bug in the management
interface to execute a snoop attack. Work factor refers to the amount of time and effort
required to exploit an attack vector.
Having assessed the vulnerability of the environment, organizations can deploy specific
control measures. Any control measure should account for three aspects: people, process,
technology, and the relationships among them.

Security Controls

• Reduce the impact of vulnerabilities


• Controls can be:
o Technical: antivirus, firewalls, and IDPS
o Non-technical: administrative policies and physical controls
• Controls are categorized as:
o Preventive
o Detective
o Corrective

The security controls are directed at reducing vulnerability by minimizing the attack surfaces
and maximizing the work factors. These controls can be technical or non-technical. Controls
are categorized as preventive, detective, and corrective.

• Preventive: Avoid problems before they occur


• Detective: Detect a problem that has occurred
• Corrective: Correct the problem that has occurred

Organizations should deploy defense-in-depth strategy when implementing the controls

Defense-in-depth
Definition: Defense-in-depth

A strategy in which multiple layers of defense are deployed throughout the infrastructure to help mitigate
the risk of security threats in case one layer of the defense is compromised.

341
• Also known as a “layered approach” to security
• Provides organizations additional time to detect and respond to an attack
o Reduces the scope of a security breach

An organization should deploy multiple layers of defense throughout the infrastructure to


mitigate the risk of security threats, in case one layer of the defense is compromised. This
strategy is referred to as defense-in-depth. This strategy may also be thought of as a “layered
approach to security” because there are multiple measures for security at different levels.
Defense-in-depth increases the barrier to exploitation—an attacker must breach each layer
of defenses to be successful—and thereby provides additional time to detect and respond to
an attack.
This potentially reduces the scope of a security breach. However, the overall cost of deploying
defense-in-depth is often higher compared to single-layered security controls. An example of
defense-in-depth could be a virtual firewall installed on a hypervisor when there is already a
network-based firewall deployed within the same environment. This provides additional
layer of security reducing the chance of compromising hypervisor’s security if network-level
firewall is compromised.

342
Governance, Risk, and Compliance
Definition: GRC

A term encompassing processes that help an organization to ensure that their acts are ethically correct
and in accordance with their risk appetite (the risk level an organization chooses to accept), internal
policies, and external regulations.

GRC work together to enforce policies and minimize risks

GRC should be integrated, holistic, and organization-wide. All operations of an organization


should be managed and supported through GRC. Governance, risk management, and
compliance management work together to enforce policies and minimize potential risks. To
better understand how these three components work together, consider an example of how
GRC is implemented in an IT organization. Governance is the authority for making policies
such as defining access rights to users based on their roles and privileges. Risk management
involves identifying resources that should not be accessed by certain users in order to
preserve confidentiality, integrity, and availability. In this example, compliance management
assures that the policies are being enforced by implementing controls such as firewalls and
identify management systems.
GRC is an important component of data center infrastructure. Therefore, while using
modern technologies infrastructure organizations must ensure that all aspects of GRC are
deployed that include cloud-related aspects such as ensuring secured multi-tenancy, the
jurisdictions where data should be stored, data privacy, and ownership.

343
Storage Security Domains and Threats
Storage Security Domains
The information made available on a network is exposed to security threats from various of sources.
Therefore, specific controls must be implemented to secure this information that is stored on an
organization’s storage infrastructure.

• To deploy controls, it is important to have a clear understanding of the access paths leading
to storage resources. If each component within the infrastructure is considered a potential
access point, the attack surface of all these access points must be analyzed to identify the
associated vulnerabilities.
• To identify the threats that apply to a storage infrastructure, access paths to data storage can
be categorized into three security domains: application access, management access, and
backup, replication, and archive.
• To secure the storage environment, identify the attack surface and existing threats within
each of the security domains and classify the threats based on the security goals—
availability, confidentiality, and integrity.

344
The illustration depicts the three security domains of a storage environment.

In the illustration:

• The first security domain involves application access to the stored data through the storage
network. Application access domain may include only those applications that access the
data through the file system or a database interface.
• The second security domain includes management access to storage and interconnecting
devices and to the data residing on those devices. Management access, whether monitoring,
provisioning, or managing storage resources, is associated with every device within the
storage environment. Most management software supports some form of CLI, system
management console, or a web-based interface. Implementing appropriate controls for
securing management applications is important because the damage that can be caused by
using these applications can be far more extensive.
• The third domain consists of backup, replication, and archive access. This domain is
primarily accessed by storage administrators who configure and manage the environment.
Along with the access points in this domain, the backup and replication media also needs
to be secured.

Key Security Threats Across Domains


Some of the key security threats across domains are:

• Denial of services (DoS)


• Distributed denial of service attack (DDoS)
• Loss of data
• Malicious insiders
• Account hacking
• Shared technology vulnerabilities

345
Denial of Services (DoS)
• Prevents legitimate users from accessing resources or services
o Example: Exhausting network bandwidth or CPU cycles
o Could be targeted against compute systems, networks, and storage resources
• DDoS is a variant of DoS attack
o Several systems launch a coordinated DoS attack on target(s)
o DDoS master program is installed on a compute system
o Master program communicates to agents at designated time
o Agents initiate the attack on receiving the command
• Control measure
o Impose restrictions and limits on resource consumption

Prevents legitimate users from accessing resources or services. DoS attacks can be targeted
against compute systems, networks, or storage resources in a storage environment. Always,
the intent of DoS is to exhaust key resources, such as network bandwidth or CPU cycles, thus
impacting production use. For example, an attacker may send massive quantities of data over
the network to the storage system with the intention of consuming bandwidth. This prevents
legitimate users from using the bandwidth and the user may not be able to access the storage
system over the network. Such an attack may be carried out by exploiting weaknesses of a
communication protocol. For example, an attacker may cause DoS to a legitimate user by
resetting TCP sessions. Apart from DoS attack, an attacker may also carry out Distributed
DoS attack.
A Distributed DoS (DDoS) attack is a variant of DoS attack in which several systems launch
a coordinated, simultaneous DoS attack on their target(s). It results into denial of service to
the users of the targeted system(s). In a DDoS attack, the attacker can multiply the
effectiveness of the DoS attack by harnessing the resources of multiple collaborating systems
which serve as attack platforms. Typically, a DDoS master program is installed on one
compute system. Then, at a designated time, the master program communicates to a number
of "agent" programs installed on compute systems. When the agents receive the command,
they initiate the attack.
The principal control that can minimize the impact of DoS and DDoS attack is to impose
restrictions and limits on the network resource consumption. For example, when it is
identified that the amount of data being sent from a given IP address exceeds the configured
limits, the traffic from that IP address may be blocked. This provides a first line of defense.
Further, restrictions and limits may be imposed on resources consumed by each compute
system, providing an additional line of defense.

Loss of Data
• Occurs due to various reasons other than malicious attacks
• Causes of data loss include:
o Accidental deletion by an administrator
o Destruction resulting from natural disasters
• If organization is a service provider then they should publish:
346
o Protection controls deployed for data protection
o Appropriate terms/conditions and penalties related to data loss
• Control measure
o Data backup and replication

Data loss can occur in a storage environment due to various reasons other than malicious
attacks. Some of the causes of data loss may include accidental deletion by an administrator
or destruction resulting from natural disasters. In order to prevent data loss, deploying
appropriate measures such as data backup or replication can reduce the impact of such
events. Organizations need to develop strategies that can avoid or at least minimize the data
loss due to such events. Examples of such strategies include choice of backup media,
frequency of backup, synchronous/asynchronous replication, and number of copies.
Further, if the organization is a cloud service provider then they must publish the protection
controls deployed to protect the data stored in cloud. The providers must also ensure
appropriate terms and conditions related to data loss and the associated penalties as part of
the service contract. The service contract should also include various BC/DR options, such as
backup and replication, offered to the consumers.

Malicious Insiders
Definition: Malicious Insiders

An organization’s current or former employee, contractor, or other business partner who has or had
authorized access to an organization's compute systems, network, or storage.

Source: Computer Emergency Response Team (CERT)

• Intentional misuse of access to negatively impact CIA


• Control measures:
o Strict access control policies
o Security audit and data encryption
o Disable employee accounts immediately after separation
o Segregation of duties (role-based access control)
o Background investigation of candidates before hiring

Today, most organizations are aware of the security threats posed by outsiders.
Countermeasures such as firewalls, malware protection software, and intrusion detection
systems can minimize the risk of attacks from outsiders. However, these measures do not
reduce the risk of attacks from malicious insiders.
According to Computer Emergency Response Team (CERT), a malicious insider could be an
organization’s current or former employee, contractor, or other business partner who has or
had authorized access to an organization’s compute systems, network, or storage. These
malicious insiders may intentionally misuse that access in ways that negatively impact the
confidentiality, integrity, or availability of the organization’s information or resources.
For example, consider a former employee of an organization who had access to the
organization’s storage resources. This malicious insider may be aware of security weaknesses

347
in that storage environment. This is a serious threat because the malicious insider may exploit
the security weakness. Control measures that can minimize the risk due to malicious insiders
include strict access control policies, disabling employee accounts immediately after
separation from the company, security audit, encryption, and segregation of duties (role-
based access control, which is discussed later in this module). A background investigation of
a candidate before hiring is another key measure that can reduce the risk due to malicious
insiders.

Account Hacking
• Occurs when an attacker gains access to administrator’s/user’s accounts
• Controls measures: multi-factor authentication, IPSec, IDPS, and firewall

Type of attack Description


Phishing • Social engineering attack used to deceive users
• Carried out by spoofing email containing link to a fake
website
• Users credentials entered on the fake site are captured

Installing keystroke-logging • Attacker installs malware in administrator’s or user’s


malware compute system
• Malware captures users credentials and sends to the
attacker

Man-in-the-middle • Attacker eavesdrops on the network to capture credential

Account hijacking refers to a scenario in which an attacker gains access to an administrator’s


or user’s account(s) using methods such as phishing or installing keystroke-logging malware
on administrator’s or user’s compute systems.
Phishing is an example of a social engineering attack that is used to deceive users. Phishing
attacks are typically carried out by spoofing email – an email with a fake but genuine-
appearing address, which provides a link to a website that masquerades as a legitimate
website. After opening the website, users are asked to enter details such as their login
credentials. These details are then captured by the attacker to take over the user’s account.
For example, an employee of an organization may receive an email that is designed to appear
as if the IT department of that organization has sent it. This email may ask the users to click
the link provided in the email and update their details. After clicking the email, the user is
directed to a malicious website where their details are captured.
Another way to gain access to a user’s credentials is by installing keystroke-logging malware.
In this attack, the attacker installs malware in the storage administrator’s compute system
which captures user credentials and sends them to the attacker. After capturing the
credentials, an attacker can use them to gain access to the storage environment. The attacker
may then eavesdrop on the administrator’s activities and may also change the configuration
of the storage environment to negatively impact the environment.

348
A “man-in-the-middle” attack is another way to hack user’s credentials. In this attack, the
attacker eavesdrops—overhears the conversation—on the network channel between two sites
when replication is occurring over the network. Use of multi-factor authentication and IPSec
(a suite of algorithms, protocols, and procedures used for securing IP communications by
authenticating and/or encrypting each packet in a data stream) can prevent this type of
attack.
Intrusion detection and prevention systems and firewalls are additional controls that may
reduce the risk of such attacks.

Shared Technology Vulnerabilities


• An attacker may exploit the vulnerabilities of tools used to enable multi-tenant
environments
• Examples of threats:
o Failure of controls that provide separation of memory and storage
o Hyperjacking attack involves installing a rogue hypervisor that takes control of
compute system
• Control measure:
o Examining program memory and processor registers for anomalies

Technologies that are used to build today’s storage infrastructure provide a multi-tenant
environment enabling the sharing of resources. Multi-tenancy is achieved by using controls
that provide separation of resources such as memory and storage for each application.
Failure of these controls may expose the confidential data of one business unit to users of
other business units, raising security risks.
Compromising a hypervisor is a serious event because it exposes the entire environment to
potential attacks. Hyperjacking is an example of this type of attack in which the attacker
installs a rogue hypervisor that takes control of the compute system. The attacker now can
use this hypervisor to run unauthorized virtual machines in the environment and carry out
further attacks. Detecting this attack is difficult and involves examining components such as
program memory and the processor core registers for anomalies.

349
Security Controls
Introduction to Security Controls
Any security control should account for three aspects: people, process, and technology, and the
relationships among them.

Security controls can be classified as:

• Administrative
o Include security and personnel policies or standard procedures to direct the safe
execution of various operations
• Technical
o Usually implemented through tools or devices deployed on the IT infrastructure

Technical security controls must be deployed at:

• Compute level
• Network level
• Storage level

Key Security Controls


Important security controls include:

• Physical security
• Identity and access management
• Role-based access control
• Firewall
• Intrusion detection and prevention system
• Virtual private network
• Malware protection software
• Data encryption
• Data shredding

At the compute system level, security controls are deployed to secure hypervisors and hypervisor
management systems, virtual machines, guest operating systems, and applications. Security at the
network level commonly includes firewalls, demilitarized zones, intrusion detection and prevention
systems, virtual private networks, and VLAN. At the storage level, security controls include data
shredding, and data encryption. Apart from these security controls, the storage infrastructure also
requires identity and access management, role-based access control, and physical security
arrangements.

350
Physical Security
Physical security is the foundation of any overall IT security strategy. Strict enforcement of policies,
processes, and procedures by an organization is critical element of successful physical security.

The physical security measures that are deployed to secure the organization’s storage infrastructure are:

• Disabling all unused devices and ports


• 24/7/365 onsite security
• Biometric or security badge-based authentication to grant access to the facilities
• Surveillance cameras to monitor activity throughout the facility
• Sensors and alarms to detect motion and fire

Identity and Access Management


Definition: Identity and Access Management (IAM)

A process of managing users identifiers, and their authentication and authorization to access storage
infrastructure resources.

IAM controls access to resources by placing restrictions based on user identities. In today’s environment,
an organization may collaborate with one or more cloud service providers to access various cloud-based
storage services. This requires deploying multiple authentication systems to enable the organization to
authenticate employees and provide access to cloud-based storage services.

Organizations may deploy the following authorization and authentication controls:

Control Description Examples


Authorization Restricts accessibility and sharing of files Windows ACLs, UNIX permission, and
and folders OAuth
Authorization Enables authentication amount client Multi-factor authentication, Kerberos,
and server CHAP, and OpenID
The key traditional authentication and authorization controls that are deployed in a storage
environment are Windows ACLs, UNIX permissions, Kerberos, and Challenge-Handshake
Authentication Protocol (CHAP). Alternatively, the organization can use Federated Identity
Management (FIM) for authentication. A federation is an association of organizations
(referred to as trusted parties) that come together to exchange information about their users
and resources to enable collaboration.
Federation includes the process of managing the trust relationships among the trusted parties
beyond internal networks or administrative boundaries. FIM enables the organizations
(especially cloud service providers) to offer services without implementing their own
authentication system. The organization can choose an identity provider to authenticate their
users. This involves exchanging identity attributes between the organizations and the identity
provider in a secure way. The identity and access management controls used by organizations
include OpenID and OAuth.

351
OAuth
Definition: OAuth

An open authorization control enables a client to access protected resources from a resource server on
behalf of a resource owner.

• Can be used to secure application access domain


• There are four entities that are involved in the authorization control:
o Resource owner
o Resource server
o Client
o Authorization Server
• Example: Giving LinkedIn permission to access your Facebook contacts

The illustration shows the steps involved in OAuth process as described in Request for
Comments (RFC) 6749 published by Internet Engineering Task Force (IETF):

• The client requests authorization from the resource owner. The authorization request
can be made directly to the resource owner, or indirectly through the authorization
server.
• The client receives an authorization grant, which is a credential representing the
resource owner's authorization to access its protected resources. It is used by the client
to obtain an access token. Access tokens are credentials that are used to access
protected resources. An access token is a string representing an authorization issued
to the client. The string is usually opaque to the client. Tokens represent specific scopes
and durations of access, granted by the resource owner, and enforced by the resource
server and authorization server.

352
• The client requests an access token by authenticating with the authorization server
and presenting the authorization grant.
• The authorization server authenticates the client and validates the authorization
grant, and if valid, issues an access token.
• The client requests the protected resource from the resource server and authenticates
by presenting the access token.
• The resource server validates the access token, and if valid, serves the request.

• OpenID

• Definition: OpenID
• An open standard for authentication in which an organization uses authentication services
from an OpenID provider.
• The organization is known as the relying party and the OpenID provider is known as the
identity provider. An OpenID provider maintains users credentials on their authentication
system and enables relying parties to authenticate users requesting the use of the relying
party’s services. This eliminates the need for the relying party to deploy their own
authentication systems.
• In the OpenID control, a user creates an ID with one of the OpenID providers. This OpenID
then can be used to sign on to any organization (relying party) that accepts Open ID
authentication. This control can be used in the modern environment to secure application
access domain.

The illustration shows the OpenID concept by considering a user who requires services from the relying
party. For the user to use the services provided by the relying party an identity (user ID and password) is
required. The relying party does not provide their own authentication control, however they support
OpenID from one or more OpenID providers. The user can create an ID with the identity provider and then

353
use this ID with the relying party. The relying party, after receiving the login request, authenticates it with
the help of identity provider and then grants access to the services.

Multifactor Authentication
• Multiple factors for authentication:
o First factor: What a user knows?
▪ For example, a password
o Second factor: What the user has?
▪ For example, a token
o Third factor: Who is the user?
▪ For example, biometric identity
• Access is granted only when all the factors are validated

Multifactor authentication uses more than one factor to authenticate a user. A commonly implemented
two-factor authentication process requires the user to supply both something he, or she knows (such as
a password) and also something he or she has (such as a device). The second factor can be a password
that is generated by a physical device (known as token), which is in the user’s possession. The password
that is generated by the token is valid for a predefined time. The token generates another password
after the predefined time is over. To further enhance the authentication process, more factors may also
be considered. Examples of more factors that may be used include biometric identity. A multifactor
authentication technique may be deployed using any combination of these factors. A user’s access to
the environment is granted only when all the required factors are validated

354
Challenge Handshake Authentication
Protocol
CHAP is basic authentication control that has been widely adopted by network devices and compute
systems. It provides a method for initiators and targets to authenticate each other by using a secret code
or password.

The figure illustrates the handshake steps that occur between an initiator and a target:

CHAP secrets are random secrets of 12 to 128 characters. The secret is never exchanged
directly over the communication channel. It is rather, a one-way hash function that converts
it into a hash value, which is then exchanged.
A hash function, using the MD5 algorithm, transforms data in such a way that the result is
unique and cannot be changed back to its original form. If the initiator requires reverse
CHAP authentication, the initiator authenticates the target by using the same procedure. The
CHAP secret must be configured on the initiator and the target. A CHAP entry, which is
composed of the name of a node and the secret associated with the node, is maintained by the
target and the initiator.
The same steps are execute run in a two-way CHAP authentication scenario.After these steps
are completed, the initiator authenticates the target. If both the authentication steps succeed,
then data access is enabled. CHAP is often used because it is a simple protocol to implement
and can be implemented across various disparate systems.

355
Role-based Access Control
Role-based access control (RBAC) is an approach to restricting access to authorized users based
on their respective roles. A role may represent a job function, for example, a storage administrator.
Minimum privileges are assigned to a role that is required to perform the tasks associated with that
role.
It is advisable to consider administrative controls, such as separation of duties, when defining data
center security procedures. Clear separation of duties ensures that no single individual can both
specify an action and carry it out. For example, the person who authorizes the creation of
administrative accounts should not be the person who uses those accounts.

Firewall and Demilitarized Zone


Definition: Firewall

A security control designed to monitor the incoming and the outgoing network traffic and compare them
to a set of filtering rules.

• Firewall security rules may use various filtering parameters such as source address,
destination address, port numbers, and protocols. The effectiveness of a firewall depends
on how robustly and extensively the security rules are defined.
• Firewalls can be deployed at:
o Network level
o Compute level
o Hypervisor level
• Uses various parameters for traffic filtering

Definition: Demilitarized Zone

A control to secure internal assets while enabling Internet-based access

A network-level firewall is typically used as first line of defense for restricting certain type of
traffic from coming in and going out from a network. This type of firewall is typically
deployed at the entry point of an organization’s network.
At the compute system-level, a firewall application is installed as second line of defense in a
defense-in-depth strategy. This type of firewall provides protection only to the compute
system on which it is installed.
In a virtualized environment, there is an added complexity of virtual machines running on a
smaller number of compute systems. When virtual machines on the same hypervisor
communicate with each other over a virtual switch, a network-level firewall cannot filter this
traffic. In such situations, a virtual firewall can be used to filter virtual machine traffic.
To reduce the vulnerability and protect the internal resources and applications, the compute
systems or virtual machines that require the Internet access are placed in a demilitarized
zone.
In a demilitarized zone environment, servers that need Internet access are placed between
two sets of firewalls, as illustrated in the illustration. The servers in the demilitarized zone
356
may or may not be allowed to communicate with internal resources. Application-specific
ports such as those designated for HTTP or FTP traffic are allowed through the firewall to
the demilitarized zone servers. However, no Internet-based traffic is allowed to go through
the second set of firewalls and gain access to the internal network.

Intrusion Detection and Prevention System


Definition: Intrusion Detection and Prevention System (IDPS)

A security tool that automates the process of detecting and preventing events that can compromise the
confidentiality, integrity, or availability of IT resources.

• Intrusion detection is the process of detecting events that can compromise the
confidentiality, integrity, or availability of IT resources.
• An intrusion detection system (IDS) is a security tool that automates the detection process.
An IDS generates alerts, in case anomalous activity is detected. An intrusion prevention
system (IPS) is a tool that has the capability to stop the events after they have been detected
by the IDS. These two controls usually work together and are generally referred to as
intrusion detection and prevention system (IDPS). The key techniques used by an IDPS to
identify intrusion in the environment are signature-based and anomaly-based detection.

In the signature-based detection technique, the IDPS relies on a database that contains known
attack patterns or signatures, and scans events against it. A signature can be an email with a
specific subject or an email attachment with a specific file name that is known to contain a
virus. This type of detection is effective only for known threats and is potentially
circumvented if an attacker changes the signature (the email subject or the file name in the
attachment, in this example).
In the anomaly-based detection technique, the IDPS scans and analyzes events to determine
whether they are statistically different from events normally occurring in the system. This
technique can detect various events such as multiple login failures, excessive process failure,
excessive network bandwidth consumed by an activity, or an unusual number of emails sent
by a user, which could signify an attack is taking place.
The IDPS can be deployed at the compute system, network, or hypervisor levels.

Virtual Private Network


• Extends a user’s private network across a public network
o Enables to apply internal network’s security and management policies over the VPN
connection
• Two methods to establish a VPN connection:
o Remote access VPN connection
▪ Remote client initiates a remote VPN connection request
▪ VPN server authenticates and grants access to organization’s network
o Site-to-site VPN connection
▪ Remote site initiates a site-to-site VPN connection

357
▪ VPN server authenticates and grants access to organization’s network

In the storage environment, a virtual private network (VPN) can be used to provide a user, a
secure connection to the storage resources. VPN is also used to provide secure site-to-site
connection between a primary site and a DR site when performing remote replication. VPN
can also be used to provide secure site-to-site connection between an organization’s data
center and cloud.
A virtual private network extends an organization’s private network across a public network
such as Internet. VPN establishes a point-to-point connection between two networks over
which encrypted data is transferred. VPN enables organizations to apply the same security
and management policies to the data transferred over the VPN connection as applied to the
data transferred over the organization’s internal network. When establishing a VPN
connection, a user is authenticated before the security and management policies are applied.
There are two methods in which a VPN connection can be established:

• Remote access VPN connection


• Site-to-site VPN connection

In a remote access VPN connection, a remote client (typically client software installed on the
user’s compute system) initiates a remote VPN connection request. A VPN server
authenticates and provides the user access to the network. This method can be used by
administrators to establish a secure connection to data center and carry out management
operations.
In a site-to-site VPN connection, the remote site initiates a site-to-site VPN connection. The
VPN server authenticates and provides access to internal network. One typical usage scenario
for this method is when deploying a remote replication or connecting the cloud.

Malware Protection Software


• Detects, prevents, and removes malware programs
• Common malware detection techniques:
o Signature-based detection
o Heuristics detection
• Protects OS against attacks that modify sensitive areas
o Disallows unauthorized modification of sensitive areas

Malware protection software is typically installed on a compute system or on a mobile device


to provide protection for the operating system and applications. The malware protection
software detects, prevents, and removes malware and malicious programs such as viruses,
worms, Trojan horses, key loggers, and spyware. Malware protection software uses various
techniques to detect malware.
One of the most common techniques that is used is signature-based detection. In this
technique, the malware protection software scans the files to identify a malware signature. A
signature is a specific bit pattern in a file. These signatures are cataloged by malware
protection software vendors and are made available to users as updates. The malware

358
protection software must be configured to regularly update these signatures to provide
protection against new malware programs.
Another technique, called heuristics, can be used to detect malware by examining suspicious
characteristics of files. For example, malware protection software may scan a file to
determine the presence of rare instructions or code. Malware protection software may also
identify malware by examining the behavior of programs. For example, malware protection
software may observe program execution to identify inappropriate behavior such as
keystroke capture.
Malware protection software can also be used to protect operating system against attacks. A
common type of attack that is carried out on operating systems is by modifying its sensitive
areas, such as registry keys or configuration files, with the intention of causing the application
to function incorrectly or to fail. This can be prevented by disallowing the unauthorized
modification of sensitive areas by adjusting operating system configuration settings or
through a malware protection software. In this case, when a modification is attempted, the
operating system or the malware protection software challenges the administrator for
authorization.

Data Encryption
Definition: Data Encryption

A cryptographic technique in which data is encoded and made indecipherable to eavesdroppers or


hackers.

• Enables securing data in-flight and at-rest


• Provides protection from threats, such as data tampering, media theft, and sniffing attacks
• Data encryption control can be deployed at compute, network, and storage
• Data should be encrypted as close to its origin as possible

Data encryption is one of the most important controls for securing data in-flight and at-rest. Data
in-flight refers to data that is being transferred over a network and data at-rest refers to data that is
stored on a storage medium. Data encryption provides protection from threats such as tampering
with data which violates data integrity, media theft which compromises data availability, and
confidentiality and sniffing attacks which compromise confidentiality.
Data should be encrypted as close to its origin as possible. If it is not possible to perform encryption
on the compute system, an encryption appliance can be used for encrypting data at the point of
entry into the storage network. Encryption devices can be implemented on the fabric to encrypt
data between the compute system and the storage media. These controls can protect both the data
at-rest on the destination device and data in-transit. Encryption can also be deployed at the storage-
level, which can encrypt data-at-rest.
Another way to encrypt network traffic is to use cryptographic protocols such as Transport Layer
Security (TLS) which is a successor to Secure Socket Layer (SSL). These are application layer
protocols and provide an encrypted connection for client-server communication. These protocols
are designed to prevent eavesdropping and tampering of data on the connection over which it is
being transmitted.

359
Data Shredding
Definition: Data Shredding

A process of deleting data or residual representation (sometimes called remanence) of data and making it
unrecoverable.

• Techniques for shredding data stored on tapes:


o Overwriting tapes with invalid data
o Degaussing media
o Destroying media
• Techniques for shredding data stored on disks and flash drives:
o Shredding algorithms
• Shred all copies of data including backup and replicas

Typically, when data is deleted, it is not made unrecoverable from the storage and an attacker
may use specialized tools to recover it. The threat of unauthorized data recovery is greater
when an organization discards the failed storage media such as disk drive, solid state drive,
or tape. After the organization discards the media, an attacker may gain access to these media
and may recover the data by using specialized tools.
Organizations can deploy data shredding controls in their storage infrastructure to protect
from loss of confidentiality of their data. Data may be stored on disks or on tapes. Techniques
to shred data stored on tape include overwriting it with invalid data, degaussing the media (a
process of decreasing or eliminating the magnetic field), and physically destroying the media.
Data stored on disk or flash drives can be shredded by using algorithms that overwrite the
disks several times with invalid data.
Organizations may create multiple copies (backups and replicas) of their data and store at
multiple locations as part of business continuity and disaster recovery strategy. Therefore,
organizations must deploy data shredding controls at all location to ensure that all the copies
are shred.

360
Concepts In Practice
RSA SecurID

• Provides two-factor authentication


• To access a resource, a user must combine their secret PIN with token code
• New token code is generated at pre-defined intervals

A two-factor authentication provides an added layer of security to ensure that only valid users have
access to systems and data. RSA SecurID is based on something a user knows (a password or PIN) and
something a user has (an authenticator device). It provides a much more reliable level of user
authentication than reusable passwords. It generates a new, one-time token code at pre-defined
intervals, making it difficult for anyone other than the genuine user to input the correct token code at
any given time. To access their resources, users combine their secret Personal Identification Number
(PIN) with the token code that is displayed on their SecurID authenticator device display at that given
time. The result is a unique, one-time password used to assure a user’s identity.

RSA Security Analytics

• Enables to detect and investigate threats often missed by other security tools
• Single platform captures and analyzes large amounts of network, logs, and other data
• Enables analysis of terabytes of metadata, log data, and recreated network sessions

Helps security analysts detect and investigate threats often missed by other security tools. Security
Analytics provides converged network security monitoring and centralized security information and
event management (SIEM). Security Analytics combines big data security collection, management, and
analytics; full network and log-based visibility; and automated threat intelligence – enabling security
analysts to better detect, investigate, and understand threats they often could not easily see or
understand before. It provides a single platform for capturing and analyzing large amounts of network,
log, and other data. It also accelerates security investigations by enabling analysts to pivot through
terabytes of metadata, log data, and recreated network sessions. It archives and analyzes long-term
security data through a distributed computing architecture and provides built-in compliance reports
covering a multitude of regulatory regimes.

RSA Adaptive Authentication

• Provides an authentication and fraud detection platform


• Measures login and post-login activities
• Provides authentication when protecting:
o Websites, online portals
o Mobile applications

A comprehensive authentication and fraud detection platform. Adaptive Authentication is designed to


measure the risk associated with a user’s login and post-login activities by evaluating a variety of risk
indicators. Using a risk and rules-based approach, the system then requires additional identity
assurance, such as out-of-band authentication, for scenarios that are at high risk and violate a policy.

361
This methodology provides transparent authentication for organizations that want to protect users
accessing websites and online portals, mobile applications and browsers, Automated Teller Machines
(ATMs), Secure Sockets Layer (SSL), virtual private network (VPN) applications, web access management
(WAM) applications, and application delivery solutions.

RSA Archer Suite

• Enables organization to:


o Manage risks
o Demonstrate compliance
o Automate business processes
o Gain visibility to corporate risk and security controls
• Provides a single point of visibility and coordination for physical, virtual, and cloud assets

Allows an organization to build an efficient, collaborative enterprise governance, risk and


compliance program across IT, finance, operations and legal domains. With RSA Archer
Suite, an organization can manage risks, demonstrate compliance, automate business
processes, and gain visibility into corporate risk and security controls. RSA delivers several
core enterprise governance, risk, and compliance solutions, with the integrated risk
management feature of RSA Archer Platform. Business users can quickly implement risk
management processes leading to improved risk management maturity, more informed
decision-making, and enhanced business performance. It also supports users with the
freedom to tailor the solutions and integrate with multiple data sources through code-free
configuration.
RSA Archer platform is an advanced security management system that provides a single
point of visibility and coordination for physical, virtual, and cloud assets. Its three layers—
controls enforcement, controls management, and security management—work together to
provide a single view of information, infrastructure, and identities across physical and virtual
environments.

Dell Change Auditor

• Helps customer audit, alert, protect, and reports user activity and configuration
• The software has role-based access
• Enables customers to see how data is being handled

Helps customers to audit, alert, protect and reports user activity and configuration and application
changes against Active Directory and Windows applications. The software has role-based access,
enabling auditors to have access to only the information they need to quickly perform their job. Change
Auditor provides visibility into enterprise-wide activities from one central console, enabling customers
to see how data is being handled.

Dell InTrust

• Provides the organizations the power to search and analyze vast amounts of data in one
place

362
• Provides information on who accessed the data, how was it obtained and how the data was
used

An IT data analytics solution that provides the organizations the power to search and analyze vast
amounts of data in one place. It provides real-time insights into user activity across security, compliance,
and operational teams. It helps the administrators to troubleshoot the issues by conducting security
investigations regardless of how and where the data is stored. It helps the compliance officers to
produce reports validating the compliance across multiple systems. This web interface quickly provides
information on who accessed the data, how was it obtained and how the data was used. This helps the
administrators and security teams to discover the suspicious event trends.

VMware Airwatch

• Enables secure access to corporate resources


• Configures and updates device settings over-the-air, and secures mobile devices
• Manages different types of devices from a single console

Enables organizations to address the challenges associated with mobility by providing a


simplified, efficient way to view and manage all devices from the central administration
console. This solution enables to enroll devices in an enterprise environment, configure and
update device settings over-the-air, and secure mobile devices. AirWatch enables to manage
devices including Android™, Apple® iOS, BlackBerry®, Mac® OS, Symbian® and
Windows® devices from a single administration console. AirWatch enables to gain visibility
into the devices connecting to your enterprise network, content and resources.
Benefits offered by the VMware AirWatch are:

• Manage different types of devices from a single console


• Allow employees to easily enroll their devices
• Enable secure access to corporate resources
• Integrate with existing enterprise infrastructure
• Support employee, corporate-owned and shared devices
• Gain visibility across mobile device deployment

VMware AppDefense

• Provides data center endpoint security


• Supports integration with third parties
• Provides automatic response
• Secures modern application

It has an authoritative understanding of how data center endpoints are meant to behave and
provides endpoint security to protect applications running in virtualized environments.
AppDefense understands application's intended state and behavior. It monitors the changes
of intended state that indicate a probable threat.
App defense ensures security in a data center environment by:

363
• Supports integration with third parties: The platform such as RSA NetWitness Suite
leverages it for deeper application context within an enterprise’s virtual data center,
response automation/orchestration, and visibility into application attacks.
• Secures modern application: Security of modern application is guaranteed through
AppDefense by protecting the network and data center endpoints and also by
encrypting the enterprise data at rest.
• Provide automatic response: Uses vSphere and VMware NSX Data Center to
automate the correct response. It automatically blocks process communication,
snapshot an endpoint for forensic analysis, and suspend or shut down the endpoint.

364
Question 1
What are the different levels to deploy security controls?

Interface

• Storage

Correct!

• • Compute

Correct!

• Network

Correct!

Question 2
How can you manage vulnerabilities in a modern data center?

Maximize network attack

Maximize the attack surface

Minimize work factor

Install security controls

Correct!

365
Question 3
Which of the following techniques is used for data shredding?

Backups and replicas

• Masking

Incorrect

• • Degaussing media

Correct Response


Hardening

366
Storage Infrastructure Management
Introduction to Storage Infrastructure
Management
What is Storage Infrastructure Management?
Definition: Storage Infrastructure Management

All the storage infrastructure-related functions that are necessary for the management of the
infrastructure components and services, and for the maintenance of data throughout its lifecycle.

• Aligns storage operations and services to an organization’s strategic business goal and
service level requirements
• Ensures that the storage infrastructure is operated optimally by using as few resources as
needed
• Ensures better utilization of existing infrastructure components

The key storage infrastructure components are compute systems, storage systems, and
storage area networks (SANs). These components could be physical or virtual and are used
to provide services to the users. The storage infrastructure management includes all the
storage infrastructure-related functions that are necessary for the management of the
infrastructure components and services, and for the maintenance of data throughout its
lifecycle. These functions help IT organizations to align their storage operations and services
to their strategic business goal and service level requirements. They ensure that the storage
infrastructure is operated optimally by using as few resources as needed. They also ensure
better utilization of existing components, thereby limiting the need for excessive ongoing
investment on infrastructure.
As organizations are driving their IT infrastructure to support modern data center
applications, the storage infrastructure management is also transformed to meet the
application requirements. Management functions are optimized to help an organization to
become a social networking, mobility, big data, or cloud service provider. This module
describes the storage infrastructure management from a service provider’s perspective.

Key Characteristics of Platform-centric


Management
Modern data center management functions are different in many ways from the traditional management
and have the following set of distinctive characteristics:

• Service-focused approach

367
• Software-defined infrastructure-aware
• End-to-end visibility
• Orchestrated operations

Traditionally, storage infrastructure management is component specific. The management


tools only enable monitoring and management of specific components(s). This may cause
management complexity and system interoperability issues in a large environment that
includes many multi-vendor components residing in world-wide locations. In addition,
traditional management operations such as provisioning LUNs and zoning are mostly
manual. The provisioning tasks often take days to weeks to complete, due to rigid resource
acquisition process and long approval cycle.
Further, the traditional management processes and tools may not support a service oriented
infrastructure, especially if the requirement is to provide cloud services. They usually lack
the ability to execute management operations in agile manner, respond to adverse events
quickly, coordinate the functions of distributed infrastructure components, and meet
sustained service levels. This component specific, extremely manual, time consuming, and
overly complex management is simply not appropriate for modern data center
infrastructure.

Service-focused Approach
The storage infrastructure management in a modern data center has a service-based focus. It is linked to
the service requirements and service level agreement (SLA). Service requirements cover the services to be
created/upgraded, service features, service levels, and infrastructure components that constitute a
service. An SLA is a formalized contract document that describes service level targets, service support
guarantee, service location, and the responsibilities of the service provider and the user. These parameters
of a service determine how the storage infrastructure will be managed.

Management functions linked to service requirements and the SLA:

• Determine optimal amount of storage space needed in a storage pool to meet the capacity
requirements of services
• Create a disaster recovery plan to meet the recovery time objective (RTO) of services
• Ensure that the management processes, management software, and staffing are appropriate
to provide services
• Return services to the users within agreed time period in the event of a service failure
• Validate changes to the storage infrastructure for creating or modifying a service

Software-Defined Infrastructure-aware
In a platform-centric environment, more value is given to the software-defined infrastructure
management over the traditional physical component-specific management, including:

368
• Software-defined infrastructure management is more valued over hardware-specific
management
• Management functions move to external software controller
• Many common, repeatable, hardware-specific management tasks are automated
• Management is focused on strategic, value-driven activities
• Management operations become independent of underlying hardware

Management functions are increasingly becoming decoupled from the physical


infrastructure and moving to external software controller. As a result of this shift, the
infrastructure components are managed through the software controller. The controller
usually has a native management tool for configuring components and creating services.
Administrators may also use independent management tools for managing the storage
infrastructure. Management tools interact with the controller commonly through the
application programming interfaces (APIs).
Management through a software controller has changed the way a traditional storage
infrastructure is operated. The software controller automates and abstracts many common,
repeatable, and physical component-specific tasks, thereby reducing the operational
complexity. This allows the administrators to focus on strategic, value-driven activities such
as aligning services with the business goal, improving resource utilization, and ensuring SLA
compliance.
Further, the software controller helps in centralizing the management operations. For
example, an administrator may set configuration settings related to automated storage
tiering, thin provisioning, backup, or replication from the management console. Thereafter,
these settings are automatically and uniformly applied across all the managed components
that may be distributed across wide locations. These components may also be proprietary or
commodity hardware manufactured by different vendors. But, the software controller
ensures that the management operations are independent of the underlying hardware.

End-to-end Visibility
• Management in modern data center environments provides end-to-end visibility into the
storage infrastructure components and deployed services.
o Provides information on the configuration, connectivity, capacity, performance, and
interrelationships of all components centrally
o Helps in consolidating reports, correlating issues, and tracking movement of data
and services across infrastructure
• End-to-end visibility of a storage infrastructure is provided by specialized monitoring tools

The end-to-end visibility of the storage infrastructure enables comprehensive and centralized
management. The administrators can view the configuration, connectivity, capacity,
performance, and interrelationships of all infrastructure components centrally. Further, it
helps in consolidating reports of capacity utilization, correlating issues in multiple
components, and tracking the movement of data and services across the infrastructure.
Depending on the size of the storage infrastructure and the number of services involved, the
administrators may have to monitor information about hundreds or thousands of
369
components located in multiple data centers. In addition, the configuration, connectivity, and
interrelationships of components change as the storage infrastructure grows, applications
scale, and services are updated. Organizations typically deploy specialized monitoring tools
that provide end-to-end visibility of a storage infrastructure on a digital dashboard. In
addition, they are capable of reporting relevant information in a rapidly changing and
varying workload environment.

Orchestrated Operations
Definition: Orchestration

Automated arrangement, coordination, and management of various system or component functions in a


storage infrastructure.

• Management operations are orchestrated as much as possible to provide business agility


o Reduces time to provide and manage a service
o Reduces risk of manual errors and administration cost
• An orchestrator programmatically integrates and sequences inter-related component
functions into workflows
o Triggers an appropriate workflow upon receiving a request

Orchestration refers to the automated arrangement, coordination, and management of


various system or component functions in a storage infrastructure. Orchestration, unlike an
automated activity, is not associated with a specific infrastructure component. Instead, it may
span multiple components, located in different locations depending on the size of a storage
infrastructure. In order to sustain in a modern data center environment, the storage
infrastructure management must rely on orchestration.
Management operations should be orchestrated as much as possible to provide business
agility. Orchestration reduces the time to configure, update, and integrate a group of
infrastructure components that are required to provide and manage a service. By automating
the coordination of component functions, it also reduces the risk of manual errors and the
administration cost.
A purpose-built software, called orchestrator, is commonly used for orchestrating component
functions in a storage infrastructure. The orchestrator provides a library of predefined
workflows for executing various management operations. Workflow refers to a series of
inter-related component functions that are programmatically integrated and sequenced to
accomplish a desired outcome. The orchestrator also provides an interface for administrators
or architects to define and customize workflows. It triggers an appropriate workflow upon
receiving a service provisioning or management request. Thereafter, it interacts with the
components as per the workflow to coordinate and sequence the execution of functions by
these components.

370
Orchestration Example
The example illustrates an orchestrated operation that creates a block volume for a compute system.
In this example, an administrator logs on to the management portal and initiates the volume creation
operation from the portal. The operation request is routed to the orchestrator which triggers a
workflow, as shown on the slide, to fulfill this request. The workflow programmatically integrates
and sequences the required compute, storage, and network component functions to create the block
volume.
The orchestrator interacts with the software-define storage (SDS) controller to let the controller to
carry out the operation according to the workflow. The SDS controller interacts with the
infrastructure components to enable the execution of component functions such as zoning, LUN
creation, and bus rescan. Through the workflow, the management portal receives the response on
the outcome of the operation.

Storage Infrastructure Management


Functions
Storage infrastructure management performs two key functions: infrastructure discovery and
operations management. These functions are described next.

Definition: Discovery
A management function that creates an inventory of infrastructure components and provides
information about the components including their configuration, connectivity, functions,
performance, capacity, availability, utilization, and physical-to-virtual dependencies.

371
Infrastructure Discovery

• Discovery provides visibility into each infrastructure component


o Discovered information helps in monitoring and management
• Discovery tool interacts and collects information from components
• Discovery is typically scheduled to occur periodically
o May also be initiated by an administrator or triggered by an orchestrator

Infrastructure discovery provides the visibility needed to monitor and manage the
infrastructure components. Discovery is performed using a specialized tool that commonly
interacts with infrastructure components commonly through the native APIs of these
components. Through the interaction, it collects information from the infrastructure
components.
A discovery tool may be integrated with the software-defined infrastructure controller,
bundled with a management software, or an independent software that passes discovered
information to a management software. Discovery is typically scheduled by setting an interval
for its periodic occurrence. Discovery may also be initiated by an administrator or be
triggered by an orchestrator when a change occurs in the storage infrastructure.

Operations Management

• Involves on-going management activities to maintain storage infrastructure and deployed


services
• Key processes that support operations management activities are:
o Monitoring
o Configuration management
o Change management
o Capacity management
o Performance management
o Availability management
o Incident management
o Problem management
o Security management

Operations management involves several management processes. The slide lists the key processes that
support operations management activities. The subsequent lessons will describe these processes.
Ideally, operations management should be automated to ensure the operational agility. Management
tools are usually capable of automating many management operations. These automated operations
are described along with the management processes. Further, the automated operations of
management tools can also be logically integrated and sequenced through orchestration.

372
Operations Management
Introduction to Monitoring

Select here for details.

Monitoring provides visibility into the storage infrastructure and forms the basis for performing
management operations. It helps:

• Track the performance and availability status of components and services


• Measure the utilization and consumption of resources by services
• Track events impacting availability and performance of components and services
• Generate reports and triggering alerts
• Track environment parameters (HVAC)

Monitoring forms the basis for performing management operations. Monitoring provides the
performance and availability status of various infrastructure components and services. It
also helps to measure the utilization and consumption of various storage infrastructure
resources by the services. This measurement facilitates the metering of services, capacity
planning, forecasting, and optimal use of these resources. Monitoring events in the storage
infrastructure, such as a change in the performance or availability state of a component or a
service, may be used to trigger automated routines or recovery procedures.
Such procedures can reduce downtime due to known infrastructure errors and the level of
manual intervention needed to recover from them. Further, monitoring helps in generating
reports for service usage and trends. It also helps to trigger alerts when thresholds are
reached, security policies are violated, and service performance deviates from SLA. Alerting
and reporting are detailed later in this module. Additionally, monitoring of the data center
environment parameters such as heating, ventilating, and air-conditioning (HVAC) helps in
tracking any anomaly from their normal status.

Monitoring Parameters
Storage infrastructure is primarily monitored for:

• Configuration
• Availability
• Capacity
• Performance
• Security

373
Monitoring Configuration
Monitoring configuration involves tracking configuration changes and deployment of storage
infrastructure components and services. It also detects configuration errors, non-compliance with
configuration policies, and unauthorized configuration changes.

The table lists configuration changes in the storage infrastructure shown in the image. These
configuration changes are captured and reported by a monitoring tool in real-time. In this
environment, a new zone was created to enable a compute system to access LUNs from one of the
storage systems. The changes were made on the FC switch (device).
Changed At Description Device Compliance
Breach
2019/01/07 @ The member 10000090FA180DCF 100000051E023364 No
13:34:23 has been added to the zone
esx161_vnx_152_1
2019/01/07 @ The member 5006016F08601EBD has 100000051E023364 No
13:34:23 been added to the zone
esx161_vnx_152_1
2019/01/07 @ A new zone esx161_vnx_152_1 has 100000051E023364 No
13:34:23 been added to the fabric
100000051E023364

374
Monitoring Availability
Identifies the failure of any component or process that may lead to service unavailability or degraded
performance.

The figure illustrates an example of monitoring the availability of storage infrastructure components,
including:

• A storage infrastructure includes three compute systems (H1, H2, and H3) that are running
hypervisors
• All the compute systems are configured with two FC HBAs, each connected to the
production storage system through two FC switches, SW1 and SW2. All the compute
systems share two storage ports on the storage system.
• Multipathing software has also been installed on hypervisor running on all the three
compute systems. If one of the switches, SW1 fails, the multipathing software initiates a
path failover, and all the compute systems continue to access data through the other switch,
SW2.
• Due to absence of redundant switch, a second switch failure could result in unavailability
of the storage system. Monitoring for availability enables detecting the switch failure and
helps administrator to take corrective action before another failure occurs. In most cases,
the administrator receives symptom alerts for a failing component and can initiate actions
before the component fails.

Availability refers to the ability of a component or a service to perform its desired function
during its specified time of operation. Monitoring availability of hardware components (for
example, a port, an HBA, or a storage controller) or software component (for example, a
database instance or an orchestration software) involves checking their availability status by

375
reviewing the alerts generated from the system. For example, a port failure might result in a
chain of availability alerts.
A storage infrastructure commonly uses redundant components to avoid a single point of
failure. Failure of a component might cause an outage that affects service availability, or it
might cause performance degradation even though availability is not compromised.
Continuous monitoring for expected availability of each component and reporting any
deviation help the administrator to identify failing services and plan corrective action to
maintain SLA requirements.

Monitoring Capacity
Tracks the amount of storage infrastructure resources used and free.

The figure provides an example that illustrates the importance of monitoring NAS file system capacity:

• If the file system is full and no space is available for applications to perform write I/O, it
may result in application/service outage
• Monitoring tools can be configured to issue a notification when thresholds are reached on
the file system capacity; for example:
o When the file system reaches 66 percent of its capacity, a warning message is issued,
and a critical message is issued when the file system reaches 80 percent of its
capacity
o This enables the administrator to take actions to provision additional LUNs to the
NAS and extend the NAS file system before it runs out of capacity
• Proactively monitoring the file system can prevent service outages caused due to lack of
file system space

Capacity refers to the total amount of storage infrastructure resources available. Inadequate
capacity leads to degraded performance or even service unavailability. Monitoring capacity

376
involves examining the amount of storage infrastructure resources used and usable such as
the free space available on a file system or a storage pool, the numbers of ports available on
a switch, or the utilization of allocated storage space to a service.
Monitoring capacity helps an administrator to ensure uninterrupted data availability and
scalability by averting outages before they occur. For example, if 90 percent of the ports are
utilized in a particular SAN fabric, this could indicate that a new switch might be required if
more compute and storage systems need to be attached to the same fabric. Monitoring usually
leverages analytical tools to perform capacity trend analysis. These trends help to understand
future resource requirements and provide an estimation of the time required to deploy them.

Monitoring Performance
Evaluates how efficiently the infrastructure components and services are performing.

The figure provides an example that illustrates the importance of monitoring performance on iSCSI storage
systems; in this example:

• Compute systems H1, H2, and H3 (with two iSCSI HBAs each) are connected to the storage
system through Ethernet switches SW1 and SW2
• The three compute systems share the same storage ports on the storage system to access
LUNs
• A new compute system running an application with a high work load must be deployed to
share the same storage port as H1, H2, and H3
• Monitoring storage port utilization ensures that the new compute system does not adversely
affect the performance of the other compute systems

Utilization of the shared storage port is shown by the solid and dotted lines in the graph. If the port
utilization prior to deploying the new compute system is close to 100 percent, then deploying the new
compute system is not recommended because it might impact the performance of the other compute
systems. However, if the utilization of the port prior to deploying the new compute system is closer to the
dotted line, then
there is room to
add a new
compute system.

Performance
monitoring
evaluates how
efficiently
different storage
infrastructure
components and
services are
performing and
helps to identify

377
bottlenecks. Performance monitoring measures and analyzes behavior in terms of response
time, throughput, and I/O wait time. It identifies whether the behavior of infrastructure
components and services meets the acceptable and agreed performance level. This helps to
identify performance bottlenecks. It also deals with the utilization of resources, which affects
the way resources behave and respond.
For example, if a VM is experiencing 80 percent of processor utilization continuously, it
suggests that the VM may be running out of processing power, which can lead to degraded
performance and slower response time. Similarly, if the cache and controllers of a storage
system is consistently over utilized, it may lead to performance degradation.

Monitoring Security
Tracks unauthorized access and configuration changes to the storage infrastructure and services.

This figure illustrates the importance of monitoring security in a storage system. In this example:

• The storage system is shared between two workgroups, WG1 and WG2
• The data of WG1 should not be accessible by WG2 and vice versa
• A user from WG1 might try to make a local replica of the data that belongs to WG2
• If this action is not monitored or recorded, it is difficult to track such a violation of security
protocols
• Conversely, if this action is monitored, a warning message can be sent to prompt a
corrective action or at least enable discovery as part of regular auditing operations

Monitoring a storage infrastructure for security includes tracking unauthorized access,


whether accidental or malicious, and unauthorized configuration changes. For example,
monitoring tracks and reports the initial zoning configuration performed and all the

378
subsequent changes. Another example of monitoring security is to track login failures and
unauthorized access to switches for performing administrative changes.
IT organizations typically comply with various information security policies that may be
specific to government regulations, organizational rules, or deployed services. Monitoring
detects all operations and data movement that deviate from predefined security policies.
Monitoring also detects unavailability of information and services to authorized users due to
security breach. Further, physical security of a storage infrastructure can also be
continuously monitored using badge readers, biometric scans, or video cameras.

Alerting
Alerts are system-to-user notifications

• Provide information about events or impending threats or issues


• Keep administrators informed on the status of components, processes, and services
• Trigger when specific situations or conditions are reached
o Conditions may be defined through monitoring tool

Type of Alert Description Example


Information • Provide useful information • Creation of zone or LUN
• Does not require administrator • Creation of a new storage
intervention pool

Warning • Requires administrative attention • Storage pool is becoming


full
• Soft media errors

Fatal • Requires immediate attention • Storage pool is full


• Multiple disk failures in
RAID set

An alert is a system-to-user notification that provides information about events or impending


threats or issues. Alerting of events is an integral part of monitoring. Alerting keeps
administrators informed about the status of various components and processes – for example,
conditions such as failure of power, storage drives, memory, switches, or availability zone,
which can impact the availability of services and require immediate administrative attention.
Other conditions, such as a file system reaching a capacity threshold, an operation breaching
a configuration policy, or a soft media error on storage drives, are considered warning signs
and may also require administrative attention.
Monitoring tools enable administrators to define various alerted conditions and assign
different severity levels for these conditions based on the impact of the conditions. Whenever
a condition with a particular severity level occurs, an alert is sent to the administrator, an
orchestrated operation is triggered, or an incident ticket is opened to initiate a corrective
action. Alert classifications can range from information alerts to fatal alerts. Information
alerts provide useful information but do not require any intervention by the administrator.
379
The creation of a zone or LUN is an example of an information alert. Warning alerts require
administrative attention so that the alerted condition is contained and does not affect service
availability. For example, if an alert indicates that a storage pool is approaching a predefined
threshold value, the administrator can decide whether additional storage drives need to be
added to the pool. Fatal alerts require immediate attention because the condition might affect
the overall performance or availability. For example, if multiple disks fail in a RAID set, the
administrator must ensure that it is returned quickly.
As every IT environment is unique, most monitoring systems require initial set-up and
configuration, including defining what types of alerts should be classified as informational,
warning, and fatal. Whenever possible, an organization should limit the number of truly
critical alerts so that important events are not lost amidst informational messages.
Continuous monitoring, with automated alerting, enables administrators to respond to
failures quickly and proactively. Alerting provides information that helps administrators
prioritize their response to events.

Reporting
• Involves gathering information from various components or processes and generating
reports
• Reports are displayed like a digital dashboard
o Provides real time tabular or graphical views of monitored information
• Commonly used reports are:
o Capacity planning report
o Configuration and asset management reports
o Chargeback report
o Performance report
o Security breach report

Like alerting, reporting is also associated with monitoring. Reporting on a storage


infrastructure involves keeping track and gathering information from various components
and processes that are monitored. The gathered information is compiled to generate reports
for trend analysis, capacity planning, chargeback, performance, and security breaches.
Capacity planning reports contain current and historic information about the utilization of
storage, file systems, database tablespace, ports, etc.
Configuration and asset management reports include details about device allocation, local or
remote replicas, and fabric configuration. This report also lists all the equipment, with
details, such as their purchase date, lease status, and maintenance records. Chargeback
reports contain information about the allocation or utilization of storage infrastructure
resources by various users or user groups. Performance reports provide current and
historical information about the performance of various storage infrastructure components
and services as well as their compliance with agreed service levels. Security breach reports
provide details on the security violations, duration of breach and its impact.
Reports are commonly displayed like a digital dashboard, which provide real time tabular or
graphical views of gathered information. Dashboard reporting helps administrators to make
instantaneous and informed decisions on resource procurement, plans for modifications in
the existing infrastructure, policy enforcement, and improvements in management processes.
380
Example – Chargeback Report
The ability to measure storage resource consumption per business unit or user group and charge them
back accordingly.

To perform chargeback, the storage usage data is collected by a billing system that generates chargeback
report for each business unit or user group. The billing system is responsible for accurate measurement of
the number of units of storage used and reports cost/charge for the consumed units.

The figure shows the assignment of storage resource as services to two business units, Payroll_1 and
Engineering_1, and presents a sample chargeback report.

In this example, each business unit is using a set of compute systems that are running
hypervisor. The VMs hosted on these compute systems are used by the business units. LUNs
are assigned to the hypervisor from the production storage system. Storage system-based
replication technology is used to create both local and remote replicas. A chargeback report
documenting the exact amount of storage resources used by each business unit is created by
a billing system. If the unit for billing is GB of raw storage, the exact amount of raw space
(usable capacity plus protection provided) configured for each business unit must be
reported.
Consider that the Payroll_1 unit has consumed two production LUNs, each 50 GB in size.
Therefore, the storage allocated to the hypervisor is 100 GB (50 + 50). The allocated storage
for local replication is 100 GB and for remote replication is also 100 GB. From the allocated
storage, the raw storage configured for the hypervisor is determined based on the RAID
protection that is used for various storage pools. If the Payroll_1 production LUNs are RAID
1-protected, the raw space used by the production volumes is 200 GB.
Assume that the local replicas are on unprotected LUNs, and the remote replicas are
protected with a RAID 5 configuration, then 100 GB of raw space is used by the local replica
and 125 GB by the remote replica. Therefore, the total raw capacity used by the Payroll_1
unit is 425 GB. The total cost of storage provisioned for Payroll_1 unit will be $2,125 (assume
cost per GB of raw storage is $5). The Engineering_1 unit also uses two LUNs, but each 100
GB in size. Considering the same RAID protection and per unit cost, the chargeback for the
Engineering_1 unit will be $3,500.

381
Operations Management Processes
Some of the main processes of operation management include:

• Configuration management
• Change management
• Capacity management
• Performance management
• Availability management
• Incident management
• Problem management
• Security management

Configuration Management
Goal: Configuration Management

Maintains information about “configuration items (CIs)” that are required to deliver services.

Key functions:

• Discovers and maintains information on CIs in a configuration management system (CMS)


• Updates CMS when new CIs are deployed or CI attributes change

Examples of CI information:

• Attributes of CIs such as CI’s name, manufacturer name, serial number, license status,
version, location, and inventory status
• Used and available capacity of CIs
• Issues linked to CIs
• Inter-relationships among CIs such as service-to-user, storage pool-to-service, storage
system-to-storage pool, and storage system-to-SAN switch

Configuration management is responsible for maintaining information about configuration


items (CI). CIs are components such as services, process documents, infrastructure
components including hardware and software, people, and SLAs that need to be managed in
order to deliver services. The information about CIs include their attributes, used and
available capacity, history of issues, and inter-relationships. Examples of CI attribute are the
CI’s name, manufacturer name, serial number, license status, version, description of
modification, location, and inventory status (for example, on order, available, allocated, or
retired). The inter-relationships among CIs in a storage infrastructure commonly include
service-to-user, storage pool-to-service, storage volume-to-storage pool, storage system-to-
storage pool, storage system-to-SAN switch, and data center-to geographic location.
All information about CIs is usually collected and stored by the discovery tools in a single
database or in multiple autonomous databases mapped into a federated database called a

382
configuration management system (CMS). Discovery tools also update the CMS when new
CIs are deployed or when attributes of CIs change. CMS provides a consolidated view of CI
attributes and relationships, which is used by other management processes for their
operations. For example, CMS helps the security management process to examine the
deployment of a security patch on VMs, the problem management to resolve a connectivity
issue, or the capacity management to identify the CIs affected on expansion of a storage pool.

Capacity Management
Goal: Capacity Management

Ensures that a storage infrastructure is able to meet the required capacity demands for services in a cost
effective and timely manner.

Key functions:

• Determines optimal amount of storage needed to meet SLA


• Maximizes capacity utilization without impacting service levels
• Establishes capacity consumption trends and plans for additional capacity

Examples of capacity management activities:

• Adding new nodes to a scale-out NAS cluster or an object-based storage system


• Enforcing capacity quotas for users
• Expanding a storage pool and setting a threshold for maximum utilization
• Forecasting usage of file system, LUN, and storage pool
• Removing unused resources from a service and reassigning those to another

Capacity management ensures adequate availability of storage infrastructure resources to


provide services and meet SLA requirements. It determines the optimal amount of storage
required to meet the needs of a service regardless of dynamic resource consumption and
seasonal spikes in storage demand. It also maximizes the utilization of available capacity and
minimizes spare and stranded capacity without compromising the service levels.
Capacity management tools are usually capable of gathering historical information on
storage usage over a specified period of time, establishing trends on capacity consumption,
and performing predictive analysis of future demand. This analysis serves as input to the
capacity planning activities and enables the procurement and provisioning of additional
capacity in the most cost effective and least disruptive manner.
Adding new nodes to a scale-out NAS cluster or an object-based storage system is an example
of capacity management. Addition of nodes increases the overall processing power, memory,
or storage capacity. Enforcing capacity quotas for users is another example of capacity
management. Provisioning a fixed amount of space for their files restricts users from
exceeding the allocated capacity. Other examples include creating and expanding a storage
pool, setting a threshold for the maximum utilization and amount of oversubscription allowed
for each storage pool, forecasting the usage of file system, LUN, and storage pool, and

383
removing unused resources from a service for their reassignment to another resource-
crunched service.
Capacity management team uses several methods to maximize the utilization of capacity.
Some of the common methods are over-commitment of processing power and memory, data
deduplication and compression, automated storage tiering, and use of converged network
such as an FCoE SAN.

384
Capacity Management Example
This example illustrates the expansion of a NAS file system using an orchestrated workflow. The
file system is expanded to meet the capacity requirement of a compute cluster that accesses the file
system.
In this example, an administrator initiates a file system expansion operation from the management
portal. The operation request is transferred to the orchestrator that triggers a change approval and
execution workflow. The orchestrator determines whether the request for change needs to be
reviewed by change management team. If the request is preapproved, it is exempted from change
management review. If not, the orchestrated workflow ensures that the change management team
reviews and approves/rejects the request.
If the file system expansion request is approved, the orchestrator interacts with the SDS controller
to invoke the expansion. Thereafter, the SDS controller interacts with the storage infrastructure
components to add the required capacity to the file system. The orchestrated workflow also invokes
the discovery operation which updates the CMS with information on the modified file system size.
The orchestrator responds by sending updates to the management portal appropriately following
completion or rejection of the expansion operation.

385
Performance Management
Goal: Performance Management

Monitors, measures, analyzes, and improves the performance of storage infrastructure and services.

Key functions:

• Measures and analyzes the response time and throughput of components


• Identifies components that are performing below the expected level
• Makes configuration changes to optimize performance and address issues

Examples of performance management activities:

• Tuning database design, resource allocation to VMs, and multipathing


• Adding new ISLs and aggregating links to eliminate bottleneck
• Separating sequential and random I/Os to different spindles
• Changing storage tiering policy and cache configuration

Performance management ensures the optimal operational efficiency of all infrastructure


components so that storage services can meet or exceed the required performance level.
Performance-related data such as response time and throughput of components are collected,
analyzed, and reported by specialized management tools. The performance analysis provides
information on whether a component meets the expected performance levels. These tools also
proactively alert administrators about potential performance issues and may prescribe a
course of action to improve a situation.
Performance management team carries out several activities to address performance-related
issues and improve the performance of the storage infrastructure components. For example,
to optimize the performance levels, activities on the compute system include fine-tuning the
volume configuration, database design or application layout, resource allocation to VMs,
workload balancing, and multipathing configuration. The performance management tasks
on a SAN include implementing new ISLs and aggregating links in a multiswitch fabric to
eliminate performance bottleneck. The storage system-related tasks include separating
sequential and random I/Os to different spindles, selecting an appropriate RAID type for a
storage pool, and changing storage tiering policy and cache configuration, when the
performance management is concerned.

Availability Management
Goal: Availability Management

Ensures that the availability requirements of all the components and services are consistently met.

Key functions:

• Establishes guideline to meet stated availability levels at a justifiable cost

386
• Identifies availability-related issues and areas for improvement
• Proposes changes in existing BC solutions or architects new BC solutions

Examples of availability management activities

• Deploying redundant, fault tolerant, and hot-swappable components


• Deploying compute cluster, fault resilient applications, and multipathing software
• Designing multiple availability zones for automated service failover
• Planning and architecting data backup and replication solutions

Availability management is responsible for establishing a proper guideline based on the


defined availability levels of services. The guideline includes the procedures and technical
features required to meet or exceed both current and future service availability needs at a
justifiable cost. Availability management also identifies all availability-related issues in a
storage infrastructure and areas where availability must be improved. The availability
management team proactively monitors whether the availability of existing services and
components is maintained within acceptable and agreed levels. The monitoring tools also help
administrators to identify the gap between the required availability and the achieved
availability. With this information, the administrators can quickly identify errors or faults in
the infrastructure components that may cause future service unavailability.
Based on the service availability requirements and areas found for improvement, the
availability management team may propose new business continuity (BC) solutions or
changes in the existing BC solutions. For example, when a set of compute systems is deployed
to support a service or any critical business function, it requires high availability. The
availability management team proposes redundancy at all levels, including components, data,
or even site levels. This is generally accomplished by deploying two or more HBAs per system,
multipathing software, and compute clustering.
The compute systems must be connected to the storage systems using at least two independent
fabrics and switches that have built-in redundancy and hot-swappable components. The VMs
running on these compute systems must be protected from hardware failure/unavailability
through VM failover mechanisms. Deployed applications should have built-in fault resiliency
features. The storage systems should also have built-in redundancy for various components
and should support local and remote replication. RAID-protected LUNs should be
provisioned to the compute systems using at least two front-end ports. In addition, multiple
availability zones may be created to support fault tolerance at the site level.

387
Incident Management
Goal: Incident Management

Returns services to users as quickly as possible when unplanned events, called ‘incidents’, interrupt
services or degrade service quality.

Key functions:

• Detects and records all incidents in a storage infrastructure


• Investigates incidents and provides solutions to resolve the incidents
• Documents incident history

The table provides a sample list of incidents that are captured by an incident management tool.

SeverityEvent Type Device Priority Status Last Updated Owner Escalation


Summary
Fatal Pool A Incident Storage None New 2019/01/07 - No
usage is system 1 12:38:34
95%
Fatal Database 1 Incident DB High WIP 2019/01/07 L. John Support
is down server 1 10:11:03 Group 2
Warning Port 3 Incident Switch A Medium WIP 2019/01/07 P. Kim Support
utilization 09:48:14 Group 1
is 85%
An incident is an unplanned event such as an HBA failure or an application error that may cause
an interruption to services or degrade the service quality. Incident management is responsible for
detecting and recording all incidents in a storage infrastructure. It investigates the incidents and
provides appropriate solutions to resolve the incidents. It also documents the incident history with
details of the incident symptoms, affected services, components and users, time to resolve the
incident, severity of the incident, description of the error, and the incident resolution data. The
incident history is used as an input for problem management (described next).
Incidents are commonly detected and logged by incident management tools. They also help
administrators to track, escalate, and respond to the incidents from their initiation to closure.
Incidents may also be registered by the users through a self-service portal, emails, or a service desk.
The service desk may consist of a call center to handle a large volume of telephone calls and a help
desk as the first line of service support. If the service desk is unsuccessful in providing solutions
against the incidents, they are escalated to other incident management support groups or to problem
management.
The incident management support groups investigate the incidents escalated by the incident
management tools or service desk. They provide solutions to bring back the services within an
agreed timeframe specified in the SLA. If the support groups are unable to determine and correct
the root cause of an incident, error-correction activity is transferred to problem management. In
this case, the incident management team provides a temporary solution (workaround) to the
incident; for example, migration of a storage service to a different storage pool in the same data
center or in a different data center. During the incident resolution process, the affected users are
kept apprised of the incident status.

388
Problem Management
Goal: Problem Management

Prevents incidents that share common symptoms or root causes from reoccurring, and minimizes the
adverse impact of incidents that cannot be prevented.

Key functions:

• Reviews incident history to detect problems in a storage infrastructure


• Identifies the underlying root cause that creates a problem
o Integrated incident and problem management tools may mark specific incidents as
problem and perform root cause analysis
• Provides most appropriate solution/preventive remediation for problems
• Analyzes and solves errors proactively before they become an incident/problem

A problem is recognized when multiple incidents exhibit one or more common symptoms.
Problems may also be identified from a single significant incident that is indicative of a single
error for which the cause is unknown, but the impact is high. Problem management reviews
all incidents and their history to detect problems in a storage infrastructure. It identifies the
underlying root cause that creates a problem and provides the most appropriate solution
and/or preventive remediation for the problem. If complete resolution is not available,
problem management provides solutions to reduce or eliminate the impact of a problem. In
addition, the problem management proactively analyzes errors and alerts in the storage
infrastructure to identify impending service failures or quality degradation. It solves errors
before they turn out to be an incident or a problem.
Incident and problem management, although separate management processes, require
automated interaction between them and use integrated incident and problem management
tools. These tools may help an administrator to track and mark specific incident(s) as a
problem and transfer the matter to problem management for further investigation.
Alternatively, these tools may automatically identify incidents that are most likely to require
root cause analysis. Further, these tools may have analytical ability to perform root cause
analysis based on various alerts. They search alerts that are indicative of problems and
correlate these alerts to find the root cause. This helps to resolve problems more quickly.

Security Management
Goal: Security Management

Prevents occurrence of incidents/activities adversely affecting confidentiality, integrity, and availability of


information and meets regulatory/compliance requirements for protecting information at
reasonable/acceptable costs.

Key functions:

• Develops information security policies

389
• Deploys required security architecture, processes, mechanisms, and tools

Examples of security management activities:

• Managing user accounts and access policies that authorize users to use a service
• Deploying controls at multiple levels (defense in depth) to access data and services
• Scanning applications and databases to identify vulnerabilities
• Configuring zoning, LUN masking, and data encryption services

Security management ensures the confidentiality, integrity, and availability of information


in a storage infrastructure. It prevents the occurrence of security-related incidents or
activities that adversely affect the infrastructure components, management processes,
information, and services. It also meets regulatory or compliance requirements (both internal
and external) for protecting information at reasonable/acceptable costs. External compliance
requirements include adherence to the legal frameworks such as U.K. Data Protection Act
1998, U.K. Freedom of Information Act 2000, U.S. Health Insurance Portability and
Accountability Act 1996, and EU Data Protection Regulation. Internal regulations are
imposed based on an organization’s information security policies such as access control
policy, bring-your-own-device (BYOD) policy, and policy on the usage of cloud storage.
Security management is responsible for developing information security policies that govern
the organization’s approach towards information security management. It establishes the
security architecture, processes, mechanisms, tools, user responsibilities, and standards
needed to meet the information security policies in a cost-effective manner. It also ensures
that the required security processes and mechanisms are properly implemented.
Security management team performs various activities to prevent unauthorized access and
security breaches in a storage infrastructure. For example, the security management team
manages the user accounts and access policies that authorize users to use a service. Further,
the access to data and services is controlled at multiple levels (defense in depth) reducing the
risk of a security breach if a protection mechanism at one level gets compromised.
Applications and databases are also scanned periodically to identify vulnerabilities and
provide protection against any threats. The security management activities in a SAN include
configuration of zoning to restrict an unauthorized HBA from accessing specific storage
system ports and providing mechanisms to transport encrypted data. Similarly, the security
management task on a storage system includes LUN masking that restricts a compute system
from accessing a defined set of LUNs.

390
Concepts In Practice
Dell EMC SRM

• Shows relationships and topology of components


• Shows capacity utilization and configuration compliance
• Helps in capacity planning and chargeback reporting

A management software for automated monitoring and reporting of both traditional and software-
defined storage infrastructure. It provides visibility to the relationships and topology from applications
hosted on virtual or physical machines down to the LUNs. It also enables administrators to analyze
performance trends, capacity utilization, and configuration compliance. With this insight, it helps
administrators to optimize storage capacity through the alignment of application workload to the right
storage tier, capacity planning, and chargeback reporting.

Dell EMC Service Assurance Suite

• Discovers infrastructure components


• Detects and correlates events to find problems
• Identifies root causes and risk conditions

Offers a combination of management tools, including Smarts and M&R (formerly known as Watch4net),
to perform IT operations in a software-defined data center. It discovers infrastructure components and
details information about each one, including configuration and inter-relationship among components.
It detects and correlates events related to availability, performance, and configuration status of
infrastructure components that may occur due to problems. It also identifies the root causes of the
problems and risk conditions. By quickly finding the root causes and risks, it helps administrators to
proactively resolve issues before they impact the services levels.

Dell EMC CloudIQ

• Easily monitor storage health from anywhere


• CloudIQ is a no cost cloud-native application for storage fitness tracking
• Dell EMC Unity, SC Series, XtremIO, and PowerMax/VMAX, all are supported by
CloudIQ

A no cost cloud-native application that leverages Machine Learning to proactively monitor and measure
the overall health of storage systems through intelligent, comprehensive, and predictive analytics. The
easiest way to describe CloudIQ is that it is like a fitness tracker for your storage environment, providing
a single, simple, display to monitor and predict the health of your storage environment. CloudIQ makes
it simple to track storage health, report on historical trends, plan for future growth, and proactively
discover and re-mediate issues from any browser or mobile device.

• Identifies performance, capacity, and configuration issues, and helps remediate them
• Optimizes the usage of capacity and performs capacity trend analysis
• Verifies configuration compliance and recommends/triggers actions

391
• Provides end-to-end visibility in a single console

A management tool that automates some of the key management operations in a storage infrastructure.
It identifies potential performance, capacity, and configuration issues and helps remediate those issues
before they become problems. It optimizes the usage of capacity and performs capacity trend analysis.
It also collects configuration data, verifies configuration compliance with predefined policies, and
recommends/triggers necessary actions to remediate policy breaches. This enables organizations to
enforce and maintain the conformance with configuration standards, regulatory requirements, and
security hardening guidelines. Further, it provides end-to-end visibility across storage infrastructure
components including application-to-component mapping in a single console.

vRealize Orchestrator

• Orchestrates service delivery and operational functions


• Enables administrators to:
o Use pre-defined workflows from library
o Create customized workflows
• Can execute hundreds or thousands of workflows concurrently

Orchestration software that helps to automate and coordinate the service delivery and operational
functions in a storage infrastructure. It comes with a built-in library of pre-defined workflows as well as
a drag-and-drop feature for linking actions together to create customized workflows. These workflows
can be launched from the VMware vSphere client, from various components of VMware vCloud Suite,
or through various triggering mechanisms. vRealize Orchestrator can execute hundreds or thousands of
workflows concurrently

392
Question 1
Which type of alert is generated if soft media errors on a disk drive approaches its pre-defined
threshold value?

Fatal

• Warning

Correct!


Information


Watermark

Question 2
What is a purpose of a chargeback report?

Reports charges for SLA breach

• Reports utilization of infrastructure components by various users

Correct!


Reports investment in managing infrastructure


Reports cost of decommissioning infrastructure components

393
Question 3
Which monitoring parameter helps ensuring availability of adequate amount of resources and
prevents service unavailability?

Security

Performance

• Capacity

Correct!


Availability

394

You might also like