Professional Documents
Culture Documents
InfiniScaleStorage TAR
InfiniScaleStorage TAR
June 2013
Abstract
This decade is seeing a tremendous proliferation of cloud and mobility enabled services in human life not only in social
interaction with one another but also almost every commercial activity of consuming goods or services. This report analyzes the
impact of this techno-economic trend on the IT consumption of large enterprises, which are vigorously re-architecting their
infrastructures to enable modern consumption paradigms of their end users. Aside from the social media, these trends have
significantly impacted business operations in e-commerce, financial services, healthcare, media and application
development/deployment. We all are familiar with the consumer side of this impact as end user consumers of enterprises in
these verticals. However, there is a greater disruption on the business & IT architectural aspects of these enterprises that this
report attempts to analyze, which is characterized as InfiniScale Storage Architectures. The report also studies the response
from NetApp’s competitors & partners to this trend, and concludes with recommendations for NetApp.
Executive Summary
What? NetApp’s large enterprise customers in e-commerce, retail, financial services, public sector
and telco/SP verticals are rolling out new analytics-driven cloud scale business operations.
These business operations are characterized by a. Lean supply-chain management through
application of Internet of Things and b. Deeper real-time consumer insights through analysis
of social media traces. These are leading to a new generation of rapidly growing low-
latency, high throughput data stores optimized for analytics. An emerging API-driven
storage stack, which has almost become a de-facto standard through its OSI-like 7-layer
model, is defining migration of data management values to layers stacked above data
storage.
Disruptions driven by the cloud business model are leading applications’ demand for newer
developer friendly data services and logical data models (like map-reduce, graph, columnar
stores that are more sophisticated than the basic file volume / block used traditionally),
while changing technology curves leading to abundance of CPU, memory & networking are
impacting the physical data abstractions and data distribution of logical models. NetApp’s
current product portfolio only targets the data storage beneath the physical abstractions in
this stack and hence needs to follow the value that has migrated up this stack to the data
distribution and data abstraction layers.
We see an emergence of 3 new workloads: real-time analytics, session stores and active
blob stores. Traditional storage architectures are stretched to address these emerging
workloads in one or more of the following vectors: cost, scale, latency, throughput and the
need to support non-POSIX oriented application driven data models. While these
environments currently pose a large business threat through open source and commodity
infrastructure, there is potential for NetApp to protect its challenged market share by
differentiating in this space with engineered solutions.
Why? Some of the key drivers for this area of work are:
Application middleware and in-memory databases are driving a trend towards doing
fine-grained data management higher in the stack. Most of cDOT data management
value moves into that space. Further, new kinds of data management emerge due to
the nearness to the application, which are difficult to provide by cDOT. Not heeding to
this shift would mean that storage is relegated to being used as JBOD. The presents
itself as the emergence of custom and narrow-focused databases, called data stores.
Enterprises are demanding real-time analytics on most types of data. In many cases
these applications cannot tolerate disk latencies, have very high transaction rates
(millions of transactions per second) and large working sets. Thus, these applications
want a lot of memory on each node and utilize scale-out architectures. SCMs have a
number of end-to-end issues to be resolved before they become real in a data center,
but DRAM-based InfiniScale solutions like SAP-HANA and Microsoft’s Hekaton are not
waiting. Further, open source based solutions, like Cassandra are being used by most of
How? The recommendations involve engineering application-guided agile data layouts that can
accommodate application-defined granularity of data management. We propose the
following broad set of investigations in ATG targeted towards accomplishing this:
Most object stores are good at storing large objects. They can neither work with tiny
objects nor can they work with the volume and velocity of tiny key-value data
elements. A KV store from that perspective is an IOPs-tier of the Object Store. Being
able to deal with tiny key-value pairs is a challenging storage problem. This is because
even as the relatively colder data is written to a more stable storage, it will be accessed
using the same access mechanisms as when it was in memory. How can tiny key-value-
pair data store organize data for large back-end IOs to stable storage, for efficient
subsequent retrieval and processing? We speculate that this has the potential of being
the unified storage for NoSQL databases.
CPU architectures are becoming very potent and very complex. Also, treating DRAM as
pure random access has a serious effect on cache effectiveness. Thus, spatial locality of
data access from DRAM is very important for high transaction and low-latency
workloads. Also, the memory bandwidth continues to be highly constrained, which only
aggravates the need for its effective utilization, achieved through spatial data locality.
Given these aspects, unique memory layouts for data processing based on the nature
of queries is needed. Thus, application-guided data layouts need to be explored.
Explore storage efficiency in a IOPs sensitive world by enabling reads over compressed
and encrypted data, and enabling highly selective decompress.
Explore the resiliency options through geo-distributed coding techniques that provide
storage efficient resiliency.
We also call out potential inorganic investments, in the form of 3 technology startups
in this space.
3 RECOMMENDATIONS ---------------------------------------------------------------------------------------------- 41
3.1 ABSTRACT ARCHITECTURES--------------------------------------------------------------------------------------- 41
3.1.1 InfiniScale Architecture ------------------------------------------------------------------------------------------------------------- 41
3.1.2 Real-time Data Stores --------------------------------------------------------------------------------------------------------------- 41
3.1.3 Capacity-based Data Stores ------------------------------------------------------------------------------------------------------- 42
3.1.4 Summary ------------------------------------------------------------------------------------------------------------------------------- 42
3.2 POTENTIAL ATG INVESTIGATIONS ------------------------------------------------------------------------------- 43
3.2.1 In-memory Data Layout ------------------------------------------------------------------------------------------------------------ 43
3.2.2 On Storage Data Layout ------------------------------------------------------------------------------------------------------------ 45
3.2.3 Storage Efficiency -------------------------------------------------------------------------------------------------------------------- 46
3.2.4 Data Distribution --------------------------------------------------------------------------------------------------------------------- 47
3.2.5 Coding for Resiliency ---------------------------------------------------------------------------------------------------------------- 47
3.2.6 Others ----------------------------------------------------------------------------------------------------------------------------------- 49
3.3 POTENTIAL TECHNOLOGY TARGETS ------------------------------------------------------------------------------ 49
3.3.1 Acunu: Real-time Monitoring and Analytics for High-velocity Data ----------------------------------------------------- 49
3.3.2 FoundationDB: A NoSQL Database with ACID Transactions --------------------------------------------------------------- 50
3.3.3 BangDB: A NoSQL for Real Time Performance -------------------------------------------------------------------------------- 52
4 CONCLUSION --------------------------------------------------------------------------------------------------------- 54
4.1 KEY INSIGHTS ------------------------------------------------------------------------------------------------------ 54
5 REFERENCES----------------------------------------------------------------------------------------------------------- 56
LIST OF TABLES
TABLE 1: FACTORS THAT DIFFERENTIATE INFINISCALE FROM OTHER STORAGES ------------------------------------------------------------- 6
TABLE 2: WORKLOAD CHARACTERISTICS FOR EMERGING DATA STORES-------------------------------------------------------------------- 12
TABLE 3: NETAPP CUSTOMER REFERENCES ------------------------------------------------------------------------------------------------- 13
2
TABLE 4: PORTFOLIO OF EMC FOR EMERGING DATA STORES ----------------------------------------------------------------------------- 17
TABLE 5: TIPPING OVER TO SHARED-NOTHING DATA STORES ------------------------------------------------------------------------------ 30
The scope of this report does not include cheap capacity [$/GB] optimized storage architectures and
cold data / archival stores. There are separate CTO office initiatives in progress around those. The
scope of this report is IT architectures driving active business operations at cloud scale with relevant
performance, data management and availability requirements. As datasets pass through their
lifecycle, they would find themselves in lower SLA, coarse-grained managed active archives and
eventually in deep / cold archives [Govil08].
1. Cloud-based Services: There has been a flood of cloud based web-services offered in the
aforementioned industry verticals that are leading much of the enterprises’ business growth.
These have led to new requirements of concurrency, security, scale & resiliency from the IT
infrastructure that supports the business operations. As an example, MetLife launched1 a
360o consolidated customer view service, in May 2013, called “The Wall” built on NoSQL
document store MongoDB. This Facebook-like internal cloud service has handled 45 million
agreements across 140 million transactions in a short span of 90 days. In a public statement,
MetLife’s CIO has committed investments to transform customer experience using state of
the art InfiniScale technologies.
1
http://www.10gen.com/press/metlife-leapfrogs-insurance-industry-mongodb-powered-big-data-application
2
http://basho.com/assets/basho-casestudy-comcast.pdf
3
http://en.wikipedia.org/wiki/Internet_of_Things
4
http://www.mckinsey.com/insights/business_technology/the_internet_of_things_and_the_future_of_manufacturing
5
http://www.westfieldlabs.com/blog/a-new-global-approach-to-social-media/
6
http://www.zdnet.com/westfield-hires-digital-guru-for-tech-push-1339336946/
1. Multi Data-Model Support: Web-storage requirements have led to a spur in the open
community defined data models/services that have been widely adopted by application
developers. A differentiated web-storage solution must support the popular data models
(such as tabular/columnar, document & key-value) in order to cater to diverse needs of web-
applications.
7
http://www.openstack.org/
1. Timeliness: The data and analysis on the data should be almost instantaneous.
2. Comprehensiveness: Real time analysis doesn't involve sampling, but complete datasets, like
the analysis needed for the last quarter of a business.
3. Accuracy: Data should be accurate. As most of the data involved in analysis is not for
statistical analysis, but maybe for guaranteeing compliance to a generated model.
4. Accessibility: The raw data should be accessible for a few days (to a few weeks) while the
result of analysis should be accessible forever.
5. Performance: Most reports/ dashboards of the analytics framework should render in less
than 5 seconds. Most of these operations involve an interactive session and anything more
than 5 seconds is considered unacceptable, while anything less than 2 seconds is considered
“very responsive”.
Properties that are required for a storage system that is to be used for real-time analytics are:
1. Highly available and distributed: The system should have high tolerance to individual node
failures and makes it possible to add multi-data center support easily if data affinity or
sovereignty is an issue. On top of that it should be easy to expand a Cassandra cluster with
new nodes if necessary.
2. Extremely good write performance: Individual writes are expected to be tiny, and there are
a large number of sessions that need to be handled. Both low latency and high throughput
are required from storage.
3. Low latency reads: This is needed for drill-down and interactive analytics. Most of the
analysis is around a range of data elements, with high degree of time or space locality. Thus
storage must be organized to cater to this low-latency on reads.
It is not shared.
It is semi-persistent.
It is keyed to a particular user.
It is updated on every interaction.
Limited scope ACID semantics.
Given these properties, the functionality necessary for a session state store can be greatly simplified
as follows:
Currently, session state storages are built using relational databases, file systems, single-copy in-
memory and replicated in memory.
An example of a session state store is eBay’s metadata service platform9 for all of its web-apps – this
service acts as a single source of truth for all its apps for media placement, ad placement, analytics
(cross-sell) et al. It consists of 100s of billions tiny metadata objects constantly updated by users’
interactions with eBay’s apps. This is a multi-datacenter platform service with read-write ratio of
10:1 with a latency SLA of ~500μs (as disclosed to our account team). They currently deploy a 400-
node MongoDB cluster with replication across two datacenters accelerated by PCIe flash hardware.
8
http://www.youtube.com/watch?v=8SP9klEv-Ho
9
http://www.slideshare.net/mongodb/storing-ebays-media-metadata-on-mongodb-by-yuri-finkelstein-architect-ebay
Petascale in capacity with hundreds of billions of objects with variable sizes (~10kB and
larger)
Latency SLA on accessing & ingesting the first bytes of the objects (typically ~10ms)
Emphasis on predictability of performance
Storage & transmission error needs to be detected & corrected
Multi-geo availability of content with disaster recovery built-in
BlobStores are aimed at storing and managing data objects, called blobs that are much larger than
the size allowed for objects in the real-time analytics store and a session state store. Blobs are useful
for serving large files, such as video or image files, and for allowing users to store binary large
objects. Most commonly known BlobStores are Microsoft Azure Blob Service and Amazon S3. Here
are some key points about BlobStores:
Globally addressable
Key, value with metadata
Accessed via HTTP
Containers are provisioned on demand through API calls
Unlimited scaling
Commonly available BlobStores such as Google Analytics Engine BlobStore and Amazon S3 BlobStore
consists of three concepts: service, container, and blob.
A BlobStore is a key-value store such as Amazon S3, where a user can create containers.
A container is a namespace for the data, and a user can have many of them.
Inside a container, a user store data as a Blob referenced by a name. Commonly in existing
BlobStores the combination of a user's account, container, and blob relates directly to a
HTTP url.
10
http://engineering.twitter.com/2012/12/blobstore-twitters-in-house-photo.html
11
https://github.com/Netflix/astyanax/wiki/Features
12
http://techblog.netflix.com/2012/01/announcing-astyanax.html
Data Stores
Workload Characteristics Traditional Store Real-time Analytics Store Session-state Store Active Blob Store
13
Based on data collected from existing NetApp customers that include Thomson Reuters, eBay & Apple
12 InfiniScale Storage Architectures NetApp Confidential – Limited Use
1.2.5 Customer References
Below is a compilation of NoSQL data stores being used at our existing customers. These data stores are not leveraging NetApp storage for (one or
more) reasons of scale, cost, latency or throughput. Ontap data management is not considered as a differentiating factor with these data stores.
Customers prefer on-premise model of InfiniScale storage consumption due to one or more
of the following reasons [Khai11]:
2. Cloud PaaS/IaaS: Deployments that built ground-up on the cloud are inherently InfiniScale
by nature as they exploit the platform’s elasticity & seamless support for mobility of clients.
Large SPs offer InfiniScale storage as the primary service for persisting unstructured, semi-
structured & structured datasets. Some examples of popular InfiniScale persisted data stores
offered by SPs are:
a. AWS offers DynamoDB, S3 for semi-structured & unstructured data respectively
b. OpenStack-based HP Cloud offers Apache Cassandra as its primary InfiniScale
persisted datastore
c. Azure offers MongoDB as the primary data store for all its .NET based applications
Customers are taking a cloud-first approach while architecting new InfiniScale applications
ground-up. Recent examples of cloud-first architectures from NetApp’s customers include:
$3,500
Market Size (in millions)
$3,000
$2,500 $2,752
$2,000 $3.8bn
$1,991
$1,500 $1,440
$1,000 $1,024
$736
$1,051
$500 $797
$458 $604
$348
$0
CY2012 CY2013 CY2014 CY2015 CY2016
SpringSource [Aug 10, 2009]14: VMWare’s acquisition of SpringSource for $420M in Aug 2009
heralded EMC’s foray into InfiniScale application & IT frameworks. SpringSource was the innovator
and driving force behind some of the most popular and fastest growing open source developer
communities, application frameworks, runtimes, and management tools (including Apache TomCat,
Groovy & Grails). In just five years, SpringSource has established a presence in a majority of the
Global 2000 companies, and is rapidly delivering a new generation of commercial products and
services. VMware continues to support the principles that have made SpringSource solutions
popular: the interoperability of SpringSource software with a wide variety of middleware software,
and the open source model that is important to the developer community. Just prior to this
14
http://www.vmware.com/company/news/releases/springsource.html
Cloud Foundry [August 19, 2009]15: SpringSource had planned acquisition of Cloud Foundry, an
Oakland based open-source PaaS platform provider prior to its acquisition by VMWare. VMWare
endorsed this decision and it happened as planned right after SpringSource’s acquisition. Cloud
Foundry complements SpringSource by allowing applications developed to take full advantage of
elastic cloud computing. Over the coming years, VMWare has invested in integrating Cloud Foundry
with all popular IaaS platforms - vCloud, AWS and OpenStack. VMWare has also provided an open-
source cloud provider interface (CPI) called BOSH for integration into any infrastructure (IaaS)
platform. Cloud Foundry has found great traction with session state management InfiniScale
applications (such as e-commerce) with its seamless integration with MongoDB. eBay has developed
an e-commerce as-a-Service platform called X.com on top of Cloud Foundry, which it uses internally
as well.
GemStone [May 6, 2010]16: GemStone Systems, Inc. was a privately held provider of enterprise data
management solutions based in Beaverton, Oregon. The acquisition advanced
SpringSource/VMware/EMC’s vision of providing the infrastructure necessary for emerging cloud-
centric applications, with built-in availability, scalability, security and performance guarantees for an
elastic session state store. These modern applications require new approaches to data management,
given they will be deployed across elastic, highly scalable, geographically distributed architectures.
With the addition of GemStone’s data management solutions, customers will be able to make the
right data available to the right applications at the right time within a distributed cloud environment.
Greenplum [July 6, 2010]17: EMC acquired the privately held Greenplum Inc. in 2010 and added a
data warehousing technology to enable big data clouds and self-service analytics. Greenplum utilizes
a shared-nothing massively parallel processing (MPP) architecture that has been designed from the
ground up for real-time analytical processing using virtualized x86 infrastructure. Greenplum is
capable of delivering 10 to 100 times the performance of traditional database software at a
dramatically lower cost. Post-acquisition, EMC invested in developing map-reduce Hadoop
capabilities to Greenplum and built a proprietary version of Hadoop called Greenplum HD.
Pivotal Labs [March 20, 2012]18: EMC acquired a boutique mobile / cloud application development
consulting and project management SaaS firm, Pivotal Labs, in March 2012. This is an important
acquisition that added much required talent force to enable EMC’s internal cloud service ambitions
as well as offer this as a professional service to its customers.
Cetas [April 24, 2012]19: VMWare acquired an early stage 18-month old startup Cetas that developed
an elastic cloud friendly query platform on top of Hadoop. It virtualized Hadoop’s architecture into a
cloud friendly stack that could be deployed on AWS or vCloud.
EMC Pivotal Initiative [April 2013]20: After the mixed success of EMC Unified Analytics Platform
product based on Greenplum & Greenplum HD, EMC & VMWare are on the cusp of rolling out a
federated platform-as-a-service called Pivotal. In this joint venture, EMC has 69% stake with
Greenplum & Pivotal Lab technologies. VMWare has the rest 31% stake with Cloud Foundry,
15
http://classic.cloudfoundry.com/news.html
16
http://www.vmware.com/company/news/releases/spring-gemstone.html
17
http://www.emc.com/about/news/press/2010/20100706-01.htm
18
http://www.emc.com/about/news/press/2012/20120320-02.htm
19
http://gigaom.com/2012/04/24/vmware-buys-big-data-startup-cetas/
20
http://gigaom.com/2013/03/13/the-pivotal-initiative-in-case-you-were-wondering-is-now-official/
Note that Pivotal marks the entry of EMC into a service based business model competing head-on
with AWS & Azure, while also being able to inter-operate with them as pure IaaS platforms (owing to
Cloud Foundry’s BOSH CPIs). This goes to emphasize the close relationship between InfiniScale
applications and cloud, and EMC’s offering allows its customers to consume its technology portfolio
as a wide catalogue of services. Here is a summary of how EMC Pivotal maps onto InfiniScale
architectures:
Here are the InfiniScale services offered by AWS segmented by workload categories:
(Realtime) Analytics: Apart from seamless support for Apache Cassandra & SAP-HANA21, AWS also
recently launched a massively parallel cloud data warehouse called AWS RedShift22 (currently in
beta). RedShift simplifies integrating datasets in AWS S3 (active blob store) and AWS DynamoDB
(Session State Store) into a queryable interface for analytics. RedShift guarantees < $1/GB/year price
21
https://aws.amazon.com/marketplace/b/6153421011/ref=mkt_ste_L3_MP
22
http://aws.amazon.com/redshift/
Session State Store: AWS offers a wide range of low latency data store functionalities for InfiniScale
applications to persist session state data from transactions / interactions. The key service,
DynamoDB23, is based on Amazon’s well-known Dynamo storage engine and offers flexible key-value
interfaces for applications to persist semi-structured data with flexible schemas. An associated
service, ElastiCache24, offers in-memory caching for DynamoDB and offers very low latency
performance backed by SSDs. DynamoDB provides a scale-out architecture that can seamlessly
rapidly scale from few thousand users (~100 read-writes/sec) to many million concurrent users
(~100k read-write/sec) without the customer requiring altering the architecture or application. This
has made DynamoDB very popular25 amongst gaming, social apps, advertising & e-commerce
customers who see volatile surges in demand.
Active Blob Store: AWS offers both map-reduce and RESTful interfaces to blob store data through its
EMR26 and S327 services. It is noteworthy that both these are integrated with AWS RedShift and AWS
DynamoDB, thus letting customers build architectures with seamless integration across the
InfiniScale workloads. Another important data service that lets customers build customized
architectures is AWS Data Pipeline28 that reliably moves data between AWS services.
The most significant benefits of cloud based InfiniScale architectures are elastic high utilization of
hardware & software resources and extremely simplified manageability, which together brings great
agility and economics to the business.
23
http://aws.amazon.com/dynamodb/
24
http://aws.amazon.com/elasticache/
25
http://www.allthingsdistributed.com/2012/06/amazon-dynamodb-growth.html
26
http://aws.amazon.com/elasticmapreduce/
27
http://aws.amazon.com/s3/
28
http://aws.amazon.com/datapipeline/
MongoDB: MongoDB was developed in 2009 by 10gen as a general-purpose transaction store with
web-friendly document/JSON data model. 10gen built extensive library extensions across all popular
programming languages to let developers seamlessly persist data structures as documents on to
MongoDB. This led to a very high level of uptake of MongoDB amongst InfiniScale app developers
using Java, .NET, SpringSource, Python or Ruby/Rails frameworks. 10gen also built an SQL-like query
interface that helped app developers move over from SQL Server to MongoDB easily. Today,
MongoDB is the de-facto choice for session state stores in web-based InfiniScale apps in e-
commerce, gaming, SaaS & web-based transactions. Documents give a schema-free architecture
(with support for indexes) that brings agility for accommodating changes in schema dynamically (as
data structures are ingested) without any downtime. It also requires minimal admin work, as schema
management is in-built. It shards automatically and handles failures through replica sets (that also
help read performance). Being an in-memory database that uses mmap() to persist memory images
to disk, it has shown cache coherency issues with NFS (esp. with journaling on) leading to reduced
write performance over NFS. This has led to customers choosing internal HDD as the preferred
storage architecture (barring a handful iSCSI/FC deployments). Thus the choice of MongoDB
displaces NetApp install base due to architectural issues with NAS storage. As with Cassandra, wide
customer adoption of MongoDB has led to revenue losses for NetApp. Some of the large NetApp
customers with MongoDB deployment include eBay, Disney, News Corp., Intuit, Apple et al.
Built-in manageability that leverages linear scaling of commodity infrastructure & failure
management and ability to sustain very rapid growth in demand with minimal admin
overhead
High performance with extreme flexibility of changing schemas / data models owing to the
non-relational architecture of the DBs themselves
Abundance of IT resources due to high end CPU, memory and network bandwidth available
at commodity prices, leading to shifting of the financial bottleneck to operational efficiencies
Economics around the business value of data that has fine-grained analytics into all modern
business operations percolating deep down into infrastructure architectures
The InfiniScale apps exploit #1 above and deliver #2 to business operations of enterprises. This is a
fundamental value domain migration from the previous decades, which was about infrastructural
efficiency (conserving memory, disk, CPU resources at high operational costs) and application
agnostic aggregation of IT resources (such as shared storage) without any architectural underpinning
of the business value of data. As NetApp customers are migrating to modern business operations
(following the trends above), they are demanding new product values from vendors like NetApp. It is
very important for NetApp to evolve and support modern product values relevant to this world. As
seen in the market sizing analysis, NetApp would miss on hyper-growth in this market segment if it
continues to support only the traditional product values.
Thus, InfiniScale is not just impacting and growing in Internet companies but is also home to rapidly
growing data stores in enterprises such as eBay, Intuit, Thomson Reuters, UBS, UHG and the likes
Jay Kidd, our CTO, provided the following sequence of causation that accentuates relevance of
InfiniScale to NetApp:
Demand for real-time analytics will drive creation and adoption of in-memory compute apps
and models
In-memory apps will drive demand for large storage class memory (SCM) extensions to
memory to work on larger working sets. This will drive reduced cost of SCM-loaded systems
and a virtuous cycle of adoption will begin.
Rise of in memory/SCM stores will give rise to in memory/SCM data management
models. Intel’s non-volatile memory ‘file’ system for SCM is an example of this. These data
management models will deal with a cache-line size block as the primitive and provide
distribution, protection, recovery and efficiency services, while maintaining low latency.
This data management model will put new demands on the capacity tier to provide efficient
capacity for cool data, excellent latency for warm data to feed the SCM. These capacity
stores must not assume traditional block or file structures to write to disk, but must start
with the performance requirements of the cache-line sized granular objects and figure out
how SSD and HDD can store them. In short, everything we know will change.
One of the fundamental trends is to leverage commodity hardware in InfiniScale solutions. This
allows for scaled-out and shared-nothing architectures to be built and operated. As a result of this,
storage attached to each host is thus managed by the middleware at the host and collectively, across
nodes, it presents an abstraction to the application.
Figure 3: Data Store Stack: The new OSI-like Model for Storage
Figure 3 presents an OSI-like model for emerging data stores, also called the Data Store Stack. Well-
defined and de-facto standard APIs are emerging between each of these layers. The various layers of
the data store stack is as explained below:
1. Application: Applications are developed and deployed (typically) on a PaaS platform with
language bundled data service APIs. For example, eBay develops its web-service applications
on CloudFoundry and uses its bundled document store or MongoDB APIs.
2. Data Service: This encapsulates the underlying complexity of the stack and presents a
convenient API to use. For example, MongoDB provides a convenient JSON interface for a
transaction store permitting applications to persist data structures as Mongo documents. A
Data Service is also responsible for offering typical CRUD and/or query interfaces to the
application.
3. Data Model: A data model describes the logical relationships, ordering and organization of
data items, when accessed through their keys. Commonly used data models are key-value
stores (Riak, Acunu), document stores (MongoDB), graph stores (Neo4J) and columnar stores
From a NetApp perspective, Bit is the data distribution and data abstraction layers where we have an
opportunity to innovate and differentiate, with well-engineered products and solutions.
Traditionally, NetApp has been focused on the bottom-half of the stack and more deeply invested in
the data layout, with WAFL. It is imperative that we expand that focus into the data distribution
layer, as our focus also includes addressing the needs of a more geo-dispersed cloud infrastructure.
C
In the past, we have driven efficiencies by coupling storage resiliency (through RAID) with data
layout. With a more geo-dispersed infrastructure, efficiencies will need to be driven by coupling
storage resiliency with the data distribution layer. This is a fundamental shift in thinking at NetApp,
but is already trending in that direction, in the industry.
InfiniScale solutions, as of this writing, are fueled primarily by the need to ingest and analyze large
amounts of machine-generated data. Riding on the trend of Internet of Things, a lot of this machine-
One of the first storage challenges is the ability to ingest large amount of tiny data from a large
number of data sources, without dropping (losing) any data. If lost, that data packet might have
been carrying anomalous behavior information of the system. So, this is not about being statistically
correct in a large corpus. It is about gathering and analyzing all data. A slightly extreme case of such
data ingest is that of Twitter, which needs to handle close to 5000 tweets/ sec, and the number of
data sources might not be known before-hand. An enterprise deployment would be slightly more
predictable than that. One of the ways in which this workload challenge is addressed is by
minimizing system resource hold time. Thus, asynchronous response mechanisms must be
developed to help scale the solution better and not couple the front-end source-side processing with
the back-end data sink processing.
Another challenge is to be able to layout the tiny data on stable storage, reliably. This gets
challenging due to the huge amount of randomness that might be introduced by the requirement of
storing the incoming datum along a certain dimension mandated by the key. This is one such
challenge that Thomson Reuters has been faced with in their stock ticker service, where they cannot
afford loss of any input and the inputs received have to be organized along the stock timeline. One
of the ways in which the industry is addressing this challenge is by leveraging the in-memory data
layout capabilities of data stores, like Cassandra. Cassandra allows the application to specify a key,
which may be a compound string using the stock symbol. The value of that may be stored in a
Cassandra column, which may be time-versioned.
But, that is only half the explanation. It is so because it describes what happens in the logical domain
(the data model) and did not describe how physically it is organized to meet the ingest criteria. Most
of the InfiniScale solutions are heavy on the use of memory. This is because when data is ingested
into a Cassandra node, it is stored in memory, organized along a certain dimension, logged for
recoverability, replicated to another node for fault tolerance and then acknowledged to the client.
So, ingest has very low latency. Periodically, that data collected in memory is flushed to stable
storage. Organization of data in a specific form on stable storage is covered in the next sub-section.
It is worthwhile to examine Dhow such heavy usage of memory (DRAM) is considered economically
feasible. First off, volume DRAM prices are at $15/GB, as of this writing, and dropping 20% every
year. Secondly, the data model of Cassandra helps with a number of potential optimizations that
helps bring only that data into memory that is needed. This is through the column family
abstractions of Cassandra. Other attributes (column families) of the object (row) in question are not
brought into memory. Thirdly, organization of data as column families leads to high compressibility.
Because of high similarity of content of a column family, data is highly compressible, and therefore
IO throughput is also kept high. Cassandra uses bloom filters to selectively read segments within a
column family. Thus, when data is finally brought into memory it is absolutely the data that needs to
be consumed. Research in the areas of data compression and database performance has also shown
that compressed data can be used directly without having the need for necessarily uncompressing
the same. This allows for better memory bandwidth utilization although at a slightly higher CPU
utilization. Given that the cost/ CPU cycle halves every 2 years, and that the memory bandwidth is
always challenged, these techniques go a long way to increase the usable memory capacity, making
The above techniques give an InfiniScale solution the ability to acknowledge each datum ingested in
well under a millisecond. It also gives the capability to analyze historical data, over short windows,
and raise alarms for anomalous behaviors.
From a NetApp perspective, Ontap can handle tiny data updates, as long as each update is not
addressed by subsequent accesses. In the context of emerging InfiniScale workloads, that is precisely
what is needed. Each data element (addressed by a key or objectID) needs to be tracked in a sea of
data elements. These tiny data elements need to be versioned and accessed with a certain temporal
and spatial locality with other data elements. From this perspective, an InfiniScale solution is more
like a database to Ontap, as the inter-relationship among data elements is alien to Ontap. When
fine-grained data management happens at the higher levels, Ontap value is diminished. When the
intensity of workloads threatens to create a bottleneck at the controller, Ontap is not the storage of
choice.
Our intention here is not to analyze those 120+ data stores, but to state that in this era of
specialization, there is a growing need to adapt storage layouts to the needs of the application. This
is to increase operational efficiencies and improve application effectiveness. To demonstrate what
value a custom data store can provide to an application, we examine two more in this section
(Cassandra was covered in a previous sub-section).
Another reason for terming this era as an era of specialization is that there is a shift in the way
products are being constructed. Rather than a single large monolithic solution, like Oracle Database
(or even Btrfs, for that matter), which has just about every feature under-the-sun, the move is to
now have simpler and more nimble, but highly efficient products along a certain dimension. These
products do “one” thing and they do them well. Some of these perform as much as a 100x better, on
ingest and query performance.
E
To address the question of why a single in-memory data layout does not suffice, we will need to
look at the impact of treating DRAM as a purely random access medium. As an example, if we were
to traverse an array of pointers to data elements (which are dispersed in DRAM), we would be
accessing data with almost no spatial locality. This would result in data being brought in from DRAM
into CPU cache lines. If each access results in accessing a different cache line, we would be incurring
a 60nanoseconds access (to DRAM), as opposed to a 3nanoseconds access (to an L1 cache).
For this reason, if the data model of the InfiniScale solution wants to present the capability of being
able to provide efficient spatial access to related nodes in a graph structure, as an example, the
underlying layout should support that intent. A reference from a row structure to another row, to
simulate a graph will result in poor spatial locality, leading to poor performance of nearest neighbor
queries. Thus, a columnar layout is inappropriate for a graph database.
The data layout should be driven by the application intent. If the nature of queries is pre-known, and
if the essence of those queries is given down to the layout, as hints, the data layout can thus be
organized in a fashion that would best serve those queries.
From a NetApp perspective, Ontap (WAFL) never bothered about such optimizations. What is driving
the solutions towards these optimizations? Ontap played in a world of 1millisecond to 10millisecond
access latencies, due to the network hop. When focus shifts towards a low-latency play, with bulk of
data access from memory, the latencies of the order of microseconds start to play an important part.
Thus, the engineering required in InfiniScale solutions are very different from the optimizations that
we have traditionally played in.
1. Traditional SQL, Converged DB: Examples of this are SAP Hana and Microsoft’s Hekaton. The
fundamental target is the traditional business applications, which are presented the familiar
SQL interface. These solutions run OLTP and OLAP on a single underlying database, which is
hosted in memory. The fundamental goal is to support interactive and real-time query
processing over hot data (transactions).
2. Emerging NoSQL, Converged: Most emerging applications are not SQL-based. They work
with specialized data models, which are closer to their problem domain. Examples are
GraphDB, Neo4J, Cassandra, Riak, HBase and the like. Scale of operation is the fundamental
[1], [2] and [4] have led to a very significant shift in storage architectures, all lead by analytics.
Traditionally, there have been separate Online Transactional Processing (OLTP) data stores and
Online Analytical Processing (OLAP) data stores. The data model of the OLTP data stores (also called
operational data stores) is optimized for transactions and OLAP data stores follow a different data
model (the data warehouses). The organization of data in the OLAP data stores is typically along the
dimensions (attributes) of interest, along which planned queries will be executed.
Such levels of response time are feasible only if the batch operations of copying data from
operational data stores to the data marts can be avoided. Thus, the call is to have a single data store
which can serve as the transactional data store, as well as against which we can issue OLAP queries
(and as we evolve, even unplanned analytical queries).
Unfortunately, OLTP and OLAP have orthogonal workloads. OLTP constitutes small random writes,
while OLAP mandates the data store to be re-organized to leverage the large sequential read
throughputs of disks. But, this re-organization of the data along specific dimensions in the OLAP data
store was needed because of its underlying storage medium (disks) and was not a tops-down
decision. Such orthogonal workloads on a single data store thus calls for the data store to be placed
on random access medium. With DRAM prices down to nearly $15/GB, SAP chose to skip flash as a
medium and host the working set in DRAM. Microsoft, since 2009, has been known to be working on
Hekaton, an in-memory SQL server realization, which was announced to be in beta trials in
November 2012 [Lars12]. There is, however, a fundamental difference between the approaches of
SAP-HANA and Oracle Exalytics. Oracle has revived a 20-yr old database, TimesTen Database (an
embedded database) and has chosen to use it as an in-memory database. But they have not
converged the OLTP and OLAP databases.
From NetApp perspective, the unfortunate part is that our OLTP customers might not ask us for a
change. It is the OLAP-side, where we don't have a substantial footprint and where we are not
exploring, which threatens to change the storage architectures and do it in a way that has a positive
side-effect of enhancing the transaction latency and throughput of OLTP workloads. We thus, run a
risk of being blind-sided while storage evolves towards the new world-order.
Due to this infrastructural shift (and the cost points of the same), the notional differences and gaps
between Tier-1 and Tier-2 diminish, and in some extreme cases the worlds collide. Thus, the so-
called emerging Tier-1 solutions threaten to enter into NetApp’s green zone Tier-2 business
processing, and eat into the same.
H
Because the application is now joined at the hip with the storage, the storage need not jump
through the hoops to figure out and guess what the application is trying to do with storage. Instead,
the application can now specify what it needs of storage. The interface between the application and
storage is up-for-grab and can be defined in ways that allows fine-grained, application-driven and in-
band data management. For example, a SQL query will now be able to specify, through SQL
constraints (as an example), that a transaction involving $3 Million is more important than a $30
transaction. This in-band specification of hints from application will allow the high-value transaction
to be synchronously committed to a remote DR site before acknowledging completion to the
application, while other transactions are protected using the usual asynchronous DR mechanisms.
This allows for application-driven and transaction-selective continuous data protection. This is an
example of application-driven in-band data management, which is transaction-granular.
SAP-HANA is the Google of this infrastructure because what they have done is pioneering. Most
enterprises would want to leverage what SAP has contributed, but can't build it themselves. It
should be our endeavor to partner with this Google. There is an opportunity to work with HANA and
re-draw the storage boundaries while they are still not cast-in-stone. There is also the opportunity to
drive these interfaces into de-facto community standards such as the OpenStack and assume a
leadership position. Some of these interfaces are at the data abstraction layer of the data store
stack.
Another aspect is that the level of memory and structure optimizations done for HANA has been
done keeping a specific Data Layout in perspective. In the case of HANA, it is a custom data layout,
but essentially column-oriented, for the most part. And, this layout was chosen to allow for data
warehouse queries. However, as HANA moves into the territory of more complex and varied
analytics on its data streams, it is considering options for alternative data layout schemes. The
question that begs to be answered is: is there a single data layout scheme that can take us to the
point of 80th percentile performance in 60% of the cases.
I
The transaction layer is going to be a very important central control point that we should stake a
claim on. Transactions help knit mutations on multiple objects, provide data storage consistency
points, and are critical to businesses. It is also what is expected to see the most evolution in the near
This change has been driven by technology limitations and fueled by business needs. In the past 5
years businesses have been pushing for disproportionately increased levels of ingest and query
performance.
Interestingly most of the data growth has been happening as unstructured content. Thus the need
for ingesting and analyzing large volumes of high throughput unstructured content led to the
evolution of storage architectures in the direction of leveraging scale-out and shared-nothing
paradigms. What would otherwise be considered as low-value data, this unstructured content
29
Source: http://www.stanford.edu/class/ee282/08_handouts/L07-IO.pdf
To what extent up the stack should these shared-nothing notions be exposed? From one
perspective, cDOT is also a shared-nothing system, as a D-Blade of one system does not access data
from another D-Blade. But, the N-Blade can go across nodes (the remote path). There is also a cross-
node transactional binding in the control paths at the M-Host. And, the HA-pair works to protect its
partner by leveraging shared disks at the backplane. cDOT exports a POSIX interface (covered in the
next section) that hides the underlying NUDA (Non-Uniform Data Access) model from the
applications. Thus, an application could potentially run transactions and joins that span nodes, while
being totally agnostic of the underlying data distribution across nodes. This simplicity has been very
important to the enterprise applications we support.
Shared-nothing architectures on the other hand pass the complexity of topology boundaries up to
the middleware at the host. Most InfiniScale middleware would not allow for cross-node
transactions and data access in a single unit, at a certain low-level of abstraction. This is not just for
better performance. It is also for a more robust system, by avoiding cross-node state and lock
maintenance.
However, at higher levels of programming abstractions, even Google, with its Megastore has gone
down the path of simplifying programming abstractions but leveraging the scale-out and shared-
nothing paradigms as its underpinnings. Thus, Megastore builds a higher level Data Model over a
lower-level Data Model.
K
Below is an attempt to classify when one would move from a centralized storage model to a shared-
nothing and potentially a peering model.
Storage Model
Shared Storage Shared-nothing Storage
Application Model
Monolithic 2-hops (Traditional) 3-hops
30
Sharded Function 3-hops 2-hops (Peering)
The above analysis is centered on the premise that a network hop significantly impacts the latency
seen by an application and thus alters its performance profile. The number of hops is counted from
the client, where the application initiates the request. These architectures are explained below:
1. Monolithic over Shared Storage: Access from the client would see two hops. One from the
client to Application Server and the second from the Application Server to Shared Storage.
This is a rather deterministic number of hops due to the monolithic nature of the application
at the Application Server. This is the traditional siloed data access model in an enterprise.
2. Monolithic over Scale-out, Shared-nothing: In this case, the number of hops involved would
be three hops. The additional hop is introduced at the storage layer. Counting the number of
hops, the first one is from the client to the Application Server and the second one is from the
monolithic Application at the Application Server to a node at the scale-out storage layer.
30
Function transformations are co-located with the data shard, within a node boundary, that it operates on. This leads to
highly filtered data movement over the network.
The above analysis is an attempt to define the architecture of choice for InfiniScale and when the
scale-out and shared-nothing architecture becomes mainstream architecture of choice. Middleware
in InfiniScale solutions is already split and sharded and follows approach [4], as defined above. In this
model, each node hosts an Application Server function that serves to operate over the data and
provide a highly filtered data movement over the network. However, not all problems can be broken
down into nicely partitioned function blocks working over partitioned datasets. Most of traditional
data management, as seen by enterprise applications and admins is built over our Snapshot®
technology. LCoordinating a snapshot across a 1000-node cluster is as yet an unsolved hard problem.
Being able to do this in the time frame of mainstreaming of SCM would be important.
In the past, network latencies of 5milli-seconds matched up nicely with disk seek (rotational)
latencies and thus disk-based shared storage was viable for a long time. Even with the advent of
Flash, the 100 microseconds of read latencies matched well with about the same round-trip
latencies of 10Gbe. Thus, there was still a compelling enough a reason to stay with shared storage
across the network. But, SCM at the horizon promises a 100 nanoseconds access latency, which
leads to a singularity that fosters growth of shared-nothing architectures. This brings up other issues
around data availability and how one can make data resilient in the context of SCM, when working
with an order of magnitude higher latency interconnects. For further analysis of how SCM impacts
storage architectures, the reader is advised to refer to the DC2015 Technology Report. It should also
be stated that while SCM is on the horizon, some of these changes are happening today, with DRAM-
based storage becoming mainstream.
As covered thus far, most InfiniScale solutions are DRAM-based with data durability on locally
attached disks, which would shift to SCM, in the near future. Data replication across nodes is
employed to protect against data loss, in the event of a node fault.
From a NetApp perspective, InfiniScale architecture has control points at the host, and in the data
path. This isn’t a place that NetApp has traditionally played at. Data availability is also provided at
the host layers. For purposes of cross-data-center disaster recovery, some deployments might want
to rely on a storage array for replication, but most would continue to depend on the InfiniScale
solution to provide that capability. A related but orthogonal capability is to be able to replicate into a
much smaller cluster at the DR target. We examine and contrast these architectures with host-based
caching architectures in a later section.
While this might seem slightly provocative, it is also a fact. Most new and InfiniScale applications
being developed today are being programmed against InfiniScale middleware, which encapsulates
storage and presents higher-level abstractions to interface with. The APIs (Data Model) provided by
the InfiniScale middleware is the new emerging and de-facto standard. At this point however, there
is no one single interface that has won in this race, but there are clear favorites. Some popular
InfiniScale interfaces using which applications are programming storage are: MongoDB’s JSON
interface, Cassandra’s columnar structures or Neo4J’s Graph APIs. These are becoming popular also
because of the simplicity of integration with data structures in a programming environment.
M
POSIX is not dead, but it has been relegated to being used within a node and at very low levels of
abstraction, almost making it irrelevant. Programming by leveraging POSIX is becoming similar to
programming in assembler, of yesteryears, as an analogy. POSIX interfaces are used in InfiniScale
middleware for 3 different purposes:
- For large blob IO, where a blob layout is completely managed by the middleware
- For memory mapping a large segment of a file into the address space of the middleware for
manipulating the contents of the same, and
- For log management, for recoverability
Thus, in most cases, content layout within the storage region is managed almost completely by the
InfiniScale middleware. POSIX is relegated to managing those extents and for providing mapping of
those extents into the address space, as needed. More often than not we have a key-value layout
within these extents. This usage trend will only exacerbate with the advent of SCM, as that is also
the preferred programming model, as being proposed as a standard in the NVM Programming TWG
in SNIA.
N
POSIX is also not being used as a presentation layer interface. POSIX was created largely for
purposes of large block IO and not tiny updates. This interface still has its roots in tape-based IO,
which is evident in the lack of capability to read back the tiny updates. The data model presented by
POSIX is that of flat blobs, which isn’t very useful when the boundary required is very fine grained.
Another reason is that POSIX and scale-out are hard to get right. If each tiny datum was an
independent file, the metadata overheads are very high due to inodes and directory entries. Access
to a single file thus also has high metadata overhead. Most file systems also do not do much to
maintain spatial locality across files in a dataset. If each such file were distributed using a consistent
hash, listing of a directory would suffer. Serialization across nodes to get POSIX right is also hard and
difficult to scale. Thus, alternate richer and more flexible data models are used in InfiniScale
architectures.
This trend is further fueled by gains seen when bypassing the kernel and the kernel buffer cache and
by taking fine-grained control of what is cached, and till when, through judicious use of madvise. This
complexity is managed by the InfiniScale middleware and is the new OUser-space Kernel. Very
significant CPU path lengths can be eliminated through these means and through judicious use of
lock-free data structures and algorithms. In one of the experiments it was demonstrated that on
Intel’s Sandy Bridge the number of operations in a ConcurrentArrayQueue (a Java data structure), a
10x improvement of number of operations per second was possible when changing the structure
into a lock-free form, while reducing the latency by almost 50%.
In summary, POSIX was a good abstraction to be working with when the resting place for the data
was over the IO bus. But with in-memory processing catching on, the resting place changes to DRAM
and the memory bus is used instead of the IO bus. This changes the expectations of latencies, and
suddenly the nicely structured IO subsystem appears burdensome. This leads to the call for a re-
think of use of POSIX in the application stack.
31
Source: http://www.mysqlops.com/2012/04/09/linux-io-stack.html
Analysis presented in the previous section on POSIX provides a wedge into this topic, which is
related to low latency processing and strict guarantees to meet those latency requirements. The
deep stacks and layers of software that needs to be traversed for IO processing not just impacts the
latencies, but also increases the probability of missing deadlines due to the sheer complexity (and
thus uncertainty of operation) of layers involved in the stack. Troubleshooting performance issues in
these deep layers has also proven to be a challenge. This also leads to a similar conclusion of
bypassing the entire IO stack and taking control of processing at the user space kernel. That is also
the cost of modularity paid for creating a stack that would be applicable for a broad segment, which
is against custom-built stacks, and in line with one-size-does-not-fit-all paradigm.
One of the questions that we often encounter, given our NetApp lineage, is about what data
management problems do we need in the real-time data stores. A real-time data store is not
attractive by virtue of its rich data management features, as known to us, but because of its
simplicity and ability to meet the performance SLO within strict and bounded deviations. This is the
single most important data management feature needed in real-time stores.
Metamarket’s Druid, used by Netflix is an example of a real-time data store. With 70 billion log
events per day and ingesting over 2TB of data per hour, it is one of the largest log-collection
infrastructures known, as of this writing. The kinds of operations that are subject to real-time
processing are: aggregation (group by), time-series roll-ups and generalized regular expression
searches.
In order to meet the real-time needs, it is often required to layout data, which is driven by the kind
of queries, filters and dimensional analysis that the data will be subject to. With sequential data
storage, in-memory allocation, and the automatic text enumeration (as found in Lisp), searching for
a symbol is really just scanning for an integer in an array. That's why such data stores are a few
orders of magnitude faster than common relational databases for reading and analytics. So, it is
never about just placing data in-memory as-is and hope for a speedup. There is substantial
engineering and hand-organization required, to best organize data for cache-efficient access.
Writing data is a weakness of column-oriented storage. Because each column is an array (in-
memory) or file (on-disk), changing a single row means updating each array or file individually as
opposed to simply streaming the entire row at once. Furthermore, appending data in-memory or on-
disk is pretty straightforward, as is updating/ inserting data in-memory, but updating/ inserting data
on-disk is practically impossible. That is, the user can't change historical data without some massive
hack. For this reason, historical data (stored on-disk) is often considered append-only. In practice,
column-oriented data stores require the user to adopt a bi-temporal or point-in-time schema. Such a
scheme has been adopted by SAP’s HANA solution too.
In the scope of in-memory data stores of an InfiniScale solution, the role of flash and other storage-
class memory technologies could be twofold: First, flash volumes can be used as major persistent
storage devices, leaving disks as backup and archiving devices. The insert-only paradigm of an in-
memory database matches the advantages of flash memory. In an insert-only database the number
of random writes can be reduced if not eliminated and the disadvantage of limited durability is
alleviated by the fact that no in-place updates occur and no data is deleted. Second, the low readout
latency of flash storage guarantees a fast system recovery in the event of a system shutdown or
even failure. In a second scenario, flash could be used as memory-mapped storage to keep less
frequently used data or large binary objects that are mainly used during read accesses. The
InfiniScale solution can transfer infrequently used columns to a special memory region representing
a flash volume based on a simple heuristic or manually by the user. The amount of main memory can
However, not all real-time analytics are in the order of a few milliseconds. Quite a few analytics
demand higher flexibility and drill-down capabilities and are willing to trade that flexibility for an
additional order or two of milliseconds. This is where a combination of data layout and inverted
indices are used in conjunction to address the needs of adhoc and near real-time analytics. However,
there are data stores such as HyperDex, that solve the exact same problem through a concept they
call Hyperspace Hashing which allows for simultaneous multi-dimensional analysis without having to
deal with multiple single-dimensional inverted indexes.
R
One of the major impact of the need for real-time ingest and query is that the authentic copy of
data will start to come into InfiniScale solutions, rather than first ingest the same into shared
storage. This threat has started to become a reality with our customers, and will only increase with
the need thus calling for anti-caching32 solutions.
Given that an InfiniScale solution is sized for a certain working set, the size of the cluster need not
change unless business requirements change. Another important aspect is the analysis around
latencies. Most DRAM accesses are of the order of 100nanoseconds (rounded up for simplicity of
analysis) while going over the network (to a storage array) involves a millisecond, which is a 4 orders
of magnitude difference. A cache miss is thus very expensive for the application, which it will not be
able to tolerate or choose not to tolerate. Thus, most InfiniScale applications would not use shared
storage. STiering is adopted, but in the direction of InfiniScale to active archive (anti-caching), rather
than caching into InfiniScale. The amount of data kept as the working set is typically driven by
policies in an organization, which may vary from 1 week to 1 quarter.
Thus, there is a clear split in the InfiniScale architectures and the applications created for InfiniScale
and those created for active archives. Hadoop Map-Reduce is applicable to applications for active
archives, not for the real-time InfiniScale space. StorageGRID is an example of active archive.
FlashAccel clearly falls short due to the issues with caching semantics, like described earlier. We
have a portfolio gap in not having a solution for InfiniScale.
32
http://istc-bigdata.org/index.php/anti-caching-and-non-volatile-memory-for-transactional-dbms/
Transactional updates with full ACID semantics are also supported in limited scopes; as found in
MongoDB the supported scope is a document. In Cassandra, a transaction support cannot span a
row. These limited semantics are to overcome technology limitations, such as distributed
transactions, distributed locks and to avoid any shared state maintenance. It not just helps simplify
the data store design and implementation, it also helps in having a blazing fast solution for that
need. Avoidance of shared state across nodes in a cluster also helps in better resiliency to partition
tolerance. Most applications using these data stores do not need these missing features, and find it
acceptable to live with some additional complexity handed over to them, just in case they needed
higher-level semantics.
U
There are, thus, elements of self-healing and self-managing baked into the InfiniScale middleware.
This allows for elimination of major costs. It allows for elimination of duplication in hardware,
because software can now self-heal and software is now built with failure in mind and how it should
recover from various scenarios. InfiniScale solutions are only getting better at self-healing, with
wider deployment exposure. Another aspect of cost reduction and simplicity is around the solution
being self-managed. For the most part, admins monitor the state of the system, rather than actively
manage the same. Thus most of the management tools (Puppet, Chef, Nagios, and a host of others)
used in such deployments allow for simplified provisioning and setup, in the front-end of the
infrastructure lifecycle and subsequently monitor the performance of workloads. This is very
different from active monitoring and management of workloads and infrastructure done in
traditional enterprise deployments.
InfiniScale solutions are developed for analytics needs and are managed by analytical means. By
definition any InfiniScale solution generates enough forensics data that provides insights into its
operations. Any system providing feedback control at machine speeds will need to be monitored and
controlled in its timelines. Thus, most controls are in-built into the solution, by design. Other aspects
are monitored and violations analyzed through other analytical solutions.
Few questions that are typically posed in the context of InfiniScale solutions are discussed below:
- How important are storage efficiencies? Short answer: very important. Replication by mere
copying is used for both data resiliency and parallel data access, to avoid hot spots.
However, this approach is questionable at large scale as the total cost of ownership
increases. As an example, Acunu was able to engineer Cassandra clusters in ways that would
allow it to replace a 100-node stock-Cassandra cluster, with a 10-node Acunu-engineered
Cassandra cluster. This was through well-engineered data layout and data distribution
schemes. At small scale a 3TB or 9TB is a matter of two additional disks, but extending that
There is very significant innovation happening in the way InfiniScale solutions are designed,
constructed, deployed and managed. Some of the design tenets and management aspects of
InfiniScale have already been discussed in the previous sub-sections. It is out of scope of this
document to discuss the development process and methodologies, including their release models,
but we would like to say that the release model enables quick turn-around times and fosters faster
innovation. It enables features to be released and pulled-back with equal ease.
There are a few aspects that have proven to be useful to both simplify development as well as
achieve high levels of productivity. Some of these are touched upon here:
1. DevOps: It is important to see how development is starting to happening in the new world
and what the developers’ workbench looks like. A lot of the developers use Eclipse-based
IDEs with Maven integration with build tools, tests, version control and release mechanisms
integrated. In fact these IDEs integrate with a test infrastructure in the cloud. They even
extend the above development environment into deployment. This is exactly what is
referred to as DevOps, wherein the IT operations are linked very closely to the developers’
workbench. SpringSource is one such community that was created, supported and driven by
VMware. SpringSource integrates Grails, which is an open source and full-stack web
33
http://lwn.net/Articles/475681/
A couple of years back Facebook started Open Compute (OCP)34 and Open Rack35 projects in the
community. While this originally seemed like a good samaritan move, it was an excellent business
decision. With some of the more recent developments in that community, Facebook recently
contributed designs of their motherboards, which would allow for Intel and AMD processors (and
soon ARM processors) to be on the same motherboard. Now, ARM could potentially replace an Intel
processor very easily. Upgrades become cheaper as only those components that need to be
upgraded shall be upgraded. Also, processor upgrade cycles can be detached from memory upgrade
cycles. All this contributes towards their stated goals of: “most efficient computing infrastructure at
the lowest cost”.
OCP says its standards promise to deliver hardware that is 24% more energy efficient and 38% more
cost efficient, on average, than so-called commodity hardware. The group is working on specs for
storage, motherboard and server design, racks, interoperability, hardware management, and data
center design. They plan to do this through disaggregation.
Z
Disaggregation is about separating and modularizing storage, compute, interconnects, power,
cooling and other components so companies can custom configure to their workload requirements.
This approach also supports smarter technology refreshes, so companies can swap out and replace
quickly evolving components, such as CPUs, while keeping in service slowly evolving components,
such as memory and network interface cards.
With their new rack designs as shown in Figure 7, the node boundaries are eliminated and a whole
rack is a computer. This allows for efficiencies in the production, operations and upgrades of these
racks. AAThis is now changing the DAS-based approaches and taking the industry through a more
decomposable and open architecture. While this sounds like good news for NetApp, storage
continues to be treated as commodity. Also, specialized storage options are emerging in that
community, for high-speed IO, dense storage and storage for long-term archival. The OpenVault36
solution looks to be competitive with E-series. Another storage option is their cold storage
specification, which leverages shingled disks and is designed as a bulk load fast archive. To achieve
low cost and high capacity, Shingled Magnetic Recording (SMR) hard disk drives are used in the cold
storage system. This kind of HDD is extremely sensitive to vibration; so only 1 drive of the 15 on an
34
http://www.opencompute.org/
35
http://www.opencompute.org/projects/open-rack/
36
http://www.opencompute.org/projects/open-vault-storage/
Figure 7: Open Compute Rack & One Open Compute Project Server in the rack.
Calxeda has taken to low-power processors of ARM and produced ARM-based motherboards. They
have then coupled this with the OpenVault JBOD storage enclosures to produce a storage server.
The interesting observation they make is that a 32-bit processor is sufficient as a controller for cold
storage and thus correspondingly only supports 4GB of onboard RAM. While a 64-bit processor can
be supported, they do not recommend the same for sake of staying true to controlling cost and
power consumption.
AMD and Facebook have also contributed designs of their micro-servers, which are essentially low
power servers that further enable the anti-virtualization drive. Each micro-server is a cluster of low-
power processors, a small amount (4GB) of low power SDRAM and about 128GB of MLC Flash.
The member organizations have contributed their previous generation designs into the community,
at this stage. This is also so that the Open Compute community gets a validated design that it claims
to have provided under its banner. But, as we go along other members (such as Intel) have started to
table design proposals on the table, rather than validated designs. This is an interesting shift and a
point of inflection in the maturity curve of a community project. Intel has contributed specs to OCP
for Silicon Photonic interconnect technologies that already surpass 100 gigabits per second -- nearly
twice the speed of the fastest interconnect technologies currently available.
While many OCP members are hyper-scale Internet companies we cannot ignore this space as
Goldman Sachs and Fidelity are leading innovation in Data Center design through OCP. Rackspace is
known to have wanted to leverage designs from OCP. Open Compute and OpenStack combo is
evolving into a very potent combination for InfiniScale solutions and beyond. At this point, NetApp
should track developments in this community.
We would like to draw focus on the bottom-half of the figure, which we have defined as InfiniScale
Storage, as it is responsible for core storage functions. Ingest is handled through Data Ingest module
by writing incoming data into the write-ahead log for local resiliency and replicating the same to
another node for cluster-wide availability. Finally, a write-optimized data store is updated with the
incoming update.
A read is handled by looking at the data layout through a combination of the write-optimized and
read-optimized data stores. Data is moved from the write-optimized store into the read-optimized
store, periodically.
The replication engine is a part of the data distribution layer, which is responsible for availability of
data in the cluster, and cluster-wide consistency semantics.
The Data Model is identical to the ones presented earlier. Data Analytics is part of the data service
layer of the 7-layer model.
The fact that InfiniScale solutions adopt a scale-out architecture, leads to the need to be able to
distribute data across the cluster. This data distribution might be for capacity balancing or for
purposes of higher availability, or for parallel data access. Typically data will be distributed in their
original form rather than coding the same into an asymmetric form. This is because latency of
operation is critical. Finally, data layout defines the organization of data on the storage media, which
may be flash or disk.
As the tiny data elements are collected together in InfiniScale solutions, the data chunks written to
stable storage are much larger than the potential 30-byte ingested fragments. The typical size of
data chunks written to storage is of the order of 10MB. Thus, if the larger data chunks (of order of
1GB) in a capacity store can be broken down into manageable data elements of the size of 10MB, we
can leverage the data store we used for InfiniScale in the capacity solution. This is what we use the
Data Chunking layer for. We then need to check if this chunk is already available within the cluster. If
it is, we will only need to update the metadata (not shown in Figure 10). The Data De-duplication
block does this function. Once we identify a chunk that is not already in the cluster, we will need to
store that chunk in the cluster, with high efficiency. Just so that we can retrieve the data element
even in the case of disaster, it should be possible to code the chunk in a way that the sub-chunks can
be distributed across the geos. The Coding Layer does this function. Finally, we distribute the sub-
chunks and store the same in the identified nodes.
3.1.4 Summary
Real time data stores and Capacity based data stores thus fit into the data store stack discussed in
section 2.2.1 as shown in Figure 11.
A related and often ignored element is: How much should the InfiniScale solution be aware of NUMA
that is present in the system? From Figure 12, above, it is clear that a local DRAM bank costs 65ns
while a remote DRAM bank is 105ns, a 60% penalty. Thus, optimizing the in-memory layout and the
threading schemes to leverage this discontinuity of access latencies will result in significantly more
efficient systems leading to lowered TCO.
Figure 13 shows typical values of access latencies as one goes down the memory hierarchy. If we
organized data that would need a pointer access and jump from one memory location to another,
we would not leverage cache locality nicely. This would have a significant impact on performance of
the InfiniScale solution. Thus, a great deal of work goes into engineering the in-memory data layout.
Data Layouts thus have to be cache-aware while being cache-size oblivious. Cache-aware means that
they should be aware of the memory hierarchies and ignoring the same has significant performance
and operational costs. Cache-size oblivious refers to the condition that the algorithms should be
such that they do not tie themselves down to the sizes of different caches. An extreme example is
cDOT, which treats 4KB as a special buffer size and optimizes around it. But, as the environment
changes, it takes a herculean effort to move away from that sized affinity.
Another aspect we would like to highlight in the context of in-memory data stores is that random
reads are still very different from sequential reads, even in “random access memory”. The in-
memory data layout is driven by the access patterns. What is sequential for one access pattern
might turn out to be randomized access pattern for another algorithm. For example, a columnar
store can be optimized for reads along certain dimensions, and is O(N) for the most part. But, if the
same layout is subject to a nearest neighbor algorithm, it becomes a O(N2). Workloads do not
change randomly, but do evolve over a period of time. To be able to evolve the layout based on
needs is one way of addressing this. Another way of addressing this is to invent a data layout that
can be used by a majority of analytics algorithms.
The Aries transactional model [Mohan92], which has been the basis for transactional system for the
past 20+ years, will be redefined in the context of SCM. Thus we shouldn’t ideally need a separate
write-ahead-log for protecting transactions.
ATG Investigations
R1. Devising an in-memory layout for Cassandra: Cassandra is a very popular InfiniScale
middleware. It is also a column-oriented data store. A column-oriented data store is also our
first choice as it helps address the needs of our enterprise customers, which are mostly
database-oriented, but need a few specific functions from their data store and their needs
are not being met. For example, rate of data ingest, tiny data ingest, scale of processing, and
the likes. For such customers, it would be good to investigate a solution that is key-value
oriented but can support a columnar structure. The goal is to optimize the data layout in-
memory for subsequent storage on disk, and efficient retrieval from the same. It is also
important to investigate how the layout should evolve if the longer-term retention media is
flash, as against a rotating media.
If we do treat InfiniScale as a IOPs tier to Object Stores, there has to be a translation from the tiny
keys to the large blob stores into which large numbers of tiny key-value pairs are fit. Thus, given a
tiny key, one should be able to determine the blob key, which would encapsulate that tiny key. This
metadata mapping is internal to the InfiniScale solution.
There are multiple ways in which this can be achieved. Some of the solutions involve putting an
upper bound on the number of blobs that shall be inspected. Others need a more deterministic
mapping. As the value of the tiny key becomes smaller, it becomes more and more challenging to
meet the deterministic mapping schemes. This is because the amount of overhead to manage the
mapping from the tiny key to a blob key will be very large. We can thus resort to mechanisms such
as a probabilistic lookup into the blob to determine if the tiny key would be found in that blob with
high confidence. Once such a blob is determined, the key space within that blob is either searched,
or yet another key hash is looked up for a deterministic match.
The capacity tier can make no assumption about the metadata content within the blobs. The fact
that these blobs are immutable, also allows one to consolidate the metadata signatures into a
central store that may be cached in-memory. Once that is done, the capacity tier is just a large blob
object store, which can serve as the store for immutable data elements from InfiniScale as well as
just any other capacity tier.
ATG Investigations
R4. Unified Data Store: As stated, the bottom-half of the InfiniScale solutions and the capacity
solutions has the potential for common underpinnings. This validation is in order. It will also
be useful to construct a simple capacity solution with efficiency and resiliency-based codes
built over the disk based storage layout of a large blob store (order of 10MB per blob). We
already have IOPs solutions that work with consolidating tiny KV-pairs into large blobs.
Having a single unified data store will enable more seamless data tiering and movement
between a cold capacity tier and a warmer InfiniScale solution. It also enables one to have
Compression for Columnar stores also lines up nicely as similar KV-pairs are batched together into a
single column that find their way into a single physical file/ blob on disk. For example, SAP Hana’s
SanssouciDB37 uses dictionary encoding, where a dictionary is used to map each distinct value of a
column to a shorter, so-called, value ID. Each value ID is then compressed, using only as many bits as
are needed for the operating range. In the dictionary below, 3-bits are needed to encode the 5
values of the highlighted column.
InfiniScale solutions also leverage open source compression techniques, such as Snappy. Snappy is a
compression/ decompression library. It does not aim for maximum compression, or compatibility
with any other compression library; instead, it aims for very high speeds and reasonable
compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude
faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger. On
a single core of an Intel i7® processor in 64-bit mode, Snappy compresses at about 250 MB/sec or
more and decompresses at about 500 MB/sec or more.
ATG Investigations
R5. Storage Efficiency through read-compressed: Storage Efficiency in the InfiniScale context is
very different from traditionally known mechanisms, like de-duplication and compression.
37
http://link.springer.com/chapter/10.1007%2F978-3-642-29575-1_4
Master-slave architectures have a centralized metadata management, which can be looked up for
seeking out the nodes that would have the data, from where it can be served [Kerr94, Vei01]. This
also helps in identifying nodes with lesser load so that load is better distributed. HDFS is an example
of centralized metadata management. This typically follows a tight consistency model and is fine
within a data center, not across a low-latency link.
The other extreme is Cassandra, which follows a peering architecture that helps split the key-space
and routes requests within the cluster for a maximum of O(logN) hops, as in DHT. This has better
resiliency characteristics and has a decentralized balancing logic when nodes enter or leave cluster
membership. This also has better spread of workloads, as the single metadata server does not
become a bottleneck.
What is ideally needed is the peering architecture of Cassandra with the single-hop operation of
HDFS. This is important in the context of Flash and more so in the context of SCM. Options in
between the above-mentioned extremes must thus be explored.
ATG Investigations
R6. Geo-scale Data Distribution: There is a clear opportunity to merge the notions of disaster
recovery and data resiliency (covered in the next sub-section), to achieve higher levels of
storage efficiency. However, mechanisms of cross data center distribution and the wide-area
topology construction is not well understood within NetApp as we have not played as much
in the WAN space. Reconstruction typically happens in the context of a file/ object. Given
that these are coded objects that were distributed, each data center must keep track of
(metadata) that associates the objects to their coded chunks. Protocol mechanisms, such as
torrents, will be needed to stream these data chunks to a point of reconstruction, which is
typically closest to the point-of-consumption. Data ingest should take into account capacity
balancing as well as safeguard erasures due to a disaster. It should be possible to reconstruct
and redistribute object chunks that were lost due to a whole data center outage.
One significant shift that is needed in the coding algorithms is that the same should have the
property of logical reconstruction and must not be linked to the capacity of a physical device. It
should also spread the load of reconstruction over the entire group and yet should be able to
contain the fault boundary. This logical reconstruction is resorted to by HDFS and most of the other
InfiniScale solutions. They do this by reconstructing the content of a logical bucket. This has the
effect of better capacity balancing across nodes in a cluster, while all those nodes in the cluster
contributing to reconstruction, thereby reducing reconstruction time.
However, if we needed a DR site, we will need to create a full-replica of the primary data, at the DR
site. Most cold data will be consolidated in the cloud. The cloud is fundamentally a geo-distributed
store with high bandwidth links between those sites. If disaster strikes one of those sites, it should
still be possible to recover all data from the other sites. So, if there are 10 data center sites, a
complete erasure of one site should not result in any data loss. There is also the need of load and
capacity balancing across those sites. If coding can be done in a way that all these aspects can be
provided by a single solution, the efficiency of the solution would be maximized, leading to optimal
TCO.
Hierarchical regenerative codes have the property of being able to regenerate all data locally, with
the help of a master. If the master and another node are lost locally, the master from another site
will be needed to complete reconstruction. This code has a very nice property of graceful
degradation, and the degree of degradation depends on the severity of the fault.
Other codes such as network codes have a property of minimizing data transfer over the WAN and
extending those properties into storage will help with end-to-end storage efficiency. Typically, the
regeneration code is also send along with the coded data to enable the destination to reconstruct. If
this regeneration code can be secured the code transmission is provably secure. Thus, by combining
two or more functions of this nature, the efficiencies of the end-to-end system are further improved.
Finally, asymmetric erasure codes, such as (modified) Reed Solomon codes are known to be
expensive to compute, but provide very good storage efficiencies at high erasure rates. It is possible
to handle 5 erasures over 25 chunks at 30% overhead. If we were to create a replica at 3 known
locations, that would have been a 200% overhead with more overheads for maintenance of
metadata for where data chunks could be found. Even during reconstruction, the entire bisectional
bandwidth cannot be used. Thus a replica-based solution at geo-scale does not work well and
alternatives listed here must be explored.
Yet another aspect in favor of asymmetric coding is the fact that Intel CPUs have SIMD instructions
that further help alleviate the cost of coding and reconstruction.
ATG Investigations
R7. Mapping the Coding Landscape: Prof. Muriel has expressed deep interest in mapping the
coding landscape and co-author a survey paper with NetApp. Information about the pros
and cons of different coding schemes is severely lacking thorough treatment and
comparisons thus have been difficult. This research and survey paper has been seen as a
significant gap in the academia and industry circles.
R8. Experimentation with Coding Algorithms: Coding theory and implementation specifics have
evolved as newer techniques have been proposed and newer CPU capabilities have evolved.
Even newer implementations are posted in open source. It is thus required to make some
investments within ATG to assess some of the algorithms from a realization standpoint from
3.2.6 Others
One of the major areas where NetApp is looking to invest is in solutions that meet the growing
needs of a large market. Cloud is one such market. This recommendation thus is an opportunistic
one.
One of the ways in which storage in cloud is described is Amazon S3. What would compete with
that? Or what would enable a provider to create a solution compelling enough in another
dimension? Amazon S3 is a capacity-based object store. We just described a Unified Data Store that
has an object store as its underpinnings.
ATG Investigation
R9. NetApp Key-Value Appliance: This investigation is not about data layouts or getting a
functional KV-store. That is part of the investigations called out for unified data store and
others. This is about what might be the challenges to get those in an appliance form-factor.
Some basic cost modeling show that E-series can be a viable platform for a cost sensitive
storage tier. The fundamental persona needed is that of an object store. And we just showed
that the capacity tiers of InfiniScale could be leveraged as an object store, which is a KV-
store. Identifying the design options for retaining the optimal aspects of eos firmware and
adding functionality of a large blob KV-store will enable a low-cost and native object store.
This native object store can then be combined with memory heavy hosts for an InfiniScale
solution. In its native form, it can be used as a storage building block for the cloud, in a 4U
form factor. We can then even have the much feature-rich StorageGRID software to be
deployed on this solution. This solution can also be given an S3 persona for making it cloud-
ready.
Acunu Data Platform: The Acunu Data Platform is a next-generation Big Data Database combining
Apache Cassandra, Acunu Control Center and the Acunu Storage Engine (also known as Acunu
Castle) as shown in Figure 15. The Acunu Storage Engine is at the heart of our distribution for Apache
Cassandra. It comprises a rewrite of the Linux storage stack that offloads much of the storage work
from Cassandra and includes advanced OS caching and buffering schemes that eliminate the need
for tuning and provide high and predictable performance for a wide range of workloads. Acunu
transforms Cassandra into an easy to use, enterprise-ready database system optimized for today's
demanding NOSQL workloads and cloud environments. The Acunu Control Center provides simple
web-based management to support common administrative tasks including cluster management
and database creation together with unique features such as cluster-wide snapshot and clone.
Acunu requires no changes to Cassandra applications; it's integrated, tested and hardened; and is
100% compatible with Apache Cassandra drivers and APIs including Thrift and CQL.
Acunu Castle: Acunu developed its platform from the ground up to leverage developments such as
distributed servers in the cloud and the huge throughput increases enabled by SSDs. Acunu built
Castle -- the storage core, an open-source Linux kernel module that contains optimizations and data
structures targeted to be deployed on commodity hardware. Castle offers a new storage interface,
where keys have any number of dimensions, and values can be very small or very large. Whole
ranges can be queried, and large values streamed in and out. It’s designed to be just general-
purpose enough to model simple key-value stores, BigTable data models such as Cassandra’s, Redis-
style data structures, graphs, and others.
Acunu Analytics: Acunu Analytics delivers a platform and toolset that makes it possible to build and
extend complex, real-time applications easily and quickly. It does this by layering flexible and
expressive data modeling on top of Cassandra's base 'key-value pair' data model and by delivering a
much richer query capability; one that is more recognizable to developers used to the ease of use
and power of SQL. Acunu Analytics runs as a layer above any Apache Cassandra ring. In fact, it is
actually a Cassandra client application, using the Hector library, so you can run it on a shared
Cassandra cluster, alongside your existing applications. Similar to Cassandra, Acunu Analytics is
scalable, high performance and handles node failures and cluster membership changes without
interruption.
Acunu Analytics provides SQL-like query constructs to the NoSQL world, enabling familiar concepts
such as SELECT, WHERE, JOIN, and GROUP BY and built in aggregating functions such as topK and
Standard Deviation. Data collection can be performed by a JSON-based API, custom data integration,
and integration with Apache Flume. Acunu Analytics also ships with Acunu Dashboards, a powerful
and flexible browser-based tool for building live dashboards, configuring Analytics schemas, and
visualizing results.
Bare Bones: The founders of Acunu have several publications in order to validate their underlying
data structures. The first one was on copy-on-write B-tree finally beaten by Andy Twigg et. al.
[Twigg11] as a means to introduce their work on data structures for versioned data stores. A more
detailed account appears in [Byde11]
A startup database vendor based in Vienna, VA, launched March 2013 is making claims that its
database, FoundationDB delivers on the promise of true data consistency for a NoSQL database,
without a huge loss of speed or flexibility. The initial release of the FoundationDB data store is
The founders of FoundationDB claim that CAP was being misunderstood by most people, and that in
fact choosing C and P did not preclude a system from being highly available in the case of failure
scenarios.38
Database analyst Curt Monash, of Monash Research, has warned against data stores that have been
designed to support multiple data models, noting that "BBTo date, nobody has ever discovered a data
layout that is efficient for all usage patterns"39.
FoundationDB is a key-value-like storage engine that can support (multiple) layers of NoSQL data
models. It can support a document data model to replace MongoDB, or support a key-value model
to replace memcached, or support a graph model to replace Neo4J. This enables developers to much
more easily code their apps to reach into the FoundationDB. These layers, according to the founders,
can't be used on other key-value systems, because without consistent transactions, it would not
work. As building a distributed, fault tolerant, high performance database with cross-node ACID
transactions was difficult enough, many database features were pushed out of the core and into
layers40 as possible. Thus, the data model is a simple ordered key-value store (like a dictionary) and
the API is simple, but ACID transactions make the building of higher-level data models and features
very simple. Also, since data is going to be consistent, applications won't have to be built to wait for
data to catch up within a given transaction - thus making apps less complex and easier to build.
FoundationDB, however, has found a way to offer both availability and consistency through Paxos --
an agreement algorithm, which ensures that multiple copies of the data -- the database keeps three
copies of all data it stores -- stay synchronized. Google engineers also used Paxos [Les01] in its
Spanner global database architecture [Corb12], though Google's setup is different from
FoundationDB's. Google's up-and-coming Spanner database, a second-generation distributed
database that could ultimately replace the search engine company's Bigtable systems, is being built
on the premise that transactional integrity has to be a part of that database, too.
FoundationDB doesn't offer the traditional SQL interface, but instead offers data access through C,
Python, Ruby, Node.js and Java APIs. FoundationDB uses optimistic concurrency control and multi-
version concurrency control to construct a lock-free database, which is essential in a high
performance distributed system. The transaction conflict resolution function is decoupled from the
data storage function, thus enabling separate levels of optimization. FoundationDB is optimized to
be able to take advantage of the high random I/O of SSD's, making high performance with strong
durability guarantees. FoundationDB uses Flow, a new language that is an extension of C++, which
adds some Erlang-like asynchronous functionality, while still retaining the performance advantages
of C++. FoundationDB developed Flow, which allowed them the ability to simulate thousands of
failure scenarios that cause ACID violations.
The company has published detailed metrics based on running off of a $39k 24-machine cluster
across a dataset of two billion key-value pairs. It reports a stable 500,000 operations per second of
90 percent read and 10 percent write, 150,000 operations per second if 50/50, and up to 1,080,000
writes per second across blocks of 140 adjacent keys. The software is not available as open source,
though the company has promised to release a no-cost community version. The full general release
is expected to be available by the end of 2013. The software runs on Linux, OS X, and Windows, as
well as on Amazon's Elastic Cloud Compute (EC2).
38
http://foundationdb.com/#CAP
39
http://www.dbms2.com/2013/02/21/one-database-to-rule-them-all/
40
http://www.foundationdb.com/#layers
The most suitable spot for BangDB is to run as many threads as number of CPU on the machine.
Since performance was one of main design items for the BangDB, hence it was realized that the
database has to be concurrent and should take advantage of number of CPUs in the machine41.
Concurrency definitely adds the complexity and overhead but to settle with low performance even
on higher capacity machine was not the intention.
Manipulation of the tree by any thread uses only a small constant number of page locks any
time
Search through the tree does not prevent reading any node. Search procedure in-fact does
no locking most of the time
Based on Lehman and Yao paper [Lehman81] but extended further for performance
improvement
Separate pools for different types of data. This gives flexibility and better performance when
it comes to managing data in the buffer pool in different scenarios
Semi adaptive data flush to ensure performance degrades gracefully in case of data
overflowing out of buffer
41
http://highscalability.com/blog/2012/11/29/performance-data-for-leveldb-berkley-db-and-bangdb-for-rando.html
42
http://www.iqlect.com/architecture.php
Other
Apart from open source projects that contribute to InfiniScale solutions, two major players (EMC 2
and Amazon) were analyzed for their play in the InfiniScale space. We concluded with
recommendations for ATG to pursue and potential technology targets that NetApp should consider,
as inorganic options.
Given below are some key insights that were gathered during various phases of authoring this
report.
[Beer13] Leander Beernaert, Pedro Gomes, Miguel Matos, Ricardo Vilaça, and Rui Oliveira. Evaluating
Cassandra as a manager of large file sets. In Proceedings of the 3rd International Workshop on
Cloud Data and Platforms (CloudDP '13). ACM, New York, NY, USA, 25-30, 2013.
[Brewer00] Eric A. Brewer. Towards robust distributed systems. In PODC, page 7, 2000.
[Byde11] Andrew Byde, Andy Twigg. Optimal query/update tradeoffs in versioned dictionaries,
http://arxiv.org/abs/1103.2566, April 2011.
[Chen12] Jianjun Chen, Chris Douglas, Michi Mutsuzaki, Patrick Quaid, Raghu Ramakrishnan, Sriram Rao,
and Russell Sears. 2012. Walnut: a unified cloud object store. In Proceedings of the 2012 ACM
SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY,
USA, 743-754.
[Corb12] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,
Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh,
Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,
David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,
Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google’s Globally-Distributed
Database, Proceedings of OSDI'12: Tenth Symposium on Operating System Design and
Implementation, Hollywood, CA, October, 2012.
[Gilbert02] Seth Gilbert and Nancy A. Lynch. Brewer’s conjecture and the feasibility of consistent, available,
partition-tolerant web services. SIGACT News, 33(2):51–59, 2002.
[Govil08] Jivika Govil; Kaur, N.; Kaur, H.; Jivesh Govil, "Data/Information Lifecycle Management: A Solution
for Taming Data Beast”, Fifth International Conference on Information Technology: New
Generations, 2008. ITNG 2008., vol., no., pp.1226,1227, 7-9 April 2008.
[Gray81] J. Gray. The Transaction Concept, Virtues and Limitations. In Proceedings of VLDB, Cannes,
France, Sept 1981.
[Grid06] G. Grider, L. Ward, R. Ross, and G. Gibson, "A Business Case for Extensions to the POSIX I/O API
for High End, Clustered, and Highly Concurrent Computing,"
www.opengroup.org/platform/hecewg, 2006.
[Hild09] Dean Hildebrand, Arifa Nisar, and Roger Haskin. 2009. pNFS, POSIX, and MPI-IO: a tale of three
semantics. In Proceedings of the 4th Annual Workshop on Petascale Data Storage (PDSW '09).
ACM, New York, NY, USA, 32-36.
[Kan12] Kan, M.; Kobayashl, D.; Yokota, H., "Data layout management for energy-saving key-value storage
using a write off-loading technique," Cloud Computing Technology and Science (CloudCom), 2012
IEEE 4th International Conference on , vol., no., pp.74,81, 3-6 Dec. 2012.
[Kerr94] Kerr, A.U., "Towards Distributed Storage and Data Management Systems," Mass Storage Systems,
1994. 'Towards Distributed Storage and Data Management Systems.' First International
Symposium. Proceedings., Thirteenth IEEE Symposium on , vol., no., pp.1,, 1994.
[Khai11] Cho Cho Khaing; Thinn Thu Naing, "The efficient data storage management system on cluster-
based private cloud data center," Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE
International Conference on , vol., no., pp.235,239, 15-17 Sept. 2011.
[Lehman81] Philip L. Lehman and s. Bing Yao. 1981. Efficient locking for concurrent operations on B-trees.
ACM Trans. Database Syst. 6, 4, 650-670, December 1981.
[Les01] Lamport, Leslie. Paxos Made Simple ACM SIGACT News (Distributed Computing Column) 32, 4
(Whole Number 121) 51-58, December 2001.
[Leva13] Justin J. Levandoski, David B. Lomet, Sudipta Sengupta, The Bw-Tree: A B-Tree for New Hardware
Platforms, 29th IEEE International Conference on Data Engineering, 2013.
[Ling03] Benjamin C. Ling and Armando Fox. 2003. The case for a session state storage layer. In
Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9 (HOTOS'03),
Vol. 9. USENIX Association, Berkeley, CA, USA, 30-30.
[Mohan92] C. Mohan , Don Haderle , Bruce Lindsay , Hamid Pirahesh , Peter Schwarz. Aries: A transaction
recovery method supporting fine-granularity locking and partial rollbacks using write-ahead
logging, ACM Transactions on Database Systems, Vol 17, 94-162, 1992.
[Nishi12] Nishikawa, N.; Nakano, M.; Kitsuregawa, M., "Energy Efficient Storage Management Cooperated
with Large Data Intensive Applications," Data Engineering (ICDE), 2012 IEEE 28th International
Conference on , vol., no., pp.126,137, 1-5 April 2012.
[Sakr11] Sherif Sakr, Anna Liu, Daniel M. Batista, and Mohammad Alomari, A Survey of Large Scale Data
Management Approaches in Cloud Environments , IEEE Communications Surveys & Tutorials, Vol.
13, No. 3, Third Quarter 2011, 311 - 336.
[Ston05] Michael Stonebraker and Ugur Cetintemel. 2005. "One Size Fits All": An Idea Whose Time Has
Come and Gone. In Proceedings of the 21st International Conference on Data Engineering (ICDE
'05). IEEE Computer Society, Washington, DC, USA, 2-11, 2005
[Twigg11] Andy Twigg, Andrew Byde, Grzegorz Milos, Tim Moreton, John Wilkes, Tom Wilkie. Stratified B-
trees and versioning dictionaries, http://arxiv.org/abs/1103.4282, March 2011.
[Vei01] Veitch, A.; Riedel, E.; Towers, S.; Wilkes, J., "Towards global storage management and data
placement," Hot Topics in Operating Systems, 2001. Proceedings of the Eighth Workshop on ,
vol., no., pp.184,, 20-22 May 2001.
[Voul11] Voulodimos, A.; Gogouvitis, S.V.; Mavrogeorgi, N.; Talyansky, R.; Kyriazis, D.; Koutsoutos, S.;
Alexandrou, V.; Kolodner, E.; Brand, P.; Varvarigou, T., "A Unified Management Model for Data
Intensive Storage Clouds," Network Cloud Computing and Applications (NCCA), 2011 First
International Symposium on , vol., no., pp.69,72, 21-23 Nov. 2011.
NetApp provides no representations or warranties regarding the accuracy, reliability, or serviceability of any
information or recommendations provided in this publication, or with respect to any results that may be
obtained by the use of the information or observance of any recommendations provided herein. The
information in this document is distributed AS IS, and the use of this information or the implementation of
any recommendations or techniques herein is the implementers’ responsibility and depends on the their
ability to evaluate and integrate them into the operational environment. This document and
the information contained herein may be used solely in connection with the NetApp products discussed
in this document.
© 2013 NetApp, Inc. All rights reserved. No portions of this document may be reproduced without prior written consent of NetApp,
Inc. Specifications are subject to change without notice. NetApp, the NetApp logo and Go further, faster are trademarks or registered
trademarks of NetApp, Inc. in the United States and/or other countries. All other brands or products are trademarks or registered
trademarks of their respective holders and should be treated as such. InfiniScale Storage Architectures