Br Columndb

ATG/Projects/Pissarro/Report
NetApp Confidential
< ATG | Projects | Pissarro
Analysis of Column Oriented and NoSQL Databases

Author: Jiri Schindler
Draft date: August 31, 2011
v. 1.0
Contents
1 Introduction/Executive summary
1.1 Guide to the rest of the report
2 Background
2.1 Traditional DBMS architecture
2.2 Column-oriented DBs
2.3 Architectural changes
2.4 Canonical CODBMS architecture
2.4.1 Large memory single system image
2.4.2 Clustered system
3 Existing and emerging systems
3.1 Established commercial products
3.1.1 Sybase IQ
3.1.2 SAP/R3
3.1.3 Oracle RAC and Exadata Database Machine
3.2 Smaller players
3.2.1 Netezza
3.2.2 Aster Data Systems
3.2.3 Vertica
3.2.4 ParAccel
3.2.5 Endeca
3.3 HBase
3.3.1 General architecture
3.3.2 Storage with HDFS
3.3.3 Column format details
3.4 Other technologies
3.4.1 Schema-less or NoSQL databases
3.4.2 Research systems
4 Applications and usage patterns
4.1 Star schema
4.2 Query types
4.3 Updates
4.4 I/O profile
4.5 Automated on-line re-provisioning
4.6 High availability and system bring-up
5 Implications for storage systems
5.1 Different I/O behavior
5.2 Systems types based on I/O behavior
5.2.1 Type I: In-memory systems
5.2.2 Type II: Large-data clustered systems
5.2.3 Type III: DAS caching (Host-side flash)
5.2.4 Type IV: HDFS
5.2.5 Type V: Cassandra
6 Threats and opportunities for NetApp
6.1 FAS system mismatch
6.2 Opportunities
6.2.1 Fine-grained object management
6.2.2 InfiniVols and Searchlight
6.2.3 Improving basic R/W performance
6.2.4 E-series solutions
7 Closing the gaps
7.1 Short-term efforts
7.1.1 Understanding Type IV and V workloads
7.1.2 SnapManager for HBase
7.1.3 Core ONTAP features
7.2 Medium-term efforts
7.2.1 Host-side flash management
7.2.2 Solutions optimized for Type IV and Type V
7.2.3 ONTAP core performance
7.3 Long-term efforts
7.3.1 InfiniVols and E-Series
8 Proposed projects
8.1 Project I: Host-side flash storage management
8.2 Project II: Understanding HBase, Cassandra, and other technologies
8.3 Project III: ONTAP core technologies
8.3.1 Project III-A: Efficient extent writes
8.3.2 Project III-B: Extent-based cache management
8.4 Collaborative project with SAP
8.5 Leveraging university research
8.5.1 MIT Database Group
8.5.2 CMU Parallel Data Lab
9 Recommendations
Introduction/Executive summary
The ever-increasing number of CPU cores per die and the abundance of DRAM relative to the size of the active dataset profoundly affect the internal architecture
of systems and applications managing and accessing structured and semi-structured data. This in turn leads to different usage of and access patterns to persistent
data and forces us to re-consider how shared storage system infrastructure supporting these database management systems (DBMS) ought to be built.
At the workload front, the explosion of collected data leads to running ever-more complex business intelligence (real-time data analytics) queries. As a result, data
base workloads that were in the past dominated by record-level updates driven by OLTP (on-line transactional processing) move away. More fundamentally, the
internal DBMS organization moved from whole record- or row-oriented to column-based one. The more important workload is no longer a record-level (small)
update but instead the ability to quickly and efficiently scan large amounts of data for analytics (e.g., calculating aggregations, trends, etc.). At the technology
front, the abundance of DRAM in particular allows most relevant (active) data to reside in main memory. Coupled with the large number of CPU cores one can
process large quantities of data the the DRAM-access time speeds (some quote the ability to scan 1,000,000 small fixed-size items or records per ms per core).
These trends have secondary implications on the architecture of the database systems. First, to support the streaming efficiency of accessing data from DRAM
through the CPU-level memory hierarchy (e.g., L1 of L2 caches), databases use compression and normalization (encoding) to make each attribute (column) fixed
size and as small as possible). Second, rather than keeping lots of indexes on various keys (attributes) and pay the cost of maintaining them up-to-date in the face
of changes, database systems simply do not maintain vary many (or any) indexes and rely on the speed and efficiency of streaming data from DRAM. It may be
faster (and more efficient) to simply scan the entire column than to chase a few pointers to find a given item (in point query). And third, merging the record-level
updates to the columns of the main database happens at much lower frequency. These merges can be expensive when it is necessary to re-encode the entire
column, which leads to reading all of the old data and writing it out anew.
This fundamental shift in internal organization or architecture of database systems and the resulting change in access patterns has significant consequences how
persistent data is accessed. Therefore, it influences the internal organization and the role of a traditional disk spindle-based storage system.
We envision three fundamental shifts to how storage systems are used in these environments:
1. The ability to do efficient small updates is diminished (a workload for which WAFL is tuned for); data will be written in bulk.
2. The emergence of host-side flash memory (in SSDs, PCI-based adapter or any other form factor) can provide most I/O needs for
certain classes of applications.
3. For large data analytics applications with column-oriented organization where the accessed data footprint is much larger than the
available DRAM storage systems where the ratio of system-provided B/W relative to the cumulative raw disk bandwidth is close to
one are desired. This in turn favors "cheap-and-deep" storage systems over systems that trade off raw performance for feature-rich
options and high mark up per spindle/capacity (like NetApp FAS controllers).
Guide to the rest of the report
This report contains information structured into three parts. We describe the organization of the report to help the reader navigate through the text and choose the
information to his or her needs.
The first part includes two sections. Section #Background provides technical information on the general architecture of row-oriented databases and the transition to
the column-oriented databases. Section #Existing and emerging systems highlights architectures and products in this space. Both of these sections provide
background for the interested reader. They can be skipped with no loss of understanding the rest of the document.
The second part consists of sections #Applications and usage patterns and #Implications for storage systems that describe how column-oriented DBs for real-time
analytics view and use storage systems.
The third part of the document provides information specific to NetApp. Section #Threats and opportunities for NetApp lists how NetApp FAS products are
positioned in this application space. Section #Closing the gaps outlines a series of steps that would strengthen NetApp's position in this application domain.
Finally, sections #Proposed projects and #Recommendations outline recommendations for ATG investigations and projects in this space.
Background
Traditional DBMS architecture
In relational databases, a schema defines the structure and partitioning of information into individual relations (tables). Each table then contains a set of rows for
each record and each record includes a set of attributes or column fields. To connect information from one table to anther, database systems use keys that "link"
those records together. For example, a payroll info and employee directory tables can use employee ID as the key.
Traditionally, relational database management systems (RDBMS) such as commercial IBM DB2 or Oracle use a row-oriented architecture; The entire record (or
row with all its attributes) is stored one after another. To locate each record within each, say 4KB, page, the row-oriented page includes a slot that lists the offset of
each record within that page.
This row-based orientation favors efficient updates to one or more records typical for On-line transactional processing (OLTP) workload. To locate individual
records fast, database systems often build indexes on several keys (often one index per table within the schema) to look up a value in log(n) time. Additionally, for
frequent queries that look up only some attributes but from several tables, databases maintain so-called materialized views. These are system-generated tables with
attributes copied/duplicated from different tables. Thus, a single transaction may update few values in a handful of tables in the original database schema as well as
those in the system-generated and maintained structures such as indexes and materialized views. With several indexes and a few materialized views, a single user-
visible transaction updating only a few values values, can result in touching (dirtying) many different pages.
For recovery, databases use write-ahead logging and only record changes to these pages as a result of a single transaction, only modifying the in-memory (cached)
versions of the data. Periodically, the log is checkpointed to disk, writing out in-memory dirty pages to stable media up to a well know point in time in the log.
Thus, a checkpoint operation can amortize the cost of writing out a page across many update operations.
The advent of deeper memory hierarchies with L1/L2 caches atop traditional ones consisting of only memory and disk and the shift towards more data mining
operations, in addition to the traditional OLTP workloads, leads to inefficient use of resources. In order to do aggregations (i.e., computations of sums, averages
etc.) across, say, one attribute, or column, the system has to read an entire page from disk and chase pointers within the page to locate the beginning of the record
and retrieve the desired values for the column within that page. This causes scanning or skipping over data that is not needed. Moreover, the desire for faster
execution of analytical queries through the use of materialized views and indexes lead to ever-more-costly updates and fine-grained synchronization.
Column-oriented DBs
The basic premise of column-oriented databases (CODBMS) is to do away with storing whole records for one table in a single page and instead partition vertically
single table into individual columns. This provides for fast and efficient scanning of entire columns. Traditionally a given column includes only a small number of
unique values, allowing for effective compression. With compressed, fixed size columns that are cache friendly (e.g., aligned on cache line boundaries) an entire
column can be scanned very fast. The ever-increasing size of main memory in servers compounds the effectiveness of both vertical decomposition into columns
and data compression. Therefore, analytical queries can run entirely from main memory, requiring very little, if any, I/O to complete.
In a typical column-oriented architecture, each column stores compressed values in a set of fixed-size pages because most buffer caches are page-oriented.
Fundamentally, column compression leverages the fact that there are only a few unique values in a column relative to the number of bits used to express them. For
example, ASCII uses 8-bits per character, but effectively we need less than 6 bits for human-readable values of e.g., color names. Similarly, the practical range of
values is much smaller for fixed-size files such as date and time or integers for counts etc. For variable-length attributes e.g., last names, an effective compression
is a variant of Huffman coding (http://en.wikipedia.org/wiki/Huffman_coding) . A frequent value in the column, even one that uses a large number of bytes, is
expressed by a term with only a few bytes. Thus, the CODBMS also maintains for each column a header with the column dictionary listing the encoding for each
unique value in that column. Finally, each record value must have a unique record id (that can be calculated on the fly) and typically includes a timestamp with a
valid/invalid bit.
Vertical partitioning the database scheme with column compression can provide significant space savings. In one [real-world example], a database with 34 million
records worth 5 years of a single enterprise accounting data took 35 GB of space in row-oriented representation. With column-oriented design, the total space used
was only 8 GB --- over 77% space savings.
The CODBMS model differs in several key aspects from that of the traditional row-oriented DBMS:
1. It eliminates the need for indexes. With memory-efficient scans of million records per millisecond, it is faster to scan the entire
column to find the value for the given record than to chase pointers across several index pages.
2. There is no need for materialized values in order to run efficiently data mining queries. Rather than paying the expense of
maintaining materialized views for most-frequent queries, one can run ad-hoc queries with the right set of columns and join the
records together according to the record ids with a simple offset-length arithmetic.
3. Merging updates to pages written out to stable store is relatively expensive and therefore deferred as long as possible. With vertical
decomposition, updating a few values in a single record touches as many pages - one for each attribute. Moreover, because of data
encoding or compression, an addition of a new unique value may result in a situation whereby the entire column has to be re-
encoded. This requires reading the entire column and write it out again with the merged changes. Thus, CODBMS keeps large
update logs that tend to be merged with column store pages infrequently relative to checkpoints in traditional row-oriented
RDBMS.
Architectural changes
The profound shift in the server hardware architecture in the last decade or so brought about the fundamental change to the internal architecture of RDBMS and the
move away from row to column-based orientation.
First, the rapid increase in the size of installed main memory, or DRAM, in servers far outpaced the growth in the overall footprint of databases for OLTP. In
particular, in 2005, only two of the top ten largest OTLP databases (http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf) were
larger than 10 TB, measured as total storage capacity including tables, logs, and indexes. The same result held for 2003
(http://www.wintercorp.com/vldb/2003_TopTen_Survey/All%20Winners.pdf) . As a consequence, most online transactions could be executed in main memory.
Second, in the past decade, the memory hierarchy deepened from single-level small caches to up to three levels of on-die caches in front of the main memory.
Third, the number of hardware threads per system increased many times from a single thread (one CPU) in 2000 to 48 or 64 hardware threads per single system
with a single shared main memory.
In late 1990s, researchers recognized the emerging hardware trends and proposed new architectures and column-oriented DBMS organizations to address the
concern that main memory could be the new bottleneck rather than I/O (see the original 1999 VLDB paper (http://www.ercim.eu/publication/ws-proceedings/12th-
EDRG/EDRG12_BoMaKe.pdf) , the MonetDB project (http://www.monetdb.org/Home) that addresses this issue and is still ongoing, or the Tape is Dead, Disk is
Tape, Flash is Disk, RAM Locality is King talk (http://research.microsoft.com/en-us/um/people/gray/talks/Flash_is_Good.ppt) by Jim Gray). The solutions moved
away from
random memory access (a.k.a. memory pointer chasing)

fine-grained record-level locking
in-place record-level attribute value updates in favor of architectures with
streaming memory access that leverages cache line prefetches and DRAM's much more efficient serial access relative to truly
random one
partitioned execution to minimize shared state updates and the resulting cache line invalidations due to exclusive-state writes
deferring record-level changes (to in memory data) by logging them into difference logs and merging them with in-memory and
persistent copies.
The collection of these architectural and technological changes enables the execution of ad-hoc analytical and data-mining queries from main memory. This allows
running both traditional OLTP and OLAP (online analytical processing) or decision-support (DSS) workloads with very different access patterns on a single
system and in real time. This new architecture is therefore often termed as column-oriented in-memory databases to signify the shift both in logical organization as
well as towards in-memory access away from traditional I/O to a storage system to run a query.
Canonical CODBMS architecture
While it is difficult to say that there is one typical or prevailing system architecture for column-oriented in-memory databases we will try to outline two common
approaches to building systems for large-scale data mining and business analytics.
Large memory single system image
The most common approach, and for example one taken by SAP (http://www.sap.com/) , is to run a single application server on a single machine with large
amount of memory (say 100+ GB) and very many cores, all controlled by a single OS image. In this architecture the dataset access is partitioned across the parallel
threads and the amount of sharing is limited. However, a fault of this single image system (CPU complex, OS etc.) is very costly. In order to prevent
unavailability, one has to run another identical system. Moreover, the failover includes the bring up of system and warming of 100s of GB the main memory from
the persistent storage.
More recently, SAP Labs has been experimenting with an even larger system, called Big Iron (http://ra.ziti.uni-
heidelberg.de/coeht/pages/events/20110208/presentations/ike.pdf) . They take five individual servers connected by InfiniBand, each with 100+ GB of DRAM and
8-socket multi-core CPUs, to create a single virtual HW system image. This virtualized server runs a single instance Linux OS with a single instance of the SAP
server, creating a machine with 320 hardware threads and several TBs of DRAM. The hardware virtualization software is in effect a software-based ccNUMA
shared memory system with directory-based MESI cache coherent protocol (at the L4 cache). This is akin to the Treadmarks
(http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=485843) CSM research project).
Clustered system
Another approach, taken by for example Oracle RAC (http://www.oracle.com/technetwork/database/clustering/overview/index.html) (Real Application Cluster),
HBase or Netezza, is to present a single DBMS system image by pooling independent servers, each running a separate OS instance and use some form of clustered
or share storage. In this approach, a single CPU complex/server failure can be masked by the clustering software by re-distributing the workload thanks to shared
storage subsystem. On the other hand, the cross-node communication incurs larger overheads compared to the tightly coupled HW architecture. Fundamentally,
however, this architecture serves the same purpose --- partitioning computation to facilitate parallel data access and to effectively increase the processing power vis
a vis the cross-section main-memory-to-CPU bandwidth.
Existing and emerging systems

We briefly map out the landscape of existing commercial database systems specializing on real-time data analytics that favors column-oriented design as well as
those still in the research or startup phase. The main purpose is not to be exhaustive nor to provide a detailed overview of these systems. Instead, the intent is to list
and put into proper context of this report the names of relevant companies and products.
Established commercial products
Sybase IQ
One of the earliest commercial product with pure column-oriented organization was Sybase IQ. It used vertical partitioning and aggressive column compression to
normalize variable-sized attributes and to provide more efficient access to data for decision-support queries. It also used novel techniques for joins, using offset
arithmetic and aggregate computation on encoded values. Sybase has been acquired by SAP in May 2010 (http://www.sap.com/press.epx?pressid=13202) , mainly
because of the other parts of the software stack (solutions for mobile platforms) rather for its Sybase IQ engine.
SAP/R3
SAP is a typical representative of the in-memory column-oriented database design running on a single instance of a large server/host. It uses large memory to store
columns and minimizes locking contention in favor of aggressive partitioning of execution across the many host cores. Furthermore, it separates transactional
locking into two layers - the application-level transactions that can be long-lived and can include more complex operations and operations and transactions at the
persistence layer that serialize updates to the data stored persistently. Instead of updates in place (in memory) they put new values into difference logs and merge
those updates into the column store.
Logically, the NewDB (the latest re-design of their system code-named Hana), represents a single column by a logical file object consisting of the dictionary and
fixed-size encoded attribute values with timestamps. Physically, the storage of the entire database schema i.e., all columns and dictionaries, is in a single file,
managed by the NewDB persistence manager. This large single file is essentially a file system that includes metadata about what pages (blocks) belong to what
columns and manages consistent views of individual logical file objects. See the write up about SAP architecture for more details.
Oracle RAC and Exadata Database Machine
The Oracle RAC (http://www.oracle.com/technetwork/database/clustering/overview/index.html) (Real Application Cluster) is a clustered solution of the traditional
Oracle 11g database (row-oriented) engine. It pools individual (up to 100) servers running a separate OS image (e.g., Linux) and uses a cluster management and
recovery software to present a single DBMS image. In particular the clustering module partitions and coordinates work across tables that are stored in a large
shared storage system (e.g., Oracle clustered file system OCFS2 (http://oss.oracle.com/projects/ocfs2/) ). Naturally, transactional updates are "in place" and
expensive as they have to use some variant of 2PC protocol if updates occur across separate nodes. The primary application of Oracle RAC is to scale out
performance for large-scale big-data analytics queries.
Oracle does not use column-based orientation. However, the internal page representation is optimized for memory access. It uses attributes partitioned across
(PAX) page layout. While access at the memory and cache level is optimized, the entire page with additional payload that is not needed has to be read from the
storage system and pollutes main memory.
The Exadata Database Machine (http://www.oracle.com/us/products/database/database-machine/index.html) is a bundled solution of hardware and software - the
Oracle RAC technology, with Solaris and Sun HW for storage. It also leverages Flash memory (or solid-state disk drives) for accessing "hot" data. It uses page-
based compression to save storage capacity. Another feature of the Exadata DB Machine is the ability to push predicates of compiled queries to the storage nodes.
When the storage node understands the format of the page it can perform a filtering operation and limit the amount of data (pages) that are sent to the database
node.
Smaller players
Most of the systems described below are successful start ups that established themselves on the marked with some customer base and were recently acquired by
larger companies.
Netezza
The basic premise of Netezza was to provide a closed solution for large throughput execution of analytical (DSS) queries across a cluster of nodes. The
architecture would take a SQL query compile the query plan into a special C or object-language representation. It would then further compile the query plan
operators into Verilog to push simple predicates (i.e., filters like SELECT val WHERE val = X) and other computation into a custom-built FPGA hardware
present at each storage node (the concept of computation close to data). This approach would eliminate/reduce data that need to travel from storage nodes to
aggregators. The original design did not support real-time updates and the data had be ingested and stored in a proprietary/closed format with aggressive
column/attribute compression. Since the initial architecture, the specialized HW was dropped in favor of software-only solution with each node running Linux OS
and an addition of extensible analytical functions into the dataflow.Netezza was bought by IBM in 2010 (http://venturefizz.com/funding/netezza-acquired-ibm-
marlborough-ma) for $1.4 billion.
Aster Data Systems
Aster Data (http://www.asterdata.com) provides a parallel analytic platform that is a software-only competitor to Netezza
(http://www.dbms2.com/2010/02/22/netezza-twinfin/) . It also uses in-memory column-based proprietary database format. It specializes on rich analytic functions
within an integrated development environment. It focuses on integration with SSDs certifying its product with Fusion I/O cards installed on the nodes of the Aster
Cluster nodes. Aster was acquired by Teradata for $263 million (https://www.fis.dowjones.com/WebBlogs.aspx?
aid=DJFVW00020110303e734000dx&ProductIDFromApplication=&r=wsjblog&s=djfvw) .
Vertica
Vertica is another representative of the pure column-oriented design with deferred updates. The updates are accumulated into a row-oriented store --- a form of the
difference log --- and eventually merged with the main column-oriented database. The basic idea behind the Vertica system was formulated in the C-Store paper
published in the VLDB Proceedings in 2005 (http://db.csail.mit.edu/projects/cstore/vldb.pdf) . The academic paper lays out good rationale and motivation for
having column-oriented organization with aggressive column-based data compression even for environments that face frequent updates.
The architecture scales across individual nodes (with direct-attached disks or using shared storage) and focuses on real-time analytics (analytics on live data). It
now uses a cascaded style of merges folding at each stage more and more updates in an effort to minimize data churn that requires reading, re-encoding and
rewriting all values in a single column. HP acquired Vertica in early 2011 (http://www.hp.com/hpinfo/newsroom/press/2011/110214xb.html) .
Vertica offers a series of connectors for applications such as JDBC, ODBC etc. It also includes plug ins for Business Intelligence tools like Cognos (also used
internally within NetApp by the Onaro group). Finally, it offers a Hadoop (for Map/Reduce operations) and Pig (http://research.yahoo.com/node/90) (functional
language for Big Data Analytics and Processing developed by Yahoo) connector the core database engine.
ParAccel
ParAccel is a start-up focusing on business analytics with a column-oriented architecture spread across discrete computational nodes for scalable performance,
similar to Vertica. The PADB uses column compression, cluster-wide load balance and other typical column-oriented features. It also integrates with the
Hadoop/MapReduce and includes connectors and interfaces such as ODBC or JDBC. From the storage or I/O perspective, it combines node direct-attached storage
with SAN storage for fail-over capabilities. The PADB includes proprietary snapshot and D/R capabilities, mirroring data between DAS and SAN LUNs. ParAccel
expects about $15 million in revenue (http://blogs.wsj.com/venturecapital/2011/03/10/paraccel-feeling-fine-after-acquisition-smoke-clears/) this year.
Endeca
Endeca (http://www.endeca.com/) specializes on search through semi-structured data for business intelligence. The architecture leverages in-memory vertically-
partitioned data representation extracted from the original data. It automatically creates the multi-dimensional hierarchy of data typical for OLAP applications for
drill-down queries by automatically building in-memory indexes.
HBase
HBase (http://hbase.apache.org) is an open software implementation of the Google BigTable (http://labs.google.com/papers/bigtable.html) distributed under the
Apache license. In HBase, data is organized into columns where each value is time-stamped. The record can be stitched together by looking up the record id.
HBase does not have a predefined schema. Columns can be added at any point to the table and can be sparse, meaning that column need not contain a value for a
given record id/key. HBase exhibits the many attributes of column-oriented organization, but it targets mostly data that are much larger than the amount of
available memory in the cluster. In short, HBase is column-oriented, but not an in-memory database. Nevertheless, it represents an important type of an emerging
new application so we describe the HBase architecture in more detail than that of the systems outlined in the previous section.
General architecture
HBase runs on a cluster of workstations, with each node running a single instance of the HBase server. The server provides record management for the given piece
(region) of the column. HBase leverages the Hadoop File System (HDFS) and other additional components such as ZooKeeper to do so.
An HBase column is implemented as a collection of regions, each represented by a single HDFS file. In the HDFS fashion, each file is split into chunks of default
size of 64MB. Each chunk is located on a different HDFS node. The HDFS metadata service keeps track for HBase of which chunk is stored where. Also, by
default, each HDFS chunk contains three replicas. Finally, the HDFS chunks are write-once. Any updates require (optionally, if not already cached by the
HBaseServer) reading the original chunk, copying the data into a new one and replacing the old version with a new, creating, again three replicas in the process.
Thus, an update of a few bytes can translate into reading 64MB reads and (eventually) writing out 3 times 64 MB.
HBase relies on ZooKeeper, a highly-available service with a quorum of nodes, to keep track of pertinent HBase metadata - a pointer to the "root" region and a list
of participating region servers. Thus, with the help of Zookeeper, a client can map the record id to the region of each column file and thus points to the requisite
HRegionServer. This translation is then cached by the client and subsequently allows it to ask directly the right HRegionServer for data from a record X. The
HRegionServer manages regions, where a region is the requisite portion of the column; it issues reads to the HDFS and caches data necessary. The picture below
shows the main components, including the mapping of individual files to HDFS chunks.
Storage with HDFS

Basic HBase architecture.
As depicted above, each HRegion, managed by a single instance of the HRegionServer, has the structure of a typical column-oriented DBMS; Individual updates
occur in main memory and are first accumulated to the HLog (akin the difference log discussed earlier), which is an append-only file within HDFS. At some point
later on the changes in the log are merged with the persistent data in the HFile. This merge operation can also perform optional compaction or LFS-style segment
cleaning akin to in-memory CODBMS re-encoding. Unlike traditional in-memory column-oriented systems, the data in the HBase column is not encoded,
although HBase supports LZO or Gzip compression of the individual HFile chunks. The HBaseRegionServer uncompresses the data when it is read from the
HDFS before storing them into its main memory.
Each HFile is broken into fix-sized "blocks" that are the logical units administered by HBaseRegionServers. Multiple blocks form a single region of the column
file. Each block (by default 64KB, and labeled "Data" in the picture) has an internal structure that includes individual records, stored one after another. The file
also includes indexes allow direct look up to the record for key K within the blocks. The index points to the most-recent time-stamped version of the record. This
can happen, for example, when the record is updated several times before the "block" is finalized and flushed out to HDFS.
Format of a single HFile for one column region.
HFiles are files in a distributed file system (e.g., HDFS). In principle, HBase does not require Hadoop (HDFS) underneath it, but HBase is optimized to work on
top of HDFS. There are patches to HDFS (https://issues.apache.org/jira/browse/HBASE-2467) to fix performance and/or correctness issues with HDFS. However,
an HBase distribution bundles both HDFS and ZooKeeper in addition to the HMaster and HRegionServer proper code.
For more details on the architecture and implementation of HBase storage, please refer to the HBase Storage description on NoSQL Wikipedia
(http://nosqlpedia.com/wiki/HBase_Storage) .
Column format details

HBase column record.
Each record is variable length and includes
1. the length of the key and the value (to support schema-less user-defined keys and values)
2. the row and column identifiers (maintained internally by HBase)
3. row key
4. timestamp
5. value Both the key and the value are byte arrays. Thus, an application has to convert its data structure (e.g., string or integer) into
bytearrays before storing or retrieving data.
Each record is variable length, as described above, so each HBase block also includes possibly some padding to make its size fixed. HBase packs blocks into the
individual HFiles, or segments. Using the default HDFS value 64 MB for file chunk size, there are 1024 64 KB blocks. Depending on the amount of updates and
flushes (commits) different parameters for block size are used. Some suggest setting the block size equal to the HDFS chunk size. However, this leads to lot of
intra-chunk space wasted, if frequent updates occur. Recall, that the HBaseRegionServer caches each HDFS chunk and strives to defer writing out the chunk to
HDFS to amortize the cost of merging, i.e., updating records within the HFile blocks.
The HBase format is storage-space inefficient. Our preliminary measurements of a single-node HBase installation show over 2x space required due to the
overheads within each record and block (timestamps, index, and the repeated storage of the key/row id). With 3 HDFS replicas, this yields 6x storage required.
This represents a best-case scenario of storing fixed-size records with no deletes or "invalid" entries. See the report for more details.
The HBase column format is also not optimized for efficient memory access. To scan through rows of a given column attribute, one has to chase pointers. Each
"row" in the column file consists of a key-value pair, where both the key and the value are variable length prefixed with two numbers indicating the length of the
key and value terms. The key, among others, includes a record ID and a timestamp.
Other technologies
This report focuses on databases with column-based organization. However, there exist other technologies or systems related to them in concept and to some
degree in implementation, such as HBase. They are sometimes lumped together under the term NoSQL or schema-less databases because they might not use SQL
for expressing data queries, can have their schema extended (like HBase), or harness the combined DRAM of a large cluster to execute from main memory. This
section highlights these new technologies. A more detailed understanding of their behavior and interactions with storage systems, however, is beyond the scope of
this report. We direct the reader to the available overview articles that detail these systems and map out the landscape of these emerging technologies. The Scalable
SQL and NoSQL Data Stores (http://www.cattell.net/datastores/Datastores.pdf) by Rick Cattell gives a detailed overview and classification of the emerging
systems. The detailed 100+ page NoSQL Databases (http://88.198.54.126/nosqldbs.pdf) report by Christof Strauch is exhaustive in covering existing systems and
provides some technical detail and a summary of views or opinions from various technologists commenting on the this NoSQL databases trend.
Schema-less or NoSQL databases
The "NoSQL" databases can be divided into three broad categories:
"simple" key-value stores

document stores
extensible large-scale column (or semi-structured) data stores.
Valdemort (http://project-voldemort.com/design.php) (developed and contributed by LinkedIn) represents the first category. It is a scalable store where both keys
and values are byte-ranges defined by the user or applications. They only support efficient point query (look up of a value for key K).
Systems like MongoDB (http://www.mongodb.org/) or CouchDB (http://couchdb.apache.org/) represent the category of document stores. A document can be, for
example, an XML document, or a JSON (http://json.org/) (java-script object notation) object. Some of them support indexing and thus efficient range query i.e.,
retrieval of documents for a specified set of keys.
The representative of the third category is the aforementioned HBase or CassandraDB (http://cassandra.apache.org/) , originally developed by Facebook.
Cassandra has many of the similar goals as HBase. It provides extensible storage of semi-structured data with column families and no built-in transactional
support. It uses MVCC (multi-version concurrency control) and a special purpose-built storage, unlike HBase that is built on top of a clustered system --- the
Hadoop file system. For that reason, it is likely to have very different access and storage usage patterns from "typical" CODBMS, including HBase. However,
more investigations are needed to verify this assertion.
These systems exhibit the following general characteristics:
They are scalable, meaning that they can run on multiple machines and their performance, at least in theory, scales up with the
number of nodes used.
They provide some sort of partitioning of load-balancing without manual intervention from the application write or complicated
configuration by the DB admin.
They provide much "softer" semantics compared to the ACID properties of traditional RDBMS --- these are sometimes called
BASE, which stands for Basically-available, Soft-state, Eventually consistent. More formally, the Brewer's CAP theorem states that
a system can provide any two of the three properties of consistency, availability, and partition tolerance. Traditional RDBMS with
ACID properties satisfy C and A, while the "NoSQL" systems provide A and P. To accomplish that, some systems employ multi-
version concurrency control (MVCC) (e.g., CouchDB) and/or rely on the user application programmer to deal with conflicts and
inconsistencies.
They all allow access methods more palatable to the Web 2.0-era programmer. RESTful Web interface
(http://en.wikipedia.org/wiki/Representational_State_Transfer) , API bindings to languages such as Java, Python, Ruby, PHP etc.
(via e.g., Apache Thrift library (http://thrift.apache.org/) ), and the Map/Reduce paradigm are more important rather than more
traditional database connectors (ODBC/JDBC) and SQL.
Research systems
The research in academia is looking at providing database as a service, combining the scalability of the emerging systems mentioned in the previous section the
with the advantages of relational organization and ACID properties. Sam Madden at MIT has an ongoing Relational Cloud (http://relationalcloud.com/index.php?
title=Main_Page) project that looks at partitioning an deployment of large relational data across many nodes. Another effort headed by Michael Stonebreaker,
called SciDB (http://www.scidb.org/) , is looking at building an open source data management and analytics software for scientific research. The basic premise is
to provide the strong properties of relational databases for very-large sparse arrays of arbitrary dimensions. See the notes from the talk for more details.
Applications and usage patterns

Generally, the main users of databases with columnar orientation are data analytics applications. In general, such applications run many queries (in parallel) that
sift through the requisite portion of the data set to return the requested results in real-time. Thus, low latency, or total query execution time, is the basic requirement
for these systems. Hence, these systems prefer in-memory access rather than I/O to storage.
The May 2011 issue of Communications of ACM published an article titled Stonebreaker on Data Warehouses. The author outlines 10 rules that characterize
data warehouses with columnar organization. The next section highlights some of these points.
Star schema
Most systems are built in support of answering the five questions of Who, Where, When, Why, and How about a key fact. Thus, all information centers around the
main fact, or a lineitem, while the other information or tables, are all in support to answer those questions. This configuration is called the Star Schema.
Incidentally, the database schema for the TPC-H decision support benchmark contains a table called lineitem that has the most data and most queries in the
benchmark operate on this table.
Star schema.
Query types
The queries run by CODBMS are not typically a simple record lookup or search for a given value (i.e., a point query). Instead, they are a type of a range query that
involves scanning relatively large portion of the data for aggregations i.e., calculations based on the data as it is retrieved from the requisite columns. In these
applications, indexes are not helpful. Instead the CODBMS provide efficient scanning of data in the column while calculating aggregate values. Large bandwidth
is the key requirement here.
Certain class of applications may include imprecise queries for discovery such as: Find a part that looks like this, fits component X,
and costs no more $Y. Certain systems like Endeca specialize on organization and interface for efficient discovery through imprecise queries.
Another type of business applications may run forecasting based on existing data. That can be forecasting of sales, business growth etc. Typically, such queries are
complex calculations across several attributes, or tables in the schema combined with drill down operations. In traditional row-oriented RBMS, these are pre-
canned queries executed as stored procedures on the server against previously-defined materialized views. The CODBMS architecture offers efficient running of
ad-hoc queries without having to determine a priori the set of queries and data to use. Both discovery and forecasting may lead to scanning several columns or
running several queries in parallel or in succession to get the desired results.
Updates
One study of data from several OLAP systems customer deployments showed that, on average, only 5% of all data is ever updated during the lifetime of the
database, with the largest portion being 11%. The updates are of three categories:
1. Aggregate updates to whole records (e.g., downloads of data from a branch to the main database)
2. Status updates with a timestamp (e.g., toggle unpaid/paid field)
3. Values updates to a single variable (e.g., change in price of a line item).
Because logical updates are infrequent, relative to the amount of data access for reading, and often tracked over time, most CODBMS store them physically to
append-only logs. In another words, updates are treated as inserts with a new timestamp and an invalidation of the record with the old value in the original column
store. Since it may take a long time before updates propagate to the columns on stable stores, the delta or difference log reflects the current state of the database
and must be stored persistently with each transaction.
I/O profile
From the perspective of the CODBMS, data analytics applications are dominated by reads. However, if most of the active set fits into the system's main memory,
the I/O breakdown and the access patterns exhibited by the CODBMS to storage systems may quite different. Furthermore, the storage layer organization or how
the CODBMS manages storage can profoundly influence the I/O access patterns. The next section outlines in detail the implications of the column-oriented DBs
on storage systems.
In cases when the active data does not fit into main memory CODBMS tend to prefer accessing local, or direct-attached, storage because of the latency-sensitive
nature of reads. From the perspective of the CODBMS designers, storage dedicated to a single host does not experience performance variations that could
potentially occur in a shared storage system. Similarly, CODBMS is ripe for the use of host-side flash (e.g., in PCI-attached or SATA-based SSDs), when
performance benefits offset its higher cost. Flash provides much higher IOPS, and, with the right abstraction and interconnect technology, higher bandwidth than a
network-attached disk-based system. ParAccel's PADB or Aster Data are an example of this approach.
Automated on-line re-provisioning
Compared to traditional row-oriented DBMS, the load on a CODBMS system can vary greatly thanks to the ad-hoc nature of the queries and the different uses for
the data. Therefore, it is desirable for the system to automatically re-provision data for different load levels. For example, during peak times, it should have twice
as many servers (or hardware threads) for the same data. Such operation should not be disruptive and should not require manual intervention. For example, the
open systems such HBase and the underlying HDFS both support such automatic elasticity. When HDFS storage nodes and/or HBase region servers are added the
system re-partitions the workload automatically. Similarly, for example Cassandra or MongoDB call this automatic sharding of storage. Some of the proprietary
systems have similar capabilities.
High availability and system bring-up
Merging updates from logs into the data is expensive and thus relatively infrequent. Therefore, recovering the database from the difference log can take a very long
time. To avoid that, most installations use instead replication and fail over to a replica if there is data corruption or a failure.
There are several ways to implement this. Clustered systems use typically use a cold-standby method ---having another node take over the work for the just-failed
one. With cold-standby and shared storage, the logs can be taken over by the replica system, but warming the main memory (i.e., loading the data from the stable
store) will take some time and lead to performance degradation similar to the one experienced by WAFL during failover after a controller failure. Single-image
systems, like SAP can use hot-standby to minimize disruptions, but this doubles the system cost.
Similarly, for the same reasons outlined above, system bring up is a complex operation. It may take a while to achieve steady performance for a system with 100s
of GBs of main memory. It has to be filled with data read from storage and the difference logs so as to restore the current in-memory DB state.
Implications for storage systems

The access patterns and the use of storage systems by column-oriented DBs is quite different from traditional row-oriented systems running an OLTP workload.
This section highlights these changes and attempts to categorize the use of storage by different types of column-oriented DBs.
Different I/O behavior
The I/Os coming from an RDMBS running an OLTP workload is dominated by small random reads and small (random) updates. For some systems, serial reads
after random updates are important. In contrast, the general I/O pattern of OLAP or data mining applications running atop COBDMS have the following general
characteristics:
1. Most reads are serial with large I/O sizes (100s of KB to single digit MB)
2. Writes are append-only (rather than physical overwrites) thanks to the time-stamp organization of both data and logs.
3. For data (columns), writes are typically large and much more bursty. Writes are generated during merges of difference logs with the
columns. They tend to be appends to existing objects of writes to new locations; in-place overwrites are not very common.
4. There are also append-style small writes to the log.
Systems types based on I/O behavior
We now categorize CODBMS based on their usage of storage systems and the types of I/Os they generate.
Type I: In-memory systems
This type represents systems that, in steady state, contain their working set mostly in main memory and therefore the ratio of reads to writes, as seen by a storage
system, is skewed towards writes. A representative of such system is the SAP/R3 system. Most (read) data accesses during query executions happen within the
confines of the host. The storage system (both DAS or network-attached) sees I/Os due to these activities:
1. Start up - one time occurrence when the system is booting and reading (on demand) reading data into main memory. Most I/Os are
large serial reads. The server/application start up time is largely governed, if not bottlenecked, by the I/O speed. Thus, serial read
I/O performance is important, if not very frequent in steady state for this type of system.
2. Merge operations of difference logs with main/column stores generate bulk-write operations. Simple merges that do not need re-
encoding of the column result in append-only, very rarely update-in-place, writes of small to medium size. Merges that require re-
encoding generate lots of new, typically large-sized and sequential writes. Some reads may occur as well.
Typically, these systems leverage the time-based nature of data and leverage it for data management operations. For example, they may put more recent data in
main memory, vis a vis, on devices with best performance while older data may be moved to a different class of storage - another file or volume storage on less-
expensive devices. Similarly, they tend to have a storage layer that uses snapshots or some similar mechanism for maintaining consistent versions (views) of the
data stores for fast point-in-time recovery. For that reason, occasional reading of data may occur (by a backup or D/R manager). Depending on the organization,
some compaction may occur that shuffles data around.
The exact nature of the operations depends on many factors such as
the architecture of the storage layer (e.g., the use of NAS vs. SAN LUNs)
the breakdown or aggregation of individual database objects such as logs and columns into individual storage objects (files).
We now briefly describe how SAP/R3 (the NewDB engine) manages storage as one possible representative. SAP uses NFS and stores all of its active data in a
large file that it manages internally. This file acts as a container for a basic file system. This one large file includes metadata that has a block tree that maps page
IDs of individual columns to the pages within the page (pages can range from 4-256KB in size). SAP does no-overwrite updates and maintains consistent views
(snapshots) within the file, very much like WAFL. From the point of view of the storage system (that only views this one large file) the I/O pattern looks like
updates in the middle of a file (since pages can become obsolete and overwritten subsequently). The SAP persistency layer also performs periodic copy of a
snapshot to another system by another process and compaction/de-fragmentation of the data. See the storage layer section in the internal SAP architecture
overview for more details and I/O behavior.
The SAP's style of managing storage as one large file greatly influences the I/O pattern observed by a storage system. Depending on the internal fragmentation of
the file, the large bulk writes may look like small writes that plug holes. Similarly, what could be large serial reads may result in small random reads. All these
issues are well known to anyone familiar with how WAFL works. In summary, this offers an opportunity of improvement and we have proposed to SAP joint
investigations to explore data storage and management alternatives that leverage the strengths of both systems and do not duplicate efforts (e.g., implementing file
system with snapshot features on top of a filesystem (WAFL) that provides these features efficiently.
Type II: Large-data clustered systems
Clustered installations that routinely scan TBs of data must use external I/Os to run their queries; it is not feasible to keep all data in the main memory. This type of
system is dominated by large serial reads. The write profile is very similar to the previous type --- large bulk updates that look like new writes (rather than
overwrites of existing data). These systems tend to use local (DAS) storage to avoid a potential bottleneck in the network or interference at a shared storage
system. An example of this type would be Vertica, Aster Data or certain type of ParAccel deployments. Compared to Type I, these systems typically do not
implement themselves (or rely on) more sophisticated data management features. That's mostly because they a nascent technology.
Type III: DAS caching (Host-side flash)
Type III is a a specialized class of the previous type. The difference here is that they tend to rely heavily on DAS (either local disks or flash-based storage). They
use DAS as their primary access, but these devices act as caches for network-attached storage systems. The SAN or NAS-based systems provide the high-
availability and persistency in case of a failure of a node (along with its DAS). Thus, the shared system views a very different profile from Type II deployment and
are similar to Type I:
1. ratio of reads to writes skewed towards writes

2. large reads are rare (most are serviced by DAS)
3. writes are append-style, large, and bursty, just like in Type I.
ParAccel is an example of a Type III system.
Type IV: HDFS
HBase is a representative of this type of column-oriented database systems. It uses services of a distributed (clustered) file system, HDFS to store and manage its
objects. In this scenario, the column is horizontally partitioned into individual files. A file in the HDFS is append-only with no physical overwrites. A logical
update (i.e., overwriting a portion of the file) results in read-modify-write operation: read (if not already cached) the existing chunk and write its new version,
updating the file's metadata at the HDFS name node. For the HBase HFiles that store column families, the typical operations include sequential reads that look to
the storage system (HDFS data node) like reads of 64MB chunks (or whatever the size is set to) and large writes that are either logical updates in place or append-
style ones.
The difference log (called HLog in HBase) is also managed as a HDFS file and its data is almost always just written out. Unlike the column files, the HLog file
sees lots of small updates that are flushed to the HDFS. Because of the no-overwrite style of writes in HDFS and large chunk size, a commit can appear as frequent
overwrites (new versions) of the tail chunk of the log file.
Type V: Cassandra
As described earlier, the Apache Cassandra columnar store represents a different approach to building distributed columnar databases. The interface and object
model (column families with individual colums and extensible schema) are very simiar to HBase, but the implementation as a peer-to-peer system with a DHT
architecture has very different I/O profile. For Cassandra, the homogenous nodes (both HW and SW) use DAS and data replication is provided at the server level
(rather at the FS layer by having 3 copies of the same chunk as in HDFS).
Cassandra is generally characterized by
apend-style writes to the node-local log are relatively small (a transaction/client update can flush data to disk)
reads can be random if running point queries or very selective range queries, in part due to the hash-based distribution of data
across the nodes storing data for a given column family.
Compared to HBase or other types of CODBMS, Cassandra's I/Os are generally
1. smaller (HBase reads and writes in the default 64MB chunk size of HDFS)
2. more random (at least for reads) due to the fine-grained distribution of data based on the hash of the row key.
Therefore, using Flash memory (e.g., the use of flash-based SSDs) can greatly improve Cassandra performance compared to HDD-based DAS. There are startups
that focus on improving Cassandra throughput by optimizing storage layer using Flash memory (e.g., Acunu (http://www.acunu.com/) ).
Threats and opportunities for NetApp

At the moment NetApp is not in a strong position to effectively compete in this market with its FAS product line. In general, column-oriented DBs look for
good serial performance (both reads and writes) at low cost - preferably with the ratio of overall system bandwidth to the aggregate
per-spindle bandwidth close to one (for example, a system with 14 data disk drives should be able to deliver a sustained bandwidth
on the front end 14 times the per-disk streaming bandwidth). The high per-I/O latency and the performance degradation of aged
WAFL volumes are a major hindrance to penetrating this market.
Advanced volume-based features such as de-duplication, quotas, or writable clones are not very useful in these deployments,
although snapshots and DR features (e.g., those based on Paloma or LRSE) can potentially provide some value (see below).
Multi-protocol access to the same data is not very useful. Most CODBMS use one method of access (e.g., iSCSI or NFS). CIFS
does not play any (significant) role in this space. Most solutions run on open (Unix/Linux) systems.
Multi-tenancy (sharing one system for several applications) is not desirable as it can lead to potential performance variation; or at
least that is the prevailing impression in the community. An effective performance isolation with the QoS effort (initially available
in the SN ONTAP release) could be potentially useful.
FAS system mismatch
The feature-rich, unified FAS storage system with ONTAP and WAFL at its core is not best suited for CODBMS. It is relatively expensive in terms of dollars per
GB or MB/s (bandwidth), as explained in the New Apps position paper (http://cto.eng.netapp.com/spt/sections/news/NewAppsPositionPaper-v1.1.docx) .
The trend of building high-end FAS systems like Absolut (FAS6200 series) that have large caches and many hardware threads to handle lots of IOPS runs counter
to the CODBMS use of storage. In CODBMS deployments, FAS storage controller caches are not effective. Collectively, host servers running CODBMS systems
have at least as much, and often much more, DRAM than a modestly sized cluster of high-end FAS controller running DOT 8.x. Therefore most out-of-core access
by CODBMS need to access large amounts of data, resulting in mostly large serial reads for which caching is ineffective.
Others within NetApp have recognized the price/performance mismatch between application server needs and FAS offerings and proposed a DAS-like alternative
architecture called SimpleStore/DAS with value proposition better suited for CODBMS. Another alternative to the FAS system is the NetApp E-Series system
(from the recent Engenio acquisition by NetApp) it shares many of characteristics outlined in the SimpleStore document.
Opportunities
While FAS is not best suited as-is to compete in the space of low $/GB and effective performance for streaming reads writes, with proper positioning and/or
development of certain strategic features, ONTAP-based FAS storage can become a viable player in certain deployments, especially for Type I, II, or III
deployments. In deployments that put premium on cost and simplicity without many value-added features, the E-series as part of the HadoopPod (a bundle with
servers and switches targeting Apache Hadoop deployments and analogous to the already offered FlexPod solution
(http://www.netapp.com/us/technology/flexpod/) can be effective especially for Type IV (HBase which is HDFS-based solutions).
We believe that NetApp solutions for columnar DBs should be generally positioned as follows:
FAS systems cannot compete on performance (IOPS/BW) at low price point alone. Offerings should leverage its advanced data
management features for data at rest (replication, DR etc.) along with data efficiency (RAID-DB and possibly deduplication) at
acceptable baseline performance.
Develop solutions that manage data both on host-attached storagedevices as well as in a shared data center infrastructure (i.e.,
clustered ONTAP and across data centers for DR).
Focus on object-based (e.g, CDMI model with RESTful web interface) or NAS rather than block-based SAN solutions. Most
applications prefer file-based/POSIX API and the semantically richer protocol allows building solutions for fine-grained object-
level management (as opposed to whole LUNs).
More specifically, we believe the following ONTAP features are of benefit to column-oriented DBs:
1. features derived from WAFL snapshots such as data retention and replication for disaster recovery
2. features derived from object-level management (currently volume-level or individual file sis-clones)
3. storage efficiency derived from RAID-DP on SATA drives combined with HA (through failover/give back) and multi-path access
(with C-mode S-blade for SAN protocols or NFS vservers).
4. seamless data/capacity management with boundary-less containers (InfiniVols).
We next highlight CODBMS-relevant ONTAP features either already present in the product, considered for future releases, if not committed for development, or
not yet on the product roadmaps. We detail specific AD projects in a later section.
Fine-grained object management
Currently, ONTAP offers coarse-grain object management at the volume (FlexVol or LUN) level such as volume-based SnapMirror (for D/R) snapshots (for point-
in-time recovery) etc. However, in the existing CODBMS solutions the notion of a volume or a file system with a mount point is not very common. Instead most
systems manage individual objects directly such as difference logs or individual columns. Put differently, the ONTAP features for managing storage objects are
very useful to CODBMS but not at the current whole volume level. ONTAP ought to provide a set of features at individual object (file) level.
This requirement has been already presented for virtualized data centers with VMDKs. CODBMS would also benefit from the file-level operations considered for
those environments:
snapshots
copy offload
and hole punching
In particular, the ability to snapshot a collection of application-specified files (e.g., a collection of columns) for point-in-time recovery would be of value to
CODBMS. Recall, that merge operations tend to either update certain pages of the existing column (i.e., punch holes in the existing file) or update whole columns
when they must re-encode the data, creating new versions of files and bulk-deleting (unlinking) old version. A selective snapshot of a collection of files would
would allow applications to so safely. Similarly, efficient deletes that can actually reclaim space immediately are of importance to CODBMS as well, particularly
for storage systems with high cost of storage capacity.
The upcoming RR ONTAP release will provide a feature that allows for clones of individual files or parts of files (sub-LUNs) called constant time sis clones. The
feature will allow between 200 and 500 simultaneous file clone operations. Exploiting this new feature and exploring new and effective ways of "atomically"
cloning an application-specified collection of files is recommended as such ONTAP capability would provide a value to CODBMS and could be a feature
distinguishing NetApp from other competing products.
Most CODBMS currently implement their own version of snapshots and backups for D/R. Copy offload functions of specific files, working in collaboration with
the CODBMS, would provide more efficient solution that would reduce network utilization and simplify the CODBMS design and implementation. The
development of a host-side kit in the style of application-specific snap managers would add value to CODBMS Types I-V.
InfiniVols and Searchlight
CODBMS generally do not manage storage in discrete units of volumes. Instead, they prefer a single name space with infinite capacity that can grow seamlessly.
Therefore the introduction of InfiniVols as part of the Beacon project in BR is important for effective support of CODBMS.
The continuation of InifiniVols in SN (project Searchlight) is equally, if not more, important. If will offer multi-protocol access to the same data, expanding the
Beacon's access over NFS with Web RESTful interface defined as part of the CDMI standard.
InfiniVols are in particularly attractive because they provide transparent storage management with non-disruptive increase in capacity. They can also provide the
basis for HDFS access (Type IV systems). The single vserver per InfiniVol is not a limitation as a single vserver can have multiple interfaces (vifs) spread across
Nblades. However, the namespace constituent volume that resides on a single Dblade can become a bottleneck.
Improving basic R/W performance
To compete successfully withe above-mentioned value-add features, ONTAP has to first close the performance gaps around
1. efficient handling of bulk writes (i.e., avoiding back-to-back CPs and/or NV-RAM capacity limitations)
2. efficient non-degrading serial read performance.
For the first shortcoming, investigations around alternate layouts with White Alligator project (the new WAFL write allocator infrastructure introduced in RR) or a
design of a suitable Falafel region type are necessary. To overcome the second shortcoming, the recent efforts around WAFL contiguous segment cleaning and
black-box readahead (aka RAVE) are a step in the right direction. However, it is not clear how much improvements these new features will offer to the CODBMS
workloads. Thus, we recommend to evaluate these features in the context of this specific workload rather that the traditional SPC-1 and similar performance
benchmarks routinely run by the ONTAP performance teams.
The current biggest impediment (as of BR ONTAP release) to better streaming read performance is a well-understood limitation in write-allocation. WAFL
allocates by default at most 64 contiguous blocks from a single file/LUN onto a single device (the -max-write-alloc-blocks option for C-mode volumes).
Thus, in the best case (without any overwrites or WAFL layout fragmentation), the largest disk read I/O possible is 256 KB. This is too small to come anywhere
near the disk's streaming bandwidth of 100+ MB/s: for a 2010 SATA (near-line) disk, 256 KB represents only 1/6 to 1/4th of a single disk drive track. The
allocation size can be adjusted and the WAFL Disk Performance team has experimented with this in the past to improve SPC-1 results. However, while some
improvement was observed, the default value was not changed.
Another impediment to more efficient execution of large I/Os within ONTAP is the large cost, both in CPU path-lengths as well as metadata space overhead, is
WAFL's approach to storage management at 4KB block granularity. Currently, the largest chain of buffers sent that can result in a single disk I/O is 32 buffers,
assuming the requested data is laid our serially on the media. However, this is both too small of an I/O - 128 KB at most.
The Metawafl project (as of this writing slated for the LB release with a possibility of some limited functionality in SN) may provide some relief in terms of the
metadata overhead --- it will only need one descriptor in the file's i-node for an extent of up to 32 4KB disk blocks (this number is implementation specific and
may change in the final release). However, the current setting will not result in a larger I/O sent to the disk (as a single SCSI command). In addition to a more
efficient descriptor for an extent of 4KB blocks that MetaWAFL offers, more efficient buffer cache organization is needed. Currently the waflbuf structure ---
the in-memory header for each 4KB cached block, is 312 bytes (on x86_64 architectures). This provides large memory overhead and thus increases the cost of the
solution for workloads that predominantly deal with large I/O sizes.
Finally, another potential impediment to more efficient execution of large I/Os is the SpinNP data transport mechanism inside ONTAP. Currently, the SpinNP
protocol supports messages with data payload of up to approximately 128 KB. For NFS the SpinNP payload size is sufficient since a single NFS RPC is typically
not larger than 32 KB. In the SN ONTAP release the Searchlight feature will offer data access method via the Web RESTful interface as part of the CDMI support.
This interface allows for large read and write (many MB in size) operations on objects via a single HTTP PUT operation (The HTTP protocol supports multi-
segment operation, similar to scatter-gather iovec operation).
It is not clear, however, if the current SpinNP data payload size limitation will be a real or just a perceived bottleneck for the Web RESTful interface when reading
or writing objects whose size is 10s of MBs. CDMI, or more specifically the Web RESTful interface, is relevant for one class of CODBMS systems - Type IV.
Recall that the default HDFS chunk size is 64 MB. Thus, further investigations should be undergone to determine if the limited message size transport is a problem
or if the above mentioned WAFL and storage layer-related limitations are the primary bottleneck.
E-series solutions
The E-series product line coming from the recent Engenio acquisition offers a cost-effective bandwidth-efficient solutions that could play a significant role in the
CODBMS space as an alternative to ONTAP. the E-Series offers iSCSI or SAS-based volumes with some basic snapshot capabilities. Compared to ONTAP-based
solutions, the E-series provides much better bandwidth on the front-end relative to the capabilities of the back-end disks. As mentioned before, a solution called
HadoopPod is in the works. It is a hardware offering of NetApp E-Series systems with Cisco servers and switches targeting Apache Hadoop deployments and
analogous to the already offered FlexPod solution (http://www.netapp.com/us/technology/flexpod/) . The initial target is the Hadoop analytics space, providing
attractive hardware offering for Hadoop-based clusters. Such can be effective especially for Type IV (HBase which is HDFS-based solutions).
While the E-series is attractive for its low cost of storage capacity as well as bandwidth (MB/s achieved per unit of cost), it is not currently strongly positioned to
offer advanced data management features that ONTAP offers. First, it supports only LUN (or block volume-based) access via iSCSI, FC, or SAS. For CODBMS,
NFS or object-based approach is more advantageous. Second, volume-based snapshots introduce potentially large performance penalty. E-series volumes support
up to 512 KB stripe unit size --- the size of a single I/O per disk in a RAID group (ONTAP, by default offers only 128KB). However, a single logical block update
results in copy-on-write (COW) operation of a 16KB chunk. Furthermore, the inheritance-based implementation of snapshots and clones can lead to multiple
COW operations as a result of a single block update. For this reason, the current E-series 2600 product release limits the number of updates (or COW operations)
at 16 per single I/O (i.e., 16 snapshots, or copies per base volume).
In summary, the E-series, as currently implemented with its block-only access via iSCSI, SAS (serial SCSI) or FC, is potentially attractive for CODMBS Type IV
and possibly Type V deployments, while less so for the other types. Alternatively, E-series may be suited for solutions for Type II or III deployments that put
premium on high bandwidth from SCSI/SAN-based LUNs, can use the basic set of snapshot/DR solutions (compared to ONTAP), and can tolerate the
performance penalty associated with maintaining pointer chains for several snapshots).
Closing the gaps

We now describe a collection of possible steps that NetApp could undergo to address some of the issues described above to be be better positioned in the space of
CODBMS. We outline short, medium and long term efforts that should be undertaken. Additionally, whenever applicable, we describe relevant activities already
being undertaken within NetApp. In the next section, we propose a set of concrete ATG-driven projects.
Short-term efforts
Most of the activities for the short term (~1 year time horizon) coincide with the Hadoop Analytics effort. Because the Type IV workload (HBase) leverages the
Hadoop platform HDFS, there is a natural overlap between the CODBMS topics and Hadoop Analytics as part of the New Apps space. Thus, an initial solution
addressing the needs of MapReduce jobs and working with HDFS will be adequate, if not tailored, for HBase.
The Hadoop Analytics effort is already working on the following solutions:
a bundled hardware platform, termed HadoopPod 1.0, which is a modification of the FlexPod based on the E-Series 2600 controller
with the Wembley disk shelf (4U enclosure with 60 hot-swappable disks)
a software solution for managing data for Hadoop analytics: staging data from an ONTAP-based storage system to a Hadoop
cluster for processing and analysis and staging them out for retention with the ONTAP value proposition derived from WAFL
features.
Understanding Type IV and V workloads
To better understand the workload characteristics and I/O demands of Type IV and Type V systems, we should evaluate their performance using standardized
benchmarks and model workloads. NetApp should conduct a hands-on experimentation with these two systems. This hands-on experimentation would allow us to
better understand the differences in the behavior of the respective system and the different demands they put on storage system as well as help us better understand
the implications of the different architectural and data integrity models of the HBase and Cassandra (both NoSQL open-source columnar stores). The results would
allows us to precisely define product requirements addressing the market needs. See Section #Project II: Understanding HBase, Cassandra, and other technologies
for more details.
The TESO organization has already conducted a proof of concept test with Hadoop cluster, focusing on how to configure and customize the various system
settings when using FAS-provided storage (FAS3170) in place of the DAS. However, this study focused on configuration with limited performance evaluation.
Similarly, the ASUP on Hadoop effort plans to leverage HBase in the future as an alternative to the current solution based on Oracle RDBMS that stores processed
data.
We propose to leverage those two efforts, focusing on performance and workload characterization of the two different CODBMS systems. Note that, in addition to
helping us define the requirements for customer-oriented solutions, this would would also help an internal customer --- the ASUP team in charge of implementing
the Hadoop-based ASUP solution.
SnapManager for HBase
Additionally, we propose to explore a solution for providing a SnapManager specifically for HBase (Type IV workload/system). This feature would be more
relevant as more customers will use the Hadoop platform for storing semi-structured data in HBase alongside the original or unstructured format used for
MapReduce M/R data analysis). A typical workflow in this scenario is to do the initial analysis from the raw data using M/R (e.g., leveraging FLume or custom-
written MapReduce code) and then store its results (i.e., the derived and reduced data) in a semi-structured format (HBase). This reduced data is then used for more
detailed analysis, data mining trending etc. In this scenario, the derived data stored in HBase has much higher value to the customer's business than the original
raw data and thus a good candidate for SnapManager features (recovery, off-site copy etc.).
Core ONTAP features
The plans for the next ONTAP release (SN) are largely fixed. Thus, we only summarize those features that will help ONTAP to be better positioned for the
CODBMS market to ensure that they are not dropped from the release. Most of them have been discussed above and they include:
constant-time sis-clone for a collection files

InfiniVols with CDMI (RESTful access)
Hybrid aggregates
Medium-term efforts
Host-side flash management
In reaction to the trend of moving data closer to the CPU and main memory complex, we envision host-side Flash memory to play more central role in supporting
most application IOPS needs. The current host-side Flash memory-based solutions focus on transparent caching of blocks nominally stored. Both the NetApp-
initiated Mercury project and the recent acquisition of I/O Turbine by Fusion I/O (http://www.fusionio.com/press-releases/fusion-io-to-acquire-io-turbine/)
exemplify this trend. However, while transparent caching may accelerate the I/O, it does not leverage the technology's full potential. On the other hand, an explicit
management of the flash memory by the application has the potential to do so. Naturally, such a solution requires both the cooperation of the OS (to enforce basic
access protection and hiding the device specifics) as well as of the data-at-rest storage system. One of the impediments is that most current caching solutions use a
block-device abstraction rather than object-level abstraction. This does not allow an application to explicitly communicate to the storage system which data is
important or which one is not going to be useful in the future. Instead the storage layer that includes the flash memory has to implicitly infer that for each block. As
we know from the ONTAP experience, storage management block-at-a-time for TBs of data is challenging and inefficient.
The various OS abstractions (especially in virtualized environments with guest and host OS) incur a relatively large latency due to the large number of CPU cycles
needed to generate an I/O. This high CPU cost diminishes the overall capability of the device. For example, a Fusion I/O card is a PCIe-attached device, but uses a
block-device (SSD) abstraction with software-based FTL and other management software running in the OS kernel, consuming relatively large amount of CPU
resources.
In order to more effectively leverage the IOPS, and more importantly the bandwidth of the host-attached flash memory (PCIe 2.0 x16 can deliver 8 GB/s), we need
different abstractions and more importantly a management model for the data in the flash memory. A model that allows the application to explicitly manage the
device with tight cooperation with a data-at-rest (or network-attached) storage system. In order to deliver a solution in this space in the long term (4-5 years) we
need to start an exploration of how to more effectively manage host-side flash as part of the end-to-end storage solution. See the project description in Section
#Project I: Host-side flash storage management for more details.
Solutions optimized for Type IV and Type V
The previously discussed HadoopPod may address the initial needs of Type IV (HBase on top of HDFS) CODBMS. Building on the results and conclusions of the
performance study described above in Section #Short-term efforts, we propose to devise deployment solutions that are optimized, rather than just working for Type
IV and V workloads. The particular emphasis would be on addressing the market needs of Type V systems (Cassandra etc.).
ONTAP core performance
For the medium term, we envision NetApp having better understanding of the potential performance bottlenecks when running CODBMS on top of ONTAP and
solutions for addressing some of the problems. Several problems with performance of larger I/Os at scale are already being addressed (RR or RR.1 release) or will
be addressed by the SN release time frame. These include:
RAID
RAID-MP work
ZCSv2 and tetrises aligned on ZCS boundaries
WAFL
White Alligator
RAVE
MetaWAFL
Nblade
CDMI/HTTP-based RESTful interface (Searchlight)
In addition to the above-mentioned areas, we recommend adding support for more efficient execution of large I/Os within ONTAP (by changes to WAFL
allocation) and for more efficient use of RAM (WAFL buffer cache) that alleviates the high cost of 4KB-at-a-time buffer management. See Section #Project III:
ONTAP core technologies that outlines specific projects that would address these shortcomings.
Long-term efforts
InfiniVols and E-Series
In its initial deployment, InfiniVols will use pointers from the namespace constituent volume to the data constituent volumes stored on different Dblades of the
same ONTAP cluster. However, this pointer indirection opens door for another possibility: inclusion of heterogenous storage platforms, or more specifically, the
use of E-series systems as data storage nodes. This is akin to the pNFS architecture with the ONTAP-based namespace constituent volume having the role of the
metadata servers and E-series serving as data nodes. This architecture, marrying the features is well suited for CODMBS, or more generally any application,
looking for solutions with seamless storage container and good price to performance ratio.
Naturally, bringing this idea to fruition will entail significant engineering effort to solve the various technical challenges of cluster membership, redirection etc.
Similarly, the advanced snapshot features available today in ONTAP-based . However, viewed differently, this solution could be a possible embodiment of the
Falafel architectural vision with ONTAP-based volumes performing the role of a Falafel volume while the E-series based nodes function as regions storing
individual extents. This solution would leverage technologies mostly available today: using E-series block-based interface where extents would be described by a
starting LBN and length descriptor. All metadata bookkeeping would be managed by ONTAP-based MetaWAFL volumes. Put differently the data blocks
(comprising a MetaWAFL extent) would be blocks stored on an E-series block-based LUN while all requisite metadata (e.g., inodes and indirect block tree) would
be stored in an ONTAP-based volume.
Proposed projects
In this section we outline a series of investigative projects that advance the basic agenda to better align NetApp product offerings for column-oriented databases.
The projects are related to the two major recommendations:
1. leveraging host-side flash and effectively integrating it with the feature-rich network-attached storage
2. providing both cost and performance-effective ONTAP-based solutions with fine-grained object management
The projects vary in scope and leverage the different approaches to carrying out ATC-sanctioned projects. Some of the projects assume to be carried solely within
ATG while others are scoped for collaboration with BUs or other external entities. leveraging the model of outsourcing basic or speculative research work to
universities.
Project I: Host-side flash storage management
We propose to explore ways to more tightly integrate host-side flash memory used by CODBMS with features provided by network-attached clustered ONTAP
system. Currently, CODBMS use host-side flash abstracted as solid-state drives (SSD) for fast our-of-core access (e.g., when scanning many GBs of data per
node). Currently, the host-side flash is not integrated into the ecosystem of data set management, providing high-availability, D/R etc. Instead, CODBMS use the
direct-attached SSDs as caches, managing themselves issues like data consistency, recovery and durability.
One approach would be to leverage the work done previously by the ATG project Mercury. Using the Mercury basic framework of abstracting a host-side flash
device with a block interface, this project will focus on the set of mechanisms and policies around automatic management of data objects. More specifically, it will
expand the basic caching with write-through semantics to the network-attached volume and introduce a set of management features that provide automated tiering
of data, backup (i.e., durability) and other operations around data set life-cycle management. However, rather than focusing on basic mechanisms, which the
Mercury project has done, the core of the proposed project would explore ways to use application-provided semantic knowledge of data objects (files) to facilitate
migration of older data to lower storage tier, leveraging the timestamp-based view of data. Put differently, the project would focus on the aspects of SLO-driven
data management where the data is formatted and stored in flash-memory in potentially application-specific format rather than the traditional storage management
objects (i.g., volume, file etc.). Finally, the project could explore mechanisms for automatically fetching data from a network-attached device for access as if they
were . Put another way, this project would explore a snapshot manager-like functionality for use by CODBMS, replacing the specialized solutions offered by, for
example, ParAccel.
Currently, the most common abstraction of a host-side flash-memory-based device is that of a block device. While this provides good IOPS much of the bandwidth
potential, especially for PCI-enabled devices like Fusion I/O card, is not realized. This project does not intent to explore alternative abstractions of flash memory
for efficient high-bandwidth access. These issues will be explored by different projects defined below.
Project II: Understanding HBase, Cassandra, and other technologies
To advance the New Apps agenda and to get hands-on experience with the relevant technologies, we propose explore the use of HBase and the related
technologies - Hive (a SQL-like engine on top of Map/Reduce jobs in Hadoop/HBase) and Pig (procedural language for expressing data analytics jobs). As part of
preparing this report, the authors conducted some preliminary hands-on explorations with a local installation of HBase. The purpose was to gain familiarity with
the software and APIs and to understand storage (in)efficiency of using HDFS files to represent HBase columns.
The proposed project would conduct more in-depth investigation and understanding of the structure and I/O access patterns of the two open-source CODBMS
listed above. For example, one option would be to measure the effectiveness of column compression in HBase with respect to space efficiency and performance
overheads. An extension to this project would be an assessment whether ONTAP file compression (a recently added feature to ONTAP release) could help HBase
on top of HDFS. For Cassandra, DAS Flash memory (in SSDs) speeds up Cassandra I/O execution. However, it is not known in great detail (to us) what aspects of
the application I/O features most benefit from Flash memory (append-style writes vs. the more random nature of reads due to fragmented log structured
organization).
In addition to contrasting the two technologies and providing more insight into their behavior and the demand they put on the storage system, we propose to frame
these explorations in the context of the MEI BU, and in particular the OnCommand Insight product family. These systems would be both a model for monitoring a
Hadoop cluster (for HBase) as well as potentially a mechanism for addressing the growing needs of the product's analytical capabilities (currently, they use Cognos
data analytics package on top of MySQL).
Project III: ONTAP core technologies
We propose to research and devise ONTAP-internal features that can offer better performance and storage efficiency compared to the current ONTAP release.
Project III-A: Efficient extent writes
This project would focus on developing data layout suitable for the CODBMS workload, i.e., a set of policies to device workload-specific aggregate, vis a vis,
volume layout. One way of achieving this goal would be prototyping directly within ONTAP and leveraging the newly developed framework for extensible
policies of data allocation within WAFL called walloc2 or White Alligator and introduced in ONTAP RR release. Another approach would be to use the
Falafel.One as the prototype platform directly and validate the ideas in a more controlled environment.
This While Alligator framework separates mechanisms for layout policies, and offers abstractions, and infrastructure for more effective allocation and MP-scalable
solution. The first ONTAP release will include the basic WAFL allocation policies. These policies work well for certain kinds of workloads but are not generally
suitable for the workload types exhibited by the CODBMS. This project would focus on the policies governing data allocation that is best suited to the CODBMS.
This second approach would be analogous to that chosen by the MetaWAFL team for exploring options for efficient object reference counting: using the
Falafel.One framework for rapid prototyping with the intent to port/implement the results in the context of MetaWAFL, which is a WAFL BU driven project
developing a feature for inclusion in a future ONTAP release. Using the first approach (using White Alligator) would aid in validating one of the White Alligator
project design goal of providing abstractions and mechanisms to easily plug in different allocation policies.
Regardless of the chosen approach (White Alligator or Falafel.One), this project would build upon previous ATG investments: MAPL and FLURry. MAPL
developed an efficient mechanism for capturing snapshots with valid-until snapshot ids and b-tree organization. Recall that writable clones (with SLOs changes)
are not prevalent (important) for CODBMS. MAPL structures were also optimized for this workload, but did not focus on providing named object abstractions;
MAPL looked at LUNs. In contrast, we expect named object (file) management to be important for CODBMS.
FLURry focused on efficient large/serial reads after small updates with per-I/O explicit SLOs and data sharing. FLURry also assumed the presence of a staging
area implemented as append-only flash memory-based storage. If the chosen approach would assume Falafel.One architecture it would entail the design of a new
volume type with requisite specialized region types targeting CODBMS deployments and providing efficient handling of large reads and writes. The developed
region type would be different, and would have lower complexity, than the FLURry region - data sharing is almost non-existent and CODBMS don't expect
different SLOs in a single storage container. The new volume and region type would be optimized for large writes, whole-file unlinks (with immediate, rather than
deferred free space reclamation).
Project III-B: Extent-based cache management
The 4KB-at-a-time block management is too costly for PB-scale ONTAP cluster. Managing each 4KB WAFL block and using existing waflbuf structure (sized at
302 bytes) is costly both in terms of space (the amount of DRAM consumed in metadata overhead) and time (consumed in creating a large I/O request by chaining
lot of buffer for track-sized (efficient) disk I/O. The same holds true for the limitations built into the existing prefetcher.
To alleviate this problem, we propose to design and prototype an extent-based buffer cache that works with the Falafel-style architecture. In particular, the extent-
based cache would allow for much more efficient tracking and execution of large I/Os and would support both MetaWAFL-style extents and the 3-way indexing
(tagging) of data extents in Falafel (volume, region, slab).
Collaborative project with SAP
Jiri Schindler with the help of Jeff Kimmel and Manfred Buchmann proposed investigations in collaboration with the SAP's Experimental Systems Group. We are
currently in the process of finalizing the joint proposal that includes two main threads:
1. Fine-grained object-level management layer that explicitly manages each SAP DB object explicitly that is aligned with features of
ONTAP such as no-overwrite updates, snapshots, copy off-load or D/R copy.
2. Exploration of abstractions and interface for high-performance access to host-side flash, possibly combined with object-level
management. See the work-in-progress draft of the proposal text for more details.
The first aspect of the proposal is more tactical (better aligning the SAP architecture with the features and capabilities of the ONTAP-based storage system). The
second aspect, i.e., exploring the performance and data management implications of host-side flash, complements the proposed Project I and is more strategic in
nature.
Leveraging university research
In line with the ATC/CTO office approach of outsourcing basic research to universities, we outline two efforts that are related to this effort. These efforts could be
very helpful in setting future directions with minimal investment from ATG.
MIT Database Group
NetApp recently awarded NFF (NetApp Faculty Fellowship) to MIT Prof. Sam Madden to investigating database organizations for the Cloud scale. In addition to
the regular oversight of the NFF grant, we propose to leverage this funding for co-authoring a publication and to use the NFF grant as a leverage for the MIT's
group NSF proposal.
The grants/funding provided by the National Science Foundation (NSF) is always looking for industry/academia collaboration providing increased funding to both
the academic institution and the industrial partner. The fringe benefit would be an increased recognition in the academic and research circles with little (and no
monetary) investment on NetApp's part.
CMU Parallel Data Lab
We propose to collaborate on an ongoing research of suitable interfaces and access methods for a new class of host-attached storage in the form of storage-class
memory (SCM) such as flash or phase-change memory. This work would complement the proposed ATG Project I (see above). The expected outcome would be a
paper coauthored by the CMU PIs and a NetApp MTS.
NetApp, as the member of the PDL, was approached by Dave Anderson and Garth Gibson (both CS faculty at CMU and PDL affiliates) to collaborate on a
paper/exploration of the right APIs and new abstractions with more efficient/low overhead direct access. The key ideas of the potential collaboration are:
OS Provides protection, in the form akin to permissions in virtual memory page tables.
direct access to SCM read-only data without OS abstracted interface (e.g., mmap or sockets).
Write accesses are "guarded" by the OS.
Recommendations
We conclude the report by recommending a set of next steps for ATG-driven projects. At a minimum, we recommend to commence the Project I: Host-side flash
storage management before next ATG review. The project tackles an important trend in data centers and application architecture and fits the long-term vision that
is common to several NetApp efforts: end-to-end efficiency of IOPS layer. Stated differently, the project would fit under the umbrella of improving performance at
the host and more effective data management --- data stored initially at the host-side flash memory (cache).
Additionally, we recommend concurrently funding Project II: Understanding HBase, Cassandra, and other technologies. The scope and the level of complexity of
this project makes it a good starter project for new ATG hires (with adequate guidance from a PI) or interns. More importantly, it targets an important tactical area
in the space of NewApps and complements the already undergoing efforts within NetApp in the Hadoop Analytics. Finally, the project has a BU stakeholder: the
MEI BU (the OnCommand Insight product suite).
Finally, we recommend executing on the proposed university outreach projects and the SAP collaboration as activities parallel to the proposed ATG projects as
these would require little direct ATG involvement. With a higher-level of investment, executing on Project III, or at least one of its parts, would be beneficial to
improving the core ONTAP efficiency. Not only would that address the gaps for CODBMS solutions, but other workloads or deployment scenarios would benefit
as well.
Retrieved from "http://wikid.netapp.com/w/ATG/Projects/Pissarro/Report"
This page was last modified on 14 October 2011, at 07:28.

Br Columndb

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Br Columndb

Uploaded by

Copyright:

Available Formats

ATG/Projects/Pissarro/Report

Analysis of Column Oriented and NoSQL Databases

Draft date: August 31, 2011

random memory access (a.k.a. memory pointer chasing)

Canonical CODBMS architecture

Large memory single system image

Existing and emerging systems

Established commercial products

Oracle RAC and Exadata Database Machine

Aster Data Systems

Storage with HDFS

Format of a single HFile for one column region.

Column format details

Each record is variable length and includes

Schema-less or NoSQL databases

The "NoSQL" databases can be divided into three broad categories:

"simple" key-value stores

These systems exhibit the following general characteristics:

Applications and usage patterns

Automated on-line re-provisioning

High availability and system bring-up

Implications for storage systems

Different I/O behavior

Systems types based on I/O behavior

Type I: In-memory systems

The exact nature of the operations depends on many factors such as

Type II: Large-data clustered systems

Type III: DAS caching (Host-side flash)

1. ratio of reads to writes skewed towards writes

ParAccel is an example of a Type III system.

Type IV: HDFS

Cassandra is generally characterized by

Compared to HBase or other types of CODBMS, Cassandra's I/Os are generally

Threats and opportunities for NetApp

Fine-grained object management

InfiniVols and Searchlight

Improving basic R/W performance

Closing the gaps

The Hadoop Analytics effort is already working on the following solutions:

SnapManager for HBase

Core ONTAP features

constant-time sis-clone for a collection files

Host-side flash management

Solutions optimized for Type IV and Type V

ONTAP core performance

InfiniVols and E-Series

Project I: Host-side flash storage management

Project II: Understanding HBase, Cassandra, and other technologies

Project III: ONTAP core technologies

Project III-A: Efficient extent writes

Project III-B: Extent-based cache management

Collaborative project with SAP

Leveraging university research

CMU Parallel Data Lab

Retrieved from "http://wikid.netapp.com/w/ATG/Projects/Pissarro/Report"

This page was last modified on 14 October 2011, at 07:28.

You might also like