Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Performance Tradeoffs in Read-Optimized Databases

Stavros Harizopoulos Velen Liang Daniel J. Abadi Samuel Madden


MIT Computer Science and Artificial Intelligence Laboratory
32 Vassar Street
Cambridge, MA 02139
{stavros, vliang, dna, madden}@csail.mit.edu

ABSTRACT ing several columns in a table, a column store must ensure that it
Database systems have traditionally optimized performance for can read large sequential portions of each column, since the cost
write-intensive workloads. Recently, there has been renewed of seeking between two fields in the same tuple (stored separately
interest in architectures that optimize read performance by using on disk) can be prohibitively expensive. Similarly, writes can
column-oriented data representation and light-weight compres- actually be made relatively inexpensive as long as many inserts
sion. This previous work has shown that under certain broad are buffered together so they can be done sequentially. In this
classes of workloads, column-based systems can outperform row- paper, we carefully study how column orientation affects the per-
based systems. Previous work, however, has not characterized the formance characteristics of a read-optimized database storage
precise conditions under which a particular query workload can manager, using scan-mostly queries.
be expected to perform better on a column-oriented database.
In this paper we first identify the distinctive components of a 1.1 Read-Optimized DBMS Design
read-optimized DBMS and describe our implementation of a While column-oriented systems often have their own sets of col-
high-performance query engine that can operate on both row and umn-wise operators that can provide additional performance ben-
column-oriented data. We then use our prototype to perform an efit to column stores [1], we focus in this paper on the differences
in-depth analysis of the tradeoffs between column and row-ori- between column and row stores related solely to the way data is
ented architectures. We explore these tradeoffs in terms of disk stored on disk. To this end, we implement both a row- and col-
bandwidth, CPU cache latency, and CPU cycles. We show that umn-oriented storage manager from scratch in C++ and measure
for most database workloads, a carefully designed column sys- their performance with an identical set of relational operators. As
tem can outperform a carefully designed row system, sometimes data is brought into memory, normal row store tuples are created
by an order of magnitude. We also present an analytical model to in both systems, and standard row store operations are performed
predict whether a given workload on a particular hardware con- on these tuples. This allows for a fixed query plan in our experi-
figuration is likely to perform better on a row or column-based ments and a more direct comparison of column and row systems
system. from a data layout perspective.
Both of our systems are read-optimized, in the sense that the disk
1. INTRODUCTION representation we use is tailored for read-only workloads rather
A number of recent papers [21][7] have investigated column ori- than update-intensive workloads. This means, for example, that
ented physical database designs (column stores), in which rela- tuples on-disk are dense-packed on pages, rather than being
tional tables are stored by vertically partitioning them into single- placed into a slotted-tuple structure with a free-list per page. Fig-
column files. On first blush, the primary advantage of a column- ure 1 shows the basic components of a generalized read-opti-
oriented database is that it makes it possible to read just the sub- mized DBMS, upon which we base both of our systems (solid
set of columns that are relevant to a query rather than requiring lines show what we have actually implemented for the purposes
the database to read all of the data in a tuple, and the primary dis- of this paper). In this design we assume a staging area (the “write
advantage is that it requires updates to write in a number of dis- optimized store”) where updates are done, and a “read-optimized”
tinct locations on disk (separated by a seek) rather than just a
single file. compression
compression-aware
Surprisingly, these initial observations turn out to not be trivially advisor
operators
true. For example, to obtain a benefit over a row store when read- writes Query engine
MV
advisor reads
Permission to copy without fee all or part of this material is granted pro-
vided that the copies are not made or distributed for direct commercial MV
advantage, the VLDB copyright notice and the title of the publication and Write Read MV
its date appear, and notice is given that copying is by permission of the Optimized Optimized MV
Very Large Data Base Endowment. To copy otherwise, or to republish, to Store Store
post on servers or to redistribute to lists, requires a fee and/or special per-
mission from the publisher, ACM. merge DB storage
VLDB '06, September 12-15, 2006, Seoul, Korea.
Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09 Figure 1. Basic components of read-optimized DBMS.
store where tuples are permanently stored on disk. Tuples are speedup
periodically moved in bulk from the write store to the read store. 144
2
A compression advisor and a materialized view (MV) advisor 2-

Cycles per disk byte


choose compression schemes and appropriate vertical partition- 2 72
1.6-1.8
ing depending on the workload characteristics. This design is 2-
loosely inspired by the design of C-Store [21], a column-oriented 2
1.2-1.6 36
database that also maintains separate storage for writes and peri- 1-
odically merges new data into the read-only store. 2
0.8-1.2
1- 18
1.2 Performance Study 1
0.4-0.8
0-
We use our read-optimized systems to perform an in-depth analy- 9
1 8 12 16 20 24 28 32 36
sis of the tradeoffs between column and row-oriented architec-
tures, in terms of disk bandwidth, CPU cache latency, and CPU Tuple width
cycles. Our performance study is based on a set of scan-mostly Figure 2. Average speedup of columns over rows (see text
queries (which represent the majority of queries in read-mostly below for an explanation).
databases like data warehouses), operating on a database schema
derived from the TPC-H benchmark specification. The purpose of
column system over a row system when performing a simple scan
this paper is not to definitively settle the question of whether a
of a relation, selecting 10% of the tuples and projecting 50% of
column store is “better” than a row store, but to explore the situa-
the tuple attributes. The x-axis is the tuple width of the relation in
tions under which one architecture outperforms the other. We
bytes (either compressed or uncompressed). The y-axis represents
focus in particular on answering the following questions:
the system’s available resources in terms of CPU cycles per byte
1. As the number of columns accessed by a query increase, read sequentially from disk (cpdb). In Section 5 we show that
how does that affect the performance of a column store? typical configurations (varying the number of available CPUs,
disks, query requirements, and also competing traffic to both the
2. How is performance affected by the use of disk and L2 CPUs and the disks) range from 20 to 400 cpdb. As a reference, a
cache prefetching? modern single-disk, dual-processor desktop machine has a cpdb
3. On a modern workstation, under what workloads are col- of about 108.
umn and row stores I/O bound? For the graph of Figure 2, we compute the average speedup by
4. How do parameters such as selectivity, number of projected assuming the column system is accessing half of the columns
attributes, tuple width, and compression affect column store (50% projection). For example, for a tuple width of 8 bytes, a col-
performance? umn system reads only 4 bytes per tuple. Later we present results
that show that the speedup of columns over rows converges to 1
5. How are the relative performance tradeoffs of column and when the query accesses all attributes, but it can be as high as N
row stores affected by the presence of competition for I/O if the query only needs to access 1 / Nth of the tuple attributes. As
and memory bandwidth along with CPU cycles from com- Figure 2 shows, row stores have a potential advantage over col-
peting queries? umn stores only when a relation is lean (less than 20 bytes), and
In addition to a complete set of performance results comparing only for CPU-constrained environments (low values of cpdb). As
the column- and row-based systems, we present an analytical tuples get wider, column stores maintain a sizeable relative per-
model that provides accurate estimates of CPU and I/O costs for a formance advantage.
given hardware configuration and table layout. We note several
things that we explicitly chose not to study in this paper. Because 1.4 Contributions and Paper Organization
we are focused on the performance of queries in a read-optimized The main contribution of this paper is increased understanding of
system (e.g., warehouses) we do not model updates. Our view is the tradeoffs between column and row stores in read-optimized
that updates in these systems are not done via OLTP-style queries settings. We focus on isolating the important factors that affect
but rather happen via a bulk-loading tool. Building an efficient relative query performance in both column and row stores. Our
bulk-loader is an interesting but orthogonal research question. As performance study is based on a real implementation that strips
noted above, we also do not study the effects of column versus the problem to the bare minimum by avoiding irrelevant details.
row orientation on different database operators — our query plans This allows us to report results that are general and easy to repro-
are identical above the disk access layer. This decision allows us duce. Furthermore, by capturing important system parameters in a
to directly compare the two approaches, isolating implementation few simple measures, we are able to provide a set of equations
differences to the smallest possible part of the system. that can predict relative performance of column and row stores on
a wide range of different configurations.
1.3 Summary of Results The rest of the paper is organized as follows. Section 2 discusses
We find that, in general, on modern database hardware, column performance considerations in read-optimized databases (more
stores can be substantially more I/O efficient than row stores but specifically, the role of disk and main memory bandwidth), and
that, under certain circumstances, they impose a CPU overhead then describes our implementation. Section 3 contains the experi-
that may exceed these I/O gains. Figure 2 summarizes these mental setup, while Section 4 carries out the performance study.
results (Section 5 explains how we construct the graph). In this We present our analysis in Section 5. We discuss related work in
contour plot, each color represents a speedup range achieved by a Section 6 and conclude in Section 7.
2. READ-OPTIMIZED DBMS DESIGN RAID) can easily yield 100-300MB/sec of aggregate disk band-
In this section we first discuss performance considerations in width. When a sequential disk access pattern breaks, the disks
read-optimized database architectures, and then present our spend about 5-10 msec while the heads perform a seek to the new
implementation of a minimum-complexity, high-performance data location. To preserve high performance, the system needs to
relational query engine that can operate on both row and column- minimize the time spent on seeks. For multiple concurrent
oriented data. sequential requests, this is typically achieved by prefetching a
large amount of data from one file before seeking to read from
the next file. Modern database storage managers employ some
2.1 Performance Considerations form of prefetching. For example, the SQL Server Enterprise Edi-
Data warehousing and business intelligence applications rely on tion’s storage manager may prefetch up to 1024 pages (each page
the efficiency a database system can provide when running com- is 8KB).
plex queries on large data repositories. Most warehouses are bulk-
loaded during periods of low activity (e.g., over night) with data Even when database systems have the option to use an index on a
acquired from a company’s operation (such as sales, or clicks on large relation, in many cases it is better to use a plain sequential
web sites). Bulkloading is followed by periods when analysts scan. For instance, consider a query that can utilize a secondary,
issue long, read-only queries to mine the collected data. The data- unclustered index. Typically, the query probes the index and con-
base schema for these types of applications typically specifies a structs a list of record IDs (RIDs) to be retrieved from disk. The
few (or, just one) fact tables that store the majority of the newly list of RIDs is then sorted to minimize disk head movement. If we
collected information (e.g., sales/clicks), and a number of smaller were to assume a 5ms seek penalty and 300MB/sec disk band-
dimension tables for the rest of the data (e.g., customer/company width, then the query must exhibit less than 0.008% selectivity
info). In a typical schema, the fact tables store mostly foreign before it pays off to skip any data and seek directly to the next
keys that reference the collection of dimension tables. value (assuming 128-byte tuples and uniform value distribution).
In such environments, typical SQL queries examine fact tables When multiple concurrent queries scan the same table, often it
and compute aggregates, either directly from the fact table data, pays off to employ a single scanner and deliver data to multiple
or by first joining them with dimension tables. Data sometimes queries off a single reading stream (scan sharing). Teradata, Red-
needs to be corrected in a data warehouse, which can involve Brick, and SQL Server are among the commercial products that
updates to fact tables. Often, however, compensating facts are have been reported to employ this optimization, and, recently, it
used (e.g., a negative Sale amount). The majority of typical long- has also been demonstrated in the QPipe research prototype
running data warehousing queries involve sequential scans on the engine [13]. Such an optimization is orthogonal to data placement
fact tables. Speeding up the time it takes to read data off the disk (columns versus rows), and therefore we do not examine it in this
and, through the cache hierarchies, into the CPU can have a sig- paper.
nificant performance impact. Exactly this type of optimization
lies in the heart of a read-optimized DB design and comprises the 2.1.2 CPU Cache-Related Considerations
focus of this paper. Other types of optimizations (such as materi- In recent years, a thread of database research has been investigat-
alized view selection or multi-query optimization) are orthogonal ing CPU cache-related delays in database workloads, and has pro-
to scan-related performance improvements and are not examined posed cache-conscious algorithms and data placement techniques
in this paper. (relevant references and additional background can be found in
An important goal of read-optimized databases is to minimize the [3]). Recent studies have pointed out that commercial DBMS
number of bytes read from the disk when scanning a relation. For spend a significant amount of time on L2 cache misses [5][15].
a given access plan, there are two ways to achieve this goal: (a) The proposed methodologies calculate the time spent on L2 cache
minimize extraneous data read (information that is not needed to misses by measuring the actual number of misses (via CPU per-
evaluate a query), and (b) store data in a compressed form. To formance counters) and multiplying this number by the measured
achieve (a), read-optimized systems improve the space utilization memory latency. While this methodology accurately described
of a database page by packing attributes as densely as possible. DBMS performance on the machines used at the time (Intel Pen-
Since updates occur in bulk, there is no reason to leave empty tium Pro / Pentium II), it no longer applies to today's CPUs (such
slots in pages to accommodate inserts of a few tuples. Column- as Pentium 4). The reason is that all major chip manufacturers
oriented data can further reduce the number of bytes read, since now include both hardware and software data prefetching mecha-
only the columns relevant to the query need to be read. Compres- nisms1. Once the hardware detects that a memory region (such as
sion schemes can apply to both column and row data, and they a database page) is accessed sequentially, it will start prefetching
have been shown to speed up query execution [22][18][24][1]. data from main memory to the L2 cache. For example, Pentium 4
Database-specific compression techniques differ from general- will prefetch two cache lines (128 bytes each) for every accessed
purpose compression algorithms in that they need to be “light- cache line that is characterized by a predictable access pattern.
weight,” so that the disk bandwidth savings do not exceed the The implications of hardware cache prefetching are especially
decompression cost. In this paper we study three such compres- significant in data warehousing, as data pages are regularly
sion techniques (described later in Section 2.2.1). accessed sequentially. The bottleneck now becomes the memory
bus bandwidth, and not the cache latency, as previously thought.
2.1.1 Disk-Related Considerations This effect makes main memory behave somewhat similarly to a
As disk drives yield their peak data delivery bandwidth when data
is read sequentially, database systems embrace sequential storing 1. Intel Corporation. “IA-32 Intel® Architecture Software Devel-
and accessing of relations. As a reference, inexpensive worksta- oper's Manual, Volume 3: System Programming Guide.” (Order
tion configurations (e.g., plain SATA controllers and software Number 253668).
ROW STORAGE COLUMN STORAGE We compress data using three commonly used lightweight com-
4K page 4K page 4K page pression techniques: Dictionary, Bit packing, and FOR-delta (see
# rows A1 B1 Z1 A2 # val A1 A2 # val B1 B2 [22][10][1][24] for additional background on these schemes).
B2 Z2 . . . ... ... Compression schemes are typically chosen during physical
design. These three techniques yield the same compression ratio
.....
for both row and column data. They also produce compressed
Am Bm Zm page info
values of fixed length. We refrain from using techniques that are
An page info Bn page info
better suited for column data (such as run length encoding [1]) to
keep our performance study unbiased. We use bit-shifting instruc-
TABLE PA0 PA1 . . . eof PB0 PB1 . . . eof ...
P0 P1 eof tions to pack compressed values inside a page, and the reverse
stripe stripe procedure for decompressing at read time. Following is a brief
description of the three compression schemes we use.
Bit packing (or Null suppression). This scheme stores each
attribute using as many bits as are required to represent the maxi-
mum value in the domain. For example, if an integer attribute has
Figure 3.Page structure and disk storage layout for a single a maximum value of 1000, then we need at most 10 bits to store
table for Row data (left) and Column data (right). each attribute.
Dictionary. When loading data we first create an array with all
disk when it comes to sequential accesses: it allows the CPU to the distinct values of an attribute, and then store each attribute as
overlap computation with memory-to-cache transfers (on top of an index number to that array. For example, to compress an
disk-to-memory transfers). We describe in Section 4.1 how we attribute that takes the values “MALE” and “FEMALE,” we store
calculate memory-related delays when taking into consideration 0 or 1 for each attribute. We use Bit packing on top of Dictionary.
the hardware cache prefetcher. For configurations where memory At read time, we first retrieve the index value through bit-shift-
bandwidth is the bottleneck, database architectures can speed-up ing, and then make a look-up in the corresponding array.
query execution in the same way they deal with disk bandwidth:
FOR-delta. FOR (Frame-Of-Reference) maintains a base value
by avoiding reading extraneous data and by utilizing compressed
per block of values (in our case, per page), and then stores differ-
forms of data [24].
ences from that base value. For most of the experiments we use
FOR-delta, which also maintains a base value per block but stores
2.2 Implementing a Read-Optimized Engine the difference of a value from the previous one (the first value in
In this section we describe our implementation of a minimum- the block is the base value of that block). This scheme applies to
complexity, read-only relational query engine that operates on integer-type attributes that exhibit value locality. For example, a
both column and row data. We first describe page layouts and sorted ID attribute (100, 101, 102, 103, etc.), will be stored as (0,
compression schemes (Section 2.2.1), then the column and row 1, 2, 3, etc.) under plain FOR, and as (0, 1, 1, 1, etc.) under FOR-
data scanners (Section 2.2.2), and, lastly, the overall architecture delta; in both cases the base value for that page will be 100. At
and the rest of the relational operators (Section 2.2.3). read time, real values are calculated either from the base value (in
FOR) or from the base value and all the previous readings in that
2.2.1 Disk Storage for Columns and Rows page (in FOR-delta).
Since there are no real-time updates in a read-optimized DBMS,
we forego the traditional slotted-page structure in favor of a 2.2.2 Row and Column Table Scanners
tightly packed one (see Figure 3). For both row and column data, Scanners are responsible for applying predicates, performing pro-
a page contains an array of values: entire tuples for row data and jection and providing the output tuples to their parent operators.
single-attribute values for column data (we discuss in Section 6 Both the row and column scanner produce their output in exactly
alternative row page structures such as PAX [4]). At the begin- the same format and therefore they are interchangeable inside the
ning of the page, an integer value stores the number of entries. query engine (see Figure 4). Their major difference is that a row
We store additional page-specific information at the end of the scanner reads from a single file, whereas the column scanner
page (using a fixed offset). Such information includes the page ID must read as many files as the columns specified by the query.
(which, in conjunction with a tuple’s position in the page gives Both scanners employ the same I/O interface to receive buffers
the Record ID), along with compression-specific data. For sim- containing database pages. We describe the I/O interface in the
plicity, we use fixed-length attributes throughout this paper. For next sub-section (2.2.3). The row scanner is straightforward: it
variable-length attributes, additional offset values need to be iterates over the pages contained inside an I/O buffer, and, for
placed between values. each page, it iterates over the tuples, applying the predicates.
Tuples that qualify are projected according to the list of attributes
Pages are stored adjacently in a disk file. For column data, a table
selected by the query and are placed in a block of tuples. When
is stored using one file per column. In our system, we stripe data-
that block fills up, the scanner returns it to the parent operator.
base files across all available disks in a disk array. The page size
is a system parameter, and for all experiments in this paper we A column scanner consists of a series of pipelined scan nodes, as
use a 4KB value. For the type of sequential accesses we study, many as the columns selected by the query. The deepest scan
the page size has no visible effect on performance (as long as the node starts reading the column, creating {position, value} pairs
cost of crossing page boundaries is significantly lower than the for all qualified tuples. We use the same block-iterator model to
cost of processing a page). pass data from one scanner to the next. Once the second-deepest
final results and file storage. BerkeleyDB proved to be a mismatch to our
high-performance requirements since reading a page caused mul-
tiple copies between user and system space, and there was no
Aggr
Rel built-in prefetching. DB engines commonly use a multithreaded
block block-iterator
op storage manager to handle prefetching and also overlap I/O with
of tuples computation. We take advantage of the Asynchronous I/O (AIO)
model
interface in Linux 2.6 to implement a non-blocking prefetching
mechanism within our single-threaded application. Our AIO
attr1 | attr2 | attr3 attr1 | attr2 | attr3 interface performs reads at the granularity of an I/O unit.
Throughout the paper we use I/O units of 128KB per disk. Every
S pred1 | pred2 time we run the engine we also specify the depth of prefetching:
poslist2 + col1 + col2
how many I/O units are issued at once when reading a file. Disk
S
col3
S poslist1 + col1 transfers go directly to memory (i.e., using DMA) without involv-
ing the CPU, and without passing through the Linux file cache.
pred2 S There is no buffer pool (since it does not make a difference for
col2
sequential accesses). Instead, the AIO interface passes the query a
Column scanner Row scanner pointer to a buffer containing data from an I/O unit.
pred1 col1
Our I/O interface provides a minimum-overhead data path from
Figure 4. Query engine and scanner architecture. disk to memory. This allows us to report results that are not
biased by a non-efficient implementation. The code consists of
about 2,500 lines of C++ (without including the code to create,
scan node receives a block of tuples (containing position pairs), it
load, and query data), compiled with gcc, on Linux 2.6, using the
uses the position information to drive the inner loop, examining
libaio library for asynchronous I/O. Our code is publicly available
values from the second column. That is, we only evaluate predi-
and it can be used as a benchmarking tool for measuring the per-
cates corresponding to qualified records in the first column. If the
formance limit of TPC-H style queries, on both row and column
second predicate is true, we copy both values along with the posi-
data3.
tion to a new tuple that is passed on to the next scanner. When a
scan node has no predicates, it simply attaches all values corre-
sponding to the input positions, without re-writing the resulting 3. EXPERIMENTAL SETUP
tuples. At all times, the tuple blocks are reused between scanners 3.1 Workload
(and operators), so there is no new memory allocation during For all experiments we use two tables, LINEITEM and ORDERS,
query execution. As in most systems, we push scan nodes that based on the TPC-H4 benchmark specification. We populate the
yield few qualifying tuples as deep as possible. tables with data generated by the official TPC-H toolkit. Our only
modifications to the original specifications were the following:
2.2.3 Query Engine and I/O Architecture First, we use four-byte integers to store all decimal types (we pre-
Our query engine follows a pull-based, block-iterator model for serve the original values though). Second, for LINEITEM, we use
all operators. Each relational operator calls next on its child (or a fixed text type instead of a variable one for the L_COMMENT
children), and expects to receive a block (array) of tuples (Figure field, to bring the tuple size to 150 bytes. Finally, for ORDERS,
4). Adopting a block-iterator model instead of a tuple-by-tuple we drop two text fields, and change the size of another one, to
one amortizes the cost of procedure calls and also removes any bring the tuple size to 32 bytes. The schema of the two tables is
delays related to CPU instruction-cache performance [17][23]. shown in Figure 5. The row system uses 152 bytes to store a
We make the block size a tunable parameter and choose a value LINEITEM tuple (the extra 2 bytes are for padding purposes),
that makes the block fit in the L1 data cache to minimize L1 and 32 bytes to store an ORDERS tuple (same as the tuple
cache delays (on our machine, the L1 data cache is 16KB and we width). The column system packs attributes from each column
use blocks of 100 tuples). Relational operators are agnostic about contiguously without any storage overhead.
the database schema, and operate on generic tuple structures. We
The rationale behind the choice for the tuple sizes was to con-
have implemented the following operators: aggregation (sort-
struct a “wide” tuple that is larger than a cache line (the L2 cache
based and hash-based), merge join, and table scanners which can
line for Pentium 4 is 128 bytes), and a “narrow” one, that can fit
apply SARGable predicates.
multiple times within a cache line. Fixing the tuple size provides
In our implementation we took care to avoid overhead associated measurement consistency. TPC-H specifies a ratio of four
with extra functionality that was not necessary for the purposes of LINEITEM lines for every ORDERS line. To be able to compare
our study. We therefore omitted implementing a parser or opti- results between the two tables, we use scale 10 for LINEITEM
mizer; instead we use precompiled queries. With the same think- (60M tuples, 9.5GB on disk) and scale 40 for ORDERS (60M
ing, we opted for a single-threaded implementation. Evaluating a tuples, 1.9 GB on disk). TPC-H generated data allow for substan-
query in parallel is orthogonal to this study and our results trivi- tial compression. The compressed versions of the LINEITEM and
ally extend to multiple CPUs. ORDERS tuples are shown in Figure 5. We apply all three com-
Initially, we used the BerkeleyDB2 storage manager to manage pression techniques described in the previous section.
disk files, but eventually opted for a custom-built I/O interface
3. http://db.csail.mit.edu/projects/cstore/
2. http://www.sleepycat.com 4. http://www.tpc.org/tpch/
LINEITEM (150 bytes) LINEITEM-Z (52 bytes) 4. PERFORMANCE STUDY
1 L_PARTKEY int 1 non-compressed There are several factors that impact query performance when
2 L_ORDERKEY int 2Z delta, 8 bits comparing column and row stores. We study five such factors:
3 L_SUPPKEY int 3 non-compressed
4 L_LINENUMBER int 4Z pack, 3 bits • Query workload
5 L_QUANTITY int 5Z pack, 6 bits • Database physical design
6 L_EXTENDEDPRICE int 6 non-compressed • System parameters
7 L_RETURNFLAG text, 1 7Z dict, 2 bits
• Capacity planning
8 L_LINESTATUS text, 1 8 non-compressed
9 L_SHIPINSTRUCT text, 25 9Z dict, 2 bits • System load
10 L_SHIPMODE text, 10 10Z dict, 3 bits We discuss next how each factor may affect performance.
11 L_COMMENT text, 69 11Z pack, 28 bytes
(i) Query workload. The primary advantage of column over row
12 L_DISCOUNT int 12Z dict, 4 bits
stores is the ability to read from disk exactly the attributes
13 L_TAX int 13Z dict, 4 bits
selected by a query, thus saving on disk and main memory band-
14 L_SHIPDATE int 14Z pack, 2 bytes
width, along with CPU time that is required to read pages from
15 L_COMMITDATE int 15Z pack, 2 bytes
disk. We expect queries that select increasingly more attributes,
16 L_RECEIPTDATE int 16Z pack, 2 bytes
up to the entire tuple, to exhibit a convergence in the performance
of column and row stores. For queries with decreased selectivity
ORDERS (32 bytes) ORDERS-Z (12 bytes) (few qualifying tuples7), we expect the CPU time to decrease in
both column and row stores. The end-to-end performance of a
1 O_ORDERDATE int 1Z pack, 14 bits
query depends on the cost of other operators involved in the
2 O_ORDERKEY int 2Z delta, 8 bits
query plan, not just the table scans. Given an identical operator
3 O_CUSTKEY int 3 non-compressed
implementation in column and row stores, we expect the relative
4 O_ORDERSTATUS text, 1 4Z dict, 2 bits
end-to-end performance difference to decrease as the bulk of the
5 O_ORDERPRIORITYtext, 11 5Z dict, 3 bits
query cost moves to operators other than scans (we discuss this in
6 O_TOTALPRICE int 6 non-compressed
Section 5).
7 O_SHIPPRIORITY int 7Z pack, 1 bit
(ii) Database physical design. The tuple width in a table is spe-
Figure 5. Table schema. Compressed versions of the tables cific to a database schema, but it can change (to be narrower) dur-
are shown on the right (compressed attributes are noted ing the physical design phase, using vertical partitioning or
with the letter “Z” next to an attribute’s number). materialized view selection. During the design phase, the data-
base administrator (or an automatic tool) can also decide on
whether compression schemes will be used and how. Both nar-
rower (in the case of row stores) and compressed tuples remove
3.2 Hardware platform & measurement tools pressure from disk and main memory bandwidth requirements.
They also reduce CPU time spent executing disk read instructions
All experiments are performed on a Pentium 4 3.2GHz, with (CPU system time); for compressed tuples, we expect the CPU
1MB L2 cache, 1GB RAM, running Linux 2.6. Unless otherwise user time to slightly increase due to extra CPU instructions
noted, all data is read off a software RAID consisting of three required by decompression.
SATA disk drives, capable of delivering 180MB/sec bandwidth
(60MB/sec per disk). The default configuration of our system is a (iii) System parameters. A parameter that may differ across
128KB I/O unit with prefetch depth of 48 I/O units and tuple implementations is the prefetch unit size in large scans (how
block sizes of 100 entries. There is no file-caching in our system, much data the storage manager reads ahead). As we show later, a
so all data requests are served from the disk. sufficiently large prefetch buffer results into good performance
for both column and row stores for sequential scans, whereas
We use the standard UNIX utility iostat to monitor disk activ- small values negatively affect both systems. A small prefetch
ity. All measurements are collected through the PAPI5 library [16] buffer means that the disks spend more time seeking (moving the
which provides access to the CPU performance counters, disk heads) to read from multiple tables than actually reading,
enabling us to measure events at the micro-architectural level resulting into poor disk utilization.
(number of L2 cache misses, executed instructions, etc.). More
specifically, we use papiex6, which can monitor any binary and (iv) Capacity planning. The database administrator can specify
report performance counter events either in user mode or system how many disks a table will be striped over. When there are
(kernel) mode. The former are events that our application is more than one CPUs available, the DBMS decides how many
responsible for, while in user space (executing the code we CPUs will be used to evaluate the query in parallel (a parameter
wrote), and the latter are events that occurred while the CPU exe- known as “degree of parallelism” or DOP; the user may also
cutes system calls and OS code. The way we convert these mea- specify what the DOP is for a given query). Different ratios of
surements into time breakdowns is described in the next section, CPUs per disk for a given query may have different effects in col-
next to the results. umn stores than in row stores, depending on what the bottleneck
is (disk, CPU, or memory) for any given configuration.

7. Throughout the paper we adopt the following convention:


5. http://icl.cs.utk.edu/papi/ decreased or lower selectivity corresponds to a lower percentage
6. http://icl.cs.utk.edu/~mucci/papiex/ of qualified tuples.
60 12
usr-rest (top)
50 10 usr-L1

Elapsed time (sec)


Elapsed time (sec)

usr-L2
40 8
Row usr-uop
30 Column 6 sys
Row CPU
20 Column CPU 4

10 2

0 0
4 20 36 52 68 84 100 116 132 148 # of attributes selected (see Fig. 5 for the type
Selected bytes per tuple row store column store of each attribute)

Figure 6. Baseline experiment (10% selectivity, LINEITEM). Left: Total elapsed time (solid lines) and CPU time (dashed lines) for
column and row store. The total elapsed time is equal to I/O time since CPU time is overlapped. X-axis is spaced by the width of
selected attributes. Right: CPU time breakdowns. The first two bars are for row store, the rest are for column store.

(v) System load. The system load (disk utilization and CPU uti- shows the total time (solid lines), and the CPU time separately
lization) can significantly affect a single query’s response time. (dashed lines) as we vary the number of selected attributes on the
Obviously, a disk-bound query will see a big increase in the x-axis. Both systems are I/O-bound in our default configuration
response time if it is competing with other disk-bound queries. (1 CPU, 3 disks, no competition), and therefore the total time
Competing disk and CPU traffic may again have different effects reflects the time it takes to retrieve data from disk. Both systems
in column stores than in row stores. are designed to overlap I/O with computation (as discussed in
Section 2). As expected, the row store is insensitive to projectiv-
We base all experiments on a variant of the following query:
ity (since it reads all data anyway), and therefore its curve
select A1, A2 … from TABLE remains flat. The column store, however, performs better most of
where predicate (A1) yields variable selectivity the time, as it reads less data. Note that the x-axis is spaced by
Since the number of selected attributes per query is the most the width of the selected attributes (e.g., when selecting 8
important factor, we vary that number on the X-axis through all attributes, the column store is reading 26 bytes per LINEITEM
row, whereas for 9 attributes, it reads 51 bytes — see Figure 5 for
experiments. Table 1 summarizes the parameters considered, what
the expected effect is in terms of time spent on disk, memory bus detailed schema information).
and CPU, and which section discusses their effect. The “crossover” point that the column store starts performing
worse than the row store is when selecting more than 85% of a
tuple’s size. The reason it performs worse in that region is that it
Table 1: Expected performance trends in terms of elapsed makes poorer utilization of the disk. A row store, for a single
disk, memory transfer, and CPU time (arrows facing up mean scan, enjoys a full sequential bandwidth. Column stores need to
increased time), along with related experimentation sections. seek between columns. The more columns they select, the more
time they spend seeking (in addition to the time they spend read-
parameter Disk Mem CPU section
ing the columns). Our prefetch buffer (48 I/O units) amortizes
selecting more attributes some of the seek cost. A smaller prefetch buffer would lower the
(column store only) 4.1
performance of the column store in this configuration, but addi-
decreased selectivity 4.2 tional disk activity from other processes would make the cross-
narrower tuples 4.3 over point to move all the way to the right. We show these two
compression 4.4 scenarios later, in Section 4.5.
larger prefetch 4.5 While this specific configuration is I/O-bound, it still makes
sense to analyze CPU time (dashed lines in left graph of Figure
more disk traffic 4.5
6), as it can affect performance in CPU-bound configurations, or
more CPUs / when the relations are cached in main memory. The left graph of
5
more Disks
Figure 6 shows the total CPU time for both systems. We provide
a time breakdown of CPU costs in the graph on the right part of
Figure 6. The first two bars correspond to the row store, selecting
4.1 Baseline experiment 1 and 16 attributes (the two ends in our experiment). The rest of
the bars belong to the column store, selecting from 1 to 16
select L1, L2 … from LINEITEM
attributes. The height of each stacked bar is the total CPU time in
where predicate (L1) yields 10% selectivity
seconds. The bottom area (dark color), is the time (in sec) spent
As a reminder, the width of a LINEITEM tuple is 150 bytes, it in system mode. This is CPU time spent while Linux was execut-
contains 16 attributes, and the entire relation takes 9.5GB of ing I/O requests and we do not provide any further break-down of
space. Figure 6 shows the elapsed time for the above query for that. For the row store, the system time is the same regardless of
both row and column data. The graph on the left of the figure the number of selected attributes. For the column store it keeps
10 usr-rest (top)
12 12
Elapsed time (sec)

8 usr-L1 usr-rest (top)

Elapsed time (sec)


usr-L2 10 10 usr-L1
6 usr-uop usr-L2
8 8
4 sys usr-uop
6 6 sys
2
4 Row 4
0
2 Column 2
column store
row store # of attributes selected 0 0
Figure 7. Changing selectivity to 0.1% (LINEITEM table). 4 8 12 16 20 24 28 32 column store
Selected bytes per tuple # of attributes selected
Figure 8. 10% selection query on ORDERS (narrow tuples).
increasing as the system performs more I/O. The jumps in bars
#10, #11, #12, are due to the larger, string attribute types that are
(#12 - #16). These observations lead us to two questions and to
being added to the selection list. Note that the column store ends
our next experiments: what happens when the column store has
up spending more system time than the row store when selecting
less work for each additional column (decreased selectivity: Sec-
all attributes, due to more work needed by the Linux scheduler to
tion 4.2), and what happens when both systems access narrower
handle read requests for multiple files.
structures (Section 4.3 and Section 4.4).
The rest of the fields in the stacked bars are user time. Moving
upwards, the next field (usr-uop) is the minimum time the CPU 4.2 Effect of selectivity
could have possibly spent executing our code. Pentium 4 can exe-
cute a maximum of 3 micro-operations (uops) per cycle. By mea- select L1, L2 … from LINEITEM
suring the actual number of uops (and dividing by 3), we compute where predicate (L1) yields 0.1% selectivity
the minimum time. In reality, CPUs never achieve this time, due
Selecting fewer tuples through a very selective filter has no effect
to several stall factors [14], however, we can treat this time as a
on I/O performance. Figure 7 shows the CPU time breakdown in
metric of the work assigned to a CPU, and how far off the CPU is
the same format as Figure 6. As expected, system time remains
from a perfect execution. The next two fields (usr-L2 and usr-L1),
the same (since it depends on I/O). The row store also remains
is the time it took for data to move from main memory to L2, and
the same, since it has to examine every tuple in the relation to
from L2 to L1 cache, correspondingly.
evaluate the predicate, and so, memory delays continue to domi-
We carefully compute L2 delays for each access pattern, taking nate. The column store, however, behaves entirely different.
into consideration the hardware L2 prefetcher (see also discussion Selecting additional attributes adds negligible CPU work, since
in Section 2.1.2). For sequential memory access patterns, the the different scan nodes (other than the first) process on average
memory bus in our system can deliver 128 bytes (one L2 cache only one out of every 1000 values. The memory delay of transfer-
line) every 128 CPU cycles. The time spent fetching sequential ring large strings (attributes #9, #10, #11) is no longer an issue
data into L2 is overlapped with the usr-uop field (mentioned since their cost is much less than the cost of evaluating predicate
above). Whenever the CPU tries to access a non-prefetched cache L1.
line, the stall time is 380 cycles (measured time for random mem-
In general, for decreased selectivity of the first attribute in the
ory access time in our system). The usr-L2 area shown in the
query list, the pure CPU computation time (usr-uop) of column
stacked bar of Figure 6 is the minimum time the CPU was stalled
stores is very close to a row store, regardless of the number of
waiting for data to arrive in the L2, after subtracting any overlaps
selected attributes. However, as selectivity increases towards
with actual CPU computation. The usr-L1 field, on the other
100%, each additional column scan node contributes an increas-
hand, is the maximum possible time the CPU could have been
ing CPU component, causing the crossover point to move
stalled to L1 cache misses. In reality, that component is much
towards the left. Note that this behavior stems directly from the
smaller due to out-of-order execution of modern CPUs [14].
pipelined column scanner architecture used in this paper (shown
Finally, the remaining, light-colored field, is the amount of time
in Figure 4), where each scan node is driven by a separate value
the CPU was active in user mode. A number of other factors can
iterator. One optimization that a column system could employ is
contribute to the remaining time (branch mispredictions, func-
to utilize a non-pipelined, single-iterator scanner. The way such a
tional unit stalls, etc.).
scanner works is the following. It first fetches disk pages from all
There are two observations to make for the CPU-time graph of scanned columns into memory. Then, it uses memory offsets to
Figure 6. Column stores require increasingly more CPU work access all attributes within the same row, iterating over entire
than a row store as more attributes are selected from a relation. rows, similarly to a row store. This architecture is similar to PAX
Their average performance, however, is marginally better due to [4] and MonetDB [7]. Since such an optimization is out of the
less work in system mode, and less time spent in main-memory scope of this performance study, we do not further discuss it.
transfers in user mode. The second observation is that the type of
selected attributes is crucial for column stores. Once the query 4.3 Effect of narrow tuples
expands enough to include the three strings of LINEITEM (#9,
#10, #11), we suddenly see a significant L2/L1 component that select O1, O2 … from ORDERS
will remain the same as the query goes back to adding integers where predicate (O1) yields 10% selectivity
10 total time 10 CPU time 35 Column 2

Elapsed time (sec)


30 Column 4
Elapsed time (sec)

8 Row 8 Row Column 8


25
Column FOR-delta Column FOR-delta Column 16
Column FOR Column FOR 20 Column 48
6 6
15 Row
4 4 10
5
2 2 0
4 8 12 16 20 24 28 32
0 0
Selected bytes per tuple
4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32
Selected bytes per tuple Selected bytes per tuple Figure 10.Varying the prefetch size when scanning ORDERS.
(when uncompressed) (when uncompressed) The row system is not affected by prefetching in this scenario.
Figure 9. Selection query on ORDERS-Z (compressed), total
time (left) and CPU time (right). The column system uses two ues in the page to perform decompression. Plain FOR compres-
different compression schemes for one of the attributes.
sion for that attribute (storing the difference from a base value
instead of the previous attribute) requires more space (16 bits
instead of 8), but is computationally less intensive. This is appar-
We switch tables and use ORDERS (32 bytes wide tuple, 7
ent in the right graph of Figure 9. This experiment does not nec-
attributes). Figure 8 contains the results (total time on the left,
essarily suggest that plain FOR is a better compression scheme
CPU breakdowns on the right). Both systems are still I/O-bound,
than FOR-delta. If our system was disk-constrained (for exam-
and so we see the same behavior as before in total time. The CPU
ple, if we used one disk instead of three), then the I/O benefits
behavior, however, is different. First, the system time is a smaller
would offset the CPU cost of that specific compression scheme.
percentage of the overall CPU time; the reason is that we still
scan the same number of tuples as before (ORDERS have the
same cardinality as LINEITEM), but perform less I/O per tuple. 4.5 Effect of prefetch size, competing traffic
Second, the pure computation (usr-uop) is almost the same as
select O1, O2 … from ORDERS
before, however, the memory-related delays are no longer visible
where predicate (O1) yields 10% selectivity
in either system: the main memory bandwidth surpasses the rate
at which the CPU can process data. In a memory-resident dataset, We repeat the experiment from Figure 8, this time varying the
for this query, column stores would perform worse than row prefetch depth. So far we have been using a prefetch depth of 48
stores no matter how many attributes they select. However, if we I/O units (each unit is 128KB per disk). Figure 10 shows, in addi-
were to use decreased selectivity (see CPU usr-uop time for col- tion to 48 units, results for prefetch depth of 2, 4, 8, and 16 units.
umn data in Figure 7), both systems would perform similarly. Since there is only a single scan in the system, prefetch depth
does not affect the row system. The column system, however,
4.4 Effect of compression performs increasingly worse as we reduce prefetching, since it
spends more time seeking between columns on disk instead of
select Oz1, Oz2 … from ORDERS-Z reading. It therefore makes sense to aggressively use prefetching
where predicate (Oz1) yields 10% selectivity in a column system. However, the same is true in the general case
for row systems as well, as we show next.
In studying the effects of compression, we initially ran a selection
query on LINEITEM-Z. However, the results for total time did Figure 11 shows three graphs where both the row and column
not offer any new insights (the LINEITEM-Z tuple is 52 bytes, system use a prefetch size of 48, 8, and 2 correspondingly, in the
and we already saw the effect of a 32-byte wide tuple). We there- presence of a concurrent competing scan. The competing scan is
fore only show results with the ORDERS-Z table (12 bytes wide) issued by a separate process that executes a row-system scan on a
in Figure 9. The graph on the left shows the total time whereas different file on disk (LINEITEM). In each scenario, we matched
the graph on the right shows CPU time. The x-axis is spaced on the prefetch size of the competing process to the prefetch size of
the uncompressed size of selected attributes. For this experiment the systems under measurement, to present the disk controller
we show results for two different compression schemes for with a balanced load.
attribute #2 of ORDERS, FOR and FOR-delta (see Section 2.2.1).
As Figure 11 shows, the column system (solid-triangle line) out-
In this query, the column store becomes CPU-bound (the row performs the row system in all configurations. This contradicts
store is still I/O-bound), and therefore the crossover point moves our expectations that a column system selecting all columns in a
to the left. Note that the slightly irregular pattern for total time in relation would perform the same as a row system in the presence
the left graph of Figure 9 is due to imperfect overlap of CPU and of competing disk traffic (since both systems should spend the
I/O time. Both systems exhibit reduced system times. The row same time reading off disk and waiting on disk-head seeks). It
store, for the first time so far, shows a small increase in user CPU turns out that a column system benefits from the fact it employs
time when going from selecting one attribute to all seven (right several scans. When the first data for column #1 arrives from
graph in Figure 9). The reason for that is the cost of decompres- disk, the CPU is moving on to column #2 while the disk still
sion. The column store exhibits a peculiar behavior in the CPU serves column #1. At that point, column #2 submits its disk
time of the FOR-delta curve, in that there is a sudden jump when request and the CPU is halted until all disk requests for column
selecting the second attribute. FOR-delta requires reading all val- #1 are served and the first data from column #2 arrives. Being
80 Row -48 80 Row -8 80
Elapsed time (sec)

Column-48 slow Column-8 slow


60 Column-48 60 Column-8 60

40 40 40
Row -2
20 20 20
Column-2 slow
Column-2
0 0 0
4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32 4 8 12 16 20 24 28 32
Selected bytes per tuple Selected bytes per tuple Selected bytes per tuple

Figure 11.Repeating previous experiment for prefetch size of 48, 8, and 2 units, this time in the presence of another concurrent scan.
See text for an explanation of “slow” curve.

one step ahead allows the column system to be more aggressive Throughout the analysis, “DISK” and “CPU” refer to all disks
in its submission of disk requests, and, in our Linux system, to and all CPUs made available for executing the query. Paralleliz-
get favored by the disk controller. As a reference, we imple- ing a query is orthogonal to our analysis. If a query can run on
mented a version of the column system that waits until a disk three CPUs, for example, we will treat it as one that has three
request from one column is served before submitting a request times the CPU bandwidth as a query that runs on a single CPU.
from another column. The results (“slow” line in Figure 11) for
Disk analysis. We model the disk rate R DISK (in tuples/sec) as
this system are now closer to our initial expectations.
the sum of rates of all files read, weighted by the size of a file (in
bytes):
5. ANALYSIS
File1 SizeFile1 File2 SizeFile2
The previous section gave us enough experimental input to guide R DISK = R ⋅ ------------------------------- + R ⋅ ------------------------------- + … (2)
SizeFileALL SizeFileALL
the construction of a simple set of equations to predict relative
performance differences between row and column stores under For example, in the case of a merge-join, if File1 is 1GB and
various configurations. We are primarily interested in predicting File2 is 10GB, then the disks process on average one byte from
the rate of tuples/sec at which row and column systems will oper- File1 for every ten bytes from File2. The individual file rates are
ate for a given query and given configuration. We summarize in defined as:
Table 2 the parameters we are going to use in our analysis along
with what different configurations these parameters can model. File N DiskBW -
R = -------------------------------
TupleWidth N
Our analysis focuses on a setting where scan nodes continuously
read input tuples which are then passed on to other relational DiskBW is simply the available bandwidth from the disks in
operators in a pipelined fashion. We assume (as in our implemen- bytes/sec, which we divide by the width of the tuples to obtain
tation) that CPU and I/O are well overlapped. For simplicity, we the rate in tuples/sec. DiskBW is always the full sequential band-
do not model disk seeks (we assume that the disks read sequen- width (we assume large prefetching buffers that minimize time
tially the majority of the time). The rate (tuples/sec) at which a spent in disk seeks, as shown in the previous section). Note that:
query can process its input is simply the minimum of the rate R
the disks can provide data and the rate the CPUs can process SizeFile i = N i ⋅ TupleWidth i
these data:
where N i is the cardinality of relation i.
R = MIN ( R DISK, R CPU ) (1)
We can now rewrite (2) as:

N1 + N2 + …
R DISK = DiskBW ⋅ ------------------------------- (3)
Table 2: Summary of parameters used in the analysis. SizeFileALL

parameter what can it model For a column store, we can derive a similar equation to the one
above:
SizeFile
various database schemas
TupleWidth Columns N1 f1 + N2 f2 + …
R = DiskBW ⋅ ----------------------------------------- (4)
MemBytesCycle various speeds for the memory bus DISK SizeFileALL
number of attributes selected by a query where f1, f2, etc. are the factors by which a regular row tuple is
f
(projection)
larger than the size of the attributes needed by a query. For exam-
CPU work of each operator ple, if a column system needs to read only two integers (8 bytes)
I (can model various selectivities for scanners,
or various decompression schemes) from ORDERS (32 bytes), the factor f is 4 (= 32 / 8).
more/fewer disks CPU analysis. To model CPU rate, we assume a cascaded con-
cpdb more/fewer CPUs nection of all relational operators and scanners. If an operator
competing traffic for disk / CPU processes Op 1 tuples/sec, and is connected to another operator
with rate Op 2 , which in turn is connected to an operator with rate cpdb (cycles per disk byte) combines into a single parameter the
Op 3 and so on, the overall CPU rate R CPU is: available disk and CPU resources for a given configuration. It
1 = ---------
1 + ---------
1 + ---------
1 + … (5) shows how many (aggregate) CPU cycles elapse in the time it
------------ takes the disks to sequentially deliver a byte of information. For
R CPU Op 1 Op 2 Op 3
example, the machine used in this paper (one CPU, three disks) is
The above formula resembles the formula for computing the rated at 18 cpdb. By operating on a single disk, cpdb rating jumps
equivalent resistance of a circuit where resistors are connected in to 54.
parallel; we adopt the same notation (two parallel bars) and Parameters such as competing traffic and number of disks/CPUs
rewrite (5) (this time including the rates for the various scanners): can be modeled through cpdb rating. Since competing CPU traffic
R CPU = Op1 || Op 2 || … || Scan 1 || Scan 2 || … (6) fights for cycles, the cpdb rating for a given query drops. On the
other hand, competing disk traffic causes cpdb to increase. Look-
As an example, consider one operator processing 4 tuples/sec, ing up trends in CPU [3] and disk8 speed, we find that, for a sin-
connected to an operator that processes 6 tuples/sec. The overall gle CPU over a single disk, cpdb has been slowly growing, from
rate of tuple production in the system is: 10 in 1995, to 30 in 2005. With the advent of multicore chips, we
expect cpdb to grow faster. When calculating the cpdb rating of
Op 1 ⋅ Op 2 an arbitrary configuration, note that disk bandwidth is limited by
Op 1 || Op 2 = -------------------------
- the maximum bandwidth of the disk controllers.
Op 1 + Op 2
We use the speedup formula to predict relative performance of
or 2.4 tuples/sec. column systems over row systems for various configurations by
The rate of an operator is approximated as: changing the cpdb rating. In disk-bound systems (when the disk
rate is lower than the CPU rate), column stores outperform row
clock stores by the same ratio as the total bytes selected over the total
Op = ------------- (7)
I Op size of files. In CPU-bound systems, either system can be faster,
depending on the cost of the scanners. In the previous section we
where clock is the available cycles per second from the CPUs saw some conditions (low selectivity, narrow tuples) where row
(e.g., for our single-CPU machine, clock is 3.2 billion cycles per stores outperform column stores. Note that a high-cost relational
second). I Op is the total number of CPU instructions it takes the operator lowers the CPU rate, and the difference between col-
operator to process one tuple. Note that we approximate the rate umns and rows in a CPU-bound system becomes less noticeable.
by assuming 1 CPU cycle per instruction. If the actual ratio is The formula can also be used to examine whether a specific
available, we can replace it in the above formula. query on a specific configuration is likely to be I/O or CPU
To compute the rate of a scanner, we have to take into consider- bound.
ation the CPU-system time, the CPU-user time, and the time it The graph shown in the introduction (Figure 2), is constructed
takes the memory to deliver tuples to the L2 cache. We treat from the speedup formula, filling up actual CPU rates from our
CPU-system and CPU-user as two different operators. Further, we experimental section. In that figure we use a 50% projection of
compute CPU-user rate as the minimum of the pure computation the attributes in the relation, and 10% selectivity. The graph
rate and the rate the memory can provide tuples to the L2 cache. shows that row stores outperform column stores only in very lim-
The latter is equal to the memory bandwidth divided by the tuple ited settings.
width. We compute memory bandwidth as: clock times how many
bytes arrive per CPU cycle (MemBytesCycle).
6. RELATED WORK
Therefore we can write the rate Scan of a scanner as: The decomposition storage model (DSM) [8] was one of the first
clock clock clock ⋅ MemBytesCycle proposed models for a column oriented system designed to
Scan = --------------- || MIN  -------------, --------------------------------------------------------- (8) improve I/O by reducing the amount of data needed to be read off
I system I user TupleWidth
disk. It also improves cache performance by maximizing inter-
record spatial locality [6]. DSM differs from the standard row-
Speedup of columns over rows. We are now ready to compute store N-ary storage model (NSM) by decomposing a relation with
the speedup of columns over rows by dividing the corresponding n attributes into n vertically partitioned tables, each containing
rates (using (1), (3), (4), (6), (7) and (8)): one column from the original relation and a column containing
 N1 f1 + N2 f2 + …  TupleIDs used for tuple reconstruction. The DSM model attempts
1 1
MIN  -----------------------------------------, cpdb ⋅  ------ || -------------- || …  to improve tuple reconstruction performance by maintaining a
SizeFileALL  I op I ScanC 
  clustered index on TupleID, however, complete tuple reconstruc-
Speedup = -----------------------------------------------------------------------------------------------------------------------------
N + N + … tion remained slow. As a result, there have been a variety of opti-
 1 1 1 
MIN  -------------------------------, cpdb ⋅  ------ || -------------- || … 
2
mizations and hybrid NSM/DSM schemes proposed in
 SizeFileALL I op I ScanR  subsequent years: (a) partitioning of relations based on how often
attributes appear together in a query [9], (b) using different stor-
To derive the above formula we divided all members by DiskBW,
age models for different mirrors of a database, with queries that
and replaced the following quantity:
perform well on NSM data or DSM data being sent to the appro-
clock - priate mirror [19], (c) optimizing the choice of horizontal and ver-
cpdb = -------------------- tical partitioning given a database workload [2].
DiskBW
8. http://www.hitachigst.com/hdd/technolo/overview/chart16.html
Recently there has been a reemergence of pure vertically parti- [6] P. A. Boncz, S. Manegold, and M. L. Kersten. “Database
tioned column-oriented systems as modern computer architecture Architecture Optimized for the New Bottleneck: Memory
trends favor the I/O and memory bandwidth efficiency that col- Access.” In Proc. VLDB, 1999.
umn-stores have to offer [7][21]. PAX [4] proposes a column- [7] P. Boncz, M. Zukowski, and N. Nes. “MonetDB/X100:
based layout for the records within a database page, taking advan- Hyper-Pipelining Query Execution.” In Proc. CIDR, 2005.
tage of the increased spatial locality to improve cache perfor-
mance, similarly to column-based stores. However, since PAX [8] A. Copeland and S. Khoshafian. “A Decomposition Storage
Model.” In Proc. SIGMOD, 1985.
does not change the actual contents of the page, I/O performance
is identical to that of a row-store. The Fates database system [20] [9] D. W. Cornell and P. S. Yu. “An Effective Approach to Ver-
organizes data on the disk in a column-based fashion and relies tical Partitioning for Physical Design of Relational Data-
on clever data placement to minimize the seek and rotational bases.” In IEEE Transactions on Software Engineering
delays involved in retrieving data from multiple columns. 16(2): 248-258, 1990.
While it is generally accepted that generic row-stores are pre- [10] J. Goldstein, R. Ramakrishnan, and U. Shaft. “Compressing
ferred for OLTP workloads, in this paper we complement recent Relations and Indexes.” In Proc. ICDE, 1998.
proposals by exploring the fundamental tradeoffs in column and [11] G. Graefe and L. D. Shapiro. “Data compression and data-
row oriented DBMS on read-mostly workloads. Ongoing work base performance.” In Proc. ACM/IEEE-CS Symposium on
[12] compares row- and column-based pages using the Shore Applied Computing, Kansas City, MO, 1991.
storage manager. The authors examine 100% selectivity and focus
[12] A. Halverson, J. L. Beckmann, J. F. Naughton, and D. J.
on 100% projectivity, both of which are the least favorable work- DeWitt. “A Comparison of C-Store and Row-Store in a
load for pipelined column scanners, as we showed earlier. Common Framework.” Technical Report, University of Wis-
consin-Madison, Department of Computer Sciences,
7. CONCLUSIONS TR1566, 2006.
In this paper, we compare the performance of read-intensive col- [13] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. “QPipe:
umn- and row-oriented database systems in a controlled imple- A Simultaneously Pipelined Relational Query Engine.” In
mentation. We find that column stores, with appropriate Proc. SIGMOD, 2005.
prefetching, can almost always make better use of disk bandwidth [14] J. L. Hennessy, D. A. Patterson. “Computer Architecture: A
than row-stores, but that under a limited number of situations, Quantitative Approach.” 2nd ed, Morgan-Kaufmann, 1996.
their CPU performance is not as good. In particular, they do less
well when processing very narrow tuples, use long projection [15] K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W.
lists, and apply non-selective predicates. We use our implementa- E. Baker. “Performance Characterization of a Quad Pentium
Pro SMP Using OLTP Workloads.” In Proc. ISCA-25, 1998.
tion to derive an analytical model that predicts query performance
with a particular disk and CPU configuration and find that current [16] P. J. Mucci, S. Browne, C. Deane, and G. Ho. “PAPI: A Por-
architectural trends suggest column stores, even without other table Interface to Hardware Performance Counters.” In Proc.
advantages (such as the ability to operate directly on compressed Department of Defense HPCMP Users Group Conference,
data [1] or vectorized processing [7]) will become an even more Monterey, CA, June 1999.
attractive architecture with time. Hence, a column oriented data- [17] S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhing-
base seems to be an attractive design for future read-oriented ran. “Block Oriented Processing of Relational Database
database systems. Operations in Modern Computer Architectures.” In Proc.
ICDE, 2001.
8. ACKNOWLEDGMENTS [18] M. Poss and D. Potapov. “Data Compression in Oracle.” In
We thank David DeWitt, Mike Stonebraker, and the VLDB Proc. VLDB, 2003.
reviewers for their helpful comments. This work was supported [19] R. Ramamurthy, D. J. DeWitt, and Q. Su. “A Case for Frac-
by the National Science Foundation under Grants 0520032, tured Mirrors.” In Proc. VLDB, 2002.
0448124, and 0325525.
[20] M. Shao, J. Schindler, S. W. Schlosser, A. Ailamaki, and G.
R. Ganger. “Clotho: Decoupling memory page layout from
9. REFERENCES storage organization.” In Proc. VLDB, 2004.
[1] D. J. Abadi, S. Madden, and M. Ferreira. “Integrating Com-
pression and Execution in Column-Oriented Database Sys- [21] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherni-
tems.” In Proc. SIGMOD, 2006. ack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P.
E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. “C-Store: A
[2] S. Agrawal, V. R. Narasayya, B. Yang. “Integrating Vertical Column-oriented DBMS.” In Proc. VLDB, 2005.
and Horizontal Partitioning Into Automated Physical Data-
base Design.” In Proc. SIGMOD, 2004. [22] T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte.
“The Implementation and Performance of Compressed Data-
[3] A. Ailamaki. “Database Architecture for New Hardware.” bases.” In SIGMOD Rec., 29(3):55–67, Sept. 2000.
Tutorial. In Proc. VLDB, 2004.
[23] J. Zhou and K. A. Ross. “Buffering Database Operations for
[4] A. Ailamaki, D. J. DeWitt, et al. “Weaving Relations for Enhanced Instruction Cache Performance.” In Proc. SIG-
Cache Performance.” In Proc. VLDB, 2001. MOD, 2004.
[5] A. Ailamaki, D. J. DeWitt, at al. “DBMSs on a modern pro- [24] M. Zukowski, S. Heman, N. Nes, and P. Boncz. “Super-Sca-
cessor: Where does time go?” In Proc. VLDB, 1999. lar RAM-CPU Cache Compression.” In Proc. ICDE, 2006.

You might also like