Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

contributed articles

doi:10.1145/ 1629175.1629197
DBMS, as it quickly loads and pro-
MapReduce complements DBMSs since cesses large amounts of data in an
ad hoc manner. As such, it comple-
databases are not designed for extract- ments DBMS technology rather than
transform-load tasks, a MapReduce specialty. competes with it. We also discuss the
differences in the architectural deci-
by Michael Stonebraker, Daniel Abadi, sions of MR systems and database
David J. DeWitt, Sam Madden, Erik Paulson, systems and provide insight into how
Andrew Pavlo, and Alexander Rasin the systems should complement one
another.

MapReduce
The technology press has been fo-
cusing on the revolution of “cloud
computing,” a paradigm that entails
the harnessing of large numbers of

and Parallel
processors working in parallel to solve
computing problems. In effect, this
suggests constructing a data center by
lining up a large number of low-end
servers, rather than deploying a small-

DBMSs:
er set of high-end servers. Along with
this interest in clusters has come a
proliferation of tools for programming
them. MR is one such tool, an attrac-

Friends
tive option to many because it provides
a simple model through which users
are able to express relatively sophisti-
cated distributed programs.

or Foes?
Given the interest in the MR model
both commercially and academically,
it is natural to ask whether MR sys-
tems should replace parallel database
systems. Parallel DBMSs were first
available commercially nearly two de-
cades ago, and, today, systems (from
about a dozen vendors) are available.
As robust, high-performance comput-
ing platforms, they provide a high-
level programming environment that
The MapReduce7 (MR) paradigm has been hailed as a is inherently parallelizable. Although
revolutionary new platform for large-scale, massively it might seem that MR and parallel
DBMSs are different, it is possible to
parallel data access.16 Some proponents claim the write almost any parallel-processing
extreme scalability of MR will relegate relational task as either a set of database queries
database management systems (DBMS) to the status or a set of MR jobs.
Our discussions with MR users lead
of legacy technology. At least one enterprise, Facebook, us to conclude that the most common
has implemented a large data warehouse system use case for MR is more like an ETL sys-
tem. As such, it is complementary to
using MR technology rather than a DBMS.14
Illustration by Ma rius wat z

DBMSs, not a competing technology,


Here, we argue that using MR systems to perform since databases are not designed to be
tasks that are best suited for DBMSs yields less than good at ETL tasks. Here, we describe
what we believe is the ideal use of MR
satisfactory results,17 concluding that MR is more technology and highlight the different
like an extract-transform-load (ETL) system than a MR and parallel DMBS markets.

64 communicatio ns o f th e acm | Jan uary 201 0 | vo l . 5 3 | n o. 1


january 2 0 1 0 | vo l. 53 | n o. 1 | c om m u n ic ation s of the acm 65
contributed articles

We recently conducted a bench- a 10-million-row table across a clus- return the final result to the applica-
mark study using a popular open- ter of 50 nodes, each with four disks, tion program that issued the query.
source MR implementation and two would place 50,000 rows on each of Suppose we would like to know
parallel DBMSs.17 The results show the 200 disks. Most parallel database the total sales amount for each custId
that the DBMSs are substantially fast- systems offer a variety of partitioning within the same date range. This is
er than the MR system once the data strategies, including hash, range, and done through the following query:
is loaded, but that loading the data round-robin partitioning.8 Under a
takes considerably longer in the da- hash-partitioning physical layout, as SELECT custId, SUM(amount)
tabase systems. Here, we discuss the each row is loaded, a hash function FROM Sales
source of these performance differ- is applied to one or more attributes WHERE date BETWEEN
ences, including the limiting archi- of each row to determine the target “12/1/2009” AND “12/25/2009”
tectural factors we perceive in the two node and disk where the row should GROUP BY custId;
classes of system, and conclude with be stored.
lessons the MR and DBMS communi- The use of horizontal partitioning If the Sales table is round-robin
ties can learn from each other, along of tables across the nodes of a cluster partitioned across the nodes in the
with future trends in large-scale data is critical to obtaining scalable per- cluster, then the rows corresponding
analysis. formance of SQL queries8 and leads to any single customer will be spread
naturally to the concept of partitioned across multiple nodes. The DBMS
Parallel Database Systems execution of the SQL operators: selec- compiles this query into the three-op-
In the mid-1980s the Teradata20 and tion, aggregation, join, projection, and erator pipeline in Figure(a), then ex-
Gamma projects9 pioneered a new update. As an example how data parti- ecutes the query plan on all the nodes
architectural paradigm for parallel tioning is used in a parallel DBMS, con- in the cluster in parallel. Each SELECT
database systems based on a clus- sider the following SQL query: operator scans the fragment of the
ter of commodity computers called Sales table stored at that node. Any
“shared-nothing nodes” (or separate SELECT custId, amount FROM Sales rows satisfying the date predicate are
CPU, memory, and disks) connected WHERE date BETWEEN passed to a SHUFFLE operator that dy-
through a high-speed interconnect.19 “12/1/2009” AND “12/25/2009”; namically repartitions the rows; this
Every parallel database system built is typically done by applying a hash
since then essentially uses the tech- With the Sales table horizontally function on the value of the custId at-
niques first pioneered by these two partitioned across the nodes of the tribute of each row to map them to a
projects: horizontal partitioning of cluster, this query can be trivially particular node. Since the same hash
relational tables, along with the parti- executed in parallel by executing a function is used for the SHUFFLE op-
tioned execution of SQL queries. SELECT operator against the Sales eration on all nodes, rows for the same
The idea behind horizontal parti- records with the specified date predi- customer are routed to the single node
tioning is to distribute the rows of a cate on each node of the cluster. The where they are aggregated to compute
relational table across the nodes of intermediate results from each node the final total for each customer.
the cluster so they can be processed are then sent to a single node that per- As a final example of how SQL is par-
in parallel. For example, partitioning forms a MERGE operation in order to allelized using data partitioning, con-
sider the following query for finding
Parallel database query execution plans. (a) Example operator pipeline for calculating the names and email addresses of cus-
a single-table aggregate. (b) Example operator pipeline for performing a joining on two
partitioned tables.
tomers who purchased an item costing
more than $1,000 during the holiday
(a) (b) shopping period:

SELECT C.name, C.email FROM


Customers C, Sales S
Sum Join
WHERE C.custId = S.custId
AND S.amount > 1000
AND S.date BETWEEN
Shuffle Shuffle “12/1/2009” AND
“12/25/2009”;

Assume again that the Sales table is


Select Select round-robin partitioned, but we now
hash-partition the Customers table
on the Customer.custId attribute. The
DBMS compiles this query into the
Sales Sales
operator pipeline in Figure(b) that is
executed in parallel at all nodes in the
cluster. Each SELECT operator scans

66 communicatio ns o f th e acm | Jan uary 201 0 | vo l . 5 3 | n o. 1


contributed articles

its fragment of the Sales table looking in 2004, and, today, numerous open by the following template of five opera-
for rows that satisfy the predicate source and commercial implementa- tions:
tions are available. The most popular ˲˲ Read logs of information from sev-
S.amount > 1000 and S.date MR system is Hadoop, an open-source eral different sources;
BETWEEN “12/1/2009” and project under development by Yahoo! ˲˲ Parse and clean the log data;
“12/25/2009.” and the Apache Software Foundation ˲˲ Perform complex transformations
(http://hadoop.apache.org/). (such as “sessionalization”);
Qualifying rows are pipelined into The semantics of the MR model are ˲˲ Decide what attribute data to store;
a shuffle operator that repartitions its not unique, as the filtering and trans- and
input rows by hashing on the Sales. formation of individual data items ˲˲ Load the information into a DBMS
custId attribute. By using the same (tuples in tables) can be executed by a or other storage engine.
hash function that was used when modern parallel DBMS using SQL. For These steps are analogous to the
loading rows of the Customer table Map operations not easily expressed extract, transform, and load phases in
(hash partitioned on the Customer. in SQL, many DBMSs support user- ETL systems; the MR system is essen-
custId attribute), the shuffle operators defined functions18; UDF extensibility tially “cooking” raw data into useful in-
route each qualifying Sales row to the provides the equivalent functionality formation that is consumed by another
node where the matching Customer of a Map operation. SQL aggregates storage system. Hence, an MR system
tuple is stored, allowing the join op- augmented with UDFs and user-de- can be considered a general-purpose
erator (C.custId = S.custId) to execute fined aggregates provide DBMS users parallel ETL system.
in parallel on all the nodes. the same MR-style reduce functional- For parallel DBMSs, many products
Another key benefit of parallel ity. Lastly, the reshuffle that occurs perform ETL, including Ascential, In-
DBMSs is that the system automati- between the Map and Reduce tasks in formatica, Jaspersoft, and Talend. The
cally manages the various alternative MR is equivalent to a GROUP BY opera- market is large, as almost all major
partitioning strategies for the tables tion in SQL. Given this, parallel DBMSs enterprises use ETL systems to load
involved in the query. For example, if provide the same computing model as large quantities of data into data ware-
Sales and Customers are each hash-par- MR, with the added benefit of using a houses. One reason for this symbiotic
titioned on their custId attribute, the declarative language (SQL). relationship is the clear distinction
query optimizer will recognize that the The linear scalability of parallel as to what each class of system pro-
two tables are both hash-partitioned DBMSs has been widely touted for vides to users: DBMSs do not try to do
on the joining attributes and omit the two decades10; that is, as nodes are ETL, and ETL systems do not try to do
shuffle operator from the compiled added to an installation, the database DBMS services. An ETL system is typi-
query plan. Likewise, if both tables are size can be increased proportionally cally upstream from a DBMS, as the
round-robin partitioned, then the opti- while maintaining constant response load phase usually feeds data directly
mizer will insert shuffle operators for times. Several production databases into a DBMS.
both tables so tuples that join with one in the multi-petabyte range are run Complex analytics. In many data-
another end up on the same node. All by very large customers operating mining and data-clustering applica-
this happens transparently to the user on clusters of order 100 nodes.13 The tions, the program must make multiple
and to application programs. people who manage these systems passes over the data. Such applications
Many commercial implementa- do not report the need for additional cannot be structured as single SQL ag-
tions are available, including Terada- parallelism. Thus, parallel DBMSs of- gregate queries, requiring instead a
ta, Netezza, DataAllegro (Microsoft), fer great scalability over the range of complex dataflow program where the
ParAccel, Greenplum, Aster, Vertica, nodes that customers desire. There is output of one part of the application is
and DB2. All run on shared-nothing no reason why scalability cannot be the input of another. MR is a good can-
clusters of nodes, with tables horizon- increased dramatically to the levels didate for such applications.
tally partitioned over them. reported by Jeffrey Dean and Sanjay Semi-structured data. Unlike a
Ghemawat,7 assuming there is cus- DBMS, MR systems do not require us-
Mapping Parallel DBMSs tomer demand. ers to define a schema for their data.
onto MapReduce Thus, MR-style systems easily store
An attractive quality of the MR program- Possible Applications and process what is known as “semi-
ming model is simplicity; an MR pro- Even though parallel DBMSs are able structured” data. In our experience,
gram consists of only two functions— to execute the same semantic workload such data often looks like key-value
Map and Reduce—written by a user to as MR, several application classes are pairs, where the number of attributes
process key/value data pairs.7 The input routinely mentioned as possible use present in any given record varies; this
data set is stored in a collection of par- cases in which the MR model might be style of data is typical of Web traffic
titions in a distributed file system de- a better choice than a DBMS. We now logs derived from disparate sources.
ployed on each node in the cluster. The explore five of these scenarios, discuss- With a relational DBMS, one way to
program is then injected into a distrib- ing the ramifications of using one class model such data is to use a wide table
uted-processing framework and execut- of system over another: with many attributes to accommo-
ed in a manner to be described. The MR ETL and “read once” data sets. The date multiple record types. Each un-
model was first popularized by Google canonical use of MR is characterized required attribute uses NULLs for the

january 2 0 1 0 | vo l. 53 | n o. 1 | c om m u n ic ation s of the acm 67


contributed articles

values that are not present for a given professional DBMS programmers and cal tasks of increasing complexity we
record. Row-based DBMSs generally administrators are more willing to think are common processing tasks
have trouble with the tables, often suf- pay in terms of longer learning curves that could be done using either class
fering poor performance. On the other and startup times, because the perfor- of systems. We ran all experiments on
hand, column-based DBMSs (such as mance gains from faster queries offset a 100-node shared-nothing cluster at
Vertica) mitigate the problem by read- the upfront costs. the University of Wisconsin-Madison.
ing only the relevant attributes for any Limited-budget operations. Another The full paper17 includes the complete
query and automatically suppressing strength of MR systems is that most results and discussion from all our ex-
the NULL values.3 These techniques are open source projects available for periments, including load times; here,
have been shown to provide good per- free. DBMSs, and in particular parallel we provide a summary of the most
formance on RDF data sets,2 and we DBMSs, are expensive; though there interesting results. (The source code
expect the same would be true for sim- are good single-node open source for the benchmark study is available
pler key-value data. solutions, to the best of our knowl- at http://database.cs.brown.edu/proj-
To the extent that semistructured edge, there are no robust, community- ects/mapreduce-vs-dbms/.)
data fits the “cooking” paradigm dis- supported parallel DBMSs. Though Hadoop is by far the most popular
cussed earlier (that is, the data is pre- enterprise users with heavy demand publicly available version of the MR
pared for loading into a back-end da- and big budgets might be willing to framework (the Google version might
ta-processing system), then MR-style pay for a commercial system and all be faster but is not available to us), and
systems are a good fit. If the semistruc- the tools, support, and service agree- DBMS-X and Vertica are popular row-
tured data set is primarily for analytical ments those systems provide, users and column-store parallel database
queries, we expect a parallel column with more modest budgets or require- systems, respectively.
store to be a better solution. ments find open source systems more In the time since publication of
Quick-and-dirty analyses. One disap- attractive. The database community Pavlo et al.17 we have continued to
pointing aspect of many current paral- has missed an opportunity by not pro- tune all three systems. Moreover, we
lel DBMSs is that they are difficult to viding a more complete parallel, open have received many suggestions from
install and configure properly, as users source solution. the Hadoop community on ways to
are often faced with a myriad of tuning Powerful tools. MR systems are fun- improve performance. We have tried
parameters that must be set correctly damentally powerful tools for ETL-style them all, and the results here (as of
for the system to operate effectively. applications and for complex analyt- August 2009) represent the best we
From our experiences with installing ics. Additionally, they are popular for can do with a substantial amount of
two commercial parallel systems, an “quick and dirty” analyses and for us- expert help on all three systems. In
open-source MR implementation pro- ers with limited budgets. On the other fact, the time we’ve spent tuning Ha-
vides the best “out-of-the-box” experi- hand, if the application is query-inten- doop has now exceeded the time we
ence17; that is, we were able to get the sive, whether semistructured or rigidly spent on either of the other systems.
MR system up and running queries structured, then a DBMS is probably Though Hadoop offers a good out-of-
significantly faster than either of the the better choice. In the next section, the-box experience, tuning it to obtain
DBMSs. In fact, it was not until we re- we discuss results from use cases that maximum performance was an ardu-
ceived expert support from one of the demonstrate this performance superi- ous task. Obviously, performance is a
vendors that we were able to get one ority; the processing tasks range from moving target, as new releases of all
particular DBMS to run queries that those MR systems ought to be good at to three products occur regularly
completed in minutes, rather than those that are quite complex queries. Original MR Grep task. Our first
hours or days. benchmark experiment is the “Grep
Once a DBMS is up and running DBMS “Sweet Spot” task’” from the original MR paper,
properly, programmers must still To demonstrate the performance which described it as “representative
write a schema for their data (if one trade-offs between parallel DBMSs and of a large subset of the real programs
does not already exist), then load the MR systems, we published a bench- written by users of MapReduce.”7
data set into the system. This process mark comparing two parallel DBMSs For the task, each system must scan
takes considerably longer in a DBMS to the Hadoop MR framework on a va- through a data set of 100B records
than in an MR system, because the riety of tasks.17 We wished to discover looking for a three-character pattern.
DBMS must parse and verify each da- the performance envelope of each ap- Each record consists of a unique key
tum in the tuples. In contrast, the de- proach when applied to areas inside in the first 10B, followed by a 90B
fault (therefore most common) way for and outside their target application random value. The search pattern is
MR programmers to load their data is space. We used two database systems: found only in the last 90B once in ev-
to just copy it into the MR system’s un- Vertica, a commercial column-store ery 10,000 records. We use a 1TB data
derlying distributed block-based stor- relational database, and DBMS-X, a set spread over the 100 nodes (10GB/
age system. row-based database from a large com- node). The data set consists of 10 bil-
If a programmer must perform some mercial vendor. Our benchmark study lion records, each 100B. Since this is
one-off analysis on transient data, then included a simple benchmark pre- essentially a sequential search of the
the MR model’s quick startup time is sented in the original MR paper from data set looking for the pattern, it pro-
clearly preferable. On the other hand, Google,7 as well as four other analyti- vides a simple measurement of how

68 communicatio ns o f th e acm | Jan uary 201 0 | vo l . 5 3 | n o. 1


contributed articles

quickly a software system can scan Benchmark performance on a 100-node cluster.


through a large collection of records.
The task cannot take advantage of any
sorting or indexing and is easy to spec- Hadoop DBMS-X Vertica Hadoop/DBMS-X Hadoop/Vertica
ify in both MR and SQL. Therefore, one Grep 284s 194s 108x 1.5x 2.6x
would expect a lower-level interface Web Log 1,146s 740s 268s 1.6x 4.3x
(such as Hadoop) running directly on Join 1,158s 32s 55s 36.3x 21.0x
top of the file system (HDFS) to exe-
cute faster than the more heavyweight
DBMSs.
However, the execution times in the for a typical user task fall somewhere doop’s performance suggested by the
table here show a surprising result: The in between these extremes. In the next MR community.
database systems are about two times section, we explore the reasons for In contrast to repetitive parsing in
faster than Hadoop. We explain some these results. MR, records are parsed by DBMSs when
of the reasons for this conclusion in the the data is initially loaded. This initial
section on architectural differences. Architectural Differences parsing step allows the DBMSs storage
Web log task. The second task is a The performance differences between manager to carefully lay out records
conventional SQL aggregation with Hadoop and the DBMSs can be ex- in storage such that attributes can be
a GROUP BY clause on a table of user plained by a variety of factors. Before directly addressed at runtime in their
visits in a Web server log. Such data delving into the details, we should say most efficient storage representation.
is fairly typical of Web logs, and the these differences result from imple- As such, there is no record interpreta-
query is commonly used in traffic ana- mentation choices made by the two tion performed during query execution
lytics. For this experiment, we used classes of system, not from any funda- in parallel DBMSs.
a 2TB data set consisting of 155 mil- mental difference in the two models. There is nothing fundamental about
lion records spread over the 100 nodes For example, the MR processing model the MR model that says data cannot be
(20GB/node). Each system must calcu- is independent of the underlying stor- parsed in advance and stored in opti-
late the total ad revenue generated for age system, so data could theoretically mized data structures (that is, trad-
each visited IP address from the logs. be massaged, indexed, compressed, ing off some load time for increased
Like the previous task, the records and carefully laid out on storage during runtime performance). For example,
must all be read, and thus there is no a load phase, just like a DBMS. Hence, data could be stored in the underly-
indexing opportunity for the DBMSs. the goal of our study was to compare ing file system using Protocol Buffers
One might think that Hadoop would the real-life differences in performance (http://code.google.com/p/protobuf/),
excel at this task since it is a straight- of representative realizations of the Google’s platform-neutral, extensible
forward calculation, but the results in two models. mechanism for serializing structured
the table show that Hadoop is beaten Repetitive record parsing. One con- data; this option is not available in Ha-
by the databases by a larger margin tributing factor for Hadoop’s slower doop. Alternatively, one could move
than in the Grep task. performance is that the default con- the data outside the MR framework
Join task. The final task we discuss figuration of Hadoop stores data in into a relational DBMS at each node,
here is a fairly complex join operation the accompanying distributed file thereby replacing the HDFS storage
over two tables requiring an additional system (HDFS), in the same textual layer with DBMS-style optimized stor-
aggregation and filtering operation. format in which the data was gen- age for structured data.4
The user-visit data set from the previ- erated. Consequently, this default There may be ways to improve the
ous task is joined with an additional storage method places the burden of Hadoop system by taking advantage of
100GB table of PageRank values for 18 parsing the fields of each record on these ideas. Hence, parsing overhead
million URLs (1GB/node). The join task user code. This parsing task requires is a problem, and SequenceFiles are
consists of two subtasks that perform a each Map and Reduce task repeatedly not an effective solution. The problem
complex calculation on the two data parse and convert string fields into should be viewed as a signpost for guid-
sets. In the first part of the task, each the appropriate type. Hadoop pro- ing future development.
system must find the IP address that vides the ability to store data as key/ Compression. We found that enabling
generated the most revenue within a value pairs as serialized tuples called data compression in the DBMSs deliv-
particular date range in the user visits. SequenceFiles, but despite this ability ered a significant performance gain.
Once these intermediate records are it still requires user code to parse the The benchmark results show that using
generated, the system must then calcu- value portion of the record if it con- compression in Vertica and DBMS-X on
late the average PageRank of all pages tains multiple attributes. Thus, we these workloads improves performance
visited during this interval. found that using SequenceFiles with- by a factor of two to four. On the other
DBMSs ought to be good at analyti- out compression consistently yielded hand, Hadoop often executed slower
cal queries involving complex join op- slower performance on our bench- when we used compression on its input
erations (see the table). The DBMSs are mark. Note that using SequenceFiles files; at most, compression improved
a factor of 36 and 21 respectively faster without compression was but one of performance by 15%; the benchmark
than Hadoop. In general, query times the tactics for possibly improving Ha- results in Dean and Ghemawat7 also

january 2 0 1 0 | vo l. 53 | n o. 1 | c om m u n ic ation s of the acm 69


contributed articles

did not use compression. compile-time scheduling. The former


It is unclear to us why this improve- has the advantage, as some have ar-
ment was insignficant, as essentially gued,4 of allowing the MR scheduler
all commercial SQL data warehouses to adapt to workload skew and perfor-
use compression to improve perfor-
mance. We postulate that commer- The commercial mance differences between nodes.
Column-oriented storage. In a column
cial DBMSs use carefully tuned com- DBMS products store-based database (such as Vertica),

must move
pression algorithms to ensure that the system reads only the attributes
the cost of decompressing tuples necessary for solving the user query.
does not offset the performance gains
from the reduced I/O cost of reading
toward one-button This limited need for reading data rep-
resents a considerable performance
compressed data. For example, we installs, automatic advantage over traditional, row-stored
have found that on modern proces-
sors standard Unix implementations
tuning that works databases, where the system reads all
attributes off the disk. DBMS-X and
of gzip and bzip are often too slow to correctly, better Hadoop/HDFS are both essentially row
provide any benefit.
Pipelining. All parallel DBMSs op- Web sites with stores, while Vertica is a column store,
giving Vertica a significant advantage
erate by creating a query plan that is example code, over the other two systems in our Web
distributed to the appropriate nodes
at execution time. When one opera- better query log benchmark task.
Discussion. The Hadoop community
tor in this plan must send data to the
next operator, regardless of whether
generators, will presumably fix the compression
problem in a future release. Further-
that operator is running on the same and better more, some of the other performance
or a different node, the qualifying
data is “pushed” by the first operator
documentation. advantages of parallel databases (such
as column-storage and operating di-
to the second operator. Hence, data is rectly on compressed data) can be
streamed from producer to consumer; implemented in an MR system with
the intermediate data is never written user code. Also, other implementa-
to disk; the resulting “back-pressure” tions of the MR framework (such as
in the runtime system will stall the pro- Google’s proprietary implementation)
ducer before it has a chance to overrun may well have a different performance
the consumer. This streaming tech- envelope. The scheduling mechanism
nique differs from the approach taken and pull model of data transmission
in MR systems, where the producer are fundamental to the MR block-level
writes the intermediate results to lo- fault-tolerance model and thus unlike-
cal data structures, and the consumer ly to be changed.
subsequently “pulls” the data. These Meanwhile, DBMSs offer transac-
data structures are often quite large, tion-level fault tolerance. DBMS re-
so the system must write them out to searchers often point out that as da-
disk, introducing a potential bottle- tabases get bigger and the number of
neck. Though writing data structures nodes increases, the need for finer-
to disk gives Hadoop a convenient way granularity fault tolerance increases as
to checkpoint the output of intermedi- well. DBMSs readily adapt to this need
ate map jobs, thereby improving fault by marking one or more operators in a
tolerance, we found from our investi- query plan as “restart operators.” The
gation that it adds significant perfor- runtime system saves the result of these
mance overhead. operators to disk, facilitating “operator
Scheduling. In a parallel DBMS, level” restart. Any number of operators
each node knows exactly what it must can be so marked, allowing the granu-
do and when it must do it according to larity of restart to be tuned. Such a
the distributed query plan. Because the mechanism is easily integrated into the
operations are known in advance, the efficient query execution framework of
system is able to optimize the execu- DBMSs while allowing variable granu-
tion plan to minimize data transmis- larity restart. We know of at least two
sion between nodes. In contrast, each separate research groups, one at the
task in an MR system is scheduled on University of Washington, the other at
processing nodes one storage block at the University of California, Berkeley,
a time. Such runtime work schedul- that are exploring the trade-off between
ing at a granularity of storage blocks is runtime overhead and the amount of
much more expensive than the DBMS work lost when a failure occurs.

70 communicatio ns o f th e ac m | Jan uary 201 0 | vo l . 5 3 | n o. 1


contributed articles

We generally expect ETL and com- larger data set, the user must first load data processing on large clusters. In Proceedings of
the Sixth Conference on Operating System Design and
plex analytics to be amenable to MR the data into the DBMS. Unless the Implementation (Berkeley, CA, 2004).
systems and query-intensive workloads user plans to run many analyses, it is 8. DeWitt, D.J. and Gray, J. Parallel database systems:
The future of high-performance database systems.
to be run by DBMSs. Hence, we expect preferable to simply point the DBMS Commun. ACM 35, 6 (June 1992), 85–98.
the best solution is to interface an MR at data on the local disk without a 9. DeWitt, D.J., Gerber, R.H., Graefe, G., Heytens, M.L.,
Kumar, K.B., and Muralikrishna, M. GAMMA: A
framework to a DBMS so MR can do load phase. There is no good reason high-performance dataflow database machine. In
complex analytics, and interface to a DBMSs cannot deal with in situ data. Proceedings of the 12th International Conference on
Very Large Databases. Morgan Kaufmann Publishers,
DBMS to do embedded queries. Ha- Though some database systems (such Inc., 1986, 228–237.
doopDB,4 Hive,21 Aster, Greenplum, as PostgreSQL, DB2, and SQL Server) 10. Englert, S., Gray, J., Kocher, T., and Shah, P. A
benchmark of NonStop SQL Release 2 demonstrating
Cloudera, and Vertica all have com- have capabilities in this area, further near-linear speedup and scaleup on large databases.
mercially available products or proto- Sigmetrics Performance Evaluation Review 18, 1
flexibility is needed. (1990), 1990, 245–246.
types in this “hybrid” category. 11. Fushimi, S., Kitsuregawa, M., and Tanaka, H. An
overview of the system software of a parallel relational
Conclusion database machine. In Proceedings of the 12th
Learning from Each Other Most of the architectural differences International Conference on Very Large Databases,
Morgan Kaufmann Publishers, Inc., 1986, 209–219.
What can MR learn from DBMSs? MR discussed here are the result of the 12. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly,
advocates should learn from parallel different focuses of the two classes of D. Dryad: Distributed data-parallel programs from
sequential building blocks. SIGOPS Operating System
DBMS the technologies and techniques system. Parallel DBMSs excel at effi- Review 41, 3 (2007), 59–72.
for efficient query parallel execution. cient querying of large data sets; MR- 13. Monash, C. Some very, very, very large data
warehouses. In NetworkWorld.com community
Engineers should stand on the shoul- style systems excel at complex analyt- blog, May 12, 2009; http://www.networkworld.com/
ders of those who went before, rather ics and ETL tasks. Neither is good at community/node/41777.
14. Monash, C. Cloudera presents the MapReduce bull
than on their toes. There are many what the other does well. Hence, the case. In DBMS2.com blog, Apr. 15, 2009; http://www.
good ideas in parallel DBMS executors two technologies are complementary, dbms2.com/2009/04/15/cloudera-presents-the-
mapreduce-bull-case/.
that MR system developers would be and we expect MR-style systems per- 15. Olston, C., Reed, B., Srivastava, U., Kumar, R., and
wise to adopt. forming ETL to live directly upstream Tomkins, A. Pig Latin: A not-so-foreign language
for data processing. In Proceedings of the SIGMOD
We also feel that higher-level lan- from DBMSs. Conference. ACM Press, New York, 2008, 1099–1110.
16. Patterson, D.A. Technical perspective: The data center
guages are invariably a good idea for Many complex analytical problems is the computer. Commun. ACM 51, 1 (Jan. 2008), 105.
any data-processing system. Rela- require the capabilities provided by 17. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt,
D.J., Madden, S.R., and Stonebraker, M. A comparison
tional DBMSs have been fabulously both systems. This requirement moti- of approaches to large-scale data analysis. In
successful in pushing programmers vates the need for interfaces between Proceedings of the 35th SIGMOD International
Conference on Management of Data. ACM Press, New
to a higher, more-productive level of MR systems and DBMSs that allow each York, 2009, 165–178.
abstraction, where they simply state system to do what it is good at. The re- 18. Stonebraker, M. and Rowe, L. The design of Postgres.
In Proceedings of the SIGMOD Conference, 1986,
what they want from the system, rath- sult is a much more efficient overall 340–355.
er than writing an algorithm for how system than if one tries to do the entire 19. Stonebraker, M. The case for shared nothing. Data
Engineering 9 (Mar. 1986), 4–9.
to get what they want from the system. application in either system. That is, 20. Teradata Corp. Database Computer System Manual,
In our benchmark study, we found “smart software” is always a good idea. Release 1.3. Los Angeles, CA, Feb. 1985.
21. Thusoo, A. et al. Hive: A warehousing solution
that writing the SQL code for each task over a Map-Reduce framework. In Proceedings of
was substantially easier than writing Acknowledgment the Conference on Very Large Databases, 2009,
1626–1629.
MR code. This work is supported in part by Na-
Efforts to build higher-level inter- tional Science Foundation grants CRI-
faces on top of MR/Hadoop should be 0707437, CluE-0844013, and CluE- Michael Stonebraker (stonebraker@csail.mit.edu) is an
adjunct professor in the Computer Science and Artificial
accelerated; we applaud Hive,21 Pig,15 0844480. Intelligence Laboratory at the Massachusetts Institute of
Scope,6 Dryad/Linq,12 and other proj- Technology, Cambridge, MA.

ects that point the way in this area. Daniel J. Abadi (dna@cs.yale.edu) is an assistant
References professor in the Department of Computer Science at Yale
What can DBMSs learn from MR? 1. Abadi, D.J., Madden, S.R., and Hachem, N. Column- University, New Haven, CT.
The out-of-the-box experience for most stores vs. row-stores: How different are they really?
In Proceedings of the SIGMOD Conference on David J. DeWitt (dewitt@microsoft.com) is a technical
DBMSs is less than ideal for being able Management of Data. ACM Press, New York, 2008. fellow in the Jim Gray Systems Lab at Microsoft Inc.,
to quickly set up and begin running 2. Abadi, D.J., Marcus, A., Madden, S.R., and Hollenbach, K. Madison, WI.
Scalable semantic Web data management using vertical
queries. The commercial DBMS prod- partitioning. In Proceedings of the 33rd International Samuel Madden (madden@csail.mit.edu) is a professor
ucts must move toward one-button Conference on Very Large Databases, 2007. in the Computer Science and Artificial Intelligence
3. Abadi, D.J. Column-stores for wide and sparse data. Laboratory at the Massachusetts Institute of Technology,
installs, automatic tuning that works In Proceedings of the Conference on Innovative Data Cambridge, MA.
correctly, better Web sites with exam- Systems Research, 2007.
4. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Erik Paulson (epaulson@cs.wisc.edu) is a Ph.D. candidate
ple code, better query generators, and Silberschatz, A., and Rasin, A. HadoopDB: An in the Department of Computer Sciences at the University
better documentation. architectural hybrid of MapReduce and DBMS of Wisconsin-Madison, Madison, WI.
technologies for analytical workloads. In Proceedings
Most database systems cannot deal of the Conference on Very Large Databases, 2009. Andrew Pavlo (pavlo@cs.brown.edu) is a Ph.D. candidate
with tables stored in the file system (in 5. Boral, H. et al. Prototyping Bubba, a highly parallel in the Department of Computer Science at Brown
database system. IEEE Transactions on Knowledge University, Providence, RI.
situ data). Consider the case where a and Data Engineering 2, 1 (Mar. 1990), 4–24.
6. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Alexander Rasin (alexr@cs.brown.edu) is a Ph.D.
DBMS is used to store a very large data Shakib, D., Weaver, S., and Zhou, J. SCOPE: Easy and candidate in the Department of Computer Science at
set on which a user wishes to perform efficient parallel processing of massive data sets. Brown University, Providence, RI.
In Proceedings of the Conference on Very Large
analysis in conjunction with a smaller, Databases, 2008.
private data set. In order to access the 7. Dean, J. and Ghemawat, S. MapReduce: Simplified © 2010 ACM 0001-0782/10/0100 $10.00

january 2 0 1 0 | vo l. 53 | n o. 1 | c om m u n ic ation s of the acm 71

You might also like