Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

ing system that produces the data structures used for the make it easier for programmers to write

rite parallel pro-


Google web search service. The indexing system takes grams. A key difference between these systems and
as input a large set of documents that have been retrieved MapReduce is that MapReduce exploits a restricted pro-
by our crawling system, stored as a set of GFS files. The gramming model to parallelize the user program auto-
raw contents for these documents are more than 20 ter- matically and to provide transparent fault-tolerance.
abytes of data. The indexing process runs as a sequence Our locality optimization draws its inspiration from
of five to ten MapReduce operations. Using MapReduce techniques such as active disks [12, 15], where compu-
(instead of the ad-hoc distributed passes in the prior ver- tation is pushed into processing elements that are close
sion of the indexing system) has provided several bene- to local disks, to reduce the amount of data sent across
fits: I/O subsystems or the network. We run on commodity
processors to which a small number of disks are directly
The indexing code is simpler, smaller, and easier to connected instead of running directly on disk controller
understand, because the code that deals with fault processors, but the general approach is similar.
tolerance, distribution and parallelization is hidden
Our backup task mechanism is similar to the eager
within the MapReduce library. For example, the
scheduling mechanism employed in the Charlotte Sys-
size of one phase of the computation dropped from
tem [3]. One of the shortcomings of simple eager
approximately 3800 lines of C++ code to approx-
scheduling is that if a given task causes repeated failures,
imately 700 lines when expressed using MapRe-
the entire computation fails to complete. We fix some in-
duce.
stances of this problem with our mechanism for skipping
The performance of the MapReduce library is good bad records.
enough that we can keep conceptually unrelated The MapReduce implementation relies on an in-house
computations separate, instead of mixing them to- cluster management system that is responsible for dis-
gether to avoid extra passes over the data. This tributing and running user tasks on a large collection of
makes it easy to change the indexing process. For shared machines. Though not the focus of this paper, the
example, one change that took a few months to cluster management system is similar in spirit to other
make in our old indexing system took only a few systems such as Condor [16].
days to implement in the new system. The sorting facility that is a part of the MapReduce
library is similar in operation to NOW-Sort [1]. Source
The indexing process has become much easier to machines (map workers) partition the data to be sorted
operate, because most of the problems caused by and send it to one of R reduce workers. Each reduce
machine failures, slow machines, and networking worker sorts its data locally (in memory if possible). Of
hiccups are dealt with automatically by the MapRe- course NOW-Sort does not have the user-definable Map
duce library without operator intervention. Further- and Reduce functions that make our library widely appli-
more, it is easy to improve the performance of the cable.
indexing process by adding new machines to the in-
River [2] provides a programming model where pro-
dexing cluster.
cesses communicate with each other by sending data
over distributed queues. Like MapReduce, the River
7 Related Work system tries to provide good average case performance
even in the presence of non-uniformities introduced by
Many systems have provided restricted programming heterogeneous hardware or system perturbations. River
models and used the restrictions to parallelize the com- achieves this by careful scheduling of disk and network
putation automatically. For example, an associative func- transfers to achieve balanced completion times. MapRe-
tion can be computed over all prefixes of an N element duce has a different approach. By restricting the pro-
array in log N time on N processors using parallel prefix gramming model, the MapReduce framework is able
computations [6, 9, 13]. MapReduce can be considered to partition the problem into a large number of fine-
a simplification and distillation of some of these models grained tasks. These tasks are dynamically scheduled
based on our experience with large real-world compu- on available workers so that faster workers process more
tations. More significantly, we provide a fault-tolerant tasks. The restricted programming model also allows
implementation that scales to thousands of processors. us to schedule redundant executions of tasks near the
In contrast, most of the parallel processing systems have end of the job which greatly reduces completion time in
only been implemented on smaller scales and leave the the presence of non-uniformities (such as slow or stuck
details of handling machine failures to the programmer. workers).
Bulk Synchronous Programming [17] and some MPI BAD-FS [5] has a very different programming model
primitives [11] provide higher-level abstractions that from MapReduce, and unlike MapReduce, is targeted to

To appear in OSDI 2004 11

You might also like