On Traffic-Aware Partition and Aggregation in MapReduce For Big Data Applications

IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016
On Traffic-Aware Partition and Aggregation in

MapReduce for Big Data Applications
Huan Ke, StudentMember,IEEE, Peng Li, Member, IEEE,
Song Guo, Senior Member, IEEE, and Minyi Guo, Senior Member, IEEE
AbstractThe MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel
map tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the
network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is
used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size
associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job
by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where
each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal
with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and
aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network
traffic cost under both offline and online cases.
1 I NTRODUCTION petabytes in total in the analysis of SCOPE jobs [14]. For

MapReduce [1] [2] [3] has emerged as the most popular shuffle-heavy MapReduce tasks, the high traffic could
computing framework for big data processing due to its incur considerable performance overhead up to 30-40 %
simple programming model and automatic management as shown in [15].
of parallel execution. MapReduce and its open source By default, intermediate data are shuffled according
implementation Hadoop [4] [5] have been adopted by to a hash function [16] in Hadoop, which would lead to
leading companies, such as Yahoo!, Google and Face- large network traffic because it ignores network topology
book, for various big data applications, such as machine and data size associated with each key. As shown in Fig.
learning [6] [7] [8], bioinformatics [9] [10] [11], and cyber- 1, we consider a toy example with two map tasks and
security [12] [13]. two reduce tasks, where intermediate data of three keys
MapReduce divides a computation into two main K1 , K2 , and K3 are denoted by rectangle bars under each
phases, namely map and reduce, which in turn are machine. If the hash function assigns data of K1 and
carried out by several map tasks and reduce tasks, K3 to reducer 1, and K2 to reducer 2, a large amount
respectively. In the map phase, map tasks are launched in of traffic will go through the top switch. To tackle
parallel to convert the original input splits into interme- this problem incurred by the traffic-oblivious partition
diate data in a form of key/value pairs. These key/value scheme, we take into account of both task locations
pairs are stored on local machine and organized into and data size associated with each key in this paper.
multiple data partitions, one per reduce task. In the By assigning keys with larger data size to reduce tasks
reduce phase, each reduce task fetches its own share closer to map tasks, network traffic can be significantly
of data partitions from all map tasks to generate the reduced. In the same example above, if we assign K1
final result. There is a shuffle step between map and and K3 to reducer 2, and K2 to reducer 1, as shown in
reduce phase. In this step, the data produced by the Fig. 1(b), the data transferred through the top switch will
map phase are ordered, partitioned and transferred to be significantly reduced.
the appropriate machines executing the reduce phase. To further reduce network traffic within a MapReduce
The resulting network traffic pattern from all map tasks job, we consider to aggregate data with the same keys
to all reduce tasks can cause a great volume of network before sending them to remote reduce tasks. Although a
traffic, imposing a serious constraint on the efficiency similar function, called combiner [17], has been already
of data analytic applications. For example, with tens adopted by Hadoop, it operates immediately after a map
of thousands of machines, data shuffling accounts for task solely for its generated data, failing to exploit the
58.6% of the cross-pod traffic and amounts to over 200 data aggregation opportunities among multiple tasks on
different machines. As an example shown in Fig. 2(a), in
H. Ke, P. Li and S. Guo are with the School of Computer Science and the traditional scheme, two map tasks individually send
Engineering, the University of Aizu, Japan. Email: {m5172105, pengli, data of key K1 to the reduce task. If we aggregate the
sguo}@u-aizu.ac.jp
M. Guo is with the Department of Computer Science and Engineering, data of the same keys before sending them over the top
Shanghai Jiao Tong University, China. Email: guo-my@cs.sjtu.edu.cn switch, as shown in Fig. 2(b), the network traffic will be
(a) Traditional hash partition (a) Without global aggregation
(b) With global aggregation

(b) Traffic-aware partition
Fig. 2. Two schemes of intermediate data transmission in
Fig. 1. Two MapReduce partition schemes. the shuffle phase.
reduced. et al. [19] have presented Purlieus, a MapReduce re-

In this paper, we jointly consider data partition and source allocation system, to enhance the performance of
aggregation for a MapReduce job with an objective that MapReduce jobs in the cloud by locating intermediate
is to minimize the total network traffic. In particular, data to the local machines or close-by physical ma-
we propose a distributed algorithm for big data appli- chines. This locality-awareness reduces network traffic
cations by decomposing the original large-scale problem in the shuffle phase generated in the cloud data center.
into several subproblems that can be solved in parallel. However, little work has studied to optimize network
Moreover, an online algorithm is designed to deal with performance of the shuffle process that generates large
the data partition and aggregation in a dynamic manner. amounts of data traffic in MapReduce jobs. A critical
Finally, extensive simulation results demonstrate that factor to the network performance in the shuffle phase
our proposals can significantly reduce network traffic is the intermediate data partition. The default scheme
cost in both offline and online cases. adopted by Hadoop is hash-based partition that would
The rest of the paper is organized as follows. In yield unbalanced loads among reduce tasks due to its
section II, we review recent related work. Section III unawareness of the data size associated with each key.
presents a system model. Section IV develops a mixed- To overcome this shortcoming, Ibrahim et al. [20] have
integer linear programming model for the network traffic developed a fairness-aware key partition approach that
minimization problem. Sections V and VI propose the keeps track of the distribution of intermediate keys
distributed and online algorithms, respectively, for this frequencies, and guarantees a fair distribution among
problem. The experiment results are discussed in section reduce tasks. Meanwhile, Liya et al. [21] have designed
VII. Finally, Section VIII concludes the paper. an algorithm to schedule operations based on the key
distribution of intermediate key/value pairs to improve
the load balance. Lars et al. [22] have proposed and e-
2 RELATED WORK valuated two effective load balancing approaches to data
Most existing work focuses on MapReduce performance skew handling for MapReduce-based entity resolution.
improvement by optimizing its data transmission. Blanca Unfortunately, all above work focuses on load balance
et al. [18] have investigated the question of whether at reduce tasks, ignoring the network traffic during the
optimizing network usage can lead to better system shuffle phase.
performance and found that high network utilization In addition to data partition, many efforts have been
and low network congestion should be achieved simul- made on local aggregation, in-mapper combining and
taneously for a job with good performance. Palanisamy in-network aggregation to reduce network traffic within
MapReduce jobs. Condie et al. [23] have introduced a the data volume of key/value pairs with key p P
combiner function that reduces the amount of data to generated by mapper i M .
be shuffled and merged to reduce tasks. Lin and Dyer A set of aggregators are available to the intermediate
[24] have proposed an in-mapper combining scheme by results before they are sent to reducers. These aggrega-
exploiting the fact that mappers can preserve state across tors can be placed on any machine, and one is enough for
the processing of multiple input key/value pairs and de- data aggregation on each machine if adopted. The data
fer emission of intermediate data until all input records reduction ratio of an aggregator is denoted by , which
have been processed. Both proposals are constrained can be obtained via profiling before job execution.
to a single map task, ignoring the data aggregation The cost of delivering a certain amount of traffic over
opportunities from multiple map tasks. Costa et al. [25] a network link is evaluated by the product of data
have proposed a MapReduce-like system to decrease the size and link distance. Our objective in this paper is to
traffic by pushing aggregation from the edge into the minimize the total network traffic cost of a MapReduce
network. However, it can be only applied to the network job by jointly considering aggregator placement and
topology with servers directly linked to other servers, intermediate data partition. All symbols and variables
which is of limited practical use. used in this paper are summarized in Table 1.
Different from existing work, we investigate network
traffic reduction within MapReduce jobs by jointly ex- TABLE 1
ploiting traffic-aware intermediate data partition and Notions and Variables
data aggregation among multiple map tasks.
Notations Description
N a set of physical machines
dxy distance between two machines x and y
3 S YSTEM M ODEL M a set of map tasks in map layer
MapReduce is a programming model based on two R a set of reduce tasks in reduce layer
A a set of nodes in aggregation layer
primitives: map function and reduce function. The for- P a set of intermediate keys
mer processes key/value pairs hk, vi and produces a Ai a set of neighbors of mapper i M
set of intermediate key/value pairs hk , v i. Intermediate maximum number of aggregators
mpi data volume of key p P generated by mapper i M
key/value pairs are merged and sorted based on the (u) the machine containing node u
intermediate key k and provided as input to the reduce binary variable denoting whether mapper i M sends
xpij
function. A MapReduce job is executed over a distribut- p
data of key p P to node j A
ed system composed of a master and a set of workers. fij traffic for key p P from mapper i M to node j A
Ijp input data of key p P on node j A
The input is divided into chunks that are assigned to
Mj a set of neighboring nodes of j A
map tasks. The master schedules map tasks in the work- Ojp output data of key p P on node j A
ers by taking into account of data locality. The output data reduction ratio of an aggregator
of the map tasks is divided into as many partitions as j data reduction ratio of node j A
the number of reducers for the job. Entries with the binary variable indicating if an aggregator is placed
zj
on machine j N
same intermediate key should be assigned to the same binary variable denoting whether data of key p P
ykp
partition to guarantee the correctness of the execution. is processed by reducer k R
All the intermediate key/value pairs of a given partition p
gjk the network traffic regarding key p P from node j A
to reducer k R
are sorted and sent to the worker with the corresponding zjp an auxiliary variable
reduce task to be executed. Default scheduling of reduce jp Lagrangian multiplier
tasks does not take any data locality constraint into mpj (t) output of mpj at time slot t
consideration. As a result, the amount of data that has to j (t) j at time slot t
be transferred through the network in the shuffle process jj migration cost for aggregator from machine j to j
kk () cost of migrating intermediate data from reducer k to k
may be significant. CM (t) total migration cost at time slot t
In this paper, we consider a typical MapReduce job
on a large cluster consisting of a set N of machines.
We let dxy denote the distance between two machines
x and y, which represents the cost of delivering a unit 4 P ROBLEM FORMULATION
data. When the job is executed, two types of tasks, i.e., In this section, we formulate the network traffic mini-
map and reduce, are created. The sets of map and reduce mization problem. To facilitate our analysis, we construct
tasks are denoted by M and R, respectively, which are an auxiliary graph with a three-layer structure as shown
already placed on machines. The input data are divided in Fig. 3. The given placement of mappers and reducers
into independent chunks that are processed by map tasks applies in the map layer and the reduce layer, respec-
in parallel. The generated intermediate results in forms tively. In the aggregation layer, we create a potential
of key/value pairs may be shuffled and sorted by the aggregator at each machine, which can aggregate data
framework, and then are fetched by reduce tasks to from all mappers. Since a single potential aggregator is
produce final results. We let P denote the set of keys sufficient at each machine, we also use N to denote all
contained in the intermediate results, and mpi denote potential aggregators. In addition, we create a shadow
where Mj denotes the set of js neighbors in the map

layer. The corresponding output data of node j A is:
Ojp = j Ijp , j A, p P, (4)
where j = if node j is a potential aggregator.

Otherwise, i.e., node j is a shadow node, we have j = 1.
We further define a binary variable zj for aggregator
placement, i.e.,

1, if a potential aggregator j N is activated

zj = for data aggregation,

0, otherwise.

Since the total number of aggregators is constrained

by , we have:
X
zj . (5)
Fig. 3. Three-layer model for the network traffic minimiza- jN
tion problem.
The relationship among xpij and zj can be represented
by:
node for each mapper on its residential machine. In xpij zj , j N, i Mj , p P. (6)
contrast with potential aggregators, each shadow node
can receive data only from its corresponding mapper In other words, if a potential aggregator j N is
in the same machine. It mimics the process that the not activated for data aggregation, i.e., zj = 0, no data
generated intermediate results will be delivered to a should be forwarded to it, i.e., xpij = 0.
reduce directly without going through any aggregator. Finally, we define a binary variable ykp to describe
All nodes in the aggregation layers are maintained in intermediate data partition at reducers, i.e.,
set A. Finally, the output data of aggregation layer are
sent to the reduce layer. Each edge (u, v) in the auxiliary 1, if data of key p P are processed by

graph is associated with a weight d(u)(v) , where (u) ykp = reducer k R,

denotes the machine containing node u in the auxiliary 0, otherwise.

graph.
Since the intermediate data with the same key will be
To formulate the traffic minimization problem, we first
processed by a single reducer, we have the constraint:
consider the data forwarding between the map layer and
the aggregation layer. We define a binary variable xpij as X p
yk = 1, p P. (7)
follows: kR

1, if mapper i M sends data of key p P

The network traffic from node j A to reducer k R
p
xij = to node j A; can be calculated by:

0, otherwise.

p
gjk = Oj ykp , j A, k R, p P. (8)
Since all data generated in the map layer should be
sent to nodes in the aggregation layer, we have the With the objective to minimize the total cost of net-
following constraint for xpij : work traffic within the MapReduce job, the problem can
X p be formulated as:
xij = 1, i M, p P, (1) XX X p XX p
jAi
min fij dij + gjk djk
pP iM jAi jA kR
where Ai denotes the set of neighbors of mapper i in the subject to: (1) (8).
aggregation layer.
We let fijp denote the traffic from mapper i M to Note that the formulation above is a mixed-integer
node j A, which can be calculated by: nonlinear programming (MINLP) problem. By applying
linearization technique, we transfer it to a mixed-integer
fijp = xpij mpi , i M, j Ai , p P. (2) linear programming (MILP) that can be solved by ex-
The input data of node j A can be calculated by isting mathematical tools. Specifically, we replace the
summing up all incoming traffic, i.e., nonlinear constraint (8) with the following linear ones:
p
Ijp =
X p
fij , j A, p P, (3) 0 gjk Ojp , j A, k R, p P, (9)
p p p p p
iMj Oj (1 yk )Oj gjk Oj , j A, k R, p P, (10)
where constant O p = j P p
j iMj mi is the upper bound the MapReduce job for big data. In such a job, there
p
of Oj . The MILP formulation after linearization is: are hundreds or even thousands of keys, each of which
XX X p XX p is associated with a set of variables (e.g., xpij and ykp )
min fij dij + gjk djk and constraints (e.g., (1) and (7)) in our formulation,
pP iM jAi jA kR leading to a large-scale optimization problem that is
subject to: (1) (7), (9), and (10). hardly handled by existing algorithms and solvers in
practice.
Theorem 1. Traffic-aware Partition and Aggregation problem In this section, we develop a distributed algorithm to
is NP-hard. solve the problem on multiple machines in a parallel
Proof: To prove NP-hardness of our network traffic manner. Our basic idea is to decompose the original
optimization problem, we prove the NP-completeness of large-scale problem into several distributively solvable
its decision version by reducing the set cover problem subproblems that are coordinated by a high-level master
to it in polynomial time. problem. To achieve this objective, we first introduce
The set cover problem: given a set U = an auxiliary variable zjp such that our problem can be
{x1 , x2 , . . . , xn }, a collection of m subsets equivalently formulated as:
S = {S1 , S2 , . . . , Sm }, Sj U ,1 j m and an XX X p XX p
min fij dij + gjk djk
integer K. The set cover problem
S seeks for a collection pP iM jAi jA kR
C such that |C| K and iC Si = U .
subject to: xpij zjp , j N, i Mj , p P, (11)
zjp = zj , j N, p P, (12)
(1) (5), (7), (9), and (10).
The corresponding Lagrangian is as follows:
X XX p
L() = Cp + j (zj zjp )
pP jN pP
X XX XX
= p
C + jp zj jp zjp
pP jN pP jN pP
X X XX
p
jp zjp jp zj

= C + (13)
pP jN jN pP
where jp are Lagrangian multipliers and C p is given as

X X p XX p
Cp = fij dij + gjk djk .
Fig. 4. A graph instance. iM jAi jA kR
Given jp ,
the dual decomposition results in two sets of
For each xi U , we create a mapper Mi that generates subproblems: intermediate data partition and aggregator
only one key/value pair. All key/value pairs will be sent placement. The subproblem of data partition for each
to a single reducer whose distance with each mapper is key p P is as follows:
more than 2. For each subset Sj , we create a potential X p p
aggregaor Aj with distance 1 to the reducer. If xi Sj , SUB DP: min (C p j z j )
we set the distance between Mi to Aj to 1. Otherwise, jN
their distance is greater than 1. The aggregation ratio is subject to:(1) (4), (7), (9), (10), and (11).
defined to be 1. The constructed instance of our problem
These problems regarding different keys can be dis-
can be illustrated using Fig. 4. Given K aggregators, we
tributed solved on multiple machines in a parallel man-
look for a placement such that the total traffic cost is no
ner. The subproblem of aggregator placement can be
greater than 2n. It is easy to see that a solution of the
simply written as:
set cover problem generates a solution of our problem XX p
with cost 2n. When we have a solution of our problem SUB AP: min ( j z j ) subject to: (5).
with cost 2n, each mapper should send its result to an jN pP
aggregator with distance 1 away, which forms a solution
The values of jp
are updated in the following master
of the corresponding set cover problem.
problem:
C p +
X XX p XX p p
5 D ISTRIBUTED A LGORITHM D ESIGN min L() = j zj j zj
pP jN pP jN pP
The problem above can be solved by highly efficient subject to: jp 0, j A, p P, (14)
approximation algorithms, e.g., branch-and-bound, and
fast off-the-shelf solvers, e.g., CPLEX, for moderate-sized where C p , zjp
and zj are optimal solutions returned by
input. An additional challenge arises in dealing with subproblems. Since the objective function of the master
Algorithm 1 Distributed Algorithm 50

p
1: set t = 1, and j (j A, p P ) to arbitrary 45
nonnegative values; 40
19.01MB
55.01MB
2: for t < T do 35 193.90MB
341.13MB
Aggregation ratio (%)

3: distributively solve the subproblem SUB DP and 30
SUB AP on multiple machines in a parallel man- 25
ner; 20
4: update the values of jp with the gradient method 15
(15), and send the results to all subproblems; 10
5: set t = t + 1; 5
6: end for 0
50MB 150MB 600MB 1GB
Data size
problem is differentiable, it can be solved by the follow- Fig. 5. Data reduction ratio.
ing gradient method.
h i+
jp (t + 1) = jp + zj (jp (t)) zjp (jp (t)) , (15) tested inputs of 213.44M , 213.40M , 213.44M , 213.41M
and 213.42M for five map tasks to generate correspond-
where t is the iteration index, is a positive step size, and ing outputs, which turn out to be 174.51M , 177.92M ,
+ denotes the projection onto the nonnegative orthants. 176.21M , 177.17M and 176.19M , respectively. Based on
In summary, we have the following distributed algo- these outputs, the optimal solution is to place an ag-
rithm to solve our problem. gregator on node 1 and to assign intermediate data
according to the traffic-aware partition scheme. Since
5.1 Network Traffic Traces mappers 1 and 2 are scheduled on node 1, their outputs
In this section, we verify that our distributed algorithm can be aggregated before forwarding to reducers. We
can be applied in practice using real trace in a cluster list the size of outputs after aggregation and the final
consisting of 5 virtual machines with 1GB memory and intermediate data distribution between reducers in Table
2GHz CPU. Our network topology is based on three- 2. For example, the aggregated data size on node 1 is
tier architectures: an access tier, an aggregation tier and 139.66M , in which 81.17M data is for reducer 1 and
a core tier (Fig. 6). The access tier is made up of cost- 58.49M for reducer 2.
effective Ethernet switches connecting rack VMs. The
access switches are connected via Ethernet to a set of
aggregation switches which in turn are connected to a
layer of core switches. An inter-rack link is the most
contentious resource as all the VMs hosted on a rack
transfer data across the link to the VMs on other racks.
Our VMs are distributed in three different racks, and the
map-reduce tasks are scheduled as in Fig. 6. For example,
rack 1 consists of node 1 and 2; mapper 1 and 2 are
scheduled on node 1 and reducer 1 is scheduled on node
2. The intermediate data forwarding between mappers
and reducers should be transferred across the network.
The hop distances between mappers and reducers are
shown in Fig. 6, e.g., mapper 1 and reducer 2 has a hop
distance 6. Fig. 6. A small example.
We tested the real network traffic cost in Hadoop using
the real data source from latest dumps files in wikimedi- The data size and hop distance for all intermediate
a (http://dumps.wikimedia.org/enwiki/latest/). In the data transfer obtained in the optimal solution are shown
meantime, we executed our distributed algorithm using in Fig. 6 and Table 2. Finally, we get the network traffic
the same data source for comparison. Since our distribut- cost as follows:
ed algorithm is based on a known aggregation ratio
, we have done some experiments to evaluate it in
Hadoop environment. Fig. 5 shows the parameter in 81.17 2 + 58.49 6 + 96.17 4 + 80.04 6 + 98.23 6
terms of different input scale. It turns out to be stable + 78.94 2 + 94.17 6 + 82.02 0 = 2690.48
with the increase of input size, and thus we exploit the
average aggregation ratio 0.35 for our trace.
To evaluate the experiment performance, we choose Since our aggregator is placed on node 1, the outputs
the wordcount application in Hadoop. First of all, we of mapper 1 and mapper 2 are merged into 139.66M .
TABLE 2
Real Cost v.s. Ideal Cost
Node 1 Node 2 Node 3 Node 4 Node 5

data size and cost
mapper 1 mapper 2 mapper 3 mapper 4 mapper 5
before aggregation 174.51M 177.92M 176.21M 177.17M 176.19M
after aggregation 139.66M 176.21M 177.17M 176.19M
reducer 1 81.17M 96.17M 98.23M 94.17M
reducer 2 58.49M 80.04M 78.94M 82.02M
real cost 2690.48
ideal cost 2673.49
The intermediate data from all mappers is transferred Algorithm 2 Online Algorithm
according to the traffic-aware partition scheme. We can 1: t = 1 and t = 1;
get the total network cost 2690.48 in the real Hadoop 2: solve the OPT ONE SHOT problem for t = 1;
environment while the ideal network cost is 2673.49 3: while t T do
Pt P
obtained from the optimal solution. They turn out to 4: if =t pP Ctp ( ) > CM (t) then
be very close to each other, which indicates that our 5: solve the following optimization problem:
distributed algorithm can be applied in practice. X
min C p (t)
pP
subject to:(1) (7), (9), and (10), for time slot t.
6 O NLINE A LGORITHM
6: if the solution indicates a migration event then
Until now, we take the data size mpi
and data aggregation 7: conduct migration according to the new solu-
ratio j as input of our algorithms. In order to get their tion;
values, we need to wait all mappers to finish before 8: t = t;
starting reduce tasks, or conduct estimation via profiling 9: update CM (t);
on a small set of data. In practice, map and reduce 10: end if
tasks may partially overlap in execution to increase 11: end if
system throughput, and it is difficult to estimate system 12: t = t + 1;
parameters at a high accuracy for big data applications. 13: end while
These motivate us to design an online algorithm to dy-
namically adjust data partition and aggregation during
the execution of map and reduce tasks. and migration over a time interval [1, T ], i.e.,
In this section, we divide the execution of a MapRe- T
X X
duce job into several time slots with a length of several min CM (t) + C p (t) , subject to:
minutes or an hour. We let mpj (t) and j (t) denote the t=1 pP
parameters collected at time slot t with no assump- (1) (7), (9), (10), and (16), t = 1, ..., T.
tion about their distributions. As the job is running,
an existing data partition and aggregation scheme may An intuitive method to solve the problem above is to
not be optimal anymore under current mpj (t) and j (t). divide it into T one-shot optimization problems:
To reduce traffic cost, we may need to migrate an X
OPT ONE SHOT: min CM (t) + C p (t)
aggregator from machine j to j with a migration cost
pP
jj . Meanwhile, the key assignment among reducers is
adjusted. When we let reducer k process the data with subject to: (1) (7), (9), (10), and (16), for time slot t.
key p instead of reducer k thatPis currently in charge of Unfortunately, the algorithm of solving above one-shot
t P P p
this key, we use function kk ( =1 jA kR gjk ( )) optimization in each time slot based on the information
to denote the cost migrating all intermediate data re- collected in the previous time slot will be far from op-
ceived by reducers so far. The total migration cost can timal because it may lead to frequent migration events.
be calculated by: Moreover, the coupled objective function due to CM (t)
X X p introduces additional challenges in distributed algorithm
CM (t) = yk (t 1)ykp (t)kk design.
k,k R pP
In this section, we design an online algorithm whose
X t XX X basic idea is to postpone the migration operation until
p
gjk ( ) + zj (t 1)zj (t)jj . (16) the cumulative traffic cost exceeds a threshold. As shown
=1 jA kR j,j N in Algorithm 2, we let t denote the time of last migration
operation, and obtain an initial solution by solving the
Our objective is to minimize the overall cost of traffic OPT ONE SHOT problem. In each of the following time
4
slot, x 10
Pt we P checkpwhether the accumulative traffic cost, i.e., 6
=t

pP Ct ( ), is greater than times of CM (t). If it
Optimal
is, we solve an optimization problem with the objective 5 DA
HRA
of minimizing traffic cost as shown in line 5. We con-
Network traffic
4 HNA
duct migration operation according to the optimization
results and update CM (t) accordingly as shown in lines 3
6 to 10. Note that the optimization problem in line 5 can
2
be solved using the distributed algorithm developed in
last section. 1
0
7 P ERFORMANCE EVALUATION 0 2 4 6 8 10
Number of keys
In this section, we conduct extensive simulations to
evaluate the performance of our proposed distributed Fig. 7. Network traffic cost versus number of keys from 1
algorithm DA by comparing it to the following two to 10
schemes. 5
x 10
3.5
HNA: Hash-based partition with No Aggregation.
DA
As the default method in Hadoop, it makes the 3
HRA
traditional hash partitioning for the intermediate da- 2.5 HNA
ta, which are transferred to reducers without going Network traffic
through aggregators. 2
HRA: Hash-based partition with Random Aggre- 1.5

gation. It adopts a random aggregator placement
1
algorithm over the traditional Hadoop. Through
randomly placing aggregators in the shuffle phase, it 0.5
aims to reducing the network traffic cost compared 0
0 20 40 60 80 100
to the traditional method in Hadoop. Number of keys
To our best knowledge, we are the first to propose
the scheme that exploits both aggregator placement and Fig. 8. Network traffic cost versus different number of
traffic-aware partitioning. All simulation results are av- keys from 1 to 100.
eraged over 30 random instances.
while fixing others. We consider a MapReduce job with

7.1 Simulation results of offline cases 100 keys and other parameters are the same above.
We first evaluate the performance gap between our As shown in Fig. 8, the network traffic cost shows as
proposed distributed algorithm and the optimal solution an increasing function of number of keys from 1 to 100
obtained by solving the MILP formulation. Due to the under all algorithms. In particular, when the number
high computational complexity of the MILP formulation, of keys is set to 100, the network traffic of the HNA
we consider small-scale problem instances with 10 keys algorithm is about 3.4 105 , while the traffic cost of our
in this set of simulations. Each key associated with ran- algorithm is only 1.7 105 , with a reduction of 50%. In
dom data size within [1-50]. There are 20 mappers, and contrast to HRA and HNA, the curve of DA increases
2 reducers on a cluster of 20 machines. The parameter slowly because most map outputs are aggregated and
is set to 0.5. The distance between any two machines is traffic-aware partition chooses closer reduce tasks for
randomly chosen within [1-60]. each key/value pair, which are beneficial to network
As shown in Fig. 7, the performance of our distributed traffic reduction in the shuffle phase.
algorithm is very close to the optimal solution. Although We then study the performance of three algorithms
network traffic cost increases as the number of keys under different values of in Fig. 9 by changing its
grows for all algorithms, the performance enhancement value from 0.2 to 1.0. A small value of indicates a
of our proposed algorithms to the other two schemes lower aggregation efficiency for the intermediate data.
becomes larger. When the number of keys is set to We observe that network traffic increases as the growth
10, the default algorithm HNA has a cost of 5.0 104 of under both DA and HRA. In particular, when
while optimal solution is only 2.7 104, with 46% traffic is 0.2, DA achieves the lowest traffic cost of 1.1 105 .
reduction. On the other hand, network traffic of HNA keeps stable
We then consider large-scale problem instances, and because it does not conduct data aggregation.
compare the performance of our distributed algorithm The affect of available aggregator number on network
with the other two schemes. We first describe a default traffic is investigated in Fig. 10. We change aggrega-
simulation setting with a number of parameters, and tor number from 0 to 6, and observe that DA always
then study the performance by changing one parameter outperforms other two algorithms, and network traffics
IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 April 2016
5 5
x 10 x 10
3 10
DA
8 HRA
2.5 HNA
Network traffic
Network traffic
6
2
4
1.5 DA
HRA 2
HNA
1 0
0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60
Number of map tasks
Fig. 9. Network traffic cost versus data reduction ratio . Fig. 11. Network traffic cost versus number of map tasks.
5 5
x 10 x 10
3.5 3.5
DA
HRA
3 HNA
3
Network traffic
Network traffic
2.5
2.5
2
2 DA
HRA 1.5
HNA
1.5 1
0 1 2 3 4 5 6 1 2 3 4 5 6
Maximum number of aggregators Number of reduce tasks
Fig. 10. Network traffic cost versus number of aggrega- Fig. 12. Network traffic cost versus number of reduce
tors. tasks.
decrease under both HRA and DA. Especially, when the traffic. When the number of reduce tasks is larger than
number of aggregator is 6, network traffic of the HRA 3, network traffic decreasing becomes slow because the
algorithm is 2.2105 , while of DAs cost is only 1.5105, capability of intermediate data sharing among reducers
with 26.7% improvement. That is because aggregators has been fully exploited.
are beneficial to intermediate data reduction in the shuf- The affect of different number of machines is inves-
fle process. Similar with Fig. 9, the performance of HNA tigated in Fig. 13 by changing the number of physical
shows as a horizontal line because it is not affected by nodes from 10 to 60. We observe that network traffic of
available aggregator number. all the algorithms increases when the number of nodes
We study the influence of different number of map grows. Furthermore, HRA algorithm performs much
tasks by increasing the mapper number from 0 to 60. As worse than other two algorithms under all settings.
shown in Fig. 11, we observe that DA always achieves
the lowest traffic cost as we expected because it jointly
optimizes data partition and aggregation. Moreover, as 7.2 Simulation results of online cases
the mapper number increases, network traffic of all We then evaluate the performance of proposed algorithm
algorithms increases. under online cases by comparing it with other two
We shows the network traffic cost under different schemes: OHRA and OHNA, which are online extension
number of reduce tasks in Fig. 12. The number of reduc- of HRA and HNA, respectively. The default number
ers is changed from 1 to 6. We observe that the highest of mappers is 20 and the number of reducers is 5.
network traffic is achieved when there is only one reduce The maximum number of aggregators is set to 4 and
task under all algorithms. That is because all key/value we also vary it to examine its impact. The key/value
pairs may be delivered to the only reducer that locates pairs with random data size within [1-100] are generated
far away, leading to a large amount of network traffic randomly in different slots. The total number of physical
due to the many-to-one communication pattern. As the machines is set to 10 and the distance between any two
number of reduce tasks increases, the network traffic machines is randomly choose within [1-60]. Meanwhile,
decreases because more reduce tasks share the load the default parameter is set to 0.5. The migration cost
of intermediate data. Especially, DA assigns key/value kk and jj are defined as constants 5 and 6. The initial
pairs to the closest reduce task, leading to least network migration cost CM (0) is defined as 300 and is set to
10
5 5
x 10 x 10
5 3.5
Online
3
4 OHRA
2.5 OHNA
Network traffic
Network traffic
3 2
2 1.5
DA
1
HRA
1
HNA 0.5
0 0
10 20 30 40 50 60 0 10 20 30 40 50
Number of nodes Number of keys
Fig. 13. Network traffic cost versus number of machines. Fig. 15. Network traffic cost versus number of keys
5 5
x 10 x 10
3.5 3
3
Online 2.5
2.5
Network traffic
Network traffic
OHRA
2 OHNA
2
1.5
1
1.5 Online
0.5 OHRA
OHNA
0 1
1 2 3 4 5 6 0.2 0.4 0.6 0.8 1
Time slots T
Fig. 14. Network traffic cost versus size of time interval T Fig. 16. Network traffic cost versus data reduction ratio
1000. All simulation results are averaged over 30 random On the other hand, our online algorithm outperforms
instances. OHRA and OHNA under other settings due to the
We first study the performance of all algorithm under jointly optimization of traffic-aware partition and global
default network setting in Fig. 14. We observe that net- aggregation.
work traffic increases at the beginning and then tends to We investigate the performance of three algorithms
be stable under our proposed online algorithm. Network under different number of aggregators in Fig. 17. We
traffics of OHRA and OHNA always keep stable because observe the online algorithm outperforms other two
OHNA obeys the same hash partition scheme and no schemes. When the number of aggregator is 6, the net-
global aggregation for any time slot. OHRA introduces s- work traffic of the OHNA algorithm is 2.8 105 and our
lightly migration cost due to jj is just 6 . Our proposed online algorithm has a network traffic of 1.7 105 , with
online algorithm always updates migration cost CM (t) an improvement of 39%. As the increase of aggregator
and executes the distributed algorithm under different numbers, it is more beneficial to aggregate intermediate
time slots, which will incur some migration cost in this data, reducing the amount of data in the shuffle process.
process. However, when the number of aggregators is set to 0,
The influence of key numbers on network traffic is which means no global aggregation, OHRA has the same
studied in Fig. 15. We observe that our online algorithm network traffic with OHNA and our online algorithm
performs much better than other two algorithms. In always achieves the lowest cost.
particular, when the number of keys is 50, the network
traffic for online algorithm is about 2105 and the traffic
for OHNA is almost 3.1105, with an increasing of 35%. 8 C ONCLUSION
In Fig. 16, we compare the performance of three In this paper, we study the joint optimization of inter-
algorithms under different values of . The larger , mediate data partition and aggregation in MapReduce to
the lower aggregation efficiency the intermediate data minimize network traffic cost for big data applications.
has. We observe that network traffics increase under our We propose a three-layer model for this problem and
online algorithm and OHRA. However, OHNA is not formulate it as a mixed-integer nonlinear problem, which
affected by parameter because no data aggregation is then transferred into a linear form that can be solved
is conducted. When is 1, all algorithms has similar by mathematical tools. To deal with the large-scale
performance because = 1 means no data aggregation. formulation due to big data, we design a distributed
11
5
x 10 [10] J. Wang, D. Crawl, I. Altintas, K. Tzoumas, and V. Markl, Com-
3
parison of distributed data-parallelization patterns for big data
2.8 analysis: A bioinformatics case study, in Proceedings of the Fourth
International Workshop on Data Intensive Computing in the Clouds
2.6 (DataCloud), 2013.
Network traffic
[11] R. Liao, Y. Zhang, J. Guan, and S. Zhou, Cloudnmf: A mapreduce

2.4 implementation of nonnegative matrix factorization for large-
scale biological datasets, Genomics, proteomics & bioinformatics,
2.2
vol. 12, no. 1, pp. 4851, 2014.
Online [12] G. Mackey, S. Sehrish, J. Bent, J. Lopez, S. Habib, and J. Wang,
2 OHRA Introducing map-reduce to high end computing, in Petascale
1.8 OHNA Data Storage Workshop, 2008. PDSW08. 3rd. IEEE, 2008, pp. 16.
[13] W. Yu, G. Xu, Z. Chen, and P. Moulema, A cloud computing
1.6 based architecture for cyber security situation awareness, in
0 1 2 3 4 5 6
Maximum number of aggregators Communications and Network Security (CNS), 2013 IEEE Conference
on. IEEE, 2013, pp. 488492.
[14] J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li,
Fig. 17. Network traffic cost versus number of aggrega- W. Lin, J. Zhou, and L. Zhou, Optimizing data shuffling in data-
tors parallel computation by understanding user-defined functions,
in Proceedings of the 7th Symposium on Networked Systems Design
and Implementation (NSDI), San Jose, CA, USA, 2012.
algorithm to solve the problem on multiple machines. [15] F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, Mapreduce
with communication overlap, pp. 608620, 2013.
Furthermore, we extend our algorithm to handle the [16] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, Map-
MapReduce job in an online manner when some system reduce-merge: simplified relational data processing on large clus-
parameters are not given. Finally, we conduct extensive ters, in Proceedings of the 2007 ACM SIGMOD international con-
ference on Management of data. ACM, 2007, pp. 10291040.
simulations to evaluate our proposed algorithm under [17] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth,
both offline cases and online cases. The simulation result- J. Talbot, K. Elmeleegy, and R. Sears, Online aggregation and
s demonstrate that our proposals can effectively reduce continuous query support in mapreduce, in Proceedings of the
2010 ACM SIGMOD International Conference on Management of data.
network traffic cost under various network settings. ACM, 2010, pp. 11151118.
[18] A. Blanca and S. W. Shin, Optimizing network usage in mapre-
ACKNOWLEDGMENT duce scheduling.
[19] B. Palanisamy, A. Singh, L. Liu, and B. Jain, Purlieus: locality-
This work was partially sponsored by the the Nation- aware resource allocation for mapreduce in a cloud, in Proceed-
al Natural Science Foundation of China (NSFC) (No. ings of 2011 International Conference for High Performance Computing,
Networking, Storage and Analysis. ACM, 2011, p. 58.
61261160502, No. 61272099), the Program for Changjiang [20] S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi, Leen:
Scholars and Innovative Research Team in University Locality/fairness-aware key partitioning for mapreduce in the
(IRT1158, PCSIRT), the Shanghai Innovative Action Plan cloud, in Cloud Computing Technology and Science (CloudCom),
2010 IEEE Second International Conference on. IEEE, 2010, pp. 17
(No. 13511504200). S. Guo is the corresponding author. 24.
[21] L. Fan, B. Gao, X. Sun, F. Zhang, and Z. Liu, Improving the load
R EFERENCES balance of mapreduce operations based on the key distribution
of pairs, arXiv preprint arXiv:1401.0355, 2014.
[1] J. Dean and S. Ghemawat, Mapreduce: simplified data process- [22] S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu, A load-balanced mapre-
ing on large clusters, Communications of the ACM, vol. 51, no. 1, duce algorithm for blocking-based entity-resolution with multiple
pp. 107113, 2008. keys, Parallel and Distributed Computing 2014, p. 3, 2014.
[2] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, Map task [23] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy,
scheduling in mapreduce with data locality: Throughput and and R. Sears, Mapreduce online. in NSDI, vol. 10, no. 4, 2010,
heavy-traffic optimality, in INFOCOM, 2013 Proceedings IEEE. p. 20.
IEEE, 2013, pp. 16091617. [24] J. Lin and C. Dyer, Data-intensive text processing with mapre-
[3] F. Chen, M. Kodialam, and T. Lakshman, Joint scheduling of produce, Synthesis Lectures on Human Language Technologies, vol. 3,
cessing and shuffle phases in mapreduce systems, in INFOCOM, no. 1, pp. 1177, 2010.
2012 Proceedings IEEE. IEEE, 2012, pp. 11431151. [25] P. Costa, A. Donnelly, A. I. Rowstron, and G. OShea, Camdoop:
[4] Y. Wang, W. Wang, C. Ma, and D. Meng, Zput: A speedy data Exploiting in-network aggregation for big data applications. in
uploading approach for the hadoop distributed file system, in NSDI, vol. 12, 2012, pp. 33.
Cluster Computing (CLUSTER), 2013 IEEE International Conference
on. IEEE, 2013, pp. 15.
[5] T. White, Hadoop: the definitive guide: the definitive guide. OReilly
Media, Inc., 2009.
[6] S. Chen and S. W. Schlosser, Map-reduce meets wider varieties
of applications, Intel Research Pittsburgh, Tech. Rep. IRP-TR-08-05,
2008.
Huan Ke is a graduate student in the Depart-
[7] J. Rosen, N. Polyzotis, V. Borkar, Y. Bu, M. J. Carey, M. Weimer,
ment of Computer Science at University of Aizu.
T. Condie, and R. Ramakrishnan, Iterative mapreduce for large
Her research interests include cloud computing,
scale machine learning, arXiv preprint arXiv:1303.3517, 2013.
big data, network and RFID system. Huan got
[8] S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S.
her bachelor degree from Huazhong University
Schreiber, Presto: distributed machine learning and graph pro-
of Science and Technology.
cessing with sparse matrices, in Proceedings of the 8th ACM
European Conference on Computer Systems. ACM, 2013, pp. 197
210.
[9] A. Matsunaga, M. Tsugawa, and J. Fortes, Cloudblast: Combin-
ing mapreduce and virtualization on distributed resources for
bioinformatics applications, in eScience, 2008. eScience08. IEEE
Fourth International Conference on. IEEE, 2008, pp. 222229.
12
Peng Li received his BS degree from Huazhong

University of Science and Technology, China,
in 2007, the MS and PhD degrees from the
University of Aizu, Japan, in 2009 and 2012,
respectively. He is currently an Associate Profes-
sor in the University of Aizu, Japan. His research
interests include networking modeling, cross-
layer optimization, network coding, cooperative
communications, cloud computing, smart grid,
performance evaluation of wireless and mobile
networks for reliable, energy-efficient, and cost-
effective communications. He is a member of IEEE.
Song Guo (M02-SM11) received the PhD de-

gree in computer science from the University of
Ottawa, Canada in 2005. He is currently a Full
Professor at School of Computer Science and
Engineering, the University of Aizu, Japan. His
research interests are mainly in the areas of
wireless communication and mobile computing,
cyber-physical systems, data center networks,
cloud computing and networking, big data, and
green computing. He has published over 250
papers in refereed journals and conferences in
these areas and received three IEEE/ACM best paper awards. Dr. Guo
currently serves as Associate Editor of IEEE Transactions on Parallel
and Distributed Systems, Associate Editor of IEEE Transactions on
Emerging Topics in Computing for the track of Computational Networks,
and on editorial boards of many others. He has also been in organizing
and technical committees of numerous international conferences. Dr.
Guo is a senior member of the IEEE and the ACM.
Minyi Guo received the BSc and ME degrees

in computer science from Nanjing University,
China; and the PhD degree in computer science
from the University of Tsukuba, Japan. He is
currently Zhiyuan Chair professor and head of
the Department of Computer Science and Engi-
neering, Shanghai Jiao Tong University (SJTU),
China. Before joined SJTU, Dr. Guo had been
a professor and department chair of school of
computer science and engineering, University
of Aizu, Japan. Dr. Guo received the national
science fund for distinguished young scholars from NSFC in 2007,
and was supported by 1000 recruitment program of China in 2010.
His present research interests include parallel/distributed computing,
compiler optimizations, embedded systems, pervasive computing, and
cloud computing. He has more than 250 publications in major jour-
nals and international conferences in these areas, including the IEEE
Transactions on Parallel and Distributed Systems, the IEEE Transac-
tions on Nanobioscience, the IEEE Transactions on Computers, the
ACM Transactions on Autonomous and Adaptive Systems, INFOCOM,
IPDPS, ICS, ISCA, HPCA, SC, WWW, etc. He received 5 best paper
awards from international conferences. He was on the editorial board
of IEEE Transactions on Parallel and Distributed Systems and IEEE
Transactions on Computers. Dr. Guo is a senior member of IEEE,
member of ACM and CCF.

On Traffic-Aware Partition and Aggregation in MapReduce For Big Data Applications

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On Traffic-Aware Partition and Aggregation in MapReduce For Big Data Applications

Uploaded by

Copyright:

Available Formats

IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016

On Traffic-Aware Partition and Aggregation in

1 I NTRODUCTION petabytes in total in the analysis of SCOPE jobs [14]. For

(a) Traditional hash partition (a) Without global aggregation

(b) With global aggregation

reduced. et al. [19] have presented Purlieus, a MapReduce re-

where Mj denotes the set of js neighbors in the map

Ojp = j Ijp , j A, p P, (4)

where j = if node j is a potential aggregator.

Since the total number of aggregators is constrained

where jp are Lagrangian multipliers and C p is given as

Algorithm 1 Distributed Algorithm 50

Aggregation ratio (%)

SUB AP on multiple machines in a parallel man- 25

4: update the values of jp with the gradient method 15

(15), and send the results to all subproblems; 10

Node 1 Node 2 Node 3 Node 4 Node 5

HRA: Hash-based partition with Random Aggre- 1.5

while fixing others. We consider a MapReduce job with

[11] R. Liao, Y. Zhang, J. Guan, and S. Zhou, Cloudnmf: A mapreduce

Peng Li received his BS degree from Huazhong

Song Guo (M02-SM11) received the PhD de-

Minyi Guo received the BSc and ME degrees

You might also like