Professional Documents
Culture Documents
On Traffic-Aware Partition and Aggregation in MapReduce For Big Data Applications
On Traffic-Aware Partition and Aggregation in MapReduce For Big Data Applications
AbstractThe MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel
map tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the
network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is
used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size
associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job
by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where
each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal
with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and
aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network
traffic cost under both offline and online cases.
MapReduce jobs. Condie et al. [23] have introduced a the data volume of key/value pairs with key p P
combiner function that reduces the amount of data to generated by mapper i M .
be shuffled and merged to reduce tasks. Lin and Dyer A set of aggregators are available to the intermediate
[24] have proposed an in-mapper combining scheme by results before they are sent to reducers. These aggrega-
exploiting the fact that mappers can preserve state across tors can be placed on any machine, and one is enough for
the processing of multiple input key/value pairs and de- data aggregation on each machine if adopted. The data
fer emission of intermediate data until all input records reduction ratio of an aggregator is denoted by , which
have been processed. Both proposals are constrained can be obtained via profiling before job execution.
to a single map task, ignoring the data aggregation The cost of delivering a certain amount of traffic over
opportunities from multiple map tasks. Costa et al. [25] a network link is evaluated by the product of data
have proposed a MapReduce-like system to decrease the size and link distance. Our objective in this paper is to
traffic by pushing aggregation from the edge into the minimize the total network traffic cost of a MapReduce
network. However, it can be only applied to the network job by jointly considering aggregator placement and
topology with servers directly linked to other servers, intermediate data partition. All symbols and variables
which is of limited practical use. used in this paper are summarized in Table 1.
Different from existing work, we investigate network
traffic reduction within MapReduce jobs by jointly ex- TABLE 1
ploiting traffic-aware intermediate data partition and Notions and Variables
data aggregation among multiple map tasks.
Notations Description
N a set of physical machines
dxy distance between two machines x and y
3 S YSTEM M ODEL M a set of map tasks in map layer
MapReduce is a programming model based on two R a set of reduce tasks in reduce layer
A a set of nodes in aggregation layer
primitives: map function and reduce function. The for- P a set of intermediate keys
mer processes key/value pairs hk, vi and produces a Ai a set of neighbors of mapper i M
set of intermediate key/value pairs hk , v i. Intermediate maximum number of aggregators
mpi data volume of key p P generated by mapper i M
key/value pairs are merged and sorted based on the (u) the machine containing node u
intermediate key k and provided as input to the reduce binary variable denoting whether mapper i M sends
xpij
function. A MapReduce job is executed over a distribut- p
data of key p P to node j A
ed system composed of a master and a set of workers. fij traffic for key p P from mapper i M to node j A
Ijp input data of key p P on node j A
The input is divided into chunks that are assigned to
Mj a set of neighboring nodes of j A
map tasks. The master schedules map tasks in the work- Ojp output data of key p P on node j A
ers by taking into account of data locality. The output data reduction ratio of an aggregator
of the map tasks is divided into as many partitions as j data reduction ratio of node j A
the number of reducers for the job. Entries with the binary variable indicating if an aggregator is placed
zj
on machine j N
same intermediate key should be assigned to the same binary variable denoting whether data of key p P
ykp
partition to guarantee the correctness of the execution. is processed by reducer k R
All the intermediate key/value pairs of a given partition p
gjk the network traffic regarding key p P from node j A
to reducer k R
are sorted and sent to the worker with the corresponding zjp an auxiliary variable
reduce task to be executed. Default scheduling of reduce jp Lagrangian multiplier
tasks does not take any data locality constraint into mpj (t) output of mpj at time slot t
consideration. As a result, the amount of data that has to j (t) j at time slot t
be transferred through the network in the shuffle process jj migration cost for aggregator from machine j to j
kk () cost of migrating intermediate data from reducer k to k
may be significant. CM (t) total migration cost at time slot t
In this paper, we consider a typical MapReduce job
on a large cluster consisting of a set N of machines.
We let dxy denote the distance between two machines
x and y, which represents the cost of delivering a unit 4 P ROBLEM FORMULATION
data. When the job is executed, two types of tasks, i.e., In this section, we formulate the network traffic mini-
map and reduce, are created. The sets of map and reduce mization problem. To facilitate our analysis, we construct
tasks are denoted by M and R, respectively, which are an auxiliary graph with a three-layer structure as shown
already placed on machines. The input data are divided in Fig. 3. The given placement of mappers and reducers
into independent chunks that are processed by map tasks applies in the map layer and the reduce layer, respec-
in parallel. The generated intermediate results in forms tively. In the aggregation layer, we create a potential
of key/value pairs may be shuffled and sorted by the aggregator at each machine, which can aggregate data
framework, and then are fetched by reduce tasks to from all mappers. Since a single potential aggregator is
produce final results. We let P denote the set of keys sufficient at each machine, we also use N to denote all
contained in the intermediate results, and mpi denote potential aggregators. In addition, we create a shadow
IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016
where constant O p = j P p
j iMj mi is the upper bound the MapReduce job for big data. In such a job, there
p
of Oj . The MILP formulation after linearization is: are hundreds or even thousands of keys, each of which
XX X p XX p is associated with a set of variables (e.g., xpij and ykp )
min fij dij + gjk djk and constraints (e.g., (1) and (7)) in our formulation,
pP iM jAi jA kR leading to a large-scale optimization problem that is
subject to: (1) (7), (9), and (10). hardly handled by existing algorithms and solvers in
practice.
Theorem 1. Traffic-aware Partition and Aggregation problem In this section, we develop a distributed algorithm to
is NP-hard. solve the problem on multiple machines in a parallel
Proof: To prove NP-hardness of our network traffic manner. Our basic idea is to decompose the original
optimization problem, we prove the NP-completeness of large-scale problem into several distributively solvable
its decision version by reducing the set cover problem subproblems that are coordinated by a high-level master
to it in polynomial time. problem. To achieve this objective, we first introduce
The set cover problem: given a set U = an auxiliary variable zjp such that our problem can be
{x1 , x2 , . . . , xn }, a collection of m subsets equivalently formulated as:
S = {S1 , S2 , . . . , Sm }, Sj U ,1 j m and an XX X p XX p
min fij dij + gjk djk
integer K. The set cover problem
S seeks for a collection pP iM jAi jA kR
C such that |C| K and iC Si = U .
subject to: xpij zjp , j N, i Mj , p P, (11)
zjp = zj , j N, p P, (12)
(1) (5), (7), (9), and (10).
The corresponding Lagrangian is as follows:
X XX p
L() = Cp + j (zj zjp )
pP jN pP
X XX XX
= p
C + jp zj jp zjp
pP jN pP jN pP
X X XX
p
jp zjp jp zj
= C + (13)
pP jN jN pP
Given jp ,
the dual decomposition results in two sets of
For each xi U , we create a mapper Mi that generates subproblems: intermediate data partition and aggregator
only one key/value pair. All key/value pairs will be sent placement. The subproblem of data partition for each
to a single reducer whose distance with each mapper is key p P is as follows:
more than 2. For each subset Sj , we create a potential X p p
aggregaor Aj with distance 1 to the reducer. If xi Sj , SUB DP: min (C p j z j )
we set the distance between Mi to Aj to 1. Otherwise, jN
their distance is greater than 1. The aggregation ratio is subject to:(1) (4), (7), (9), (10), and (11).
defined to be 1. The constructed instance of our problem
These problems regarding different keys can be dis-
can be illustrated using Fig. 4. Given K aggregators, we
tributed solved on multiple machines in a parallel man-
look for a placement such that the total traffic cost is no
ner. The subproblem of aggregator placement can be
greater than 2n. It is easy to see that a solution of the
simply written as:
set cover problem generates a solution of our problem XX p
with cost 2n. When we have a solution of our problem SUB AP: min ( j z j ) subject to: (5).
with cost 2n, each mapper should send its result to an jN pP
aggregator with distance 1 away, which forms a solution
The values of jp
are updated in the following master
of the corresponding set cover problem.
problem:
C p +
X XX p XX p p
5 D ISTRIBUTED A LGORITHM D ESIGN min L() = j zj j zj
pP jN pP jN pP
The problem above can be solved by highly efficient subject to: jp 0, j A, p P, (14)
approximation algorithms, e.g., branch-and-bound, and
fast off-the-shelf solvers, e.g., CPLEX, for moderate-sized where C p , zjp
and zj are optimal solutions returned by
input. An additional challenge arises in dealing with subproblems. Since the objective function of the master
IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016
nonnegative values; 40
19.01MB
55.01MB
2: for t < T do 35 193.90MB
341.13MB
ner; 20
5: set t = t + 1; 5
6: end for 0
50MB 150MB 600MB 1GB
Data size
problem is differentiable, it can be solved by the follow- Fig. 5. Data reduction ratio.
ing gradient method.
h i+
jp (t + 1) = jp + zj (jp (t)) zjp (jp (t)) , (15) tested inputs of 213.44M , 213.40M , 213.44M , 213.41M
and 213.42M for five map tasks to generate correspond-
where t is the iteration index, is a positive step size, and ing outputs, which turn out to be 174.51M , 177.92M ,
+ denotes the projection onto the nonnegative orthants. 176.21M , 177.17M and 176.19M , respectively. Based on
In summary, we have the following distributed algo- these outputs, the optimal solution is to place an ag-
rithm to solve our problem. gregator on node 1 and to assign intermediate data
according to the traffic-aware partition scheme. Since
5.1 Network Traffic Traces mappers 1 and 2 are scheduled on node 1, their outputs
In this section, we verify that our distributed algorithm can be aggregated before forwarding to reducers. We
can be applied in practice using real trace in a cluster list the size of outputs after aggregation and the final
consisting of 5 virtual machines with 1GB memory and intermediate data distribution between reducers in Table
2GHz CPU. Our network topology is based on three- 2. For example, the aggregated data size on node 1 is
tier architectures: an access tier, an aggregation tier and 139.66M , in which 81.17M data is for reducer 1 and
a core tier (Fig. 6). The access tier is made up of cost- 58.49M for reducer 2.
effective Ethernet switches connecting rack VMs. The
access switches are connected via Ethernet to a set of
aggregation switches which in turn are connected to a
layer of core switches. An inter-rack link is the most
contentious resource as all the VMs hosted on a rack
transfer data across the link to the VMs on other racks.
Our VMs are distributed in three different racks, and the
map-reduce tasks are scheduled as in Fig. 6. For example,
rack 1 consists of node 1 and 2; mapper 1 and 2 are
scheduled on node 1 and reducer 1 is scheduled on node
2. The intermediate data forwarding between mappers
and reducers should be transferred across the network.
The hop distances between mappers and reducers are
shown in Fig. 6, e.g., mapper 1 and reducer 2 has a hop
distance 6. Fig. 6. A small example.
We tested the real network traffic cost in Hadoop using
the real data source from latest dumps files in wikimedi- The data size and hop distance for all intermediate
a (http://dumps.wikimedia.org/enwiki/latest/). In the data transfer obtained in the optimal solution are shown
meantime, we executed our distributed algorithm using in Fig. 6 and Table 2. Finally, we get the network traffic
the same data source for comparison. Since our distribut- cost as follows:
ed algorithm is based on a known aggregation ratio
, we have done some experiments to evaluate it in
Hadoop environment. Fig. 5 shows the parameter in 81.17 2 + 58.49 6 + 96.17 4 + 80.04 6 + 98.23 6
terms of different input scale. It turns out to be stable + 78.94 2 + 94.17 6 + 82.02 0 = 2690.48
with the increase of input size, and thus we exploit the
average aggregation ratio 0.35 for our trace.
To evaluate the experiment performance, we choose Since our aggregator is placed on node 1, the outputs
the wordcount application in Hadoop. First of all, we of mapper 1 and mapper 2 are merged into 139.66M .
IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016
TABLE 2
Real Cost v.s. Ideal Cost
The intermediate data from all mappers is transferred Algorithm 2 Online Algorithm
according to the traffic-aware partition scheme. We can 1: t = 1 and t = 1;
get the total network cost 2690.48 in the real Hadoop 2: solve the OPT ONE SHOT problem for t = 1;
environment while the ideal network cost is 2673.49 3: while t T do
Pt P
obtained from the optimal solution. They turn out to 4: if =t pP Ctp ( ) > CM (t) then
be very close to each other, which indicates that our 5: solve the following optimization problem:
distributed algorithm can be applied in practice. X
min C p (t)
pP
subject to:(1) (7), (9), and (10), for time slot t.
6 O NLINE A LGORITHM
6: if the solution indicates a migration event then
Until now, we take the data size mpi
and data aggregation 7: conduct migration according to the new solu-
ratio j as input of our algorithms. In order to get their tion;
values, we need to wait all mappers to finish before 8: t = t;
starting reduce tasks, or conduct estimation via profiling 9: update CM (t);
on a small set of data. In practice, map and reduce 10: end if
tasks may partially overlap in execution to increase 11: end if
system throughput, and it is difficult to estimate system 12: t = t + 1;
parameters at a high accuracy for big data applications. 13: end while
These motivate us to design an online algorithm to dy-
namically adjust data partition and aggregation during
the execution of map and reduce tasks. and migration over a time interval [1, T ], i.e.,
In this section, we divide the execution of a MapRe- T
X X
duce job into several time slots with a length of several min CM (t) + C p (t) , subject to:
minutes or an hour. We let mpj (t) and j (t) denote the t=1 pP
parameters collected at time slot t with no assump- (1) (7), (9), (10), and (16), t = 1, ..., T.
tion about their distributions. As the job is running,
an existing data partition and aggregation scheme may An intuitive method to solve the problem above is to
not be optimal anymore under current mpj (t) and j (t). divide it into T one-shot optimization problems:
To reduce traffic cost, we may need to migrate an X
OPT ONE SHOT: min CM (t) + C p (t)
aggregator from machine j to j with a migration cost
pP
jj . Meanwhile, the key assignment among reducers is
adjusted. When we let reducer k process the data with subject to: (1) (7), (9), (10), and (16), for time slot t.
key p instead of reducer k thatPis currently in charge of Unfortunately, the algorithm of solving above one-shot
t P P p
this key, we use function kk ( =1 jA kR gjk ( )) optimization in each time slot based on the information
to denote the cost migrating all intermediate data re- collected in the previous time slot will be far from op-
ceived by reducers so far. The total migration cost can timal because it may lead to frequent migration events.
be calculated by: Moreover, the coupled objective function due to CM (t)
X X p introduces additional challenges in distributed algorithm
CM (t) = yk (t 1)ykp (t)kk design.
k,k R pP
In this section, we design an online algorithm whose
X t XX X basic idea is to postpone the migration operation until
p
gjk ( ) + zj (t 1)zj (t)jj . (16) the cumulative traffic cost exceeds a threshold. As shown
=1 jA kR j,j N in Algorithm 2, we let t denote the time of last migration
operation, and obtain an initial solution by solving the
Our objective is to minimize the overall cost of traffic OPT ONE SHOT problem. In each of the following time
IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016
4
slot, x 10
Pt we P checkpwhether the accumulative traffic cost, i.e., 6
=t
pP Ct ( ), is greater than times of CM (t). If it
Optimal
is, we solve an optimization problem with the objective 5 DA
HRA
of minimizing traffic cost as shown in line 5. We con-
Network traffic
4 HNA
duct migration operation according to the optimization
results and update CM (t) accordingly as shown in lines 3
6 to 10. Note that the optimization problem in line 5 can
2
be solved using the distributed algorithm developed in
last section. 1
0
7 P ERFORMANCE EVALUATION 0 2 4 6 8 10
Number of keys
In this section, we conduct extensive simulations to
evaluate the performance of our proposed distributed Fig. 7. Network traffic cost versus number of keys from 1
algorithm DA by comparing it to the following two to 10
schemes. 5
x 10
3.5
HNA: Hash-based partition with No Aggregation.
DA
As the default method in Hadoop, it makes the 3
HRA
traditional hash partitioning for the intermediate da- 2.5 HNA
ta, which are transferred to reducers without going Network traffic
through aggregators. 2
5 5
x 10 x 10
3 10
DA
8 HRA
2.5 HNA
Network traffic
Network traffic
6
2
4
1.5 DA
HRA 2
HNA
1 0
0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60
Number of map tasks
Fig. 9. Network traffic cost versus data reduction ratio . Fig. 11. Network traffic cost versus number of map tasks.
5 5
x 10 x 10
3.5 3.5
DA
HRA
3 HNA
3
Network traffic
Network traffic
2.5
2.5
2
2 DA
HRA 1.5
HNA
1.5 1
0 1 2 3 4 5 6 1 2 3 4 5 6
Maximum number of aggregators Number of reduce tasks
Fig. 10. Network traffic cost versus number of aggrega- Fig. 12. Network traffic cost versus number of reduce
tors. tasks.
decrease under both HRA and DA. Especially, when the traffic. When the number of reduce tasks is larger than
number of aggregator is 6, network traffic of the HRA 3, network traffic decreasing becomes slow because the
algorithm is 2.2105 , while of DAs cost is only 1.5105, capability of intermediate data sharing among reducers
with 26.7% improvement. That is because aggregators has been fully exploited.
are beneficial to intermediate data reduction in the shuf- The affect of different number of machines is inves-
fle process. Similar with Fig. 9, the performance of HNA tigated in Fig. 13 by changing the number of physical
shows as a horizontal line because it is not affected by nodes from 10 to 60. We observe that network traffic of
available aggregator number. all the algorithms increases when the number of nodes
We study the influence of different number of map grows. Furthermore, HRA algorithm performs much
tasks by increasing the mapper number from 0 to 60. As worse than other two algorithms under all settings.
shown in Fig. 11, we observe that DA always achieves
the lowest traffic cost as we expected because it jointly
optimizes data partition and aggregation. Moreover, as 7.2 Simulation results of online cases
the mapper number increases, network traffic of all We then evaluate the performance of proposed algorithm
algorithms increases. under online cases by comparing it with other two
We shows the network traffic cost under different schemes: OHRA and OHNA, which are online extension
number of reduce tasks in Fig. 12. The number of reduc- of HRA and HNA, respectively. The default number
ers is changed from 1 to 6. We observe that the highest of mappers is 20 and the number of reducers is 5.
network traffic is achieved when there is only one reduce The maximum number of aggregators is set to 4 and
task under all algorithms. That is because all key/value we also vary it to examine its impact. The key/value
pairs may be delivered to the only reducer that locates pairs with random data size within [1-100] are generated
far away, leading to a large amount of network traffic randomly in different slots. The total number of physical
due to the many-to-one communication pattern. As the machines is set to 10 and the distance between any two
number of reduce tasks increases, the network traffic machines is randomly choose within [1-60]. Meanwhile,
decreases because more reduce tasks share the load the default parameter is set to 0.5. The migration cost
of intermediate data. Especially, DA assigns key/value kk and jj are defined as constants 5 and 6. The initial
pairs to the closest reduce task, leading to least network migration cost CM (0) is defined as 300 and is set to
IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016
10
5 5
x 10 x 10
5 3.5
Online
3
4 OHRA
2.5 OHNA
Network traffic
Network traffic
3 2
2 1.5
DA
1
HRA
1
HNA 0.5
0 0
10 20 30 40 50 60 0 10 20 30 40 50
Number of nodes Number of keys
Fig. 13. Network traffic cost versus number of machines. Fig. 15. Network traffic cost versus number of keys
5 5
x 10 x 10
3.5 3
3
Online 2.5
2.5
Network traffic
Network traffic
OHRA
2 OHNA
2
1.5
1
1.5 Online
0.5 OHRA
OHNA
0 1
1 2 3 4 5 6 0.2 0.4 0.6 0.8 1
Time slots T
Fig. 14. Network traffic cost versus size of time interval T Fig. 16. Network traffic cost versus data reduction ratio
1000. All simulation results are averaged over 30 random On the other hand, our online algorithm outperforms
instances. OHRA and OHNA under other settings due to the
We first study the performance of all algorithm under jointly optimization of traffic-aware partition and global
default network setting in Fig. 14. We observe that net- aggregation.
work traffic increases at the beginning and then tends to We investigate the performance of three algorithms
be stable under our proposed online algorithm. Network under different number of aggregators in Fig. 17. We
traffics of OHRA and OHNA always keep stable because observe the online algorithm outperforms other two
OHNA obeys the same hash partition scheme and no schemes. When the number of aggregator is 6, the net-
global aggregation for any time slot. OHRA introduces s- work traffic of the OHNA algorithm is 2.8 105 and our
lightly migration cost due to jj is just 6 . Our proposed online algorithm has a network traffic of 1.7 105 , with
online algorithm always updates migration cost CM (t) an improvement of 39%. As the increase of aggregator
and executes the distributed algorithm under different numbers, it is more beneficial to aggregate intermediate
time slots, which will incur some migration cost in this data, reducing the amount of data in the shuffle process.
process. However, when the number of aggregators is set to 0,
The influence of key numbers on network traffic is which means no global aggregation, OHRA has the same
studied in Fig. 15. We observe that our online algorithm network traffic with OHNA and our online algorithm
performs much better than other two algorithms. In always achieves the lowest cost.
particular, when the number of keys is 50, the network
traffic for online algorithm is about 2105 and the traffic
for OHNA is almost 3.1105, with an increasing of 35%. 8 C ONCLUSION
In Fig. 16, we compare the performance of three In this paper, we study the joint optimization of inter-
algorithms under different values of . The larger , mediate data partition and aggregation in MapReduce to
the lower aggregation efficiency the intermediate data minimize network traffic cost for big data applications.
has. We observe that network traffics increase under our We propose a three-layer model for this problem and
online algorithm and OHRA. However, OHNA is not formulate it as a mixed-integer nonlinear problem, which
affected by parameter because no data aggregation is then transferred into a linear form that can be solved
is conducted. When is 1, all algorithms has similar by mathematical tools. To deal with the large-scale
performance because = 1 means no data aggregation. formulation due to big data, we design a distributed
IEEE Transactions on PARALLEL AND DISTRIBUTED SYSTEMS vol: 27 FEB 2016
11
5
x 10 [10] J. Wang, D. Crawl, I. Altintas, K. Tzoumas, and V. Markl, Com-
3
parison of distributed data-parallelization patterns for big data
2.8 analysis: A bioinformatics case study, in Proceedings of the Fourth
International Workshop on Data Intensive Computing in the Clouds
2.6 (DataCloud), 2013.
Network traffic
12