Professional Documents
Culture Documents
An Optimized Algorithm For Reduce Task Scheduling: Xiaotong Zhang, Bin Hu, Jiafu Jiang
An Optimized Algorithm For Reduce Task Scheduling: Xiaotong Zhang, Bin Hu, Jiafu Jiang
4, APRIL 2014
Abstract In this paper, we propose a novel algorithm to the name MapReduce implies, the reduce tasks are always
solve the starving problem of the small jobs and reduce the performed after the map tasks. In general, MapReduce
process time of the small jobs on Hadoop platform. Current framework contains a master node(JobTracker) and some
schedulers of MapReduce/Hadoop are quite successful in
achieving data locality and scheduling the reduce tasks with slave nodes (TaskTracker). To process a job, the client
a greedy algorithm. Some jobs may have hundreds of map submits a job to the JobTracker, which will do some
tasks and just several reduce tasks, in which case, the reduce initial work for the job and add the job to a queue.
tasks of the large jobs require more time for waiting, which When a TaskTracker has the abilities to run tasks, it
will result in the starving problem of the small jobs. Since sends a request (heartbeat) to the JobTracker, which will
the map tasks and the reduce tasks are scheduled separately,
we can change the way the scheduler launches the reduce use the TaskScheduler to pick up some tasks of the jobs
tasks without affecting the map phase. Therefore we develop in the queue, and assign them to the TaskTracker. The
an optimized algorithm to schedule the reduce tasks with TaskTracker lives on each node of the Hadoop cluster
the shortest remaining time (SRT) of the map tasks. We and is responsible for executing tasks. The TaskTracker
apply our algorithm to the fair scheduler and the capacity communicates with the JobTracker through a heartbeat
scheduler, which are both widely used in real production
environment. The evaluation results show that the SRT protocol. In essence the heartbeat is a mechanism for
algorithm can decrease the process time of the small jobs the TaskTrackers to announce their availability on the
effectively. cluster. In addition to announcing its availability, the
Index Terms mapreduce, hadoop , schedule , SRT heartbeat protocol also includes information about the
state of the TaskTracker. Heartbeat messages indicate
whether the TaskTracker is ready for the next task or
I. I NTRODUCTION
not. When the JobTracker receives a heartbeat message
innate relation with the map phase, this can lead to poor A. FIFO scheduler
utilization. The FIFO scheduler is the default scheduler of Hadoop.
The reduce phase includes three subphases: shuffle, In this scheduler, all jobs are submitted to a single queue,
sort, and reduce. In the shuffle phase, input of a reduce when a TaskTrackers heartbeat comes, the scheduler
task is the sorted output of the map tasks. In this phase simply picks up a job from the head of the queue, and
the framework fetches the relevant partition of the output tries to assign the tasks of the job, if the job has no tasks
of all the map tasks, via HTTP. In the sort phase, the to be assigned (for example, all tasks have been assigned),
framework groups the inputs of the reduce tasks by keys then the scheduler tries the next job. The jobs in the queue
(since different map tasks may have output the same are sorted by their priority, then submitted time.
key). The shuffle and sort phases occur simultaneously. The advantage of the FIFO scheduler is that the algo-
As the map-outputs are being fetched, they will be merged rithm is quite simple and straightforward, the load of the
momentarily. A reduce task copies all the data that belong JobTracker is relatively low. But it can result in starvation
to it from all map intermediate data, and sorts the data when the small jobs come after the large jobs. That means,
while copying. But it has to wait until all the intermediate a job has to wait until all the tasks (map, reduce) of the
data have been copied and sorted, then it begins the third jobs before it have been launched.
stage. That means the real reduce function cannot start
until all the map tasks are done. Some jobs may have B. Fair Scheduler
hundreds of map tasks and just several reduce tasks, in
The fair scheduler is designed to assign resources to
which case, the reduce tasks may spend most of the time
jobs such that each job gets roughly same amount of
waiting for the data and keep occupying the resources.
resources over time. For example, when there is only
Therefore, this can lead to the starving problem of the
one job submitted, it uses the entire cluster, and when
small jobs [5].
other jobs are submitted, tasks slots that free up are
Therefore, the role of a good scheduling algorithm is
assigned to the new jobs. In this way, the problem of
to efficiently assign each task to a priority to minimize
the FIFO scheduler that we mention above is solved: the
makespan [6]. To solve this problem, Jian Tan et al.
small jobs can be finished in reasonable time while not
proposed a solution where the scheduler will gradually
starving the large jobs. The fair scheduler organizes jobs
launch the reducer tasks depending on the map progress
into pools and shares resources fairly across all pools. By
[7]. But it does not consider the factor of time. For
default, each user is allocated a separate pool and gets an
example, a job with 80% map tasks completed may even
equal share of the cluster no matter how many jobs are
has a longer time to finish all the map tasks than that of
submitted. Within each pool, either the fair sharing or the
a job with only 20% map tasks completed. This thesis
FIFO algorithm is used for scheduling jobs. In addition to
aims at providing a simple way of achieving launching
providing fair sharing, the fair scheduler allows assigning
the reduce tasks depending on the shortest remaining
guaranteed minimum shares to pools, which is useful
time (SRT) of the map tasks. In this way, we reduce
for ensuring that certain users, groups or production
the process time and the reduce time of the small jobs.
applications always get sufficient resources. When a pool
We implement our algorithm based on the fair scheduler,
contains jobs, it gets at least its minimum share, however
which is widely used in real production environment. Our
when the pool does not need its full guaranteed share,
experiment results show that the reduce time of the jobs
the excess is split between other pools. The fair scheduler
decreases a lot, especially for the small jobs (61% 39
implementation keeps track of the compute time for each
%). The process time of the small jobs can decrease by
job in the system. Periodically, the scheduler inspects the
55% at most. In addition, the process time of the large
jobs to compute the difference between the compute time
jobs can also decrease by 1% 4%. The remainder of
the job received and the time it should have received in
this paper is organized as follows. In Section II, related
an ideal scheduler. The result determines the deficit for
work on different scheduling algorithms of Hadoop is
the task. The fair scheduler will ensure that the task with
presented. In Section III, the implement detail of the SRT
the highest deficit is launched next. Disadvantages of the
algorithm and the drawback are described. In Section IV,
fair scheduler are listed below:
the experimental results and the analyses of performance
(1)The greed algorithm of launching reduce tasks will
of the SRT scheduler on real Hadoop environment are
cause reduce starvation of the small jobs.
given. Finally, in Section V, conclusions and suggestions
(2)Locality of the reduce tasks does not been consid-
for future work are presented.
ered.
To solve the first problem, Jian Tan et al. proposed a
II. R ELATED W ORKS scheduler called Couping Scheduler which will gradually
launch the reduce tasks depending on the map progress
In this section we briefly introduce some schedulers [7]. But the Couping Scheduler does not consider the
of Hadoop, including Hadoop inner schedulers (the FIFO factor of time. The jobs that have larger rate of progress of
scheduler, the fair scheduler, and the capacity scheduler) the map tasks may have longer time to finish all the map
and schedulers developed by others. Specially, we focus tasks. We implement our algorithm based on the shortest
on the fair scheduler. remaining time of the the map tasks and apply it to the
fair scheduler, which is widely used in real production comparator called the SRT comparator to compare two
environment. The detail of our algorithm is provided in jobs when sorting jobs according to their remaining time
section III. of the map tasks. To calculate the remaining time of the
map tasks of jobi , we have to estimate the remaining time
C. Other schedulers of the tasks in both Ri and Pi . Let f (Ui ) represent the
remaining time it takes to finish all the tasks in Ui . We
The capacity scheduler is designed to guarantee a
calculate f (Ui ) like below:
minimum capacity of each organization. The central idea
is that the available resources in the Hadoop cluster are f (Ui ) = max[f (Ri ), f (Pi )] (3)
partitioned among multiple organizations who collectively
fund the cluster based on computing needs. In the capacity In fact, we have three methods to calculate f (Ri ): we
scheduler, several queues are created, each of which is can treat the running tasks as the pending map tasks,
assigned a guaranteed capacity (where the overall capacity or calculate the remaining time of each running task
of the cluster is the sum of each queues capacity). The according to its progress and add them together, or we
queues are monitored, and if a queue is not consuming its can simply ignore them. There is no mechanism through
allocated capacity, this excess capacity can be temporarily which we can obtain the process of a specified running
allocated to other queues. Given that the queues can task in Hadoop. Therefore, the second method will be
represent a person or larger organization, any available difficult to achieve unless we modify the underlying
capacity is redistributed for other users. To improve the architecture of Hadoop. Therefore we choose the third
data locality of the fair scheduler, Matei Zaharia et al. way to calculate f (Ui ). Note that, two jobs with the same
developed a simple technique called delay scheduling [8]. priority will have the same number of the running tasks
But it still ignored the locality of the reduce tasks. M. according to the fair scheduler. Besides, each job will
Hammoud et al. addressed this problem by incorporat- have a small and limited number of the running tasks
ing network locations and sizes of the partitions of the when many jobs are processed simultaneously, in which
reduce tasks in the scheduling decisions, and mitigated case f (Ri ) is much smaller than f (Ui ). Suppose jobi has
network traffic and improved MapReduce performance run for time Ti , the number of the finished map tasks is
[9]. Liying Li et al. [10] proposed a new improvement fi , the number of the pending map tasks is pi , f (Ui ) can
of the Hadoop relevant data locality scheduling algorithm be expressed like below:
based on LATE [11]. Besides, authors of [12] and [13] !
proposed schedulers which managed the jobs to meet + fi = 0
f (Ui ) = f (Pi ) = (4)
deadlines. Similarly, the schedulers that were suitable for Ti /fi pi fi = 0
real-time circumstance were proposed [14] [15]. Sheng-
We predict the remaining time of the map tasks in a job
bing Ren and Dieudonne Muheto proposed strategies to
with the average time of map tasks multiplied by the
schedule tasks in OLAM (Online Analytic Mining) using
amount of the pending map tasks. For example, suppose
MapReduce [16]. Specially, Xiaoli Wang et al. took full
that a job has been running for 4 minutes, and it has 2
consideration of the relationship between the performance
finished map tasks and 1 pending map task, we calculate
and energy consumption of servers, and proposed a new
the remaining time of the map tasks like this: 4 min
energy-efficient task scheduling model based on MapRe-
/ 2 * 1 = 2 min. Some other methods to estimate the
duce, and gave the corresponding algorithm [17].
remaining time of the jobs are provided in [18] [19] [20].
We let s (i) and e (i) represent the start time and end
III. A LGORITHM D ESCRIPTION
time of jobi respectively, and let s (i) and e (i) represent
A. Algorithm Implement the start time and end time of the reduce phase of jobi
Suppose we have M jobs waiting to be processed, we respectively. Therefore,
use J to represent the set of them. M
1 "
J = {jobi |1 i M } (1) AverageJ = [e (i) s (i)] (5)
M i=1
For every jobi in J, we represent the map tasks of jobi
by Ui . is the average process time of all jobs. And
Ui = Fi Ri Pi (2) M
1 "
AverageR = [e (i) s (i)] (6)
The symbols Fi , Ri , and Pi represent the set of the fin- M i=1
ished map tasks, the running map tasks, and the pending
map tasks of jobi . In the fair scheduler, the pools are is the average process time of the reduce phases of all
sorted by fair sharing algorithm while the jobs in a pool jobs. Once the request of a TaskTracker comes, we sort
can be scheduled by either the FIFO algorithm or the fair the jobs in the waiting queue, and assign tasks to the
sharing algorithm. We apply the SRT algorithm to the TaskTracker in order. The steps of assigning the reduce
fair scheduler in this way: we change the way the fair tasks using the SRT algorithm are shown in Fig. 1.
scheduler sorts the jobs within a pool when scheduling Step 1 The TaskTracker sends a request for the reduce
the reduce tasks. To achieve that goal, we provide a tasks.
map reduce
map reduce map
1: if f1 = 0 f2 = 0 then
reduce reduce
2: return 1
reduce
3: else if f1 = 0 f2 = 0 then
4: return 1
5: else if f1 = 0 f2 = 0 then
6: if p1 > p2 then
Figure 1. The steps of assigning the reduce tasks
7: return 1
8: else if p1 < p2 then
Step 2 The JobTracker forwards the request to the 9: return 1
TaskScheduler. 10: else
Step 3 The TaskScheduler sorts the pools according to 11: return 0
the fair scheduler. In this way the advantage of 12: end if
the fair scheduler can be achieved. 13: else
Step 4 The TaskScheduler selects a pool from the ordered 14: if p1 = 0 p2 = 0 then
pools (for example, pool 2), and sorts the jobs in 15: if f1 > f2 then
pool 2 using the SRT algorithm. 16: return 1
Step 5 The TaskScheduler picks up a reduce task of the 17: else if f1 < f2 then
job with the shortest remaining time of the map 18: return 1
phase. 19: else
Step 6 The TaskScheduler returns the reduce task to the 20: return 0
JobTracker. 21: end if
Step 7 The JobTracker assigns the reduce task to the 22: end if
TaskTracker. 23: RemainingT ime1 T1 /f1 p1
24: RemainingT ime2 T2 /f2 p2
The SRT comparator takes the status information of
25: if RemainingT ime1 > RemainingT ime2 then
two jobs as input, and output an integer which indicates
26: return 1
the order of two jobs. The detail of the SRT comparator
27: else if RemainingT ime1 < RemainingT ime2
is shown in Algorithm 1 below.
then
With the SRT algorithm, we make the reduce task of
28: return 1
the shortest remaining time of the map phase get lanched
29: else
before others. The remaining time of the map phase is
30: return 0
changing dynamically according to the feedback. If two
31: end if
jobs both finish their map phases, we let the job with
32: end if
less map tasks lanch its reduce tasks ahead of another.
Therefore, the small jobs will not be starving and the
process time of the small jobs decreases greatly. The
SRT algorithm focuses on scheduling the reduce tasks, of job2, because these reduce tasks may spend much time
therefore it can be easily applied to many other schedulers copying intermediate data and when they finish copying
of Hadoop. the intermediate data of previous map tasks, all map tasks
may have been finished, so the reduce tasks do not need
to wait. On the contrary, job1 may have less intermediate
B. Drawback data to copy and block for waiting new map intermediate
The SRT algorithm does not consider the time that the data. Figure 2 below illustrates the problem. To remedy
reduce tasks spend on the shuffle and the sort phases, the defect as we describe above, a more precise method
which will sometimes lead to resource underutilization. to calculate the remaining time of the map and the reduce
For instance, if job1 has a shorter map remaining time phases is needed. Since optimal solutions are hard to
than job2 does, we make job1 ahead of job2 according to compute (NP-hard) [21], we propose an approximation
the SRT algorithm. But if job2 has more intermediate data algorithm. In this model, the SRT algorithm will work
to copy, then it may be better to launch the reduce tasks fine and reduce the average process time of the small
Time
job1s map
job2s map
job2 reduce shuffle and sort the data below no waiting time here
Figure 2. Consider the shuffle phase and the sort phase Figure 3. The process time of the small jobs
TABLE I.
T HE HARDWARE CONFIGURATION
No. Type CPU Memory
0 Master 8-core, 2.13GHz 16G
1 Slave 4-core, 2.13GHz 4G
2 Slave 4-core, 2.13GHz 4G
3 Slave 8-core, 2.13GHz 16G
4 Slave 8-core, 2.13GHz 4G
5 Slave 16-core, 2.66GHz 48G
6 Slave 16-core, 2.66GHz 48G
7 Slave 16-core, 2.66GHz 48G
8 Slave 4-core, 2.13GHz 4G
9 Slave 4-core, 2.13GHz 4G
10 Slave 4-core, 2.13GHz 4G
!"#$%&m'(m)'$*
!"#$%&m'(m)'$*
!"#$$ %#&'(
Figure 5. The reduce time of the small jobs
Figure 8. The job process time when the small jobs come after the
large jobs
!"#$%&m'(m)'$*
Figure 9. The job process time of the small jobs under different number
of slave nodes
is 6% more than that of the fair scheduler. This may be
due to the large jobs being pushed behind the small jobs
and taking a longer time to finish.
We can see that the small jobs can get 39% decrease
of process time with the SRT algorithm. And the job
B. Small jobs come after large jobs process time of the large jobs is nearly the same under
In this experiment, we submit all small jobs after large two schedulers.
jobs. The test bed contains one master node (No.0) and
four slave nodes(No.1 No.4). We submit 10 large jobs,
wait for 5 seconds, and then submit 10 small jobs. The
time interval between two submission is used to ensure C. Scalability
that the jobs submitted earlier can be initialized before
those submitted later so that all the jobs are in the right In this experiment, we repeat the first experiment under
order we want. Figure 8 shows the results of the average different number of slave nodes to see the scalability
job process time of two kinds of jobs. of the SRT scheduler. We conduct four groups of tests,
each group has 4, 6, 8, 10 slave nodes respectively. For
example, in the fourth group of tests, the test bed contains
one master node (No.0) and ten slave nodes(No.1
No.10). In each test we submit 40 small jobs and 40 large
jobs. Each group of tests are repeated for twenty times.
Figure [9,10] show the results of the average job process
time of two kinds of jobs.
We can see that the SRT scheduler still uses less time
to finish the small jobs than the fair scheduler does. With
the number of slave nodes getting larger, the process time
of the small jobs of two schedulers is getting closer. This
is a normal phenomenon, since we can assume that we
!"#$%&m'(m)'$* have thousands of slave nodes, then all the tasks will be
running at the same time, therefore, the process time of
Figure 7. The process time of total jobs each job is invariable no matter what scheduler we use.
ACKNOWLEDGMENT
This work was supported partially by National High-
Tech Research and Development Program of China under
Grant No.2011AA040101 and was jointly funded by the
Fundamental Research Funds for the Central Universities
China No.FRF-MP-12-007A.
R EFERENCES
[1] J. Dean and S. Ghemawat, Mapreduce: simplified data
!"#$%&m'(m!$#)%m!'+%* processing on large clusters, Commun. ACM, vol. 51,
no. 1, pp. 107113, Jan. 2008.
Figure 10. The job process time of the large jobs under different number [2] Apache hadoop, http://hadoop.apache.org.
of slave nodes
[3] Fair scheuler, http://hadoop.apache.org/docs/r1.0.4/fair
scheduler.html.
[4] Capacity scheduler, http://hadoop.apache.org/
mapreduce/docs/r0.21.0/capacityscheduler.html.
[5] J. Tan, X. Meng, and L. Zhang, Delay tails in mapreduce
scheduling, in Proceedings of the 12th ACM SIGMET-
RICS/PERFORMANCE joint international conference on
Measurement and Modeling of Computer Systems, ser.
SIGMETRICS 12. New York, NY, USA: ACM, 2012,
pp. 516.
[6] Y. Xu, K. Li, T. T. Khac, and M. Qiu, A multiple
priority queueing genetic algorithm for task scheduling
on heterogeneous computing systems, Liverpool, United
kingdom, 2012, pp. 639 646.
[7] J. Tan, X. Meng, and L. Zhang, Coupling scheduler
!"#$%&m'(m)'$*
for mapreduce/hadoop, in Proceedings of the 21st in-
ternational symposium on High-Performance Parallel and
Figure 11. The process time of the small jobs
Distributed Computing, ser. HPDC 12. New York, NY,
USA: ACM, 2012, pp. 129130.
[8] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,
D. Apply the SRT algorithm to the capacity scheduler S. Shenker, and I. Stoica, Delay scheduling: a simple
technique for achieving locality and fairness in cluster
The capacity scheduler is another popular scheduler of scheduling, in Proceedings of the 5th European confer-
Hadoop, and we apply the SRT algorithm to it and see ence on Computer systems, ser. EuroSys 10. New York,
NY, USA: ACM, 2010, pp. 265278.
how it can reduce the process time of the small jobs. As [9] M. Hammoud and M. Sakr, Locality-aware reduce task
we see from the scenario of Fig. 11, the SRT algorithm scheduling for mapreduce, in Cloud Computing Technol-
can perfectly reduce the process time of the small jobs ogy and Science (CloudCom), 2011 IEEE Third Interna-
when working with the capacity scheduler. tional Conference on, 29 2011-dec. 1 2011.
[10] L. Li, Z. Tang, R. Li, and L. Yang, New improvement
of the hadoop relevant data locality scheduling algorithm
based on late, in Mechatronic Science, Electric Engineer-
V. C ONCLUSIONS ing and Computer (MEC), 2011 International Conference
on, aug. 2011, pp. 1419 1422.
In this paper, we propose a simple algorithm for [11] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and
scheduling the reduce tasks based on the shortest remain- I. Stoica, Improving mapreduce performance in heteroge-
ing time of the map tasks. We implement our algorithm neous environments. in OSDI, vol. 8, no. 4, 2008, p. 7.
[12] K. Kc and K. Anyanwu, Scheduling hadoop jobs to meet
by applying the SRT algorithm to the fair scheduler
deadlines, in Cloud Computing Technology and Science
and the capacity scheduler, which have been proved by (CloudCom), 2010 IEEE Second International Conference
several experiments. The results of the experiments show on, 30 2010-dec. 3 2010, pp. 388 392.
that the SRT algorithm decreases the reduce time and [13] L. Liu, Y. Zhou, M. Liu, G. Xu, X. Chen, D. Fan, and
process time of the small jobs without affecting the Q. Wang, Preemptive hadoop jobs scheduling under a
deadline, in Semantics, Knowledge and Grids (SKG),
performance of large jobs. However, the process time of
2012 Eighth International Conference on, 2012, pp. 7279.
total jobs of the SRT scheduler is a little more than that [14] X. Dong, Y. Wang, and H. Liao, Scheduling mixed real-
of the fair scheduler. In the future, we will improve our time and non-real-time applications in mapreduce environ-
algorithm by considering the time of the reduce phase and ment, in Parallel and Distributed Systems (ICPADS), 2011
using a more precise method to estimate the remaining IEEE 17th International Conference on, 2011, pp. 916.
time of both the map and the reduce phases. We also [15] L. Phan, Z. Zhang, Q. Zheng, B. T. Loo, and I. Lee, An
empirical analysis of scheduling techniques for real-time
plan to consider several high-level objectives, such as cloud-based data processing, in Service-Oriented Com-
the intelligent scheduler based on genetic algorithm and puting and Applications (SOCA), 2011 IEEE International
energy saving. Conference on, 2011, pp. 18.