Professional Documents
Culture Documents
IMPROVING PERFORMANCE OF HETER. MAPRED. CLUSTER-2016
IMPROVING PERFORMANCE OF HETER. MAPRED. CLUSTER-2016
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
Abstract—Datacenter-scale clusters are evolving toward heterogeneous hardware architectures due to continuous server replace-
ment. Meanwhile, datacenters are commonly shared by many users for quite different uses. It often exhibits significant performance
heterogeneity due to multi-tenant interferences. The deployment of MapReduce on such heterogeneous clusters presents significant
challenges in achieving good application performance compared to in-house dedicated clusters. As most MapReduce implementations
are originally designed for homogeneous environments, heterogeneity can cause significant performance deterioration in job execution
despite existing optimizations on task scheduling and load balancing. In this paper, we observe that the homogeneous configuration
of tasks on heterogeneous nodes can be an important source of load imbalance and thus cause poor performance. Tasks should be
customized with different configurations to match the capabilities of heterogeneous nodes. To this end, we propose a self-adaptive
task tuning approach, Ant, that automatically searches the optimal configurations for individual tasks running on different nodes. In
a heterogeneous cluster, Ant first divides nodes into a number of homogeneous subclusters based on their hardware configurations.
It then treats each subcluster as a homogeneous cluster and independently applies the self-tuning algorithm to them. Ant finally
configures tasks with randomly selected configurations and gradually improves tasks configurations by reproducing the configurations
from best performing tasks and discarding poor performing configurations. To accelerate task tuning and avoid trapping in local
optimum, Ant uses genetic algorithm during adaptive task configuration. Experimental results on a heterogeneous physical cluster
with varying hardware capabilities show that Ant improves the average job completion time by 31%, 20%, and 14% compared to
stock Hadoop (Stock), customized Hadoop with industry recommendations (Heuristic), and a profiling-based configuration approach
(Starfish), respectively. Furthermore, we extend Ant to virtual MapReduce clusters in a multi-tenant private cloud. Specifically, Ant
characterizes a virtual node based on two measured performance statistics: I/O rate and CPU steal time. It uses k-means clustering
algorithm to classify virtual nodes into configuration groups based on the measured dynamic interference. Experimental results on
virtual clusters with varying interferences show that Ant improves the average job completion time by 20%, 15%, and 11% compared
to Stock, Heuristic and Starfish, respectively.
Index Terms—MapReduce Performance Improvement, Self-Adaptive Task Tuning, Heterogeneous Clusters, Genetic Algorithm
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
TABLE 1
tasks even in a heterogeneous environment. Task-level Fig. 2. The optimal task configuration changes with
parameters control the fine-grained task execution on workloads and platforms.
individual nodes and can possibly be changed inde-
pendently and on-the-fly at runtime. Parameter tuning
at the task level opens up opportunities for improving
performance in heterogeneous environments and is our
focus in this work.
Hadoop installations pre-set the configuration param-
eters to default values assuming a reasonably sized
cluster and typical MapReduce jobs. These parameters
should be specifically tuned for a target cluster and in-
dividual jobs to achieve the best performance. However,
there is very limited information on how the optimal set- Fig. 3. The architecture of Ant.
tings can be determined. There exist rule of thumb rec-
ommendations from industry leaders (e.g., Cloudera [10]
and MapR [11]) as well as academic studies [4], [6]. These hardware configurations. Map completion times varied
approaches can not be universally applied to a wide as we changed the value of parameter io.sort.mb.
range of applications or heterogeneous environments. In The figure suggests that uniform task configurations do
this work, we develop an online self-tuning approach not lead to the optimal performance in a heterogeneous
for task-level configuration. Next, we provide motivating environment. For example, map tasks achieved the best
examples to show the necessity of configuration tuning performance on the Atom machine when the parameter
for heterogeneous workloads and hardware platforms. was set to 125 while the optimal completion time on the
T420 machine was due to the parameter being set to 275.
[Summary] We have shown that the performance of
2.2 Motivating Examples Hadoop applications can be substantially improved by
we created a heterogeneous Hadoop cluster composed of tuning task-level parameters for heterogeneous work-
three types of machines listed in Table 1. Three MapRe- loads and platforms. However, parameter optimization
duce applications from the PUMA benchmark [12], i.e., is an error-prone process involving complex interplays
WordCount, Terasort and Grep, each with 300 GB input among the job, the Hadoop framework and the archi-
data, were run on the cluster. We configured each slave tecture of the cluster. Furthermore, manual tuning still
node with four map slots and two reduce slots, and remains difficult due to the large parameter search space.
HDFS block size was set to 256MB. The heap size As many MapReduce jobs are recurring or have multiple
mapred.child.java.opts was set to 1 GB and other waves of task execution, it is possible to learn the best
parameters were set to the default values. We measured task configurations based on the feedback of previous
the map task completion time in two different scenarios – runs. These observations motivated us to develop a task
heterogeneous workload on homogeneous hardware and self-tuning approach to automatically configure parame-
homogeneous workload on heterogeneous hardware. We ters for various Hadoop jobs and platforms in an online
show that the heterogeneity either in the workload or manner.
hardware makes the determination of the optimal task
configuration difficult. 3 A NT DESIGN AND ASSUMPTIONS
Figure 2(a) shows the average map completion times
of the three heterogeneous benchmarks on a homo- 3.1 Architecture
geneous cluster only consisting of the T110 machines. Ant is a self-tuning approach for multi-wave MapReduce
The completion times changed as we altered the values applications, in which job executions consist of several
of parameter io.sort.record.percent. The figure rounds of map and reduce tasks. Unlike traditional
shows that wordcount, terasort, and grep achieved their MapReduce implementations, Ant centers on two key
minimum completion times when the parameter was designs: (1) tasks belonging to the same job run with
set to 0.4, 0.2, and 0.6, respectively. Figure 2(b) shows different configurations matching the capabilities of the
the performance of wordcount on machines with different hosting machines; (2) the configurations of individual
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
only select parents with fitness scores that exceed the •In the first scenario of static interference, Ant only
mean. classifies VMs at the beginning of the adaptation
Aggressive selection algorithm. Based on the above process when interferences are relatively stable dur-
two aggressive selection strategies, the parents selection ing the job execution process.
of the self-tuning reproduction is operated by an inte- • In the second scenario of dynamic interference, the
grated selection algorithm. As shown in Algorithm 2, background interference could change and the per-
Ant firstly selects the best configuration candidate with formance of virtual nodes change over time. Thus, a
the highest fitness in the last interval as one of the re- re-clustering of the virtual nodes is needed to form
production parents. Then it selects another configuration new tuning groups. Ant is adapt to the variations
set from the candidates with fitness scores that exceed of VM performance by periodically performing re-
the mean by applying the Roulette Wheel approach. grouping if significant performance changes in VMs
Finally, Ant generates the new generation configuration are detected.
candidates by using the two selected candidates (i.e., In the following, we describe how to classify a virtual
Cbest and CW R ) as the reproduction parents. MapReduce cluster into a number of homogeneous sub-
Furthermore, the aggressive selection strategies also clusters in the above two scenarios.
reduce the impact of task skews during the tuning pro-
cess. Long tasks due to data skews may add noises in the
proposed GA-based task tuning. Taking the advantages 5.1 Stable Interference Scenario
of the aggressive selection, only the best configurations Ant estimates the actual capabilities of virtual nodes
are possibly used to generate new configurations. It is based on low-level resource utilizations of virtual re-
unlikely that the tasks with skews would be selected as sources. Previous study found that MapReduce jobs are
reproduction candidates. Thus, Ant would find the best mostly bottlenecked by the slow processing of a large
configurations for the majority of tasks. amount of data [19]. Excessive I/O rate and a lack
of CPU allocation are signs of slow processing. Ant
5 M OVING A NT INTO THE C LOUD characterizes a virtual node based on two measured
Cloud computing offers users the ability to access performance statistics: I/O rate and CPU steal time.
large pools of computational and storage resources on Both statistics can be measured at the TaskTracker of
demand. Developers by using the cloud computing individual nodes.
paradigm are enabled to perform parallel data process- Ant monitors the number of data bytes written to
ing in a distributed and scalable environment with af- disk during the execution of a task. Since there is little
fordable cost and reasonable time. In particular, MapRe- data reuse in MapReduce jobs, the volume of writes is a
duce deployment in the cloud allows enterprises to good indicator of I/O access rate and memory demand.
cost-effectively analyze large amounts of data without When the in-memory buffer, controlled by parame-
creating large infrastructures of their own [15]. Using ter mapred.job.shuffle.merge.percent, runs out
virtual machines (VMs) and storage hosted by the cloud, or reaches the threshold number of map outputs
enterprises can simply create virtual MapReduce clusters mapred.inmem.merge.threshold, it is merged and
to analyze their data. However, tuning the jobs hosted spilled to disks. The spilled records are written to
in virtual MapReduce clusters is significant challenge disks, which include both map and reduce spills.
to users. This is due to the fact that even the virtual TaskTracker automatically records the data writing
node configured with the same virtual hardware could size and spilling duration of each task. We calculate the
have varying capacity due to interferences from co- I/O access rate of individual nodes based on the data
located users [16], [17], [18]. When running Hadoop in a spilling operations.
virtual cluster, finding nodes with the similar hardware The CPU steal time is the amount of time that the
capability is more challenging and complex than that in virtual node is ready to run but failed to obtain CPU
a physical cluster. cycles because the hypervisor is serving other users. It
For moving Ant into virtual MapReduce clusters, it is reflects the actual CPU time allocated to the virtual node
apparently inaccurate to divide nodes into a number of and can effectively calibrate the configuration of virtual
homogeneous subclusters based on their static hardware hardware according to the experienced interferences. We
configurations. Cloud offers elasticity to slice large, un- record the real-time CPU steal time by running the Linux
derutilized physical servers into smaller, parallel virtual top command on each virtual machine. TaskTracker
machines, enabling diverse applications to run in iso- reports the information to the master node in the cluster
lated environments on a shared hardware platform. As by the heartbeat connection.
a result, such a cloud environment leads to various inter- Ant uses the k-means clustering algorithm to classify
ferences among the different applications. In this work, virtual nodes into configuration groups. Formally, given
we focus on two representative interference scenarios in a set of VMs (M1 , M2 , . . . , Mm ), where each observation
the cloud: stable interference and dynamic interference. is a 2-dimensional real vector (i.e., I/O rate and CPU
We extend Ant to virtual MapReduce clusters in the steal time), k-means clustering aims to partition the m
cloud environments. observations into k(≤ n) sets S = S1 , S2 , . . . , Sk so as to
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
on the runtime. Fixing the number of groups in a virtual re-grouping should be more frequent correspond-
cloud cluster may lead to poor quality of grouping and ingly so as to capture dynamic capacity changes of
performance degradation. Thus, we propose a modified the virtual nodes.
k-means method to find the number of groups based on • If the size of the virtual cluster is small, re-grouping
the quality of the grouping output on the fly. It relies should be less frequent. This is due to there is certain
on the following two performance metrics to qualify the amount of time to find favorable task configurations
grouping of similar VMs in a virtual MapReduce cluster. after each re-grouping adjustment is done, which
• inter is used to measure the separation of homo-
may weight out the benefit of re-grouping. On the
geneous VM groups. The term is the sum of the contrary, re-grouping needs to be more frequent
distances between P the group centroids, which is when the size of the virtual cluster is large.
defined as: inter = kµi − µx k2 , i, x ∈ k. Based on the above analysis, we formally formulate the
• intra is used to measure the compactness of indi- dynamic capacity change of the virtual nodes based on
vidual homogeneous VM groups. Here, we use the the two performance metrics (i.e., inter and intra) as,
standard deviation as intra to evaluate the compact-
ness of the data points
q in each group of VMs. It is intra(t) inter(t − 1)
Pn Dynamic(t) = × . (2)
defined as: intra = n1 i=1 (Mi − µi )2 . intra(t − 1) inter(t)
As shown in Algorithm 3, users have the flexibility Either the increase of intra or the decrease of inter
either to fix the number of groups or enter the minimum means the dynamic capacity changes of VMs become sig-
number of groups. In the former case, it works in the nificant. The growth of intra means the compactness of
same way as the traditional k-means algorithm does. individual homogeneous VM groups becomes relaxed.
The value of k is empirically determined based on the inter reduction means the separation of homogeneous
historical workload records and MapReduce multi-wave VM groups becomes unstable and regrouping becomes
behavior analysis. In the latter case, it lets the algorithm necessary. We periodically monitor the two performance
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
10
TABLE 3
The characteristics of benchmark applications used in our experiments.
Category Type Label Input size (GB) Input data # Maps # Reduces
Wordcount CPU intensive J1/J2/J3 100/300/900 Wikipedia 400/1200/3600 14/ 14/ 14
Grep I/O intensive J4/J5/J6 100/300/900 Wikipedia 400/1200/3600 14/ 14/ 14
Terasort I/O intensive J7/J8/J9 100/300/900 TeraGen 400/1200/3600 14/ 14/ 14
1 1
Stock Starfish Value Value
JCT normalized to Stock
Heuristic Ant
io.sort.record.percent
io.sort.record.percent
1.2 0.8 0.8
0 0
0 0 5 10 15 20 25 0 5 10 15 20 25
1 2 3 4 5 6 7 8 9
Time (minutes) Time (minutes)
Job ID
(a) Atom. (b) T110.
Fig. 7. Job completion time on the physical cluster.
Fig. 8. Task-level parameter search process.
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
11
50
0.6
25 0.3
0 0
0 40 80 120 160
1 2 3 4 5 6 7 8 9
Time (minute)
Job ID
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
12
2.5 150 36 40 40
1 50 12
10 10
0.5
0 0 0 0
0 Wordcount Grep Terasort Wordcount Grep Terasort Wordcount Grep Terasort
Wordcount Grep Terasort Workload types Workload types Workload types
(a) Breakdown of JCT. (b) Task completion times. (a) Without clustering. (b) With clustering.
Fig. 11. Impact on task-level performance. Fig. 12. Comparsion by applying different tuning ap-
proaches in a heterogeneous cluster.
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
13
Group number
1.1
Normalized JCT
20 30 3
1
20 2
0.9
10
0.8 10 1
0 0.7 0 0
Wordcount Grep Terasort 1 2 3 4 0 40 80 120 160
Workloads Number of groups (k) Time (minute)
(a) Improvement comparison. (b) Static k value selection. (c) Dynamic re-grouping adjustment.
Fig. 13. Performance improvement comparison between static grouping and dynamic re-grouping approach.
1.4 1.6
that automated resource allocation and configuration
Normalized JCT
Normalized JCT
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
14
compared to stock Hadoop, customized Hadoop with [10] Cloudera, “Configuration parameters,” 2012,
industry recommendations, and a profiling-based config- http://blog.cloudera.com/blog/author/aaron/.
[11] MapR, “The executive’s guide to big data,” 2013,
uration approach, respectively. Experimental results on http://www.mapr.com/resources/white-papers.
two virtual cloud clusters with varying multi-tenancy [12] “PUMA: Purdue mapreduce benchmark suite,” 2012.
interferences show that Ant improves the average job [13] Y. Guo, J. Rao, and X. Zhou, “ishuffle: Improving hadoop per-
formance with shuffle-on-write,” in Proc. of Int’l Conference on
completion time by 20%, 15%, and 11% compared to Autonomic Computing (ICAC), 2013.
Stock, Heuristic and Starfish, respectively. Ant can be [14] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and
deployed to different types of clusters, and thus is elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. on
Evolutionary Computation, vol. 6, pp. 182–197, 2002.
flexible and adaptive. [15] B. Igou, “User survey analysis: Cloud-computing budgets are
Our method Ant can be extended to other frameworks growing and shifting; traditional it services providers must pre-
such as Spark, though some additional effort is needed. pare or perish,” 2010, gartner Report.
[16] D. Carrera, M. Steinder, I. Whalley, J. Torres, and E. Ayguade,
Different from Hadoop, which executes individual tasks “Autonomic placement of mixed batch and transactional work-
in separate JVMs, Spark uses executors to host multiple loads,” IEEE Trans. on Parallel and Distributed Systems (TPDS),
tasks on worker nodes. To extend Ant to Spark, we need 2012.
[17] B. Palanisamy, A. Singh, and L. Liu, “Cost-effective resource
to dynamically change executor sizes without restart- provisioning for mapreduce in a cloud,” IEEE Trans. on Parallel
ing a launched job. Since running Spark on another and Distributed Systems (TPDS), 2014.
generic cluster management middleware, such as YARN, [18] B. Sharma, T. Wood, and C. R. Das, “Hybridmr: A hierarchical
mapreduce scheduler for hybrid data centers,” in Proc. IEEE Int’l
becomes increasingly popular, it is possible to enable Conference on Distributed Computing Systems (ICDCS), 2013.
malleable executors using resource containers. As such, [19] R. C. Chiang and H. H. Huang, “Interference-aware scheduling
Ant can monitor the completion times of individual for data-intensive applications in virtualized environments,” in
Proc. of Int’l Conference for High Performance Computing, Networking,
tasks and use such information as feedback to determine Storage and Analysis (SC), 2011.
the optimal size of Spark executors. In the future, we [20] B. Cho, M. Rahman, T. Chajed, I. Gupta, C. Abad, N. Roberts,
also plan to extend Ant to multi-tenancy public cloud and P. Lin, “Natjam: Eviction policies for supporting priorities
and deadlines in mapreduce clusters,” in Proc. ACM Symposium
environments such as Azure and EC2. on Cloud Computing (SoCC), 2013.
[21] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and
ACKNOWLEDGEMENT J. Schad, “Hadoop++: making a yellow elephant run like a cheetah
(without it even noticing),” in Proc. Int’l Conf. on Very Large Data
This research was supported in part by U.S. NSF research Bases (VLDB), 2010.
grants CNS-1422119, CNS-1320122, CNS-1217979, and [22] X. Shi, M. Chen, L. He, X. Xie, L. Lu, Y. Jin, H. Chen, and
S. Wu, “Mammoth: Gearing hadoop towards memory-intensive
NSF of China research grant 61328203. A preliminary mapreduce applications,” IEEE Trans. on Parallel and Distributed
version of the paper appeared in [9]. The authors are Systems (TPDS), 2015.
grateful to the editor and anonymous reviewers for their [23] A. Jinda, J. Quian-Ruiz, and J. Dittrich, “Trojan data layouts: Right
shoes for a running elephant,” in Proc. of ACM Symposium on Cloud
valuable suggestions for revising the manuscript. Computing (SoCC), 2011.
[24] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,
R EFERENCES R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,
O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache
[1] F. Ahmad, S. Chakradhar, A. Raghunathan, and T. N. Vijaykumar, hadoop yarn: Yet another resource negotiator,” in Proc. ACM
“Tarazu: optimizing mapreduce on heterogeneous clusters,” in Symposium on Cloud Computing (SoCC), 2013.
Proc. of Int’l Conf. on Architecture Support for Programming Language [25] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing
and Operating System (ASPLOS), 2012. hadoop provisioning in the cloud,” in Proc. USENIX HotCloud
[2] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, Workshop, 2009.
“Improving mapreduce performance in heterogeneous environ- [26] Y. Guo, J. Rao, C. Jiang, and X. Zhou, “Moving mapreduce into
ments,” in Proc. of USENIX Symposium on Operating System Design the cloud with flexible slot management,” in Proc. of IEEE/ACM
and Implementation (OSDI), 2008. Int’l Conference for High Performance Computing, Networking, Storage
[3] H. Herodotou, F. Dong, and S. Babu, “No one (cluster) size fits and Analysis (SC), 2014.
all: Automatic cluster sizing for data-intensive analytics,” in Proc. [27] D. Carrera, M. Steinder, I. Whalley, J. Torres, and E. Ayguadé,
of ACM Symposium on Cloud Computing (SoCC), 2011. “Enabling resource sharing between transactional and batch
[4] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, workloads using dynamic application placement,” in Proc.
and S. Babu, “Starfish: A self-tuning system for big data ana- ACM/IFIP/USENIX Int’l Conf. on Middleware (Middleware), 2008.
lytics,” in Proc. of Conference on Innovative Data Systems Research [28] J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar,
(CIDR), 2011. S. Parekh, K. Wu, and A. Balmin, “Flex: A slot allocation
[5] H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost- scheduling optimizer for mapreduce workloads,” in Proc. of the
based optimization of mapreduce programs,” in Proc. Int’ Conf. ACM/IFIP/USENIX Int’l Middleware Conference (Middleware), 2010.
on Very Large Data Bases (VLDB), 2011. [29] R. Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, and
[6] P. Lama and X. Zhou, “Aroma: Automated resource allocation A. Rowstron, “Scale-up vs scale-out for hadoop: Time to rethink?”
and configuration of mapreduce environment in the cloud,” in in Proc. of ACM Symposium on Cloud Computing (SoCC), 2013.
Proc. of Int’l Conf. on Autonomic computing (ICAC), 2012. [30] X. Li, Y. Wang, Y. Jiao, C. Xu, and W. Yu, “Coomr: Cross-task
[7] M. Li, L. Zeng, S. Meng, J. Tan, L. Zhang, N. Fuller, and A. R. coordination for efficient data management in mapreduce pro-
Butt, “mronline: Mapreduce online performance tuning,” in Proc. grams,” in Proc. of Int’l Conference for High Performance Computing,
of ACM Symposium on High-Performance Parallel and Distributed Networking, Storage and Analysis (SC), 2013.
Computing (HPDC), 2014. [31] Z. Li, Y. Cheng, C. Liu, and C. Zhao, “Minimum standard devia-
[8] T. White, Hadoop: The Definitive Guide, 3rd ed. O’Reilly Media / tion difference-based thresholding,” in Proc. of Int’l Conference on
Yahoo Press, 2012. Measuring Technology and Mechatronics Automation, 2010.
[9] D. Cheng, J. Rao, Y. Guo, and X. Zhou, “Improving mapreduce [32] A. Verma, L. Cherkasova, and R. H. Campbell, “Resource provi-
performance in heterogeneous environments with adaptive task sioning framework for mapreduce jobs with performance goals,”
tuning,” in Proc. of ACM/IFIP/USENIX Int’l Middleware Conference in Proc. ACM/IFIP/USENIX Int’l Middleware Conference (Middle-
(Middleware), 2014. ware), 2011.
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
15
2 60
Maximaized improvement
Normalized JCT
1.6 40
1.4
1.2 20
1
0.8
0.6 g1 g2 g3 g4 g5 g6
Task-level parameter
1 14 28 42 56 70 84 98
mapred.reduce.tasks Fig. 17. The sensitivity of task-level parameters.
Fig. 15. Impact of mapred.reduce.tasks.
is due to the fact that the number of map tasks and the
completion time of map tasks is mainly determined by
1.4 the block size of the input dataset. If most map tasks
Normalized JCT
A PPENDIX A A PPENDIX B
I MPACT OF C LUSTER /J OB - LEVEL PARAMETERS D ISCUSSIONS ON OVERHEAD AND M ULTI -J OB
Currently, cluster and job-level configurations mostly Overhead of Ant. Although Ant relies on a centralized
rely on personal experiences and/or job-profiling control framework, it does not launch any new process
based approaches. We quantify the performance for the controller. Instead, it only sends a tiny per-
impact of an important job-level parameter, i.e., task configuration description file to TaskTrackers when
mapred.reduce.tasks and another important initializing each task. These task-level configuration files
cluster-level parameter, i.e., dfs.block.size in the are distributed to individual slave nodes along with
experiment. LaunchTaskAction by Hadoop default RPC connec-
Impact of reduce number. Figure 15 quantifies tions. We find that the self-tuning optimization algorithm
the performance impact by one of most important takes approximately 25-40 ms to complete in our testbed.
job-level parameter tuning optimization, i.e., This overhead is negligible compared to the control
mapred.reduce.tasks. The completion times interval of 5 minutes in the experiments.
changed as we altered the values of the parameter. Multi-job support. As MapReduce scales to ever-
Our experimental results suggest that the number of larger platforms, concerns about cluster utilization
reduce tasks should be set to 1-5 times the number grow [27]. Multi-tenancy support becomes an essential
of available reduce slots in the cluster. The default need in MapReduce. However, sharing a MapReduce
value of a single reduce task setting is worst while too cluster among multiple jobs from different users poses
many reduce task settings can also lead to performance significant challenges in resource allocation and job con-
degradation. Thus, for fairness, we set the value of figurations. For instance, interferences among different
parameter mapred.reduce.tasks to 0.9 times the jobs may create straggler tasks that significantly slow
number of available reduce slots in the cluster for all down the entire job execution, and may lead to severe
jobs in our experiments. performance degradations [28].
Impact of block size. Figure 16 shows the job comple- Parameter sensitivity. Ant is designed to tune multi-
tion times varied as we changed the value of parame- ple task-level parameters jointly. Figure 17 shows that
ter dfs.block.size. Our experimental results suggest tuning different parameters can lead to different per-
that the block size of HDFS file system should be set formance improvements. The improvement of job com-
to 64 MB-512MB in the cluster. The default value of 64 pletion time varies from 10% to 37% when the six pa-
MB is not an optimal setting while too large block size rameters are tuned separately. However, tuning param-
settings can also lead to performance degradation. This eters independently to their respective optimal values
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
16
TABLE 4 TABLE 5
Performance improvement of Microsoft workload due to The number of spilling (J3, J6 and J9).
different approaches.
Approach Wordcount Grep Terasort Improve
Input size 1-100GB 0.1-1TB 1-10TB Total Stock 2.4 × 104 4.7 × 104 1.7 × 105 baseline
Job fraction 40% 20% 10% 70% Heuristic 1.9 × 104 3.8 × 104 1.4 × 105 18%
JCTHeuristic 2.1 h 6.3 h 11.7 h 19.1 h Starfish 1.75 × 104 3.6 × 104 1.25 × 105 23%
JCTStarf ish 2h 5.9 h 11.3 h 18.2 h Ant 1.4 × 104 2.9 × 104 0.95 × 105 42%
JCTAnt 2.3 h 5.3 h 9.1 h 16.7 h
40
+Elisist Non-aggresive
parameters are highly correlated and should be tuned (a) Job completion time. (b) Convergence rate.
together to improve job performance.
Fig. 18. Effectiveness of aggressive selection in terms of
job completion time and convergence rate.
A PPENDIX C
A NT UNDER M ICROSOFT W ORKLOAD
To better understand the effectiveness of Ant in a pro- map task by analysis of the spilling logs in the exper-
duction environment, we analyzed 174,000 jobs sub- iments. Table 5 shows that Ant reduces the average
mitted to a production analytics cluster in Microsoft data spilling times 42% while Heuristic and Starfish
datacenter in a single month in 2011 [29]. We use a only reduce 18% and 23%, compared with the stock
synthetic workload, “MicroSoft-Derived (MSD)”, which Hadoop configuration. This is due to the fact that Ant
models Microsoft’s production workload. It is a scaled- optimizes the intermediate data spilling, which satisfies
down version of the workload studied in [29] since individual needs of different workloads and platforms
our cluster is significantly smaller. MSD contains jobs in the heterogeneous cluster.
with widely varying execution times and data set sizes,
representing a scenario where the cluster is used to
A PPENDIX E
run many different types of batch applications. We do
not run the Microsoft code itself. Rather, we mimic the E FFECTIVENESS OF AGGRESSIVE S ELECTION
distribution characteristics of the job size by running We compare the performance impact of applying differ-
Terasort application with various input data sizes. We ent selection strategies in Ant (described in Section 4)
scale the workload down in two ways: we reduce the from the perspective of job completion time and conver-
overall number of jobs to 87, and eliminate the largest gence rate respectively. The experimental results demon-
10% of jobs and the smallest 20% of jobs. strate that the aggressive selection approach of Ant can
Table 4 shows the job size distribution of MSD work- effectively improve application performance for various
load used in the experiment and the job completion time workloads.
achieved by different tuning approaches. The results Figure 18(a) compares the job completion time
demonstrate that Ant reduces the overall job completion achieved by different parents selection approaches. It
time of MSD workload 12.5% and 8% compared with shows that the elitist strategy improves the job comple-
Heuristic and Starfish. tion time 9% and the inferior strategy improves the job
completion time 17% compared with the no-aggressive
strategy, respectively. Figure 18(a) also illustrates that
A PPENDIX D
applying the two strategies (i.e., elitist and inferior) to-
R EDUCING NUMBER OF DATA SPILLS gether improves the job completion time 22% on average
Ant is designed to optimize task-level performance, for the three applications compared with applying the
particularly I/O operations during the map phase. The no-aggressive strategy.
data spilling operations play an important role in the Figure 18(b) illustrates the impact of convergence rate
map phase and usually consume most execution time in terms of the standard deviation [31] by using the
of map task processing [30]. Ant aims to reduce the aggressive selection strategies of Ant. It shows that no-
map output data spilling times by tuning the buffer aggressive selection has the worst convergence rate since
related parameters during execution, e.g., io.sort.mb Roulette Wheel selection mechanism still has a low
and io.sort.spill.percent. To evaluate the effec- probability of selecting bad candidates to be parents.
tiveness, we measure the data spilling times of each However, both elitist and inferior strategies improve
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
17
45 45
Physical cluster Map
15
15
0
0 Heuristic Starfish Ant
Heuristic Starfish Ant
Approaches
Approaches
Fig. 20. Task execution comparison.
Fig. 19. Impact of environments.
the convergence rate slightly in terms of the standard which are hosted on three blade servers in our university
deviation compared with the no-aggressive strategy. Fur- cloud. Each blade server also hosts a representative
thermore, the results demonstrate that applying inferior transactional application RUBiS as the interference pro-
and elitist strategies together can effectively accelerate ducer, which models an online auction site. We created
the convergence rate during the job execution. Thus, we three VMs, each with 4 vCPUs, 8 GB RAM and 80 GB
integrate two strategies in the task evolution of Ant to hard disk space, for the RUBiS benchmark application on
improve searching performance. the three servers respectively. The number of concurrent
Furthermore, please refer to Appendix A for the im- users is set to 800, 2500 and 4500 for the three servers.
pact of cluster and job-level parameters on Ant perfor- We empirically set the number of subclusters to three
mance, and Appendix B for discussion on the overhead in this experiment since all the VMs used in the experi-
and multi-job support. ment are hosted on three blade servers in our university
cloud. We then divide the virtual cluster into three
subclusters by using k-means clustering approach. VMs
A PPENDIX F having similar CPU steal time and I/O rate are classified
I MPROVEMENT ON TASK E XECUTION into the same subcluster. We obtain three subclusters as
Reducing task execution time. Figure 20 shows the follows: a subcluster of 4 VMs with low steal time and
improvement comparisons in terms of the average task high I/O rate, a subcluster of 8 VMs with middle steal
completion time by Ant, Heuristic and Starfish on the time and I/O rate, and a subcluster of 12 VMs with high
physical cluster respectively. It demonstrates that map steal time and low I/O rate. Note that each subcluster
task completion time improvements are much more than is treated as an independent homogeneous cluster to
the reduce task improvements for all of the three con- evolve the task-level optimal configuration set.
figuration approaches. Ant outperforms Heuristic and Effectiveness of Ant with stable interference. Ant
Starfish significantly for map task performance improve- improves the average job completion time by 23%, 11%
ment while obtaining similar reduce task performance and 16% compared with Stock, Heuristic and Starfish
improvement. This is due to the fact that Ant focuses on the virtual cluster, respectively. However, the perfor-
on optimizing the intermediate data merging operation mance of Starfish is slightly worse than that of Heuristic
of multi-wave map tasks. Most production jobs have in this scenario since performance interference in the
only one reduce wave since the number of reduce tasks virtual environment significantly reduces the accuracy
in a job is typically set to be 0.9 times the number of of Starfish job profiling.
available reduce slots. Hence, map tasks usually occur Figure 19 compares the job completion time improve-
in multiple waves, while reduce tasks tend to complete ment achieved by Ant, Heuristic and Starfish under
in one to two waves for most real-world workloads. the physical and virtual environments, respectively. The
Accordingly, Ant focuses on improving the execution results demonstrate that the average job completion
time of map tasks, which aims to take advantage of the time of Ant on the physical cluster is improved by
multi-wave behaviors of map tasks. Understanding the 8% compared with that on the virtual cluster. It shows
wave behavior of tasks, such as the number of waves that Ant is more effective in the physical environment.
and the size of waves, would aid task configuration to This is due to the fact that MapReduce perception of
improve cluster utilization. cluster topology and resource isolation may be different
from the actual settings on hardware when running in
a virtual environment. As the result, the corresponding
A PPENDIX G
parameter tuning strategies are less effective in a virtual
S TABLE INTERFERENCE SCENARIO cluster than in a physical cluster. Figure 19 also shows
Experiment setup. In our testbed there are 24 VMs, each that Ant improves the absolute job completion time
with 2 vCPU, 4 GB RAM and 80 GB hard disk space, at least by 9% and 7% compared with Heuristic and
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems
18
TABLE 6
Task-level parameter settings by current representative industry and academic approaches.
Approach Rules-of-Thumb (Heuristic) from Cloudera Starfish from Duke University
Parameter g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6
Wordcount 10 100mb 0.35 0.8 4k -Xmx200m 10 50mb 0.15 0.8 4k -Xmx200m
Grep 10 200mb 0.05 0.6 16k -Xmx300m 100 200mb 0.1 0.66 16k -Xmx300m
Terasort 10 150mb 0.15 0.7 32k -Xmx250m 10 200mb 0.15 0.7 32k -Xmx350m
40
Convergence time (minutes)
40 20
20 10
1 2 3 4 8 12 0 10 20 30 40 50
Number of machines Time (minutes)
Fig. 21. Impact of cluster size. Fig. 22. Impact of machine type.
Starfish on the physical cluster and on the virtual cluster, convergence cost decreases as the number of machines
respectively. increases in the cluster. The searching process relies on
the online task execution learning. It needs multiple
A PPENDIX H reconfiguration intervals to find good task-level param-
S EARCH S PEED OF A NT eters. The more nodes available in the cluster, the more
The searching speed of Ant represents the time cost to opportunities there are to learn the characteristics of the
find a stable task configuration set for a running job running job.
on the cluster. A slow searching speed will reduce the
benefit from a good task-level configuration, especially
for small jobs. This is due to the fact that a small job may Impact of machine type. Figure 22 demonstrates the
complete before Ant can find out an optimal setting for searching speed of Ant in the physical cluster is faster
this job. This is also the reason that Ant is more effective than that in the virtual cluster. We define that the param-
for large jobs than for small jobs as shown in Figures 21 eter value is stable when its standard deviation value is
and 22. Our experiments reveal that there are two main smaller than 10%, which is widely used in evaluation of
factors contributing significant impact on the searching the convergence speed in previous study [31]. The result
speed of Ant. in Figure 22 shows that Ant takes 18 and 35 minutes
Impact of cluster size. The number of available to find a good configuration in the physical cluster and
nodes in the cluster is an important factor that affects in the virtual cluster, respectively. Note that the actual
the searching speed of Ant. Figure 21 enumerates the resource allocation for each node in the virtual cluster is
convergence cost of Ant with various numbers of ma- typically time-varying due to the interference issues [17],
chines in our experiments. The result shows that the [32].
1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.