Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

Improving Performance of Heterogeneous


MapReduce Clusters with Adaptive Task Tuning
Dazhao Cheng, Jia Rao, Yanfei Guo, Changjun Jiang and Xiaobo Zhou

Abstract—Datacenter-scale clusters are evolving toward heterogeneous hardware architectures due to continuous server replace-
ment. Meanwhile, datacenters are commonly shared by many users for quite different uses. It often exhibits significant performance
heterogeneity due to multi-tenant interferences. The deployment of MapReduce on such heterogeneous clusters presents significant
challenges in achieving good application performance compared to in-house dedicated clusters. As most MapReduce implementations
are originally designed for homogeneous environments, heterogeneity can cause significant performance deterioration in job execution
despite existing optimizations on task scheduling and load balancing. In this paper, we observe that the homogeneous configuration
of tasks on heterogeneous nodes can be an important source of load imbalance and thus cause poor performance. Tasks should be
customized with different configurations to match the capabilities of heterogeneous nodes. To this end, we propose a self-adaptive
task tuning approach, Ant, that automatically searches the optimal configurations for individual tasks running on different nodes. In
a heterogeneous cluster, Ant first divides nodes into a number of homogeneous subclusters based on their hardware configurations.
It then treats each subcluster as a homogeneous cluster and independently applies the self-tuning algorithm to them. Ant finally
configures tasks with randomly selected configurations and gradually improves tasks configurations by reproducing the configurations
from best performing tasks and discarding poor performing configurations. To accelerate task tuning and avoid trapping in local
optimum, Ant uses genetic algorithm during adaptive task configuration. Experimental results on a heterogeneous physical cluster
with varying hardware capabilities show that Ant improves the average job completion time by 31%, 20%, and 14% compared to
stock Hadoop (Stock), customized Hadoop with industry recommendations (Heuristic), and a profiling-based configuration approach
(Starfish), respectively. Furthermore, we extend Ant to virtual MapReduce clusters in a multi-tenant private cloud. Specifically, Ant
characterizes a virtual node based on two measured performance statistics: I/O rate and CPU steal time. It uses k-means clustering
algorithm to classify virtual nodes into configuration groups based on the measured dynamic interference. Experimental results on
virtual clusters with varying interferences show that Ant improves the average job completion time by 20%, 15%, and 11% compared
to Stock, Heuristic and Starfish, respectively.

Index Terms—MapReduce Performance Improvement, Self-Adaptive Task Tuning, Heterogeneous Clusters, Genetic Algorithm

1 I NTRODUCTION gradually upgraded and replaced in datacenters. Inter-


ferences from multiple tenants sharing the same cloud
In the past few years, MapReduce has proven to be an
platform can also cause heterogeneous performance even
effective platform to process a large set of unstructured
on homogeneous hardware. The difference in processing
data as diverse as sifting through system logs, running
capabilities on MapReduce nodes breaks the assumption
extract transform load operations, and computing web
of homogeneous clusters in MapReduce design and can
indexes. Since big data analytics requires distributed
result in load imbalance, which may cause poor perfor-
computing at scale, usually involving hundreds to thou-
mance and low cluster utilization. To improve MapRe-
sands of machines, access to such facilities becomes a
duce performance in heterogeneous environments, work
significant barrier to practicing big data processing in
has been done to make task scheduling [2] and load
small business. Deploying MapReduce in datacenters or
balancing [1] heterogeneity aware. Despite these opti-
cloud platforms, offers a more cost-effective model to im-
mizations, most MapReduce implementations such as
plement big data analytics. However, the heterogeneity
Hadoop still perform poorly in heterogeneous envi-
in datacenters and clouds present significant challenges
ronments. For the ease of management, MapReduce
in achieving good MapReduce performance [1], [2].
implementations use the same configuration for tasks.
Hardware heterogeneity occurs because servers are Existing research [3], [4] has shown that MapReduce
configurations should be set according to cluster size
• D. Cheng, J.Rao and X. Zhou are with the Department of Computer and hardware configurations. Thus, running tasks with
Science, University of Colorado, Colorado Springs, CO, 80918.
E-mail: {dcheng,jrao,xzhou}@uccs.edu.
homogeneous configurations on heterogeneous nodes
• C. Jiang is with the Department of Computer Science & Technology, Tongji inevitably leads to sub-optimal performance.
University, 4800 Caoan Road, Shanghai, China. In this work, we propose a task tuning approach
Email: cjjiang@tongji.edu.cn.
• Y. Guo is currently a Postdoc Fellow in the Argonne National Lab.
that allows tasks to have different configurations, each
Email:yguo@anl.gov. optimized for the actual hardware capabilities, on het-
• X. Zhou is the corresponding author. erogeneous nodes. We address the following challenges
in automatic MapReduce task tuning. First, determining

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

the optimal task configuration is a tedious and error-


prone process. A large number of performance-critical
parameter can have complex interplays on task execu-
tion. Previous studies [5], [4], [6], [7] have shown that
it is difficult to construct models to connect parameter
settings with MapReduce performance. Second, there is
no one-size-fit-all model for different MapReduce jobs,
and even different configurations are needed for differ-
ent execution phases or input sizes. In a cloud envi-
ronment, task configurations should also be changed in
response to the changes in multi-tenancy interferences.
Finally, most MapReduce implementations use fixed task
configurations that are set during job initializations [8]. Fig. 1. The Hadoop framework.
Adaptive task tuning requires new mechanisms for on-
the-fly task reconfiguration.
We present Ant, a self-adaptive tuning approach for The rest of this paper is organized as follows. Section 2
task configuration in heterogeneous environments. Ant gives motivations on improvement of MapReduce con-
monitors task execution in large MapReduce jobs, which figuration framework. Section 3 describes the design of
comprise multiple waves of tasks and optimizes task Ant. Section 4 presents the details of the proposed self-
configurations as job execution progresses. It clusters tuning algorithm. Section 5 introduces moving Ant into
worker nodes (either physical or virtual nodes) into the cloud. Section 6 gives Ant implementation details.
groups according to their hardware configurations or the Section 7 presents the experimental results and analysis
estimated capability based on interference statistics. For on a physical cluster. Section 8 presents the experimental
each node group, Ant launches tasks with different con- results and analysis on a virtual cluster. Section 9 reviews
figurations and considers the ones that complete sooner related work. Section 10 concludes the paper. Appendix
as good settings. To accelerate tuning speed and avoid gives more details on Ant performance analysis.
trapping in local optimum, Ant uses genetic functions
crossover and mutation to generate task configurations 2 M OTIVATIONS
for the next wave from the two best performing tasks
Here, we first introduce the default Hadoop configura-
in a group. We implement Ant in Hadoop, the pop-
tion framework and then provide motivating examples
ular open source implementation of MapReduce, and
to demonstrate that static task configurations do not op-
perform comprehensive evaluations with representative
timize performance across different jobs and platforms.
MapReduce benchmark applications. Experimental re-
sults on a physical cluster with three types of machines
show that Ant improves the average job completion 2.1 Background
time by 31%, 20%, and 14% compared to stock Hadoop MapReduce is a distributed parallel programming model
(Stock), customized Hadoop with industry recommen- originally designed for processing a large volume of data
dations (Heuristic), and a profiling-based configuration in a homogeneous environment. Based on the default
approach (Starfish), respectively. Our results also show Hadoop framework, a large number of parameters need
that although Ant is not quite effective for small jobs to be set before a job can run in the cluster. These pa-
with only a few waves of tasks, it can significantly rameters control the behaviors of jobs during execution,
improve the performance of large jobs. Experiments with including their memory allocation, level of concurrency,
Microsoft’s MapReduce workload, which consist of 10% I/O optimization, and the usage of network bandwidth.
large jobs, demonstrate that Ant is able to reduce the As shown in Figure 1, slave nodes load configurations
overall workload completion time by 12.5% and 8% from the master node where the parameters are config-
compared to heuristic- and profiling-based approaches. ured manually. By design, tasks belonging to the same
A preliminary version of the paper appeared in [9]. job share the same configuration.
In this manuscript, we have extended Ant to virtual In Hadoop, there are more than 190 configuration
MapReduce clusters in multi-tenancy private cloud en- parameters, which determine the settings of the Hadoop
vironments. Specifically, Ant characterizes a virtual node cluster, describe a MapReduce job to the Hadoop frame-
based on two measured performance statistics: I/O rate work, and optimize task execution [8]. Cluster-level pa-
and CPU steal time. We consider two representative rameters specify the organization of a Hadoop cluster
interference scenarios in the cloud: stable interference and some long-term static settings. Changes to such
and dynamic interference. Experimental results on vir- parameters require rebooting the cluster to take effect.
tual clusters with varying interferences show that Ant Job-level parameters determine the overall execution
improves the average job completion time by 20%, 15%, settings, such as input format, number of map/reduce
and 11% compared to Stock, Heuristic and Starfish, tasks, and failure handling. These parameters are rel-
respectively. atively easier to tune and have uniform effect on all

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

TABLE 1

Average task completion time (s)

Average task completion time (s)


Wordcount Atom
Multiple machine types in the cluster. 120
Terasort
Grep 120
T110
T420

Machine model CPU Memory Disk 90 90

Supermicro Atom 4*2.0GHz 8 GB 1 TB


PowerEdge T110 8*3.2GHz 16 GB 1 TB 60 60

PowerEdge T420 24*1.9GHz 32 GB 1 TB 0.1 0.5 0.9 100 250 500


io.sort.record.percent io.sort.mb (MB)

(a) Heterogeneous workloads. (b) Heterogeneous platforms.

tasks even in a heterogeneous environment. Task-level Fig. 2. The optimal task configuration changes with
parameters control the fine-grained task execution on workloads and platforms.
individual nodes and can possibly be changed inde-
pendently and on-the-fly at runtime. Parameter tuning
at the task level opens up opportunities for improving
performance in heterogeneous environments and is our
focus in this work.
Hadoop installations pre-set the configuration param-
eters to default values assuming a reasonably sized
cluster and typical MapReduce jobs. These parameters
should be specifically tuned for a target cluster and in-
dividual jobs to achieve the best performance. However,
there is very limited information on how the optimal set- Fig. 3. The architecture of Ant.
tings can be determined. There exist rule of thumb rec-
ommendations from industry leaders (e.g., Cloudera [10]
and MapR [11]) as well as academic studies [4], [6]. These hardware configurations. Map completion times varied
approaches can not be universally applied to a wide as we changed the value of parameter io.sort.mb.
range of applications or heterogeneous environments. In The figure suggests that uniform task configurations do
this work, we develop an online self-tuning approach not lead to the optimal performance in a heterogeneous
for task-level configuration. Next, we provide motivating environment. For example, map tasks achieved the best
examples to show the necessity of configuration tuning performance on the Atom machine when the parameter
for heterogeneous workloads and hardware platforms. was set to 125 while the optimal completion time on the
T420 machine was due to the parameter being set to 275.
[Summary] We have shown that the performance of
2.2 Motivating Examples Hadoop applications can be substantially improved by
we created a heterogeneous Hadoop cluster composed of tuning task-level parameters for heterogeneous work-
three types of machines listed in Table 1. Three MapRe- loads and platforms. However, parameter optimization
duce applications from the PUMA benchmark [12], i.e., is an error-prone process involving complex interplays
WordCount, Terasort and Grep, each with 300 GB input among the job, the Hadoop framework and the archi-
data, were run on the cluster. We configured each slave tecture of the cluster. Furthermore, manual tuning still
node with four map slots and two reduce slots, and remains difficult due to the large parameter search space.
HDFS block size was set to 256MB. The heap size As many MapReduce jobs are recurring or have multiple
mapred.child.java.opts was set to 1 GB and other waves of task execution, it is possible to learn the best
parameters were set to the default values. We measured task configurations based on the feedback of previous
the map task completion time in two different scenarios – runs. These observations motivated us to develop a task
heterogeneous workload on homogeneous hardware and self-tuning approach to automatically configure parame-
homogeneous workload on heterogeneous hardware. We ters for various Hadoop jobs and platforms in an online
show that the heterogeneity either in the workload or manner.
hardware makes the determination of the optimal task
configuration difficult. 3 A NT DESIGN AND ASSUMPTIONS
Figure 2(a) shows the average map completion times
of the three heterogeneous benchmarks on a homo- 3.1 Architecture
geneous cluster only consisting of the T110 machines. Ant is a self-tuning approach for multi-wave MapReduce
The completion times changed as we altered the values applications, in which job executions consist of several
of parameter io.sort.record.percent. The figure rounds of map and reduce tasks. Unlike traditional
shows that wordcount, terasort, and grep achieved their MapReduce implementations, Ant centers on two key
minimum completion times when the parameter was designs: (1) tasks belonging to the same job run with
set to 0.4, 0.2, and 0.6, respectively. Figure 2(b) shows different configurations matching the capabilities of the
the performance of wordcount on machines with different hosting machines; (2) the configurations of individual

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

tasks dynamically change to search for the optimal set-


tings. Ant first spawns tasks with random configurations
and executes them in parallel. Upon task completion,
Ant collects task runtimes and adaptively adjusts task
settings according to the best-performing tasks. After
several rounds of tuning, task configurations on differ-
ent nodes converge to the optimal settings. Since task
tuning starts with random settings and improves with
job execution, ant does not require any priori knowledge
of MapReduce jobs and is model independent. Figure 3
shows the architecture of Ant.
• Self-tuning optimizer uses a genetic algorithm Fig. 4. Task self-tuning process in Ant.
(GA)-based approach to generate task configura-
tions based on the feedback reported by the task
analyzer. Settings that are top-ranked by the task capabilities due to interferences in multi-tenant clouds
analyzer are used to re-produce the optimized con- can possibly impede getting good task configurations.
figurations. We discuss how Ant address these issues in Appendix B.
• Task analyzer uses a fitness (utility) function to
evaluate the performance of individual tasks due to 4 S ELF - ADAPTIVE T UNING
different configurations. The fitness function takes Ant identifies good configurations by comparing the per-
into account task completion time as well as other formance of parallel executing tasks on the nodes with
performance critical execution statistics. similar processing capabilities in a self-tuning manner.
Ant operates as follows. When a job is submitted Due to the multi-wave task execution in many MapRe-
to the JobTracker, the configuration optimizer gen- duce jobs, Ant is able to continuously improve perfor-
erates a set of parameters randomly in a reasonable mance by adapting task configurations. In this section,
range to initialize the task-level configuration. Then the we first describe how Ant forms self-tuning task groups
JobTracker sends the randomly initialized tasks to in which different configurations can be compared. We
their respective TaskTrackers. The steps of task tuning then discuss the set of parameters Ant optimizes and
correspond to the multiple waves of tasks execution. the utility function Ant uses to evaluate the goodness of
Upon completing a wave, the task analyzer residing in parameters. Finally, we present the design of a genetic
the JobTracker recommends good configurations to algorithm-based self-tuning approach and a strategy to
the configuration optimizer for the next wave of exe- accelerate the tuning speed.
cution. This process is repeated until the job completes.
4.1 Forming Self-tuning Groups
3.2 Assumptions We begin with describing Ant’s workflow in a homo-
Our findings that uniform task configurations lead to geneous cluster and discuss how to form homogeneous
sub-optimal performance in a heterogeneous environ- sub-clusters in a heterogeneous cluster.
ment motivated the design of Ant, a self-tuning ap- Homogeneous cluster. In a homogeneous cluster, all
proach that allows differentiated task settings in the nodes have the same processing capability. Thus, Ant
same job. The effectiveness of Ant relies on two as- considers the whole Hadoop cluster as a self-tuning
sumptions – substantial performance improvement can group. Each node in the Hadoop cluster is configured
be achieved via task configurations and the MapReduce with a predefined number of map and reduce slots. If
jobs are long running ones (e.g., with multiple waves) the number of tasks (e.g., mappers) exceeds the available
which allow for task reconfiguration and performance slots in the cluster (e.g., map slots), execution proceeds in
optimization. There are two levels of heterogeneity that multiple waves. Figure 4 shows the multi-wave task self-
can affect task performance, i.e., task-level data skew- tuning process in a homogeneous cluster. Ant starts mul-
ness and machine-level varying capabilities. Although tiple tasks with different configurations concurrently in
due to data skew some tasks inherently take longer the self-tuning group and reproduces new configurations
to finish, Ant assumes that the majority of tasks have based on the feedback of completed tasks. We frame
uniform completion time with identical configurations. the tuning process as an evolution of configurations, in
Ant focuses on improving performance for average tasks which each wave of execution refers to one configuration
by matching task configurations to the actual hardware generation. The reproduction of generations is directed
capabilities. To address hardware heterogeneity, Ant by a genetic algorithm which ensures that the good
groups nodes with similar hardware configurations or configurations in prior generations are preserved in new
capabilities together and compares parallel executing generations.
tasks to determine the optimal configurations for the Heterogeneous cluster. In a heterogeneous cluster, as
node group. However, task skew and varying hardware shown in Figure 5, Ant divides nodes into a number

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

Algorithm 1 Ant task self-tuning algorithm.


1: /*Evaluate the fitness of each completed task*/
2: f (C1 ), · · · , f (Ci ), · · · , f (CM )
3: repeat
4: if Any slot is available then
5: Select two configuration candidates as parents;
6: Do Crossover and Mutation operations;
7: Use the obtained new generation Cnew to assign
the task to the available slot;
8: end if
Fig. 5. Ant on a heterogeneous cluster.
9: until the running job completed.

of homogeneous subclusters based on their hardware


configurations. Hardware information can be collected inherently longer to complete. For example, due to data
by the JobTracker on the master node using the skew, tasks that have expensive records in their input
heartbeat connection. Ant treats each subcluster as an files can take more than five times longer to complete.
homogeneous cluster and independently applies the self- Thus, we combine TCT with another performance metric
tuning algorithm to them. The outcomes of the self- to construct a utility function (or a fitness function
tuning process are significantly improved task-level con- in genetic algorithms). We found that most task mis-
figurations, one for each subcluster. Since each subcluster configurations are related to task memory allocations
has different processing capability, the optimized task and incur excessive data spill operations. If either of
configurations can be quite different across subclusters. the kvbuffer or the metadata buffer fills up, a map
task spills intermediate data to local disks. The spills
could lead to three times more I/O operations [10], [13].
4.2 Task-level Parameters Thus, Ant is designed to simultaneously minimize task
completion time and the number of spills.
TABLE 2 We define the fitness function of a configuration candi-
Task-level parameters and search space. 1
date (Ci ) as: f (Ci ) = T CT 2 (Ci )×(#spills) , where TCT is the
Task-level parameters Search space Symbol task completion time and #spills is the number of spill
operations. Since majority of tasks have little or no data
io.sort.factor {1, 300} g1 skew, we give more weight to TCT in the formulation of
io.sort.mb {100, 500} g2 the fitness function. Task configurations with high fitness
io.sort.record.percent {0.05, 0.8} g3 values will be favored in the tuning process. Note that
io.sort.spill.percent {0.1, 0.9} g4 the fitness function does not address the issue of data
io.file.buffer.size {4K, 64K} g5 skew due to non-uniform record distributions in task
mapred.child.java.opts {200, 500} g6 inputs. We believe that configurations optimized for a
task with inherently more data can be even harmful to
normal tasks as allocating more memory to normal tasks
Task-level parameters control the behavior of task exe- incurs resource waste.
cution, which is critical to the Hadoop. Previous studies
have shown that a small set of parameters are critical
4.4 Task Self-tuning
to Hadoop performance. Thus, as shown in Table 2, we
choose task-level parameters which have significant per- Ant deployes an online self-tuning approach based on
formance impact as the candidates for tuning. We further genetic algorithm to search the optimal task-level con-
shrink the initial searching space of these parameters to a figurations. We consider MapReduce jobs composed of
reasonable range in order to accelerate the search speed. multiple waves of map tasks. The performance of in-
This simple approach allows us to cut the search time dividual task T is determined by its parameter set C.
down from a few hours to a few minutes. A set candidate Ci consisting of a number of selected
parameters (refer to genes in GA), denoted as Ci =
[g1 , g2 , · · · , gn ], represents a task configuration set, where
4.3 Evaluating Task Configurations n is the number of parameters. Each element g represents
To compare the performance of different task config- a task-level parameter as shown in Table 2.
urations, Ant requires a quantitative metric to rank Reproduction process. Ant begins with an initial con-
configurations. As the goal of task tuning is to minimize figuration of randomly generated candidates for the task
job execution time, task completion time (TCT) is an assignment. After that, it evolves individual task con-
intuitive metric to evaluate performance. However, TCT figuration to breed good solutions during each interval
itself is not a reliable metric to evaluate task configu- by using the genetic reproduction operations. As shown
rations. A longer task completion time does not neces- in Algorithm 1, Ant first evaluates the fitness of all
sarily indicate a worse configuration as some tasks are completed tasks in the last control interval. Note that

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

Algorithm 2 Aggressive selection algorithm.


1: /*Select the best configuration candidate Cbest */
2: best = arg max[f (Ci )], i ∈ {1, · · · , M }
i
3: /*Select another configuration set from the candi-
dates with fitness scores that exceed the mean by
WR*/
1
P M
4: Avgf = M i=1 f (Ci )
5: if f (Ci ) > Avgf then
Fig. 6. Reproduction operations. 6: Select CW R = f (Ci ) with possibility Pi ;
7: end if
8: Use Cbest and CW R as parents.
M represents the total number of the completed tasks in
the last control interval. When there is any available slot
in the cluster, it selects two configuration candidates as dividual undergoes the mutation process. Otherwise,
the evolving parents. Ant generates the new generation the mutation operation involves replacing a randomly
configuration candidates by using the proposed genetic chosen parameter with a new value generated randomly
reproduction operations. Finally, it assigns the task with in its search space. This process prevents premature
the new generated configuration set to the available slot. convergence of the population and helps Ant sample the
There are many variations of the reproduction algo- entire solution space.
rithm obtained by altering the selection, crossover, and
mutation operators as shown in Figure 6. The selec-
tion determines which two parents (task configuration 4.5 Aggressive Selection
sets) in the last generation will have offsprings in the The popular Roulette Wheel selection mechanism has a
next generation. The crossover operator determines how higher probability of selecting good candidates to be par-
genes are exchanged between the parents to create those ents than bad ones. However, this approach still results
offsprings. The mutation allows for random alteration in too many task evaluations, which in turn reduces the
of genes. While the selection and crossover operators speed of convergence. Therefore, our selection procedure
tend to increase the quality of the task execution in the is more aggressive and deterministically selects good
new generation and force convergence, mutation tends candidates to be parents. We use the following two
to bring in divergence. strategies to accelerate the task-level parameter tuning.
Parents selection. A popular selection approach is Elitist strategy: We found that good candidates are
the Roulette Wheel (RW) mechanism. In this method, if more likely to produce good offsprings. In this work, an
fitness f (Ci ) is the fitness of completed task performance elitist strategy is developed similar to the proposed GAs
in the candidate population, its probability of being in study [14]. Elitism provides a means for reducing ge-
selected is Pi = PMf (Cfi(C )
, where M is the number of netic drift by ensuring that the best candidate is allowed
i=1 i)
tasks completed in the previous interval. This allows to copy their attributes to the next generation. Since
candidates with good fitness values to have a higher elitism can increase the selection pressure by preventing
probability of being selected as parents. The selection the loss of low salience genes of candidates due to
module ensures reproduction of more highly fit candi- deficient selection pressure, it improves the performance
dates compared to the number of less fit candidates. with regard to optimality and convergence. However, the
Crossover. A crossover function is used to cut the elitism rate should be adjusted suitably and accurately
sequence of elements from two chosen candidates (par- because high selection pressure may lead to premature
ents) and swap them to produce two new candidates convergence. The best candidate with highest fitness
(children). As crossover operation is crucial to the suc- value in the previous generation will be preserved as
cess of Ant and it is also problem dependent, an ex- one of the parents in the next generation.
clusive crossover operation is employed for each in- Inferior strategy: We also found that it is unlikely for
dividual. We implement relative fitness crossover [14] two low fitness candidates to produce an offspring with
instead of absolute fitness crossover operation, because high fitness. This is due to the fact that bad performance
it moderates the selection pressure and controls the is often caused by a few key parameters and these bad
rate of convergence. Crossover operation is exercised on settings continue to be inherited in real clusters. For
configuration candidates with a probability, known as instance, for an application that is both CPU and shuffle-
crossover probability (Pc ). intensive in a cluster with excessive I/O bandwidth and
Mutation. The mutation function aims to avoid trap- limited CPU resources, enabling compression of map
ping in the local optimum by randomly mutating an outputs would stress the CPU and degrade application
element with a given probability. Instead of performing performance, regardless of others. The selection method
gene-by-gene mutation at each generation, a random should eliminate this configuration quickly. In order to
number r is generated for each individual. If r is larger quickly eliminate poor candidates, we calculate the mean
than the mutation probability (Pm ), the particular in- fitness of the completed tasks for each generation and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

only select parents with fitness scores that exceed the •In the first scenario of static interference, Ant only
mean. classifies VMs at the beginning of the adaptation
Aggressive selection algorithm. Based on the above process when interferences are relatively stable dur-
two aggressive selection strategies, the parents selection ing the job execution process.
of the self-tuning reproduction is operated by an inte- • In the second scenario of dynamic interference, the
grated selection algorithm. As shown in Algorithm 2, background interference could change and the per-
Ant firstly selects the best configuration candidate with formance of virtual nodes change over time. Thus, a
the highest fitness in the last interval as one of the re- re-clustering of the virtual nodes is needed to form
production parents. Then it selects another configuration new tuning groups. Ant is adapt to the variations
set from the candidates with fitness scores that exceed of VM performance by periodically performing re-
the mean by applying the Roulette Wheel approach. grouping if significant performance changes in VMs
Finally, Ant generates the new generation configuration are detected.
candidates by using the two selected candidates (i.e., In the following, we describe how to classify a virtual
Cbest and CW R ) as the reproduction parents. MapReduce cluster into a number of homogeneous sub-
Furthermore, the aggressive selection strategies also clusters in the above two scenarios.
reduce the impact of task skews during the tuning pro-
cess. Long tasks due to data skews may add noises in the
proposed GA-based task tuning. Taking the advantages 5.1 Stable Interference Scenario
of the aggressive selection, only the best configurations Ant estimates the actual capabilities of virtual nodes
are possibly used to generate new configurations. It is based on low-level resource utilizations of virtual re-
unlikely that the tasks with skews would be selected as sources. Previous study found that MapReduce jobs are
reproduction candidates. Thus, Ant would find the best mostly bottlenecked by the slow processing of a large
configurations for the majority of tasks. amount of data [19]. Excessive I/O rate and a lack
of CPU allocation are signs of slow processing. Ant
5 M OVING A NT INTO THE C LOUD characterizes a virtual node based on two measured
Cloud computing offers users the ability to access performance statistics: I/O rate and CPU steal time.
large pools of computational and storage resources on Both statistics can be measured at the TaskTracker of
demand. Developers by using the cloud computing individual nodes.
paradigm are enabled to perform parallel data process- Ant monitors the number of data bytes written to
ing in a distributed and scalable environment with af- disk during the execution of a task. Since there is little
fordable cost and reasonable time. In particular, MapRe- data reuse in MapReduce jobs, the volume of writes is a
duce deployment in the cloud allows enterprises to good indicator of I/O access rate and memory demand.
cost-effectively analyze large amounts of data without When the in-memory buffer, controlled by parame-
creating large infrastructures of their own [15]. Using ter mapred.job.shuffle.merge.percent, runs out
virtual machines (VMs) and storage hosted by the cloud, or reaches the threshold number of map outputs
enterprises can simply create virtual MapReduce clusters mapred.inmem.merge.threshold, it is merged and
to analyze their data. However, tuning the jobs hosted spilled to disks. The spilled records are written to
in virtual MapReduce clusters is significant challenge disks, which include both map and reduce spills.
to users. This is due to the fact that even the virtual TaskTracker automatically records the data writing
node configured with the same virtual hardware could size and spilling duration of each task. We calculate the
have varying capacity due to interferences from co- I/O access rate of individual nodes based on the data
located users [16], [17], [18]. When running Hadoop in a spilling operations.
virtual cluster, finding nodes with the similar hardware The CPU steal time is the amount of time that the
capability is more challenging and complex than that in virtual node is ready to run but failed to obtain CPU
a physical cluster. cycles because the hypervisor is serving other users. It
For moving Ant into virtual MapReduce clusters, it is reflects the actual CPU time allocated to the virtual node
apparently inaccurate to divide nodes into a number of and can effectively calibrate the configuration of virtual
homogeneous subclusters based on their static hardware hardware according to the experienced interferences. We
configurations. Cloud offers elasticity to slice large, un- record the real-time CPU steal time by running the Linux
derutilized physical servers into smaller, parallel virtual top command on each virtual machine. TaskTracker
machines, enabling diverse applications to run in iso- reports the information to the master node in the cluster
lated environments on a shared hardware platform. As by the heartbeat connection.
a result, such a cloud environment leads to various inter- Ant uses the k-means clustering algorithm to classify
ferences among the different applications. In this work, virtual nodes into configuration groups. Formally, given
we focus on two representative interference scenarios in a set of VMs (M1 , M2 , . . . , Mm ), where each observation
the cloud: stable interference and dynamic interference. is a 2-dimensional real vector (i.e., I/O rate and CPU
We extend Ant to virtual MapReduce clusters in the steal time), k-means clustering aims to partition the m
cloud environments. observations into k(≤ n) sets S = S1 , S2 , . . . , Sk so as to

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

minimize the within-group sum of squares, Algorithm 3 Dynamic grouping of VMs.


k 1: Inputs: k : number of groups (initialize k =2); M : num-
arg min
X X
kM − µi k2 . ber of VMs in a virtual cluster.
(1)
2: Randomly choose k VMs in the cluster, as k groups;
i=1 M ∈Sk
3: repeat
Here µi is the mean of points in Si . kM −µi k2 is a chosen 4: if any other VM is not in a group then;
distance(intra) measure between a data point Mm and 5: assign the VM to the group according to Eq. 1;
the group centre µi . It is is an indicator of the distance 6: update the mean of the group;
7: end if
of the mth VM from its respective group centers.
8: until the minimum mean of the group is achieved;
9: if the number of groups is fixed then
5.2 Dynamic Interference Scenario 10: goto Output
The determination of k value in the proposed k-means 11: end if
compute: inter = q kµi − µx k2 , i, x ∈ k
P
12:
classification approach is more difficult and complex
compute: intra = n1 ni=1 (Mi − µi )2
P
when Ant is applied in a highly dynamic cloud environ- 13:
ment. The real capacity of virtual nodes in the cluster 14: if intrak < intrak−1 and interk > interk−1 then
will change over time as the background interference 15: k ← k + 1 and goto line 3 repeat;
16: end if
changes. Furthermore, virtual nodes could be migrated
17: Output: a set of k groups.
among different physical host machines in the cloud,
which also has significant impact on the performance
of virtual nodes. These observations motivate us to
periodically perform re-grouping of the virtual nodes to compute the new number of groups by incrementing
to form new tuning groups. Accordingly, we have to the group counter by one in each iteration until it satis-
decide the number of homogeneous subclusters and the fies the validity of grouping quality threshold, referring
frequency of re-grouping the subclusters in a dynamic to line 14 in Algorithm 3.
interference environment.
5.2.2 Dynamic Re-grouping
5.2.1 K value selection Since multi-tenancy interferences in the cloud are usually
As designed, the algorithm is not capable of determining time-varying, grouping of homogeneous virtual nodes
the appropriate number of subclusters and it depends should be executed by k-means approach periodically.
upon the user to identify this in advance. However, it is Intuitively, the frequency of the re-grouping is deter-
very difficult for Ant to set the number of groups in the mined by two factors: the dynamics of multi-tenancy
dynamic interference scenario. In order to achieve the interference in the cloud and the size of the virtual
desired performance of Ant, it is necessary to find the MapReduce cluster.
number of groups for dynamic interferences among VMs • If the multi-tenancy interferences change frequently,

on the runtime. Fixing the number of groups in a virtual re-grouping should be more frequent correspond-
cloud cluster may lead to poor quality of grouping and ingly so as to capture dynamic capacity changes of
performance degradation. Thus, we propose a modified the virtual nodes.
k-means method to find the number of groups based on • If the size of the virtual cluster is small, re-grouping

the quality of the grouping output on the fly. It relies should be less frequent. This is due to there is certain
on the following two performance metrics to qualify the amount of time to find favorable task configurations
grouping of similar VMs in a virtual MapReduce cluster. after each re-grouping adjustment is done, which
• inter is used to measure the separation of homo-
may weight out the benefit of re-grouping. On the
geneous VM groups. The term is the sum of the contrary, re-grouping needs to be more frequent
distances between P the group centroids, which is when the size of the virtual cluster is large.
defined as: inter = kµi − µx k2 , i, x ∈ k. Based on the above analysis, we formally formulate the
• intra is used to measure the compactness of indi- dynamic capacity change of the virtual nodes based on
vidual homogeneous VM groups. Here, we use the the two performance metrics (i.e., inter and intra) as,
standard deviation as intra to evaluate the compact-
ness of the data points
q in each group of VMs. It is intra(t) inter(t − 1)
Pn Dynamic(t) = × . (2)
defined as: intra = n1 i=1 (Mi − µi )2 . intra(t − 1) inter(t)
As shown in Algorithm 3, users have the flexibility Either the increase of intra or the decrease of inter
either to fix the number of groups or enter the minimum means the dynamic capacity changes of VMs become sig-
number of groups. In the former case, it works in the nificant. The growth of intra means the compactness of
same way as the traditional k-means algorithm does. individual homogeneous VM groups becomes relaxed.
The value of k is empirically determined based on the inter reduction means the separation of homogeneous
historical workload records and MapReduce multi-wave VM groups becomes unstable and regrouping becomes
behavior analysis. In the latter case, it lets the algorithm necessary. We periodically monitor the two performance

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

Algorithm 4 Dynamic re-grouping interval. MapTask or ReduceTask thread to execute user-defined


1: Inputs: Tn : the interval of nth re-grouping (initialize map and reduce functions.
T1 = 10 minutes); Algorithm implementation: We implemented the self-
2: repeat tuning algorithm to generate the configuration sets for
3: Collect statistics of inter and intra periodically; the new generation tasks in each control interval (i.e., 5
Tn
4: Compute at the end of Tn : Tn+1 = Avg Dynamic(t) t∈Tn minutes). The selection of the control interval is a trade-
according to Eq. 2; off between the parameter searching speed and average
5: until the job is completed;
task execution time. If the interval is too long, it will take
6: Output: New re-grouping interval Tn+1 .
more time to find good configurations. If the interval
is too short, the task with new configurations may not
complete and no performance feedback can be collected.
metrics to reflect the interference in the cloud. Here, t is Thus, we choose a control interval of 5 minutes which
the interval to evaluate the dynamic capacity changes. is approximately 2 times of the average task execution
We empirically set the value of t to one minute in the time. The mutated value of a parameter is randomly cho-
experiment. It is a tradeoff between the average task sen from its search space. Since our aggressive selection
completion time and the fluctuation of interference in algorithm prunes poor regions, we can use an atypically
the cloud. high mutation rate (e.g., pm = 0.2) without impacting
We formally design the dynamic re-grouping inter- convergence. The value of pm is empirically determined.
val selection process as shown in Algorithm 4. The A cut point is randomly chosen in each parent candidate
re-grouping interval (Tn ) is initialized to 10 minutes configuration and all parameters beyond that point are
at the beginning. It is then dynamically adjusted at swapped between two parents to produce two children.
runtime based on the dynamic capacity changes of the We empirically set the crossover probability pc to be 0.7.
virtual nodes. As the dynamic increases, the re-grouping
interval decreases, referring to line 4 in Algorithm 4.
The item Avgt∈Tn Dynamic(t) represents the average 7 E VALUATION ON A P HYSICAL C LUSTER
dynamics during the previous re-grouping interval Tn . 7.1 Experiment Setup
It aims to mitigate the impact of time-varying dynamic We evaluate Ant on a physical cluster composed of 1
interference in the cloud. Atom, 3 T110, 3 T420 (as shown in Table 1), 1 T320
(12-core CPUs, 24 GB RAM and 1 T hard disk), and 1
6 I MPLEMENTATION T620 (24-core CPUs, 16 GB RAM and 1 T hard disk).
Hadoop modification: We implemented Ant by The master node is hosted on one T420 machine in the
modifying classes JobTracker, TaskTracker and cluster. The servers are connected with Gigabit Ethernet.
LaunchTaskAction based on Hadoop version 1.0.3. Each slave node is configured with four map slots and
We added a new interface taskConf, which is used two reduce slots. The block size of HDFS is set to 256M
to specify the configuration file of individual tasks due to the large input data size in the experiment.
while assigning them to slave nodes. Each set of task- Ant is modified based on the version 1.0.3 of Hadoop
level parameter set is tagged with its corresponding implementation. FIFO scheduler is used in our experi-
AttemptTaskID. Additionally, we added another ments. We evaluate Ant using three MapReduce appli-
new interface Optimizer to implement the GA cations from the PUMA benchmark [12] with different
optimization. During job execution, we created a method input sizes as shown in Table 3, which are widely used
taskAnalyzer to collect the status of each completed in evaluation of MapReduce performance by previous
task by using TaskCounter and TaskReport. works [20].
Ant execution process: At slave nodes, once a We compare the performance of Ant with two other
TaskTracker gets task execution commands from the main competitors in practical use: Starfish [4], a job
TaskScheduler by calling LaunchTaskAction, it re- profiling based configuration approach from Duke uni-
quires task executors to accept a launchTask() action versity, and Rules-of-Thumb 1 (Heuristic), another repre-
from a local TaskTracker. Ant uses the launchTask() sentative heuristic configuration approach from industry
RPC connection to pass on the task-level configuration leader Cloudera [10]. For reference, we normalize the Job
file description (i.e., taskConf), which is originally Completion Time (JCT) achieved by various approaches
supported by the Hadoop. Ant creates a directory in the to the JCT achieved by the Hadoop stock parameter
local file system to store the per-task configuration data settings. Unless otherwise specified, we use the stock
for map/reduce tasks at TaskTracker. The directory configuration setting of Hadoop implementation for the
is under the task working directory, and is tagged with other items that are not listed in the Table 6 (Please
AttemptTaskID which is obtained from JobTracker. refer to Appendix). Note that both Heuristic and Starfish
Therefore, tasks can load their specified configuration
1. Cloudera recommends a set of configuration items based on its in-
items by accessing their task local file systems while dustry experience, e.g.,io.sort.record.percent is recommended
initializing individual tasks by Localizetask(). Then 16
to set as 16+avg−record−size , which is based on the average size of
after task localization, it kicks off a process to start a map output records. More rules are available in [10].

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

10

TABLE 3
The characteristics of benchmark applications used in our experiments.
Category Type Label Input size (GB) Input data # Maps # Reduces
Wordcount CPU intensive J1/J2/J3 100/300/900 Wikipedia 400/1200/3600 14/ 14/ 14
Grep I/O intensive J4/J5/J6 100/300/900 Wikipedia 400/1200/3600 14/ 14/ 14
Terasort I/O intensive J7/J8/J9 100/300/900 TeraGen 400/1200/3600 14/ 14/ 14

1 1
Stock Starfish Value Value
JCT normalized to Stock

Heuristic Ant

io.sort.record.percent

io.sort.record.percent
1.2 0.8 0.8

0.9 0.6 0.6

0.6 0.4 0.4

0.3 0.2 0.2

0 0
0 0 5 10 15 20 25 0 5 10 15 20 25
1 2 3 4 5 6 7 8 9
Time (minutes) Time (minutes)
Job ID
(a) Atom. (b) T110.
Fig. 7. Job completion time on the physical cluster.
Fig. 8. Task-level parameter search process.

always maintain identical configuration files for job exe-


cutions as described in Section 2. For fairness, the cluster- Impact of job size. Figure 7 shows that Ant slightly
level and job-level parameters for all approaches (includ- reduces the completion time of small jobs compared with
ing the baseline stock configuration) in the experiments stock Hadoop, e.g., J1, J4 and J7. In contrast, Ant is more
are set to suggested values by the rules of thumb from effective for large jobs, e.g., J3, J6 and J9. This is due
Cloudera [10]. For example, we roughly set the value of to the fact that small jobs usually have short execution
mapred.reduce.tasks (the number of reduce tasks of times and the self-tuning process can not immediately
the job) to 0.9 times the total number of reduce slots in find the optimal configuration solutions. Thus, such
the cluster. We discuss the performance impact of such small jobs are not favored by Ant.
job-level parameters in Appendix A. Impact of workload type. Figure 7 reveals that Ant
Our evaluation firstly demonstrates the performance reduces the job completion time of I/O intensive work-
improvement achieved by Ant and analyzes the im- loads, i.e., Grep and Terasort, 10% more than that of CPU
provement from the perspectives of job size and work- intensive workloads, i.e., Wordcount. This is due to the
load type. Then we demonstrate the effectiveness of Ant fact that Ant focuses on the task-level I/O operation
for searching task-level parameters. parameter tuning and accordingly it affects more for I/O
intensive workloads.
Overall, Ant achieves consistently better performance
7.2 Effectiveness of Ant than Heuristic and Starfish do on the physical Hadoop
cluster. This is due to its capability of adaptively tuning
Reducing job completion time. Figure 7 compares vari-
task-level parameters while considering various work-
ous job completion times achieved by Heuristic, Starfish
load preferences and heterogeneous platforms. Please
and Ant, respectively. The results demonstrate that all
refer to Appendix C for the effectiveness of Ant in a pro-
of these configuration approaches improve the job com-
duction environment, Appendix E for the effectiveness
pletion times more or less when compared with the per-
of the aggressive selection strategy and Appendix F for
formance achieved by Hadoop stock parameter setting
the task execution improvement of Ant.
(Stock). Figure 7 shows that Ant improves the average
job completion time by 31%, 20% and 14% compared
with Stock, Heuristic and Starfish on the physical cluster, 7.3 Ant Searching Process
respectively. This is due to the fact that Stock, Heuristic In order to take a close look of Ant searching
and Starfish all rely on a unified and static task-level process, Figure 8 depicts the parameter
parameter setting. Such unified configuration is appar- io.sort.record.percent searching path from
ently inefficient in the heterogeneous cluster. The results two representative machines ( i.e., Atom and T110 )
also reveal that Starfish is more effective than Heuristic in the physical cluster. Figure 8(a) shows that Atom
on the physical cluster since Starfish benefits from its spends around 20 minutes to find a stable parameter
job profiling capability. The learning process of Starfish range. In contrast, Figure 8(b) shows that T110 takes
is more accurate than the experience based tuning of only 7 minutes due to the fact that there are three
Heuristic to capture the characteristic of individual jobs. T110 machines in the cluster which provide three times

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

11

Cluster level CPU ultilization Stock Starfish


Heuristic Ant

JCT normalized to Def


100
1.2

CPU utilization (%)


75
0.9

50
0.6

25 0.3

0 0
0 40 80 120 160
1 2 3 4 5 6 7 8 9
Time (minute)
Job ID

Fig. 9. Dynamic interference in the cloud.


Fig. 10. Job completion time on the virtual cluster with
dynamic performance interference.
concurrent running tasks of Atom’s. More parallel task
executions mean there are more opportunities to learn
proposed re-grouping approach in the dynamic interfer-
the characteristics of the running job on the cluster in
ence environment. Finally, we analyze the effectiveness
the same time period.
of the dynamic re-grouping interval selection.

8 E VALUATION ON V IRTUAL C LUSTERS 8.2 Effectiveness of Ant under Dynamic Interference


8.1 Experiment Setup Figure 10 compares various job completion times
We built a virtual cluster in our university cloud. achieved by Heuristic, Starfish and Ant in the cloud with
VMware vSphere 5.1 was used for server virtualization. dynamic interference, respectively. Figure 10 demon-
VMware vSphere module controls the CPU usage limits strates that all of these configuration approaches im-
in MHz allocated to VMs. We created a pool of VMs with prove the job completion time compared with that
different hardware configurations from the virtualized achieved by stock Hadoop parameter setting approach
blade server cluster and run them as Hadoop slave (Stock). The result shows that Ant improves the average
nodes. All VMs ran Ubuntu Server 12.04 with Linux job completion time by 20%, 15% and 11% compared
kernel 3.2. The cluster-level configurations for Hadoop with Stock, Heuristic and Starfish on the physical cluster,
are the same as those in the physical cluster (Section 7.1). respectively. As described in the physical cluster sce-
The number of reduce tasks is set to 42, which is 0.9 times nario, Stock, Heuristic and Starfish all rely on a unified
the number of the available reduce slots in the cluster. and static task-level parameter setting. Such unified con-
Stable interference scenario: Please refer to Ap- figurations are apparently inefficient in a virtual cluster
pendix G for the effectiveness of Ant with stable inter- with various interferences. In the cloud environment,
ference and Appendix H for the speed of finding a good task configurations should be changed in response to
configuration by Ant. the changes of interferences. The results also reveal that
Dynamic interference scenario: In the virtual cluster Heuristic is more effective than Starfish on the virtual
there are 24 VMs, each with 2 vCPU, 4 GB RAM and 80 cluster. Although the learning based Starfish is more
GB hard disk space, which are hosted on blade servers accurate than the experience based Heuristic in the
in our cloud. Unlike the stable interference scenario, we physical cluster, it fails to capture the characteristic of
do not run any specific applications in the cloud as the various dynamic interferences in the cloud environment.
dynamic interference producer. Instead, there are various Figure 11 shows the detailed impact on task-level per-
applications (e.g., scientific computations and multi-tier formance while applying Ant in the cloud environment.
Web applications) hosted in our university cloud over Figure 11(a) depicts the breakdown of job completion
time. These different type and time-varying applications time of the three jobs (i.e., J3, J6 and J9) used in the
play the role of dynamic interference producer in the experiment. It reveals that Wordcount is map-intensive
cloud environment. As shown in Figure 9, we record the while Terasort and Grep are shuffle & reduce-intensive.
cluster level CPU utilization to demonstrate the interfer- Figure 11(b) shows the average task completions time
ence in the cloud. We set the initialized value of k to 2 of map tasks and reduce tasks in the experiment. The
in Algorithm 3 and then dynamically divide the virtual result demonstrates that map tasks are relatively small
cluster into a few subclusters by using the modified k- compared to reduce tasks. This is due to the fact that
means clustering approach. The re-grouping interval is the number of reduce tasks is configured to 0.9 times of
dynamically adjusted based on the Algorithm 4 so that the total number of reduce slots in the cluster. It aims to
the grouping of virtual nodes is updated dynamically to complete the execution of reduce tasks in one waive to
capture the interference changes in the cloud. avoid the overhead caused by data shuffle.
We firstly demonstrate the performance improvement Figure 12 shows the job completion time improvement
achieved by Ant on the virtual cluster with dynamic achieved by Genetic with/without clustering, Starfish
interferences. We then evaluate the effectiveness of the with/without clustering, and Heuristic with/without

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

12

2.5 150 36 40 40

TCT of Reduce task (minute)


Map Map Starfish without clustering Starfish with clustering

TCT of Map task (second)


Shuffle Reduce Heuristic without clustering Heuristic with clustering

JCT improvement (%)

JCT improvement (%)


Breakdown of JCT (hour)

2 Reduce Genetic without clustering Genetic with clustering


30 30
100 24
1.5
20 20

1 50 12
10 10
0.5
0 0 0 0
0 Wordcount Grep Terasort Wordcount Grep Terasort Wordcount Grep Terasort
Wordcount Grep Terasort Workload types Workload types Workload types

(a) Breakdown of JCT. (b) Task completion times. (a) Without clustering. (b) With clustering.

Fig. 11. Impact on task-level performance. Fig. 12. Comparsion by applying different tuning ap-
proaches in a heterogeneous cluster.

clustering approaches in a single heterogeneous clus-


ter. The results demonstrate approaches with cluster- and 160th minutes. However, the number of groups
ing achieve better performance improvement than ap- fluctuates when the interference changes frequently, such
proaches without clustering in heterogeneous environ- as in the period around 100th minute.
ments. We find that Genetic with clustering achieves At the same time, the interval of re-grouping is adap-
slight performance improvement compared to the other tively changed with the interference changes at the run-
two approaches. This is due to the fact that the main time. When the cloud interference fluctuates significantly
contribution of Ant relies on the dynamic clustering (e.g., 90th to 110th minutes), the re-grouping interval be-
and task-level adaptive configuration in heterogeneous comes small so that Ant can capture the capacity changes
environments. of the virtual nodes in the cluster. The re-grouping
interval is larger when the interference changes more
8.3 Effectiveness of Subcluster Re-grouping frequently in the could. It represents system performance
sensitivity of the interference dynamics in our university
Figure 13(a) compares the job completion time improve-
cloud.
ment achieved by Ant with the dynamic subcluster re-
grouping and static grouping capability. It shows that
Ant with the dynamic re-grouping capability can out- 8.4 Sensitivity of GA Parameter Selection
perform Ant with static grouping capability by almost We change the values of crossover probability and mu-
100% in terms of the job completion time improvement. tation rate to study their impact on the performance im-
This is due to the benefit of re-grouping virtual nodes provement in terms of job completion time. Figure 14(a)
into a number of homogeneous subclusters so as to shows that the job completion time initially decreases as
mitigate the impact of interference on task tuning. In the crossover probability increases. However, increasing
the approach of Ant without re-grouping, it uses a static the crossover probability further leads to performance
number of groups that is similar with the stable inter- degradation. This tells that a very large crossover prob-
ference scenario. Ant only groups the virtual nodes at ability may lead to job completion time deterioration.
the beginning of the task tuning process in the dynamic Thus, we empirically set the crossover probability to
interference environment. Thus, Ant without subcluster 0.7 in the experiment. It is a tradeoff between the
re-grouping capability cannot capture time-varying in- search speed and the job completion time improvement.
terference in the cloud. Figure 13(b) shows the impact of Figure 14(b) shows tuning the mutation rate has the
the number of groups on the job completion time under similar phenomenon with the crossover probability in
the static grouping approach. We empirically select the the experiment. A large mutation rate (e.g., 0.5) leads
static number of group (i.e., k = 2) for comparison with significant performance deterioration due to the instabil-
the dynamic subcluster re-grouping capability. ity of task-level configuration. Thus, we empirically set
Figure 13(c) shows that both the number of groups the mutation rate to 0.2 without impacting convergence
and the re-grouping interval are dynamically adjusted in the experiment.
based on the time-varying interference in the cloud.
Ant periodically collects the dynamic capacity changes
of the virtual nodes and then re-groups virtual nodes 9 R ELATED W ORK
according to Algorithms 3 and 4. Figure 13(c) depicts Framework modification. Studies [21], [22] have demon-
the dynamic tuning process of Ant in a 160-minute time strated that it is effective to improve MapReduce per-
window, which is corresponding to the time-varying formance by modifying the default Hadoop framework.
interference scenario as shown in Figure 9. The number Jinda et al. [23] proposed a new data layout, namely
of groups is initialized as two at the beginning stage. coined Trojan Layout, which internally organizes data
It is then dynamically tuned based on the dynamic blocks into attribute groups according to the workload in
capacity changes. The result shows that the number of order to improve data access times. It is able to schedule
groups keeps stable at two when the interference is incoming MapReduce jobs to data block replicas with
relatively stable, such as in the period between 120th the most suitable Trojan Layout. Dittrich et al. proposed

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

13

30 1.2 Re-grouping interval

Time interval (minute)


Static grouping Number of groups
JCT improvement (%) Dynamic re-grouping 40 4

Group number
1.1

Normalized JCT
20 30 3
1
20 2
0.9
10
0.8 10 1

0 0.7 0 0
Wordcount Grep Terasort 1 2 3 4 0 40 80 120 160
Workloads Number of groups (k) Time (minute)

(a) Improvement comparison. (b) Static k value selection. (c) Dynamic re-grouping adjustment.

Fig. 13. Performance improvement comparison between static grouping and dynamic re-grouping approach.

1.4 1.6
that automated resource allocation and configuration
Normalized JCT

Normalized JCT

1.2 1.4 of Hadoop parameters for achieving the performance


1 1.2 goals while minimizing the incurred cost. Herodotou
0.8 1 et al. proposed Starfish [4], an optimization framework
0.6 0.8
that hierarchically optimizes from jobs to workflows by
0.5 0.6 0.7 0.8
Crossover probability (pc)
0.9 0.1 0.2 0.3 0.4
Mutation rate (pm)
0.5
searching for good parameter configurations. These ap-
proaches mostly rely on the default Hadoop framework
(a) Impact of pc . (b) Impact of pm . and configure the parameters by static settings. They are
often not effective when the cluster platform becomes
Fig. 14. Sensitivity of the crossover probability and muta- heterogeneous.
tion rate tuning. MapReduce in the cloud. Currently, there are several
options for using MapReduce in the cloud environments,
such as using MapReduce as a service, setting up one’s
Hadoop++ [21], a new index and join technique to own MapReduce cluster on cloud instances, or using
improve runtime of MapReduce jobs. Vavilapalli et al. specialized cloud MapReduce runtimes that take ad-
presented the next generation of Hadoop known as vantage of cloud infrastructure services. Guo et al. [26]
YARN [24]. It still employs a unified configuration policy designed FlexSlot, an effective yet simple extension to
for all tasks. None of those approaches considered mod- the slot-based Hadoop task scheduling framework. It
ifying the default Hadoop configurations to improve adaptively changes the number of slots on each virtual
MapReduce performance. node to promote efficient usage of resource pool in cloud
Heterogeneous environment. As heterogeneous hard- environment. Chiang et al. [19] presented TRACON, a
ware is applied to Hadoop clusters, how to improve novel task and resource allocation control framework
MapReduce performance in heterogeneous environ- that mitigated the interference effects from concurrent
ments attracts much attention [1], [2]. Ahmad et al. [1] data-intensive applications. Ant differentiates itself from
identified key reasons for MapReduce poor performance those efforts through its capability of adaptive task-level
on heterogeneous clusters. Accordingly, they proposed tuning to achieve performance optimization.
an optimization based approach, Tarazu, to improve
MapReuce performance by communication-aware load
balancing. Zaharia et al. [2] designed a robust MapRe- 10 C ONCLUSIONS AND F UTURE W ORK
duce scheduling algorithm, LATE, to improve the com- Although a unified design framework, such as MapRe-
pletion time of MapReduce jobs in a heterogeneous duce, is convenient and easy to use for large-scale
environment. They paid little attention to optimizing parallel and distributed programming, it ignores the
Hadoop configurations, which has a significant impact differentiated needs in the presence of various platforms
on the performance of MapReduce jobs, especially in a and workloads. In this paper, we tackle a practical yet
heterogeneous Hadoop cluster. challenging problem of automatic configuration of large-
Parameter configuration. Recently, a few studies start scale MapReduce workloads in heterogeneous environ-
to explore how to optimize Hadoop configurations to ments. We have proposed and developed a self-adaptive
improve job performance. Herodotou et al. [5] proposed task-level tuning approach, Ant, that automatically finds
several automatic optimization based approaches for the optimal settings for individual jobs running on het-
MapReduce parameter configuration to improve job erogeneous nodes. In Ant, tasks are customized with
performance. Kambatla et al. [25] presented a Hadoop different settings to match the capabilities of heteroge-
job provisioning approach by analyzing and comparing neous nodes. It works best for large jobs with multiple
resource consumption of applications. It aimed to max- rounds of map task execution. Our experimental results
imize job performance while minimizing the incurred demonstrate that Ant can improve the average job com-
cost. Lama and Zhou designed AROMA [6], an approach pletion time on a physical cluster by 31%, 20%, and 14%

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

14

compared to stock Hadoop, customized Hadoop with [10] Cloudera, “Configuration parameters,” 2012,
industry recommendations, and a profiling-based config- http://blog.cloudera.com/blog/author/aaron/.
[11] MapR, “The executive’s guide to big data,” 2013,
uration approach, respectively. Experimental results on http://www.mapr.com/resources/white-papers.
two virtual cloud clusters with varying multi-tenancy [12] “PUMA: Purdue mapreduce benchmark suite,” 2012.
interferences show that Ant improves the average job [13] Y. Guo, J. Rao, and X. Zhou, “ishuffle: Improving hadoop per-
formance with shuffle-on-write,” in Proc. of Int’l Conference on
completion time by 20%, 15%, and 11% compared to Autonomic Computing (ICAC), 2013.
Stock, Heuristic and Starfish, respectively. Ant can be [14] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and
deployed to different types of clusters, and thus is elitist multiobjective genetic algorithm: NSGA-II,” IEEE Trans. on
Evolutionary Computation, vol. 6, pp. 182–197, 2002.
flexible and adaptive. [15] B. Igou, “User survey analysis: Cloud-computing budgets are
Our method Ant can be extended to other frameworks growing and shifting; traditional it services providers must pre-
such as Spark, though some additional effort is needed. pare or perish,” 2010, gartner Report.
[16] D. Carrera, M. Steinder, I. Whalley, J. Torres, and E. Ayguade,
Different from Hadoop, which executes individual tasks “Autonomic placement of mixed batch and transactional work-
in separate JVMs, Spark uses executors to host multiple loads,” IEEE Trans. on Parallel and Distributed Systems (TPDS),
tasks on worker nodes. To extend Ant to Spark, we need 2012.
[17] B. Palanisamy, A. Singh, and L. Liu, “Cost-effective resource
to dynamically change executor sizes without restart- provisioning for mapreduce in a cloud,” IEEE Trans. on Parallel
ing a launched job. Since running Spark on another and Distributed Systems (TPDS), 2014.
generic cluster management middleware, such as YARN, [18] B. Sharma, T. Wood, and C. R. Das, “Hybridmr: A hierarchical
mapreduce scheduler for hybrid data centers,” in Proc. IEEE Int’l
becomes increasingly popular, it is possible to enable Conference on Distributed Computing Systems (ICDCS), 2013.
malleable executors using resource containers. As such, [19] R. C. Chiang and H. H. Huang, “Interference-aware scheduling
Ant can monitor the completion times of individual for data-intensive applications in virtualized environments,” in
Proc. of Int’l Conference for High Performance Computing, Networking,
tasks and use such information as feedback to determine Storage and Analysis (SC), 2011.
the optimal size of Spark executors. In the future, we [20] B. Cho, M. Rahman, T. Chajed, I. Gupta, C. Abad, N. Roberts,
also plan to extend Ant to multi-tenancy public cloud and P. Lin, “Natjam: Eviction policies for supporting priorities
and deadlines in mapreduce clusters,” in Proc. ACM Symposium
environments such as Azure and EC2. on Cloud Computing (SoCC), 2013.
[21] J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and
ACKNOWLEDGEMENT J. Schad, “Hadoop++: making a yellow elephant run like a cheetah
(without it even noticing),” in Proc. Int’l Conf. on Very Large Data
This research was supported in part by U.S. NSF research Bases (VLDB), 2010.
grants CNS-1422119, CNS-1320122, CNS-1217979, and [22] X. Shi, M. Chen, L. He, X. Xie, L. Lu, Y. Jin, H. Chen, and
S. Wu, “Mammoth: Gearing hadoop towards memory-intensive
NSF of China research grant 61328203. A preliminary mapreduce applications,” IEEE Trans. on Parallel and Distributed
version of the paper appeared in [9]. The authors are Systems (TPDS), 2015.
grateful to the editor and anonymous reviewers for their [23] A. Jinda, J. Quian-Ruiz, and J. Dittrich, “Trojan data layouts: Right
shoes for a running elephant,” in Proc. of ACM Symposium on Cloud
valuable suggestions for revising the manuscript. Computing (SoCC), 2011.
[24] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar,
R EFERENCES R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,
O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler, “Apache
[1] F. Ahmad, S. Chakradhar, A. Raghunathan, and T. N. Vijaykumar, hadoop yarn: Yet another resource negotiator,” in Proc. ACM
“Tarazu: optimizing mapreduce on heterogeneous clusters,” in Symposium on Cloud Computing (SoCC), 2013.
Proc. of Int’l Conf. on Architecture Support for Programming Language [25] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing
and Operating System (ASPLOS), 2012. hadoop provisioning in the cloud,” in Proc. USENIX HotCloud
[2] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, Workshop, 2009.
“Improving mapreduce performance in heterogeneous environ- [26] Y. Guo, J. Rao, C. Jiang, and X. Zhou, “Moving mapreduce into
ments,” in Proc. of USENIX Symposium on Operating System Design the cloud with flexible slot management,” in Proc. of IEEE/ACM
and Implementation (OSDI), 2008. Int’l Conference for High Performance Computing, Networking, Storage
[3] H. Herodotou, F. Dong, and S. Babu, “No one (cluster) size fits and Analysis (SC), 2014.
all: Automatic cluster sizing for data-intensive analytics,” in Proc. [27] D. Carrera, M. Steinder, I. Whalley, J. Torres, and E. Ayguadé,
of ACM Symposium on Cloud Computing (SoCC), 2011. “Enabling resource sharing between transactional and batch
[4] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, workloads using dynamic application placement,” in Proc.
and S. Babu, “Starfish: A self-tuning system for big data ana- ACM/IFIP/USENIX Int’l Conf. on Middleware (Middleware), 2008.
lytics,” in Proc. of Conference on Innovative Data Systems Research [28] J. Wolf, D. Rajan, K. Hildrum, R. Khandekar, V. Kumar,
(CIDR), 2011. S. Parekh, K. Wu, and A. Balmin, “Flex: A slot allocation
[5] H. Herodotou and S. Babu, “Profiling, what-if analysis, and cost- scheduling optimizer for mapreduce workloads,” in Proc. of the
based optimization of mapreduce programs,” in Proc. Int’ Conf. ACM/IFIP/USENIX Int’l Middleware Conference (Middleware), 2010.
on Very Large Data Bases (VLDB), 2011. [29] R. Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, and
[6] P. Lama and X. Zhou, “Aroma: Automated resource allocation A. Rowstron, “Scale-up vs scale-out for hadoop: Time to rethink?”
and configuration of mapreduce environment in the cloud,” in in Proc. of ACM Symposium on Cloud Computing (SoCC), 2013.
Proc. of Int’l Conf. on Autonomic computing (ICAC), 2012. [30] X. Li, Y. Wang, Y. Jiao, C. Xu, and W. Yu, “Coomr: Cross-task
[7] M. Li, L. Zeng, S. Meng, J. Tan, L. Zhang, N. Fuller, and A. R. coordination for efficient data management in mapreduce pro-
Butt, “mronline: Mapreduce online performance tuning,” in Proc. grams,” in Proc. of Int’l Conference for High Performance Computing,
of ACM Symposium on High-Performance Parallel and Distributed Networking, Storage and Analysis (SC), 2013.
Computing (HPDC), 2014. [31] Z. Li, Y. Cheng, C. Liu, and C. Zhao, “Minimum standard devia-
[8] T. White, Hadoop: The Definitive Guide, 3rd ed. O’Reilly Media / tion difference-based thresholding,” in Proc. of Int’l Conference on
Yahoo Press, 2012. Measuring Technology and Mechatronics Automation, 2010.
[9] D. Cheng, J. Rao, Y. Guo, and X. Zhou, “Improving mapreduce [32] A. Verma, L. Cherkasova, and R. H. Campbell, “Resource provi-
performance in heterogeneous environments with adaptive task sioning framework for mapreduce jobs with performance goals,”
tuning,” in Proc. of ACM/IFIP/USENIX Int’l Middleware Conference in Proc. ACM/IFIP/USENIX Int’l Middleware Conference (Middle-
(Middleware), 2014. ware), 2011.

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

15

2 60
Maximaized improvement

JCT improvement (%)


1.8

Normalized JCT
1.6 40
1.4
1.2 20
1
0.8
0.6 g1 g2 g3 g4 g5 g6
Task-level parameter
1 14 28 42 56 70 84 98
mapred.reduce.tasks Fig. 17. The sensitivity of task-level parameters.
Fig. 15. Impact of mapred.reduce.tasks.
is due to the fact that the number of map tasks and the
completion time of map tasks is mainly determined by
1.4 the block size of the input dataset. If most map tasks
Normalized JCT

1.2 take less than one minute to complete, the overhead


caused by frequent task setup and scheduling could have
1 significant performance impact, especially for large jobs.
0.8
In this case, the number of map tasks should be reduced
by setting the block size larger to avoid such overhead.
0.6 Therefore, if the job has more than 200GB of input, we
64 128 256 512
suggest to increase the block size of the input dataset
hdfs.block.size (MB) to 128MB or even 512MB so that the number of map
tasks will be smaller. As a result, the block size of HDFS
Fig. 16. Impact of hdfs.block.size. is empirically set to 256MB due to the large input data
size in the experiment.

A PPENDIX A A PPENDIX B
I MPACT OF C LUSTER /J OB - LEVEL PARAMETERS D ISCUSSIONS ON OVERHEAD AND M ULTI -J OB
Currently, cluster and job-level configurations mostly Overhead of Ant. Although Ant relies on a centralized
rely on personal experiences and/or job-profiling control framework, it does not launch any new process
based approaches. We quantify the performance for the controller. Instead, it only sends a tiny per-
impact of an important job-level parameter, i.e., task configuration description file to TaskTrackers when
mapred.reduce.tasks and another important initializing each task. These task-level configuration files
cluster-level parameter, i.e., dfs.block.size in the are distributed to individual slave nodes along with
experiment. LaunchTaskAction by Hadoop default RPC connec-
Impact of reduce number. Figure 15 quantifies tions. We find that the self-tuning optimization algorithm
the performance impact by one of most important takes approximately 25-40 ms to complete in our testbed.
job-level parameter tuning optimization, i.e., This overhead is negligible compared to the control
mapred.reduce.tasks. The completion times interval of 5 minutes in the experiments.
changed as we altered the values of the parameter. Multi-job support. As MapReduce scales to ever-
Our experimental results suggest that the number of larger platforms, concerns about cluster utilization
reduce tasks should be set to 1-5 times the number grow [27]. Multi-tenancy support becomes an essential
of available reduce slots in the cluster. The default need in MapReduce. However, sharing a MapReduce
value of a single reduce task setting is worst while too cluster among multiple jobs from different users poses
many reduce task settings can also lead to performance significant challenges in resource allocation and job con-
degradation. Thus, for fairness, we set the value of figurations. For instance, interferences among different
parameter mapred.reduce.tasks to 0.9 times the jobs may create straggler tasks that significantly slow
number of available reduce slots in the cluster for all down the entire job execution, and may lead to severe
jobs in our experiments. performance degradations [28].
Impact of block size. Figure 16 shows the job comple- Parameter sensitivity. Ant is designed to tune multi-
tion times varied as we changed the value of parame- ple task-level parameters jointly. Figure 17 shows that
ter dfs.block.size. Our experimental results suggest tuning different parameters can lead to different per-
that the block size of HDFS file system should be set formance improvements. The improvement of job com-
to 64 MB-512MB in the cluster. The default value of 64 pletion time varies from 10% to 37% when the six pa-
MB is not an optimal setting while too large block size rameters are tuned separately. However, tuning param-
settings can also lead to performance degradation. This eters independently to their respective optimal values

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

16

TABLE 4 TABLE 5
Performance improvement of Microsoft workload due to The number of spilling (J3, J6 and J9).
different approaches.
Approach Wordcount Grep Terasort Improve
Input size 1-100GB 0.1-1TB 1-10TB Total Stock 2.4 × 104 4.7 × 104 1.7 × 105 baseline
Job fraction 40% 20% 10% 70% Heuristic 1.9 × 104 3.8 × 104 1.4 × 105 18%
JCTHeuristic 2.1 h 6.3 h 11.7 h 19.1 h Starfish 1.75 × 104 3.6 × 104 1.25 × 105 23%
JCTStarf ish 2h 5.9 h 11.3 h 18.2 h Ant 1.4 × 104 2.9 × 104 0.95 × 105 42%
JCTAnt 2.3 h 5.3 h 9.1 h 16.7 h
40
+Elisist Non-aggresive

JCT normalized to Non-aggresive


+Inferior +Elisist

Normalized deviation (%)


1 +Elisist+Inferior +Inferior
30 +Inferior+Elisist

does not lead to the optimal overall performance. For 0.8


20
instance, parameter io.sort.mb determines the total
amount of buffer memory to use while sorting files 0.6 10

and parameter io.sort.factor specify the number 0.4


Wordcount(J3) Grep(J6) Terasort(J9)
of streams to merge while sorting such files. These two Workloads
0 10 20 30
Time (minutes)
40 50

parameters are highly correlated and should be tuned (a) Job completion time. (b) Convergence rate.
together to improve job performance.
Fig. 18. Effectiveness of aggressive selection in terms of
job completion time and convergence rate.
A PPENDIX C
A NT UNDER M ICROSOFT W ORKLOAD
To better understand the effectiveness of Ant in a pro- map task by analysis of the spilling logs in the exper-
duction environment, we analyzed 174,000 jobs sub- iments. Table 5 shows that Ant reduces the average
mitted to a production analytics cluster in Microsoft data spilling times 42% while Heuristic and Starfish
datacenter in a single month in 2011 [29]. We use a only reduce 18% and 23%, compared with the stock
synthetic workload, “MicroSoft-Derived (MSD)”, which Hadoop configuration. This is due to the fact that Ant
models Microsoft’s production workload. It is a scaled- optimizes the intermediate data spilling, which satisfies
down version of the workload studied in [29] since individual needs of different workloads and platforms
our cluster is significantly smaller. MSD contains jobs in the heterogeneous cluster.
with widely varying execution times and data set sizes,
representing a scenario where the cluster is used to
A PPENDIX E
run many different types of batch applications. We do
not run the Microsoft code itself. Rather, we mimic the E FFECTIVENESS OF AGGRESSIVE S ELECTION
distribution characteristics of the job size by running We compare the performance impact of applying differ-
Terasort application with various input data sizes. We ent selection strategies in Ant (described in Section 4)
scale the workload down in two ways: we reduce the from the perspective of job completion time and conver-
overall number of jobs to 87, and eliminate the largest gence rate respectively. The experimental results demon-
10% of jobs and the smallest 20% of jobs. strate that the aggressive selection approach of Ant can
Table 4 shows the job size distribution of MSD work- effectively improve application performance for various
load used in the experiment and the job completion time workloads.
achieved by different tuning approaches. The results Figure 18(a) compares the job completion time
demonstrate that Ant reduces the overall job completion achieved by different parents selection approaches. It
time of MSD workload 12.5% and 8% compared with shows that the elitist strategy improves the job comple-
Heuristic and Starfish. tion time 9% and the inferior strategy improves the job
completion time 17% compared with the no-aggressive
strategy, respectively. Figure 18(a) also illustrates that
A PPENDIX D
applying the two strategies (i.e., elitist and inferior) to-
R EDUCING NUMBER OF DATA SPILLS gether improves the job completion time 22% on average
Ant is designed to optimize task-level performance, for the three applications compared with applying the
particularly I/O operations during the map phase. The no-aggressive strategy.
data spilling operations play an important role in the Figure 18(b) illustrates the impact of convergence rate
map phase and usually consume most execution time in terms of the standard deviation [31] by using the
of map task processing [30]. Ant aims to reduce the aggressive selection strategies of Ant. It shows that no-
map output data spilling times by tuning the buffer aggressive selection has the worst convergence rate since
related parameters during execution, e.g., io.sort.mb Roulette Wheel selection mechanism still has a low
and io.sort.spill.percent. To evaluate the effec- probability of selecting bad candidates to be parents.
tiveness, we measure the data spilling times of each However, both elitist and inferior strategies improve

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

17

45 45
Physical cluster Map

TCT improvement (%)


Virtual cluster Reduce

JCT improvement (%) 30 30

15
15

0
0 Heuristic Starfish Ant
Heuristic Starfish Ant
Approaches
Approaches
Fig. 20. Task execution comparison.
Fig. 19. Impact of environments.

the convergence rate slightly in terms of the standard which are hosted on three blade servers in our university
deviation compared with the no-aggressive strategy. Fur- cloud. Each blade server also hosts a representative
thermore, the results demonstrate that applying inferior transactional application RUBiS as the interference pro-
and elitist strategies together can effectively accelerate ducer, which models an online auction site. We created
the convergence rate during the job execution. Thus, we three VMs, each with 4 vCPUs, 8 GB RAM and 80 GB
integrate two strategies in the task evolution of Ant to hard disk space, for the RUBiS benchmark application on
improve searching performance. the three servers respectively. The number of concurrent
Furthermore, please refer to Appendix A for the im- users is set to 800, 2500 and 4500 for the three servers.
pact of cluster and job-level parameters on Ant perfor- We empirically set the number of subclusters to three
mance, and Appendix B for discussion on the overhead in this experiment since all the VMs used in the experi-
and multi-job support. ment are hosted on three blade servers in our university
cloud. We then divide the virtual cluster into three
subclusters by using k-means clustering approach. VMs
A PPENDIX F having similar CPU steal time and I/O rate are classified
I MPROVEMENT ON TASK E XECUTION into the same subcluster. We obtain three subclusters as
Reducing task execution time. Figure 20 shows the follows: a subcluster of 4 VMs with low steal time and
improvement comparisons in terms of the average task high I/O rate, a subcluster of 8 VMs with middle steal
completion time by Ant, Heuristic and Starfish on the time and I/O rate, and a subcluster of 12 VMs with high
physical cluster respectively. It demonstrates that map steal time and low I/O rate. Note that each subcluster
task completion time improvements are much more than is treated as an independent homogeneous cluster to
the reduce task improvements for all of the three con- evolve the task-level optimal configuration set.
figuration approaches. Ant outperforms Heuristic and Effectiveness of Ant with stable interference. Ant
Starfish significantly for map task performance improve- improves the average job completion time by 23%, 11%
ment while obtaining similar reduce task performance and 16% compared with Stock, Heuristic and Starfish
improvement. This is due to the fact that Ant focuses on the virtual cluster, respectively. However, the perfor-
on optimizing the intermediate data merging operation mance of Starfish is slightly worse than that of Heuristic
of multi-wave map tasks. Most production jobs have in this scenario since performance interference in the
only one reduce wave since the number of reduce tasks virtual environment significantly reduces the accuracy
in a job is typically set to be 0.9 times the number of of Starfish job profiling.
available reduce slots. Hence, map tasks usually occur Figure 19 compares the job completion time improve-
in multiple waves, while reduce tasks tend to complete ment achieved by Ant, Heuristic and Starfish under
in one to two waves for most real-world workloads. the physical and virtual environments, respectively. The
Accordingly, Ant focuses on improving the execution results demonstrate that the average job completion
time of map tasks, which aims to take advantage of the time of Ant on the physical cluster is improved by
multi-wave behaviors of map tasks. Understanding the 8% compared with that on the virtual cluster. It shows
wave behavior of tasks, such as the number of waves that Ant is more effective in the physical environment.
and the size of waves, would aid task configuration to This is due to the fact that MapReduce perception of
improve cluster utilization. cluster topology and resource isolation may be different
from the actual settings on hardware when running in
a virtual environment. As the result, the corresponding
A PPENDIX G
parameter tuning strategies are less effective in a virtual
S TABLE INTERFERENCE SCENARIO cluster than in a physical cluster. Figure 19 also shows
Experiment setup. In our testbed there are 24 VMs, each that Ant improves the absolute job completion time
with 2 vCPU, 4 GB RAM and 80 GB hard disk space, at least by 9% and 7% compared with Heuristic and

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2594765, IEEE
Transactions on Parallel and Distributed Systems

18

TABLE 6
Task-level parameter settings by current representative industry and academic approaches.
Approach Rules-of-Thumb (Heuristic) from Cloudera Starfish from Duke University
Parameter g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6
Wordcount 10 100mb 0.35 0.8 4k -Xmx200m 10 50mb 0.15 0.8 4k -Xmx200m
Grep 10 200mb 0.05 0.6 16k -Xmx300m 100 200mb 0.1 0.66 16k -Xmx300m
Terasort 10 150mb 0.15 0.7 32k -Xmx250m 10 200mb 0.15 0.7 32k -Xmx350m

40
Convergence time (minutes)

Time cost for convergence Virtual cluster

Normalized deviation (%)


Physical cluster
60 30

40 20

20 10

1 2 3 4 8 12 0 10 20 30 40 50
Number of machines Time (minutes)

Fig. 21. Impact of cluster size. Fig. 22. Impact of machine type.

Starfish on the physical cluster and on the virtual cluster, convergence cost decreases as the number of machines
respectively. increases in the cluster. The searching process relies on
the online task execution learning. It needs multiple
A PPENDIX H reconfiguration intervals to find good task-level param-
S EARCH S PEED OF A NT eters. The more nodes available in the cluster, the more
The searching speed of Ant represents the time cost to opportunities there are to learn the characteristics of the
find a stable task configuration set for a running job running job.
on the cluster. A slow searching speed will reduce the
benefit from a good task-level configuration, especially
for small jobs. This is due to the fact that a small job may Impact of machine type. Figure 22 demonstrates the
complete before Ant can find out an optimal setting for searching speed of Ant in the physical cluster is faster
this job. This is also the reason that Ant is more effective than that in the virtual cluster. We define that the param-
for large jobs than for small jobs as shown in Figures 21 eter value is stable when its standard deviation value is
and 22. Our experiments reveal that there are two main smaller than 10%, which is widely used in evaluation of
factors contributing significant impact on the searching the convergence speed in previous study [31]. The result
speed of Ant. in Figure 22 shows that Ant takes 18 and 35 minutes
Impact of cluster size. The number of available to find a good configuration in the physical cluster and
nodes in the cluster is an important factor that affects in the virtual cluster, respectively. Note that the actual
the searching speed of Ant. Figure 21 enumerates the resource allocation for each node in the virtual cluster is
convergence cost of Ant with various numbers of ma- typically time-varying due to the interference issues [17],
chines in our experiments. The result shows that the [32].

1045-9219 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like