Professional Documents
Culture Documents
DPPACS A NOVEL DATA PARTITIONING AND PLACEMENT AWARE COMPUTATION-2015
DPPACS A NOVEL DATA PARTITIONING AND PLACEMENT AWARE COMPUTATION-2015
DPPACS A NOVEL DATA PARTITIONING AND PLACEMENT AWARE COMPUTATION-2015
Department of Computer Science and Engineering, National Institute of Science and Technology,
Cloud infrastructures are capable of leveraging massive computational as well as data processing
capabilities in virtualized environments. Emerging applications on today’s clouds are data intensive
and this has led to the trend of employing data-parallel frameworks, like Hadoop and its myriad
descendants, for handling such massive data requirements. Scheduling of jobs in such frameworks
is in essence a two-step process, where the block-data distribution follows mapping of computations
among those resources. Since most Hadoop-based systems make these two decisions independently,
it seems a promising prospective to map computations within cloud resources based on data blocks
already distributed to them. This paper proposes data partitioning and placement aware computation
scheduling scheme (DPPACS), a data and computation scheduling framework that adopts the strat-
egy of improving computation and data co-allocation within a Hadoop cloud infrastructure based
on knowledge of data blocks availability. Accordingly, this paper proposes a data-partitioning algo-
rithm, a novel partition-cum-placement algorithm and finally proposes a computational scheduling
algorithm that exploits knowledge of data availability at different clusters. The proposed DPPACS
has been implemented on a test bed and its comparative performance results with respect to Hadoop’s
default data placement strategy have been presented. Experiments conducted herein conclusively
demonstrate the efficacy of the proposed DPPACS.
1. INTRODUCTION
Hadoop is one such data-parallel framework, which, if
In the past decade, there has been a myriad research efforts employed, provides massive data handling capability to appli-
targeted toward effective scheduling of massive computational cations. Hadoop provides an open-source implementation of
jobs over a network of federated computational resources, even the map-reduce paradigm [3]. Hadoop allows programmers
from different administrative domains [1]. But emerging appli- and developers to easily utilize resources of a Hadoop cloud, by
cations and technologies, such as bioinformatics, astrophysics, means of a runtime system that takes care of the details of input
Internet-of-Things, have stimulated the advent of computations data, scheduling program’s execution across a set of machines,
that are even more data-intensive in nature. Cloud computing handling fault tolerance and managing the necessary commu-
has the promise of providing massive processing capabilities nications. The Hadoop runtime splits a data file into multiple
of computational and data resources in virtualized environ- blocks and places these blocks within the cloud resources
ments [2]. To handle the massive data requirements of applica- for reliability and performance reasons [3]. Since the perfor-
tions, several data-parallel frameworks have been proposed. mance of any large-scale system like cloud, depends on the
performance of the underlying scheduling (data skewing and employed for assessing the efficacy of the proposed DPPACS.
computation mapping) policies employed, thus, it behooves Section 5 explains the experiments conducted on the aforemen-
us to arrive at appropriate scheduling algorithms for such tioned framework and provides intuitive discussions on the
data-intensive computational challenges on a Hadoop cloud. obtained results and their significance. Finally, the conclusions
Scheduling jobs in Hadoop-like map-reduce frameworks is in are presented in Section 6.
essence a two-step process, where data skewing, i.e. the block-
data distribution (Hadoop splits a data file into multiple blocks
and places these blocks within the Hadoop cloud resources) fol-
1.1. Related work
lows mapping of computations among those resources. Hadoop
distributions have typically designed and implemented these There has been several previous researches that have dealt with
two steps independently. The fact that computations and data, data placement-related issues in service-oriented infrastruc-
when co-located within a map-reduce framework, can provide tures by employing concepts like grouping in some specific
significant performance benefits [4] leads the prospect of map- way to organize the data for high-performance data accesses.
ping computations within cloud resources intelligently based For instance, Amer et al. [14] proposed a file-grouping rela-
on knowledge of data blocks already distributed to them. This tion for better management of the distributed file caches. Such
More recently, a number of works have dealt with HDFS 2. DATA PLACEMENT IN A NATIVE MAP-REDUCE
data placement, like the CDRM by Wei et al. [21], CoHadoop HADOOP CLUSTER
by Eltabakh et al. [22], the work on energy-efficient data place-
Hadoop [6] provides an implementation of the map-reduce
ment by Maheshwari et al. [23]. However, the motivations of
concept that thrives on balancing overall data distribution
none of these papers are to minimize inter-HDFS cluster data
among a cluster of nodes (or among different clusters of a data
movement. For instance, the main focus of Wei et al. [21] is to
center or among different data centers spread across the inter-
offer enhancements to Hadoop’s default replication manage-
net) with the motivation to exploit the potential parallelism
ment for better performance and load balancing. CoHadoop
within multiple map tasks of a map-reduce job. Hadoop places
[22] presents a lightweight extension of Hadoop that allows
data blocks evenly among the storage units that constitute the
applications to control where data are stored for increased col-
HDFS using a random placement strategy. So, intuitively it
location. However, the emphasis of the performance analysis
may be concluded that a map-reduce application that uses the
carried out is limited to the context of log processing with oper-
underlying files stored in the full form (HDFS) uniformly may
ations like indexing, grouping, aggregation, columnar storage,
achieve expected performance benefits. But unfortunately,
joins and sessionization. DPPACS proposed in this paper is not
most practical map-reduce applications exhibit some locality
restricted to specific application type for obtaining performance
tendency among the datasets it require. For instance, in bioin-
depicts HDDP. Scheduling jobs in such framework is in essence which are stored in the underlying HDFS. The data block
a two-step process, where the block-data distribution (Hadoop requirement of each task for its respective execution com-
splits a data file into multiple blocks and places these blocks pletion is shown in Table 1. It is possible to maintain this
within the cloud resources follows mapping of computations information very easily from NameNode logs. Specifically, the
among those resources). In Fig. 1, the ovals denote a procedure, name node of a Hadoop cloud maintains logs for every system
tasks that are generated by an application and the rectangles rep- operation, including block usage information of files by the
resent data blocks. A data block fi bj represents the jth block of map tasks.
a file ‘i’ that is currently in the distributed file system. HDDP’s In cloud environment, parallel application data management
primary objective is load balancing as has been denoted by involves distributing these data blocks among different racks
the four nodes within each of the three clusters having data of a cluster. All these split blocks are logically grouped into a
blocks. Subsequently, when a map-reduce task Ti is scheduled number of partitions, namely {P1 , P2 , P3 , P4 }, based on some
to data nodes, they may find performance barriers owing to rea- specific relation of data. After splitting of files into a number
sons such as remote data access or queuing delay and so forth. of blocks depending on architectural parameters, these blocks
Intuitively, it can be reasoned that by distributing data blocks are randomly distributed to different clusters and the blocks
with an a priori knowledge of their demands by map-reduce under a single cluster are logically partitioned. Table 2 shows
tasks, such overheads can be alleviated. This paper presents a one example of such random distribution of blocks into four
data partitioning and placement aware computation scheduling partitions on four clusters. Once these blocks are logically
framework that bears the promise of reducing the aforemen- partitioned, all these partitions are subsequently distributed to
tioned overheads and the details are provided in Section 3. different clusters. The distribution of partitions into clusters
However, to demonstrate the efficacy of DPPACS over HDDP, is founded on the principle of load balancing. For execution
the data distribution strategy adopted by these two placement of map-reduce task (t2 ) at different clusters (map-reduce task
schemes has been depicted by means of an illustrative example distribution shown in Fig. 3), some blocks are required to be
in the next subsection. moved between the clusters. These movements definitely offer
performance bottlenecks in cloud environments. Table 2 sum-
marizes the total number of block movements for completion
2.1. HDDP: an illustration
of the six map tasks (t1 through t6 ). A possible distribution of
In order to illustrate how HDDP works, this section describes map tasks is shown in Fig. 3. Map tasks {t1 , t2 , t3 , t4 , t5 , t6 } are
the process with an example. The graph shown in Fig. 2 depicts mapped to four clusters C1 through C4 as has been depicted in
the data access pattern of six map tasks among 16 data blocks Fig. 3. For instance, map tasks t1 and t5 have been placed in
t4 b13
b5
bn Data Block
b2
b1 b4 t3
b9
MapTask
b11
t1 b8 t6
b12 b14
t2 Data access by MapTask from data block
b10
b16 b15 Data Access between MapTask
b7
t5
b3
b6
Additional data blocks required No. of blocks Random partition of blocks and Partition placed
Task for executing task (ti ) movement assigned partitions with HDDP in cluster ID
cluster C1 within logical partition P1 and this has been denoted 3. DPPACS: AN OVERVIEW
as {t1 , t5 → C1 with P1 }. Similarly, the other map tasks have
With an intent to reduce the number of block movements
been places as follows: {t2 → C2 with P2 }, {t4 → C3 with P3 }
at runtime, this paper proposes a data-partitioning scheme
& {t3 , t6 → C4 with P4 }. Finally, Table 2 presents the number
based on interdependency of datasets and finally places data
of data blocks movement among the clusters completion of
blocks into different clusters of a cloud data center. Finally,
map tasks. This has been depicted by dotted lines in Fig. 3. The
map-reduce tasks are scheduled to different clusters within
data partitioning and placement aware computation scheduling
a cloud with a priori awareness of data placements, so that
scheme (DPPACS) proposed in this paper envisions to reduce
it can exploit the advantage of data-compute co-location. In
the data movements and to accomplish this, three algorithms
short, the proposed DPPACS functions as a three-step process,
have been proposed; the details of each of which have been
where (a) a dependency-based partitioning of data blocks is
presented in the subsequent sections.
FIGURE 4. Indexing of files broken into multiple blocks to be used for DPPACS.
followed by (b) placing of these partitions within the Hadoop f = {b1 , b2, b3 , . . ., bn }. In DPPACS, these sets of data blocks
cluster and finally (c) map-reduce tasks are mapped to cluster are grouped into certain partitions according to block depen-
nodes so as to maximize data-compute co-location. Accord- dencies. Figure 4 depicts how of a data file is split into a number
ingly, three algorithms have been proposed herein and they of blocks in DPPACS and while doing so, related data blocks
are discussed in the three subsequent subsections, namely 3.2 are attempted to be placed within a single cluster. Algorithm
through 3.4, respectively. For easy comparison, the princi- 1 presents the above said logical partition method and Fig. 2
ples behind DPPACS’ three algorithms have been detailed in presents the interaction between data blocks and map tasks.
Section 3.1 with the same illustrative example as presented in For example, task ti accessing data block bj for its execu-
Section 2.1 for Hadoop’s default placement strategy. tion {ti → bj }.One map task can access data from different
blocks concurrently for its execution. The logical partitions
and their dependencies have been taken into account in this
3.1. An illustrative example
work. Based on this dependency, an adaptive technique has
A map-reduce job may be split into a number of sub map- been used to group data blocks into logical partitions. These
reduce tasks to run on different virtual nodes in parallel. A logical partitions are formed on the basis of usage of data
job ‘J’ is split into a set of tasks, and can be represented as: blocks by a map-reduce task. For example, if task t1 accesses
J = {t1 , t2 , t3 , . . ., tn } and required data files are divided into six data blocks, namely b1 , b4 , b7 , b9 , b12 , b16 for its execution
a number of blocks. These blocks are placed over different it is represented as t1 → {b1 , b4 , b7 , b9 , b12 , b16 } as has been
data centers within a cloud environment. For example, a file depicted in Table 1. These blocks are kept in one logical
‘f ’ may be split into a number of data blocks, for instance, partition P1 within Table 3, have been referred as the Partition
TABLE 3. Dependency-based partition (DBP) table. TABLE 4. Dependency-based partition placement (DPP) table.
P1 1 4 7 9 12 16 C1 0 1 6 2
P2 3 6 11 13 15 2 C2 2 5
P3 2 10 9 13 5 8 C3 2 3
P4 5 12 6 1 4 13 C4 2 4
P5 6 14 4 10 12 8
P6 7 3 1 15
to Block map table (DBP). Similarly, task t6 accesses six data Block Id No. of replicas
blocks, namely b4 , b7 , b3 , b1 , b15 and b13 and it has been B1 3
depicted as t6 → {b4 , b7 , b3 , b1 , b15 , b13 }, then these blocks B2 3
TABLE 8. Summary of the illustrative example in Table 1 using the DPPACS placement scheme.
TABLE 9. Partition to cluster map (PCM) table. partitions within same cluster. Based on this notion, Algo-
rithm 1 maintains in Tables 6 and 8. It has to be understood
of minimizing data movements. Detailed procedural steps are 3.5. Implementing the Algorithms in Hadoop
presented in Algorithm 3. For this scheduling purpose, APIs of
Implementation of the three algorithms presented in the
GridGain, a cloud middleware, have been employed within the
previous subsections (namely 3.2 through 3.4) requires
DPPACS framework.
modifications to Hadoop’s source code. It required modi- placement. For computational scheduling, GridGain’s APIs
fying/adding around 2000 lines of Java code to accomplish were employed. For instance, data correlations had to be
realizing DPPACS’ dependency-based partitioning and found among data blocks for building the dependency-based
unprocessed data among different nodes over a network To test the efficacy of the three proposed algorithms, this
becomes a critical issue and can affect Hadoop’s performance. paper employs a small cloud environment using GridGain’s
To improve the performance of Hadoop in heterogeneous clus- accelerator for Hadoop [6]. GridGain provides an integrated
ters, DPPACS has been proposed here with the objective of distributed environment for executing map-reduce tasks, both
minimizing data movement among cluster nodes. This goal computation-intensive as well as data-intensive types, through
can be achieved by an appropriate data placement scheme that exploiting in-memory map-reduce computations in its native
distributes and stores data across multiple heterogeneous nodes distributed file system, hereafter referred to as the GGFS.
while ensuring that interdependency among tasks are kept It has to be noted that GGFS encompasses the in-memory
intact. file system of all participating nodes. In other words, in a
multi-cluster cloud or in a multi-node cluster, GridGain pro-
vides a distributed file system across the main memory of all
its components. Thus, it is often referred to as a distributed
Bio ANALYTICS RDBMS cache layer. Figure 5 depicts the niche of the GGFS within the
Informatics
GridGain-based cloud environment. On the contrary, Hadoop’s
HIVE PIG DATAFU SCOOP HBASE HDFS is a disk-based distributed file system that supports the
algorithm. Figure 6 depicts a higher-level schematic represen- TABLE 10. Node configuration of cloud test-bed set-up.
tation of the DPPACS framework. The labels alongside the
arrows in Fig. 6n designate sequence of steps. Numerical labels Head node with Compute node
depict sequence of steps for Hadoop framework and alphabets NFS storage
labels designate the sequence of steps for the GridGain frame- Make Dell Power Edge Dell Power Edge
work. Data-intensive applications on the DPPACS framework R610 R410
can be handled either with all data files in-memory within the Model Dual Intel Xeon Dual Intel Xeon
GGFS or on the disk-based HDFS. Since the main contribu- quad-core E5620 quad-core E5620
tion of this paper pertains to a novel data placement strategy CPU 2.93 GHz processor 2.4 GHz processor
within disk-based HDFS, hence, the capability of the GGFS RAM 12 GB DDR2 12 GB
to provide distributed caching facility over HDFS has been Internal hard disk 3 × 300 GB SAS 3 × 300 GB SAS
ignored. GridGain’s APIs have been employed for computa- HDD/RAID – 5 HDD/RAID – 5
tional scheduling only leading to GGFS being configured to Network connection RPS/dual NIC RPS/dual NIC
work in proxy mode. The key enhancement in the DPPACS is Operating system Cent OS 5.0 Cent OS 5.0
the analyzer block which forms dependency-based groups from Switch details Eight port GB Eight port GB
previous execution logs. The dependency-based grouping has
been explained in Fig. 3 for block partitioning. Figure 7 gives
a demonstration of the efficacy of the proposed data placement
for DPPACS with respect to HDDP as has been shown in Fig. 1. per CPU hour to the human genome in a memory footprint
The next section presents the experimental work performed of as little as 1.1 GB [9]. Bowtie extends its extant indexing
and details the performance analysis of the proposed DPPACS techniques with a quality-aware search algorithm that per-
vis-a-vis HDDP. mits mismatches. Map-reduce version of the Bowtie indexing
allows greater alignment speed. The weather analysis map-
reduce application employs numerous computation-intensive
comprehensive statistical analysis of weather parameters on
5. EXPERIMENTS AND RESULTS
NCDC’s weather data, including computing standard deviation
In this section, the results of experiments carried out on the (month-wise, year-wise, decade-wise trend analysis, etc.), cor-
proposed DPPACS have been presented. For the purpose of relation etc. of temperature, pressure and other data fields of
comparison, two applications have been employed to test the dataset. The choice of these tasks provide a way to assess
DPPACS’ efficacy with that of HDDP, namely a map-reduce DPPACS’ partitioning and placement strategies with varying
version of the Bowtie indexing application that uses 40 GB of degree of correlated data. The data of genome data accessed
genome data and a weather analysis program that uses NCDC’s by the former program have much more correlation compared
weather data. Bowtie is a fast and memory-efficient program for with the later that accesses weather data. As already men-
aligning short reads to mammalian genomes. Efficient index- tioned, DPPACS places data blocks within a Hadoop cluster
ing allows Bowtie to align more than 25 million 35-bp reads accounting for interdependencies among programs accessing
(a)
(b)
FIGURE 9. (a) DPPACS’ block partitioning for dataset 1. (b) FIGURE 11. (a) Data movement in DPPACS for dataset 1. (b) Data
DPPACS’ block partitioning for dataset 2. movement in DPPACS for dataset 2.
(a)
FIGURE 12. (a) Completion progress percentage for dataset 1 (genome) HDDP versus DPPACS with max NR = 2. (b) Completion progress
percentage for dataset 2 (NCDC) HDDP versus DPPACS with max NR = 2.
the data; contrary to the HDDP scheme, where data blocks are DPPACS’ dependency-based partition placement. The third
randomly placed. This section details the experiments carried experiment was conducted to observe the effect of replication
out in order to test the performance of DPPACS vis-à-vis its degree on performance of DPPACS. For all these experiments,
HDDP counterpart. two datasets were employed, namely the 40 GB genome data
[10, 11], referred to as dataset 1 and NCDC weather forecast
data [12, 13], and has been referred to as dataset 2. Two stan-
5.1. Experiments conducted
dard map-reduce programs have been employed for the purpose
In order to assess DPPACS’ performance, this paper conducts of using these datasets; a Bowtie indexing program [7–9] that
three experiments on a small cloud environment, details of uses dataset 1 and a weather forecasting program extracted
which have been summarized in the next subsection. In the from [27] that uses dataset 2. The results of these experiments
first experiment, data-distribution patterns among clusters have been presented in a subsequent subsection.
have been studied, once for HDDP and then for DPPACS.
The goal of this experiment is to test DPPACS’ capability to
5.2. Test-bed set-up
place related data on dependencies depicted in Algorithm 2.
The second experiment tests the relative efficacy of DPPACS In order to study the performance of DPPACS, the aforemen-
with respect to HDDP in terms of block movements. Run- tioned experiments were carried out on a small cloud test-bed
time block movements can be a performance bottleneck, and set-up. The test-bed consisted of four clusters, each of 16 nodes
thus, the goal of this experiment is to validate the effect of with Hadoop 2.0.0 installed. Out of the 64 nodes, one was
(a)
FIGURE 13. (a) Completion progress percentage for dataset 1 (genome) HDDP versus DPPACS with max NR = 3. (b) Completion progress
percentage for dataset 2 (NCDC) HDDP versus DPPACS with max NR = 3.
configured as the NameNode and Job Tracker, whereas the uploading strategy. For instance, the 40 GB of dataset 1 may be
other 63 nodes were designated as DataNode and TaskTrack- uploaded all at one or may be uploaded based on their category,
ers. The configurations of these nodes are summarized in that is, in a species-wise manner. Similarly, the dataset 2 may
Table 10. It has to be noted that network characteristic settings be uploaded either in bulk mode or decade-wise. An unbiased
and capacity of the hardware switches do affect the overhead of data distribution, the data are uploaded by employing both data
data movement in federated Hadoop clusters. Although public uploading strategies, 20 times each. However, the patterns of
cloud providers, like Amazon can provide flexible network overall data distribution have been observed to be similar for
configuration settings, in this paper, we stick to fixed capacity these different runs.
hardware switches of 1 GB, as mentioned in Table 10, since
studying the effect of such network setting is beyond scope of
this paper. The data-aware and computational scheduling algo- 5.4. Performance analysis
rithm, as enunciated in Algorithm 3, has been implemented on To validate the dependency-based partitioning and placement,
this cloud test-bed employing GridGain 5.2. we carried out a number of experiments to compare the perfor-
mance of HDDP and DPPACS. We employed two map-reduce
applications (Bowtie indexing on dataset 1 and weather analysis
5.3. Data distribution
on dataset 2).
Distribution of data blocks among clusters of a cloud and A. Data Blocks Placement Among Cluster Nodes: Figure 8
nodes within a cluster is intuitively dependent on data blocks depicts the performance of HDDP in placing data blocks among
(a)
FIGURE 14. (a) Effect of replication degree on completion time for dataset 1. (b) Effect of replication degree on completion time for dataset 2.
clusters, while Fig. 9 depicts the performance of DPPACS in within a cloud (or different nodes within a cluster). On the con-
placing data blocks among the same. The (a) label refers to trary, DPPACS’ logical partitioning helps to place-related data
dataset 1, whereas label (b) denotes dataset 2 for these two within same clusters. Such partitioning leads to reduced data
Figs 8 and 9. Comparing Figs 8a with 9a and 8b with 9b, we movements during runtime. This has been explained in the next
can easily see the dispersed nature of related data among dif- set of results.
ferent clusters for HDDP. This intuitively leads to expensive Figure 11a and b shows the required data movements between
inter-cluster data movements. The pattern of block dispersion clusters by employing DPPACS’ placement during the runtime
remains invariant in all the 20 times that the experiment was for the two datasets, respectively. Comparing Figs 10 and 11, the
conducted. percentage of reduction in block movement can be obtained for
B. Data Blocks Movement Efficacy: Figure 10a and b shows each of the datasets. It can be observed that on an average block
the required data movements between clusters by employing movements are reduced by ∼47% in genome data and to about
HDDP’s default placement during the runtime for each of the 44% for the NCDC data.
two datasets, respectively. It can be concluded from the dis- C. Computation Progress: Figure 12a and b depicts the
persed nature of distributions obtained that employing HDDP progress of map and reduce steps of the two respective map-
leads to related data (data of a specific species for dataset 1 or a reduce applications on the aforementioned datasets using traces
specific decade for dataset 2) to be loaded into different clusters of two runs, once employing Hadoop’s randomly placed data
(HDDP) and thereafter employing DPPACS’s reorganized as can be depicted in Figs 10 and 11. Moreover, the pro-
data. Completion progression is metric used to denote this. In posed DPPACS also exhibits faster completions with respect
both cases, the maximum degree of block replication is config- to Hadoop’s default block distribution policy, amounting to
ured at NR = 2. Figure 13a and b, on the other hand, shows the around 32–39% for different datasets used. This improvement
effect of higher replication on the same, with NR = 3. can be attributed to the dependency-based grouping and subse-
The pattern of completion time savings is depicted in Fig. 14a quent data placement among nodes employed in the proposed
and b, the former revealing the saving obtained for dataset 1 DPPACS framework. The emergence of the proposed DPPACS
and the later for dataset 2. The average savings in completion as a generalized data placement policy requires further testing
times for these datasets have been found out to be varying it on applications that bear no significant correlation among
between 13% and 18% for different species with the dataset 1 data within datasets, which shall be taken up as and it is one of
(genome data), whereas the average savings in completion time the future works of this research. Also the effect of replication
for dataset 2 (NCDC data) has been observed to vary between on price-performance tradeoffs in data-parallel frameworks
12% and 17%. can be taken up for future research.
The number of reducers in all these experiments is set as
large as possible, so that the reduce phase does not become
[14] Amer, A., Long, D.D. and Burns, R.C. (2002) Group-based [22] Eltabakh, M.Y., Tian, Y., Gemulla, R., Krettek, A. and McPher-
Management of Distributed File Caches. Pro. 22nd Int. Conf. son, J. (2011) CoHadoop: flexible data placement and its
on Distributed Computing Systems, Vienne, Austria, July 2–5, exploitation in Hadoop. Proc. the VLDB Endowment, 4,
pp. 525–534. IEEE. 575–585.
[15] Wang, J., Shang, P. and Yin, J. (2014) DRAW: A New Data- [23] Maheshwari, N., Nanduri, R. and Varma, V. (2012) Dynamic
Grouping-Aware Data Placement Scheme for Data Intensive energy efficient data placement and cluster reconfiguration
Applications with Interest Locality. Cloud Computing for algorithm for MapReduce framework. Future Gener. Comput.
Data-Intensive Applications, pp. 149–174. Springer, New York. Syst., 28, 119–127.
[16] Yuan, D., Yang, Y., Liu, X. and Chen, J. (2010) A data place- [24] Krish, K. R., Anwar, A. and Butt, A.R. (2014). [phi] Sched: A
ment strategy in scientific cloud workflows. Future Gener. Com- Heterogeneity-Aware Hadoop Workflow Scheduler. In Modeling,
put. Syst., 26, 1200–1214. Analysis & Simulation of Computer and Telecommunication Sys-
[17] Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J. and Qin, tems (MASCOTS), 2014 IEEE 22nd International Symposium on,
X. (2010) Improving Mapreduce Performance through Data Paris, pp. 255–264. IEEE.
Placement in Heterogeneous Hadoop Clusters. IEEE Int. Symp. [25] Huang, S., Huang, J., Dai, J., Xie, T. and Huang, B. (March
Parallel & Distributed Processing, Workshops and Phd Forum 2010) The HiBench Benchmark Suite: Characterization of the
(IPDPSW), Atlanta, GA, pp. 1–9. IEEE. MapReduce-based Data Analysis. 2010 IEEE 26th Int. Conf.