DPPACS A NOVEL DATA PARTITIONING AND PLACEMENT AWARE COMPUTATION-2015

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

The Computer Journal Advance Access published August 25, 2015

c The British Computer Society 2015. All rights reserved.


For Permissions, please email: journals.permissions@oup.com
doi:10.1093/comjnl/bxv062

DPPACS: A Novel Data Partitioning


and Placement Aware Computation
Scheduling Scheme for Data-Intensive
Cloud Applications
K. Hemant Kumar Reddy and Diptendu Sinha Roy∗

Department of Computer Science and Engineering, National Institute of Science and Technology,

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


Palur Hills, Berhampur, Odisha 761008, India
∗Corresponding author: diptendu.sr@gmail.com

Cloud infrastructures are capable of leveraging massive computational as well as data processing
capabilities in virtualized environments. Emerging applications on today’s clouds are data intensive
and this has led to the trend of employing data-parallel frameworks, like Hadoop and its myriad
descendants, for handling such massive data requirements. Scheduling of jobs in such frameworks
is in essence a two-step process, where the block-data distribution follows mapping of computations
among those resources. Since most Hadoop-based systems make these two decisions independently,
it seems a promising prospective to map computations within cloud resources based on data blocks
already distributed to them. This paper proposes data partitioning and placement aware computation
scheduling scheme (DPPACS), a data and computation scheduling framework that adopts the strat-
egy of improving computation and data co-allocation within a Hadoop cloud infrastructure based
on knowledge of data blocks availability. Accordingly, this paper proposes a data-partitioning algo-
rithm, a novel partition-cum-placement algorithm and finally proposes a computational scheduling
algorithm that exploits knowledge of data availability at different clusters. The proposed DPPACS
has been implemented on a test bed and its comparative performance results with respect to Hadoop’s
default data placement strategy have been presented. Experiments conducted herein conclusively
demonstrate the efficacy of the proposed DPPACS.

Keywords: data-intensive computing; Hadoop; map-reduce; data placement


Received 27 September 2014; revised 29 April 2015
Handling editor: Amr El Abbadi

1. INTRODUCTION
Hadoop is one such data-parallel framework, which, if
In the past decade, there has been a myriad research efforts employed, provides massive data handling capability to appli-
targeted toward effective scheduling of massive computational cations. Hadoop provides an open-source implementation of
jobs over a network of federated computational resources, even the map-reduce paradigm [3]. Hadoop allows programmers
from different administrative domains [1]. But emerging appli- and developers to easily utilize resources of a Hadoop cloud, by
cations and technologies, such as bioinformatics, astrophysics, means of a runtime system that takes care of the details of input
Internet-of-Things, have stimulated the advent of computations data, scheduling program’s execution across a set of machines,
that are even more data-intensive in nature. Cloud computing handling fault tolerance and managing the necessary commu-
has the promise of providing massive processing capabilities nications. The Hadoop runtime splits a data file into multiple
of computational and data resources in virtualized environ- blocks and places these blocks within the cloud resources
ments [2]. To handle the massive data requirements of applica- for reliability and performance reasons [3]. Since the perfor-
tions, several data-parallel frameworks have been proposed. mance of any large-scale system like cloud, depends on the

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
2 K.H.K. Reddy and D.S. Roy

performance of the underlying scheduling (data skewing and employed for assessing the efficacy of the proposed DPPACS.
computation mapping) policies employed, thus, it behooves Section 5 explains the experiments conducted on the aforemen-
us to arrive at appropriate scheduling algorithms for such tioned framework and provides intuitive discussions on the
data-intensive computational challenges on a Hadoop cloud. obtained results and their significance. Finally, the conclusions
Scheduling jobs in Hadoop-like map-reduce frameworks is in are presented in Section 6.
essence a two-step process, where data skewing, i.e. the block-
data distribution (Hadoop splits a data file into multiple blocks
and places these blocks within the Hadoop cloud resources) fol-
1.1. Related work
lows mapping of computations among those resources. Hadoop
distributions have typically designed and implemented these There has been several previous researches that have dealt with
two steps independently. The fact that computations and data, data placement-related issues in service-oriented infrastruc-
when co-located within a map-reduce framework, can provide tures by employing concepts like grouping in some specific
significant performance benefits [4] leads the prospect of map- way to organize the data for high-performance data accesses.
ping computations within cloud resources intelligently based For instance, Amer et al. [14] proposed a file-grouping rela-
on knowledge of data blocks already distributed to them. This tion for better management of the distributed file caches. Such

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


paper proposes a scheduling (data skew and computation map grouping is intuitive, since group-accessed files are likely to
sequence) framework that adopts the strategy of improving be group accessed again. Although this works well in cache
computation and data co-location within a Hadoop cloud infras- management; however, in data-intensive computing, such
tructure based on knowledge of data blocks’ availability. This grouping behavior is at a chunk level rather than file level,
paper proposes three algorithms, namely a data-partitioning since Hadoop or similar platforms split file into multiple groups
algorithm that divides the data blocks into logical partitions within its file system. Wang et al. [15] presented DRAW, a
based on dependency among them (with respect to their usage Data-gRouping-AWare data placement scheme, which dynam-
in map-reduce tasks), a partition placement algorithm that sub- ically monitors the data accesses from system log files. It
sequently places data blocks into different cluster and finally extracts optimal data groupings and reorganizes data layouts
proposes a computational scheduling algorithm that exploits to achieve improved parallelism per group subjected to load
knowledge of data availability at different clusters. Details of balancing. Yuan et al. [16] presented a data-dependency-based
the underlying architecture as well as implementation details data placement for scientific cloud workflows. The main focus
of these algorithms have been presented in later sections. The of the work is identification of related data, in terms of depen-
proposed DPPACS has been implemented on a test bed, and dency among datasets, and clustering them aggressively so as
the comparative performance results have been presented. For to reduce data movement during workflow’s execution time.
demonstrating the efficacy of the proposed DPPACS frame- However, this work presents simulation results only and such
work, a pseudo-cloud infrastructure has been deployed using aggressive clustering directly contradicts the fundamental phi-
GridGain 5.2, and Hadoop 2.0 has been installed for providing losophy of data-parallel framework where data distribution
the data-parallel framework. GridGain’s Hadoop integra- is never intended for load balancing. Xie et al. [17] proposed
tion [5] termed as the in-memory accelerator for Hadoop [6] an application framework running on a Hadoop map-reduce
has been employed for the performance study. The DPPACS cluster, which considers data locality for mapping specula-
framework has been implemented on top of this infrastructure. tive map-reduce tasks in heterogeneous environments. This
To test the efficacy of the proposed DPPACS framework, two work focused on placement of data across nodes in such a
map-reduce applications have been employed, with associated way that each node has a balanced data processing load based
data, namely a Bowtie indexing [7–10] map-reduce application on network topology and disk space utilization of a cluster.
that indexes chromosomes of different species using 40 GB of DPPACS, however, does not consider these two factors for data
obtained genome data available at the UCSC genome bioinfor- placement; thus, the present work can be considered a comple-
matics [10, 11] database; and another map-reduce application mentary to that proposed in [17]. Sehrish et al. [18] presented
pertaining to weather forecasting and analysis of temperature map-reduce with access patterns, a set of map-reduce APIs,
data from the weather dataset obtained from National Climatic for possible intelligent data distribution scheme. However, this
Data Center (NCDC) [12, 13]. scheme has limited practicality, since the data distributions are
The remainder of this paper is organized as follows. Section 2 based on a priori knowledge of future data access patterns.
presents Hadoop’s default data placement (HDDP) strategy Cope et al. [19] proposed a data placement strategy for critical
and discusses its inherent limitations with an intuitive and illus- computing environments to guarantee the data’s robustness.
trative example. Section 3 systematically presents the different NUCA [20] employed a dual data placement scheme and repli-
aspects of the DPPACS, explaining the rationale behind its cation management strategy for distributed caches to reduce the
underlying principles and design. This section also presents the data’s access latency. But none of the aforementioned works
three algorithms with a short discussion on the implementation have considered reducing the data block movement between
details. Section 4 presents the GridGain–Hadoop framework datacenters over the Internet.

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 3

More recently, a number of works have dealt with HDFS 2. DATA PLACEMENT IN A NATIVE MAP-REDUCE
data placement, like the CDRM by Wei et al. [21], CoHadoop HADOOP CLUSTER
by Eltabakh et al. [22], the work on energy-efficient data place-
Hadoop [6] provides an implementation of the map-reduce
ment by Maheshwari et al. [23]. However, the motivations of
concept that thrives on balancing overall data distribution
none of these papers are to minimize inter-HDFS cluster data
among a cluster of nodes (or among different clusters of a data
movement. For instance, the main focus of Wei et al. [21] is to
center or among different data centers spread across the inter-
offer enhancements to Hadoop’s default replication manage-
net) with the motivation to exploit the potential parallelism
ment for better performance and load balancing. CoHadoop
within multiple map tasks of a map-reduce job. Hadoop places
[22] presents a lightweight extension of Hadoop that allows
data blocks evenly among the storage units that constitute the
applications to control where data are stored for increased col-
HDFS using a random placement strategy. So, intuitively it
location. However, the emphasis of the performance analysis
may be concluded that a map-reduce application that uses the
carried out is limited to the context of log processing with oper-
underlying files stored in the full form (HDFS) uniformly may
ations like indexing, grouping, aggregation, columnar storage,
achieve expected performance benefits. But unfortunately,
joins and sessionization. DPPACS proposed in this paper is not
most practical map-reduce applications exhibit some locality
restricted to specific application type for obtaining performance
tendency among the datasets it require. For instance, in bioin-

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


benefits. The work by Maheshwari et al. [23] on the other hand
formatics domain, only the X and Y chromosomes of human
focuses on configuring Hadoop for gaining energy efficiency
being are together analyzed for investigations related to off-
by reconfiguring HDFS clusters based on current workload
spring’s gender out of all the 24 human chromosomes [26]. In
characteristics. ϕSched [24] presents an HDFS enhancement
climate modeling and forecasting, scientists are interested in
that focuses on minimizing inter-cluster data movements, much
specific time periods [27]. Similarly, social networking data
like the proposed DPPACS. However, the scope of ϕSched
also reveal a high degree of grouping with respect to nation,
and proposed DPPACS is different. ϕSched’s main goal is to
language, profession and several other aspects [28]. Thus, it
provide unique APIs for supporting workflow scheduling on
seems a worthwhile proposition to investigate the effect of data
heterogeneous clusters in a hardware-aware manner. It deals
placement strategies among cloud resources. In order to better
with detailed profiling of performance characteristics of differ-
understand the data placement strategy proposed in this paper,
ent applications (12 applications of the HiBench Benchmark
this section provides a brief overview of HDDP strategy.
Suite [25]) on various hardware configurations. DPPACS,
In Hadoop’s default random data placement strategy, data are
however, does not consider workflow scheduling. Of course,
distributed among the different cloud data centers to achieve
Lee et al. [24] present a region-aware data placement strategy
load balance among them. The principle behind such strategy
for reducing inter-cluster data movements, but for that only
is to exploit parallel data access by several map tasks out of
TeraGen (for generating 20 GB) and TeraSort (both from the
the HDFS. It has to be borne in mind that a map-reduce job is
HiBench suite) have been employed to assess the performance
split into a number of map tasks to map at different nodes in
of the region-aware placement policy, and thus the validity of
parallel for processing. Map tasks, if scheduled to a node where
the region-aware placement policy proposed in [24] is rendered
required data are available locally, promises improved perfor-
limited in scope, due to the fact that randomly generated data
mance compared with the case when data have to be brought
cannot capture the inherent correlation that real-world datasets
from another data centers over a network. However, HDFS
generally are endowed with. Moreover, DPPACS’ data place-
may be defined at varied levels of granularity. For instance, an
ment strategy has been tested with 40 GB of genome data and
HDFS instance might span over a cluster of multiple nodes.
also tested with NCDC’s weather data, with repeated runs for
Similarly, a number of clusters may combinedly comprise of
ensuring its validity.
an HDFS instance spanning over multiple data centers within a
The unique contribution of DPPACS with respect to the state
cloud. It can be concluded intuitively that Hadoop’s data place-
of the art lies in its dependency-based data-partitioning and
ment strategy is the same for all levels of granularity. In this
subsequent placement of partitioned data. Besides, the fun-
section, however, HDDP has been considered at a fine level
damental difference between this work and that presented in
abstraction where an HDFS instance over a multi-node cluster
[15, 16] is the fact that the prime motivation of these works
has been dealt with. Without loss of generality, the same prin-
is to form groups such that Hadoop’s parallel data accesses
ciple applies to coarser HDFS instances; the only difference
can be exploited, whereas the dependency-based grouping
being increased block sizes for larger grain size of the HDFS
enunciated in the present work is targeted to reduce runtime
instance.
data movements by exploiting compute-data collocation. The
A file stored in HDFS is divided into a number of data
replication degree can also be exploited for increasing par-
blocks and then replicated over the file system for reliabil-
allel data accesses. Intuitively, it can also be concluded that
ity and performance reasons [29]. The block sizes and the
the effect of fewer data movements at runtime will be more
replication intensity are, configurable and is decided by the
pronounced for larger cloud sizes. Thus, the proposed approach
system administrator based on its level of granularity. Figure 1
holds greater promise.

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
4 K.H.K. Reddy and D.S. Roy

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


FIGURE 1. An example of Hadoop’s default data placement (HDDP).

depicts HDDP. Scheduling jobs in such framework is in essence which are stored in the underlying HDFS. The data block
a two-step process, where the block-data distribution (Hadoop requirement of each task for its respective execution com-
splits a data file into multiple blocks and places these blocks pletion is shown in Table 1. It is possible to maintain this
within the cloud resources follows mapping of computations information very easily from NameNode logs. Specifically, the
among those resources). In Fig. 1, the ovals denote a procedure, name node of a Hadoop cloud maintains logs for every system
tasks that are generated by an application and the rectangles rep- operation, including block usage information of files by the
resent data blocks. A data block fi bj represents the jth block of map tasks.
a file ‘i’ that is currently in the distributed file system. HDDP’s In cloud environment, parallel application data management
primary objective is load balancing as has been denoted by involves distributing these data blocks among different racks
the four nodes within each of the three clusters having data of a cluster. All these split blocks are logically grouped into a
blocks. Subsequently, when a map-reduce task Ti is scheduled number of partitions, namely {P1 , P2 , P3 , P4 }, based on some
to data nodes, they may find performance barriers owing to rea- specific relation of data. After splitting of files into a number
sons such as remote data access or queuing delay and so forth. of blocks depending on architectural parameters, these blocks
Intuitively, it can be reasoned that by distributing data blocks are randomly distributed to different clusters and the blocks
with an a priori knowledge of their demands by map-reduce under a single cluster are logically partitioned. Table 2 shows
tasks, such overheads can be alleviated. This paper presents a one example of such random distribution of blocks into four
data partitioning and placement aware computation scheduling partitions on four clusters. Once these blocks are logically
framework that bears the promise of reducing the aforemen- partitioned, all these partitions are subsequently distributed to
tioned overheads and the details are provided in Section 3. different clusters. The distribution of partitions into clusters
However, to demonstrate the efficacy of DPPACS over HDDP, is founded on the principle of load balancing. For execution
the data distribution strategy adopted by these two placement of map-reduce task (t2 ) at different clusters (map-reduce task
schemes has been depicted by means of an illustrative example distribution shown in Fig. 3), some blocks are required to be
in the next subsection. moved between the clusters. These movements definitely offer
performance bottlenecks in cloud environments. Table 2 sum-
marizes the total number of block movements for completion
2.1. HDDP: an illustration
of the six map tasks (t1 through t6 ). A possible distribution of
In order to illustrate how HDDP works, this section describes map tasks is shown in Fig. 3. Map tasks {t1 , t2 , t3 , t4 , t5 , t6 } are
the process with an example. The graph shown in Fig. 2 depicts mapped to four clusters C1 through C4 as has been depicted in
the data access pattern of six map tasks among 16 data blocks Fig. 3. For instance, map tasks t1 and t5 have been placed in

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 5

t4 b13

b5
bn Data Block
b2
b1 b4 t3
b9
MapTask
b11
t1 b8 t6
b12 b14
t2 Data access by MapTask from data block
b10
b16 b15 Data Access between MapTask
b7
t5
b3

b6

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


FIGURE 2. Graph denotes the block access pattern by map tasks.

TABLE 1. Example of data accesses by map tasks for execution.

TABLE 2. Required block movements for HDDP.

Additional data blocks required No. of blocks Random partition of blocks and Partition placed
Task for executing task (ti ) movement assigned partitions with HDDP in cluster ID

t1 {b1 , b4 , b16 , b9 , b12 } 5 P1 = {b7 , b2 , b5 , b15 , b11 } C1


t2 {b11 , b13 , b15 , b2 } 4 P2 = {b1 , b3 , b6 , b9 } C2
t3 {b2 , b9 , b13 , b5 } 4 P3 = {b4 , b12 , b13 } C3
t4 {b5 , b6 , b1 } 3 P4 = {b8 , b10 , b14 , b16 } C4
t5 {b6 , b14 , b4 , b10 , b12 , b8 } 6
t6 {b4 , b7 , b3 , b1 , b15 , b13 } 6

cluster C1 within logical partition P1 and this has been denoted 3. DPPACS: AN OVERVIEW
as {t1 , t5 → C1 with P1 }. Similarly, the other map tasks have
With an intent to reduce the number of block movements
been places as follows: {t2 → C2 with P2 }, {t4 → C3 with P3 }
at runtime, this paper proposes a data-partitioning scheme
& {t3 , t6 → C4 with P4 }. Finally, Table 2 presents the number
based on interdependency of datasets and finally places data
of data blocks movement among the clusters completion of
blocks into different clusters of a cloud data center. Finally,
map tasks. This has been depicted by dotted lines in Fig. 3. The
map-reduce tasks are scheduled to different clusters within
data partitioning and placement aware computation scheduling
a cloud with a priori awareness of data placements, so that
scheme (DPPACS) proposed in this paper envisions to reduce
it can exploit the advantage of data-compute co-location. In
the data movements and to accomplish this, three algorithms
short, the proposed DPPACS functions as a three-step process,
have been proposed; the details of each of which have been
where (a) a dependency-based partitioning of data blocks is
presented in the subsequent sections.

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
6 K.H.K. Reddy and D.S. Roy

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


FIGURE 3. Block partition-cluster mapping.

FIGURE 4. Indexing of files broken into multiple blocks to be used for DPPACS.

followed by (b) placing of these partitions within the Hadoop f = {b1 , b2, b3 , . . ., bn }. In DPPACS, these sets of data blocks
cluster and finally (c) map-reduce tasks are mapped to cluster are grouped into certain partitions according to block depen-
nodes so as to maximize data-compute co-location. Accord- dencies. Figure 4 depicts how of a data file is split into a number
ingly, three algorithms have been proposed herein and they of blocks in DPPACS and while doing so, related data blocks
are discussed in the three subsequent subsections, namely 3.2 are attempted to be placed within a single cluster. Algorithm
through 3.4, respectively. For easy comparison, the princi- 1 presents the above said logical partition method and Fig. 2
ples behind DPPACS’ three algorithms have been detailed in presents the interaction between data blocks and map tasks.
Section 3.1 with the same illustrative example as presented in For example, task ti accessing data block bj for its execu-
Section 2.1 for Hadoop’s default placement strategy. tion {ti → bj }.One map task can access data from different
blocks concurrently for its execution. The logical partitions
and their dependencies have been taken into account in this
3.1. An illustrative example
work. Based on this dependency, an adaptive technique has
A map-reduce job may be split into a number of sub map- been used to group data blocks into logical partitions. These
reduce tasks to run on different virtual nodes in parallel. A logical partitions are formed on the basis of usage of data
job ‘J’ is split into a set of tasks, and can be represented as: blocks by a map-reduce task. For example, if task t1 accesses
J = {t1 , t2 , t3 , . . ., tn } and required data files are divided into six data blocks, namely b1 , b4 , b7 , b9 , b12 , b16 for its execution
a number of blocks. These blocks are placed over different it is represented as t1 → {b1 , b4 , b7 , b9 , b12 , b16 } as has been
data centers within a cloud environment. For example, a file depicted in Table 1. These blocks are kept in one logical
‘f ’ may be split into a number of data blocks, for instance, partition P1 within Table 3, have been referred as the Partition

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 7

TABLE 3. Dependency-based partition (DBP) table. TABLE 4. Dependency-based partition placement (DPP) table.

Partition Id Blocks Cluder Id MaxPvalue (3) Partition(s)

P1 1 4 7 9 12 16 C1 0 1 6 2
P2 3 6 11 13 15 2 C2 2 5
P3 2 10 9 13 5 8 C3 2 3
P4 5 12 6 1 4 13 C4 2 4
P5 6 14 4 10 12 8
P6 7 3 1 15

TABLE 5. MaxReplicaValue of block-data (MRV) table.

to Block map table (DBP). Similarly, task t6 accesses six data Block Id No. of replicas
blocks, namely b4 , b7 , b3 , b1 , b15 and b13 and it has been B1 3
depicted as t6 → {b4 , b7 , b3 , b1 , b15 , b13 }, then these blocks B2 3

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


are kept in a single logical partition P6 as has been shown in B3 3
Table 3. The common blocks used by these two tasks t1 and t6 B4 3
are b1 and b4 . Out of these two blocks, three replicas of block . . ... 3
b4 have already been placed in partitions P1 , P4 , P5 as shown in B16 3
Table 3. As it exceeds its maximum replication range of three,
so a dependency is created between partitions P1 and P6 . This
has been depicted by the presence of partitions P1 and P6 in
cluster C1 as shown in Table 4, has been referred to as the DPP. TABLE 6. Logical partition dependency (LPD) table.
For instance, block b4 is commonly used by tasks t1 and t6 .
Thus, it should belong to partitions P1 and P6 . But since b4 Partition A Partition B Dependency value
has been placed in partitions P1 , P4 and P5 , as per the ongoing P1 P6 1
illustrative example, its maximum replication range of 3 (as P2 P6 1
per Table 5) exceeds. Thus, b4 cannot be placed in partition P6
as can be seen from last row of Table 3. So a dependency is
maintained between these two partitions, namely P1 and P6 and
this dependency value increases upon every such occurrence of TABLE 7. Data block requirement of individual tasks (DBR) table.
dependencies (as shown in Table 6). Similarly, a reason depen-
dency exists between partitions P2 and P6 owing to block b13 . Task Id Blocks
It has to be noted that for commonly shared data blocks for t1 1 4 7 9 12 16
which maximum degree of replication is not exhausted, such t2 3 6 11 13 15 2
dependencies need not be maintained. For instance, tasks t3 t3 2 10 9 13 5 8
and t5 use a common block b10 and this block is placed in t4 5 12 6 1 4 13
respective partitions P3 and P5 . Therefore, no dependencies t5 6 14 4 10 12 8
have been recorded in Table 4 (DPP). In order to maintain the t6 4 7 3 1 15 13
dependency values and partitions, DPPACS maintains special
dependency based on data access patterns by map-reduce tasks.
These patterns can be obtained from Hadoop’s NameNode
logs. Tables 3–8 depict these. For instance, Table 7 contains the
block requirement of different tasks derived from logs of the at most three partitions. This is only for the purpose of simplic-
name node and Table 5 holds the permitted number of repli- ity. In Table 6, mapping information of partitions to clusters
cas for each data block. It has to be noted that the number of is maintained. These tables are dynamically maintained and
replicas for a block is configurable for Hadoop. The logical the algorithms presented in the three subsequent subsections
partitions shown in Table 3 hold the logical partitions made update and use these tables. Table 8 shows, in summary form,
for the six tasks and 16 data blocks (as per Table 1) based on DPPACS’ placement algorithm and reveals that employing the
dependencies. Dependency among partitions is maintained in same, the number of block movements can be reduced drasti-
Table 4. The column maxPvalue(n) of Table 4 signifies the cally. A data partition algorithm has been proposed to group the
maximum number of partitions that can placed within a cluster. data blocks into partitions and for calculating the dependency
It has to be noted that the maximum partition in a cluster may value between the clusters and this has been presented in the
vary. In Table 4, it has been assumed that all clusters can store following subsection.

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
8 K.H.K. Reddy and D.S. Roy

TABLE 8. Summary of the illustrative example in Table 1 using the DPPACS placement scheme.

Data blocks assigned to No. of Assigned Partition placed


Task logical partitions for execution blocks movement partition in cluster ID

t1 P1 = {b1 , b4 , b7 , b9 , b12 , b16 } 0 P1 C1


t2 P2 = {b3 , b6 , b11 , b13 , b15 , b2 } 0 P2 C1
t3 P3 = {b2 , b10 , b9 , b13 , b5 , b8 } 0 P3 C3
t4 P4 = {b5 , b12 , b6 , b1 , b4 , b13 } 0 P4 C4
t5 P5 = {b6 , b14 , b4 , b10 , b12 , b8 } 0 P5 C2
t6 P6 = {b7 , b3 , b1 , b15 } Two within clustered nodes P6 C1

TABLE 9. Partition to cluster map (PCM) table. partitions within same cluster. Based on this notion, Algo-
rithm 1 maintains in Tables 6 and 8. It has to be understood

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


Task Id Partition Id Cluster Id that Table 6 maintains information of only those partitions for
t1 P1 C1
which dependency value is greater than zero. For instance,
t2 P2 C1
Table 4 contains only two entries regarding the dependency
t3 P3 C3
between P1 and P6 with a dependency value 1, similarly par-
t4 P4 C4
tition P2 and P6 has a dependency value 1. For these two
t5 P5 C2
entries, two map-reduce tasks {t1 , t6 } use common blocks
t6 P6 C1
{b1 , b4 } for their completion (the respective blocks have been
highlighted in Tables 5 and 4). However, a copy of block b1
can be placed in partition p6 because its replica value is less
than the max-replica-value of 3 (as per Table 5), whereas a
3.2. Partitioning of data blocks based on dependencies copy of block b4 cannot be placed in partition p6 since its
This section presents the dependency-based partitioning replica value has already exceeded the max-replica-value. So
scheme of data blocks within the proposed DPPACS frame- a dependency is maintained in Table 4 between partitions P1
work. To form partitions, required information is gathered and P6 with dependency value of 1. Similar is the case for
from Tables 1, 2, 5, 7 and 9, and finally the information per- partitions P2 and P6 . From these two entries in Table 4, it can
taining to the obtained logical partitions are stored in Table 9. be inferred that there is a dependency between partitions P1
Tables 4 and 6 are populated in course of execution of the and P6 and between P6 and P2 . Thus, these three partitions
data-partitioning algorithm. As described in the previous sub- can be placed within a single cluster and the rest of the par-
section, based on access interdependency of data blocks among titions can be placed in other clusters (as shown in Table 6).
map-reduce tasks, dependency values are computed in Tables 6 Depending on information obtained from Table 1, partitions are
and 8 and the pseudo-code for the data-partitioning algorithm distributed among different nodes of the clusters. The detailed
is presented in Algorithm 1. Comments, wherever necessary, procedural steps of DPPACS’ partition placement strategy
have been provided in italics. are explained in Algorithm 2, and it has been referred to as
the DPPA.

3.3. Dependency-based data partition placement


3.4. Computation scheduling
Once logical partitions of data blocks are formed as per Table 3,
the partitions need to be placed among different clusters within Once the data blocks are grouped into partitions and the parti-
a cloud datacenter. For proper application data management, tions are placed in appropriate clusters, then a simple scheduling
data grouping and placement plays an important role for algorithm is used to map the map-reduce tasks to an appropriate
improving the performance of a cloud system. After grouping cluster. For instance, once Table 6 (LPD) is prepared; using
the data blocks into logical partitions depending on partition its information the third column of Table 9 (PCM) is updated.
dependency values, these partitions need to be placed in an Table 9 indicates the map-reduce tasks with required partition
effective way to minimize the data block movement during for its completion, along with information about the partition
map-reduce tasks execution. placement. Using these information from Table 9, scheduling
This section presents a partition placement algorithm that decisions can be employed for distributing the map-reduce
places the partitions among clusters within a cloud on the basis tasks among cluster nodes in a data-aware manner, i.e. based
of dependency values. Higher the dependency value between on data placement information as available from Table 9 and
clusters, higher is its probability for placing that group of thus, a data-aware scheduling can be accomplished with a goal

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 9

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015

of minimizing data movements. Detailed procedural steps are 3.5. Implementing the Algorithms in Hadoop
presented in Algorithm 3. For this scheduling purpose, APIs of
Implementation of the three algorithms presented in the
GridGain, a cloud middleware, have been employed within the
previous subsections (namely 3.2 through 3.4) requires
DPPACS framework.

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
10 K.H.K. Reddy and D.S. Roy

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015

modifications to Hadoop’s source code. It required modi- placement. For computational scheduling, GridGain’s APIs
fying/adding around 2000 lines of Java code to accomplish were employed. For instance, data correlations had to be
realizing DPPACS’ dependency-based partitioning and found among data blocks for building the dependency-based

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 11

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


partition table (as depicted in Table 3). To do this, a new 4. THE DPPACS FRAMEWORK
parameter, namely dfs.depenencypartition.id has been intro-
duced in Hadoop’s hdfs-site.xml configuration file. The In cloud environments, where each node has a local disk and
cluster administrator can utilize this to identify the partition a group of nodes consist of a cluster, it is efficient to map data
or physical cluster to which DataNodes belong. Thereafter, processing operations to clusters where application data are
HDFS’ DataNodeDescriptor data structure has been modi- located. If data are not locally available in a processing node
fied to provide an additional global characteristic of each Data or cluster, data have to be migrated over the network to the
Node. This can be employed by DataNodeRegistration process designated node that performs the data processing operations.
of the HDFS for registering the dependency-based parti- Migrating huge amount of data leads to excessive overhead as
tion placement in DataNodes with the federated NameNode. well as network congestion, which in turn, degrades system
Moreover, NameNode’s ReplicationTargetChooser compo- performance.
nent has also been modified to support the dependency-based In heterogeneous cloud environment, the computing capa-
partition placement scheme proposed in DPPACS. HDFS bilities of nodes may vary significantly. A node with high
maintains a NetworkTopologyStructure that keeps track of computational capabilities may complete processing data
all racks within HDFS clusters. Users can modify Hadoop’s stored in a local disk of the node faster than another node with
default data replication. For instance, if one wants to ensure lower computational power. Thus, high-end nodes can lend
that multiple replicas of a block not to be placed on the same support toward load sharing by handling unprocessed data
node, it can simply be done by adding the desired node to the located in remote nodes. However, if the amount of transferred
excludenodelist. data due to load sharing is huge, the overhead of transferring

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
12 K.H.K. Reddy and D.S. Roy

unprocessed data among different nodes over a network To test the efficacy of the three proposed algorithms, this
becomes a critical issue and can affect Hadoop’s performance. paper employs a small cloud environment using GridGain’s
To improve the performance of Hadoop in heterogeneous clus- accelerator for Hadoop [6]. GridGain provides an integrated
ters, DPPACS has been proposed here with the objective of distributed environment for executing map-reduce tasks, both
minimizing data movement among cluster nodes. This goal computation-intensive as well as data-intensive types, through
can be achieved by an appropriate data placement scheme that exploiting in-memory map-reduce computations in its native
distributes and stores data across multiple heterogeneous nodes distributed file system, hereafter referred to as the GGFS.
while ensuring that interdependency among tasks are kept It has to be noted that GGFS encompasses the in-memory
intact. file system of all participating nodes. In other words, in a
multi-cluster cloud or in a multi-node cluster, GridGain pro-
vides a distributed file system across the main memory of all
its components. Thus, it is often referred to as a distributed
Bio ANALYTICS RDBMS cache layer. Figure 5 depicts the niche of the GGFS within the
Informatics
GridGain-based cloud environment. On the contrary, Hadoop’s
HIVE PIG DATAFU SCOOP HBASE HDFS is a disk-based distributed file system that supports the

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


map-reduce paradigm and can thus handle files having sizes
In Memory MapReduce Hadoop MapReduce of much higher-order magnitudes. The Hadoop accelerator for
GridGain provides an integrated in-memory-cum-disk-based
GridGain File System
distributed file system having the capacities of leveraging map-
Grid Gain File System
reduce facilities to any application. Figure 5 depicts a high-level
abstraction of the GridGain–Hadoop combined framework.
HDFS It has to be noted that although the GGFS [30] may act as a
distributed (in-memory) cache for HDFS, but all experiments
Java Virtual Machine reported herein have been carried out with GGFS acting in
OS (Windows /Linux)
proxy mode. In other words, the GGFS system caches nothing.
The only utility of GridGain in this work is during schedul-
ing computation as per Algorithm 3 presented in Section 3.4
FIGURE 5. GridGain–Hadoop default framework. where GridGain’s APIs are employed to implement the said

FIGURE 6. Schematic details of the DPPACS framework.

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 13

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


FIGURE 7. An example of DPPACS’ data placement.

algorithm. Figure 6 depicts a higher-level schematic represen- TABLE 10. Node configuration of cloud test-bed set-up.
tation of the DPPACS framework. The labels alongside the
arrows in Fig. 6n designate sequence of steps. Numerical labels Head node with Compute node
depict sequence of steps for Hadoop framework and alphabets NFS storage
labels designate the sequence of steps for the GridGain frame- Make Dell Power Edge Dell Power Edge
work. Data-intensive applications on the DPPACS framework R610 R410
can be handled either with all data files in-memory within the Model Dual Intel Xeon Dual Intel Xeon
GGFS or on the disk-based HDFS. Since the main contribu- quad-core E5620 quad-core E5620
tion of this paper pertains to a novel data placement strategy CPU 2.93 GHz processor 2.4 GHz processor
within disk-based HDFS, hence, the capability of the GGFS RAM 12 GB DDR2 12 GB
to provide distributed caching facility over HDFS has been Internal hard disk 3 × 300 GB SAS 3 × 300 GB SAS
ignored. GridGain’s APIs have been employed for computa- HDD/RAID – 5 HDD/RAID – 5
tional scheduling only leading to GGFS being configured to Network connection RPS/dual NIC RPS/dual NIC
work in proxy mode. The key enhancement in the DPPACS is Operating system Cent OS 5.0 Cent OS 5.0
the analyzer block which forms dependency-based groups from Switch details Eight port GB Eight port GB
previous execution logs. The dependency-based grouping has
been explained in Fig. 3 for block partitioning. Figure 7 gives
a demonstration of the efficacy of the proposed data placement
for DPPACS with respect to HDDP as has been shown in Fig. 1. per CPU hour to the human genome in a memory footprint
The next section presents the experimental work performed of as little as 1.1 GB [9]. Bowtie extends its extant indexing
and details the performance analysis of the proposed DPPACS techniques with a quality-aware search algorithm that per-
vis-a-vis HDDP. mits mismatches. Map-reduce version of the Bowtie indexing
allows greater alignment speed. The weather analysis map-
reduce application employs numerous computation-intensive
comprehensive statistical analysis of weather parameters on
5. EXPERIMENTS AND RESULTS
NCDC’s weather data, including computing standard deviation
In this section, the results of experiments carried out on the (month-wise, year-wise, decade-wise trend analysis, etc.), cor-
proposed DPPACS have been presented. For the purpose of relation etc. of temperature, pressure and other data fields of
comparison, two applications have been employed to test the dataset. The choice of these tasks provide a way to assess
DPPACS’ efficacy with that of HDDP, namely a map-reduce DPPACS’ partitioning and placement strategies with varying
version of the Bowtie indexing application that uses 40 GB of degree of correlated data. The data of genome data accessed
genome data and a weather analysis program that uses NCDC’s by the former program have much more correlation compared
weather data. Bowtie is a fast and memory-efficient program for with the later that accesses weather data. As already men-
aligning short reads to mammalian genomes. Efficient index- tioned, DPPACS places data blocks within a Hadoop cluster
ing allows Bowtie to align more than 25 million 35-bp reads accounting for interdependencies among programs accessing

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
14 K.H.K. Reddy and D.S. Roy

(a)

(b)

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


FIGURE 8. (a) Hadoop’s default HDDP block partitioning for
dataset 1. (b) Hadoop’s Default HDDP block partitioning for FIGURE 10. (a) Data movement in Hadoop’s default HDDP for
dataset 2. dataset 1. (b) Data movement in Hadoop’s default HDDP for dataset 2.

FIGURE 9. (a) DPPACS’ block partitioning for dataset 1. (b) FIGURE 11. (a) Data movement in DPPACS for dataset 1. (b) Data
DPPACS’ block partitioning for dataset 2. movement in DPPACS for dataset 2.

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 15

(a)

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


(b)

FIGURE 12. (a) Completion progress percentage for dataset 1 (genome) HDDP versus DPPACS with max NR = 2. (b) Completion progress
percentage for dataset 2 (NCDC) HDDP versus DPPACS with max NR = 2.

the data; contrary to the HDDP scheme, where data blocks are DPPACS’ dependency-based partition placement. The third
randomly placed. This section details the experiments carried experiment was conducted to observe the effect of replication
out in order to test the performance of DPPACS vis-à-vis its degree on performance of DPPACS. For all these experiments,
HDDP counterpart. two datasets were employed, namely the 40 GB genome data
[10, 11], referred to as dataset 1 and NCDC weather forecast
data [12, 13], and has been referred to as dataset 2. Two stan-
5.1. Experiments conducted
dard map-reduce programs have been employed for the purpose
In order to assess DPPACS’ performance, this paper conducts of using these datasets; a Bowtie indexing program [7–9] that
three experiments on a small cloud environment, details of uses dataset 1 and a weather forecasting program extracted
which have been summarized in the next subsection. In the from [27] that uses dataset 2. The results of these experiments
first experiment, data-distribution patterns among clusters have been presented in a subsequent subsection.
have been studied, once for HDDP and then for DPPACS.
The goal of this experiment is to test DPPACS’ capability to
5.2. Test-bed set-up
place related data on dependencies depicted in Algorithm 2.
The second experiment tests the relative efficacy of DPPACS In order to study the performance of DPPACS, the aforemen-
with respect to HDDP in terms of block movements. Run- tioned experiments were carried out on a small cloud test-bed
time block movements can be a performance bottleneck, and set-up. The test-bed consisted of four clusters, each of 16 nodes
thus, the goal of this experiment is to validate the effect of with Hadoop 2.0.0 installed. Out of the 64 nodes, one was

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
16 K.H.K. Reddy and D.S. Roy

(a)

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


(b)

FIGURE 13. (a) Completion progress percentage for dataset 1 (genome) HDDP versus DPPACS with max NR = 3. (b) Completion progress
percentage for dataset 2 (NCDC) HDDP versus DPPACS with max NR = 3.

configured as the NameNode and Job Tracker, whereas the uploading strategy. For instance, the 40 GB of dataset 1 may be
other 63 nodes were designated as DataNode and TaskTrack- uploaded all at one or may be uploaded based on their category,
ers. The configurations of these nodes are summarized in that is, in a species-wise manner. Similarly, the dataset 2 may
Table 10. It has to be noted that network characteristic settings be uploaded either in bulk mode or decade-wise. An unbiased
and capacity of the hardware switches do affect the overhead of data distribution, the data are uploaded by employing both data
data movement in federated Hadoop clusters. Although public uploading strategies, 20 times each. However, the patterns of
cloud providers, like Amazon can provide flexible network overall data distribution have been observed to be similar for
configuration settings, in this paper, we stick to fixed capacity these different runs.
hardware switches of 1 GB, as mentioned in Table 10, since
studying the effect of such network setting is beyond scope of
this paper. The data-aware and computational scheduling algo- 5.4. Performance analysis
rithm, as enunciated in Algorithm 3, has been implemented on To validate the dependency-based partitioning and placement,
this cloud test-bed employing GridGain 5.2. we carried out a number of experiments to compare the perfor-
mance of HDDP and DPPACS. We employed two map-reduce
applications (Bowtie indexing on dataset 1 and weather analysis
5.3. Data distribution
on dataset 2).
Distribution of data blocks among clusters of a cloud and A. Data Blocks Placement Among Cluster Nodes: Figure 8
nodes within a cluster is intuitively dependent on data blocks depicts the performance of HDDP in placing data blocks among

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 17

(a)

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


(b)

FIGURE 14. (a) Effect of replication degree on completion time for dataset 1. (b) Effect of replication degree on completion time for dataset 2.

clusters, while Fig. 9 depicts the performance of DPPACS in within a cloud (or different nodes within a cluster). On the con-
placing data blocks among the same. The (a) label refers to trary, DPPACS’ logical partitioning helps to place-related data
dataset 1, whereas label (b) denotes dataset 2 for these two within same clusters. Such partitioning leads to reduced data
Figs 8 and 9. Comparing Figs 8a with 9a and 8b with 9b, we movements during runtime. This has been explained in the next
can easily see the dispersed nature of related data among dif- set of results.
ferent clusters for HDDP. This intuitively leads to expensive Figure 11a and b shows the required data movements between
inter-cluster data movements. The pattern of block dispersion clusters by employing DPPACS’ placement during the runtime
remains invariant in all the 20 times that the experiment was for the two datasets, respectively. Comparing Figs 10 and 11, the
conducted. percentage of reduction in block movement can be obtained for
B. Data Blocks Movement Efficacy: Figure 10a and b shows each of the datasets. It can be observed that on an average block
the required data movements between clusters by employing movements are reduced by ∼47% in genome data and to about
HDDP’s default placement during the runtime for each of the 44% for the NCDC data.
two datasets, respectively. It can be concluded from the dis- C. Computation Progress: Figure 12a and b depicts the
persed nature of distributions obtained that employing HDDP progress of map and reduce steps of the two respective map-
leads to related data (data of a specific species for dataset 1 or a reduce applications on the aforementioned datasets using traces
specific decade for dataset 2) to be loaded into different clusters of two runs, once employing Hadoop’s randomly placed data

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
18 K.H.K. Reddy and D.S. Roy

(HDDP) and thereafter employing DPPACS’s reorganized as can be depicted in Figs 10 and 11. Moreover, the pro-
data. Completion progression is metric used to denote this. In posed DPPACS also exhibits faster completions with respect
both cases, the maximum degree of block replication is config- to Hadoop’s default block distribution policy, amounting to
ured at NR = 2. Figure 13a and b, on the other hand, shows the around 32–39% for different datasets used. This improvement
effect of higher replication on the same, with NR = 3. can be attributed to the dependency-based grouping and subse-
The pattern of completion time savings is depicted in Fig. 14a quent data placement among nodes employed in the proposed
and b, the former revealing the saving obtained for dataset 1 DPPACS framework. The emergence of the proposed DPPACS
and the later for dataset 2. The average savings in completion as a generalized data placement policy requires further testing
times for these datasets have been found out to be varying it on applications that bear no significant correlation among
between 13% and 18% for different species with the dataset 1 data within datasets, which shall be taken up as and it is one of
(genome data), whereas the average savings in completion time the future works of this research. Also the effect of replication
for dataset 2 (NCDC data) has been observed to vary between on price-performance tradeoffs in data-parallel frameworks
12% and 17%. can be taken up for future research.
The number of reducers in all these experiments is set as
large as possible, so that the reduce phase does not become

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


a performance bottleneck. It is evident from the results that ACKNOWLEDGEMENTS
performance can be improved by tuning these parameters
This work has been facilitated by and carried out in parts
depending on size of datasets that an application uses. On an
at the High Performance Computing (HiPC) Lab and the
average, both map phases (for the two datasets used) running
Data Sciences Lab, Department of Computer Science and
on DPPACS’s data finished nearly 44% earlier than the when
Engineering, National Institute of Science and Technology,
running on randomly placed HDDP data. Applications’ overall
Berhampur.
execution time also improves by 34% approximately using
DPPACS’s data in comparison to HDDP. The same has been
obtained from Figs 12a and 13a. The map-reduce job run-
ning on the DPPACS’s reorganized data has 52% maps, which REFERENCES
benefits from having data locality, compared with 38% from [1] Foster, I., Zhao, Y., Raicu, I. and Lu, S. (2008) Cloud Comput-
the randomly placed data HDDP. Of course, the two applica- ing and Grid Computing 360-degree Compared. Grid Computing
tions chosen for experimental study in this paper have high Environments Workshop, GCE’08, Austin, TX, USA, November
correlation among the data. It can be concluded as a general- 16, pp. 1–10. IEEE.
ization, that applications that use uncorrelated datasets might [2] Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J. and Brandic,
not show such a pronounced degree of performance advantage I. (2009) Cloud computing and emerging IT platforms: vision,
of DPPACS over HDDP. However, almost all practical data hype, and reality for delivering computing as the 5th utility.
show substantial correlation and thus the proposed DPPACS Future Gener. Comput. Syst., 25, 599–616.
seems appealing as a generalized placement policy for future [3] Dean, J. and Ghemawat, S. (2008) MapReduce: simplified data
processing on large clusters. Commun. ACM, 51, 107–113.
data placement strategies.
[4] Ibrahim, S., Jin, H., Lu, L., Qi, L., Wu, S. and Shi, X. (2009)
Evaluating Mapreduce on Virtual Machines: The Hadoop Case.
6. CONCLUSIONS In Cloud Computing, pp. 519–528.Springer, Berlin.
[5] www.gridgain.com (accessed on November 5, 2013).
This paper proposes a data and computation scheduling
[6] http://hadoop.apache.org/ (accessed November 15, 2013).
framework, DPPACS, that adopts the strategy of improving
[7] http://bowtie-bio.sourceforge.net/index.shtml
computation and data co-allocation within a Hadoop cloud
[8] http://enterix.cbcb.umd.edu/enteric/enteric-eco.html (accessed
infrastructure based on knowledge of data blocks availabil-
November 15, 2013).
ity. Hadoop’s default strategy places data blocks randomly
[9] Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L. (2009)
among the clusters, which may lead to performance bottle- Ultrafast and memory-efficient alignment of short DNA
necks where correlation among datasets is prevalent. In this sequences to the human genome. Genome Biol., 10, R25.
context, the DPPACS framework has been proposed, and the http://michael.dipperstein.com/bwt/
three algorithms have been provided to implement the same. [10] http://genome.ucsc.edu/ (accessed November 15, 2013).
A pseudo-cloud environment has been set-up, and numerous [11] Unipro UGENE. http://ugene.unipro.ru (accessed November 15,
experiments have been carried out on certain standard pro- 2013).
grams using two distinct datasets and DPPCAS’ efficacy over [12] http://www.ncdc.noaa.gov/data-access/land-based-station-data/
Hadoop’s default placement strategy has been compared. Sev- land-based-datasets/integrated-surface-database-isd (accessed
eral experiments have been conducted reveal that for the given November 15, 2013).
applications, DPPACS show significant reduction of around [13] ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/ (accessed November 15,
44–47% in required number of block movements at runtime, 2013).

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015
DPPACS 19

[14] Amer, A., Long, D.D. and Burns, R.C. (2002) Group-based [22] Eltabakh, M.Y., Tian, Y., Gemulla, R., Krettek, A. and McPher-
Management of Distributed File Caches. Pro. 22nd Int. Conf. son, J. (2011) CoHadoop: flexible data placement and its
on Distributed Computing Systems, Vienne, Austria, July 2–5, exploitation in Hadoop. Proc. the VLDB Endowment, 4,
pp. 525–534. IEEE. 575–585.
[15] Wang, J., Shang, P. and Yin, J. (2014) DRAW: A New Data- [23] Maheshwari, N., Nanduri, R. and Varma, V. (2012) Dynamic
Grouping-Aware Data Placement Scheme for Data Intensive energy efficient data placement and cluster reconfiguration
Applications with Interest Locality. Cloud Computing for algorithm for MapReduce framework. Future Gener. Comput.
Data-Intensive Applications, pp. 149–174. Springer, New York. Syst., 28, 119–127.
[16] Yuan, D., Yang, Y., Liu, X. and Chen, J. (2010) A data place- [24] Krish, K. R., Anwar, A. and Butt, A.R. (2014). [phi] Sched: A
ment strategy in scientific cloud workflows. Future Gener. Com- Heterogeneity-Aware Hadoop Workflow Scheduler. In Modeling,
put. Syst., 26, 1200–1214. Analysis & Simulation of Computer and Telecommunication Sys-
[17] Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J. and Qin, tems (MASCOTS), 2014 IEEE 22nd International Symposium on,
X. (2010) Improving Mapreduce Performance through Data Paris, pp. 255–264. IEEE.
Placement in Heterogeneous Hadoop Clusters. IEEE Int. Symp. [25] Huang, S., Huang, J., Dai, J., Xie, T. and Huang, B. (March
Parallel & Distributed Processing, Workshops and Phd Forum 2010) The HiBench Benchmark Suite: Characterization of the
(IPDPSW), Atlanta, GA, pp. 1–9. IEEE. MapReduce-based Data Analysis. 2010 IEEE 26th Int. Conf.

Downloaded from http://comjnl.oxfordjournals.org/ at Universite Laval on October 14, 2015


[18] Sehrish, S., Mackey, G., Wang, J. and Bent, J. (2010) Mrap: A on Data Engineering Workshops (ICDEW), Long Beach, CA,
Novel Mapreduce-based Framework to Support HPC Analytics March 1–6, pp. 41–51. IEEE.
Applications with Access Patterns. Proc. the 19th ACM Int. [26] Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J.,
Symp. on High Performance Distributed Computing, Chicago, Sutton, G. and Beasley, E. (2001) The sequence of the human
IL, June 20–25, pp. 107–118. ACM. genome. Science, 291, 1304–1351.
[19] Cope, J.M., Trebon, N., Tufo, H.M. and Beckman, P. (2009) [27] Tripathi, S. and Govindaraju, R.S. (2009) Change Detection in
Robust Data Placement in Urgent Computing Environments. Rainfall and Temperature Patterns Over India. Proc. the Third
IEEE Int. Symp. on Parallel & Distributed Processing. IPDPS Int. Workshop on Knowledge Discovery from Sensor Data, Paris,
2009. Rome, Italy, May 25–29, pp. 1–13. IEEE. France, pp. 133–141. ACM.
[20] Hardavellas, N., Ferdman, M., Falsafi, B. and Ailamaki, A. [28] Wasserman, S. (1994) Social Network Analysis: Meth-
(2009) Reactive NUCA: Near-optimal Block Placement and ods and Applications. Vol. 8. Cambridge University Press,
Replication in Distributed Caches, Vol. 37(3), pp. 184–195. Cambridge.
ACM SIGARCH Computer Architecture News. ACM. [29] Shvachko, K., Kuang, H., Radia, S. and Chansler, R. (May 2010)
[21] Wei, Q., Veeravalli, B., Gong, B., Zeng, L. and Feng, D. The Hadoop Distributed File System. 2010 IEEE 26th Symp. on
(September 2010) CDRM: A Cost-effective Dynamic Replica- Mass Storage Systems and Technologies (MSST), Incline Village,
tion Management Scheme for Cloud Storage Cluster. 2010 IEEE NV, pp. 1–10. IEEE.
Int. Conf. on Cluster Computing (CLUSTER), Heraklion, Crete, [30] http://en.wikipedia.org/wiki/Google_File_System (accessed
Greece, September 20–24, pp. 188–196. IEEE. November 15, 2013).

Section A: Computer Science Theory, Methods and Tools


The Computer Journal, 2015

You might also like