AN ENHANCED HADOOP HEARTBEAT MECHANISM FOR MAPREDUCE TASK-2018

NETWORKS & SECURITY
An Enhanced Hadoop Heartbeat Mechanism for

MapReduce Task Scheduler Using Dynamic Calibration
Xinzhu Lu1, Keatkeong Phang2,*
1
Faculty of computer science and information technology, University of Malaya, 50603, Malaysia
2
Faculty of computer science and information technology, University of Malaya, 50603, Malaysia
* The corresponding author, email: phangkeatkeong@gmail.com
Abstract: MapReduce is a popular program- to the “slow start” property of HMTS-DC,

ming model for processing large-scale datasets EHMTS-DC relies on the historical compu-
in a distributed environment and is a funda- tation capacity of the slave machines. The
mental component of current cloud comput- experimental results show that EHMTS-DC
ing and big data applications. In this paper, outperforms HMTS-DC in a dynamic environ-
a heartbeat mechanism for MapReduce Task ment.
Scheduler using Dynamic Calibration (HMTS- Keywords: dynamic load-balancing; hadoop;
DC) is proposed to address the unbalanced mapReduce; replication mechanism; heartbeat
node computation capacity problem in a het- nechanism
erogeneous MapReduce environment. HMTS-
DC uses two mechanisms to dynamically I. INTRODUCTION
adapt and balance tasks assigned to each com-
pute node: 1) using heartbeat to dynamically The emergence of cloud computing, the big
estimate the capacity of the compute nodes, data era, and the proliferation of data centers is
and 2) using data locality of replicated data the inevitable consequence of the human abili-
blocks to reduce data transfer between nodes. ty to transform each event and interaction into
With the first mechanism, based on the heart- digital data. A considerable amount of valu-
beats received during the early state of the job, able information can be derived by analyzing
the task scheduler can dynamically estimate and extracting value from data. Consequently,
the computational capacity of each node. Us- the ever-increasing volume of data leads to the
ing the second mechanism, unprocessed Tasks search for more efficient data-intensive pro-
local to each compute node are reassigned and cessing frameworks. MapReduce is a popular
reserved to allow nodes with greater capacities programming model for processing large-scale
to reserve more local tasks than their weaker datasets in a distributed environment. Hadoop,
counterparts. Experimental results show that an open source implementation of MapReduce
HMTS-DC performs better than Hadoop and originated from the Google File System [1], is
Dynamic Data Placement Strategy (DDP) in a programming framework that focuses on the
a dynamic environment. Furthermore, an en- processing and storage of large data in a dis-
Received: Mar. 23, 2018
hanced HMTS-DC (EHMTS-DC) is proposed tributed computing environment[2]. Revised: Jul. 11, 2018
by incorporating historical data. In contrast The Hadoop Distributed File System Editor: Honggang Zhang
China Communications • November 2018 93

(HDFS) is the storage component of Hadoop, HDFS replicates, distributes, and stores the
Two MapReduce task which is very similar to the existing distrib- data blocks in related data nodes. Then, Ha-
schedulers are pro- uted file system, but is optimized for high doop tasks are initiated to process these data
posed in this work . throughput and works best when handling blocks. In current MapReduce task scheduler
The HMTS-DC algo-
large volumes of data. HDFS partitions data implementation using Hadoop, a homogeneous
rithm aims to address
the unbalanced node
related to a MapReduce job into chunks of environment is assumed, and these blocks are
computation capacity fixed-size blocks called tasks (or blocks). It queued and accessed sequentially. Although
problem in a hetero- then replicates, distributes, and stores these simple and elegant, it is may not be efficient
geneous MapReduce data blocks or tasks to related data nodes for in a heterogeneous dynamic environment in
environment. The parallel processing [5, 6]. terms of local data block processing and net-
EHMTS-DC is an en-
Fig.1 depicts the MapReduce architecture. work communication overhead involving data
hanced version of
HMTS-DC.
The major entities in the architecture include block transmission.
the client application, JobTrackers and Task- In view of the poor performance of current
Trackers. JobTracker runs on a master node, mechanisms in a heterogeneous dynamic en-
and manages and coordinates the jobs distrib- vironment, we propose adaptive MapReduce
uted among the nodes. TaskTracker runs on task schedulers. Our task schedulers are able
each compute node and launches and coordi- to dynamically adapt and balance tasks as-
nates the tasks executed within each node. In signed to each compute node using two mech-
figure1, a MapReduce client application com- anisms: 1) using heartbeat to dynamically
municates to JobTracker. JobTracker accepts estimate the capacity of the compute nodes,
the job request from the client, breaks it down and 2) using data locality of replicated data
into smaller tasks and assigns these tasks to blocks to reduce data transfer between nodes.
TaskTrackers. Then, TaskTrackers perform the The experimental and simulation results
assigned maps and reduce the tasks. show that our proposed algorithm, namely
To process MapReduce jobs or tasks effi- heartbeat mechanism for task scheduler using
ciently, many schedulers have been proposed Dynamic Calibration (HMTS-DC), performs
[3, 9-16]. Besides, there are alternative types better relative to Hadoop[2] and the current
task schedulers that focus on adjust task lists Dynamic Data Placement Strategy (DDP) [9]
or optimize task load. Our main concern is to in dynamic environments. HMTS-DC has
improve the replication mechanism and heart- been further enhanced into Enhanced Adaptive
beat to balance task loading on cluster. On MapReduce Task Scheduler using Dynamic
submission of a MapReduce job, the data file Calibration (EHMTS-DC) with the introduc-
is partitioned into chunks of fixed-size blocks. tion of historical data. In contrast to the “slow
start” property of HMTS-DC, EHMTS-DC
is able to assign local tasks to compute nodes
JobTracker quickly, in proportion to the initial historical
Map task 1
capacity of each node. As the job progresses
Active jobs
Client application Map task 2
Retired/Historical EHMTS-DC, based on the periodic heartbeat,
Job A ... jobs
Hadoop MapReduce Job X recalibrates the relative capacities of the com-
Job B Reduce task 1
Client
...
Job Y pute nodes. The remaining uncomputed tasks
Reduce task 2
Reduce task 3
Job Z
are reassigned to compute nodes based on
... these revised capacities. The simulation and
experimental results show that EHMTS-DC
performs better than HMTS-DC.
TaskTracker 1 TaskTracker 2 TaskTracker 3
M M M R M The rest of the paper is organized as fol-
M
R M R lows. Section 2 reviews related works. Sec-
tion 3 presents our proposed model, HMTS-
Fig. 1. MapReduce architecture and job submission[4]. DC. Section 4 presents the experimental and
94 China Communications • November 2018

simulation setup and performance analysis are normal or failed. It is a commonly used
and the enhancement of EHMTS-DC. Section mechanism in computer science. For instance,
5 presents experimental test by real physical instantaneous messaging applications such as
machine. Finally, Section 6 presents the con- VANET and wireless sensor network usually
clusion of this study, and future work. relied on heartbeat in their operation. [17]
reports the use of heartbeat to synchronize
II. RELATED WORKS VANET nodes and improve VANET safety.
This paper proposed a heartbeat message
In this section, we describe those researches based misbehavior detection scheme, which
those trying to improve the working perfor- is capable of identifying the source of false
mance and extending its range of application information and thus improve VANET safe-
of Hadoop ecosystem. ty. In [18] a heartbeat protocol is introduced
Job Scheduler is an independent module in along each path of a wireless sensor network
MapReduce, and can be replaced by user with in order to guarantee the timely detection of
simple plugin commands. The most famous nodes failure. The proposed protocol is able to
FIFO, fair and capacity scheduler are proposed shorten the latency between a failure and a full
in[2]. FIFO Scheduler is the default Hadoop recovery of the network. In Hadoop (an im-
scheduler. Jobs are scheduled based on a first- plementation of MapReduce) uses a heartbeat
come first-serve basis. In other words, FIFO mechanism for communication between the
Scheduler places jobs in a queue and runs computing nodes, i.e., the TaskTrackers and
them based on the order of job submission. A the JobTrackers. The heartbeat is implement-
received job is partitioned into smaller tasks. ed as remote procedure call (RPC)[19] based
Tasks are put into a queue to be accessed by on TCP/IP socket-based communication. In
JobTrackers. The task scheduling in Hadoop general, the heartbeat as a RPC application
FIFO lies in its simplicity. However, the dis- will use User Datagram Protocol (UDP) when
advantage is that no consideration is given to sending data, and only fall back to Transmis-
differences in the computation capacity of the sion Control Protocol (TCP) when the data to
nodes. Capacity Scheduler allows organiza- be transferred does not fit into a single UDP
tions to share a large cluster, and guarantees datagram. Greater details of how heartbeat is
each organization a minimum capacity. The used in Hadoop is presented in section 3.
benefit of Capacity Scheduler is that an orga- Xie et al. [10] pointed out that ignoring the
nization can use and access excess capacity, data locality in heterogeneous environments
i.e., that not being used by others. Fair Sched- can hinder MapReduce performance. They ad-
uler assigns resources to jobs and ensures that dressed the problem of data placement across
all jobs obtain an equal share of resources over nodes so that the data to be processed by each
time. The scheduler ensures that short jobs node is balanced. Data is allocated to a node
finish within reasonable makespans without based on the node capacity. However, as rep-
starving long jobs. What is worth paying atten- lication is removed, system fault tolerance is
tion to is task scheduler (sub job scheduler) is not preserved.
not independent but most related into Hadoop He el al. [11] proposed the MatchMarking
heartbeat mechanism and other useful interior algorithm to improve data locality at the task
functions. Some researchers summarize the allocation level. Regardless of which job a
advantage of task scheduler and those interior task belong to, local tasks always have higher
functions in many ways like adjust data locali- priority in terms of execution than non-local
ty/data placement tasks. In addition, each slave has a marker
Heartbeat is a signal generated periodically to mark a node to ensure that each node has
by hardware or software to synchronize nodes fair chance of being assigned its local task. In
or to indicate the status of nodes whether they other words, MatchMarking gives every slave

node a fair chance to grab local tasks before consider the design of the cluster in a hetero-
any non-local tasks are assigned to any slave geneous environment. LATE algorithm cal-
node. The drawback is that no consideration is culates the remaining number of tasks by cal-
given to heterogeneous environments. culating the actual number of slow tasks, i.e.,
Furthermore, Historical record[9][13] is nodes that are relatively slow in completing
mostly used to pre-adjust data storage to im- the task. However, the problem with LATE is
prove load balancing between data and cluster. that it is difficult to identify really slow tasks.
Lee et al. [9] proposed the Dynamic Data Chen et al. [8] proposed a Self-Adaptive
Placement (DDP) Strategy for Hadoop in Het- MapReduce scheduling algorithm (SAMR).
erogeneous Environments. DDP adapts and SAMR adapts to the variation in the environ-
balances data stored in each node based on the ment by computing task progress dynamically.
computing capacity of the individual node. SAMR attempts to improve the computation
Higher capacity nodes are assigned more data. of LATE. By analyzing historical records,
DDP is static (despite the title containing the SAMR is able to accurately identify actual
term “Dynamic”) and is based on historical slow tasks and backup the slow task to con-
run time data. DDP is highly accurate provid- serve storage. SAMR separate slow tasks into
ed the computational environment remains 1) slow map tasks, and 2) slow reduce tasks.
static. However, should the environment be- It assigns the backup tasks to faster nodes.
come dynamic, i.e., there is fluctuation in the SAMR was shown to perform better than
computation capacity of the nodes, DDP is LATE.
unable to perform well. In addition, by exclud- Gu et al. [12] proposed SHadoop to im-
ing HDFS and data replication, DDP does not prove MapReduce performance by optimizing
support fault tolerance. the job execution mechanism in Hadoop clus-
Xu et al. [13] proposed Dynamic Task Split- ters. SHadoop replaces the heartbeat mecha-
ting Scheduler (DTSS) to address the tradeoffs nism with instant messaging to monitor tasks,
issue between fairness and data locality during which results in faster scheduling and execu-
job scheduling. On a non-data-local node, tion of performance-sensitive tasks. SHadoop
DTSS splits a task dynamically and executes improves the makespan, particularly for short
the split tasks immediately to improve fair- jobs.
ness. The drawback of DTSS is that copying There are some studies help perfect Hadoop
the data blocks from the local to remote node by adjust cluster structure to makes it more fit
increases the network communication cost. to process and calculate massive data storage.
Some researchers consider the process from In our discussion, cluster could be classified
job being submitted until completion. But they into homogeneous and heterogeneous environ-
are generally and targeted and limited, such as ment, A homogeneous computing environment
only for short jobs[12]. is that which has a homogeneous network en-
speculative tasks are those tasks who can vironment (Fig.2(a)).
delay the entire execution time later in the job Anjos et al. [16] proposed MRA++, a novel
execution process. MapReduce framework design that considers
Zaharia et al. [7] developed the LATE the heterogeneity of nodes during data dis-
(Longest Approximate Time to End) algo- tribution, task scheduling, and job control.
rithm. The three principles of LATE are: tasks MRA++ characterizes a smaller number of
are prioritized for speculation, fast nodes are machines as stragglers and executes a larger
chosen to execute tasks, and speculative tasks number of tasks concurrently in a heteroge-
are capped to prevent thrashing. Contrary neous environment. A knowledge base of
to Hadoop, which assumes that all compute execution times is used prior to the data distri-
nodes have a similar computing capacity, the bution. MRA++ has a shorter makespan. The
LATE algorithm is one of the earliest works to disadvantage of MRA++ is the exclusion of

slow machines, which is a waste of resource. research of this type of historical data [8,9]
Many researchers give it their best shot less useful and DDP will not be able to per-
on shortening the job entire processing time. form well.
Current works show that, in a heterogeneous Based on the current work, there is a need
environment, the makespan could be reduced to design a novel task scheduler that is able to
if workloads are assigned in proportion to the work in a more realistic environment. The new
capacity of the heterogeneous compute nodes. task scheduler should be able to respond to a
In general, DDP [9] has the best performance dynamic heterogeneous environment. The next
in terms of shorter makespan in static hetero- section details how our algorithms are formu-
geneous environments. By assuming a static lated from Hadoop, DDP, and current systems.
compute environment, DDP is able to use the
historical run time record to accurately assign III. PROPOSED HMTS-DC ALGORITHM
the correct portion of data blocks in relation
to their computation power to the compute This section presents the analysis of the
nodes. performance of Hadoop in a heterogeneous
However, the static compute environment environment. The basic design of Hadoop as-
assumption of DDP is not realistic or accu- sumes a homogeneous environment, In theory
rate as the computation power of the compute even distribution of data can avoid the delay
nodes may vary from time to time. In other caused by network transmission. However,
words, a real environment, in which the com- if a heterogeneous environment, it is best to
pute node capacity is dynamic, will render the make data distributing load proportion to their
(a) Homogeneous (b) heterogeneous
Fig. 2. Homogeneous environment and heterogeneous environment.
15 16
Node3
Hadoop Default
Best case
Node2
Node1
Execution Time
Fig. 3. DDP data allocation.

node’s capabilities, but it’s not easy to do this. tributed nodes can be done almost at the same
In a rush to improve the makespan. DDP time.
algorithm[9], data blocks are allocated in However, the static compute environment
proportion to the computational capacity assumption of DDP is not realistic or accu-
(gathered from historical runs) of each node, rate as the computation power of the compute
as depicted in figure3. Assume the relative nodes may vary from time to time.
computing capacities of nodes A, B, and C are For instance, changes in the processes/
3, 2, and 1, respectively, the number of blocks applications running within the compute clus-
assigned to these nodes follows this ratio. In ter or changes in hardware configuration of
DDP, since there is no data block movement compute cluster such as the addition of RAM,
between compute nodes during job processing, upgrading of CPU, addition/removal of com-
the makespan is shortened. Tasks on those dis- puting nodes within the cluster renders the
historical data useless. In other words, should
the environment become dynamic as in an ac-
tual environment, historical run time records
will not be accurate and DDP will not perform
well. Note that DDP does not support fault
tolerance since only one set of data blocks is
available for the entire cluster. By default, the
MapReduce cluster responds to node failures
automatically. If a node fails, the node’s un-
completed data blocks are rescheduled to an-
other node. DDP lacks this feature.
In order to respect the application of Ma-
pReduce (simultaneously overcome the limita-
tions of DDP and it’s similar algorithms), the
task scheduler should include a means to esti-
mate the capacity of the nodes dynamically.
When jobs are executed, JobTracker sched-
ules all jobs and distributes the tasks to Task-
Fig. 4. HeartBeat mechanism in AMTS-DC. Tracker. The latter then performs the tasks and
the results are returned to JobTracker. Hadoop
uses a Heartbeat mechanism for communica-
tion between JobTracker and TaskTrackers.
Each TaskTracker is assigned a set of task
slots; then, TaskTracker can work on the tasks.
When the task slot is empty, TaskTracker
uses Heartbeat to inform JobTracker. If there
are more tasks to be completed, JobTracker
assigns the tasks to TaskTracker using Heart-
beat response. Through regular Heartbeat,
JobTracker determines the status of the tasks
assigned. The default Heartbeat interval of
Hadoop is 3s. Fig.5 illustrates the Heartbeat
mechanism in our proposed Adaptive Ma-
pReduce Task Scheduler Using Dynamic Cali-
bration (HMTS-DC).
Fig. 5. More local data blocks for faster node. Unlike DDP, HMTS-DC is designed with-

out the need to modify the current HDFS im- 3.1 Hadoop, DDP, and HMTS-
plementation. In the initial stage, data files are DC under Static Heterogeneous
replicated to the compute nodes as instructed Environment
by NameNode via HDFS. In the second stage,
Table 1 depicts one possible scenario(one
as the job progresses, Heartbeat in Hadoop is
of prototype testing), consider three worker
used to estimate the current relative computing
nodes (each running TaskTracker and DataN-
capacity of the compute node. As depicted in
ode) with replications set to two and 39 blocks
figure4, Hadoop Heartbeat is extended so that
with block IDs: 1, 2, 3, 4 ... 39. There is a total
information regarding current slave progress,
of 78 blocks after the replication (2*39=78).
current reserve task list, and non-reserved task
Each slave is allocated a total of 26 blocks
list could piggy-back on Heartbeat and be
(i.e., total replicated block/total number of
communicated to JobTracker. Node capacity
slave = 78/3). where the numbers shown are
can be estimated using the information from
the Task IDs.
Heartbeat, and more local tasks are reserved
Assume that the relative computing ca-
to compute nodes with greater capacity. The
pacity of Slave3:Slave2: Slave1 is 10:2:1,
number of local tasks reserved is proportional
where 10 and 1 represent the most and least
to the estimated relative capacity of the node.
powerful nodes, respectively. To simplify the
Section 3.1 presents a detailed analysis of
discussion, and without losing generality, the
Hadoop, DDP, and the proposed HMTS-DC
time required for the fastest node to complete
algorithms in static heterogeneous environ-
a block is assumed to be 1 s; the time taken
ments by a real prototype testing record obtain
to process a block by slave 1 is 10 s; the time
from Section 5.
taken to transmit a block in the network is as-
sumed to be 6 s.
Table I. Data block replicated at the slaves.

Slave 3: 15 16 23 10 20 04 33 32 18 30 24 11 12 38 27 26 3 35 9 36 17 06 31 02 39 29
Slave 2: 24 25 38 07 37 21 14 20 13 23 17 32 34 05 08 12 22 35 01 09 04 18 15 26 28 19
Slave 1: 06 02 29 03 27 37 33 34 28 07 21 30 16 08 10 25 13 05 11 36 01 39 19 14 22 31
15 16 23 10 20 04 33 32 18 30
11 12 27 26 03 35 09 36 17 31
39 21 13 05 19
15 16 23 10 20 04 33 32 18 30 11 12 27 26 03 35 09 36 17 31 39 21-slave2 13-slave2 05-slave2 19-slave2
Slave 3
24 25 38 07 37 14 34 08 22
10s
20s
40s
50s
30s
24 25 38 07 37 14 34 08 22
Slave 2
06 02 29 28 01
06 02 29 28 01
Slave 1
Execution Time
Fig. 6. Hadoop: tasks processed vs. time.

3.1.1 Hadoop tasks processing with Time the remaining uncompleted tasks in the right
side of figure7, are marked and reserved in
Figure 6 show that at time t=20 s, slave 3 has
proportion based on the computation capacity
processed all its local blocks and subsequently
of the slaves.
needs to move blocks 21, 13, 5, and 19 from
On 10 second, Heartbeat gets the node
slave 2. There are a total of four block move-
capability and counting out their priority pro-
ments, as depicted by the four black boxes in
cessing local task list as their reserved list.
figure6. The makespan can be calculated by
These reserved tasks have a higher priority to
adding the time to process a block (10 s for
be executed by the slave itself. The big differ-
slave 1) to the time taken for the last block to
ent between HMTS-DC and Hadoop. If there
be processed by slave 1. The makespan is 50
are two replications of task index 06, 02and
s (i.e., 40 + 10 ). The total data movement re-
29 located on 2 different capability nodes,
corded is 4.
HMTS-DC can assign it to a more powerful
3.1.2 DDP original block location(task node while Hadoop can assign it random, On
tasks) next 10 second (20 sec), heartbeat also get the
DDP allocate those blocks before MapReduce node capability and counting the unprocessed
started, as shown in Table 2,only one set of local task again, if now status is different with
blocks exists in DDP, since no blocks are rep- 10 second, it will adjust and replace the old re-
licated. The original 26 blocks are portioned served list. By this way, the local tasks on high
and assigned to the slave nodes in proportion capability node can be protected and not run-
to the computational capacity of the nodes. ning on other node, and also non-local blocks
Observe that no data movement is recorded. can be transferred as less as possible.
The last block processed by slave 3 occurs Back to our instance, slave 3 as the fastest
at t=30 s (i.e., 29 + 1). The time required for node, it has all remaining local blocks re-
slave 3 to process a block is 1 second. This is served to it(all local tasks same as the reserved
because the number of blocks assigned to the task list). The execution of these local blocks
compute nodes is proportional to the computa- by slave 3 does not incur additional network
tional capacity of the nodes and all local nodes cost and hence shorten the makespan. There
are processed and completed by the slave al- is a total of ten unreserved blocks. There are
most simultaneously. DDP is highly efficient no unreserved blocks in slave 1. The unre-
and has the shortest makespan. served blocks in slave 2 and slave 3 we can
refer to figure8. Consider of network transfer
3.1.3 HMTS-DC
delay(2.5 sec), more data movement implies
HMTS-DC does not change the original prin- a longer makespan. Only 2 block-movements
ciple of the HDFS and MapReduce mecha- are recorded and the makespan of HMTS-
nisms. Therefore, same with section 3.1.1. DC is 40s (the last task required for slave 1 is
In figure7, Tasks 15 16…….29, which are number1 and compete during time 30s to 40s
marked in brown, are those that were pro- to delay the total completion time).
cessed (from time 0 to 26 s) , during these 26 In this section, the task schedulers of Hadoop,
second, Heartbeat return information check DDP, and HMTS-DC are analyzed and their
with JobTracker and round-off error each 10 makespans are 50, 30, and 40s, respectively. Ob-
seconds. At the first second heartbeat respond- serve that our proposed HMTS-DC is unable to
ed, once the capacity of each slaves is known, outperform DDP in a static environment.
Table II. DDP original block list.

Slave 3: 6 2 29 3 27 37 33 34 28 7 21 30 16 8 10 25 13 5 11 36 1 39 19 14 22 31 24 38 20
Slave 2: 23 17 32 12 35 9
Slave 1: 4 18 15 26

IV. SIMULATION RESULTS AND 4.1 Simulation Hadoop, DDP and
DISCUSSION HMTS-DC
Simulation in Static Environment
A Java program was developed to simulate the
In this experiment (figures 9 and 10), the
performance of the algorithms, and analysis
nodes have different computation powers. The
of the performance of proposed HMTS-DC
capacity ratio is (1, 2, 7). It can be seen that
compared to Hadoop and DDP. Simulations
Hadoop has the longest makespan and most
are carried by varying job size, number of
data movement, at 11. This is followed by
compute nodes, and virtual machine configu-
HMTS-DC with a value of 4. DDP remains the
rations.
best with the shortest makespan and no data
HMTS-DC, depicted in algorithm 1 (a),
movement between the compute nodes. Then,
(b), (c), and (d), assigns more tasks within the
the capacity ratio is set to a larger ratio of
shaded area to more powerful nodes; thus, al-
(1, 3, 11). It can be seen that Hadoop has the
lowing more powerful nodes to process more
longest makespan, 1857 s, and the most data
local tasks.
movement (14 blocks are moved). This is fol-
lowed by HMTS-DC with a shorter makespan
Slave 3 15 16 23 10 20 04 33 32 18 30 24 11 12 38 27 26 03 35 09 36 17 06 31 02 39 29 No reserved blocks
Slave 2 25 07 37 21 14 13 05 24 38 20 23 17 32 34 05 08 12 22 35 01 09 04 18 15 26 28 19
Slave 1 34 28 08 06 02 29 03 27 37 33 07 21 30 16 10 25 13 05 11 36 01 39 19 14 22 31
Fig. 7. Researved list and nonreserved list of HMTS-DC.
Slave 3 NULL NULL
15 16 23 10 20 04 33 32 18 30 Slave 2 22 01 19 13 05
24 11 12 38 27 26 03 35 09 36
Slave 1 05 01 19 22 25 19
17 06 31 02 39 29
15 16 23 10 20 04 33 32 18 30 24 11 12 38 27 26 03 35 09 36 17 06 31 02 39 29 05 19
Slave 3
25 07 37 21 14 13
10s
20s
30s
40s
25 07 37 21 14 13 22
Slave 2
34 28 08
34 28 08 28
Slave 1
Execution Time
Fig. 8. Process of HMTS-DC algorithm.

Algorithm 1. HMTS-DC Algorithm. of 1646 s (only six blocks are moved). DDP
remains the best with the shortest makespan
Variable:
TotalTask (a job partitioned into many tasks),
and no data movement between the compute
TotalBlock (data file is partitioned into many blocks; TotalBlock = TotalTask), nodes.
r (replication number) and n (number of slave), Simulation in Dynamic Environment
HisRecExist (historical record of job and the relative node capacity exist) in Fig.11 and Fig.12
RelCapOfSlave (Relative Computing Capacity Of Slaves) Then, the computational environment
SetOfComputeNode (S1,S2,..SN)
changes to dynamic. The capacity ratio is
update_CR_Now (time to update the computation Ratio of each slave),
Block [S] (set of block assigned to slave S,
(1,2,7). The speed of the fastest node chang-
RBlock[S] (set of block assigned to slave S and reserved for slave S), es in the middle of the job processing. The
NBlock[S] (set of block assigned to slave S but not reserved for slave S), speed is reduced from 7 to 6, i.e., by a small
RoneB[S] (one block in RBlock[S]), factor of 1/7. As a result, the makespan for all
NoneB[S] (one block in NBlock[s]),(a) schedulers increased. However, it can be seen
that DDP now has the worst performance with
(a) Variables

the longest makespan, 3371 s, with no data
Proc Main() movement (since the replication number for
1 Begin DDP is one, no data movement is allowed).
2 //HDFS partitioned job data file into blocks {B1,B2,B3…}
This means that for DDP, the remaining tasks
3 //NameNode randomly distribute blocks to slave nodes
4 Let k = int((TotalBlock*replicate)/n) have to be completed by the now “slower”
5 For Each slave node S fastest node (Slave3) alone. The performance
6 RBlock [S] := ∅ of HMTS-DC is the best with the shortest
7 NBlock [S] := ∅ makespan and only one data movement is re-
8 RBlock[S] := RBlock[S] ∪ {Br,Br+1,…,Br+k} //r is a random num
corded between the compute nodes. At the sec-
9 // while there are still blocks at any slave unprocessed
10 Done := False
ond set of testing, The capacity ratio reset to
11 Initialize() (1,3,11). Again, the speed of the fastest node is
12 Ticks := 0 // Ticks is the time reduced in the middle of the computation. The
13 While NOT Done speed is reduced from 11 to 10. It can be seen
14 If Ticks equals Update_CR_Now
that DDP now has the longest makespan, 2235
15 updateReservedBlock()
16 For each slave node S
s, with no data movement. This is followed by
17 If (not busy S) Hadoop. The performance of AMTS-DC is the
18 If RBlock[S]is not empty best with the shortest makespan and only three
19 // Process a block in Reserved Block data movements between the compute nodes.
20 RBlock[S]:=RBlock[S]\{RoneB[S]}
21 Else 4.2 Limitation of HMTS-DC and
22 If NBlock[S]is not empty Motivation to Enhance HMTS-DC
23 // Process a block in Non-reserved Block
24 NBlock[S]:=NBlock[S]\{NoneB[S]} As depicted in Table 2, some local blocks have
25 Else to be processed before Heartbeat can return
26 // Process a non reserved block from Slave T,U,V,.. and T,U,V,..≠S information regarding the node capacity. As
27 NBlock[T]:=NBlock[T]\{NoneB[T]}
shown in figure 8, the number of local blocks
28 End If
29 End If that can be reserved is highly limited; in this
30 End If case, only a total of six local blocks that can
31 End For be reserved. Supposing that the relative capac-
32 Make-span=Ticks ity of the compute nodes is known prior to the
33 Ticks=Ticks+1
processing of any local blocks, then all local
34 End While
35 Return Make-span blocks are potential candidates that can be re-
36 End served.
Supposing the historical record of the
(b) Main
to continue at next page. relative computational capacity of the com-

pute-nodes is known. The historical value can
be used at the initialization stage, prior to the Proc updateReservedBlock():
processing of the blocks. 1 Begin
2 //Receive task status of each slave via HeartBeat
Algorithm 2 depicts the Initial Data Allo-
3 compute RelCapOfSlave:={c[1],c[2],… ,c[n]} based on task status
cation algorithm of EHMTS-DC. The reset
4 For Each slave node S where S={1,2,..,n}
of EHMTS-DC are similar to those of the 5 //Get remaining number of unprocessed blocks: remUB
HMTS-DC Algorithm depicted in Table 1A, 6 Number of reserved block:=min(c[S]/Sum(cA,cB,….)*remUB
1B, and 1C. For any job, if the historical re- 7 Number of non-reserved block:=remUB-Number of reserved block
cord of the run time capacity of the nodes is 8 Update the reserved blocks
known, the record is used at the initial stage 9 Update the non-reserved blocks
of the job being executed; otherwise, a general 10 End For
weightage of one is used. 11 End

4.2.1 Evaluation of HMTS-DC and EHMTS- (c) updateReservedBlock
DC
Proc Initialize():
Figure 13 and figure 14 show that EHMTS- 1 Begin
DC has a shorter makespan and lesser data 2 Set {cS1,cS2,… ,cSN} to {1,1,..1}
block movement compared to HMTS-DC. 3 End
This is because EHMTS-DC uses historical
data initially, whereas HMTS-DC requires (d) Initialization
time to “warm-up” and wait for the heartbeat

Algorithm 2. EHMTS-DC algorithm (initialization).
to return the actual ratio of the computing
power of the compute nodes. Proc Initialize():
1 Begin
Figure 15 and figure 16 show that EHMTS-
2 If Not HisRecExist // No history, set all to 1
DC has both a shorter makespan and a lesser
3 Set {cS1,cS2,… ,cSN} to {1,1,..1}
data block movement compared to HMTS-DC
4 Else
in a changing capability cluster environment. //Obtain from history the value of c[1],c[2],… ,c[N]
In this section, the performance of task 5 RelCapOfSlave:={c[1],c[2],… ,c[N]}
schedulers, including experimental error, in 6 For Each slave node S where S:={1,2,..,n}
a static homogeneous environment is as fol- 7 //Get remaining unprocessed local blocks: remUB
lows. DDP has the best performance, whereas 8 Number of reserved block :=min(c[S]/Sum(c[1],c[2],… ,c[N])*remUB
Hadoop and the proposed HMTS-DC are tie. 9 Number of non-reserved block:=remUB-Number of reserved block
10 Update reserved blocks
In a static heterogeneous environment, the
11 Update non-reserved blocks
performance of the schedulers, in descending
12 End For
order, is DDP, HMTS-DC, and Hadoop. Last-
13 End IF
ly, in a dynamic heterogeneous environment 14 End
where the computational capacity of the nodes
varies with time, HMTS-DC has the best per-
formance, followed by Hadoop. DDP has the
Make-span in seconds; Total
worst performance as it is unable to adapt to
Block:300; Rep:2; Static Env.
dynamic changing environments.
5000
In figure 17 and figure 18, we reset the 2796 2368 2458
1857 1541 1646
time In Seconds
cluster with five slaves with three replications.

Since it represents a static environment, DDP 0
Hadoop DDP HMTS-DC
has the best overall performance in terms of
makespan and data movement. The perfor- capacity(1,2,7) capacity(1,3,11)
mance of EHMTS-DC is intermediate. The

makespan and data movement of Hadoop are Fig. 9. Makespan with nodes of different capacity (static env).

EHMTS-DC has the best overall performance
Block Moved with a makespan of 843 and 2710 s, respec-
14
15 tively. EHMTS-DC performs very well be-
11
10
cause it is able to reduce data movement. The
6
Number of Block
4 data movements are two and zero in figure 20,

Moved
5
0 0 respectively.
0 In this section, HMTS-DC is analyzed to
Hadoop DDP HMTS-DC
capacity(1,2,7) capacity(1,3,11) identify areas where task scheduling could
be improved further. Based on finding an
Fig. 10. Block movement with nodes of different capacity (static env).
enhanced version of HMTS-DC, EHMTS-
DC is proposed by augmenting HMTS-DC
with a historical run time record. The exper-
imental results show that EHMTS-DC has a
Make-span in seconds; Total better performance than HMTS-DC in terms
Block:300; Rep:2; Dynamic Env.
of makespan and data transfer. The access to
4000 3371
3022 2785 historical run time relative computational ca-
2235
1955 1878 pacity allows EHMTS-DC to start reservation
Time in Seconds
2000
early; thus, more local data blocks can be re-
served. EHMTS-DC reduces data movement,
0
Hadoop DDP HMTS-DC and thus the makespan, effectively.
capacity (1,2,7)
V. PROTOTYPE ENVIRONMENT SETUP

Fig. 11. Makespan of Hadoop, DDP and HMTS-DC in different capacity (dynamic
The prototype environment is developed using
env)
Hadoop version 1.2.1 [2]. The virtual ma-
chines are Ubuntu14.10 desktop run on VM-
ware ESXi server 5.5.
Block Moved
The experimental environment consists of
10 8 a computer cluster containing one manager
6
(JobTracker) and three workers (TaskTracker)
Number of Block Moved
5 3 as outlined in Table 3.
1
0 0
0 5.1 Prototype environment: Static
Hadoop DDP HMTS-DC
and Dynamic
capacity(1,2,7) capacity(1,3,11)
Two environments are created to test the task
schedulers. (1) a static environment where all
compute nodes are dedicated nodes and no
Fig. 12. Block Movement of Hadoop, DDP, and HMTS-DC with capacity setting
as in fig.4.14 (dynamic env). other programs/processes are run, with the
exception of the map-reduced jobs assigned to
the poorest, at 2763 s with a data movement of them. (2) The second environment is dynamic,
34. where non-dedicated compute nodes are used.
In figure 19 and figure 20, the setup is In addition to running the map-reduced jobs
five slaves with the replication set to three. assigned to them, these nodes run other pro-
Since it represents a dynamic environment, grams/processes.
DDP has the worst overall performance with To evaluation the performance of the Ha-
a makespan of 1596 and 4796 s, respectively doop scheduler, the hardware configuration
(no data movement for DDP is recorded as and virtual machine described in Table 4 are
the replication for DDP is one). The proposed used and the makespans of jobs with different

Make-span in seconds; Total Block Moved
Time In Seconds
Block:300; Rep:2;Static Env.

8 6
Number of Block
3000 2458 2390 5
6
Moved
1646 1604 4
2000 3
4
1000 2
0 0
capacity(1,2,7) capacity
(1,3,11) capacity(1,2,7) capacity(1,3,11)
HMTS-DC EHMTS-DC
HMTS-DC EHMTS-DC
Fig. 13. Makespan of HMTS-DC and EHMTS-DC with different ca- Fig. 14. Block moved of HMTS-DC and EHMTS-DC with different
pacities (static env). capacities.
Make-span in seconds; Total Block Moved

Block:300; Rep:2;Dynamic Env.
2857 2807 4 3
3000 1878 1857 3 2 2
Time In Seconds
2000 2
NUmber of Block
1
Moved
1000 1
0 0
capacity(1,2,7) capacity capacity(1,2,7) capacity
(1,3,11) (1,3,11)
HMTS-DC EHMTS-DC HMTS-DC EHMTS-DC
Fig. 15. Makespan of HMTS-DC and EHMTS-DC with different ca- Fig. 16. Block moved of HMTS-DC and EHMTS-DC with different
pacities (dynamic env). capacities (dynamic env).
Make-span in seconds; Slave:5; Rep:3; rel Block Moved

Cap(1,2,6,8,24); Static Env.
40
2763 34
3000
2400 2413
2500 30
Number of Block Moved
Time In Seconds
2000
20
1500
893 800 813 10
1000
10
500 2 2
0 0
0 0
Hadoop DDP EHMTS-DC Hadoop DDP EHMTS-DC
410 1230 410 1230
Fig. 17. Makespan of Hadoop, DDP and HMTS-DC in 5-slave clus- Fig. 18. Block movement of Hadoop, DDP and HMTS-DC in 5-slave
ter with replication=3 (static). cluster and replication=3 (static).
data sizes are recorded. To evaluate the DDP the tasks assigned. Then, based on the ratio
scheduler, the historical record of the relative captured in the historical record, an appro-
computing speed, i.e., the ratio, of the compute priate number of data blocks are assigned to
nodes is benchmarked by running map-reduce each compute node. More powerful nodes are
jobs and recording the makespan of the job assigned more blocks. To evaluate the HMTS-
and the time taken by the node to complete DC scheduler, jobs assigned to the compute

Make-span in seconds; Slave:5; Block Moved
Rep:3; rel Cap(1,2,6,8,24);
8 7
Dynamic Env.
Number of Block
10000 6
Time In Seconds
Moved
4796 4
5000 3203 2710 2
1083 1596 843 2
0 0 0
0 0
Hadoop DDP EHMTS-DC Hadoop DDP EHMTS-DC
410 1230 410 1230
Fig. 19. Makespan of Hadoop,DDP and HMTS-DC in 5-slave clus- Fig. 20. Block movement of Hadoop, DDP and HMTS-DC in 5-slave
ter and replication=3 (dynamic env). cluster with replication=3 (dynamic env).
Table III. Hardware configuration.

Machine CPU RAM Disk
Host1 HP Compaq Elite 8300 SFF 4 CPUs * 3392 GHZ 12 GB 1 TB
Host2 HP EliteDesk 800 G1 TWR 4 CPUs * 3392 GHZ 16 GB 1 TB
Host3 HP Compaq Elite 8300 CMT 4 CPUs * 3392 GHZ 20 GB 1 TB
Host4 Intel Core Quad CPU Q9400 4 CPUs * 2659 GHZ 6 GB 1 TB
Table IV. Virtual machine configuration.

CPU(Hz) RAM(Byte) Network Replication
Master 5120 3072
Slave 1 900 3072
100 Mbps 2
Slave 2 1024 3072
Slave 3 5120 3072
nodes are executed and Heartbeat is transmit- doop and HMTS-DC (Table 6 and 7) the total
ted between JobTracker and TaskTracker. In block allocated is 120 since data replication is
Hadoop, heartbeats are send to the job tracker, set to two. Therefore, the number of blocks to
containing information such as task status, be completed is half of the total block allocat-
task counters, and data read/write. Following ed, i.e., 120/2=60 blocks. Block allocation is
this, based on the dynamic information cap- dynamically assigned by Hadoop and HMTS-
tured from the heartbeat, a ratio expressing DC.
the relative computing power of each node is Figure 21 depicts the experiment outcome
computed. Unprocessed local blocks within of the three task schedulers. The proposed
each compute node are re-assigned based on HMTS-DC has an average makespan of 757
this ratio. More local blocks are reserved for s, i.e., an improvement of approximately 15%
more powerful nodes. over Hadoop. DDP has the best performance.
690 s. This is followed by HMTS-DC and Ha-
5.2.1 Static prototype environment
doop. DDP outperforms HMTS-DC because
Table 5, 6, and 7 provide the detailed results it is able to optimize the overall block-ratio
of the static environment experiment. In Table (in this case, 60 blocks) to be assigned to each
6, the job to be processed by DDP comprises compute node, whereas HMTS-DC, only op-
60 blocks that are assigned manually to the timizes part of the local blocks (in this case,
compute nodes; only one set of data and no approximately 20 blocks) within each com-
data replication is involved for DDP. For Ha- pute node. HMTS-DC outperforms Hadoop

Table V. Hadoop Static environment.
CPU (Hz) Data size (MB) Blocks allocated Blocks completed Average Makespan
Slave1 900 900 40 10
Slave2 1024 1024 40 13 888 s
Slave3 5120 3092 40 37
Total Blocks 120 60*
Table VI. DDP static environment.

Data size Historical record Blocks Average
CPU (Hz) Blocks allocated
(MB) ratio completed Makespan
Slave1 900 900 0.15 9 9
Slave2 1024 1024 0.18 11 11
690 s
Slave3 5120 3092 0.67 40 40
Total Blocks 60* 60
Table VII. HMTS-DC static environment.

CPU (Hz) Memory (MB) Blocks Allocated Blocks completed Average Makespan
Slave1 900 900 40 9
Slave2 1024 1024 40 13 757 s
Slave3 5120 3092 40 38
Total Blocks 120 60
Total Blocks:60; Relative Total Blocks:60; Relative computational

computational capacity(0.15,0.18,0.67); Dynamic Env
capacity(0.15,0.18,0.67); Static Env 1050

1010
1000 888
1000
757
950
Time In Seconds
800 690
Time In Seconds
950
600
900 882
400
200 850
0 800
Hadoop DDP HMTS-DC Hadoop DDP HMTS-DC
Fig. 21. Makespan (static env). Fig. 22. Makespan (dynamic environment).
FIFO because HMTS-DC is able to reserve of by Slave3. DDP is able to complete all 40
some of the local tasks within the faster node. local blocks using the fastest node (slave 3),
By doing so, HMTS-DC reduces the num- whereas HMTS-DC and Hadoop FIFO only
ber of data blocks to be transferred from one finish 38 and 37 blocks, respectively, using the
compute node to the other. In this experiment, fastest node. In this experiment, DDP outper-
from Table 5, 6, and 7, Slave3 of HMTS-DC forms HMTS-DC in a static environment. This
and Hadoop only manages to execute 38 and may not be the case if the computing resources
39 of the 40 local tasks, respectively. Some lo- are dynamic, as shown in the next experiment.
cal tasks in Slave3 are replicated in Slave1 or
5.2.2 Dynamic Prototype Environment
Slave2 and these tasks were executed on the
slower nodes (either Slave1 or Slave2) instead In this experiment, the computational envi-

ronment of Slave3 is dynamic. In other words, VI. CONCLUSIONS
during the MapReduce job, other processes
are invoked, resulting in a reduction in the MapReduce is a software framework that
computation power of Slave3. allows for easy deployment of parallel appli-
As shown in Table 8, 9, and Table 10, to cations relating to large numbers of data sets
simulate a dynamic environment in Slave3, a using large computing clusters. The default
java program is executed while Slave3 pro- MapReduce implementation in Hadoop is
cesses the Hadoop tasks. The java program based on the assumption of a homogeneous
utilizes 50% of the computational resource of environment which is unrealistic. This is be-
Slave3. The remaining 50% of the computa- cause, in a heterogeneous environment where
tion power of Slave3 is utilized to complete compute nodes have varying capacities, the
the remaining Hadoop tasks. This slows down homogeneous assumption actually hinders
the task execution of Slave3. MapReduce performance.
In this experiment, as depicted in figure 22, The major contributions of our work are
DDP has the worst performance and the lon- the two proposed MapReduce task schedulers,
gest makespan. This is followed by Hadoop i.e., HMTS-DC scheduler and EHMTS-DC
FIFO and HMTS-DC. DDP is unable to per- scheduler. The HMTS-DC algorithm is an im-
form, although the block size to be allocated is proved task scheduler where the relative com-
proportional to the static computation power putational power of the compute nodes can be
of each node. Since the status of Slave3 is estimated using the heartbeat mechanism. In
dynamic, the computational power of Slave3 HMTS-DC, the proportion of local blocks to
varies from time to time. As shown in fig- be reserved to a compute node is proportion-
ure26, DDP partitioned tasks according to the al to the relative computational power of the
historical record. With the reduction in speed node.
of Slave3, additional blocks in Slave3 are Note that if the node capacity in a cluster
transferred to Slave1 and Slave2, prolonging is static, i.e., the capacity remains unchanged
the makespan. HMTS-DC has the best perfor- as recorded in the historical runs, DDP out-
mance in this experiment. performs Hadoop and HMTS-DC by having
Table VIII. Hadoop (dynamic environment).

Hadoop CPU (Hz) Data Size (MB) Blocks allocated Status Average Makespan
Slave1 900 900 40 Static
Slave2 1024 1024 40 Static 950 s
Slave3 5120 3092 40 Dynamic
Table IX. DDP (dynamic environment).

CPU Data Size Historical record
DDP Blocks allocated Status Average Makespan
(Hz) (MB) ratio
Slave1 900 900 0.15 9 Static
Slave2 1024 1024 0.18 11 Static 1010 s
Slave3 5120 3092 0.67 40 Dynamic
Table X. HMTS-DC (dynamic environment).

HMTS-DC CPU (Hz) Data Size (MB) Blocks allocated Status Average Makespan
Slave1 900 900 40 Static
Slave2 1024 1024 40 Static 882 s
Slave3 5120 3092 40 Dynamic

a shorter makespan and no data movement [7] Zaharia M, Konwinski A, Joseph A D, Katz R,
Stoica I. Improving MapReduce performance in
between compute nodes. However, in a more
heterogeneous environments. In Proc. the 8th
realistic environment where the capacity is USENIX Conference on Operating Systems De-
dynamic and fluctuates, both Hadoop and sign and Implementation, San Diego, California,
HMTS-DC are able to adapt to the dynamic 2008.
[8] Chen Q, Zhang D, Guo M, Deng Q, Guo S.
environment and their performance improves.
SAMR: A Self-adaptive MapReduce Scheduling
DDP, which depends on historical record, is Algorithm in Heterogeneous Environment. In
unable to adapt to the dynamic situation and Proc. the 2010 IEEE 10th International Confer-
lags behind, resulting in it having the worse ence on Computer and Information Technology.
[9] Lee C-W, Hsieh K-Y, Hsieh S-Y, Hsiao H-C. A
performance. HMTS-DC, which depends on
Dynamic Data Placement Strategy for Hadoop
the estimated real time current capacity of in Heterogeneous Environments. Big Data Re-
the compute nodes, has the best performance. search, 1, 14-22. doi: http://dx.doi.org/10.1016/
The task locality of HMTS-DC is further op- j.bdr.2014.07.002
[10] Xie et. al. Improving MapReduce Performance
timized to EHMTS-DC by incorporating his-
through Data Placement in Heterogeneous Ha-
torical information [9] of relative computing doop Clusters, 2010
capacity of the compute-node during the initial [11] He C, Lu Y, Swanson D. Matchmaking: A New
stage of the job. The experiment results show MapReduce Scheduling Technique. In Proc. the
2011 IEEE Third International Conference on
that in a dynamic heterogeneous environment,
Cloud Computing Technology and Science.
both makespan and data transfer are reduced [12] Gu R, Yang X, Yan J, Sun Y, Wang B, Yuan C,
in EHMTS-DC compared to HMTS-DC. Huang Y. SHadoop: Improving MapReduce per-
In future, a prediction model could be in- formance by optimizing job execution mecha-
nism in Hadoop clusters. Journal of Parallel and
cluded in the proposed algorithms to predict
Distributed Computing, 74(3), 2166-2179. doi:
the remote blocks to be processed. These http://dx.doi.org/10.1016/j.jpdc.2013.10.003
remote blocks could be pre-fetched and trans- [13] Xu Y, Cai W. Hadoop Job Scheduling with Dy-
ferred to the compute node to reduce network namic Task Splitting. In Proc. the 2015 Interna-
tional Conference on Cloud Computing Research
communication time should there be a need to
and Innovation (ICCCRI).
move blocks between nodes. Soft computing [14] Anjos J C S, Carrera I, Kolberg W, Tibola A L,
frameworks, such as fuzzy logic, could also be Arantes L B, Geyer C R . MRA++: Scheduling
used in the modelling of and inferencing in the and data placement on MapReduce for het-
erogeneous environments. Future Generation
computational capacity of the compute nodes,
Computer Systems, 42, 22-35. doi: http://dx.doi.
and the dynamic state of the compute nodes. org/10.1016/j.future.2014.09.001
[15] WordCount Example Retrieved from https://
References wiki.apache.org/hadoop/WordCount(2017)
[1] Ghemawat S, Gobioff H, Leung S-T. The Google [16] Anjos, J. C. S., Carrera, I., Kolberg, W., Tibola, A.
File System. SOSP 03 2003-10-19, ACM L., Arantes, L. B., & Geyer, C. R. (2015). MRA++:
[2] Welcome to Apache™ Hadoop®! http://ha- Scheduling and data placement on MapReduce
doop.apache.org/. for heterogeneous environments. Future Gener-
[3] Job Scheduler in Apache Hadoop Retrieved ation Computer Systems, 42, 22-35. doi: https://
from https://blog.cloudera.com/blog/2008/11/ doi.org/10.1016/j.future.2014.09.001
job-scheduling-in-hadoop/ [17] R. P. Barnwal and S. K. Ghosh. Heartbeat Mes-
[4] Holmes A. Hadoop in Practice: Manning Publi- sage Based Misbehavior Detection Scheme for
cations Co., 2012. Vehicular Ad-hoc Networks. Proceeding ICCVE
[5] Shvachko K, Kuang H, Radia S, Chansler R.The ‘12 Proceedings of the 2012 International Con-
Hadoop distributed file system.In Proc. the IEEE ference on Connected Vehicles and Expo 29-34.
26th Symp. On Mass Storage Systems and Tech- [18] Davide Scazzoli et. al. A redundant gateway
nologies (MSST), Lake Tahoe, 2010.l-10. [doi： prototype for wireless avionic sensor networks.
10.1109／MSST.2010.5496972] IEEE 28th Annual International Symposium on
[6] Wang F, Qiu J, Yang J, Dong B, Li X, Li Y. Hadoop Personal, Indoor, and Mobile Radio Communica-
High Availability through Metadata Replication. tions (PIMRC). 2017.
In Proc. 1st Int. Workshop Cloud Data Manage., [19] T. White, Hadoop: The Definitive Guide. O’Reilly
2009, pp.37–44. Media, Inc., Oct. 2010.

Biographies KeatKeong Phang, is an asso-
ciate professor in the Faculty of
Xinzhu Lu, is a master student
Computer Science and Tech-
in the Faculty of Computer Sci-
nology, University of Malaya.
ence and Technology, Universi-
He received his Ph.D. degrees
ty of Malaya. She received his
in computer science from the
Bachelor’s degree in computer
same university in 2004. His re-
science from the same univer-
search interests include high
sity in 2012. Her research inter-
speed network, cloud computing and big data.
ests include cloud computing
and big data.

AN ENHANCED HADOOP HEARTBEAT MECHANISM FOR MAPREDUCE TASK-2018

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AN ENHANCED HADOOP HEARTBEAT MECHANISM FOR MAPREDUCE TASK-2018

Uploaded by

Copyright:

Available Formats

NETWORKS & SECURITY

An Enhanced Hadoop Heartbeat Mechanism for

Abstract: MapReduce is a popular program- to the “slow start” property of HMTS-DC,

China Communications • November 2018 93

94 China Communications • November 2018

China Communications • November 2018 95

96 China Communications • November 2018

(a) Homogeneous (b) heterogeneous

Fig. 2. Homogeneous environment and heterogeneous environment.

Fig. 3. DDP data allocation.

China Communications • November 2018 97

98 China Communications • November 2018

Table I. Data block replicated at the slaves.

15 16 23 10 20 04 33 32 18 30 11 12 27 26 03 35 09 36 17 31 39 21-slave2 13-slave2 05-slave2 19-slave2

Fig. 6. Hadoop: tasks processed vs. time.

China Communications • November 2018 99

Table II. DDP original block list.

100 China Communications • November 2018

Slave 3 15 16 23 10 20 04 33 32 18 30 24 11 12 38 27 26 03 35 09 36 17 06 31 02 39 29 No reserved blocks

Fig. 7. Researved list and nonreserved list of HMTS-DC.

Slave 3 NULL NULL

Fig. 8. Process of HMTS-DC algorithm.

China Communications • November 2018 101

102 China Communications • November 2018

time to “warm-up” and wait for the heartbeat

cluster with five slaves with three replications.

mance of EHMTS-DC is intermediate. The

China Communications • November 2018 103

4 data movements are two and zero in figure 20,

V. PROTOTYPE ENVIRONMENT SETUP

104 China Communications • November 2018

Block:300; Rep:2;Static Env.

Make-span in seconds; Total Block Moved

HMTS-DC EHMTS-DC HMTS-DC EHMTS-DC

Make-span in seconds; Slave:5; Rep:3; rel Block Moved

410 1230 410 1230

China Communications • November 2018 105

410 1230 410 1230

Table III. Hardware configuration.

Table IV. Virtual machine configuration.

106 China Communications • November 2018

Table VI. DDP static environment.

Table VII. HMTS-DC static environment.

Total Blocks:60; Relative Total Blocks:60; Relative computational

capacity(0.15,0.18,0.67); Static Env 1050

China Communications • November 2018 107

Table VIII. Hadoop (dynamic environment).

Table IX. DDP (dynamic environment).

Table X. HMTS-DC (dynamic environment).

108 China Communications • November 2018

China Communications • November 2018 109

110 China Communications • November 2018

You might also like