Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Load Balancing through Block Rearrangement Policy

for Hadoop Heterogeneous Cluster


Ankit Shah Mamta Padole
Information Technology Computer Science & Engineering
Shankersinh Vaghela Bapu Institute of Technology The M.S. University of Baroda
Gandhinagar, India Baroda, India
shah_ankit101@yahoo.co.in mpadole29@rediffmail.com

Abstract— To store and analyze Big Data, Hadoop is the To place the data blocks on nodes, Hadoop uses the
most common tool for the researchers and scientists. The HDFS block placement policy [2]. Hadoop cluster gets
storage of huge amount of data in Hadoop is done using imbalanced at times, due to overutilization of few nodes
Hadoop Distributed File System (HDFS). HDFS uses block against the less utilized nodes or newly added nodes with no
placement policy to split a very large file into blocks and place
blocks stored on them. To resolve this situation, Hadoop has
them across the cluster in a distributed manner. Basically,
Hadoop and HDFS have been designed in such a way that it a built-in tool called HDFS Balancer [2].
works efficiently on the homogeneous cluster. But in this era of
networking, we cannot imagine having a cluster of A. HDFS Block Placement Policy
homogeneous nodes only. So, there is the need of storage policy 1. Hadoop stores the data blocks on the different nodes of
that can work efficiently on both homogeneous as well as the
the cluster based on block placement policy available
heterogeneous cluster. Thus, the needs of applications that can
be executed time-efficiently based on homogeneous as well as in HDFS. The policy of placing blocks in Hadoop,
the heterogeneous environment can be sufficed. Data locality in helps to distribute the data block uniformly amongst the
Hadoop maps the data block to process in the same node, but cluster nodes. This policy places the data blocks
often when you’re dealing with Big Data, it is required to map according to the following approach: Split the files into
the data block to the processes across multiple nodes. To deal blocks and replicates the blocks according to replication
with this Hadoop has functionality to copy the data block
factor that has been already defined in the hdfs-site.xml
where mappers are running. This creates a lot of performance
degradation especially on heterogeneous cluster due to I/O file.
delay or network congestions. Here we present a Novel 2. If a request comes from the datanode which is the node
algorithm to balance the data blocks on specific nodes (i.e. of the cluster, put the first replica of the block on that
custom block placement) only by dividing total nodes among datanode itself. Otherwise, place it randomly on any
two categories like: homogeneous vs. heterogeneous or high datanode of the cluster.
performing nodes vs. low performing nodes. This policy helps 3. The second replica of the block will be placed on nodes
to achieve better load rearrangement among the nodes and we
can put data blocks actually where we want our data to be of other racks if available, or maybe placed on the same
placed for the processing. rack of the first replica.
4. The third replica will be placed on any node of the rack
Keywords— Load Balancing in Hadoop, Heterogeneous where the second replica is placed.
Load Balancing, Ha-doop Performance Improvement, Improved Figure 1 demonstrates the default HDFS block placement
Block Arrangement Policy, Hadoop Balancer, Performance policy.
optimization in Hadoop;
B. HDFS Balancer
I. INTRODUCTION
Sometimes in the cluster, load imbalance may occur if
Hadoop [1] has become more popular due to its capability the load (i.e. blocks) is not uniformly distributed amongst
to utilize general-purpose computers. Hadoop has basically the nodes. In the case of Hadoop, another possible cause of
three most important constituents: The Hadoop Distributed cluster imbalance is due to its elasticity. We can add
File System (HDFS) [2], MapReduce [3] and Yet Another datanodes into existing cluster at any time. This may lead to
Resource Negotiator (YARN) [4]. HDFS allows nodes to cluster imbalance as newly added nodes will not have any
store data on the distributed cluster. HDFS said to be fault- data blocks. It is very essential to have a balanced cluster to
tolerant and can be deployed on any machine. MapReduce is get the efficient results. HDFS Balancer tool takes care of
pro-gramming framework for writing Map-Reduce load balancing of the Hadoop cluster.
applications which can run parallel on the distributed
platform. MapReduce allows applications to run on Hadoop
environment.

978-1-5386-5314-2/18/$31.00 ©2018 IEEE 230


FIGURE I. DEFAULT HDFS BLOCK PLACEMENT POLICY [5] [A]: TWO RACKS [B]: SINGLE RACK

DataNodes

Rack 1 Rack 2 Single Rack

Data block to be placed Data block to be placed


First Replica First Replica
Second Replica Second Replica
Third Replica Third Replica

For moving number of blocks it depends upon the specified paper is concluded with concluding remarks and our future
threshold limit of the cluster. When we run HDFS balancer work.
tool with its default threshold of 10%, it tries to maintain an
overall 10% difference of disk usage from the overall cluster II. RELATED WORK
usage. That means if overall cluster usage is 60% of the
To solve the load balancing issue in Hadoop there are two
cluster capacity than each node will have disk usage in
common ways: First, Rebalance the cluster by migrating
between 50% to 70% [6].
data or Second, Task Migration. Researchers are mainly
There are few issues with HDFS balancer, first is that, it working on optimizing task migration against data
moves the blocks of data without looking into the migration, as data transfer among the nodes increases the
processing capacity of nodes where data blocks are being time complexity. Hadoop schedulers can achieve load
moved and second, there is no mechanism that can be used balancing to minimize the data migration of job or task to
to control the movement of data blocks of specified some extent [8]. Several researchers have developed
file/datasets only. Moreover it is important to consider that improved scheduling algorithms apart from built-in
moving data blocks from one node to another requires high scheduler provided by Hadoop to contribute it for load
bandwidth, so that cluster balancing can take place faster. balancing [9,10].
In this paper, we propose a novel approach to move all In paper [11] authors propose the load balancing based
blocks to specified or pre-identified nodes only. This will upon the disk latency and disk space utilization. Authors
process the job faster by minimizing the inter-rack and have tested their custom block placement approach on the
internode transfer. We present the load rearrangement heterogeneous cluster and show up to 20% improvement in
algorithm for specific datasets only to achieve better job running time. Authors of paper [12] propose a dynamic
performance by categorizing nodes based on their load balancing algorithm for heterogeneous clusters. The
processing capacity or heterogeneity. algorithm uses a sliding window-based approach to achieve
significant improvement of the Hadoop job processing. In
This paper is structured as follows. Section 2 describes
paper [13] authors propose to use spatial-temporal
the related work on Hadoop load balancing. Section 3 briefs
efficiency in the heterogeneous cluster. Authors have used
the motivation of our wok. Section 4 and 5 explain the
predictive machine learning approach for forecast the job
proposed novel approach for load balancing/ block
execution time and space occupation. Hadaps [14] tool has
rearrangement and the pro-posed algorithm for the same.
been designed for load balancing using custom block
Section 6 shows the experimental setup of Grid’5000 [7].
placement policy by considering hardware generation of the
Sections 7 explain the results of the proposed algorithm
nodes. A novel dynamic load rebalancing algorithm, as
compared with the ba-lancer and in the last section; the
proposed by authors of paper [15]. The algorithm is tested

231
on the heterogeneous environment and emphasizes on to we propose to store all blocks of specific file evenly on
reduce block movement to achieve better performance. In selected nodes only so minimum data transfer will take
paper [16] authors propose block placement scheme based place for MapReduce job processing. Once we get our
on the computing capacity of heterogeneous cluster nodes. blocks on specific node then we can apply Node label [18]
This approach balances the block of data based upon to get actual advantage of proposed scheme. Using node
considering the computing capacity of each node. label we can group this node under single roof and specify
our job where to run. The idea is if we can select our nodes
Most of the research work for load balancing is either
for processing it would be better if those nodes contain
focusing on load balancing during scheduling [9, 10] or
blocks of that job so that minimal internode and inter-rack
load balancing during MapReduce [12,13] phase only.
transfer will take place and we can achieve better results.
However, in paper [11, 14, 15, 16] authors tried to balance
the whole cluster based on disk latency, disk utilization or
custom block placement policy. These approaches achieve IV. THE NOVEL APPROACH
the better results especially in case of the heterogeneous Hadoop by default will move the job to where data is
cluster. already lying. But, the node where data is lying when placed
by default HDFS block placement policy. That node which
III. MOTIVATION: OPTIMIZING TASK MIGRATION OVER already has the data may not have sufficient processing
DATA MIGRATION capability. Therefore, even if data is there, a job is moved
but resource capacity is not there it may take more time for
Hadoop is more productive on a homogeneous distributed
processing or other nodes will get the job done by moving
environment compared to the heterogeneous one [3]. In
data to that node. Hence, we propose the algorithm which
reality, Hadoop cluster may not be the cluster of
will place the data according to the configuration
homogeneous nodes only. Hadoop cluster may contain
requirement of the job that is moved, which is pre-
nodes of different storing and processing capacity.
identified. We propose a new algorithm that exploits the
HDFS Block placement policy plots the data on nodes as best capability of the cluster.
discussed in the section 1. The major issue is that we do not
We can start with two assumptions in mind: First, we
have any control over placing the data blocks according to
have a group of heterogeneous nodes with different
our choice. There are few issues in MapReduce if load
computational processing capacity. Second, nodes of cluster
balancing is not done properly. First, if MapReduce does not
are heterogeneous in nature, out of which few nodes are
get data blocks for processing on that node then it gets the
homogeneous in nature. Initially HDFS block placement
block from the node which already contains that block. This
policy places the data block as shown in Fig. 2.
requires inter-node or inter-rack block transfer which
degrades the performance of MapReduce intensively. In first case, we can group nodes by its processing
Second, if MapReduce is processing the data blocks on the capacity. Consider a fact that few nodes of cluster are very
node which has low computation capacity then the much slow in nature which degrades the performance of the
performance gets skewed. Therefore, a job may require cluster. Assign the block priority=2 for the nodes where we
longer execution time to finish the job. It is important to want to store all blocks and priority=1 which will not store
consider the cost of MapReduce job depending primarily on any block of data. This setting will take place in config.xml
the data being processed; load balancing is strongly file and our algorithm will use this parameter at runtime for
influenced by how data is partitioned in HDFS and where it custom block placement. Therefore, we will make list of
is stored for processing. The standard Hadoop distribution nodes which has higher processing capacity and will
does not necessarily meet the needs, which leads to an allocate all the data blocks to that nodes only. In second
imbalance between the maps and reduce. Our proposed case, we can group the homogeneous nodes and assign
approach solves both the problem discussed above. priority=2 to homogeneous nodes and priority=1 to rest of
other nodes. While placing all blocks our proposed
As discussed in Section 1 HDFS block placement policy
algorithm also take care that blocks of file will distribute
places the block replica on client node itself if client node is
equally over all the nodes by considering disk usage of each
part of the cluster. This makes cluster pretty much
node. Each time before placing the block of file it checks
unbalanced as that node will store one copy of whole
that node should not contain replica of the same block. The
dataset. If that node does not leverage for MapReduce
fact that Hadoop works better for homogeneous
processing then it takes more time for data processing.
environment, fulfilled by our strategy even though cluster is
We know in case of distributed computing “Moving heterogeneous. Fig. 3 demonstrates the proposed custom
Computation is Cheaper than Moving Data” [17]. Therefore, block placement strategy.

232
FIGURE II. HDFS DEFAULT BLOCK PLACEMENT POLICY

FIGURE III. PROPOSED APPROACH GROUP-1: PRIORITY=2, GROUP-2: PRIORITY=1

Step 5: Sort the nodes (i.e. priority=2) by disk utilization in


V. PROPOSED ALGORITHM ascending order and store it in node_list2
Input: HDFS location of input files to be balanced Step 6: For each block follow below steps:
File contains list of data blocks which are placed in HDFS - Initialize block_list containing all blocks of HDFS
using default block placement policy. path
Output: Data blocks will be placed to specific nodes only - Initialize node_list2, the nodes having priority=2
based on given priority factor. - For each block replica in block_list do the following:
Step 1: Select all blocks of selected HDFS path from o Select the nodes from node_list2 which does not
NameNode metadata contain replica and has lower disk utilization
Step 2: Select all data nodes from DatanodeInfo among the list
Step 3: Assign the priority to the nodes (Priority=2 will o Put the block in this node and remove it from
store all data blocks and Priority=1 will not store any block block_list
of data) configure the config.xml file o Remove the selected node from node_list2
Step 4: Prepare data node list with priority=1 (i.e.
node_list1) and priority=2 (i.e. node_list2) from config.xml
file

233
3 GB and 6 GB. The experiment is carried out on Hadoop
VI. EXPERIMENTAL TEST BED 2.7.2 with default replication factor 3 and all nodes are
To perform experiment we have used Grid’5000 empty.
heterogeneous cluster. Grid’5000 is large scale distributed
Experiment 1:
testbed for the researchers to experiment their research on
high configurable cluster. We have used 10 nodes for our In first test, we have used two text datasets of size 3 GB
experiment. The configuration of nodes is shown in below and 6 GB. Firstly we have placed the blocks using default
table 1. HDFS block placement policy and afterwards we applied
our proposed algorithm applying settings in table 2. The
VII. RESULTS results (Fig. 4 and 5)shows that strategy would place all data
We have tested the proposed algorithm using 4 data sets: blocks on the nodes with priority -2 and also try to maintain
1. Text data sets copied from local file system having size 3 the disk usage of each nodes, such that it does not differ
GB and 6 GB. 2. Datasets generated by TeraGen having size much.

TABLE I. DETAILS OF NODES OF THE CLUSTER TABLE II. SETTINGS FOR NODE PRIORITY

Parasilo-1 to 6 CPU: Intel Xeon E5-2630 V3 , Priority - 2 Priority - 1


[Parasilo-1 : 2.40 GHz 2 CPUs/node, parasilo-1.rennes.grid5000.fr parapide-1.rennes.grid5000.fr
Namenode & 8 cores/CPU parasilo-2.rennes.grid5000.fr parapide-2.rennes.grid5000.fr
Datanode] Memory: 128 GB parasilo-3.rennes.grid5000.fr parapide-3.rennes.grid5000.fr
Network: 1 Gigabit Ethernet parasilo-4.rennes.grid5000.fr parapide-4.rennes.grid5000.fr
Storage: 558 GB / HDD SATA parasilo-5.rennes.grid5000.fr
Parapide-1 to 4 CPU: Intel Xeon X5570 , parasilo-6.rennes.grid5000.fr
[Datanodes] 2.93GHz 2 CPUs/node,
4 cores/CPU
Memory: 24 GB
Network: 1 Gigabit Ethernet
Storage: 465 GB / HDD SATA

FIGURE IV. PROPOSED ALGORITHM FOR DATASET – 3 GB FIGURE V. PROPOSED ALGORITHM FOR DATASET – 6 GB

Experiment 2: generating tool of Hadoop. Using TeraGen we generates the


3GB and 6GB size of data files and test it using the same
In second test, we have used TeraGen which generates the
settings used in experiment 1 (Table 2).
data using MapReduce and places into nodes of the cluster.
The purpose is to check our algorithm with built-in data

234
FIGURE VI. PROPOSED ALGORITHM FOR TERAGEN – 3 GB FIGURE VII. PROPOSED ALGORITHM FOR TERAGEN – 6 GB

From the Fig. 6 and 7, it is observed that after applying affected. Experimental results prove that this scheme can be
the proposed algorithm the data blocks are placed in the applied to both heterogeneous as well as to homogeneous
selected nodes only. It can be also observed that the cluster.
approximate uniform size of data is placed in each node i.e.
The future enhancement focuses in using the concept of
disk utilization is more or less same on each node.
“Node Labeling” along with proposed scheme, which will
Timing Analysis allow the user to put the job on specific nodes only. Node
labeling combined with our proposed algorithm improves
Fig. 8 shows the execution time taken in experiment 1
the performance of Hadoop. This concept will facilitate to
and experiment 2 by proposed algorithm. The amount of
do processing on selected nodes only and will also arrange
time taken for balancing the blocks can be compensated
to can get data blocks on those nodes only. This will
during job processing as it reduces the internode and inter-
optimize the processing further. In extended version, we
rack transfer.
intend to test the proposed algorithm with an application, so
FIGURE VIII. BALANCING TIME OF PROPOSED ALGORITHM that actual benefit of the proposed scheme can be evaluated.

REFERENCES

[1] "Welcome to Apache™ Hadoop®!", Hadoop.apache.org, 2018.


[Online]. Available: http://hadoop.apache.org/. [Accessed: 09- Jul-
2018].
[2] K. Shvachko, H. Kuang, S. Radia and R. Chansler, "The Hadoop
Distributed File System", 2010 IEEE 26th Symposium on Mass
Storage Systems and Technologies (MSST), 2010.
[3] Dean and S. Ghemawat, "MapReduce", Communications of the ACM,
vol. 51, no. 1, p. 107, 2008.
[4] Vavilapalli, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B.
Reed, E. Baldeschwieler, A. Murthy, C. Douglas, S. Agarwal, M.
Konar, R. Evans, T. Graves, J. Lowe and H. Shah, "Apache Hadoop
YARN", Proceedings of the 4th annual Symposium on Cloud
Computing - SOCC '13, 2013.
[5] Maheshwari, R. Nanduri and V. Varma, "Dynamic energy efficient
data placement and cluster reconfiguration algorithm for MapReduce
VIII. CONCLUSION AND FUTURE WORK framework", Future Generation Computer Systems, vol. 28, no. 1, pp.
119-127, 2012.
We propose custom block placement policy which [6] "HDFS Balancers | 5.7.x | Cloudera Documentation", Cloudera.com,
leverages the processing capacity of CPU, during block 2018. [Online]. Available:
placement. This approach will be helpful in MapReduce to https://www.cloudera.com/documentation/enterprise/5-7-
x/topics/admin_hdfs_balancer.html. [Accessed: 07- Jul- 2018].
minimize the internode and inter-rack transfer. We have
[7] "Grid5000", Grid5000.fr, 2018. [Online]. Available:
demonstrated that we can place data blocks of specific file https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home.
to specific nodes only. This approach will not affect the [Accessed: 07- Jul- 2018].
overall load balancing of cluster as rest of the files won’t be

235
[8] A. Shah and M. Padole, "Performance Analysis of Scheduling optimization in a heterogeneous cloud environment", Security and
Algorithms in Apache Hadoop." Communication Networks, vol. 9, no. 17, pp. 4002-4012, 2016.
[9] Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.,“ [14] "fluxroot/hadaps", GitHub, 2018. [Online]. Available:
Improving MapReduce Performance in Heterogeneous https://github.com/fluxroot/hadaps. [Accessed: 08- Jul- 2018].
Environments” , Proceedings of the 8th conference on Symposium on [15] Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares and
Operating Systems Design and Implementation, , pp. 29–42. ,2008 X. Qin, "Improving MapReduce performance through data placement
[10] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker in heterogeneous Hadoop clusters", 2010 IEEE International
and I. Stoica, "Delay scheduling", Proceedings of the 5th European Symposium on Parallel & Distributed Processing, Workshops and
conference on Computer systems - EuroSys '10, 2010. Phd Forum (IPDPSW), 2010.
[11] Dharanipragada, S. Padala, B. Kammili and V. Kumar, "Tula: A disk [16] H. Hsiao, H. Chung, H. Shen and Y. Chao, "Load Rebalancing for
latency aware balancing and block placement strategy for Distributed File Systems in Clouds", IEEE Transactions on Parallel
Hadoop", 2017 IEEE International Conference on Big Data (Big and Distributed Systems, vol. 24, no. 5, pp. 951-962, 2013.
Data), 2017. [17] "Apache Hadoop 2.7.2 – HDFS Architecture", Hadoop.apache.org,
[12] Y. Liu, W. Jing, Y. Liu, L. Lv, M. Qi and Y. Xiang, "A sliding 2018. [Online]. Available:
window-based dynamic load balancing for heterogeneous Hadoop https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-
clusters", Concurrency and Computation: Practice and Experience, hdfs/HdfsDesign.html. [Accessed: 09- Jul- 2018].
vol. 29, no. 3, p. e3763, 2016. [18] "Apache Hadoop 2.7.2 – YARN Node Labels", Hadoop.apache.org,
[13] Q. Liu, W. Cai, J. Shen, Z. Fu, X. Liu and N. Linge, "A speculative 2018. [Online]. Available:
approach to spatial-temporal efficiency with multi-objective https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-
site/NodeLabel.html. [Accessed: 09- Jul- 2018].

236

You might also like