DISK LATENCY AWARE BALANCING AND BLOCK PLACEMENT STRATEGY-2017

2017 IEEE International Conference on Big Data (BIGDATA)
Tula: A Disk Latency Aware Balancing and Block Placement Strategy for Hadoop
Janakiram Dharanipragada, Srikant Padala Balaji Kammili, Vikram Kumar

Dept. of Computer Science Center for Development of
& Engineering, IIT Madras Advanced Computing
Chennai, India Delhi, India
Email: djram@iitm.ac.in, srikant@cse.iitm.ac.in Email: balajikammili@gmail.com, vikramkumar1507@gmail.com
Abstract—Heterogeneity could occur due to various rea- of Hadoop. [6] handles heterogeneity of nodes by finding a
sons in a Hadoop cluster. This work primarily focuses on metric called compute ratio for measuring the performance
heterogeneity occurring due to varying read/write latency of of the nodes on a relative scale by running a job of 1 GB
disks. In case of similar disk latencies among the datanodes,
balancing the blocks uniformly is a suitable choice. However, size on each node.
with time, the disk latencies can increase due to mechanical Among these works, only [4], [5] use the rebalancing
problems and bad sectors. Further, the disks which crash and approach to load balance the cluster. The rest of them work
become non-functional are replaced with newer disks which on placing the blocks as per their placement strategy when
could be of newer generation and can have greater RPM.
This leads to heterogeneity in terms of disk in the cluster
a file is uploaded to the cluster. However, these works do
which is otherwise homogeneous, and balancing uniformly not take the disk read/write latencies into account while
according to disk utilization may not give optimal job runtime. performing the balancing operation. In this work, we observe
To address this issue we propose a disk latency aware balancer, that the disk’s performance may degrade with time due
which balances the cluster taking both disk latency and disk to mechanical failures and bad sectors. Also, disks may
space utilization into consideration. This strategy for balancing
makes sure that a low latency disk gets higher number of blocks
crash completely and get replaced with newer disks with
in comparison to high latency disk. Furthermore, we introduce better RPM. This creates heterogeneity in the cluster and
a custom block placement strategy considering disk latency and balancing with disk utilization might not be optimal. The
other factors. Our preliminary results show an improvement tasks running on nodes having high disk latency will be
of upto 20% in job runtime. slower since most of the jobs are data intensive. Moreover,
Keywords-Load Balancing, Block Placement, Disk Latency tasks running on low latency nodes will finish fast. These
nodes will be used to run more tasks than the slower nodes.
I. I NTRODUCTION When the blocks are uniformly distributed, tasks running
Hadoop [1] has become the de facto standard for Big on the fast node will also access the blocks from the slower
data analytics in commodity clusters. The hadoop distributed nodes. This causes the slow nodes an additional overhead of
file system (HDFS) [2] is one of the core components streaming the blocks to a remote task which further degrades
of Hadoop. In HDFS, a file is partitioned into blocks of the performance of tasks on these nodes. Hence, to improve
equal size (usually 64 MB, 128 MB, ...) and each block the performance of jobs in such a scenario, we propose Tula,
is replicated (default 3) across the cluster. These replicas a disk latency aware balancer (LatencyBalancer) and block
are distributed in the cluster according to the default block placement strategy for Hadoop clusters.
placement policy. The default block placement policy does The major contributions of this paper are:
not ensure uniform block distribution in the cluster due to • A disk latency aware block placement strategy.
various competing considerations. Moreover, imbalance also • A modified balancer algorithm called LatencyBalancer
occurs due to node failures, addition of nodes or racks, node which uses disk latencies as well as disk utilization to
upgrades, etc. Hadoop provides a utility called balancer1 , balance the cluster.
which balances the HDFS according to disk utilization. • Experimental evaluation and comparison of Tula with
Load balancing in Hadoop has been explored in many Hadoop’s default block placement and default balancer.
previous works. [3] proposes a block placement policy which
This paper is organized as follows. Section II discusses
adapts the block placement according to the disk utilization.
the related background on Hadoop, default balancer and
[4] proposes a fully distributed approach to load rebalancing.
block placement strategy, Zero Copy API, Message filter
[5] proposes a dynamic algorithm to balance the workload
framework and discusses the related work. Section III de-
between the racks of a cluster by analysing the log files
scribes the architecture and implementation of Tula. Section
1 http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ IV evaluates it empirically. Finally, we conclude and discuss
HdfsUserGuide.html#Balancer the future work in Section V.
978-1-5386-2715-0/17/$31.00 ©2017 IEEE 2853

II. BACKGROUND AND M OTIVATION a node that is different from where this task is running.
A. Hadoop Generally, HDFS blocks are large in size (>= 64 MB),
and with older version of linux kernel, network read/write
Hadoop [1] is used to process large volumes of data on
of these blocks using the system calls read(), mmap(),
commodity servers which are prone to failures. It has two
write(), happens to be costly. This is due to redundant
components HDFS and MapReduce [7].
copying involved among kernel, user and socket buffers and
HDFS provides fault tolerant data storage and high avail-
redundant context switches. Linux kernel v2.4 onwards, a
ability for the jobs running on the cluster. It contains two
new system call sendfile64() was introduces to eliminate
type of nodes: Namenode and Datanode. Namenode(s) stores
these redundant copying and context switches. It is named
the distributed file system meta data. Datanodes store the
as Zero-Copy API 2 . Hadoop uses sendfile64() system call
actual data blocks. Datanodes send periodic block reports
to optimize the performance.
as heartbeat messages to the namenode. Namenode uses this
information to maintain the distributed file system and make D. Message Filters
the HDFS available to the jobs running on the data. It also Message filters [8] are an extension to object oriented
instructs a Datanode to replicate any under replicated block. wrappers for Linux kernel [9]. This framework gives the
When a user puts a file in HDFS, the file is divided into capability to add filters for system calls. These filters can
equal sized blocks (default 64 MB) and for each block, k be used to specify different policies or to have a controlled
replicas are created (default 3). These replicas are distributed access to the system calls. Since message filters are object
in the cluster according to the block placement policy. oriented, they are easy-to-use and maintain.
MapReduce is the API provided by Hadoop for running
jobs on the cluster. It is based on the Map Reduce program- E. Related Work
ming paradigm. The API exposes two functions to the user, Workload balancing in a Hadoop cluster has been ex-
namely, Map and Reduce which (s)he can write to process plored by many works. Different works in literature follow
the key value pairs. There will be as many mapper jobs as the different strategies for balancing. Some works have changed
number of blocks in a file. The output of mappers is sorted the block placement policy, while others have created a new
and shuffled to the respective reducers which aggregate dynamic rebalancing algorithm.
(reduce) to get the final output. [3] proposes a block placement policy which allocates
B. Default Block Placement Policy and Default Balancer the blocks according to the disk space used, i.e., blocks are
allocated in such a way that the percentage disk utilization
A file in Hadoop is a collection of blocks (typically 64
of all the datanodes is same. [4] proposes a fully distributed
MB, and replicated to provide fault tolerance). The default
approach to load rebalancing. The rationale is that strong
block placement policy of Hadoop considers the following
dependence on the Master node can be detrimental in case of
parameters for placing the replicas of a block:
large clusters. [5] proposes a dynamic algorithm to balance
• Place one replica on the same node (if it is part of the
the workload between the racks of a cluster by analyzing
cluster) from where the file is being uploaded.
the log files of Hadoop. [6] argues that in a heterogeneous
• Place one replica on the same rack of the node from
cluster, performance of nodes may vary significantly. It pro-
where the file is being uploaded so as to reduce cross-
poses a block placement and redistribution algorithm based
rack network traffic.
on compute ratio of each node. [10] analyze problems with
• Data placement has to be done in such a way so as to
default speculative execution and proposes a new speculative
survive rack failure.
algorithm, LATE to speculatively execute the tasks that hurt
• Data needs to be uniformly distributed among the
the job performance in a heterogeneous environment.
cluster nodes.
The default block placement policy of Hadoop does F. Drawback in previous works
not ensure uniform disk utilization among the datanodes. All the previous works do not consider the disk latencies
Moreover addition/failure of datanodes/racks also causes when rebalancing the blocks. We observe that disk latencies
imbalance w.r.t., the disk utilization in a cluster. HDFS plays a very important role in Hadoop jobs’ runtime, since
provides a balancer utility to balance the cluster according to most of the jobs are data intensive.
disk space utilization. The balancer runs in the background
and has a limit to the network bandwidth that it can use. G. Motivation
This is to make sure that it does not interfere too much with Hadoop moves processing to data, i.e., the scheduler tries
the jobs running in the cluster. its best to schedule the map tasks to the nodes where it can
find local data. If local data is not found by the map tasks,
C. Zero Copy API
data is streamed from other nodes (where the required data
Hadoop being a distributed batch processing system, often
reads non local data, i.e., a map/reduce tasks read data from 2 http://www.linuxjournal.com/article/6345?page=0,0
2854
is present) close to the node where map task is running. and syscallwrapper.cc and one filter file is created Filesys-
Reading non local data by mapper tasks can cause network Filter.cc. sendfile64 gets triggered when Hadoop blocks are
congestion and can degrade the performance of the current read as discussed in Section II.
job, as well as other jobs running in the cluster. Although, The sendfile64 entry point happens to be the macro
one can reduce this problem by increasing the replication SY SCALL DEF IN E4(sendf ile64, ..) which is defined
factor, but it is costly from a cloud provider’s perspective. in readwrite.c. This calls the sys sendf ile64 wrapper.
In a homogeneous cluster, where all the nodes perform This in turn call f ilter sendf ile64 in FilesysFilter.cc which
the same, it is easy to balance the blocks in the cluster checks if it is a Hadoop block read request. For doing this, it
based on the disk utilization. The default balancer utility uses getName(fileDescriptor) function to return the filename
provided by Hadoop comes to our rescue. But, when there with complete path in the local file system corresponding to
is some level of heterogeneity in the cluster due to disk the file descriptor supplied. For a hadoop block read request,
failure or disk performance degradation, balancing uniformly it measures the latency for a cumulative read requests of 64
will not yield good results. The primary reason for disk MB (default size). It exports this value in /proc filesystem,
failure is bad blocks, which can be inspected by monitoring which is later used by HDFS. Given below is the pseudo
SMART values [11]. An increase in reallocated sector count code for important changes in message filter.
indicates increase in bad sectors [12], [13]. Moreover, non-
functional disks get replaced with newer ones with better Listing 1: Find read latency of datanode
RPMs in cluster. Also the disks that have many bad sectors class FileSysFilter:public MessageFilter
will perform poorly. Due to this heterogeneity, the map tasks {
running on the nodes with faster disks can perform better protected:
compared to the slower nodes. Hence, these nodes should MessageFilter *filesysfilter;
public:
be allocated more blocks of data compared to slower nodes. FileSysFilter(MessageFilter
Otherwise, tasks on faster nodes might access non local data, *filesysfilter);
that too from slower nodes. This motivates us to develop a int filter_open(const char* filename,int
new balancer and block placement algorithm for Hadoop flags,umode_t mode);
which takes disk latency into account. ssize_t filter_sendfile64(int out_fd, int
in_fd,loff_t pos,size_t count, char
III. T ULA : A RCHITECTURE AND I MPLEMENTATION *flag);
};
ssize_t FileSysFilter::filter_sendfile64(int
out_fd, int in_fd,loff_t pos,size_t
count, char *flag) {
ssize_t size=1;
struct timeval start,end;
char* filename = getName(in_fd);
if(filename != NULL) {
if(file is stored in HDFS ) {
Export latency if cumulative read
requests is 64 MB.
}
}
return size;
}
B. Sending Datanode Latency along with Block Report

ProtocolBuffer is a method for serialization of structured
data. It is used for storing and interchanging structured in-
Figure 1: Tula Architecture formation. Protocol buffers provide a platform independent,
We propose Tula, a disk latency-aware balancer and block language independent interface for implementing the IPC
placement strategy for Hadoop. We have implemented Tula mechanism. The proto definition file contains the services
in Hadoop’s (version 2.7.1) HDFS module. Figure 1 show and messages (data structures). Hadoop uses ProtocolBuffer
the architecture of Tula. based RPC for implementing the IPC between namenode
and datanode daemons.
A. Finding read latency using Message Filters We add a optional latency variable in the BlockReportRe-
To measure the read latencies on datanodes, two files of questProto message. When we build the Hadoop source
the BOSS MOOL kernel are modified, namely readwrite.c code, the DatanodeProtocol.proto file will get compiled
2855
by protoc compiler, and a java file DatanodeProtocolPro- init function is used to classify the datanode as high,
tos.java with all getter and setter functions for all the types aboveAverage, belowAverage and low w.r.t., a threshold
defined in protocol buffer message structure like setLa- value. After this step, the Latency Balancer tries to balance
tency(long value) etc., gets generated. the cluster in multiple iterations by first matching high with
The file DatnodeProtcol.java in the package low, then high with belowAverage and aboveAverage with
org.apache.hadoop.hdfs.server.protocol declares the belowAverage, and then moves the blocks. For each block
interface DatanodeProtocol which is used by datanode to be moved, a proxy source (which is comparatively free
to communicate with the namenode. Both the namenode in terms of CPU, memory and network utilization) is found
and datanode implements it. which can be the node itself (since every block has many
The file DatanodeProtocolClientSideTranslatorPB.java in replicas). Then the block is moved to the destination node.
the package org.apache.hadoop.hdfs.protocolPB defines the After the operation completes successfully, the block at the
class DatanodeProtocolClientSideTranslatorPB which im- actual source node is deleted.
plements the DatanodeProtocol interface. It translates the
client side request made on DatanodeProtocol interface Algorithm 1 Initialize average latency utilization product
to RPC server implementing DatanodeProtocolPB. This is function initAvgOfLatencyUtilizationProduct
where the BlockReports or Heartbeat messages gets con- Input: List of DatanodeStorageReport, reports
HashMap of datanode IP address and latency, laten-
structed using and namenode rpcProxy object. In the block- cies
Report function, while building the block report we add the foreach StorageType t : StorageTypes do
latency of the datanode. double sum = 0;
The file DatanodeProtocolServerSideTranslatorPB.java foreach DatanodeStorageReport r : reports do
in the package org.apache.hadoop.hdfs.protocolPB defines foreach StorageReport s : r.getStorageReports() do
if t == s.getStorage().getStorageType() then
the class DatanodeProtocolServerSideTranslatorPB which sum += latencies.
implements the DatanodeProtocolPB. The blockReport func- get(r.getDatanodeInfo().getIpAddr())*
tion is used by the namenode to retrieve the value of getUtilization(r, t);
latencies of all datanodes by using the getter function on
the request object of type BlockReportRequestProto. avgProducts.set(t, sum/reports.size());
C. Latency Balancer Algorithm 2 Get average latency utilization product

Latency Balancer of Tula uses both disk latency (dl) and function getAvgLatencyUtilizationProduct
disk space utilization (dsu) to balance the cluster. The disk Input: StorageType t
latencies are normalized between 0 and 1. The disk space return avgProducts.get(t);
utilization is same as in default balancer. The new metrics
are given below. Algorithm 3 Classify datanode
function init
Datanode utilization = f (dl) ∗ f (dsu) (1) Input: List of DatanodeStorageReport, reports
n
f (dli ) ∗ f (dsui ) Output: The number of bytes needed to move for balancing
Cluster utilization = i=1 (2) the cluster
n foreach DatanodeStorageReport r : reports do
where n is the number of datanodes. Accumulate capacities and used spaces for this
By using the above formulae, we ensure that a low datanode r
disk latency node gets higher number of blocks com- Initialize average disk space utilizations.
pared to a high disk latency node and vice verse. The Initialize average disk latency (dl) and disk space utilization
(dsu) product.
cluster is balanced when f(dl)*f(dsu) is almost equal foreach DatanodeStorageReport r : reports do
for all datanodes. We have created one file LatencyBal- foreach StorageType t : StorageTypes do
ancer.java and modified BalancingPolicy.java in the pack- Get dsu and capacity for datanode r and storage
age org.apache.hadoop.hdfs.server.balancer for implement- type t.
ing the Latency Balancer. We use Message filter framework Classify this datanode’s overall product
(f(dl)*f(dsu)) ∈ {high, aboveAverage,
to obtain the disk latencies. The algorithms for Latency belowAverage, low} w.r.t. a threshold %.
Balancer are listed below.
initAvgOfLatencyUtilizationProduct function initializes
the average latency utilization product for each storage type D. Custom Block Placement
(in a datanode). This product along with a threshold is used Custom block placement strategy of Tula distributes the
to classify the datanodes. getAvgLatencyUtilizationProduct input file’s blocks according to the read latencies and other
function is used by the Latency Balancer to get the average block placement rules. So, the datanodes with lower latency
latency utilization product for a storage type. gets more preference. A new class is created for importing
2856
the datanode latencies from respective /proc filesystem to Preliminary results show that the job runtime has improved
the namenode. The namenode maintains a Hashmap of read by up to 20%.
latencies for each datanode in the cluster. Given below is the
A. Default Balancer vs Latency Balancer
algorithm for the custom block placement. The algorithm
1) Environment: All experiments were done on a 3-node
(VMs) homogeneous cluster connected by 1Gbps LAN. The
VMs have been configured such that master node’s disk
latency is half of the other two nodes. Each node has Intel(R)
Xeon(R) CPU E5-1620 processor with 4 cores having 3.6
Ghz CPU frequency, L3 cache of 10240 KB, 4 GB RAM
and 8 GB swap space, and runs BOSS MOOL operating
system. We use Hadoop 2.7.1 source code for implementing
the Latency Balancer.
2) Datasets: We use Pizza & Chilli corpus [14] as the
input data for word count and word mean programs. The
input file is 10 GB.
Figure 2: Block placement according to latencies

Algorithm 4 Custom Block placement
Input: File to be written.
HashMap, H of datanode latencies
/*Initialize the Datanode score with their
corresponding latencies */ (a) Block placed on each node (b) Word count and Word mean
Sorted ArrayList <DatanodeIP score> hscore; for 10 GB data runtime comparison
foreach Block b : file do
foreach replica r : b do Figure 3: Default vs Latency balancer for word count and
ref s = the datanode with least latency ∈ hscore, word mean.
not marked for block b and other considerations
3) Results: We run word count and word mean jobs
such as surviving rack failure, etc.
if s.DatanodeIP has space then on a 3-node cluster with 10 GB data balanced with the
Place replica of b at s.DatanodeIP; default balancer and with Latency Balancer respectively.
s.score += H[s.DatanodeIP]; The replication factor is set to 1 so that the disk latencies
hscore.sort(); become prominent when evaluating job runtimes. Figures 3a
mark s.DatanodeIP;
show the blocks allocated after balancing with the default
unmark the r datanodes that were marked. balancer and our custom Latency Balancer respectively. It
can be interpreted from the figure that the master node,
maintains a sorted ArrayList of scores, hscore, which is
which has half the latency when compared to the other
initialized with datanodes’ latencies. For each replica of a
two datanodes gets almost twice the number of blocks.
block, we find the datanode from hscore which satisfies the
Figure 3b show the runtimes of word count and word
normal block placement considerations such as rack failure
mean when the cluster is balanced with default balancer
survival, etc., and has the least latency. Then the replica
and Latency balancer respectively. The result shows that
is placed on this datanode and it is marked for this block
Latency balancer decreases job runtimes by up to 20% over
b, so that no other replica is placed on the same node
the default balancer.
again. Similarly, other replicas are also placed satisfying
block placement and latency constraints. After placing all B. Default Block Placement vs Tula’s Block Placement
the replicas for a block, the corresponding datanodes that Table I: Configuration
were marked are unmarked. Every time a block is placed on Node Processor Cores Frequency Disk Speed RAM
a datanode, hscore[s.datanodeIP] is incremented by its own Master
Datanode #1
Intel Xeon E3-1270v2
Intel Xeon W3565
8
2
3.5 GHz
3.6 GHz
7200 RPM
7200 RPM
8 GB
2 GB
latency value and then the hscore list is sorted. This is done Datanode
Datanode
#2
#3
Intel i7-4510U
Intel Xeon W3565
4
3
2 GHz
3.6 GHz
5400 RPM
7200 RPM
8 GB
2.5 GB
to ensure that lower latency datanodes get higher number of Datanode #4 Intel Xeon W3565 3 3.6 GHz 7200 RPM 8 GB
blocks in comparison to higher latency datanodes.

1) Environment: Experiments were done on a 5-node
IV. E VALUATION heterogeneous cluster. Table I shows the node configurations.
We compare Tula with the default balancing and block We use machines with different configurations to simulate
placement strategy of Hadoop on a BOSS MOOL cluster. disks with different latencies.
2857
2) Datasets: Two datasets of size 2 GB and 2.5 GB is R EFERENCES
generated using Teragen program given in Hadoop. [1] “Apache Hadoop,” http://hadoop.apache.org.
3) Results: We use a complete heterogeneous setting to
compare the default block placement and our custom block [2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The
placement policy by running Terasort job on 2 GB and 2.5 hadoop distributed file system,” in 2010 IEEE 26th sym-
GB data respectively. Figure 4a and 4b show the blocks posium on mass storage systems and technologies (MSST).
IEEE, 2010, pp. 1–10.
allocated to various nodes by the default block placement
policy of Hadoop and our custom block placement policy for [3] N. E. Pius, L. Qin, F. Yang, and Z. H. Ming, “Optimizing
2 GB and 2.5 GB datasets respectively, which are generated hadoop block placement policy & cluster blocks distribution,”
by Teragen. Figure 4c show the corresponding runtime vol, vol. 6, pp. 1224–1231, 2013. [Online]. Available:
comparison. We observe that job’s performance improves http://waset.org/Publications?p=70
in a heterogeneous environment when blocks are allocated [4] H.-C. Hsiao, H.-Y. Chung, H. Shen, and Y.-C. Chao, “Load
using custom block placement strategy. rebalancing for distributed file systems in clouds,” IEEE
transactions on parallel and distributed systems, vol. 24,
no. 5, pp. 951–962, 2013.
[5] X. Hou, A. K. TK, J. P. Thomas, and V. Varadharajan,

“Dynamic workload balancing for hadoop mapreduce,” in
IEEE Fourth International Conference on Big Data and Cloud
Computing (BdCloud), 2014. IEEE, 2014, pp. 56–62.
[6] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Man-

(a) Block placed on each node (b) Block placed on each node zanares, and X. Qin, “Improving mapreduce performance
for 2 GB data for 2.5 GB data through data placement in heterogeneous hadoop clusters,”
in IEEE International Symposium on Parallel & Distributed
Processing, Workshops and Phd Forum (IPDPSW), 2010.
IEEE, 2010, pp. 1–9.
[7] J. Dean and S. Ghemawat, “Mapreduce: simplified data

processing on large clusters,” Communications of the ACM,
vol. 51, no. 1, pp. 107–113, 2008.
[8] S. Nadella and D. Janakiram, “Message filters for hardening

the linux kernel,” Software: Practice and Experience,
(c) Terasort runtime comparison vol. 41, no. 1, pp. 51–62, Jan. 2011. [Online]. Available:
on 2 GB and 2.5 GB data http://dx.doi.org/10.1002/spe.997
Figure 4: Default vs Custom block placement.
[9] D. Janakiram, A. Gunnam, N. Suneetha, V. Rajani, and
C. Limitations K. Reddy, “Object-oriented wrappers for the linux kernel,”
Software: Practice and Experience, vol. 38, no. 13, pp. 1411–
In this work, we primarily consider heterogeneity in terms 1427, 2008.
of disk only and try to address the problems associated with
it. Heterogeneity in terms of CPU and memory is not our [10] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and
major concern. I. Stoica, “Improving mapreduce performance in heteroge-
neous environments.” in OSDI, vol. 8, no. 4, 2008, p. 7.
V. C ONCLUSION
[11] B. Allen, “Monitoring hard disks with smart,” Linux Journal,
In this work, we show that disk read/write latencies play no. 117, pp. 74–77, 2004.
an important role in the job runtime on a Hadoop cluster.
Balancing the cluster using disk space utilization only may [12] A. Ma, R. Traylor, F. Douglis, M. Chamness, G. Lu,
not be optimal. We propose Tula, a latency aware balancer D. Sawyer, S. Chandra, and W. Hsu, “Raidshield: charac-
and block placement algorithm. Latency balancer rebalances terizing, monitoring, and proactively protecting against disk
failures,” ACM Transactions on Storage (TOS), vol. 11, no. 4,
the cluster using the disk latencies and the disk space p. 17, 2015.
utilization of the datanodes so that low disk latency nodes
get more blocks and the job runtime is reduced. Furthermore, [13] K. L. Bush, J. R. Didner, and T. R. Lenny, “Drive failure
we have also introduced a custom block placement policy prediction techniques for disk drives,” Oct. 27 1998, uS Patent
which places the blocks of file according to disk latencies 5,828,583.
and other constraints which improves job performance. Our [14] “Pizza&Chili Corpus,” http://pizzachili.dcc.uchile.cl/texts/
results show that Tula lowers the jobs’ runtimes by 20% on nlang/.
an average.
2858

DISK LATENCY AWARE BALANCING AND BLOCK PLACEMENT STRATEGY-2017

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DISK LATENCY AWARE BALANCING AND BLOCK PLACEMENT STRATEGY-2017

Uploaded by

Copyright:

Available Formats

2017 IEEE International Conference on Big Data (BIGDATA)

Janakiram Dharanipragada, Srikant Padala Balaji Kammili, Vikram Kumar

978-1-5386-2715-0/17/$31.00 ©2017 IEEE 2853

B. Sending Datanode Latency along with Block Report

C. Latency Balancer Algorithm 2 Get average latency utilization product

Figure 2: Block placement according to latencies

blocks in comparison to higher latency datanodes.

[5] X. Hou, A. K. TK, J. P. Thomas, and V. Varadharajan,

[6] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Man-

[7] J. Dean and S. Ghemawat, “Mapreduce: simpliﬁed data

[8] S. Nadella and D. Janakiram, “Message ﬁlters for hardening

You might also like