BLOCK PLACEMENT IN DISTRIBUTED FILE SYSTEMS B-2018

Received May 17, 2018, accepted June 25, 2018, date of publication June 29, 2018, date of current
version July 30, 2018.

Digital Object Identifier 10.1109/ACCESS.2018.2851571
Block Placement in Distributed File Systems

Based on Block Access Frequency
JIANWEI LIAO 1,2 , ZHIGANG CAI1 , FRANCOIS TRAHAY3 , AND XIAONING PENG4
1 College of Computer and Information Science, Southwest University, Chongqing 400715, China
2 School of Computer Science and Engineering, Huaihua University, Huaihua 418000, China
3 Telecom SudParis, 91000 Évry, France
4 Key Laboratory of Intelligent Information Technology, Huaihua University, Huaihua 418000, China
Corresponding author: Xiaoning Peng (hhpxn@163.com)

This work was supported in part by the Hunan Provincial Natural Science Foundation of China under Grant 2018JJ2309, in part by the
Fundamental Research Funds for the Central Universities under Grant XDJK2017B044, and in part by the Opening Project of State Key
Laboratory for Novel Software Technology under Grant KFKT2016B05.
ABSTRACT This paper proposes a new data placement policy to allocate data blocks across storage
servers of the distributed/parallel file systems, for yielding even block access workload distribution. To this
end, we first analyze the history of block access sequence of a specific application and then introduce a
k-partition algorithm to divide data blocks into multiple groups, by referring their access frequency.
After that, each group has almost the same access workloads, and we can thus distribute these block groups
onto storage servers of the distributed file system, to achieve the goal of uniformly assigning data blocks
when running the application. In summary, this newly proposed data placement policy can yield not only
an even data distribution but also the block data access balance. The experimental results show that the
proposed scheme can greatly reduce I/O time and better improve utilization of storage servers when running
the database-relevant applications, compared with the commonly used block data placement strategy, i.e., the
round-robin placement policy.
INDEX TERMS Distributed file systems, k-partition, block access balance, data re-distribution,
I/O performance.
I. INTRODUCTION On the other side, different from conventional applications,

Parallel/distributed file systems have been widely adopted one of the main features of big data applications, such as
as the back-end storage for Big Data applications in high database applications, is about the tight coupling relationship
performance computing [1]. As a result, the aggregate trans- between data and computation, as computation tasks can be
fer rate can be much higher, and overall storage capacity conducted only when the required data are available [3].
are supposed to be significantly enhanced [2]. In fact, big In other words, not only task assignment, but also data place-
files are supposed to be divided into many blocks, which are ment deeply affect the execution of big data applications.
striped over multiple storage servers, to enable these servers In fact, several research work [30], [31] analyzed how file
processing file data in parallel [24]. system performance can be impacted by many factors of
The data layout policy of a distributed/parallel file sys- workload such as the distribution manners of files, I/O request
tem, which organizes the physical data layout across storage sizes, and I/O access characteristics. As a conclusion, it is
servers, is a key factor in determining parallel I/O perfor- argued that a major issue in this context is to identify where to
mance [5]. Therefore, preferable data layout policies can allocate data blocks and block replicas for these applications,
result in data locality and workload balance among storage to reach the targets of optimizing disk utilization, reducing
servers, to primarily boost system performance. In general, network congestion, and improving data throughput [34].
the distributed/parallel file systems take advantage of the Therefore, a number of advanced block (file) data place-
round-robin placement policy [3], or the uniform-at-random ment schemes have been proposed for distributed file
algorithm [4], to allocate data blocks across the storage systems, by considering various of factors. Specifically,
servers, for yielding an even distribution of data amount. Song et al. [6] introduced a cost model to calculate the
2169-3536 2018 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 6, 2018 Personal use is also permitted, but republication/redistribution requires IEEE permission. 38411
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
J. Liao et al.: Block Placement in Distributed File Systems Based on Block Access Frequency
overall I/O cost of any given application, to guide choos- have also introduced principles to guarantee achiev-
ing an appropriate layout policy for the application.1 Also, ing approximately optimal solutions to our target prob-
they have presented a server-side adaptive data layout strat- lem or similar ones.
egy in parallel file systems, for data-intensive applications • We have proposed the k-partition algorithm to sub-
with non-uniform data access patterns [7]. Gu et al. [3] and optimally classify data blocks into k groups. Conse-
Wang et al. [5] have respectively proposed their data place- quently, these divided groups will be placed to k storage
ment policies, for minimizing I/O overhead, and thus speed- servers of distributed file system, to balance block access
ing up the execution of applications. Yuan et al. [11] have workloads, as well as speed up I/O processing. As a
explored data access patterns of scientific cloud workflows consequence, the execution time of application can be
and then introduced a clustering data placement strategy noticeably cut down, and the overall performance of file
in distributed file systems, for the purpose of reducing the system can be also enhanced.
I/O time of scientific applications. A novel scheme to initially The rest of the paper is organized as follows: Section II
place data blocks across storage nodes of the Hadoop distri-
discusses background knowledge and related work on data
bution file system (HDFS) has been presented in [12], for placement in distributed/parallel file systems. The design
balancing data processing while running MapReduce appli- and implementation details of the newly proposed scheme of
cations. Similar to [35], Yazd et al. [13] have presented a block data distribution are described in Section IV. Section V
flexible block data placement strategy on storage servers in evaluates the implemented system, and then presents experi-
Hadoop-based datacenters, named as Mirrored Data Block
mental results. At last, we conclude the paper in Section VI.
Replication, to better cut down the energy utilization of stor-
age servers.
However, these existing schemes may result in a waste of II. BACKGROUND AND RELATED WORK
resources on storage nodes, and fail to speed up the execution Many efforts have been contributed to decreases in time
of applications. The big data workloads associated to the resulted by I/O operations, through providing flexible data
different files, and even to the different parts of the same (replica) placement solutions, when executing the target
file might be totally diverse, which lead to varied levels of applications.
access frequency to data blocks. As a consequence, a storage
server with frequently requested block data may stay busy, A. DATA PLACEMENT IN CONVENTIONAL
but other storage nodes having less frequently accessed block DISTRIBUTED/PARALLEL FILE SYSTEMS
data blocks may keep idle, though both kinds of servers have Xie et al. [12] have introduced a novel way to initially allocate
the same number of data blocks. Eventually, the execution of data blocks across storage nodes of the Hadoop distribution
applications depends on the completion time of the requested file system (HDFS), for the purpose of balancing data pro-
I/O services, provided by the busiest storage server. cessing load, according to the computing capacity of each
To address this issue, we propose a novel data placement node. Furthermore, they have proposed data reorganization
policy on the basis of access frequency to data blocks, for and redistribution algorithms, to adaptively regulate data
the purpose of achieving an even access distribution. Firstly, layout on the basis of data access patterns by considering
we collect access statistics on all data blocks when running dynamic data insertions and deletions. Herodotou et al. [26]
the application. After that, data blocks are evenly divided have introduced a self-tuning system for tuning big data ana-
into multiple groups, by referring their access frequency and lytic within the Hadoop stack. To reduce the imbalance data
access time. At last, different block groups can be allocated placement of write-read workflow in the storage, they have
onto separated storage nodes. As a result, not merely the proposed to use a round-robin policy for chunk placement
block load balance, but also the block access balance across instead of the default random policy.
the storage nodes can be ensured. To put it from another angle Co-Hadoop is an extension to the Hadoop distributed
the storage servers can be well utilized and provisioned, to file system [15]. It can help the applications to control
maximize the performance of big data applications, in the data placement at the file-system level [16]. To this end,
case of running them with multiple cycles. In summary, this the authors introduced a new file level property called loca-
paper makes the following two contributions: tor, and then modified the default data placement policy of
• We have performed a detailed theoretical analysis on Hadoop to make use of this locator property for neatly placing
the history of block accesses belonging to a specific data.
application. Then, we argue that distributing data blocks Song et al. [7] have proposed a strategy that adopts dif-
onto storage servers of distributed file system for yield- ferent stripe sizes for different file servers in PVFS [24],
ing data load balance and access load balance, is according to the data access characteristics on each individual
a NP-complete problem. Regarding this issue, we server. Then, the busy file servers can fully utilize bandwidth
to hold more data, and the file servers having limited request
1 Note that we intentionally describe some of the most related work here service rate can manage less data. Weil et al. [9] proposed
to further motivate our proposal, but a more general survey of related work CRUSH in Ceph, that is a scalable, pseudorandom data distri-
will be presented in Section II. bution function designed for distributed object-based storage
38412 VOLUME 6, 2018

systems. It efficiently maps data objects to storage devices minimizing the communication cost while ensuring quality
without relying on a central directory. of service of VoD services. In addition, Jin et al. [25] have
proposed a joint optimization scheme that simultaneously
B. ADAPTIVE DATA PLACEMENT POLICIES optimizes virtual machine (VM) placement and network flow
Wang et al. [14] recently argued that keeping physical data routing to maximize energy savings on the host machines.
distribution and maintaining data blocks should be separated None of aforementioned approaches, however, focuses on
out from the metadata management and conducted by each both data load balance and access load balance when plac-
storage node autonomously, for the purpose of decreasing ing file data to storage servers of distributed/parallel file
the memory cost and maintenance cost on the master node. system.
Bhadkamkar et al. [10] described BORG, a self-optimizing
layer in the storage stack, which is able to automatically III. MOTIVATIONS AND THEORETICAL ANALYSIS
regulate disk data layout for adapting to the disk access This section first discusses our motivation to properly
patterns of workloads. The basic idea of BORG is to optimize allocate data blocks onto storage servers. After that, we per-
both read and write traffic dynamically through making reads form a detailed analysis on the issue of evenly dis-
and writes more sequential. Wang et al. [5] have proposed tributing data blocks, and then demonstrate this problem
a parallel file system for reducing the I/O time required by is NP-hard.
remote sensing applications. This file system can adaptively
offer application-aware data layout policies, to benefit differ- A. MOTIVATING FACTORS
ent data access patterns of remote sensing applications from To illustrate the significance of even data placement on stor-
the server side. age servers of distributed/parallel file system, we launch a job
Hsiao et al. [34] have designed an attractive load rebalanc- running on a 12-node cluster to perform analysis on a small
ing scheme for distributed file systems in Clouds. Because scale of TPC-C [21]. The file data are divided into multiple
compute nodes may be dynamically upgraded, replaced, and blocks (block size is 64KB), and these blocks are striped with
added in the cloud system to bring about data load imbalance, the round-robin manner onto 4 storage servers in the PVFS
their proposal is able to balance the loads of nodes and file system [24].
decrease the demanded movement cost by using the infor- There are varied levels of access frequency to the blocks,
mation on physical network locality and node heterogeneity. the access counts to the first 128 data blocks of a file are
Besides, in order to eliminate the imbalance of parallel writes shown in Figure 1(a). As presented, the access frequency to
on distributed file systems, Huang et al. [27] proposed Opass, data blocks is far from balanced. In other words, a small part
that employs a heatmap for monitoring the I/O state of storage of blocks contain the most of desirable data, though a major
servers, and uses a novel heuristic policy to choose an optimal part of blocks are accessed only once. We argue that this
storage node for serving write requests. situation causes an imbalanced access workload distribution
over the storage nodes during parallel execution. Figure 1(b)
C. APPLICATION-SPECIFIC DATA (FILE) PLACEMENT further demonstrates the storage servers do have different
Targeting at different application contexts, many researchers data throughput, and this imbalance could seriously degrade
have presented their specific schemes for purposely allo- the execution performance in many sub-dataset analyses.
cating file data. Yuan et al. [11] have explored data access That is to say, the application’s I/O time (even its execution
patterns of scientific cloud workflows and then introduced a time) is determined by the storage server having the heaviest
clustering data placement strategy. In other words, this strat- I/O access workloads.
egy is able to automatically distribute application data among
storage nodes based on data dependencies. Consequently,
it can minimize the overhead caused by data movement dur-
ing the execution of workflows, as well as reduce the time
spent waiting for the required data, since relevant data are
managed locally.
Agarwal et al. [19] have presented an automated data
placement mechanism, named Volley, for geo-distributed
cloud services, by taking several factors into account, includ-
ing WAN bandwidth cost, data center capacity limits, and FIGURE 1. (Simplified) Imbalanced data access workloads in parallel
data inter-dependencies. Similarly, Yu and Pan [33] have execution of a database application. (a) Access frequency on data blocks.
(b) Data throughput over storage servers.
proposed a data placement policy to enhance the co-location
of associated data and the localized data serving under the
premise of ensuring the balance among storage nodes. In conclusion, properly distributing data blocks onto the
Wolf et al. [18] investigated how to conduct a disk place- storage servers of distributed file system is critical to not
ment of Video-on-Demand (VoD) file copies on the servers only balancing I/O access workloads, but also speeding up
and the amount of load capacity assigned to each file copy, for the execution of applications.
VOLUME 6, 2018 38413

B. THEORETICAL ANALYSIS where 0 ≤ λ ≤ 1, to represent the weighing coefficient of

This section conducts a theoretical analysis on prefer- balancing access counts and data amount.
ably allocating data blocks onto storage servers of dis- According to the general rule, we let λ equal to 0.5, and
tributed/parallel file system, as expected. note that both A and D are constants with a given block
To put it from another angle for stating this problem: access sequence. As a consequence, Equation 6 can be further
there is a set of n positive numbers, i.e. a1 , a2 , . . . , an , simplified as Equation 7.
which represents block access frequency to n data blocks k k
of B1 , B2 , . . . , Bn . On the other side, assuming we have
X X
Min A2j + D2j
k storage servers in total, labeled as S1 , S2 , . . . , Sk . The target j=1 j=1
is to partition the n positive numbers into k subsets, in which k
the sum of the elements of subgroups are equal or at least X
St. xij = 1 (i = 1, 2, . . . , n and xij = 0, 1) (7)
nearly equal, and the number of elements in each subgroup is
j=1
almost the same. Thus, it indicates we can perfectly distribute
n data blocks onto k storage servers of distributed file system, PInk
the model presented with Equation 7, the first part of
2
by also considering the access frequency to the blocks. j=1 Aj in the objective function means the goal of yielding
In order to express this division problem with a mathemat- average accessed count on all storage servers. That is to say,
ics manner we first define a dummy variable of xij , to indicate this part may reach the minimum to N 2 /k, when all Aj s
are (closely) equal to A. For the second part of kj=1 D2j , it
P
whether the data block of Bi is distributed onto the storage
severs of Sj . In the case of the value of xij is 1, it means the implies all data blocks are supposed to be evenly distributed
data block of Bi is allocate on Sj ; otherwise, the data block is onto all storage servers. In the case of all Dj s are (nearly)
not allocated on the server of Sj . Next, we define a variable equal to D, this part can reach the minimum value to n2 /k. It is
of Aj to represent the total access count with respect to the worth to say that we may not achieve the ideal optimal value,
storage server of Sj : because the variable xij is a 0/1 dummy variable, instead of
n
X a continuous variable.
Aj = ai ∗ xij (1) Through aforementioned analysis, we can identify the
i=1 target problem is a 0-1 quadratic programming problem,
The average accessed number of all storage servers can be which is a typical NP complete hard problem [17]. There-
addressed as the following equation: fore, we can understand that the above defined problem is
also hard to solve. In order to minimize the time overhead
k
X caused by solving this problem, we have proposed a partition
A= Aj /k (2)
algorithm to classify data blocks into many groups, which
j=1
have almost the same sum of access counts to the in-group
Equation 3 defines the total number of blocks mapped onto blocks. Section IV will depict the details about the proposed
the storage server of Sj . partition algorithm.
n
X
Dj = xij (3) IV. DESIGN PHILOSOPHY AND IMPLEMENTATION
i=1 Apart from presenting the system architecture, the algorithm
Consequently, the average block number among all storage used to classify data blocks into multiple groups, and its
servers can be defined as: implementation specifications with the PVFS file system are
k
X elaborately described in this section.
D= Dj /k. (4)
j=1 A. SYSTEM ARCHITECTURE
At last, we define a variable of N for showing the sum of Figure 2 demonstrates a comparison overview of the round-
access counts to all data blocks. robin placement policy (left side) and the proposed place-
n k ment policy (right side). Both of them are able to yield
an even data amount distribution across storage servers of
X X
N = ai = Aj , (5)
i=1 i=1
distributed/parallel file system, but the newly proposed one
can achieve the goal of balancing block accesses over the
We choose variance to measure the average degree of a set
storage servers.
of data, we can transfer the problem to a standard optimiza-
As shown, the commonly used round-robin manner
tion problem:
distributes data blocks evenly to the storage servers of dis-
k k
X X tributed file system, regardless of the block access frequency.
Min λ (Aj − A)2 + (1 − λ) (Dj − D)2 Consequently, the storage servers having many frequently
j=1 j=1 accessed data blocks may become the bottleneck of the file
k
system, even though some storage servers may stay idle.
X
St. xij = 1 (i = 1, 2, . . . , n and xij = 0, 1) (6)
j=1
To overcome this limitation, we have proposed a new scheme
38414 VOLUME 6, 2018

FIGURE 2. Comparison overview of the round-robin block placement policy and the proposed placement policy.
to distribute data blocks, by taking the information on block

access frequency into account. Algorithm 1 The Equal k-partition Algorithm
In addition to yielding an average block allocation in quan- Input: S is the list, and k (# of parts);
tity, we intend to achieve the goal of balanced block access Output: k equal parts of a;
workloads, associating with these storage servers. The basic 1 if k ≤ 1 then
idea is to divide all data blocks into multiple groups, with 2 return S;
accordance to the number of storage servers. More impor-
3 if k ≥ len(S) then
tantly, the number of total accesses to the blocks in a group
4 return [s] for s in S;
should be approximately equal to sums of numbers in other
groups. 5 parts_between = [];
6 for i in range(k) do
7 parts_between.append((i + 1) ∗ len(S)/k);
B. k-PARTITION GROUPING ALGORITHM
As discussed in Section III-B, the problem of grouping 8 a = sum(n) / k;
blocks in average is NP-hard in the ordinary sense [28]. It is 9 adjust_cnt = 0;
not possible to have any polynomial-time solution for such 10 #loop over possible partition decisions
problem [29]. In our application contexts, the selected 11 while true do
approximation solution for this problem is based on the first- 12 #partition the list into k subsets
fit decreasing algorithm, which is often used to generate 13 parts = [];
approximate solutions to similar NP-hard problems. 14 index = 0;
Regarding a sequence of block access counts belonging 15 for div in parts_between do
to a specific application, the modified first-fit decreasing 16 parts.append(a[index:div]);
algorithm is able to evenly divide the numbers (i.e. access 17 index = div;
counts) in the sequence into multiple (e.g. k) partitions. It first 18 parts.append(a[index:]);
sorts the numbers in a descending order. Next, it distributes 19 worst_diff = 0;
the first k numbers (the largest k numbers) across the k 20 worst_parts_idx = -1;
subgroups. After that, it iterates over the remaining n−k num- 21 for p in parts do
bers, and add each number into the subgroup that currently 22 diff = a - sum(p);
has the lowest sum. Finally, a group of n integer numbers 23 if abs(diff) > abs(worst_diff) then
(i.e. n data blocks having varied access counts) can be divided 24 worst_diff = diff;
fairly into k subgroups, and the max of the sums of k sub- 25 worst_parts_idx = parts.index(p);
groups is minimal.
Algorithm 1 specifically demonstrates this newly proposed 26 while adjust_cnt < COUNTS or worst_diff <
grouping method with a Python style. In fact, this algorithm DIFFS do
arranges integers in the list of S that represents access counts 27 adjust_part(worst_parts_idx, worst_diff,
to a number of data blocks, into k partitions. This approach parts_between, parts);
defines partition boundaries that divide the integer list of S 28 adjust_cnt += 1;
in roughly equal numbers of elements, and then repeatedly 29 return parts;
searches for better partition decisions. To put it from another
VOLUME 6, 2018 38415

angle, we first create a list of indexes to a partition list Algorithm 2 Adjust Partition (Adjust_Part(. . . ))
of partition_between, by using the index on the left of the Input: S is the list, and k (# of parts);
partition, for representing the start index of the partition. Then Output: k equal parts of a;
it can roughly partition the list of S into equal groups of 1 if worst_parts_idx == 0 then
len(S)/k, but note that the last group may be a different size. 2 if worst_diff < 0 then
Then, for the purpose of determining whether the current par- 3 parts_between[0] − = 1;
titions are acceptable or not, the algorithm computes the sum
4 else
differences of partitions, as shown in Lines 21 − 25. At last,
5 parts_between[0] + = 1;
the process of partition can be ended when the iterative search
reaches the predefined times (i.e. COUNTS) or the maximum 6 if worst_parts_idx == len(parts)-1 then
differences among partitions is less than predefined threshold 7 if worst_diff < 0 then
(i.e. DIFFS). 8 parts_between[-1] + = 1;
The criteria to end the algorithm are important for yielding 9 else
a good result with complex data, and changing such prede- 10 parts_between[-1] − = 1;
fined thresholds is a good way to experiment with getting
improved results. Occasionally, we have to adjust the current 11 else
partitions, when the ending criteria are not met, as presented 12 left_bound = worst_parts_idx -1;
in Lines 26 − 28 of Algorithm 1. 13 right_bound = worst_parts_idx;
The basic idea to adjust the partitioning of the worst par- 14 if wrost_diff < 0 then
tition that has the maximum sum, is to move it closer to the 15 if sum(parts[a-1]) >
ideal size. Algorithm 2 illustrates the details of partitioning sum(parts[worst_parts_idx+1]) then
adjustment. To be specific, the overall goal is to take the 16 parts_between[right_bound] − = 1;
worst partition and adjust its size to make its sum closer to 17 else
the ideal. In general, if the worst partition is too big, we want 18 parts_between[left_bound] + = 1;
to shrink the worst partition by moving one of its ends into
the smaller of the two neighboring parts. On the other side, 19 else
if the worst partition is too small, it is better to grow the 20 if sum(parts[a-1]) >
partition by expanding the partition towards the larger of the sum(parts[worst_parts_idx+1]) then
two neighboring parts. 21 parts_between[left_bound] − = 1;
22 else
C. IMPLEMENTATION 23 parts_between[right_bound] + = 1;
We have implemented the proposed block data placement
algorithm in PVFS2, that is a high performance distributed
file system designed for high-bandwidth parallel access to
large data files [24]. The current physical distribution mech-
anism used by PVFS2 is a simple, round-robin striping a block distribution specification, regarding the file accessed
scheme. The function of distributing data is described with by the application. Figure 3 shows an example specification
three parameters: of block distribution, when placing 1021 data blocks to 4
storage nodes in the case of running the target application.
- base: the index of the starting I/O node, with 0 being As seen, 256 blocks having totally 627 access requests are
the first node in the file system. supposed to be mapped to NODE0, and then the detailed
- pcount: the number of I/O servers on which data will information on these blocks is illustrated in the array of
be stored. blocks.
- ssize: strip size, the size of the contiguous chunks Given that a file accessed by the application links against
stored on I/O servers the modified PVFS distributed file system, the block dis-
In fact, the physical block placement is determined once tribution settings are loaded with the relevant applica-
the file is first created. Thus, it is possible to specify the tion, and the specified settings will guide the process of
aforementioned three parameters when using the function of block distribution when the application tries to open the
pvfs_open() to create the file. file.
We have modified the implementation of pvfs_open(),
to allow flexible block data distribution by a new V. EXPERIMENTS AND EVALUATION
request scheduler module with the given placement This section describes the experimental methodology for
specifications. evaluating the file system with the newly proposed block
As discussed, the algorithm of grouping data blocks can placement policy, and then reports the experimental results.
split the blocks into several groups, by referring the history of First, we describe the experimental setup, including the
access frequency to them. So that we are able to finally obtain experimental platform and the selected benchmarks. Next,
38416 VOLUME 6, 2018

- TPC-C, which is a typical online transaction processing

(OLTP) benchmark, issued by the Transaction Process-
ing Performance Council (TPC) [21]. It consists of sim-
ple short-running transactions with frequent updates and
less frequent index scans. Specifically, TPC-C reflects
an online transaction processing (OLTP) database for an
order-entry environment [23].
- TPC-E, which is the most recently standardized
OLTP benchmark by TPC, as well. It includes
transactions for real-time business intelligence com-
bined with client-side requests [22] Namely, TPC-E
models a financial brokerage house, and acts like
a typical OLTP and an online analytic processing
benchmark [23].
Table 2 summarizes the high-level specification com-
parisons of the mentioned two database benchmarks.
FIGURE 3. A simplified example of distributing 1021 blocks of the file
onto 4 storage nodes, generated by running the k-partition algorithm. As shown, one remarkable distinction of TPC-E from TPC-C
is the majority of the transaction (mix) workloads are Read-
Only (RO), whereas TPC-C has 92.0% Read-Write (RW)
transaction (mix) workloads. In addition, there are different
the experimental results and relevant discussions are pre-
types and quantities of read/write requests targeting the tables
sented, to show the feasibility and applicability of our
in the benchmarks. For example, over 90% of the write
newly proposed mechanism. At last, we make a brief
operations are related to only 8 tables, among 33 tables
summary about the findings obtained from the evaluation
in TPC-E.
experiments.
TABLE 2. Read/write Workload Analysis of TPC-C and TPC-E.
A. EXPERIMENTAL SETUP
1) PLATFORM SPECIFICATIONS
We employed one cluster and one local area network (LAN)
to conduct evaluation experiments. One active metadata
server and five storage servers of the distributed file sys-
tem are deployed on a 5-node cluster (the metadata server
is deployed with a storage server on the same machine).
All client file systems are installed on 12 machines of a
LAN. Specifically, 12 client file systems are installed on the
machines of the LAN that is connected with the server cluster
by a 1 GigE Ethernet. Table 1 presents the specifications of
machines on the server cluster as well as on the LAN. All
clients are equipped with MPICH2-1.4.1.
B. PARAMETER SETTINGS
We have evaluated disk-based query processing frameworks
TABLE 1. Machine specifications of storage servers and clients.
by running the selected TPC-C and TPC-E benchmarks,
which are update-intensive and read-intensive OLTP/OLAP
workloads respectively. We run these two benchmarks,
by employing MySQL (version 5.6.10) as the back-end
database [20]. All 12 compute nodes perform the same job
script, to execute the selected benchmarks. We set each client
process repeating 10 times of transaction submissions, for
yielding average experimental statistics.
To emulate varied levels of sharing data between client
processes running on the compute nodes, we have adopted
2) BENCHMARKS a similar evaluation method to the one introduced by Yan
Since one of the targeted contexts to use the proposed mech- and Cheung [32]. In the evaluation experiments of running
anism is the database-related application, we selected the two selected benchmarks, we changed the data set size for
following two widely used, representative database-related each transaction access, to yield varying level of data access
benchmarks for conducting evaluation experiments: frequency.
VOLUME 6, 2018 38417

C. RESULTS AND DISCUSSIONS

1) TPC-C EVALUATION
For the TPC-C benchmark, we adjusted the data size by
setting the number of warehouses from 1 to 32, and the total
size of database is around 3 GB. In order to emulate a large
amount of transactions, we make each client to issue requests
repeatedly.
Figure 4 shows performance throughput of running TPC-C
with varied number of warehouses, when using the Round-
robin policy, and the proposed k-partition policy for placing
data blocks. Specifically, Figures 4 (a) and 4(b) show the
results of TPC-C having either payment transactions or new
order transactions respectively. Figure 4(c) reports the results
of TPC-C with 50% payment transactions and 50% new order
transactions. Figure 4 (d) describes the results of TPC-C with
a standard mix of five types of transactions, according to
FIGURE 5. Normalized time of TPC-C with varied number of warehouses.
the specification. As seen, the newly proposed k-partition (a) TPC-C payment transaction. (b) TPC-C new order transaction. (c) TPC-C
data placement policy can yield the best system performance mix (50% payment, 50% new order). (d) TPC-C mix of all transaction.
in the metric of transaction per second. In other words,
k-partition can improve transaction performance by up
to 21.2%.
FIGURE 6. Transaction throughput of TPC-E with varied number of trades.

(a) TPC-E trade order transaction. (b) TPC-E trade update transaction.
(c) TPC-E mix (50% order, 50% results). (d) TPC-E mix of all transaction.
FIGURE 4. Transaction throughput of TPC-C with varied number of
warehouses. (a) TPC-C payment transaction. (b) TPC-C new order 2) TPC-E EVALUATION
transaction. (c) TPC-C mix (50% payment, 50% new order).
(d) TPC-C mix of all transaction. For the TPC-E benchmark, we altered the number of trades in
each transaction access. Because the trade update transaction
is employed to show the modifications to a set of trades,
We have also explored the completion time of run- adjusting the number of trades can correspondingly change
ning the benchmark of TPC-C, that consists of the time the amount of data in the transaction access. Specifically,
required for computation, and the time used for I/O pro- we changed the number of trades from 1K to 512K (the total
cessing. Figure 5 show the analysis on breakdown execution number of trades of entire trade table is 576K).
time of executing TPC-C. As we expected, the scheme of Figure 6 shows the results of transaction throughput
k-partition can reduce the time caused by allocating the when running TPC-E. Similar to the results of running
data blocks, by more than 12.7% in average. The less TPC-C that were illustrated in Section V-C.1, the newly pro-
completion time is resulted by the less I/O time when posed k-partition based data placement policy outperforms
running the benchmark, because the computing time is the round-robin scheme. That is to say, our proposal can yield
same when employing the selected two data placement better transaction performance when running the selected
schemes. benchmarks.
38418 VOLUME 6, 2018

D. SUMMARY
We have demonstrated that the proposed data placement pol-
icy can effectively and practically balance access workloads,
and then reduce the I/O time for the target benchmarks.
We carry out offline block partition, the time overhead to
complete partition of data blocks by referring their access
counts, is not presented. This is because the offline analysis
does not affect the performance of file system, in the case of
running the application.
With respect to comparing the proposed block data place-
ment mechanism with the conventionally used round-robin
placement policy, we emphasize the following two key obser-
vations. First, the proposed data placement policy can yield
not only even data distribution, but also uniform data access
workloads to the storage servers. Especially, in the case of the
applications may have different levels of access frequency to
FIGURE 7. Normalized response time of TPC-E with varied number of
trades. (a) TPC-E trade order transaction. (b) TPC-E trade update
the data blocks during their life cycles. Second, the longest
transaction. (c) TPC-E mix (50% order, 50% results). I/O time caused by the storage server having the most
(d) TPC-E mix of all transaction. access workloads is basically close to the shortest I/O time
introduced by the server having the least access workloads.
We then conclude that the proposed data placement policy
Besides, the normalized time required for running the can observably decrease the time overhead required by com-
benchmark is recorded, and the relevant data are reported pleting I/O requests when running BigData applications, such
in Figure 7. As shown, the proposed scheme of k-partition as online transaction processing applications.
does result in less time needed for processing I/O operations,
so that the time required for completing the benchmark is VI. CONCLUDING REMARKS
consequently cut down. This paper proposes a novel block placement policy, which
can ensure balanced access workloads with respect to stor-
3) ACCESS LOAD BALANCE age servers in distributed/parallel file systems. To this end,
The main motivation of access frequency-based data we first analyze the history of block access sequence, for
placement is to yield uniform I/O processing workloads calculating the access frequency to the blocks on the stor-
over storage servers in the distributed file system. Therefore, age servers. Then, the k-partition algorithm has been
we have measured the data throughput of storage servers introduced to classify blocks into k groups, which are sup-
when running the benchmarks, and Figure 8 presents relevant posed to be correspondingly distributed to k storage servers.
results. At last, the storage servers are able to process read/write
access requests evenly, and there should not be any storage
server taking a specific long time to complete the I/O requests
of application.
The experiments on database-relevant benchmarks show
that proper distribution of data blocks onto storage servers
can definitely decrease the I/O time required by applica-
tions, as well as balance the I/O workloads to storage nodes.
In brief, the proposed access frequency-based data placement
policy does facilitate to the cases of the application required
to be executed with multiple cycles. Therefore, it is possible
FIGURE 8. Data throughput of storage servers in the file system when
running the benchmarks of TPC-C and TPC-E. (a) TPC-C benchmark. to directing data placement on the basis of the block access
(b) TPC-E benchmark. history in the first cycle(s).
In the future work, we plan to explore more in dynamic
In contrast to the default data distribution policy, data block movement among storage servers, by referring the
i.e. Round-robin in the PVFS file system, the newly proposed access frequency to data blocks. Also, improving algorithm
scheme can noticeably benefit to balancing data through- scalability is another direction of our future work.
put over storage servers of distributed file system. Conse-
REFERENCES
quently, not only the I/O process time can be decreased, but
[1] J. Liao, L. Li, H. Chen, and X. Liu, ‘‘Adaptive replica synchronization
also the utilization ratio of storage servers can be greatly for distributed file systems,’’ IEEE Syst. J., vol. 9, no. 3, pp. 865–877,
enhanced. Sep. 2015.
VOLUME 6, 2018 38419

[2] J. H. Hartman and J. Ousterhout, ‘‘The Zebra striped network file system,’’ [29] C. E. Ferreira, A. Martin, C. C. de Souza, R. Weismantel, and L. A. Wolsey,
ACM Trans. Comput. Syst., vol. 13, no. 3, pp. 274–310, 1995. ‘‘The node capacitated graph partitioning problem: A computational
[3] L. Gu, D. Zeng, P. Li, and S. Guo, ‘‘Cost minimization for big data study,’’ Math. Program., vol. 81, no. 2, pp. 229–256, 1998.
processing in geo-distributed data centers,’’ IEEE Trans. Emerg. Topics [30] F. Isaila and W. F. Tichy, ‘‘Clusterfile: A flexible physical layout par-
Comput., vol. 2, no. 3, pp. 314–323, Sep. 2014. allel file system,’’ in Proc. IEEE Int. Conf. Cluster Comput. (Cluster),
[4] A. D. Ferguson and R. Fonseca, ‘‘Understanding filesystem imbalance in Oct. 2001, pp. 37–44.
Hadoop,’’ in Proc. USENIX Annu. Technol. Conf., 2010. [31] F. Wang et al., ‘‘File system workload analysis for large scale scientific
[5] L. Wang, Y. Ma, A. Y. Zomaya, R. Ranjan, and D. Chen, ‘‘A parallel computing applications,’’ in Proc. 21st IEEE/12th NASA Goddard Conf.
file system with application-aware data layout policies for massive remote Mass Storage Syst. Technol. (MSST), Jan. 2004, pp. 139–152.
sensing image processing in digital Earth,’’ IEEE Trans. Parallel Distrib. [32] C. Yan and A. Cheung, ‘‘Leveraging lock contention to improve OLTP
Syst., vol. 26, no. 6, pp. 1497–1508, Jun. 2015. application performance,’’ Proc. VLDB Endowment, vol. 9, no. 5,
[6] H. Song, Y. Yin, Y. Chen, and X.-H. Sun, ‘‘A cost-intelligent application- pp. 444–455, 2016.
specific data layout scheme for parallel file systems,’’ in Proc. ACM Int. [33] B. Yu and J. Pan, ‘‘Location-aware associated data placement for geo-
Symp. High Perform. Distrib. Comput. (HPDC), 2011, pp. 37–48. distributed data-intensive applications,’’ in Proc. IEEE Conf. Comput.
[7] H. Song, H. Jin, J. He, X.-H. Sun, and R. Thakur, ‘‘A server-level Commun. (INFOCOM), Apr. 2015, pp. 603–611.
adaptive data layout strategy for parallel file systems,’’ in Proc. IEEE [34] H. C. Hsiao, H. Y. Chung, H. Shen, and Y. C. Chao, ‘‘Load rebalancing
Int. Parallel Distrib. Process. Symp. Workshops PhD Forum., May 2012, for distributed file systems in clouds,’’ IEEE Trans. Parallel Distrib. Syst.,
pp. 2095–2103. vol. 24, no. 5, pp. 951–962, May 2013.
[8] V. Pai et al., ‘‘Locality-aware request distribution in cluster-based network [35] C.-Y. Lin and Y.-C. Lin, ‘‘An overall approach to achieve load balancing for
servers,’’ ACM SIGOPS Oper. Syst. Rev., vol. 32, no. 11, pp. 205–216, Hadoop distributed file system,’’ Int. J. Web Grid Services, vol. 13, no. 4,
1998. pp. 448–466, 2017.
[9] S. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn, ‘‘CRUSH: Controlled,
scalable, decentralized placement of replicated data,’’ in Proc. ACM/IEEE
Conf. Supercomput. (SC), Nov. 2006, p. 122.
[10] M. Bhadkamkar et al., ‘‘BORG: Block-reorganization for self-optimizing
storage systems,’’ in Proc. FAST, 2009, pp. 183–196. JIANWEI LIAO received the Ph.D. degree in
[11] D. Yuan, Y. Yang, X. Liu, and J. Chen, ‘‘A data placement strategy in computer science from The University of Tokyo,
scientific cloud workflows,’’ Future Gener. Comput. Syst., vol. 26, no. 8, Japan, in 2012. He is currently with the College
pp. 1200–1214, 2010. of Computer and Information Science, Southwest
[12] J. Xie et al., ‘‘Improving mapreduce performance through data placement University, China. His research interest is high
in heterogeneous Hadoop clusters,’’ in Proc. IEEE Int. Symp. Parallel performance storage systems for distributed com-
Distrib. Process. Workshops PhD Forum, Apr. 2010, pp. 1–9. puting environments.
[13] S. A. Yazd, S. Venkatesan, and N. Mittal, ‘‘Boosting energy efficiency
with mirrored data block replication policy and energy scheduler,’’ ACM
SIGOPS Oper. Syst. Rev., vol. 47, no. 2, pp. 33–40, 2013.
[14] J. Wang et al., ‘‘Deister: A light-weight autonomous block management in
data-intensive file systems using deterministic declustering distribution,’’
J. Parallel Distrib. Comput., vol. 108, pp. 3–13, Oct. 2017. ZHIGANG CAI received the M.S. degree from
[15] D. Borthakur, ‘‘The Hadoop distributed file system: Architecture Chongqing University in 2005 and the Ph.D.
and design,’’ Apache Softw. Found., 2007. [Online]. Available: degree in economics from the Renmin Univer-
https://cs.uwec.edu/~buipj/teaching/cs.491.s13/static/pdf/03_hadoop_ sity of China in 2014. He joined the College of
distributed_file_system_architecture_and_design.pdf
Computer and Information Science, Southwest
[16] M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and
University, China, in 2005. His research interests
J. McPherson, ‘‘CoHadoop: Flexible data placement and its exploitation
in Hadoop,’’ Proc. VLDB Endowment, vol. 4, no. 9, pp. 575–585, 2011.
are in probability theory, random processes, and
[17] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to probability analysis.
the Theory of NP-Completeness. San Francisco, CA, USA: Freeman, 1979.
[18] J. L. Wolf, S. Y. Philip, and H. Shachnai, ‘‘Disk load balancing for video-
on-demand systems,’’ Multimedia Syst., vol. 5, no. 6, pp. 358–370, 1997.
[19] S. Agarwal, J. Dunagan, N. Jain, and S. Saroiu, ‘‘Volley: Automated data
FRANCOIS TRAHAY received the M.S. and Ph.D.
placement for geo-distributed cloud services,’’ in Proc. 7th USENIX Symp.
degrees in computer science from the University
Netw. Syst. Design Implement. (NSDI), 2010, pp. 17–32.
of Bordeaux, France, in 2006 and 2009, respec-
[20] MySQL Database. Accessed: Jul. 2014. [Online]. Available: http://dev.
mysql.com/downloads/ tively. He was a Post-Doctoral Researcher with the
[21] Transaction Processing Performance Council. TPC-C Benchmark Revision Ishikawa Laboratory, RIKEN Institute, The Uni-
5.11.0. Accessed: Feb. 2016. [Online]. Available: http://www.tpc.org/tpcc/ versity of Tokyo, in 2010. He is currently an Asso-
[22] Transaction Processing Performance Council. TPC-E Benchmark Version ciate Professor with Telecom SudParis, France.
1.14.0. Accessed: Feb. 2016. [Online]. Available: http://www.tpc.org/tpce/ His research interests include runtime systems
[23] C. Curino, E. Jones, Y. Zhang, and S. Madden, ‘‘Schism: A workload- for high-performance computing and performance
driven approach to database replication and partitioning,’’ Proc. VLDB analysis.
Endowment, vol. 3, nos. 1–2, pp. 48–57, 2010.
[24] R. B. Ross and R. Thakur, ‘‘PVFS: A parallel file system for linux clus-
ters,’’ in Proc. Linux Showcase Conf., 2000, p. 28.
XIAONING PENG was a Visiting Scholar with
[25] H. Jin et al., ‘‘Joint host-network optimization for energy-efficient
data center networking,’’ in Proc. 11th Int. Symp. Parallel Distrib.
the School of Computer Science, Birmingham
Process. (ISPA), May 2013, pp. 623–634. University, Birmingham, from 2010 to 2011. He
[26] H. Herodotou et al., ‘‘Starfish: A self-tuning system for big data analytics,’’ is currently a Full Professor with the School of
in Proc. CIDR, 2011, pp. 261–272. Computer Science and Engineering, Huaihua Uni-
[27] D. Huang et al., ‘‘Achieving load balance for parallel data access on dis- versity. His research interests include database
tributed file systems,’’ IEEE Trans. Comput., vol. 67, no. 3, pp. 388–402, systems and system software.
Mar. 2018.
[28] K. Andreev and H. Racke, ‘‘Balanced graph partitioning,’’ Theory Comput.
Syst., vol. 39, no. 6, pp. 929–939, 2006.
38420 VOLUME 6, 2018

BLOCK PLACEMENT IN DISTRIBUTED FILE SYSTEMS B-2018

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BLOCK PLACEMENT IN DISTRIBUTED FILE SYSTEMS B-2018

Uploaded by

Copyright:

Available Formats

Received May 17, 2018, accepted June 25, 2018, date of publication June 29, 2018, date of current

version July 30, 2018.

Block Placement in Distributed File Systems

Corresponding author: Xiaoning Peng (hhpxn@163.com)

I. INTRODUCTION On the other side, different from conventional applications,

38412 VOLUME 6, 2018

VOLUME 6, 2018 38413

B. THEORETICAL ANALYSIS where 0 ≤ λ ≤ 1, to represent the weighing coefficient of

38414 VOLUME 6, 2018

to distribute data blocks, by taking the information on block

VOLUME 6, 2018 38415

38416 VOLUME 6, 2018

- TPC-C, which is a typical online transaction processing

VOLUME 6, 2018 38417

C. RESULTS AND DISCUSSIONS

FIGURE 6. Transaction throughput of TPC-E with varied number of trades.

38418 VOLUME 6, 2018

VOLUME 6, 2018 38419

38420 VOLUME 6, 2018

You might also like