Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

J. Parallel Distrib. Comput.

72 (2012) 12541268

Contents lists available at SciVerse ScienceDirect

J. Parallel Distrib. Comput.


journal homepage: www.elsevier.com/locate/jpdc

A dynamic and adaptive load balancing strategy for parallel file system with
large-scale I/O servers
Bin Dong , Xiuqiao Li, Qimeng Wu, Limin Xiao , Li Ruan
State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing 100191, China

article info abstract


Article history: Many solutions have been proposed to tackle the load imbalance issue of parallel file systems. However,
Received 1 July 2011 all these solutions either adopt centralized algorithms, or lack considerations for both the network
Received in revised form transmission and the tradeoff between benefits and side-effects of each dynamic file migration. Therefore,
2 May 2012
existing solutions will be prohibitively inefficient in large-scale parallel file systems. To address this
Accepted 14 May 2012
Available online 23 May 2012
problem, this paper presents SALB, a dynamic and adaptive load balancing algorithm which is totally
based on a distributed architecture. To be also aware of the network transmission, SALB on the one hand
Keywords:
adopts an adaptively adjusted load collection threshold in order to reduce the message exchanges for load
Distributed load balancing collection, and on the other hand it employs an on-line load prediction model with a view to reducing the
Parallel file systems decision delay caused by the network transmission latency. Moreover, SALB employs an optimization
On-line load prediction model for selecting the migration candidates so as to balance the benefits and the side-effects of each
Load collection dynamic file migration. Extensive experiments are conducted to prove the effectiveness of SALB. The
Dynamic file migration results show that SALB achieves an optimal performance not only on the mean response time but also
Adaptive algorithm on the resource utilization among the schemes for comparison. The simulation results also indicate that
SALB is able to deliver high scalability.
2012 Elsevier Inc. All rights reserved.

1. Introduction sizes [45], improper file allocations [56,60], application competi-


tions [27,6] and heterogeneous computing environments [54] may
The disparity between the rate at which scientific applications lead to load imbalance among the I/O servers. Consequently, the
can calculate results and the rate at which the applications can variance of the response time of parallel file systems is enlarged
store their data onto persistent storage (i.e., hard disks) is an and therefore the whole parallel I/O system is underutilized [38].
unavoidable issue for high-end computer systems [32]. As an Many solutions [38,46,31,35] have been proposed to tackle the
attractive solution, the parallel I/O system allows data to be load imbalance issue of the I/O servers. However, with the growth
concurrently transferred between the memory and the persistent of the parallel I/O system scale, parallel file systems are now facing
storage device. The parallel file system, one of the important significant challenges caused by the management of large-scale I/O
servers. As a result, the load balancing algorithm for parallel file
components of a parallel I/O system, is responsible for striping
systems needs to deal with the following three new challenges.
data onto I/O servers and then permits the accesses for these
The first challenge for the load balancing algorithm is how
data executing concurrently. Hence, parallel file systems play an
to provide the scalability and the availability required by the
important role in data management and have received a lot of
steadily growing parallel I/O system. High-performance computing
attention in the recent past [25,7,13].
is in the era of the petabyte-scale and is expected to achieve the
In order to fully reap the performance of parallel I/O sys-
extrabyte-scale in the 20182020 time frame [22]. Accordingly,
tems, the load among the I/O servers situated in parallel file
it is estimated that the I/O requirement of scientific applications
systems should be distributed uniformly [33]. Evenly distributed will increase from 0.2 TB/s to 20 TB/s [29]. Since the parallel
load across the I/O servers can eliminate performance bottlenecks, I/O system provides the high-speed data transfer rate through
thereby optimizing the mean response time and the resource uti- aggregating individual device performance, in order to match
lization. However, some factors such as unsuitable file striping such fast data transfer rates, the scale of the parallel I/O system
must be enlarged [1]. High-scalable and high-available softwares
are the enabling technologies that allow such systems to fully
Corresponding authors. deliver their performance. However, most existing load balancing
E-mail addresses: Bdong@cse.buaa.edu.cn (B. Dong), xiaolm@buaa.edu.cn solutions [38,46,31] employ centralized algorithms, the scalability
(L. Xiao). of which is limited by fixed memory size, CPU power, and network
0743-7315/$ see front matter 2012 Elsevier Inc. All rights reserved.
doi:10.1016/j.jpdc.2012.05.006
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1255

bandwidth [35]. Moreover, if the central server crashes, the whole 3. SALB adopts dynamic file migration to realize its load migration.
load balancing algorithm may be down. A scalable and available By this method, SALB can work without interrupting the service
load balancing method is required for parallel file systems which of the whole system. Moreover, an optimization model for
may maintain hundreds or even thousands of I/O servers. selecting migration candidates is incorporated into SALB so as
The second challenge for the load balancing algorithm is how to to balance the benefits and the side-effects of each dynamic file
take the network transmission into account. On the one hand, the migration.
message exchanges for the load collection should be considered by
To prove the effectiveness of SALB, comprehensive experiments
the load balancing algorithm. One reason is that the load balancing
were conducted. The on-line load prediction model of SALB was
decision should be made based on the load condition of the whole
evaluated with BTIO [37], MADbench2 [9] and FLASH I/O [49], all of
system, but frequently sending/receiving messages to collect the
load may degrade the performance of the whole system [4]. On which are I/O kernels of scientific applications. The results manifest
the other hand, since the network transmission latency grows as that the on-line load prediction model can fit the load series of
the distance among the I/O servers increases [47], ensuring the I/O servers with small mean square prediction error. Then, SALB
load information collected for load balancing decision up-to-date was compared against traditional load balancing algorithms under
is very important to reduce decision delays. However, in most a synthetic workload. The results show that SALB consistently
existing load balancing solutions [38,46,31,35] which are based on delivers an optimal performance on the mean response time and
a central decision-maker or group decision-makers, sponsoring a resource utilization among the existing schemes for comparison.
load balancing action needs at least two network transmissions: Finally, the configurations of SALB were discussed with a popular
one for load collection and the other for issuing the load balancing parallel I/O benchmark IOR [42] and the scalability of SALB was
command. In such a case, the collected load information may evaluated in the context of large-scale I/O servers which are
become obsolete. For this reason, the sponsored load balancing built from the widely accepted High-End Computing I/O Simulator
actions tend to lag behind the realistic workload conditions. (HECIOS) [41].
Network transmission needs to be considered to achieve an The rest of this paper is organized as follows. We survey the
effective load balancing algorithm for parallel file systems. related work in Section 2. In Section 3, we outline the architecture
The third challenge for the load balancing algorithm is how to of SALB and describe its components from Sections 4 to 8. This
effectively realize its load migration. In contrast to the dynamic is followed in Section 9 by SALBs pseudocode. In Section 10,
file reallocation which may lead to system-wide interruption of we describe the experiments evaluating SALB. Finally, Section 11
service, the dynamic file migration can transfer the load between concludes this paper and discusses future research directions.
I/O servers on-line [28]. Hence, dynamic file migration is an
effective way to implement load migration. Using an proper 2. Related work and motivation
dynamic file migration, benefits such as the reduced mean
response time and improved utilization can be obtained. However, Evenly distributed load across the I/O servers situated in
in order to ensure the consistency of the file under migration, the parallel file systems can help optimize the mean response
file requests targeted on it must be properly handled (e.g. delayed, time and resource utilization. However, some factors such as
rejected) [31]. Such side-effects of dynamic file migration may unsuitable file striping sizes [45], improper file allocations [56,
degrade the performance of the whole system. The benefits and the 60], application competitions [27,6] and heterogeneous computing
side-effects of each dynamic file migration needs to be balanced to environments [54] can lead to load imbalance among the I/O
achieve an effective dynamic load migration. servers. The literature proposed for solving the load imbalance
To address the challenges above, inspired by peer-to-peer problem can be put into two categories: static and dynamic.
computing [36] and autonomic computing [17], we propose a Typically, static load balancing algorithms assign the files
dynamic and adaptive load balancing algorithm named self-acting onto the available I/O servers before they are accessed [33]. If
load balancing (SALB) for parallel file systems. SALB runs on each the load of each file can be known beforehand, such a method
I/O server and they cooperate with each other to keep the load is simple and effective. However, the load of most scientific
across the I/O servers balanced. Specifically, there are three key applications varies over time and it is hard to be known in
characteristics of SALB: advance [18,43,44]. Actually, without the load distribution known
1. SALB is totally based on a distributed load balancing decision- beforehand, static load balancing is like the static file assignment
maker without a central infrastructure. Moreover, SALB running problem, which has been proved to be NP-Hard [24]. For this
on each I/O server would automatically construct its on-line reason, many heuristic static load balancing algorithms [56,60]
load prediction model, adjust its load collection threshold, are proposed. Typically, the existing heuristic static algorithms
gather the load of other I/O servers, make decisions, choose not only aim at the non-partitioned files but also heavily depend
migration candidates, and sponsor dynamic file migrations by on the file statistic information such as the file access rate.
itself. Therefore, SALB is able to deliver both the scalability and Therefore, their applications may be limited in a dynamic or
the availability required by the steadily growing parallel I/O productive environment where files are dynamically partitioned
system. and allocated in most cases. In most mainstream parallel file
2. SALB is aware of the network transmission. As SALB, running systems such as PVFS [14], Lustre [11], and GPFS [39], a file
on each I/O server, makes its own decision, the network is always partitioned into equal-sized chunks which are then
transmission to issue the load balancing command is avoided. distributed onto available I/O servers in a round-robin fashion. In
Hence, sponsoring a load balancing action in SALB only needs such a case, some factors such as unsuitable file striping sizes [45],
one network transmission, which is used for load collection. improper file allocations [56,60], application competitions [27,6]
Moreover, since the on-line load prediction model of SALB can and heterogeneous computing environments [54] may turn some
automatically estimate the future load of an I/O server, other I/O I/O servers into hot spots, thereby degrading the performance of
servers can collect the forecast load to make decisions. In such the whole system [54,27,34].
a case, the decision delay cased by the network transmission Dynamic load balancing algorithms, on the other hand, can keep
latency can be further reduced. Furthermore, in order to reduce the load among I/O servers balanced on-line to adapt to varying
message exchanges for load collection, SALB employs a load workloads without its distribution known in advance [38]. In order
collection threshold which can also be dynamically adjusted. to understand the existing dynamic load balancing algorithms
1256 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268

Fig. 1. Central, group and distributed decision-makers.

in depth, we believe that classifying the dynamic load balancing


algorithms by the distribution of decision-makers is more helpful. Fig. 2. Architecture of typical parallel file systems and the flowchart of SALB.
According to the distribution of load balancing decision-makers,
we put the dynamic load balancing algorithms into the following decision. There are two advantages of the distributed load
three categories. balancing decision-maker. (1) The distributed decision-maker is
able to provide both scalability and availability required by the
2.1. Central decision-maker based load balancing algorithm large scale I/O system [57]. (2) The load balancing decision delay is
small in the distributed decision-maker. This is because each load
The central decision-maker based load balancing algorithms balancing action sponsored by a distributed decision-maker only
run on a management node of a parallel file system. The popular needs one network transmission, which is used to gather the load
central decision-maker based load balancing algorithms include
of other I/O servers.
disk cooling by Scheuermann et al. [38] and the user-space load
On the other hand, as each distributed decision-maker may
balancer by Kunkel [31]. Typically, as Fig. 1 shows, the central
need to collect the load of all I/O servers to make a decision,
decision-maker collects the load information of other I/O servers
the message exchanges for load collection may degrade the
first and then determines whether a load balancing action should
performance of the whole system. To address this issue, the
be launched or not. In such a case, a global optimization for
method presented in this work adopts an adaptive load collection
the load distribution can be pursued. However, in a large-scale
threshold to prevent frequent load collection. Moreover, an on-line
parallel file system, the central decision-maker may own the
load prediction model is employed in our approach to forecast the
following two limitations. (1) The central decision-maker may
own poor performance and be unreliable in a large-scale parallel load of the I/O server. In such a case, the decision delay is expected
system [35,40]. As a central decision-maker may run on a single to be further reduced. The distributed decision-maker based load
node, its capability is limited by the fixed CPU power and network balancing algorithm has been successfully and widely applied in
bandwidth [35]. Moreover, if the central node crashes, the whole different fields: business strategy study [26], grid computing [57],
load balancing algorithm may be down. (2) The load balancing cloud computing [59], and so on. To the best of our knowledge,
decision delay for the central decision-maker is obvious. As Fig. 1 this is the first work that approaches the load balancing problem
indicates, each load balancing action needs at least two network of parallel file systems with the distributed decision-maker based
transmissions: one for load collection and the other for issuing the load balancing algorithm.
load balancing command. In such a case, the load balancing actions
sponsored by the central decision-maker tend to lag behind the 3. System overview
real workload conditions.
Typically, a parallel file system like PVFS [14] consists of three
2.2. Group decision-maker based load balancing algorithm main components: the I/O server (IOS), the metadata server (MDS),
and the client. As Fig. 2 indicates, these three components are
The group decision-maker based load balancing algorithms connected by a network. The client runs on compute nodes and
divide the whole system into groups and then distribute the provides a file system interface. The MDS stores metadata such as
decision-makers among each group, as illustrated in Fig. 1. directory and layout information of files. The IOS holds the actual
The well-known group decision-maker based load balancing data of the files. SALB presented in this article runs on each IOS
algorithm is proposed by Liu et al. in [35]. Due to the decreased and they cooperate with each other to keep the load among all
communication cost of a load balancing decision, such a load IOSes balanced. Such a distributed architecture of SALB ensures
balancing algorithm has a small impact on the whole system its scalability and availability. Fig. 2 also illustrates SALBs five
performance, thereby yielding better performance. However, since major components, along with their interactions in keeping the
a group decision-maker makes a decision without considering the load among I/O servers balanced. We discuss each component from
whole system load condition, it may achieve a local optimization the view of a single IOS.
for load balancing. For this reason, pursuing global optimization
may need special mechanisms to balance the load among groups. 1. An on-line load prediction algorithm, which is used to estimate
Moreover, sponsoring a load balancing action by a group decision- the future load of this IOS. The forecast load can be collected
maker still needs at least two network transmissions: one for load by the other IOSes for their load balancing decisions. Hence,
collection and the other for issuing the load balancing command. the load balancing decision delay caused by the network
Hence, the load balancing decision delay in the group decision- transmission latency can be reduced.
maker may be still obvious. 2. An efficient load collection mechanism. Setting a load threshold
can help reduce the load collection message exchanges which
2.3. Distributed decision-maker based load balancing algorithm may degrade the whole system performance. When the forecast
load of this IOS is larger than the load collection threshold, this
As Fig. 1 indicates, distributed decision-maker based load IOS thinks itself heavily loaded and needs to gather the forecast
balancing algorithms permit each I/O server to make its own load of all other IOSes for a load balancing decision.
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1257

3. A robust and effective distributed load balancing decision- Table 1


maker. Based on the load condition of both this IOS itself and ACF and PACF of AR, MA, and ARMA.

the whole system, a distributed load balancing decision-maker Model ACF PACF
is responsible for deciding to trigger a dynamic file migration or AR(p) Exponential decay Cut off to zero
not. after lag p
4. A migration candidate selection model. When this IOS needs MA(q) Cut off to zero Exponential decay
after lag q
to transfer its load, the migration candidates, including the
ARMA(p, q) Exponential decay Exponential decay
subfile to be migrated and the target IOS accepting it, are chosen
through this model to balance the benefit and side-effect of
dynamic file migration. So, the load series of MADbench2 with the trend eliminated can
5. A dynamic file migration algorithm. Dynamic file migration is be modeled with an AR time series model. Actually, the AR model
used to transport the selected subfile to the target IOS without is desirable in on-line prediction model for the time to build it is
determined.
interrupting system service. Also, a mutually exclusive strategy
In summary, the original load series of an I/O server may
is required to ensure the consistency of the subfile under
show high instability and variability but it can be transferred
migration.
to stationary one through once-differenced. Moreover, the once-
In the ensuing sections, we will explain in greater detail of these differenced load series of I/O server owns the property of the AR
five components. time series model.

4.2. On-line load prediction model construction algorithm


4. On-line load prediction model construction
The time series model is always built manually and off-
4.1. Analysis of load series of I/O server line [16]. In this section, we will discuss how to build an on-line
Note that, the load in this study is measured with the through- load prediction model. According to the discussions in previous
put of an IOS. Even though most previous studies [58,20,55] have section, the construction algorithm of the AR model can be
analyzed the load series of the computing node, the statistic analy- extended to build the on-line load prediction model. The on-line
sis for the load series of an IOS is rare. Hence, before construction of load prediction model construction algorithm named FcstLoad
the on-line load forecast model, the statistic characteristics of the is illustrated in Fig. 4. The FcstLoad takes the load series L =
load series of an IOS should be investigated. {Lt , Lt 1 , . . . , Lt p } of an IOS as input and returns the forecast load
To achieve this target, the load series of the MADbench2 [8] of this IOS.
is investigated in this section. The MADbench2 is derived directly Through the first loop (steps 13), the FcstLoad eliminates the
from a scientific application specifically in the field of Cosmic trend of the load series L. The stationary load series is stored in
Microwave Background data analysis. The configurations of the L . Because the mean of the load series L may not be equal to
MADbench2 are presented in Table 3 of the evaluation section. The zero, the L is zero mean normalized in step 4 and the results are
stored in L . According to the work [10], the maximum order of
load series sampled at the interval 2 s and the interval 4 s are shown p+1
in Fig. 3(a) and Fig. 3(d) respectively. It is interesting to see that the fitted AR time series model is set to 2 in the 5th step. From
the load series sampled at different time intervals exhibit similar steps 6 to 8, the autocorrelations of L are computed and stored
characteristics. Both of them can be divided into three stages: in the array R. Based on the autocorrelation array R, the FcstLoad
identifies the order and the coefficients for an AR time series
(1) the first stage of MADbench2 is dominated by write and the
model from steps 9 to 14. Specifically, the LevinsonDurbin(LD)
load series own the highest value and an upward trend; (2) the
algorithm [10], which owns O(1) complexity, is employed to
load at the second stage decreases sharply and shows considerable
compute the coefficients of each order and the optimal order p is
deviation because read and write are mixed; (3) MADbench2 reads
selected with the Akaike Information Criterion [2]. In step 15, least-
data from disk at the third stage and the load at this stage fluctuates squares regression is applied to compute accurate coefficients
until it reaches zero. In summary, the raw load series sampled 1 , 2 , . . . , p of AR(p ). Then, FcstLoad computes the one-step
directly from an I/O server shows instability and variability.
ahead load L t +1 in step 16. As L t +1 is the forecast load of the
The time series models (e.g., AR, MA, ARMA and ARIMA)
and their corresponding analyzing methods [10] are attractive stationary series L , FcstLoad adds L t +1 with the load Lt in step 17
approaches to fit the complex I/O load series such as those and finally returns the forecast load.
of MADbench2. Among these models, ARIMA can fit the non-
stationary series but it is unacceptable in on-line load construction 5. Load collection
because the time to build it on-line is not determined [21]. In the
ARIMA model, the idea behind eliminating the trend component In order to make right load balancing decisions, SALB running
on an IOS needs to collect the load information of the whole system
of an unstable series is to difference the series with certain lag.
as much as possible. However, frequently sending/receiving
The MADbench2 load series which are differenced by the lag one
messages may degrade the performance of the whole system.
and the lag two are shown in Fig. 3(b) and Fig. 3(e) separately.
In order to reduce such message exchanges, a load collection
From the pictures, we can see that the once-differenced load series
threshold (denoted with LCt ) is employed in SALB. In order to
has became stable and the twice-differenced series has no obvious
explain the rationale for setting this threshold, we explore the
progress. Actually, most non-stationary series can be transformed impact of the load of an I/O server on its response time through
to stationary ones through the once-difference [50]. experiments. The test results are plotted in Fig. 5, where the
In terms of the stationary load series, the model selection horizontal axis is the load of the I/O server and the vertical axis
among AR, MA and ARMA can be determined by analyzing the is the response time. Note that, we denote the load as a percentage
ACF (auto correlation function) and PACF (partial auto correlation of the maximum load (Lmax ). For instance, the 40 on the first sub-
function) [10] of the load series. One reason is that, according to figures X axis means 40% 10.84 M/s. The real values are denoted
Table 1, AR, MA and ARMA models own different ACF and PACF with small circles. The smooth line in the figure is fitted with the
patterns. The ACF and PACF of the once-differenced load series are polynomial of order three.
presented in Fig. 3(c) and Fig. 3(f) respectively. We can see that the As the test results indicate, we conduct experiments with
ACF shows exponential delay and the PACF cuts off after lag one. respect to different file request sizes. One reason is that we
1258 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268

(a) Sample rate (2 s). (b) Once-differenced. (c) ACF.

(d) Sample rate (4 s). (e) Twice-differenced. (f) PACF.

Fig. 3. Analysis of load series of Madbench2.

Fig. 4. FcstLoad(): The on-line load forecast algorithm.

found that the file request sizes have a significant impact on the setting LCt equal to LCt will still result in at most n2 message
maximum load (Lmax ). Nevertheless, from all these figures, we can exchanges. In order to address this issue, a adaptive load collection
see that the relationship between the response time and the load threshold adjusting algorithm name AdjustLCT is proposed for
of an I/O server has the same skew as that of Fig. 6. When the SALB. In such a case, the message exchanges are expected to be
1 2
load is less than a certain value (denoted by LCt in the figure), the 2
n in the worst case. AdjustLCT is presented in Fig. 7. The idea
impact of load changing on the response time is small. In such a behind the AdjustLCT is to replace the LCt with average load of all
case, we believe that there is no need to sponsor load collections at I/O servers when the whole system is overloaded, as shown in steps
this IOS. When the load is larger than LCt , a small increase in the 4 and 5.
load of I/O server will result in a dramatical growth of the response With the number of IOSes increasing, the load collection for all
time. Under such conditions, we think this IOS needs to trigger load IOSes should be time-consuming. In order to reduce the time spent
collection and consider load balancing. Hence, setting the LCt equal in load collection, a load collection function named CollLoad is also
to LCt can help reduce message exchanges for load collection. implemented with a parallel idea. Specially, when an IOS needs to
Fig. 5 also indicates that LCt for different file request sizes may be gather the load of other IOSes, it would first send the requests to all
different, and thus we can choose the smallest one as LCt . I/O servers concurrently and then receive the load in batch fashion.
However, when the whole system with n IOSes is overloaded In SALB, one IOS can respond to the load collection with different
(which means that the load of each I/O server is greater than LCt ), values.
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1259

(a) Request size is 64 KB, and Lmax is 10.84 MiB/s. (b) Request size is 128 KB, and Lmax is 18.85 MiB/s. (c) Request size is 256 KB, and Lmax is 28.71 MiB/s.

(d) Request size is 512 KB, and Lmax is 33.38 MiB/s. (e) Request size is 1024 KB, and Lmax is 35.20 MiB/s. (f) Request size is 2048 KB, and Lmax is 34.29 MiB/s.

Fig. 5. Relationship between the response time and the load of one I/O server.

First, SALB running on each IOS should independently make its


own decision and also take into account the load condition of both
the local IOS and the whole system. In SALB, the load condition
of local IOS is measured with the forecast load of its on-line load
prediction model and the load condition of the whole system is
measured with the efficiency of load balancing (ELB), which is
defined as:
N
1

N
{lai }
i =1
ELB = , (1)
maxNi=1 {lai }
Fig. 6. The relationship between the load and the response time of an I/O server.
where {la0 , . . . , laN } is the load series of N IOSes. Obviously, the
ELB falls between zero and one, and the more closely the ELB
approaches one, the more evenly the load is distributed. However,
since a high ELB value may result in frequent migrations which
have a negative impact on the performance of the whole system,
an ELB threshold (denoted with ELBt ) is employed in SALB. How
to choose a proper ELBt will be discussed later in the following
evaluation section.
Second, parallel load migrations should be supported by SALB.
With the number of IOSes increasing, parallel migrations among
Fig. 7. AdjustLCT(): The LCt adaptive adjustment algorithm. different IOS pairs can accelerate the load balancing progress of
a heavily imbalanced system. On the other hand, as the previous
When one IOS answers the load collection with its forecast load section states, the IOSes refusing to be considered as migration
(which is aways larger than or equal to zero), this IOS can be targets will respond to the load collection with minus values, but
selected as a dynamic file migration target. all other IOSes will respond with their load. In such a case, the
When one IOS responds the load collection with a minus value, IOSes which are making decisions may choose the same IOS as a
it means that this I/O server is either taking part in a dynamic migration target. As a result, the load from different IOSes may be
file migration or the its load is greater than the load collection migrated to the same I/O server, which may become a new hot spot.
threshold. In such a case, this IOS should not be selected by Hence, supporting parallel migrations and avoiding migrating the
other IOSes as a dynamic file migration target. load from different IOSes to the same one must be considered by
the distributed load balancing decision.
6. Distributed load balancing decision In order to achieve the above two targets, the distributed load
balancing decision mechanism of SALB is summarized as follows:
The distributed load balancing decision plays an important If the ELB computed by an IOS is greater than the threshold ELBt
role in SALB because it is the prerequisite for its scalability. and the load of this IOS is the maximum among its collected load
To implement a robust and effective distributed load balancing at the same time, this IOS can sponsor a dynamic file migration;
decision mechanism, the following two issues should be addressed. otherwise, no dynamic file migration should be sponsored at this
1260 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268

Fig. 8. The example of distributed load balancing decision. The number in the
parenthesis is the load of each IOS.
Fig. 9. SelCand(): The migration candidate selection algorithm.

IOS. There are two reasons for requiring the IOS sponsoring load
the files whose load obeys Eq. (3), the subfile with the maximum
migration to own maximum load among its collected load. First, load should be selected to maximize the load migration benefit.
requiring the local IOS to own maximum load can prevent different
IOSes migrating their load to the same I/O server. Second, the lh ll
li = , (2)
IOS whose load is not the maximum but greater than the load 2
collection threshold should not make its own decision with the lh ll
load of other servers. li . (3)
2
To explain how the distribution load balancing decision works,
On the other hand, the load migration side-effect such as the
we can see the example presented in Fig. 8, where IOS0 is
degradation of the whole system may arise when file requests
transferring its load to IOS6 and both IOS1 and IOS5 are collecting
which are targeted on the subfiles under migration are rejected
the load of other IOSes to make decisions independently. As the
or delayed to keep file data consistent. In order to reduce the
load series collected by IOS1 and IOS5 are the same, the ELB
migration side-effect, the time spent on migrating a subfile should
computed by both is 0.40, which is less than the ELBt (it is 0.7 in
be as short as possible. Usually, the time spent on transferring file
this example). Since the load of IOS1 is the maximum among its data is proportional to the size of the migrated data. Hence, the load
collected load, the IOS1 can choose one server such as the IOS7 as migration side-effect can be considered to be proportional to the
a migration target. In such a case, even though the ELB computed size si of the subfile to be migrated. Therefore, in order to reduce
by IOS5 is also equal to 0.40, IOS5 cannot trigger a load migration the load migration side-effect, SALB should choose the subfile with
because its load is not the maximum among its collected load. In the smallest size.
this example, we can see that the migrations from IOS1 and IOS5 Another factor SALB should consider is the load collection
to the same I/O server is prevented. In the next load balancing threshold LCt . One reason is that when the load of the target IOS
period, the ELB computed by IOS5 is 0.91. In such a case, there is no after migration is larger than LCt , it will think itself heavily loaded
need to sponsor a load migration. Hence, requiring the I/O server and may sponsor a load collection, which will result in more load
with the maximum load to sponsor dynamic file migration can collection actions in the whole system. Hence, after a migration
prevent an IOS making fake decisions with the load of other servers. finishes, the load of the target I/O server should be less than LCt .
In this example, we can also see that there are two migrations Combining all the above discussions, in order to balance
concurrently happening between two IOS pairs. Hence, SALB can the benefit and the side-effect of a dynamic file migration, the
balance the load of the whole system quickly through parallel load objective function for the migration candidates selection model is
migrations. defined as follows:

li
7. Optimization model for the migration candidates selection max , (4)
(i) si
which is subject to:
The migration candidates, including the local subfile to be
migrated and the target IOS which will accept it, should be selected
lh ll
li ,
before the load migration occurs. The subfile (also called datafile) l + l <2 LC ,



refers to the parts of a file that reside on the same IOS. The objective l i t
N
of the migration candidate selection is to balance the benefit and l = min
l {laj },
the side-effects of the dynamic file migration.

j =1
li {l1 , l2 , . . . , ln }, and 0 < i n,

The benefit of load migration includes reduced response time
and improved resource utilization, which may arise from the where {l1 , . . . , ln } is the load series of n local subfiles and
balanced load between the local I/O server and the target I/O {la1 , . . . , laN } is the load series of N I/O server. The SelCand
server. After a migration between two IOSes has finished, the ideal algorithm illustrated in Fig. 9 is used to find the solution of this
condition should be that both IOSes own an equal load. Let the optimal model. SelCand is like a linear search algorithm with extra
target I/O server own load ll and local I/O server own load lh . After constrains and therefore owns low time complexity.
the file with the load li is migrated, (lh li ) should be equal to (ll + li )
in ideal conditions. In such a case, as shown in Eq. (2), the load li of 8. Dynamic file migration
the migrated subfile should be half of the load difference between
these two IOSes. Since the load of the subfiles is a discrete value, it In this section, the dynamic file migration algorithm which
is reasonable to choose the migrated file with Eq. (3). Hence, among is responsible for transferring a subfile between two IOSes is
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1261

gather the load information of all I/O servers through the CollLoad
function and then store the collected load of all I/O servers in an
array {la0 , . . . , laN } in step 5. After that, the maximum load of
all I/O servers is selected and then stored in lmax . In step 7, elb
is computed with Eq. (1). If elb violates the threshold ELBt and
this server owns maximum load among the collected load at the
same time, a dynamic file migration should be triggered. Before
the dynamic file migration, the migration candidate is selected
with the function SelCand, which is illustrated in Fig. 9. Because
a file stored in the parallel file system may be shared among
all clients, the mutually exclusive strategy that is presented in
Section 8 needs to be initialized before the dynamic file migration
happens. In step 13, the function MigrateClient presented in Fig. 10
Fig. 10. Sequence diagram of the dynamic file migration. is invoked to transfer the file data from the local server to the
target server and then update the file distribution metadata. If the
investigated. The dynamic file migration algorithm in SALB is process of the migration has finished without error, the mutually
based on a clientserver architecture. The IOS which sponsors exclusive strategy is cleared and the algorithm terminates. If the
the load migration is the client and the IOS which accepts the migration has some errors, these errors should be handled before
migrated subfile is the server. The sequence of dynamic file the algorithm terminates.
migration is presented in Fig. 10. Since the distribution information
of the migrated subfile should be updated after the dynamic 10. Evaluation
file migration, the MDS is also involved. The migration client
is implemented as a MigrationClient function, which takes the In order to prove the effectiveness of the proposed SALB,
IOStarget and flocal as input. The first step of the MigrationClient is comprehensive experiments are conducted in this section. The
to retrieve the attributes of the file flocal . Then, the MigrationClient configurations of the testbed used in these experiments are
sends a migration request to the target I/O server IOStarget . After presented in Table 2. The experiments are carried out according
the target IOS accepts this request, the MigrationServer function to the following steps:
will be invoked to create a new subfile and then respond with the
First, the on-line load prediction model is tested with BTIO [37],
new subfiles handle. After the client receives the handle, it posts
MADbench2 [8] and FLASH IO [49], all of which are I/O kernels of
a flow [14] to transfer the file data. When the flow completes, the
scientific applications and their own real-world I/O behaviors.
MigrationClient will send another request to the metadata server
Second, SALB is compared with the traditional load balancing
to update the distribution information of the migrated subfile.
schemes by tracing the average response time and the
After the metadata is updated, the MigrationClient removes the
throughput of I/O servers under a synthetic load.
local subfile to terminate the migration process. In the process of
Finally, the configurations of SALB are discussed with the
load migration, the mutually exclusive strategy presented in the
parallel I/O benchmark IOR [42] and the scalability of SALB is
following paragraph is employed to keep the data under migration
evaluated with the widely accepted High-end Computing I/O
consistent.
Simulator (HECIOS) [41].
A mutually exclusive strategy is necessary to ensure the
consistency of the subfile under migration. One reason is that the
parallel file system permits a file to be concurrently accessed by 10.1. On-line load prediction model evaluation
all clients. The implementation of the mutually exclusive strategy
is platform dependent, especially for file systems with client-side In this section, we first apply the on-line load prediction model
caches. The PVFS2, adopted as the testbed of the proposed SALB, to three I/O kernels of scientific applications: (1) BTIO, based
has attribute cache (acache) and name space cache (ncache) at the on the Block-Tridiagonal problem of NPB which derives from
client side [14]. In this study, the scheduler of IOS is employed to computational fluid dynamics applications, (2) MADbench2, the
prevent file requests accessing the subfile under migration through I/O kernel for the MADspec astronomy, and (3) FLASH I/O, the
sending predefined error information to the clients which require benchmark created to model I/O precisely as in code of the FLASH
access to this subfile. Then, these clients will invalidate their local astrophysics application. Then, we investigate the performance of
caches and require new distribution information of the subfile the on-line load prediction under a mixed load where different
from the MDS when the metadata has been updated. By adding applications are running simultaneously.
these new cycles, file data consistency can be guaranteed.
10.1.1. On-line load prediction model evaluation
9. Put it all together: SALB self-acting load balancing The problem class of BTIO used in this experiment is C [37].
algorithm The problem class C is the second largest problem size in the BTIO
configuration and it is also widely used to evaluate I/O performance
The SALB algorithm is illustrated in Fig. 11. Each IOS periodically optimization strategies. The configurations of the MADbech2 in
invokes this algorithm and feeds it with a load collection threshold this experiments are presented in Table 3. The load of the FLASH
(LCt ), an efficiency of load balancing threshold (ELBt ), and its I/O is extracted from a trace log of the FLASH I/O which runs on 512
own load series L = (Lt p , . . . , Lt 1 , Lt ) which is sampled at the processes [49]. Moreover, the mixed load which is used to evaluate
previous p + 1 time intervals. the on-line prediction model is sampled at a random I/O server
The first step of SALB is to estimate the one-step ahead load lf when BTIO and MADbench2 are running simultaneously.
of this IOS with the FcstLoad function which implements the on- The experimental results are presented in Fig. 12. As the picture
line load prediction model construction algorithm and is presented shows, the load series of all applications show high variability
in Fig. 4. The second step is to compare the forecast load lf and non-stability. However, it is easy to identify that the forecast
with the load collection threshold LCt . SALB will terminate if the load of the on-line load prediction model can fit the observed load
forecast load lf is smaller than LCt . Otherwise, it would continue to very well in all cases. Specially, the mean square prediction errors
1262 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268

Table 2
Configurations of the experimental system.
Parameters Values Parameters Values Parameters Values

No. of CPUs per node 4 CPU 2.33 GHz No. of client 64


SATA hard disk 80 G Memory 4 GB No. of IOS 14
Operating system CentOS 5.2 Ethernet 1 Gbps No. of MDS 14
Local file system ext3 PVFS2 version 2.8.2 MPICH2 version 1.1

Fig. 11. SALB: The self-acting load balancing algorithm.

Table 3 97.9%, 82.2%, and 70.11% respectively. Even though, the summary
MADbench2 configuration. of the AR(1) and the AR(2) which are identified for the Mixed
Parameter Value Parameter Value Load is around 20.97%, the models whose order is smaller than
NO_PIX 2000 NO_GANG 1 five accounts for at most 94.81%. In one word, the AR models with
NO_BIN 128 FBLOCKSIZE 256 low orders dominate the on-line load prediction models. Actually,
RMOD 1 SBLOCKSIZE 256 according to the classic locality principle [19] which states that
WMOD 1 PROCESSES 64 most programs need the same data or instruction sequence
multiple times, the clients accessing the data from one I/O sever
for BTIO, MADbench2, FLASH I/O and Mixed Load are 2.81, 0.67, will visit the same I/O server next time with maximum probability.
2.48, and 1.93 respectively. The mean square prediction error for Therefore, the load is sampled at one I/O server and at adjacent
MADbench2 is the smallest. One reason is that MADbench2 has a time trends to show high correlation. Hence, extending the AR
smaller average load than other workloads. Hence, the on-line load model to build the on-line load prediction model is reasonable.
The time to build the extended AR time series models is also
prediction model presented in this paper is an effective approach
an important factor, because they are invoked at each cycle of load
to forecast the one-step ahead load of an I/O server.
balancing. The average time for building extended AR models with
different orders is presented in Fig. 14. As the figure shows, the
10.1.2. Discussion of the fitted on-line load prediction model time to build the extended AR(1) model is the minimum and then
The on-line load prediction model is based on the AR time there is a upward trend with the order increasing. The time for
series model. The maximum order for the fitted AR model is six fitting the AR(6) time series model has the maximum value 0.7 ms.
and the minimum order is one. The percentages for each AR time Nevertheless, the time to build the on-line load prediction model
series model in all fitted models are presented in Fig. 13. Among is so short that it can be ignored. Hence, the proposed on-line load
all the models identified for the BTIO, MADbench2, and FLASH prediction can be built in a determined and short time, which is
I/O, the summary of the AR(1) and the AR(2) model accounts for desirable in the on-line load balancing algorithm.
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1263

(a) BTIO. (b) MADbench2. (c) FLASH I/O. (d) Mixed load.

Fig. 12. The on-line load prediction model evaluation.

(a) BTIO. (b) MADbench2. (c) FLASH I/O. (d) Mixed load.

Fig. 13. The popularity for each fitted AR time series model.

Fig. 14. Time to build the different AR models on-line.

10.2. Evaluation of SALB algorithm under a synthetic workload

In this section, the proposed SALB is compared against


traditional load balancing solutions under a synthetic workload.
The synthetic load generator models the load imbalance allocation
across I/O servers with the self-similar distribution [30]. The self- Fig. 15. Mean response time (MRT) and the number of migrations among I/O
similar distribution is widely used in most literature which study servers under the synthetic load.
load balancing [33,38]. If the number of I/O servers is N, for given
parameters x and y, the probability of accessing the I/O server DCLB: The PVFS2 with disk cooling [38], which is a centralized
numbered i (i N ) is computed with the following equation: load balancing algorithm without the on-line load prediction
s log(x/100)/ log(x/100) function.
P ( i s) = . (5)
N
10.2.1. Mean response time
The average response time of all I/O servers is tested with the
mpi-io-test program which is distributed with PVFS source [14]. The mean response time of this experiment is presented in
Fig. 15. The horizontal axis is the time periods when the synthetic
The mpi-io-test program runs on separate clients and each process
load generator is running, and the the vertical axis is the mean
of it is dedicated to access an I/O server. The file requests
response time of all I/O servers. As the test results show, at the
among the mpi-io-test processes are synchronized with the MPI
beginning stage, the mean response time of both the DCLB and
barrier function. The average response time of all I/O servers is
SALB is the same as that of the NOLB. One reason is that the
computed with the reduce function of the MPI. Moreover, the
effect of the load balancing is offset by the sharply growing load.
migration frequencies among all I/O servers are also recorded.
After around the 75th period, the load balancing benefit begins
The experiment repeats three times under the following three
to appear. The mean response time for both the DCLB and SALB
conditions:
is smaller than that of the NOLB until the end of the test. Hence,
NOLB: The original PVFS2 without load balancing. we can conclude that when the load among I/O servers becomes
SALB: The PVFS2 with the proposed SALB. The implementation imbalanced, the mean response time of I/O servers with the load
details of SALB on PVFS2 are presented in our previous balancing function has obvious advantages over that without the
work [23]. load balancing function.
1264 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268

Moreover, it is easy to identify from Fig. 15 that the mean


response time for SALB starts to decline before the time when
the mean time for the DCLB reaches the peak point. One reason
is that the load balancing delay in SALB is small. Specifically, the
load balancing decision-maker of SALB is distributed onto each I/O
server. Therefore, sponsoring a load balancing action only needs
one network transmission. Moreover, the on-line load prediction
model is employed by SALB to forecast the one-step ahead load of
each I/O server. In such a case, SALB running on each I/O server
can make its decision with the forecast load of other servers and
therefore the decision delay in SALB can be further reduced. By
contrast, the DCLB is a centralized algorithm, where sponsoring
a load migration needs at least two network transmissions: one
for load collection and the other for issuing the load balancing
command. Hence, we can conclude that the load balancing decision Fig. 16. The mean and variance of the throughput of all I/O servers.
delay in SALB is smaller than that in the DCLB.
Further, the average response time of the DCLB is 0.0352 s and equal throughput. In such a case, the variance of the throughput in
the mean response time of NOLB is 0.0434 s. Hence, it is easy to SALB decreases to 0.78, which decreases by 14.2% compared with
conclude that the load balancing among I/O servers can reduce that of DCLB. The average throughput of all I/O servers without
the average response time of parallel file systems. Moreover, since load balancing function is 2.38 MiB/s. The average throughput for
the decision delay in SALB is small, SALB can sponsor the load all I/O servers with DCLB is 3.20 MiB/s. The average throughput
balancing as soon as the load condition of the system needs the for SALB is 3.53 MiB/s, which increase by 10.31% compared with
load balancing. In other words, in SALB, the preemptive load that of DCLB. Hence, we can conclude that the proposed self-acting
migrations may be triggered to balance the load among I/O servers load balancing algorithm can also deliver an improved resource
before the load imbalance condition becomes worse. The average utilization.
response time for SALB is 0.0312 s, which decreases by 11.36%
compared with that of the DCLB. In summary, we can conclude 10.3. Discussion for the configurations and the scalability of SALB
that the proposed SALB deliver a better performance in the mean
response time of all I/O servers situated in the PVFS2. In this section, the impact of the SALB configurations on the
parallel I/O performance is discussed with the IOR [42] at first.
Then, the scalability of SALB is evaluated in the context of a large
10.2.2. Analysis for the traced migration log
scale I/O storage system which is simulated with the High-End
In order to understand how SALB works in depth through these
Computing I/O System (HECIOS) [41]
tests, the migration logs among I/O servers are also traced. The
results are presented in Fig. 15. The horizontal axis is the time
periods when the synthetic load generator is running. The vertical 10.3.1. Discussion for SALBs configurations
axis is the number of migrations in this time period. As expected, Two important parameters, including LCt and the ELBt , of SALB
the first migration for SALB happens at the 5th interval, but the are discussed with the IOR [42] in this section. The IOR is a widely
first migration of the DCLB happens at the 7th interval. Actually, used parallel I/O performance benchmark because it is able to
compared with the migration time of the DCLB, the migration time model the I/O actions of different real scientific applications [42].
for SALB owns a forward shift. The reason is that compared with the In this experiment, the IOR iterates 1000 times and accesses I/O
decision delay in the DCLB, the decision delay in SALB is smaller. servers with 5 M transferSize through the MPI-IO interface. In
The test results are constant with that presented in the previous order to model the imbalance load distribution among fourteen
I/O servers, the blockSize of each process is set to be the twenty-
section.
one times of the transferSize. In such a case, the load of seven I/O
Meanwhile, the number of the migrations sponsored by SALB
servers is higher than the load of the others.
is 49 and the number of the migrations for the DCLB is 36.
First, the impact of different values of the load collection
Even though the number of the migrations sponsored by SALB
threshold (LCt ) on the performance of the parallel I/O system is
is larger than that of the DCLB, the mean response time of SALB
tested when the ELB threshold (EBLt ) is set to 0.8. The LCT is
is less than that of the DCLB, which is proved in the previous
adopted by SALB to prevent frequent load collection because high-
section. One reason is that each dynamic load migration between
frequency load collection may consume much network bandwidth
two I/O servers is carefully evaluated by SALB and the migration
and CPU time, thereby degrading the system performance. In order
candidates are always selected with the proposed optimization
to understand the impact of LCt on the performance in depth, the
model to balance the benefits and side-effects of the dynamic file
number of load collections and the number of file migrations across
migration. Hence, in SALB, the preemptive dynamic file migration
all I/O servers are traced. The experiment results are presented in
may be triggered by the on-line load prediction model and the
Table 4. As the results show, the value of LCt covers from 40% to
tradeoff between the benefits and the side-effects of the dynamic
90% of the maximum load of an I/O server. As the LCt decreases
file migration are well balanced by the optimization model for
from 90% to 40%, the number of load collections, the number of
selecting migration candidates.
migrations and the ratio between them grows accordingly. Also,
it is obvious that the growth of all three values is slow when the
10.2.3. Throughput of I/O servers value of LCt is larger than 60%. One reason is that the average
The standard deviation of the throughput and the mean load of all I/O servers is about 60%. In such a case, only some
throughput of I/O servers are two important factors which reflect portion of I/O servers need to sponsor load collections. From the
the load distribution and resource utilization [24]. Fig. 16 presents test results, we can also see that the maximum performance load
the experimental results with respect to these two factors. As the improvement for the write and read is achieved when the value
figure shows, the standard deviation of the throughput for the of the LCt is around 60% and 55% respectively. When LCt is less
NOLB is 2.20 and the standard deviation of the throughput for the than 60% or 55%, the load collection frequency is so high that the
DCLB is 0.91. Moreover, the I/O servers with SALB have almost performance of parallel I/O degrades; when LCt is greater than 60%
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1265

Table 4
Impact of different Lt values on the performance improvement, no. of load collection, and no. of migration.
No. of load collection
LCt Write imprv. (%) Read imprv. (%) No. of load collection No. of migration No. of migration

90 4.54 0.96 1 1 1
85 6.51 1.47 1 1 1
80 10.12 5.49 3 3 1
75 10.93 6.74 2 2 1
70 10.05 2.90 4 2 2
65 9.63 2.88 9 4 2.25
60 17.02 6.99 165 56 2.95
55 16.72 8.16 444 107 4.15
50 12.65 5.20 704 153 4.60
45 13.26 5.94 1217 225 5.41
40 14.59 2.28 1630 258 6.31

file systems [52] and the HECIOS has been applied to study the
client side cache [41], the middle-ware level client cache [5], the
scalable directory of parallel system [53] and the server-to-server
communication [15]. Hence, it is reasonable to employ the HECIOS
to evaluate the scalability of SALB.
HECIOS is built around the OMNeT++ simulation package [51]
and shares almost the same architecture as PVFS2. However,
HECIOS is a large-scale parallel I/O system simulator driven by
the trace log of real-scientific applications. In such a case, all the
actions of HECIOS are triggered by clients and thus there is no
performance scheduler at the I/O servers of HECIOS. To resolve this
problem, we employ the OMNeT++ self-messages to periodically
invoke the SALB at each I/O server. Another major consideration
about implementing SALB in HECIOS is that there is no stand-alone
metadata server in HECIOS. Instead, the HECIOS stores its metadata
Fig. 17. Impact of different ELBt values on the performance improvement of IOR. in one object of C++s singleton class. In such a case, there is no
need to perform communication to update metadata of the file
or 55%, the occurrence of the load of the I/O server violating the after its migration has finished. Other parts of SALB in HECIOS
threshold is so rare that dynamic file migrations are not triggered are the same as that in real PVFS2. Each file system server in the
frequently. Hence, it is reasonable to conclude the proposed self- HECIOS feeds SALB with a load collection threshold, an efficiency of
acting load balancing can deliver improved performance for a wide load balancing threshold, and its own previous load series. In order
range of load collection thresholds and setting the load collection to simulate the load imbalance among I/O servers, in the initial
threshold to be the average load of all I/O servers can maximize the stage of HECIOS, the self-similar distribution function defined in
performance of the proposed self-acting load balancing. the above Eq. (5) is used to guide the assignment of subfiles among
Second, the impact of different values of the ELB threshold available I/O servers in file system. The value of x/y is set to 40/60,
(ELBt ) on the performance of the parallel I/O is tested. In this which means 40% of the subfiles would be assigned to 60% of the
experiment, the LCt is set to 60% of the maximum load of an I/O servers.
I/O server. The ELBt is another important parameter for the In this experiment, HECIOS uses settings similar to the beowulf
proposed SALB algorithm. The results of the experiment are cluster Palmetto [48] of Clemson University. Palmetto provides
presented in Fig. 17. As the figure shows, the value of ELBt two interconnection networks: a Gigabit Ethernet network and a
covers from 0.4 to 0.9. The performance improvement percentage Myrinet Myri-10 G network. HECIOS supports both of the networks
for both read and write increases at the initial stage, and by using the INET network simulation components of OMNeT++.
reaches a peak point when the value of ELBt is 0.8. Then, the However, INET just provides accurate TCP/IP simulation and the
performance improvement percentage for the read declines and Myrinet of HECIOS is offered through adjusting the Ethernet
the performance improvement percentage for the write keeps network settings to approximate Myrinet performance. Therefore,
stable. One reason is that when the value of ELBt is too big, dynamic we directly use the Gigabit Ethernet networking model of HECIOS
file migrations become so frequent that they would degrade the to evaluate SALB. The trace log files used to drive the experiment
performance of the whole system. Hence, in order to fully reap are gathered from the Argonne National Laboratory and can be
the performance improvement from the proposed self-acting load downloaded from the parallel architecture research laboratory of
balancing, the threshold of the efficiency of load balancing is Clemson University [49].
The number of I/O servers we evaluated in this experiment cov-
suggested to be set to 0.8.
ers 8, 16, 32, 64, 128, and 256. The results are presented in Fig. 18.
The horizontal axis is the number of I/O servers and the vertical
10.3.2. Scalability evaluation for SALB axis is the read and write performance. From the figure, we can see
In this section, SALB is evaluated in the context of a large scale that both the read and write performance of HECIOS own high scal-
parallel I/O storage system which is simulated with the High- ability. When HECIOS is integrated with SALB, the performance for
End Computing I/O System (HECIOS) [41]. The HECIOS is driven both the read and the write increases from hundreds of megabytes
by the trace log of real scientific applications and permits us per second to almost four gigabytes per second. Specifically, the
to evaluate the optimizations proposed for large scale storage performance with SALB is almost three times higher than that
systems that are not widely available, or even at scales that are not without load balancing. The performance improvement for HECIOS
yet in production. Actually, the simulation has been successfully is more obviously than that for IOR in the previous section. The rea-
used to evaluate the metadata load balancing for petabyte-scale son is that, when the subfile assignment of HECIOS is guided by the
1266 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268

a b

Fig. 18. Scalability evaluation for SALB.

self-similar distribution in this test, some I/O servers would hold In order to reduce the message exchanges among I/O servers
more than one subfiles and some I/O servers are idle. For example, for load collection in SALB, we propose an adaptive load col-
when the number of I/O servers in the test is 256, the number of I/O lection threshold adjustment algorithm to prevent frequent load
servers which are actually assigned to subfiles approximates 215. collection. We have shown that the adaptive load collection thresh-
Further, among these 215 I/O servers, 25.7% of the subfiles are as- old can sharply reduce the message exchanges and guarantee
signed to around 20 I/O servers. In such a case, the heavily loaded the effectiveness of load balancing decisions simultaneously. Even
I/O servers with SALB can select these idle I/O servers as migra- though the message exchanges among n I/O servers are around 21 n2
tion targets. As a result, the load is distributed evenly among I/O in the worst case, we believe that the load collection can be further
servers. Hence, we can conclude that SALB has high scalability and reduced through our on-line prediction model. For example, the
can improve the performance of parallel file systems when the load on-line load prediction model can be extended to do a long-term
among the I/O servers situated in it is imbalanced. load forecast for the load collection from other I/O servers. In such
a case, SALB running on one I/O server can make its decision based
on the previously collected load information.
11. Conclusion and further work
Furthermore, SALB employs dynamic file migration to imple-
ment its load migration. One advantage of dynamic file migration
In this paper, we present a dynamic and adaptive load balancing is that it can perform load migration without interrupting the sys-
algorithm named self-acting load balancing (SALB) to tackle the tem service. Moreover, in order to balance the benefits and the
load imbalance issue among the I/O servers situated in parallel file side-effects of dynamic file migration, SALB employs an optimiza-
systems. The feasibility of SALB has been shown in a number of tion model for selecting the migration candidates. Even through
performance experiments. our load balancing strategies are orthogonal to the fault toler-
SALB is totally based on a distributed architecture, where each ance techniques, the relationship between load balancing and data
I/O server can make a load balancing decision and sponsor dynamic replication can be explored to further improve the effectiveness of
file migration by itself. The distributed property of SALB enables dynamic file migration. For example, the I/O servers which hold
it to deliver scalability and availability required by the steadily replicated data should own a higher priority than others when
growing parallel I/O systems. We have shown that SALB is a very SALB is selecting migration candidates. Therefore, the side-effects
effective method for balancing load in the context of large-scale of the dynamic file migration may be further reduced.
I/O servers which are built from the widely accepted High-End
Computing I/O Simulator (HECIOS) [41]. Moreover, in the exascale Acknowledgments
era, high performance computing applications such as the Climate
analytics [3] may handle the data distributed across several The final version has benefited greatly from the many detailed
counties. In such a case, a distributed load balancing algorithm comments and suggestions from the anonymous reviewers. The
should play a more important role in data management. Hence, authors gratefully acknowledge these comments and suggestions.
we believe that SALB provides a good framework of reference for The work described in this paper are supported by the fund of
future work. the State Key Laboratory of Software Development Environment
We have developed an on-line load prediction model to fore- under Grant No. SKLSDE-2009ZX-01, the National Natural Science
cast the one-step ahead load of an I/O server. In such a case, SALB Foundation of China under both Grant No. 60973007 and No.
running on one I/O server can collect the forecast load of other I/O 61003015, the Doctoral Fund of Ministry of Education of China
servers to make its load balancing decision. For this reason, the im- under Grant No. 20101102110018, the Fundamental Research
pact of network transmission latency on the decision delay can be Funds for the Central Universities under Grant No. YWF-10-02-
reduced. By taking into account both the workload characteristic 058, the Hi-tech Research and Development Program of China
of scientific applications and the time to build different time series (863 Program) under Grant No. 2011AA01A205, the National
models, we choose the AR time series as basis of the on-line load Core electronic devices high-end general purpose chips and
prediction model. We have shown that the developed model can fit fundamental software project under Grant No. 2010ZX01036-
the load of an I/O server with small mean square prediction errors. 001-001.
The on-line load prediction model can be further improved. For ex-
ample, the load series of the I/O servers may own seasonal compo- References
nents [12], which are generated by the loop code in applications. [1] R.R. Aire Shoshani, Scott Klasky, Scientific data management: challenges and
Through taking into account seasonal components, the accuracy of approaches in the extreme scale era, in: Proceedings of the 2010 Scientific
on-line load should be further improved. Discovery through Advanced Computing (SciDAC) Conference, USA, Jul. 2010.
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1267

[2] H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom. [34] D. Lee, R.S. Ramakrishna, Improving disk I/O load prediction using statistical
Control 19 (1974) 716723. parameter history in online for grid computing, IEICE Trans. Inf. Syst. E89-D
[3] G. Aloisio, S. Fiore, Towards exascale distributed data management, Int. J. High (2006) 24842490.
Perform. Comput. Appl. 23 (2009) 398400. [35] W. Liu, M. Wu, X. Ou, W. Zheng, M. Shen, Design of an I/O balancing file system
[4] M. Andreolini, S. Casolari, M. Colajanni, Models and framework for supporting on web server clusters, in: Proceedings of the 2000 International Workshop on
runtime decisions in web-based systems, ACM Trans. Web 2 (2008) Parallel Processing, ICPP00, 2000, pp. 119126.
17:117:43. [36] D.S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja1, J. Pruyne, B. Richard, S.
[5] M. Bassily, A middle-ware level client cache for a high performance computing Rollins, Z. Xu, Peer-to-peer computing, Tech. Rep., HewlettPackard Company,
I/O simulator, Ph.D. Thesis, Clemson University, 2009. 2005. Available: http://www.hpl.hp.com/.
[37] P. Wong, Rob F. Van der Wijngaart, NAS parallel benchmarks I/O, Version
[6] A. Batsakis, R. Burns, A. Kanevsky, J. Lentini, T. Talpey, CA-NFS: a congestion-
2.4, Tech. Rep., NASA Advanced Supercomputing Division, 2003. Available:
aware network file system, Trans. Storage 5 (2009) 15:115:24.
http://www.nas.nasa.gov/publications/npb.html.
[7] J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M.
[38] P. Scheuermann, G. Weikum, P. Zabback, Data partitioning and load balancing
Polte, M. Wingate, PLFS: a checkpoint file system for parallel applications, in:
in parallel disk systems, The VLDB J. 7 (1998) 4866.
Proceedings of the Conference on High Performance Computing Networking,
[39] F. Schmuck, R. Haskin, GPFS: a shared-disk file system for large computing
Storage and Analysis, SC09, Nov. 2009, pp. 21:121:12.
clusters, in: Proceedings of the 1st USENIX Conference on File and Storage
[8] J. Borrill, J. Carter, L. Oliker, D. Skinner, Integrated performance monitoring of a Technologies, FAST02, USENIX Association, 2002, pp. 231244.
cosmology application on leading HEC platforms, in: Proceedings of the 2005 [40] B. Schroeder, G. Gibson, A large-scale study of failures in high-performance
International Conference on Parallel Processing, ICPP05, 2005, pp. 119128. computing systems, IEEE Trans. Dependable Secur. Comput. 7 (2010) 337351.
[9] J. Borrill, L. Oliker, J. Shalf, H. Shan, A. Uselton, HPC global file system [41] B.W. Settlemyer, A study of client-based caching for parallel I/O, Ph.D. Thesis,
performance analysis using a scientific-application derived benchmark, Clemson University, 2008.
Parallel Comput. 35 (2009) 358373. [42] H. Shan, K. Antypas, J. Shalf, Characterizing and predicting the I/O performance
[10] G.E.P. Box, G. Jenkins, Time series analysis, in: Forecasting and Control, Holden- of HPC applications using a parameterized synthetic benchmark, in: Proceed-
Day Incorporated, 1990. ings of the 2008 ACM/IEEE Conference on Supercomputing, SC08, Nov. 2008,
[11] P.J. Braam, The lustre storage architecture, Tech. Rep., Aug. 2004. Available: pp. 42:142:12.
http://wiki.lustre.org/. [43] E. Smirni, D.A. Reed, Workload characterization of input/output intensive
[12] Peter J. Brockwell, Richard A. Davis, Introduction to Time Series and parallel applications, in: Proceedings of the 9th International Conference
Forecasting, second ed., Springer, 2002. on Computer Performance Evaluation: Modelling Techniques and Tools,
[13] P. Carns, S. Lang, R. Ross, M. Vilayannur, J. Kunkel, T. Ludwig, Small-file Springer-Verlag, 1997, pp. 169180.
access in parallel file systems, in: Proceedings of the 2009 IEEE International [44] E. Smirni, D.A. Reed, Lessons from characterizing the input/output behavior of
Symposium on Parallel & Distributed Processing, IPDPS09, May 2009, pp. parallel scientific applications, Perform. Eval. 33 (1998) 2744.
111. [45] H. Song, Y. Yin, X.-H. Sun, R. Thakur, S. Lang, A segment-level adaptive
[14] P.H. Carns, W.B. Ligon III, R.B. Ross, R. Thakur, PVFS: a parallel file system for data layout scheme for improved load balance in parallel file systems, in:
Linux clusters, in: Proceedings of the 4th Annual Linux Showcase & Conference Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster,
Volume 4, Oct. 2000, pp. 317327. Cloud and Grid Computing, CCGRID11, 2011, pp. 414423.
[15] P.H. Carns, B.W. Settlemyer, W.B. Ligon III, Using server-to-server communica- [46] W. Sun, J. Shu, W. Zheng, Dynamic file allocation in storage area networks with
tion in parallel file systems to simplify consistency and improve performance, neural network prediction, in: International Symposium on Neural Networks,
in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC08, in: Lecture Notes in Computer Science, 2004, pp. 133140.
Nov. 2008, pp. 18. [47] Y. Tamura, S. Kasahara, Y. Takahashi, S. Kamei, R. Kawahara, Inconsistency
of logical and physical topologies for overlay networks and its effect on file
[16] S. Casolari, M. Colajanni, Short-term prediction models for server management
transfer delay, Perform. Eval. 65 (2008) 725741.
in Internet-based contexts, Decis. Support Syst. 48 (2009) 212223.
[48] Top500. Available: http://www.top500.org/.
[17] I. Corp, An architectural blueprint for autonomic computing, Tech. Rep., 2006. [49] MPI I/O test trace log and FLASH I/O 512P trace log. Available:
Available: http://users.encs.concordia.ca/. http://www.parl.clemson.edu/, 2011.
[18] P.E. Crandall, R.A. Aydt, A.A. Chien, D.A. Reed, Input/output characteristics [50] N. Tran, D.A. Reed, Automatic ARIMA time series modeling for adaptive I/O
of scalable parallel applications, in: Proceedings of the 1995 ACM/IEEE prefetching, IEEE Trans. Parallel Distrib. Syst. 15 (4) (2004) 362377.
Conference on Supercomputing, Supercomputing95, Dec. 1995, p. 59. [51] A. Varga, The OMNET++ discrete event simulation system, in: Proceedings of
[19] P.J. Denning, The locality principle, Commun. ACM 48 (2005) 1924. the European Simulation Multiconference, Jun. 2001, pp. 319324.
[20] P.A. Dinda, Design, implementation, and performance of an extensible toolkit [52] S.A. Weil, K.T. Pollack, S.A. Brandt, E.L. Miller, Dynamic metadata management
for resource prediction in distributed systems, IEEE Trans. Parallel Distrib. Syst. for petabyte-scale file systems, in: Proceedings of the 2004 ACM/IEEE
17 (2006) 160173. Conference on Supercomputing, SC04, 2004, p. 4.
[21] P.A. Dinda, D.R. OHallaron, Host load prediction using linear models, Cluster [53] Y. Wu, A study for scalable directory in parallel file systems, Ph.D. Thesis,
Comput. 3 (2000) 265280. Clemson University, 2009.
[22] J. Dongarra, P. Beckman, T. Moore, et al., The international exascale software [54] C. Wu, R. Burns, Handling heterogeneity in shared-disk file systems, in:
project roadmap, Int. J. High Perform. Comput. Appl. 25 (2011) 360. Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC03, Nov.
[23] B. Dong, X. Li, L. Xiao, L. Ruan, B. Yu, Self-acting load balancing with parallel 2003, p. 7.
sub file migration for parallel file system, in: Proceedings of the 2010 Third [55] Y. Wu, Y. Yuan, G. Yang, W. Zheng, Load prediction using hybrid model
International Joint Conference on Computational Science and Optimization for computational grid, in: Proceedings of the 8th IEEE/ACM International
Volume 02, CSO10, 2010, pp. 317321. Conference on Grid Computing, GRID07, 2007, pp. 235242.
[24] L.W. Dowdy, D.V. Foster, Comparative models of the file assignment problem, [56] T. Xie, Y. Sun, A file assignment strategy independent of workload
ACM Comput. Surv. 14 (1982) 287313. characteristic assumptions, Trans. Storage 5 (2009) 10:110:24.
[57] B. Yagoubi, Distributed load balancing model for grid computing, ARIMA J. 12
[25] M. Eshel, R. Haskin, D. Hildebrand, M. Naik, F. Schmuck, R. Tewari, Panache:
(2010) 4360.
a parallel file system cache for global file access, in: Proceedings of the 8th
[58] Y. Zhang, W. Sun, Y. Inoguchi, Predicting running time of grid tasks based
USENIX Conference on File and Storage Technologies, FAST10, Feb. 2010, pp.
on CPU load predictions, in: Proceedings of the 7th IEEE/ACM International
155-168.
Conference on Grid Computing, GRID06, 2006, pp. 286292.
[26] G. Fragidis, K. Tarabanis, The business strategy perspective on the develop- [59] Y. Zhao, W. Huang, Adaptive distributed load balancing algorithm based on
ment of decision support systems, in: Proceedings of the CIMCA-IAWTIC06, live migration of virtual machines in cloud, in: Proceedings of the 2009 Fifth
Vol. 02, IEEE Computer Society, Washington, DC, USA, 2005, pp. 968975. International Joint Conference on INC, IMS and IDC, 2009, pp. 170175.
[27] W. Frings, F. Wolf, V. Petkov, Scalable massively parallel I/O to task-local [60] Y. Zhu, Y. Yu, W.Y. Wang, S.S. Tan, T.C. Low, A balanced allocation strategy for
files, in: Proceedings of the Conference on High Performance Computing file assignment in parallel I/O systems, in: Proceedings of the 2010 IEEE Fifth
Networking, Storage and Analysis, SC09, Nov. 2009, pp. 111. International Conference on Networking, Architecture, and Storage, NAS10,
[28] B. Gavish, O.R. Liu Sheng, Dynamic file migration in distributed computer Jul. 2010, pp. 257266.
systems, Commun. ACM 33 (1990) 177189.
[29] A. Geist, Paving the roadmap to exascale, Tech. Rep., Oak Ridge National
Laboratory, 2010. Available: http://www.scidacreview.org.
Bin Dong, received his B.S. degree in computer science
[30] D.E. Knuth, The Art of Computer Programming, in: Sorting and Searching,
from University of Electronic Science and Technology
vol. 3, Addison-Wesley, 1973.
of China, at Chengdu, China, in 2008. He is currently
[31] J.M. Kunkel, Towards automatic load balancing of a parallel file system with pursuing a Ph.D. degree in computer science at Beihang
subfile based migration, Masters Thesis, Heidelberg University, 2007. University. His research interests include parallel file
[32] S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, W. Allcock, I/O performance systems, operating systems, high performance computing,
challenges at leadership scale, in: Proceedings of the Conference on High and mathematical modeling.
Performance Computing Networking, Storage and Analysis, SC09, Nov. 2009,
pp. 112.
[33] L. Lee, File assignment in parallel I/O systems with minimal variance of service
time, IEEE Trans. Comput. 49 (2) (2000) 127140.
1268 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268

Xiuqiao Li, received his B.S. degree in computer science Limin Xiao, was born in 1970. He has a Ph.D., is a Profes-
and technology and his M.S. degree in computer architec- sor, and has a Senior membership of the China Computer
ture from Shandong University, China, in 2005 and 2008 Federation. His main research areas are computer ar-
respectively. Currently, he is a Ph.D. student in computer chitecture, computer system software, high performance
architecture at Beihang University. His research interests computing, virtualization, and cloud computing.
include parallel file systems, clusters, and cloud comput-
ing.

Qimeng Wu, received his B.S. degree in computer science Li Ruan, was born in 1978. She has a Ph.D., is a Lecturer,
and technology from Beijing Jiaotong University, China, and has a Membership of the China Computer Federation.
in 2009. Currently, he is a graduate student majoring in Her main research areas are computer architecture,
computer architecture at Beihang University, China. His computer system software, high performance computing,
research interests include parallel file systems and cloud virtualization, and cloud computing.
computing.

You might also like