Professional Documents
Culture Documents
1 s2.0 S0743731512001244 Main
1 s2.0 S0743731512001244 Main
72 (2012) 12541268
A dynamic and adaptive load balancing strategy for parallel file system with
large-scale I/O servers
Bin Dong , Xiuqiao Li, Qimeng Wu, Limin Xiao , Li Ruan
State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing 100191, China
bandwidth [35]. Moreover, if the central server crashes, the whole 3. SALB adopts dynamic file migration to realize its load migration.
load balancing algorithm may be down. A scalable and available By this method, SALB can work without interrupting the service
load balancing method is required for parallel file systems which of the whole system. Moreover, an optimization model for
may maintain hundreds or even thousands of I/O servers. selecting migration candidates is incorporated into SALB so as
The second challenge for the load balancing algorithm is how to to balance the benefits and the side-effects of each dynamic file
take the network transmission into account. On the one hand, the migration.
message exchanges for the load collection should be considered by
To prove the effectiveness of SALB, comprehensive experiments
the load balancing algorithm. One reason is that the load balancing
were conducted. The on-line load prediction model of SALB was
decision should be made based on the load condition of the whole
evaluated with BTIO [37], MADbench2 [9] and FLASH I/O [49], all of
system, but frequently sending/receiving messages to collect the
load may degrade the performance of the whole system [4]. On which are I/O kernels of scientific applications. The results manifest
the other hand, since the network transmission latency grows as that the on-line load prediction model can fit the load series of
the distance among the I/O servers increases [47], ensuring the I/O servers with small mean square prediction error. Then, SALB
load information collected for load balancing decision up-to-date was compared against traditional load balancing algorithms under
is very important to reduce decision delays. However, in most a synthetic workload. The results show that SALB consistently
existing load balancing solutions [38,46,31,35] which are based on delivers an optimal performance on the mean response time and
a central decision-maker or group decision-makers, sponsoring a resource utilization among the existing schemes for comparison.
load balancing action needs at least two network transmissions: Finally, the configurations of SALB were discussed with a popular
one for load collection and the other for issuing the load balancing parallel I/O benchmark IOR [42] and the scalability of SALB was
command. In such a case, the collected load information may evaluated in the context of large-scale I/O servers which are
become obsolete. For this reason, the sponsored load balancing built from the widely accepted High-End Computing I/O Simulator
actions tend to lag behind the realistic workload conditions. (HECIOS) [41].
Network transmission needs to be considered to achieve an The rest of this paper is organized as follows. We survey the
effective load balancing algorithm for parallel file systems. related work in Section 2. In Section 3, we outline the architecture
The third challenge for the load balancing algorithm is how to of SALB and describe its components from Sections 4 to 8. This
effectively realize its load migration. In contrast to the dynamic is followed in Section 9 by SALBs pseudocode. In Section 10,
file reallocation which may lead to system-wide interruption of we describe the experiments evaluating SALB. Finally, Section 11
service, the dynamic file migration can transfer the load between concludes this paper and discusses future research directions.
I/O servers on-line [28]. Hence, dynamic file migration is an
effective way to implement load migration. Using an proper 2. Related work and motivation
dynamic file migration, benefits such as the reduced mean
response time and improved utilization can be obtained. However, Evenly distributed load across the I/O servers situated in
in order to ensure the consistency of the file under migration, the parallel file systems can help optimize the mean response
file requests targeted on it must be properly handled (e.g. delayed, time and resource utilization. However, some factors such as
rejected) [31]. Such side-effects of dynamic file migration may unsuitable file striping sizes [45], improper file allocations [56,
degrade the performance of the whole system. The benefits and the 60], application competitions [27,6] and heterogeneous computing
side-effects of each dynamic file migration needs to be balanced to environments [54] can lead to load imbalance among the I/O
achieve an effective dynamic load migration. servers. The literature proposed for solving the load imbalance
To address the challenges above, inspired by peer-to-peer problem can be put into two categories: static and dynamic.
computing [36] and autonomic computing [17], we propose a Typically, static load balancing algorithms assign the files
dynamic and adaptive load balancing algorithm named self-acting onto the available I/O servers before they are accessed [33]. If
load balancing (SALB) for parallel file systems. SALB runs on each the load of each file can be known beforehand, such a method
I/O server and they cooperate with each other to keep the load is simple and effective. However, the load of most scientific
across the I/O servers balanced. Specifically, there are three key applications varies over time and it is hard to be known in
characteristics of SALB: advance [18,43,44]. Actually, without the load distribution known
1. SALB is totally based on a distributed load balancing decision- beforehand, static load balancing is like the static file assignment
maker without a central infrastructure. Moreover, SALB running problem, which has been proved to be NP-Hard [24]. For this
on each I/O server would automatically construct its on-line reason, many heuristic static load balancing algorithms [56,60]
load prediction model, adjust its load collection threshold, are proposed. Typically, the existing heuristic static algorithms
gather the load of other I/O servers, make decisions, choose not only aim at the non-partitioned files but also heavily depend
migration candidates, and sponsor dynamic file migrations by on the file statistic information such as the file access rate.
itself. Therefore, SALB is able to deliver both the scalability and Therefore, their applications may be limited in a dynamic or
the availability required by the steadily growing parallel I/O productive environment where files are dynamically partitioned
system. and allocated in most cases. In most mainstream parallel file
2. SALB is aware of the network transmission. As SALB, running systems such as PVFS [14], Lustre [11], and GPFS [39], a file
on each I/O server, makes its own decision, the network is always partitioned into equal-sized chunks which are then
transmission to issue the load balancing command is avoided. distributed onto available I/O servers in a round-robin fashion. In
Hence, sponsoring a load balancing action in SALB only needs such a case, some factors such as unsuitable file striping sizes [45],
one network transmission, which is used for load collection. improper file allocations [56,60], application competitions [27,6]
Moreover, since the on-line load prediction model of SALB can and heterogeneous computing environments [54] may turn some
automatically estimate the future load of an I/O server, other I/O I/O servers into hot spots, thereby degrading the performance of
servers can collect the forecast load to make decisions. In such the whole system [54,27,34].
a case, the decision delay cased by the network transmission Dynamic load balancing algorithms, on the other hand, can keep
latency can be further reduced. Furthermore, in order to reduce the load among I/O servers balanced on-line to adapt to varying
message exchanges for load collection, SALB employs a load workloads without its distribution known in advance [38]. In order
collection threshold which can also be dynamically adjusted. to understand the existing dynamic load balancing algorithms
1256 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268
the whole system, a distributed load balancing decision-maker Model ACF PACF
is responsible for deciding to trigger a dynamic file migration or AR(p) Exponential decay Cut off to zero
not. after lag p
4. A migration candidate selection model. When this IOS needs MA(q) Cut off to zero Exponential decay
after lag q
to transfer its load, the migration candidates, including the
ARMA(p, q) Exponential decay Exponential decay
subfile to be migrated and the target IOS accepting it, are chosen
through this model to balance the benefit and side-effect of
dynamic file migration. So, the load series of MADbench2 with the trend eliminated can
5. A dynamic file migration algorithm. Dynamic file migration is be modeled with an AR time series model. Actually, the AR model
used to transport the selected subfile to the target IOS without is desirable in on-line prediction model for the time to build it is
determined.
interrupting system service. Also, a mutually exclusive strategy
In summary, the original load series of an I/O server may
is required to ensure the consistency of the subfile under
show high instability and variability but it can be transferred
migration.
to stationary one through once-differenced. Moreover, the once-
In the ensuing sections, we will explain in greater detail of these differenced load series of I/O server owns the property of the AR
five components. time series model.
found that the file request sizes have a significant impact on the setting LCt equal to LCt will still result in at most n2 message
maximum load (Lmax ). Nevertheless, from all these figures, we can exchanges. In order to address this issue, a adaptive load collection
see that the relationship between the response time and the load threshold adjusting algorithm name AdjustLCT is proposed for
of an I/O server has the same skew as that of Fig. 6. When the SALB. In such a case, the message exchanges are expected to be
1 2
load is less than a certain value (denoted by LCt in the figure), the 2
n in the worst case. AdjustLCT is presented in Fig. 7. The idea
impact of load changing on the response time is small. In such a behind the AdjustLCT is to replace the LCt with average load of all
case, we believe that there is no need to sponsor load collections at I/O servers when the whole system is overloaded, as shown in steps
this IOS. When the load is larger than LCt , a small increase in the 4 and 5.
load of I/O server will result in a dramatical growth of the response With the number of IOSes increasing, the load collection for all
time. Under such conditions, we think this IOS needs to trigger load IOSes should be time-consuming. In order to reduce the time spent
collection and consider load balancing. Hence, setting the LCt equal in load collection, a load collection function named CollLoad is also
to LCt can help reduce message exchanges for load collection. implemented with a parallel idea. Specially, when an IOS needs to
Fig. 5 also indicates that LCt for different file request sizes may be gather the load of other IOSes, it would first send the requests to all
different, and thus we can choose the smallest one as LCt . I/O servers concurrently and then receive the load in batch fashion.
However, when the whole system with n IOSes is overloaded In SALB, one IOS can respond to the load collection with different
(which means that the load of each I/O server is greater than LCt ), values.
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1259
(a) Request size is 64 KB, and Lmax is 10.84 MiB/s. (b) Request size is 128 KB, and Lmax is 18.85 MiB/s. (c) Request size is 256 KB, and Lmax is 28.71 MiB/s.
(d) Request size is 512 KB, and Lmax is 33.38 MiB/s. (e) Request size is 1024 KB, and Lmax is 35.20 MiB/s. (f) Request size is 2048 KB, and Lmax is 34.29 MiB/s.
Fig. 5. Relationship between the response time and the load of one I/O server.
Fig. 8. The example of distributed load balancing decision. The number in the
parenthesis is the load of each IOS.
Fig. 9. SelCand(): The migration candidate selection algorithm.
IOS. There are two reasons for requiring the IOS sponsoring load
the files whose load obeys Eq. (3), the subfile with the maximum
migration to own maximum load among its collected load. First, load should be selected to maximize the load migration benefit.
requiring the local IOS to own maximum load can prevent different
IOSes migrating their load to the same I/O server. Second, the lh ll
li = , (2)
IOS whose load is not the maximum but greater than the load 2
collection threshold should not make its own decision with the lh ll
load of other servers. li . (3)
2
To explain how the distribution load balancing decision works,
On the other hand, the load migration side-effect such as the
we can see the example presented in Fig. 8, where IOS0 is
degradation of the whole system may arise when file requests
transferring its load to IOS6 and both IOS1 and IOS5 are collecting
which are targeted on the subfiles under migration are rejected
the load of other IOSes to make decisions independently. As the
or delayed to keep file data consistent. In order to reduce the
load series collected by IOS1 and IOS5 are the same, the ELB
migration side-effect, the time spent on migrating a subfile should
computed by both is 0.40, which is less than the ELBt (it is 0.7 in
be as short as possible. Usually, the time spent on transferring file
this example). Since the load of IOS1 is the maximum among its data is proportional to the size of the migrated data. Hence, the load
collected load, the IOS1 can choose one server such as the IOS7 as migration side-effect can be considered to be proportional to the
a migration target. In such a case, even though the ELB computed size si of the subfile to be migrated. Therefore, in order to reduce
by IOS5 is also equal to 0.40, IOS5 cannot trigger a load migration the load migration side-effect, SALB should choose the subfile with
because its load is not the maximum among its collected load. In the smallest size.
this example, we can see that the migrations from IOS1 and IOS5 Another factor SALB should consider is the load collection
to the same I/O server is prevented. In the next load balancing threshold LCt . One reason is that when the load of the target IOS
period, the ELB computed by IOS5 is 0.91. In such a case, there is no after migration is larger than LCt , it will think itself heavily loaded
need to sponsor a load migration. Hence, requiring the I/O server and may sponsor a load collection, which will result in more load
with the maximum load to sponsor dynamic file migration can collection actions in the whole system. Hence, after a migration
prevent an IOS making fake decisions with the load of other servers. finishes, the load of the target I/O server should be less than LCt .
In this example, we can also see that there are two migrations Combining all the above discussions, in order to balance
concurrently happening between two IOS pairs. Hence, SALB can the benefit and the side-effect of a dynamic file migration, the
balance the load of the whole system quickly through parallel load objective function for the migration candidates selection model is
migrations. defined as follows:
li
7. Optimization model for the migration candidates selection max , (4)
(i) si
which is subject to:
The migration candidates, including the local subfile to be
migrated and the target IOS which will accept it, should be selected
lh ll
li ,
before the load migration occurs. The subfile (also called datafile) l + l <2 LC ,
refers to the parts of a file that reside on the same IOS. The objective l i t
N
of the migration candidate selection is to balance the benefit and l = min
l {laj },
the side-effects of the dynamic file migration.
j =1
li {l1 , l2 , . . . , ln }, and 0 < i n,
The benefit of load migration includes reduced response time
and improved resource utilization, which may arise from the where {l1 , . . . , ln } is the load series of n local subfiles and
balanced load between the local I/O server and the target I/O {la1 , . . . , laN } is the load series of N I/O server. The SelCand
server. After a migration between two IOSes has finished, the ideal algorithm illustrated in Fig. 9 is used to find the solution of this
condition should be that both IOSes own an equal load. Let the optimal model. SelCand is like a linear search algorithm with extra
target I/O server own load ll and local I/O server own load lh . After constrains and therefore owns low time complexity.
the file with the load li is migrated, (lh li ) should be equal to (ll + li )
in ideal conditions. In such a case, as shown in Eq. (2), the load li of 8. Dynamic file migration
the migrated subfile should be half of the load difference between
these two IOSes. Since the load of the subfiles is a discrete value, it In this section, the dynamic file migration algorithm which
is reasonable to choose the migrated file with Eq. (3). Hence, among is responsible for transferring a subfile between two IOSes is
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1261
gather the load information of all I/O servers through the CollLoad
function and then store the collected load of all I/O servers in an
array {la0 , . . . , laN } in step 5. After that, the maximum load of
all I/O servers is selected and then stored in lmax . In step 7, elb
is computed with Eq. (1). If elb violates the threshold ELBt and
this server owns maximum load among the collected load at the
same time, a dynamic file migration should be triggered. Before
the dynamic file migration, the migration candidate is selected
with the function SelCand, which is illustrated in Fig. 9. Because
a file stored in the parallel file system may be shared among
all clients, the mutually exclusive strategy that is presented in
Section 8 needs to be initialized before the dynamic file migration
happens. In step 13, the function MigrateClient presented in Fig. 10
Fig. 10. Sequence diagram of the dynamic file migration. is invoked to transfer the file data from the local server to the
target server and then update the file distribution metadata. If the
investigated. The dynamic file migration algorithm in SALB is process of the migration has finished without error, the mutually
based on a clientserver architecture. The IOS which sponsors exclusive strategy is cleared and the algorithm terminates. If the
the load migration is the client and the IOS which accepts the migration has some errors, these errors should be handled before
migrated subfile is the server. The sequence of dynamic file the algorithm terminates.
migration is presented in Fig. 10. Since the distribution information
of the migrated subfile should be updated after the dynamic 10. Evaluation
file migration, the MDS is also involved. The migration client
is implemented as a MigrationClient function, which takes the In order to prove the effectiveness of the proposed SALB,
IOStarget and flocal as input. The first step of the MigrationClient is comprehensive experiments are conducted in this section. The
to retrieve the attributes of the file flocal . Then, the MigrationClient configurations of the testbed used in these experiments are
sends a migration request to the target I/O server IOStarget . After presented in Table 2. The experiments are carried out according
the target IOS accepts this request, the MigrationServer function to the following steps:
will be invoked to create a new subfile and then respond with the
First, the on-line load prediction model is tested with BTIO [37],
new subfiles handle. After the client receives the handle, it posts
MADbench2 [8] and FLASH IO [49], all of which are I/O kernels of
a flow [14] to transfer the file data. When the flow completes, the
scientific applications and their own real-world I/O behaviors.
MigrationClient will send another request to the metadata server
Second, SALB is compared with the traditional load balancing
to update the distribution information of the migrated subfile.
schemes by tracing the average response time and the
After the metadata is updated, the MigrationClient removes the
throughput of I/O servers under a synthetic load.
local subfile to terminate the migration process. In the process of
Finally, the configurations of SALB are discussed with the
load migration, the mutually exclusive strategy presented in the
parallel I/O benchmark IOR [42] and the scalability of SALB is
following paragraph is employed to keep the data under migration
evaluated with the widely accepted High-end Computing I/O
consistent.
Simulator (HECIOS) [41].
A mutually exclusive strategy is necessary to ensure the
consistency of the subfile under migration. One reason is that the
parallel file system permits a file to be concurrently accessed by 10.1. On-line load prediction model evaluation
all clients. The implementation of the mutually exclusive strategy
is platform dependent, especially for file systems with client-side In this section, we first apply the on-line load prediction model
caches. The PVFS2, adopted as the testbed of the proposed SALB, to three I/O kernels of scientific applications: (1) BTIO, based
has attribute cache (acache) and name space cache (ncache) at the on the Block-Tridiagonal problem of NPB which derives from
client side [14]. In this study, the scheduler of IOS is employed to computational fluid dynamics applications, (2) MADbench2, the
prevent file requests accessing the subfile under migration through I/O kernel for the MADspec astronomy, and (3) FLASH I/O, the
sending predefined error information to the clients which require benchmark created to model I/O precisely as in code of the FLASH
access to this subfile. Then, these clients will invalidate their local astrophysics application. Then, we investigate the performance of
caches and require new distribution information of the subfile the on-line load prediction under a mixed load where different
from the MDS when the metadata has been updated. By adding applications are running simultaneously.
these new cycles, file data consistency can be guaranteed.
10.1.1. On-line load prediction model evaluation
9. Put it all together: SALB self-acting load balancing The problem class of BTIO used in this experiment is C [37].
algorithm The problem class C is the second largest problem size in the BTIO
configuration and it is also widely used to evaluate I/O performance
The SALB algorithm is illustrated in Fig. 11. Each IOS periodically optimization strategies. The configurations of the MADbech2 in
invokes this algorithm and feeds it with a load collection threshold this experiments are presented in Table 3. The load of the FLASH
(LCt ), an efficiency of load balancing threshold (ELBt ), and its I/O is extracted from a trace log of the FLASH I/O which runs on 512
own load series L = (Lt p , . . . , Lt 1 , Lt ) which is sampled at the processes [49]. Moreover, the mixed load which is used to evaluate
previous p + 1 time intervals. the on-line prediction model is sampled at a random I/O server
The first step of SALB is to estimate the one-step ahead load lf when BTIO and MADbench2 are running simultaneously.
of this IOS with the FcstLoad function which implements the on- The experimental results are presented in Fig. 12. As the picture
line load prediction model construction algorithm and is presented shows, the load series of all applications show high variability
in Fig. 4. The second step is to compare the forecast load lf and non-stability. However, it is easy to identify that the forecast
with the load collection threshold LCt . SALB will terminate if the load of the on-line load prediction model can fit the observed load
forecast load lf is smaller than LCt . Otherwise, it would continue to very well in all cases. Specially, the mean square prediction errors
1262 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268
Table 2
Configurations of the experimental system.
Parameters Values Parameters Values Parameters Values
Table 3 97.9%, 82.2%, and 70.11% respectively. Even though, the summary
MADbench2 configuration. of the AR(1) and the AR(2) which are identified for the Mixed
Parameter Value Parameter Value Load is around 20.97%, the models whose order is smaller than
NO_PIX 2000 NO_GANG 1 five accounts for at most 94.81%. In one word, the AR models with
NO_BIN 128 FBLOCKSIZE 256 low orders dominate the on-line load prediction models. Actually,
RMOD 1 SBLOCKSIZE 256 according to the classic locality principle [19] which states that
WMOD 1 PROCESSES 64 most programs need the same data or instruction sequence
multiple times, the clients accessing the data from one I/O sever
for BTIO, MADbench2, FLASH I/O and Mixed Load are 2.81, 0.67, will visit the same I/O server next time with maximum probability.
2.48, and 1.93 respectively. The mean square prediction error for Therefore, the load is sampled at one I/O server and at adjacent
MADbench2 is the smallest. One reason is that MADbench2 has a time trends to show high correlation. Hence, extending the AR
smaller average load than other workloads. Hence, the on-line load model to build the on-line load prediction model is reasonable.
The time to build the extended AR time series models is also
prediction model presented in this paper is an effective approach
an important factor, because they are invoked at each cycle of load
to forecast the one-step ahead load of an I/O server.
balancing. The average time for building extended AR models with
different orders is presented in Fig. 14. As the figure shows, the
10.1.2. Discussion of the fitted on-line load prediction model time to build the extended AR(1) model is the minimum and then
The on-line load prediction model is based on the AR time there is a upward trend with the order increasing. The time for
series model. The maximum order for the fitted AR model is six fitting the AR(6) time series model has the maximum value 0.7 ms.
and the minimum order is one. The percentages for each AR time Nevertheless, the time to build the on-line load prediction model
series model in all fitted models are presented in Fig. 13. Among is so short that it can be ignored. Hence, the proposed on-line load
all the models identified for the BTIO, MADbench2, and FLASH prediction can be built in a determined and short time, which is
I/O, the summary of the AR(1) and the AR(2) model accounts for desirable in the on-line load balancing algorithm.
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1263
(a) BTIO. (b) MADbench2. (c) FLASH I/O. (d) Mixed load.
(a) BTIO. (b) MADbench2. (c) FLASH I/O. (d) Mixed load.
Fig. 13. The popularity for each fitted AR time series model.
Table 4
Impact of different Lt values on the performance improvement, no. of load collection, and no. of migration.
No. of load collection
LCt Write imprv. (%) Read imprv. (%) No. of load collection No. of migration No. of migration
90 4.54 0.96 1 1 1
85 6.51 1.47 1 1 1
80 10.12 5.49 3 3 1
75 10.93 6.74 2 2 1
70 10.05 2.90 4 2 2
65 9.63 2.88 9 4 2.25
60 17.02 6.99 165 56 2.95
55 16.72 8.16 444 107 4.15
50 12.65 5.20 704 153 4.60
45 13.26 5.94 1217 225 5.41
40 14.59 2.28 1630 258 6.31
file systems [52] and the HECIOS has been applied to study the
client side cache [41], the middle-ware level client cache [5], the
scalable directory of parallel system [53] and the server-to-server
communication [15]. Hence, it is reasonable to employ the HECIOS
to evaluate the scalability of SALB.
HECIOS is built around the OMNeT++ simulation package [51]
and shares almost the same architecture as PVFS2. However,
HECIOS is a large-scale parallel I/O system simulator driven by
the trace log of real-scientific applications. In such a case, all the
actions of HECIOS are triggered by clients and thus there is no
performance scheduler at the I/O servers of HECIOS. To resolve this
problem, we employ the OMNeT++ self-messages to periodically
invoke the SALB at each I/O server. Another major consideration
about implementing SALB in HECIOS is that there is no stand-alone
metadata server in HECIOS. Instead, the HECIOS stores its metadata
Fig. 17. Impact of different ELBt values on the performance improvement of IOR. in one object of C++s singleton class. In such a case, there is no
need to perform communication to update metadata of the file
or 55%, the occurrence of the load of the I/O server violating the after its migration has finished. Other parts of SALB in HECIOS
threshold is so rare that dynamic file migrations are not triggered are the same as that in real PVFS2. Each file system server in the
frequently. Hence, it is reasonable to conclude the proposed self- HECIOS feeds SALB with a load collection threshold, an efficiency of
acting load balancing can deliver improved performance for a wide load balancing threshold, and its own previous load series. In order
range of load collection thresholds and setting the load collection to simulate the load imbalance among I/O servers, in the initial
threshold to be the average load of all I/O servers can maximize the stage of HECIOS, the self-similar distribution function defined in
performance of the proposed self-acting load balancing. the above Eq. (5) is used to guide the assignment of subfiles among
Second, the impact of different values of the ELB threshold available I/O servers in file system. The value of x/y is set to 40/60,
(ELBt ) on the performance of the parallel I/O is tested. In this which means 40% of the subfiles would be assigned to 60% of the
experiment, the LCt is set to 60% of the maximum load of an I/O servers.
I/O server. The ELBt is another important parameter for the In this experiment, HECIOS uses settings similar to the beowulf
proposed SALB algorithm. The results of the experiment are cluster Palmetto [48] of Clemson University. Palmetto provides
presented in Fig. 17. As the figure shows, the value of ELBt two interconnection networks: a Gigabit Ethernet network and a
covers from 0.4 to 0.9. The performance improvement percentage Myrinet Myri-10 G network. HECIOS supports both of the networks
for both read and write increases at the initial stage, and by using the INET network simulation components of OMNeT++.
reaches a peak point when the value of ELBt is 0.8. Then, the However, INET just provides accurate TCP/IP simulation and the
performance improvement percentage for the read declines and Myrinet of HECIOS is offered through adjusting the Ethernet
the performance improvement percentage for the write keeps network settings to approximate Myrinet performance. Therefore,
stable. One reason is that when the value of ELBt is too big, dynamic we directly use the Gigabit Ethernet networking model of HECIOS
file migrations become so frequent that they would degrade the to evaluate SALB. The trace log files used to drive the experiment
performance of the whole system. Hence, in order to fully reap are gathered from the Argonne National Laboratory and can be
the performance improvement from the proposed self-acting load downloaded from the parallel architecture research laboratory of
balancing, the threshold of the efficiency of load balancing is Clemson University [49].
The number of I/O servers we evaluated in this experiment cov-
suggested to be set to 0.8.
ers 8, 16, 32, 64, 128, and 256. The results are presented in Fig. 18.
The horizontal axis is the number of I/O servers and the vertical
10.3.2. Scalability evaluation for SALB axis is the read and write performance. From the figure, we can see
In this section, SALB is evaluated in the context of a large scale that both the read and write performance of HECIOS own high scal-
parallel I/O storage system which is simulated with the High- ability. When HECIOS is integrated with SALB, the performance for
End Computing I/O System (HECIOS) [41]. The HECIOS is driven both the read and the write increases from hundreds of megabytes
by the trace log of real scientific applications and permits us per second to almost four gigabytes per second. Specifically, the
to evaluate the optimizations proposed for large scale storage performance with SALB is almost three times higher than that
systems that are not widely available, or even at scales that are not without load balancing. The performance improvement for HECIOS
yet in production. Actually, the simulation has been successfully is more obviously than that for IOR in the previous section. The rea-
used to evaluate the metadata load balancing for petabyte-scale son is that, when the subfile assignment of HECIOS is guided by the
1266 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268
a b
self-similar distribution in this test, some I/O servers would hold In order to reduce the message exchanges among I/O servers
more than one subfiles and some I/O servers are idle. For example, for load collection in SALB, we propose an adaptive load col-
when the number of I/O servers in the test is 256, the number of I/O lection threshold adjustment algorithm to prevent frequent load
servers which are actually assigned to subfiles approximates 215. collection. We have shown that the adaptive load collection thresh-
Further, among these 215 I/O servers, 25.7% of the subfiles are as- old can sharply reduce the message exchanges and guarantee
signed to around 20 I/O servers. In such a case, the heavily loaded the effectiveness of load balancing decisions simultaneously. Even
I/O servers with SALB can select these idle I/O servers as migra- though the message exchanges among n I/O servers are around 21 n2
tion targets. As a result, the load is distributed evenly among I/O in the worst case, we believe that the load collection can be further
servers. Hence, we can conclude that SALB has high scalability and reduced through our on-line prediction model. For example, the
can improve the performance of parallel file systems when the load on-line load prediction model can be extended to do a long-term
among the I/O servers situated in it is imbalanced. load forecast for the load collection from other I/O servers. In such
a case, SALB running on one I/O server can make its decision based
on the previously collected load information.
11. Conclusion and further work
Furthermore, SALB employs dynamic file migration to imple-
ment its load migration. One advantage of dynamic file migration
In this paper, we present a dynamic and adaptive load balancing is that it can perform load migration without interrupting the sys-
algorithm named self-acting load balancing (SALB) to tackle the tem service. Moreover, in order to balance the benefits and the
load imbalance issue among the I/O servers situated in parallel file side-effects of dynamic file migration, SALB employs an optimiza-
systems. The feasibility of SALB has been shown in a number of tion model for selecting the migration candidates. Even through
performance experiments. our load balancing strategies are orthogonal to the fault toler-
SALB is totally based on a distributed architecture, where each ance techniques, the relationship between load balancing and data
I/O server can make a load balancing decision and sponsor dynamic replication can be explored to further improve the effectiveness of
file migration by itself. The distributed property of SALB enables dynamic file migration. For example, the I/O servers which hold
it to deliver scalability and availability required by the steadily replicated data should own a higher priority than others when
growing parallel I/O systems. We have shown that SALB is a very SALB is selecting migration candidates. Therefore, the side-effects
effective method for balancing load in the context of large-scale of the dynamic file migration may be further reduced.
I/O servers which are built from the widely accepted High-End
Computing I/O Simulator (HECIOS) [41]. Moreover, in the exascale Acknowledgments
era, high performance computing applications such as the Climate
analytics [3] may handle the data distributed across several The final version has benefited greatly from the many detailed
counties. In such a case, a distributed load balancing algorithm comments and suggestions from the anonymous reviewers. The
should play a more important role in data management. Hence, authors gratefully acknowledge these comments and suggestions.
we believe that SALB provides a good framework of reference for The work described in this paper are supported by the fund of
future work. the State Key Laboratory of Software Development Environment
We have developed an on-line load prediction model to fore- under Grant No. SKLSDE-2009ZX-01, the National Natural Science
cast the one-step ahead load of an I/O server. In such a case, SALB Foundation of China under both Grant No. 60973007 and No.
running on one I/O server can collect the forecast load of other I/O 61003015, the Doctoral Fund of Ministry of Education of China
servers to make its load balancing decision. For this reason, the im- under Grant No. 20101102110018, the Fundamental Research
pact of network transmission latency on the decision delay can be Funds for the Central Universities under Grant No. YWF-10-02-
reduced. By taking into account both the workload characteristic 058, the Hi-tech Research and Development Program of China
of scientific applications and the time to build different time series (863 Program) under Grant No. 2011AA01A205, the National
models, we choose the AR time series as basis of the on-line load Core electronic devices high-end general purpose chips and
prediction model. We have shown that the developed model can fit fundamental software project under Grant No. 2010ZX01036-
the load of an I/O server with small mean square prediction errors. 001-001.
The on-line load prediction model can be further improved. For ex-
ample, the load series of the I/O servers may own seasonal compo- References
nents [12], which are generated by the loop code in applications. [1] R.R. Aire Shoshani, Scott Klasky, Scientific data management: challenges and
Through taking into account seasonal components, the accuracy of approaches in the extreme scale era, in: Proceedings of the 2010 Scientific
on-line load should be further improved. Discovery through Advanced Computing (SciDAC) Conference, USA, Jul. 2010.
B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268 1267
[2] H. Akaike, A new look at the statistical model identification, IEEE Trans. Autom. [34] D. Lee, R.S. Ramakrishna, Improving disk I/O load prediction using statistical
Control 19 (1974) 716723. parameter history in online for grid computing, IEICE Trans. Inf. Syst. E89-D
[3] G. Aloisio, S. Fiore, Towards exascale distributed data management, Int. J. High (2006) 24842490.
Perform. Comput. Appl. 23 (2009) 398400. [35] W. Liu, M. Wu, X. Ou, W. Zheng, M. Shen, Design of an I/O balancing file system
[4] M. Andreolini, S. Casolari, M. Colajanni, Models and framework for supporting on web server clusters, in: Proceedings of the 2000 International Workshop on
runtime decisions in web-based systems, ACM Trans. Web 2 (2008) Parallel Processing, ICPP00, 2000, pp. 119126.
17:117:43. [36] D.S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja1, J. Pruyne, B. Richard, S.
[5] M. Bassily, A middle-ware level client cache for a high performance computing Rollins, Z. Xu, Peer-to-peer computing, Tech. Rep., HewlettPackard Company,
I/O simulator, Ph.D. Thesis, Clemson University, 2009. 2005. Available: http://www.hpl.hp.com/.
[37] P. Wong, Rob F. Van der Wijngaart, NAS parallel benchmarks I/O, Version
[6] A. Batsakis, R. Burns, A. Kanevsky, J. Lentini, T. Talpey, CA-NFS: a congestion-
2.4, Tech. Rep., NASA Advanced Supercomputing Division, 2003. Available:
aware network file system, Trans. Storage 5 (2009) 15:115:24.
http://www.nas.nasa.gov/publications/npb.html.
[7] J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M.
[38] P. Scheuermann, G. Weikum, P. Zabback, Data partitioning and load balancing
Polte, M. Wingate, PLFS: a checkpoint file system for parallel applications, in:
in parallel disk systems, The VLDB J. 7 (1998) 4866.
Proceedings of the Conference on High Performance Computing Networking,
[39] F. Schmuck, R. Haskin, GPFS: a shared-disk file system for large computing
Storage and Analysis, SC09, Nov. 2009, pp. 21:121:12.
clusters, in: Proceedings of the 1st USENIX Conference on File and Storage
[8] J. Borrill, J. Carter, L. Oliker, D. Skinner, Integrated performance monitoring of a Technologies, FAST02, USENIX Association, 2002, pp. 231244.
cosmology application on leading HEC platforms, in: Proceedings of the 2005 [40] B. Schroeder, G. Gibson, A large-scale study of failures in high-performance
International Conference on Parallel Processing, ICPP05, 2005, pp. 119128. computing systems, IEEE Trans. Dependable Secur. Comput. 7 (2010) 337351.
[9] J. Borrill, L. Oliker, J. Shalf, H. Shan, A. Uselton, HPC global file system [41] B.W. Settlemyer, A study of client-based caching for parallel I/O, Ph.D. Thesis,
performance analysis using a scientific-application derived benchmark, Clemson University, 2008.
Parallel Comput. 35 (2009) 358373. [42] H. Shan, K. Antypas, J. Shalf, Characterizing and predicting the I/O performance
[10] G.E.P. Box, G. Jenkins, Time series analysis, in: Forecasting and Control, Holden- of HPC applications using a parameterized synthetic benchmark, in: Proceed-
Day Incorporated, 1990. ings of the 2008 ACM/IEEE Conference on Supercomputing, SC08, Nov. 2008,
[11] P.J. Braam, The lustre storage architecture, Tech. Rep., Aug. 2004. Available: pp. 42:142:12.
http://wiki.lustre.org/. [43] E. Smirni, D.A. Reed, Workload characterization of input/output intensive
[12] Peter J. Brockwell, Richard A. Davis, Introduction to Time Series and parallel applications, in: Proceedings of the 9th International Conference
Forecasting, second ed., Springer, 2002. on Computer Performance Evaluation: Modelling Techniques and Tools,
[13] P. Carns, S. Lang, R. Ross, M. Vilayannur, J. Kunkel, T. Ludwig, Small-file Springer-Verlag, 1997, pp. 169180.
access in parallel file systems, in: Proceedings of the 2009 IEEE International [44] E. Smirni, D.A. Reed, Lessons from characterizing the input/output behavior of
Symposium on Parallel & Distributed Processing, IPDPS09, May 2009, pp. parallel scientific applications, Perform. Eval. 33 (1998) 2744.
111. [45] H. Song, Y. Yin, X.-H. Sun, R. Thakur, S. Lang, A segment-level adaptive
[14] P.H. Carns, W.B. Ligon III, R.B. Ross, R. Thakur, PVFS: a parallel file system for data layout scheme for improved load balance in parallel file systems, in:
Linux clusters, in: Proceedings of the 4th Annual Linux Showcase & Conference Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster,
Volume 4, Oct. 2000, pp. 317327. Cloud and Grid Computing, CCGRID11, 2011, pp. 414423.
[15] P.H. Carns, B.W. Settlemyer, W.B. Ligon III, Using server-to-server communica- [46] W. Sun, J. Shu, W. Zheng, Dynamic file allocation in storage area networks with
tion in parallel file systems to simplify consistency and improve performance, neural network prediction, in: International Symposium on Neural Networks,
in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC08, in: Lecture Notes in Computer Science, 2004, pp. 133140.
Nov. 2008, pp. 18. [47] Y. Tamura, S. Kasahara, Y. Takahashi, S. Kamei, R. Kawahara, Inconsistency
of logical and physical topologies for overlay networks and its effect on file
[16] S. Casolari, M. Colajanni, Short-term prediction models for server management
transfer delay, Perform. Eval. 65 (2008) 725741.
in Internet-based contexts, Decis. Support Syst. 48 (2009) 212223.
[48] Top500. Available: http://www.top500.org/.
[17] I. Corp, An architectural blueprint for autonomic computing, Tech. Rep., 2006. [49] MPI I/O test trace log and FLASH I/O 512P trace log. Available:
Available: http://users.encs.concordia.ca/. http://www.parl.clemson.edu/, 2011.
[18] P.E. Crandall, R.A. Aydt, A.A. Chien, D.A. Reed, Input/output characteristics [50] N. Tran, D.A. Reed, Automatic ARIMA time series modeling for adaptive I/O
of scalable parallel applications, in: Proceedings of the 1995 ACM/IEEE prefetching, IEEE Trans. Parallel Distrib. Syst. 15 (4) (2004) 362377.
Conference on Supercomputing, Supercomputing95, Dec. 1995, p. 59. [51] A. Varga, The OMNET++ discrete event simulation system, in: Proceedings of
[19] P.J. Denning, The locality principle, Commun. ACM 48 (2005) 1924. the European Simulation Multiconference, Jun. 2001, pp. 319324.
[20] P.A. Dinda, Design, implementation, and performance of an extensible toolkit [52] S.A. Weil, K.T. Pollack, S.A. Brandt, E.L. Miller, Dynamic metadata management
for resource prediction in distributed systems, IEEE Trans. Parallel Distrib. Syst. for petabyte-scale file systems, in: Proceedings of the 2004 ACM/IEEE
17 (2006) 160173. Conference on Supercomputing, SC04, 2004, p. 4.
[21] P.A. Dinda, D.R. OHallaron, Host load prediction using linear models, Cluster [53] Y. Wu, A study for scalable directory in parallel file systems, Ph.D. Thesis,
Comput. 3 (2000) 265280. Clemson University, 2009.
[22] J. Dongarra, P. Beckman, T. Moore, et al., The international exascale software [54] C. Wu, R. Burns, Handling heterogeneity in shared-disk file systems, in:
project roadmap, Int. J. High Perform. Comput. Appl. 25 (2011) 360. Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC03, Nov.
[23] B. Dong, X. Li, L. Xiao, L. Ruan, B. Yu, Self-acting load balancing with parallel 2003, p. 7.
sub file migration for parallel file system, in: Proceedings of the 2010 Third [55] Y. Wu, Y. Yuan, G. Yang, W. Zheng, Load prediction using hybrid model
International Joint Conference on Computational Science and Optimization for computational grid, in: Proceedings of the 8th IEEE/ACM International
Volume 02, CSO10, 2010, pp. 317321. Conference on Grid Computing, GRID07, 2007, pp. 235242.
[24] L.W. Dowdy, D.V. Foster, Comparative models of the file assignment problem, [56] T. Xie, Y. Sun, A file assignment strategy independent of workload
ACM Comput. Surv. 14 (1982) 287313. characteristic assumptions, Trans. Storage 5 (2009) 10:110:24.
[57] B. Yagoubi, Distributed load balancing model for grid computing, ARIMA J. 12
[25] M. Eshel, R. Haskin, D. Hildebrand, M. Naik, F. Schmuck, R. Tewari, Panache:
(2010) 4360.
a parallel file system cache for global file access, in: Proceedings of the 8th
[58] Y. Zhang, W. Sun, Y. Inoguchi, Predicting running time of grid tasks based
USENIX Conference on File and Storage Technologies, FAST10, Feb. 2010, pp.
on CPU load predictions, in: Proceedings of the 7th IEEE/ACM International
155-168.
Conference on Grid Computing, GRID06, 2006, pp. 286292.
[26] G. Fragidis, K. Tarabanis, The business strategy perspective on the develop- [59] Y. Zhao, W. Huang, Adaptive distributed load balancing algorithm based on
ment of decision support systems, in: Proceedings of the CIMCA-IAWTIC06, live migration of virtual machines in cloud, in: Proceedings of the 2009 Fifth
Vol. 02, IEEE Computer Society, Washington, DC, USA, 2005, pp. 968975. International Joint Conference on INC, IMS and IDC, 2009, pp. 170175.
[27] W. Frings, F. Wolf, V. Petkov, Scalable massively parallel I/O to task-local [60] Y. Zhu, Y. Yu, W.Y. Wang, S.S. Tan, T.C. Low, A balanced allocation strategy for
files, in: Proceedings of the Conference on High Performance Computing file assignment in parallel I/O systems, in: Proceedings of the 2010 IEEE Fifth
Networking, Storage and Analysis, SC09, Nov. 2009, pp. 111. International Conference on Networking, Architecture, and Storage, NAS10,
[28] B. Gavish, O.R. Liu Sheng, Dynamic file migration in distributed computer Jul. 2010, pp. 257266.
systems, Commun. ACM 33 (1990) 177189.
[29] A. Geist, Paving the roadmap to exascale, Tech. Rep., Oak Ridge National
Laboratory, 2010. Available: http://www.scidacreview.org.
Bin Dong, received his B.S. degree in computer science
[30] D.E. Knuth, The Art of Computer Programming, in: Sorting and Searching,
from University of Electronic Science and Technology
vol. 3, Addison-Wesley, 1973.
of China, at Chengdu, China, in 2008. He is currently
[31] J.M. Kunkel, Towards automatic load balancing of a parallel file system with pursuing a Ph.D. degree in computer science at Beihang
subfile based migration, Masters Thesis, Heidelberg University, 2007. University. His research interests include parallel file
[32] S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, W. Allcock, I/O performance systems, operating systems, high performance computing,
challenges at leadership scale, in: Proceedings of the Conference on High and mathematical modeling.
Performance Computing Networking, Storage and Analysis, SC09, Nov. 2009,
pp. 112.
[33] L. Lee, File assignment in parallel I/O systems with minimal variance of service
time, IEEE Trans. Comput. 49 (2) (2000) 127140.
1268 B. Dong et al. / J. Parallel Distrib. Comput. 72 (2012) 12541268
Xiuqiao Li, received his B.S. degree in computer science Limin Xiao, was born in 1970. He has a Ph.D., is a Profes-
and technology and his M.S. degree in computer architec- sor, and has a Senior membership of the China Computer
ture from Shandong University, China, in 2005 and 2008 Federation. His main research areas are computer ar-
respectively. Currently, he is a Ph.D. student in computer chitecture, computer system software, high performance
architecture at Beihang University. His research interests computing, virtualization, and cloud computing.
include parallel file systems, clusters, and cloud comput-
ing.
Qimeng Wu, received his B.S. degree in computer science Li Ruan, was born in 1978. She has a Ph.D., is a Lecturer,
and technology from Beijing Jiaotong University, China, and has a Membership of the China Computer Federation.
in 2009. Currently, he is a graduate student majoring in Her main research areas are computer architecture,
computer architecture at Beihang University, China. His computer system software, high performance computing,
research interests include parallel file systems and cloud virtualization, and cloud computing.
computing.