Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Network Load Balancing Methods: Experimental

Comparisons and Improvement


Shafinaz Islam
Electrical and Electronic Engineering
Rajshahi University of Engineering and Technology Rajshahi, Bangladesh
shafinaz.orthy16@gmail.com
In this paper we evaluate four types of load balancing
methods: random, round robin, shortest queue and shortest
Abstract—Load balancing algorithms play critical roles in queue with stale load information. Also, we explore how load
systems where the workload has to be distributed across multiple balancing given out-of-date state information can be improved
resources, such as cores in multiprocessor system, computers in over a simple shortest queue method. The primary performance
distributed computing, and network links. In this paper, we study measure of interest is mean delay for an arriving job to have
and evaluate four load balancing methods: random, round robin, completed service. Another measure of interest is the
shortest queue, and shortest queue with stale load information.
utilization of resources. To do this, we build a simulation
We build a simulation model and compare mean delay of the
model using CSIM20 [1]- a development toolkit for simulation
systems for the load balancing methods. We also provide a
method to improve shortest queue with stale load information and modeling. We run experiments and present the result in
load balancing. A performance analysis for the improvement is this paper.
also presented in this paper.
II.! RELATED WORKS
Keywords: Load balancing, Load balancing simulation, Radom,
Round robin, Shortest queue. In [2] Zhou presents trace driven simulation study of
dynamic load balancing in homogenous distributed system.
From production system they have collected job’s CPU and
I.! INTRODUCTION
I/O demand and used them as input of the simulation model,
Load balancing is a significant component of current which includes representative CPU scheduling policy. They
network infrastructure and computer systems where resources have simulated seven load-balancing algorithms and compared
are distributed over a large number of systems and have to be their performances. They conclude that periodic and non-
shared by a large number of end users. Load balancing is the periodic information exchange based algorithms provide
way of distributing workload across resources in order to get similar performance and the periodic policy based algorithms
optimal resource utilization, minimum response time or that use salient agent to collect and distribute load information
reduced overload. decrease overhead and provide better scalability. Eager et al
[3] explores system load information in adaptive load for
Load balancing is also a basic problem in many practical locally distributed system. They studied three simple abstract
systems in daily life. A common example may be a load sharing policies (Random, Threshold, Shortest),
supermarket model where a central balancer or dispatcher compared their performance to each other and to two
assigns each arriving customer to one of a collection of servers bounding cases such as no load sharing and perfect load
to minimize response time. sharing. The result suggests that simple load sharing policies
using small amount of state information performs dramatically
It is the task of the load balancers to determine how jobs better than no load sharing and nearly as well as the policies
will be distributed to resources. To do this, load balancers use that use more complex information. Dahlin examines [4] the
some type of metric or scheduling algorithm. Load balancers problem of load balancing in large systems when information
may use the state of the server for scheduling strategy or may about server load is stale. The paper develops strategies that
not. If the load balancer does not use state of the resource then interpret load information based on its age. The crux of the
it is called stateless balancing strategy. One example of strategy is to use not only the stale load information of the
stateless load balancing strategy may be random load server, but also to use the age of the information and predict
balancing where load balancer distributes the workload across the rate at which new information arrives to change that
servers randomly, meaning it picks a server randomly. Round information. Load balancing in dynamic distributed system in
robin load balancing is also a stateless load balancing strategy cases of old information, has been also studied by
where workloads are distributed in round robin fashion. On the Mitzenmacher [5]. The paper concludes that choosing the least
other hand, if load balancer uses information of resource then loaded of two random choices performs better over large range
the balancing strategy is called stateful. One type of stateful of system parameters and better than similar strategies, in
load balancing is shortest queue. The load balancer gets the terms of expected response time of the system.
updated queue length of resources and assigns a job to a
resource that has shortest queue length. Gupta et al [6] presents an analysis of join-the-shortest-
queue routing for web server farms. A server farm consists
of a router, or dispatcher, which takes requests and 1) The information policy specifies what information is
distributes jobs to one of collection of servers for available to the dispatcher when it makes the decision, and
processing. They presented the idea of Single-Queue the way by which this information is distributed.
Approximation (SQA) where one designated queue in the 2) The transfer policy determines the eligibility of a job
farm is analyzed in isolation using state dependent arrival for load balancing.
rate. They conclude that SQA, in some sense, capture the 3) The placement policy chooses for the eligible job a
effect of other queues. server to transfer this job to.
For a system of two or more servers where jobs arrive We assume that every job is eligible for load balancing
according to some stochastic process and if the queue length and, thus, will be transferred to the server by the dispatcher.
is the only system information known, then dispatching the The information policy and the placement policy will be
job to the server with the shortest queue length is one of the defined in the Section IV for each of the load balancing
ways to minimize the long run average delay to service each strategies.
customer. But in [7] Whitt shows that there are service time
distributions for which it is not optimal to dispatch the job to IV. LOAD BALANCING STRATEGIES
the server that has shortest queue length. Also, if the elapsed In this paper we will consider two main types of the load
service time of customers in service is known, in the long balancing strategies that uses information about the system
run dispatching job to the server that minimizes their to make a decision:
individual expected delay does not always optimize average
delay time. With up-to-date state information. That means that
the dispatcher has relevant information every time it makes a
Soklic [8] introduces diffusive load balancing algorithm
decision.
and compares its performance with three other load
With stale information. That means that information
balancing algorithms such as static, round robin, and
shortest queue load balancing. The result shows that is updated periodically and the dispatcher won’t operate with
the recent information each time it has to transfer a job.
diffusive load balancing is better than round robin and static
load balancing in a dynamic environment. The case with the stale information is more realistic
because in the real system we have different latencies for
III. MODEL AND NOTATION each server and it is extremely hard to get all the
Our system consists of a front-end dispatcher (load information quickly and in one time. Also, requesting each
balancer) that receives all incoming jobs, and five similar server every time the job arrives generates a lot of additional
servers. Each server has its own queue, where the job is put traffic in the network. Thus, strategies with the stale
if the server is currently busy. Jobs arrive to the dispatcher information scale better which is very important for the real
and are immediately transferred to one of the servers where system.
they join the server’s queue.
A. Random strategy
The following assumptions for the system are made: In the Random load balancing strategy an incoming job
Server performs a job at first-come-first-serve basis. is sent to the Server i with probability 1/N, where N is the
Jobs arrive according to a Poisson process with the number of servers. This strategy equalizes the expected
rate equals to . number of jobs at each server. Within Random load
balancing strategy the dispatcher has no information about
Service times are independent and exponentially
the job or the server’s states.
distributed with a mean 1.0 second.
Each server has the same service rate equals to . B. Round Robin strategy
When a job comes, the dispatcher is able to make a In the Round Robin strategy jobs are sent to the servers
decision immediately at wire speed. That means that it takes in a cyclical fashion. That means that i-th job is sent to the
zero time to the dispatcher to make a decision to which Server i mod N, where N is the number of servers in the
server to transfer a job. system. This strategy also equalizes the expected number of
A job can be transferred to a server in zero time jobs at each server. According to [9], it has less variability
with zero cost. in the job’s interarrival time for each server than the
Server’s queue is not infinite. When there are more Random strategy. Within Round Robin load balancing
than 200 jobs in any queue, the system is considered to be strategy the dispatcher stores the ID of the server to which
broken. the last job was sent.
When the job is sent to the server, it cannot resent in
to another server. It means that only the dispatcher can make C. Shortest Queue strategy with up-to-date information
a decision to which server a job will be sent. In the Shortest Queue strategy the dispatcher sends job
Jobs are not preemptible. It means that we cannot to the server with the least number of jobs in the queue. If
interrupt a task if the server starts to process it. there is several servers with the smallest queue length, then
the dispatcher randomly choose a server form this list. This
The dispatcher implements some load balancing strategy tries to equalize the current number of jobs at each
algorithms, which chooses the server to transfer the job. server.
According to [2], a load balancing algorithm in general
consists of the three components: Within this Shortest Queue strategy the dispatcher gets
the length of the servers queue every time it makes a
decision. Further, the length of the queue is considered to be To keep the history in this way, we even don’t need
the number of jobs in the server and the server’s queue. addition memory because we use the array where we keep
queues length. So, the history doesn’t take a lot of memory
Winston [10] shows that the Shortest Queue strategy is and, as we will see, can significantly increase the
optimal if the following conditions are met: performance of the dispatcher.
There is no prior knowledge about the job. The pseudo-code of the algorithm that implements this
Jobs are not preemptible. strategy is listed below:
Dispatcher makes a decision and transfers a job
immediately when it comes. GENERATE random number for each server;
Each server processes task in First-Come-First- FIND the server with the shortest queue;
Served order. IF several servers have the shortest queue length)
The job size has exponential distribution. THEN pick one with the least generated random
number);
In his work, Winston defines the optimality as INCREASE the length of the queue for the chosen server
maximizing the number of jobs completed by some constant by 1;
time T. So, as our model meets all the requirements, the RETURN the ID of the chosen server as a result;
Shortest Queue with the up-to-date information can be
considered as the optimal strategy. The problem is that we As we mentioned above, an update overwrites the
cannot use this strategy in the real system because we information about the queue lengths every time.
cannot provide the information about the system state each Within this strategy the dispatcher also operates only
time the dispatcher makes a decision. with the servers queue length information that is updated
Later in the paper we will sometimes refer to this periodically. Later in the paper we will sometimes refer to
strategy as an USQ strategy. this strategy as an HSQ strategy.

D. Shortest Queue strategy with stale information V. EXPERIMENTS DESIGN


In this Shortest Queue strategy, the dispatcher also sends There are a lot of tools that can help to build the
job to the server with the least number of jobs in the queue. simulation model. In this study we use CSIM 20 for C to
The difference from the strategy C is that the information build the model. CSIM is a function library that provides the
about the system’s state is updated periodically. So, the developer with numerous methods and data structures to
dispatcher doesn’t have up-to-date information every time it simplify the implementation of the model. The first version
sends the job to the server. Ties within this strategy are of the library was announced in the 1980s. So, it is very
broken in the same way: if there are several servers with the stable. Also, as a C library, it provides with great flexibility
smallest queue length, then the dispatcher randomly choose during the project development.
a server form this list.
During the experiments performance of the model is
As we said, within this Shortest Queue strategy the measured by the mean response time for the job. The job’s
dispatcher also uses the lengths of the server queues when it response time is defined as the simulation time between the
makes the decision. But in this case information is not job’s arrival and completion at one of the servers. We want
recent and is updated on periodical basis. to mention that in the experiments CSIM gives us an
opportunity to work with the simulation time instead of the
Later in the paper we will sometimes refer to this CPU time. This makes our measurement more precise and
strategy as an SSQ strategy. independent from the machine we use.
E. Shortest Queue strategy with stale information and In the experiments we want to study the influence of the
keeping the history of the jobs scheduling during the following factors:
stale period
As we will see in the Section VI, the shortest queue Load balancing strategy. Within this project we
strategy works much worse when it operates with the stale compare five different strategies listed in the Section IV.
information comparing to its results with up-to-date System workload. System workload is defined as
information. Therefore, we come up with the new strategy the ration of the arrival rate to the service rate and is
that helps us to improve the load balancing result when usually represented using percentage. It is easy to understand
information is updated periodically. that if the arrival rate is greater than the service rate, then
system is not stable. The number of the jobs in the queue will
The main idea of this strategy is to keep history of the grow infinitely and the system will be broken after some
jobs scheduling during the stale period. It means that during queue will hit the 200 jobs in the queue threshold. Within
the stale period we keep track to which server we sent a job this project we will change the workload from 0% to 90%.
and our next decision is based on not only on the queue We won’t consider a workload more than 90% because it
lengths information but also on this history. More precisely, cannot be met in the real system as someone will likely
we increase the known queue length for the server each time increase the number of servers before the workload will be
we sent a job to it. When the information update comes, it this high.
rewrites the length of the queues and thus, deletes our Length of the stale period. Length of the stale
history for that stale period. After the update we start our period is the time between system’s information update. In
history from scratch.
this project we will change this factor from 1 second to 25 Using of such information will affect badly the mean
seconds to study its influence. response time. The dependence of the mean delay from the
length of the stale period (the time between information
We will use factorial design of the experiments. Within updates) is shown on the Figure 2. We can see that result
this project we want to do the following: become worse if the stale period is increased. The delay
Compare the performance of different load grows more significantly if the system is highly loaded. In
balancing strategies. this way the SSQ balancing strategy is similar to the
Understand how the performance of the system strategies in which up-to-date information is used.
depends on the system workload. 25
Understand how the length of the stale period Workload 80%
decreases the performance of the load balancing strategy. Workload 60%

Mean Response Time (seconds)


20
Examine does keeping history in the shortest queue Workload 20%
strategy help us to decrease the mean response time of the
system? 15

To make the results significant we will use the CSIM


approach that is called run length control. Using it, we will 10
stop the simulation when the results archive the desired
accuracy with 95% probability. In this project the error of 5
0.01 during measurements can be seen as insignificant so,
this accuracy will be used.
0
As we said, the results do not depend on the machine 0 5 10 15 20 25
used. But, to be precise, we used a machine with Intel i5 Stale Period (seconds)
processor, 4GB RAM with Windows 7 Home Edition on it. Figure 2. Mean response time for the SSQ balancing strategy depending on
To build the model we used C in Visual Studio 2010 the stale period (the time between information updates). Different
Professional Edition and the CSIM 20 for C library. workloads of the system are presented.

VI. EXPERIMENTS AND RESULTS Comparing the mean response time of the SSQ strategy
to that of the Random strategy, we can see that it is worse
A. Comparing stratagies with up-to-date information. even if the stale period is only 4 seconds. But with 2
First of all we want to decide which strategy is better if seconds time between updates, SSQ strategy works better
we have up-to-date information every time the dispatcher than Random, but still much worse than USQ which is
makes a decision. In the Section IV Part C we showed that considered as an ideal strategy. Figure 3 shows the results of
the Shortest Queue load balancing strategy is optimal for the the experiments that prove these statements.
system we use. The experiments results are consistent with 6 Stale Shortest Queue (stale period 4 sec)
theoretical result. The Shortest Queue strategy works much
Random
better than Random and Round Robin strategies. The results
Mean Response Time (seconds)

5 Stale Shortest Queue (stale period 2 sec)


of the experiments are shown on the Figure 1. Up-to-Date Shortest Queue
12 4
Random
Round Robin
Mean Response Time ( seconds)

10 Up-to-Date Shortest Queue 3

8 2

6 1

4 0
0 10 2040 3050 60 70 80
2 Workload, %
Figure 3. Comparison of the mean response time of the Shortest Queue
balancing strategy with stale information to Random and Up-to-Date
0
Shortest Queue strategies. Two different stale periods are shown.
0 20 40 60 80
Workload, % C. Studing the HSQ strategy performance.
Figure 1. Increase of the mean response time depending on the system’s The purpose of developing a new load balancing strategy
workload for Random, Round Robin and USQ strategies.
is to decrease the mean response time of the system when
B. Using stale information in the dispatcher to make a stale information is used. Our experiments show that we
decision. have achieved this goal. Results, shown on the Figure 4,
It is practically impossible to get fresh information about indicate that HSQ works much better than SSQ with the
the system every time we need to transfer a customer to a same stale period. Also we can see that the delay increases
much slower when using the HSQ strategy instead of the
server. Thus, stale information is used in the real systems.
SSQ.
9 Shortest Queue with History (workload Thus, we can say that the value of the mean response
8
60%) time for the HSQ strategy in the case of a very long stale
Shortest Queue with History (workload period will be very close to those of the Round Robin
Mean Response Time (seconds)

80%)
7 strategy.
6
VII. CONCLUSION
5
In this paper, we studied four load balancing algorithms
4 using a simulation model consisting of one load balancer
3 and five servers. Results show that shortest queue with up-
to-date server load information gives the lowest delay as
2
expected. But if the load/state information is significantly
1 out of date then the shortest queue strategy performs worse
0
even than random load balancing. To improve stale
0 5 10 15 20 25 information based shortest queue, we have implemented a
Stale Period (seconds) new history based shortest queue load balancing strategy.
Figure 4. Mean response time for the Shortest Queue balancing strategy Although our new approach works worse than the optimal
with stale information and history depending on the stale period (the time strategy, the difference is not as large as for other strategies.
between information updates). Different workloads of the system are Even the state information is too old, our approach performs
presented. The results are compared to the regular Shortest Queue with close to round-robin strategy. The improved strategy can be
stale information. used in real system as it has low memory requirement and
Comparing the HSQ strategy to the USQ strategies, we we do not have to update the state information too
can see that although it works worse than the optimal frequently to get reduced delay compare to other load
strategy the difference is not as large as for other strategies. balancing strategy discussed in the paper.
The results of this comparison are shown on the Figure 5.
We also want to mention that even with extremely long time REFERENCES
between the updates the HSQ strategy performance is very [1] http://www.mesquite.com/
close to those of the Round Robin strategy. It can be [2] S. Zhou. A trace-driven simulation study of dynamic load balancing.
IEEE Transactions on Software Engineering, 14:1327–1341, 1988.
explained by careful examination of the strategy. If the stale [3] D. L. Eager, E. D. Lazokwska, and J. Zahorjan. Adaptive load sharing
period is very long and significant number of jobs arrives in homogeneous distributed systems. IEEE Transactions on Software
during it, then the HSQ strategy will work as follow: Engineering, 12:662–675, 1986.
[4] M. Dahlin. Interpreting Stale Load Information. IEEETransactions on
1) Equalizes the lengths of the servers queues. As it Parallel and Distributed Systems, 11(10),pp. 1033–1047, October 2000.
transfers the job to the server with the shortest queue and [5] M. Mitzenmacher. How useful is old information? In Principlesof
then increments the length of the queue, it will give jobs to Distributed Computing (PODC) 97, pages 83–91, 1997.
the servers with a less queue lengths until all the queue [6] Gupta, V., Balter, M. H., Sigman, K. and Whitt, W (2007). Analysis
of Join-the-Shortest-Queue Routing for Web Server Farms. Performance
lengths become equal. Evaluation. 64, 1062–1081.
2) The strategy starts to work in round robin fashion [7] W. Whitt. Deciding which queue to join: some counter examples.
Operations Research, 34:226-244, 1986.
since it has equlized the lengths of the servers queues. First [8] Milan E. Soklic (2002), "Simulation of Load Balancing Algorithms:
we will randomly choose one server from 5, then one from A comparative Study," SIGCSEBulletin, Vol.34, No.4, pp. 138-141, Dec.
the remaining 4, etc. Thus, we will schedule five next jobs [9] Mor Harchol-Balter, Task assignment with unknown duration,
to the five next servers and then will repeate this procedure Journal of the ACM (JACM), v.49 n.2, p.260-288, March 2002.
from the beginning. [10] W. Winston. Optimality of the shortest line discipline. Journal of
Applied Probability, 14:181-189, 1977
9 Shortest Queue with History (workload
60%)
8 Shortest Queue with History (workload
Mean Response Time (seconds)

80%)
7
6
5
4
3
2
1
0
0 105 15 20 25
Stale Period (seconds)
Figure 5. Comparison of the mean response time of the Shortest Queue
balancing strategy with stale information and history to Round Robin and
Up-to-Date Shortest Queue strategies. Two different stale periods are
shown.

You might also like