Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

CONCURRENCY: PRACTICE AND EXPERIENCE, VOL.

9(1), 1–19 (JANUARY 1997)

Application-level load migration and its


implementation on top of PVM
JIANJIAN SONG1 , HENG KEK CHOO1 AND KUOK MING LEE2
1 National Supercomputing Research Center, National University of Singapore, Singapore 0511
2 Singapore Aerospace Ltd, 540 Airport Road, Singapore 1953

SUMMARY
The development and experiment of a load (process) migration scheme conceptually similar
to moving house is described. The basic idea is to migrate a process by starting a new process
on another processor with checkpoint data prepared by the old process itself but transferred
automatically by the migration system. The new process will then unpack the data and resume
the computation. The migration mechanism of our facility is implemented by a set of library calls
on top of PVM. It performs functions such as freezing and unfreezing communications, checking
load conditions, selecting destination processors, starting new processes and receiving migrated
data. Before migrating, a process needs to freeze communication, handle pending messages in
the receive buffer and pack checkpoint data. Besides the usual merits of concurrency, location
transparency and the absence of residual dependency, our scheme solves the incoming message
problem at the application level and is portable and easy to use in a heterogeneous environment.
Our experiment shows that our facility can help to utilize 74% of idle CPU cycles of a network
of workstations with less than 6% overhead on their normal operations.

1. INTRODUCTION
Krueger and Chawla in [1] discussed the results of a survey of 199 Sun-3/50 diskless
workstations and 18 Sun servers connected on a network at the Ohio State University over
a 3-month period. Their conclusions are the following: (i) some of the users do need more
computing power; (ii) there is a huge computing resource that can be tapped from a network
of workstations; (iii) this combined resource can help to reduce the computing loads of
those heavily used workstations. Many successes have been reported on using workstation
networks for load sharing and parallel computing[2–6].
Although computing power can be borrowed from other people’s workstations, the
borrower must withdraw from using the machines when their owners want to use them. If
the borrower’s programs have been running on the machines for some time, they should
be migrated with the intermediate results to other workstations to continue. This way, the
owner has priority over his machines while the borrowed power is not wasted.
This paper discusses one solution to this type of load migration problem where distributed
memory architecture is assumed. Load migration can be defined as moving computation
load (processes, tasks, objects, etc.) from one processor to another without affecting the
execution of the computation and its environment. Since our discussion is in the context
of the UNIX operating systems, we shall use the term ‘process migration’ more often but
bear in mind that our scheme does not really migrate UNIX processes. There are a number
of other approaches to the problem such as migrating UNIX processes (also called ‘task
transfer’), threads, objects or user data, etc.[7,8].

CCC 1040–3108/97/010001–19 Received 1 October 1994


1997 by John Wiley & Sons, Ltd. Revised 1 August 1995
2 J. SONG, H. K. CHOO AND K. M. LEE

There are a number of other reasons why load migration is needed, such as load sharing,
load balancing, application concurrency and owner protection[9]. Another reason for pro-
cess migration is to move a process to where its data resides[10]. Load migration techniques
can also help hide the internal complexity of computing technology.
There are two basic choices in migrating computation load. One is to physically copy
a process (or a thread) to another processor and the other is to start a new process to
continue the computation. Most previous work on process migration has concentrated on the
former. Many techniques have been proposed for mechanisms to migrate live processes[7].
However, migrating a live process has limited success on homogeneous environments due
to the difficulty that the kernel of the operating system may be involved and a large amount
of checkpoint data must be transferred. In addition, issues such as virtual memory, opened
files, message channels, execute state and other kernel states, etc. must be handled. There
are a number of other difficulties such as portability and asynchronous communications.
Migrating a live process is conceptually similar to moving house without the awareness
of the owner. It is difficult if not impossible to move house to a new location without
the knowledge of the residents while trying to keep all the functions and communications
exactly the same as they appeared to the residents. The approaches have limited success
under strict conditions such as the old house and new house must be similar in size and
design or at least appear to be the same virtually. Various techniques have been tried to
move every item in the old house and place the item in the exactly same location in the
exactly same state as it was in the old house. The approaches are not only time consuming
but also have too many restrictions to be portable.
We have designed and implemented a new scheme for load migration that is much closer
in principle to the practice of moving house. The owner of a process packs the items
belonging to the process (checkpoint data) and then asks our process migration system (the
mover) to move the process with the packed items to another processor as well as to change
communication links automatically. A new process will be started on the new processor
and unpack the items to resume the computation.
Our scheme is faster than moving a live process because it involves less data. It is also
easy to port because it is implemented on top of PVM which is portable. In addition, it is
easy to use because there are only five calls that a user needs to know, besides the PVM
calls.
Our implementation was made on top of PVM because we did not want to modify the
source code of PVM. Our experiments with CFD and digital filter design applications show
that our facility can help to utilize 74% of idle CPU cycles of a two-workstation network
with less than 6% overhead to their normal operations.
One problem with our implementation is the possibility of some messages being lost
while the receiving process is being migrated. Our experiments suggest some remedies
for the problem. We also point out how PVM internals can be modified to eliminate the
problem.
At the high level, process migration may have the following steps:

1. checkpoint a process
2. decide whether to migrate
3. stop the process and assign it to another processor
4. inform all the related processes of the change
5. start the migrating process from its checkpoint.
APPLICATION-LEVEL LOAD MIGRATION 3

A load migration scheme may consist of two parts: mechanism and policies.
Migration mechanism involves collecting statistics about processors and transferring
processes. Statistics include data on machine load (number of processes, links, and CPU
and network loads), individual processes (age, state, CPU utilization, and communication
rate), and selected active links (packets sent and received), etc.[9]. Transferring processes
involves handling virtual memory space, opened files, message channels, execution state
and other kernel states, and signals.
The mechanism is usually implemented by the operating system and its utilities. It could
be included in the operating system kernel such as in Charlotte[9] and Sprite[11], or outside
the kernel as in Condor [12], or at an even higher level as in Piranha[13].
Migration policies, on the other hand, deal with what processes to migrate (selection
policy), when to migrate (transfer policy), where to migrate (location policy), what infor-
mation to use (information policy), and who makes policies, etc.[11,14]. Policies can be
imbedded within the mechanism or decided by the user. There is a tradeoff between the
complexity and portability of policy strategies.
The parallel virtual machine (PVM) is a software package written by the Oak Ridge
National Laboratory in USA. It handles data conversion, process initialization, termi-
nation, communication and synchronization across networks of heterogeneous UNIX
computers[15]. PVM processes communicate with each other by using message-passing
constructs. Applications can be parallelized using PVM to run on a network of heteroge-
neous UNIX machines.

2. OTHER APPROACHES
This Section summarizes the mechanism and policies of some typical solutions to the load
migration problem, namely, Charlotte, Condor, Piranha and ADM. Although there are a
number of other systems such as Sprite[11], the V system[16], and the EMPS[17], etc.,
the four chosen here are representative of four distinctive approaches to load migration in
terms of the user’s involvement.
The extent of the user’s participation in load migration increases from Charlotte, Condor,
Piranha, to ADM. In Charlotte, the mechanism of detaching, transferring and reattaching a
migrated process is fixed in its kernel and the address space of a migrating process is copied
to the destination processor. In Condor, the mechanism is outside the UNIX kernel and a
checkpoint file is created by the Condor system and copied to the destination processor.
Piranha leaves the checkpointing to the user and uses shared memory to migrate data
implicitly. ADM leaves the whole responsibility of load redistribution to the user while the
system supplies load information only.

2.1. Charlotte
Charlotte is a message-based distributed operating system developed at the University of
Wisconsin[9]. Process migration is implemented in the kernel of Charlotte.
Mechanism. Charlotte collects statistics such as load (number of processes, links, CPU load
and network load), individual processes (age, state, CPU utilization and communication
rate) and selected active links (packets sent and received).
The address space of a migrating process is copied to the destination processor. The
4 J. SONG, H. K. CHOO AND K. M. LEE

link of the process is changed and passed to the related processes. Kernel data structures
pertaining to the process are marshaled, transferred and demarshaled.
Policies. A starter process decides when to migrate a process based on the statistics and
manual directions from a policy procedure given by the user. Two processors will negotiate
to see if a process can be migrated from one to the other. Charlotte process migration
does not work in a heterogeneous environment as it transfers address space between two
processors.

2.2. Condor
‘Condor is a software package for executing long-running, computation-intensive jobs on
workstations which would otherwise be idle[12].’ Condor uses a mechanism called ‘Remote
UNIX’ which starts a shadow process locally and a remote process on a remote processor.
The remote process carries out all the computation but will send any UNIX system calls to
the shadow process. Therefore, a consistent file system and environment can be maintained
although the remote process may be migrated to other processors.
Mechanism. Condor checkpoints a process regularly. A checkpoint file contains the test,
data, bss and the stack segments of a process, the registers, the status of open files, and any
messages sent by the process to its shadow for which a reply has not been received.
Migration is pre-emptive. When a process is migrated, its checkpoint file is copied to
another processor and the process is restored from its checkpoint.
Policies. Condor’s local scheduler checks the load average and/or keyboard activity to see
if a foreign process should be suspended. If a foreign process has been suspended longer
than a prespecified time, it will be asked to leave the processor. If the process does not
leave sometime after it has been asked to do so, it will be forced to terminate.
Condor has a central controller that polls each processor at two-minute intervals to
find idle processors as well as processors with waiting background tasks. A processor is
considered idle if its owner has not been active for at least 12.5 min. The controller sends
a remote task to an idle processor.
The main advantage of Condor is its portability since it is implemented outside the UNIX
kernel. The main shortcoming of Condor is that it assumes that a migrating process should
have no interprocess communication, signals or child processes.

2.3. Piranha
Piranha implements the master–worker model of parallel computing. The master (called
‘feeder’ in the Piranha system) resides on a processor permanently throughout computation
but the workers (‘piranha’) can migrate among a number of processors. To restore the global
state of a process, checkpointing is implemented by a function called ‘retreat’ which is
executed just before a worker is terminated. The retreat function is written by the user to
save intermediate results so that the process can continue on another processor[13].
Mechanism. Piranha implements globally shared memory and therefore it does not need
the migration mechanism associated with copying the virtual address space of a process.
Policies. Each processor has a Piranha daemon running at low priority to monitor whether
APPLICATION-LEVEL LOAD MIGRATION 5

the processor is available for foreign processes. If a processor becomes unavailable, the
retreat function is called and the foreign processes running on it are terminated.
Piranha uses a distributed job scheduler to decide where to migrate a process. A migrating
process first returns to a global pool of pending jobs and an available processor will get a
job from the pool.
An obvious advantage of Piranha is that there is no need to have a complex migration
mechanism. One difficulty of writing a Piranha program, however, is to write the retreat
function so that it can be called any time and still be able to restore the correct global state
of a computation.

2.4. ADM
ADM (adaptive data movement) is ‘a programming methodology for writing programs
that perform adaptive load distribution through data movement[8].’ It is suitable for data
parallel applications where there is no communication among the processes each of which
computes part of a data set. ADM uses the master–slave model of programming where a
master process assigns work (data) to a number of slave processes. The slave processes
may decide whether the machines on which they are running are too busy, and if so they
will inform the master process, who will in turn redistribute the data set to exclude those
slave processes on busy machines from doing any computation. Those idle slave processes
are put to sleep except for checking the machine loads periodically (once every minute
in [8]). Whenever a busy machine becomes idle, the sleeping processes on it will report
the fact to the master, who will redistribute the data set to include those processes in the
computation.
Mechanism. No mechanism is needed for ADM as the processes are not really migrated
but are put to sleep if the machines on which they are running are overloaded. Only load
information will be provided by the system to the slave processes.
Policies. In [8], slave processes check their machine loads after each iteration of compu-
tation. If the load of a machine is too large, the slave processes on the machine will stop
computing and report the fact to the master process. The other policies for data migration
are all left to the application programmer.

3. A NEW LOAD MIGRATION FACILITY – MPVM


This Section outlines our load migration facility, called MPVM, to migrate a program
instead of a live process, which is similar to the Piranha’s idea but is based on the message-
passing paradigm. (The term MPVM is also used in [8] to describe their live process
migration strategy, which is different from ours.) Our facility will start a new process
on another processor to continue the computation from where the old process left off. It
divides the responsibilities of process migration between the migrating process and the
system support as follows.
Mechanism:

1. the migrating process will pack a checkpoint data set to be transferred to the new
process
6 J. SONG, H. K. CHOO AND K. M. LEE

2. the migrating process will make a call to freeze communication and the new process
will make a call to defreeze communication
3. the migration system is responsible for sending future messages to the new process
4. the system is responsible for sending the checkpoint data set to the new process
5. the system will start a new process on another processor.

Policies:
1. the migrating process will check explicitly to see if migration is necessary because
the owner comes back or load is too high, etc
2. the system will decide where to migrate the process. (Our current implementation
will choose a destination processor that is the most lightly loaded.)

3.1. Routines for process migration


A process may invoke five calls for process migration and two other calls to interface our
migration system with the daemons of PVM.

3.1.1. mpvm chkload()


mpvm chkload() checks the one-minute load average of a machine. It compares the load
average against a user-specified threshold and returns an indicator of ‘yes’ or ‘no’ to show
if migration is necessary. ‘Load average’ is defined as the average number of runnable jobs
per load average time and is obtained by the UNIX command uptime.

3.1.2. mpvm migrate(int bufid)


mpvm migrate(int bufid) will migrate the caller to another processor by spawning a new
process and pass the checkpoint data in the buffer pointed by bufid to the new process. The
new process will be started with the same parameters as the old one when the latter was
started by pvm spawn() call. The routine checks for any processor in the virtual machine
(including the local one) having a one-minute load average less than a user-specified
figure. If a processor satisfies the condition, the new process is spawned on that processor.
Otherwise, the routine calls sleep() for a fixed time interval and polls the virtual machine
again.

3.1.3. mpvm freeze()


mpvm freeze() will freeze the caller from receiving any messages. It blocks until all pro-
cessors involved have replied, indicating receipt of the call.

3.1.4. mpvm defreeze()


mpvm defreeze() will resume communication between the caller and its peer processes.
APPLICATION-LEVEL LOAD MIGRATION 7

Table 1. The structure of the TID table managed by mpvmd


Original Current Freeze
TID TID flag Spawn parameters
3 4 yes task, argv, flag, where, ntask
4 4 no task, argv, flag, where, ntask

3.1.5. mpvm getluggage()


mpvm getluggage() checks if the caller is a ‘migrant’. If so, it will get the ‘luggage’ sent
by the migrated process. The luggage contains the checkpoint data for the new process to
continue from where the old process left off before the migration.
Two more calls are provided to interface our facility with pvmds of PVM. They are
mpvm send() and mpvm spawn(). Their parameters and functions are the same as those of
pvm send() and pvm spawn() of the original PVM calls except that our routines communi-
cate with our own daemon, which will be explained next.

3.2. The daemon mpvmd


In PVM Version 3, each process has a unique task identification number called TID, which
is the only destination information needed to send a message to a process. When a process
is migrated by spawning a new process on another processor, the new process will receive
another TID. But the other processes know only the TID of the migrated process. Our
facility makes the TID change of a process transparent to the other processes.
We have written a daemon routine called mpvmd to carry out our migration mechanism
together with the seven library calls. One copy of mpvmd resides on every participating
processor along with the normal pvmd. mpvmd is essentially a table lookup daemon. By
keeping a table of processes as shown in Table 1, mpvmd stores information on: processes
spawned by mpvm spawn(); and processes in migrating stage. Each mpvmd has an identical
copy of the table. Every copy of the table is updated when one entry is added or updated.
The original TID is the TID when the process is spawned. The problem of the TID change
is solved by storing the TID of the new process in the current TID field of the table. The
current TID is passed to the table by mpvm migrate() when it transfers the caller to another
processor by spawning a replicate of the caller with the same parameters as in the entry of
the TID table for the caller. If the original and current TIDs are the same, the process is
alive. Otherwise, it has been migrated and become the process with the current TID.
The freeze flag indicates whether the process is frozen. The spawn parameters are the
parameters passed to pvm spawn() of PVM.
The seven calls except mpvm chkload() and mpvm getluggage() will communicate with
mpvmd. mpvm spawn() works exactly like pvm spawn() of PVM except for storing its call
parameters to the TID table. mpvm send() works exactly like pvm send() of PVM except
for two modifications: it will go through the TID table to find the latest migrant and then
call pvm send() to send the message to it; and it will consult the TID table to see if the
destination process is frozen and return an error flag if it is. mpvm freeze() sets the freeze
flag of the entry for the caller. It is a blocking call and will not return until all copies of
the table are updated. mpvm defreeze() resets the freeze flag and is non-blocking. It simply
changes the freeze flag of the precedence of the caller by searching the current TID field to
find the original TID index and updating the entry pointed by the index. mpvm migrate()
will get the spawn parameters from the TID table.
8 J. SONG, H. K. CHOO AND K. M. LEE

Figure 1. mpvmd at work with PVM and an application code

Figure 1 shows how mpvmd works with PVM and an application code.

3.3. A typical scenario


A typical structure of a user program is given in Table 2. The active code fragments just
before and after migration are indicated in bold. The structure is self explanatory.

4. A NUMBER OF DESIGN ISSUES


From the user’s point of view, the minimum information needed to migrate a process is
the intermediate results that the process has produced just before it is to be transferred. If
the user knows when the process is to migrate, he should know what data to collect and
transfer. The user is hence in a better position than the operating system to choose the
checkpoint data set. Therefore, our approach is not to migrate a live process pre-emptively
but to restart a process that has been asked to migrate with the same parameters that the
original process started with except for initialization using the checkpoint data.
The rest of this Section lists options for process migration mechanism and policies, and
explains our choices.

4.1. Process transfer


The process state can be transferred in the following manner:

1. Start a new process by copying the virtual address space of the old one (Charlotte[9],
V[16], Sprite[11]).
2. Start a new process with the process image outside the operating system kernel
(Condor[12]).
3. Start a new process with checkpoint data stored in shared memory (Piranha[13]).
4. Start a new process with checkpoint data passed in a message (ours).
APPLICATION-LEVEL LOAD MIGRATION 9

Table 2. A typical structure of a user’s program

When a process is migrating After a process is migrated


Conceptual procedure if (need to migrate) { if(just migrated) {
tell others not to send any message. get the luggage.
pack luggage for migration. unpack the luggage.
migrate with the luggage. tell others I am ready to receive
} messages.
continue. }
continue.

An example if(bufid=mpvm getluggage()) { if(bufid=mpvm getluggage()) {


program unpack migrated data in the active unpack migrated data in the
with mpvm buffer; active buffer;
calls process migrated data; process migrated data;
mpvm defreeze(); mpvm defreeze();
} }
application code; application code;
if (mpvm chkload()) { if (mpvm chkload()) {
mpvm freeze(); mpvm freeze();
process messages; process messages;
pack data into buffer bufid; pack data into buffer bufid;
mpvm migrate(bufid); mpvm migrate(bufid);
pvm exit(); pvm exit();
exit(); exit();
} }

When it is done by the operating system, checkpointing is time-consuming and difficult


to port. An example is the Condor checkpoint file, which may have 6 Mbyte address space
and could take 2 min to pack and transfer[12]. In addition, it is hard to include any message
or signal information in a checkpoint file.
Checkpointing can be done much easier at the application level. Since the user knows
his program well, he can pack the intermediate results as checkpoint data. The example
CFD program in our experiment has a memory requirement of over 8 Mbytes but the size
of the user-prepared checkpoint data is only 2.6 Mbytes.
Our choice is therefore to start a new process with the same parameters as the old one
and send the checkpoint data automatically to the new process.

4.2. Transfer policy


If the transfer policy is the responsibility of a process migration facility, process migration
will usually be pre-emptive. However, it is almost impossible for the user to design a
checkpoint data set for pre-emptive migration because the user cannot know the state of
his process at the time of migration. For example, Piranha allows the user to prepare a
checkpoint data set but its migration mechanism is pre-emptive. Its ‘retreat’ function works
like a signal handler that is called asynchronously and therefore is hard to write so that it
will work well in all circumstances.
Hence, our choice is to let the user decide when to migrate so that the user will know
what to save in the checkpoint data set at the time of migration. The system will provide
load statistics to help the user.
10 J. SONG, H. K. CHOO AND K. M. LEE

4.3. Messages and signals


One problem with process migration is how to handle signals and messages. There is no
good way to deal with signals because they are asynchronous. There are a number of ways
to treat messages:
1. to tell all peers to stop sending and then process all pending messages (our approach)
2. to delay and redirect incoming messages (as in Charlotte)
3. to reject incoming messages (as in V).
We chose the first option to tell other processes not to send any messages to the migrating
process and then let the migrating process receive all pending messages before it exits.
Again, since the user is the one who knows his program well, our scheme requires the
user’s process to handle all messages in the receive buffer before it is migrated.

4.4. Location policy


Our current implementation sends the migrating process directly to the most lightly load
machine. When mpvm migrate() is called, it will select a new processor based on a load
average figure specified by the user in the HOME/.mpvm config file. The config file contains
two floating point numbers on two lines. The first number is the upper load threshold of
a processor above which foreign processes must be vacated. The second is the lower load
threshold below which foreign processes will be allowed to run. mpvm migrate() reads the
second number and checks if any processor in the virtual machine (including the local one)
has a load average less than the number. If a processor satisfies the condition, the process
is migrated to it. Otherwise, the routine calls sleep() for a few seconds and polls the virtual
machine again for an ‘idle’ processor.

5. EXPERIMENTAL RESULTS
The owner of a computer and its borrower may have different opinions about a process
migration facility. The owner wants to see no deterioration of the response time of his
machine and the borrower wants to fully utilize the idle cycles of the machine. This Section
discusses some experimental results to show how our facility satisfies both the concern of
the owner and the need of the borrower.
One of our experiments indicates that (i) the response time of a machine to interactive
users will not deteriorate too much (say, less than 6%) due to the existence of a foreign (or
remote) process and (ii) the foreign process can take advantage of the idle CPU power of a
network of computers by migrating among them.
One potential problem with our implementation for message handling is the possibility
of some messages being lost while the receiving process is being migrated. An experiment
was conducted to understand the problem.
We assumed that the activities of a typical programmer would be editing, compiling
and testing a program. We created a program to simulate an engineer’s activities of writ-
ing a program for digital filter design. The foreign process was a parallel program for
computational fluid dynamics (CFD).
Two workstations on a 10 Mbit/s Ethernet were used in our experiment. The machines
were IBM RS6000-370 with a clock rate of 42 MHz and 128 Mbyte RAM. Each workstation
APPLICATION-LEVEL LOAD MIGRATION 11

executed one copy of the digital filter design program. The CFD program tried to use the
idle cycles of the two machines by migrating back and forth between them.

5.1. AGNES with process migration capability


A sequential code called AGNES for CFD is discussed in detail in [18]. AGNES has been
parallelized with PVM[19].
The problem domain of CFD is three-dimensional (3-D) physical space. 3-D objects such
as the space around an aircraft are divided into smaller pieces called blocks that have simple
geometries. A block is divided further into small cells to form a 3-D grid. In an iterative
solution on a sequential computer, the computation can be carried out on a block-by-block
basis during each iteration and the new results of each block can be incorporated after they
are obtained as described by the following algorithm.
For each iteration do {
For each block do {
Compute the block.
Update the boundary conditions related to this block.
}
}
The parallel AGNES program has one master process to distribute blocks and collect
results and a number of ‘worker’ processes to compute blocks. Each worker process is
given a fixed number of blocks at the beginning of the computation. All worker processes
are identical and they co-operate with each other through synchronization or messages to
complete a computation. The master and worker processes are illustrated below:
Master process {
Read in parameters.
Set up PVM worker processes.
Perform static load balancing.
Read in grid.
Loop (while less than maximum number of iterations) {
Get residues from worker processes.
} end of loop
} end of master process

Worker process {
if (migrated) {
Unpack data from the migration buffer;
mpvm_defreeze(); }
else
Receive initial parameters and blocks from master;
Loop (while less than maximum number of iterations) {
Process blocks.
Update boundary conditions synchronously.
if(mpvm_chkload()) {
mpvm_freeze();
12 J. SONG, H. K. CHOO AND K. M. LEE

Table 3. Filters and their design times

Filter Elapsed time


Filter type length (s)
1 27 57
2 35 115
3 39 687
4 41 438
5 43 1365

Process messages;
Pack data;
mpvm_migrate(bufid);
exit();
}
} end of loop
Send final results back to the master.
} end of worker process

The worker process has migrating capability. While the worker uses about 8.2 Mbytes
of memory with no dynamic storage requirement, its checkpoint data is about 2.6 Mbytes.
Each iteration takes about 9 s of CPU time of the RS6000-370 and each process migration
operation takes about 5 s.

5.2. A simulated program development environment


The activities of a typical programmer are editing, compiling and testing a program. To
simulate a programmer’s use of a computer, we created a program to simulate computing
activities of an engineer writing a program for digital filter design. The design code uses
linear programming techniques to choose optimal filter coefficients that are powers of
two[20,21]. The engineer would edit his program, compile it and test it with example
designs.
There are five example designs with different filter lengths and running times as shown
in Table 3.
We simulated the computing activities where two engineers were working on their own
workstations. Engineer 1 would spend between 1 and 200 s of elapsed time to edit his
program and then compile it and run one of the example designs. Engineer 2 would spend
between 1 and 300 s of elapsed time modifying his program before compiling and running
it. The actual editing time is a random variable with uniform distribution. Each example
design has an equal probability of being chosen to run. As shown by the lengths of their
editing times during each design iteration in Figure 2, the activities of the two engineers
are not correlated.

5.3. Overhead and idle CPU utilization


Three cases were tested to see how process migration would affect the response time for
the engineers as shown below:
APPLICATION-LEVEL LOAD MIGRATION 13

Figure 2. The editing times of the two engineers are not correlated

Case 1: Each engineer uses one machine exclusively.


Case 2: Each engineer shares one machine with one AGNES worker.
Case 3: The engineers have exclusive usage of their own machines except for one
AGNES worker migrating between the two machines.
Table 4 shows the timing results for the first two cases and Figure 3 compiles the
total execution times for all three cases. (Case 3 is explained in more detail later in this
Section.) The response times of the engineers in Case 2 were increased by 34.8% and
48%, respectively, when the AGNES worker was executed concurrently with them. But the
response time increases for Case 3 were only 3.2% for both engineers when the AGNES
worker was migrating between the two machines, as will be seen later.
Another measurement of the performance of process migration is how much the AGNES
program can take advantage of the idle CPU cycles of the two machines. We designed three
cases as listed below:
Case 1: AGNES runs exclusively on one machine.

Figure 3. The elapsed times for the engineers in the three cases
14 J. SONG, H. K. CHOO AND K. M. LEE

Table 4. Running times for Cases 1 and 2 of the engineers

Editing time, Execution time Response time


#Design (s) (s) change
iteration Range Total Real User System (%)
Engineer 1 Case 1 40 200 3346 11697 8136 165 0
Case 2 40 200 3346 17306 8163 162 48
Engineer 2 Case 1 30 300 5055 14663 9265 204 0
Case 2 30 300 5055 19760 9327 205 34.8

Case 2: AGNES shares one machine with one engineer. (Same as Case 2 for the engi-
neers.)
Case 3: One AGNES worker migrates between two machines shared with the two
engineers. (Same as Case 3 for the engineers.)
Table 5 gives the results for the first two cases, where the total number of iterations was
1000. When the AGNES program ran concurrently with the engineer’s computations, its
execution time increased by as much as 67% in comparison to Case 1.
Our migrating facility uses two user-specified load figures to decide how to migrate a
process: the ‘upper load-threshold’ on the current machine and the ‘lower load-threshold’ on
the destination machine. When the one-minute load average of the current machine is greater
than the upper load-threshold, the AGNES worker will be stopped and mpvm migrate()
called to find another machine which has the load less than the lower load- threshold to
migrate the worker. If none of the loads of the two machines is less than the lower load-
threshold, the worker will be put to sleep and mpvm migrate() will poll the machines in
fixed time intervals until one machine becomes available.
The execution times for the engineer’s programs and the AGNES program in Case 3
are therefore related to three factors: the upper and lower load-thresholds and the polling
interval. Table 6 shows some experimental results when the upper and lower load-thresholds
and the polling interval varied.
It can be seen from Table 6 that the deterioration was quite small although the upper and
lower loads did affect the execution times for the engineers. The increase in the running
times for the engineers ranged from 0.18% to 5.8% as compared to their having exclusive
use of their machines in Case 1 for the engineers.
Our experiment has also indicated that the AGNES worker could take advantage of the
idle CPU power of the two machines. Figure 4 compares the cascade execution of the digital
filter design program and AGNES program with that of the migrating AGNES program of
Case 3. In the cascade cases, the digital design program would run to completion before
the AGNES program could start. When the load figures were 1.6 and 0.95, the migrating
AGNES worker took 14153 s to complete. As can be seen from Figure 4, the migrating
AGNES worker could finish 6042 s earlier than the cascade runs on Machine 1 and 8965 s
than on Machine 2, resulting in a time saving from 30% to 39%.
Furthermore, when the load figures were 1.6 and 0.95, the AGNES worker was executed
on Machine 1 exclusively for (14153 − 12378) = 1775 s after Engineer 1 stopped. But
the worker takes at least 8455 s to complete. Therefore, it stole (8455 − 1775) = 6680 s
from the two engineers. The maximum possible idle time (editing time) for the engineers
was (3346 + 5055) = 8396 s, out of which 6680 s were used by the AGNES program. It
APPLICATION-LEVEL LOAD MIGRATION 15

Table 5. Execution times for Cases 1 and 2 of AGNES

Real Time change


Parallel AGNES time (s) (%)
Workstation 1 Case 1 8498 0.0
Case 2 14244 67.62
Workstation 2 Case 1 8455 0.0
Case 2 13659 61.55

can therefore be claimed that 74% of the idle CPU power was utilized.

5.4. Message losses


One potential problem with our approach to message handling is the possibility of some
messages being lost because of the asynchronous nature and delay of message transmission.
A migrating process calls mpvm freeze() and then checks for remaining messages. But it
can never be sure that there is no message somewhere in the communication channels
before it starts migrating. Furthermore, the mpvm send() call will send to its local mpvmd
a request for the freeze-status of the destination process before sending the actual message
to the pvmd of PVM. A race condition may occur when the request for the freeze-status
is made at about the same time when the destination process calls mpvm freeze(). It may
happen that the sender process has received a defreeze flag while the receiver has started
migrating, resulting in the message to the receiver being lost because the receiver process
does not exist any more.
In order to find out how severe the message loss could be, we constructed an experiment
with the two workstations. The workstations communicate through a 10 Mbit/s Ethernet
which connects more than 20 computers. The experiment has two processes. One will
send integers consecutively from 1 to N with N being from 100 to 4000. The other will

Figure 4. The migrating AGNES worker utilized 74% of idle CPU power
16 J. SONG, H. K. CHOO AND K. M. LEE

Table 6. Execution times for Case 3 of AGNES and the engineers when the upper and lower load
thresholds and polling interval varied

One- One- Load


minute minute polling
upper lower Engineer 1 Change Engineer 2 Change AGNES #Migra- interval
load load (s) (%) (s) (%) (s) tions (s)
1.2 0.2 11718 0.18 14735 0.49 17458 24 30
1.2 0.8 11931 2.0 14748 0.58 14955 46 30
1.2 0.95 12056 3.1 14846 1.25 14728 97 5
1.4 0.95 12270 4.9 14991 2.24 14347 44 5
1.6 0.95 12378 5.8 15111 3.06 14153 77 5

receive the integers and starts to migrate to another machine whenever it receives an integer
divisible by 5, i.e. after receiving the integers 5, 10, 15, . . . ,etc. More than 15 test runs for
each value of N were performed and message losses as well as the number of migrations
were recorded. The test runs were executed at various times during working hours of the
day when the Ethernet was being utilized by an average of 11 people performing tasks like
word-processing, editing, compiling, UNIX job execution, file transfer and host access, etc.
The number of migrations differs from case to case due to the nature of the receiver which
‘cleans up’ any pending messages in the receive buffer before migrating. For example, the
receiver starts migrating after receiving the integer 15. At this point, the incoming buffer
might have been filled with other integers because the sender is constantly sending them.
The receiver should appropriately handle these outstanding messages by first ‘freezing’
itself (by calling mpvm freeze()) and then processing the messages in the receive-buffer, if
any, before calling mpvm migrate().
The test results indicate that more than 99% of the lost integers were the first ones after
integers divisible by 5, e.g. 6, 31, 46, 101, etc. These were the integers which were sent
at the critical moment when the receiver was either close to migrating or in the process of
migrating.
Figure 5 shows the relationship among the total number of messages, the number of lost
messages and the number of migrations for each test case. It can be seen that the number
of lost messages increases approximately linearly when the number of migrations exceeds
about 350. This is attributed to the race condition being aggravated by the growing length
of the TID-lookup table maintained by the mpvmd as the number of migrations increases.
As each migration adds a new process entry to the table, response times of the calls to the
mpvmds become longer.
Figure 6 shows the relationship between the lost messages and their sequence numbers.
The message sequence numbers are ordinal numbers starting from 0th up to 3999th. Figure 6
indicates that later messages may have much greater probability of being lost than the earlier
ones due to the increasing length of the table.
Our experiment simulated the extreme case where the receiver will migrate immediately
after checking that its local message queue is empty. We have performed another experiment
that showed that if the receiver had been made to wait for about 0.2 s after being frozen and
before it checked for pending messages, all transient messages would arrive at the receiver,
resulting in zero message losses for all the cases of the experiment.
APPLICATION-LEVEL LOAD MIGRATION 17

Figure 5. Effect of migrations on message loss

6. DISCUSSIONS
The development and experiment of a new load migration scheme are reported. The basic
idea is to migrate a process by starting a new process on another processor with the
checkpoint data transferred from the old process to the new one. The checkpoint data are
the intermediate results generated by the old process and transferred automatically by the
migration system. Our facility is easily portable and easy to use because it is built on top
of the PVM and the user does not need to know the details of the migration mechanism. It
is also faster than migrating a live process because the amount of data to be transferred in

Figure 6. Lost messages


18 J. SONG, H. K. CHOO AND K. M. LEE

our approach is much smaller.


A daemon mpvmd is written to implement the migration mechanism. The user interfaces
with mpvmd through four calls: mpvm migrate(), mpvm freeze(), mpvm defreeze() and
mpvm getluggage(). Another call, mpvm chkload(), can be used to check load conditions
to decide whether migration is necessary. Two more calls are written to collect spawn
information and check whether a process is being migrated. They are mpvm send() and
mpvm spawn().
With the above calls and the daemon, it is fairly easy for a user to incorporate the process
migration feature into his program. We modified the CFD code called AGNES in two
weeks to add the migration capability to it. The amount of checkpoint data for the AGNES
program is 70% smaller than that for migrating the program as a live process.
Our experiments with CFD and digital filter design applications show that our facility
can help to utilize 74% of idle CPU cycles of a two-workstation network with less than 6%
overhead to their normal operations.
The mpvmd may be modified to include message buffers and a message forwarding
facility in order to overcome the problem of possible message loss. Another solution to the
problem is to synchronize message transmissions, which may result in degradation of the
communication performance.
We wrote the daemon mpvmd because we did not want to modify the source code of PVM.
There are a number of things we would do differently if we could include our migration calls
into the PVM code. Instead of freezing communication and processing pending messages
as our facility does, PVM daemons could redirect messages to the new process since the
daemons would know the TID of the new process. This way, only two calls would be
needed to migrate a process: mpvm migrate() and mpvm getluggage(). pvm send() would
always return success and let PVM daemons hold the messages if their destination process
is being migrated. The process migration mechanism would be even more transparent to
the user.
The MPVM system is freely available. Interested parties should send an email message
to songjj@nsrc.nus.sg for more information.

ACKNOWLEDGEMENTS
The authors are greatly in debt to the two anonymous referees who examined the first version
of this paper. We especially express our gratitude to one of the referees for his/her three-page
comments, which have helped to improve the readability of our paper tremendously.

REFERENCES
1. P. Krueger and R. Chawla, ‘The Stealth distributed scheduler’, The 11th Int. Conf. on Distr.
Computing Systems, 1991, pp. 336–343.
2. R. E. Benner and J. M. Harris, ‘Applications, algorithms, and software for massively parallel
computing’, AT&T Tech. J., Nov. 1991, 59–72.
3. R. Fatoohi and S. Weeratunga, ‘Performance evaluation of three distributed computing environ-
ments for scientific applications’, Supercomputing ‘94, pp. 400–408.
4. W. T. C. Kramer et al., ‘Clustered workstations and their potential role as high speed compute
processors’, RNS-94-003, NASA Ames Research Center, 1994.
5. R. E. Ewing et al., ‘Distributed computation of wave propagation models using PVM’, IEEE
Parallel Distrib. Technol., 2, (1), 26–31 (1994).
APPLICATION-LEVEL LOAD MIGRATION 19

6. M. Deshpande et al., ‘Application of a distributed network in computational fluid dynamic


simulation’, Supercomput. Appl. High Performance Comput., 8, (1), 64–67 (1994).
7. M. Nuttall, ‘A brief survey of systems providing process or object migration facilities’, Oper.
Syst. Rev., 28, (4), 64–80 (1994).
8. J. Casas et al., ‘Adaptive load migration systems for PVM’, Supercomputing ‘94, pp. 390–399.
9. Y. Artsy and R. Finkel, ’Designing a process migration facility – The Charlotte experience’,
IEEE Computer, September 1989, pp. 47–56.
10. W. C. Hsieh, P. Wang and W. E. Weihl, ‘Computation migration: enhancing locality for
distributed-memory parallel systems’, Proc. of the 4th ACM SIGPLAN Symp. on Princ. &
Practice of Parallel Prog., 1993, pp. 239–248.
11. F. Douglis and J. Ousterhout, ‘Transparent process migration: design alternatives and the Sprite
implementation’, Softw. - Pract. and Exp., 21, (8), 757–785 (1991).
12. M. Litzkow and M. Solomon, ‘Supporting checkpointing and process migration outside the
UNIX kernel’, Usenix Winter Conference, 1992.
13. D. Gelernter and D. Kaminsky, ‘Supercomputing out of recycled garbage: preliminary experience
with Piranha’, Sixth ACM Int. Conf. on Supercomputing, July 1992.
14. N. G. Shivaratri, P. Krueger and M. Singhal, ‘Load distributing for locally distributed systems’,
IEEE Computer, 25, (12), 33–44 (1992).
15. A. Geist et al., PVM 3.0 User’s Guide and Reference Manual, Oak Ridge National Laboratory,
ORNL/TM-12187, 1993, pp. 1–41.
16. D. R. Cheriton, ‘The V distributed system’, Commun. ACM, 31, (3), 1314–333 (1988).
17. G. J. W. van Dijk and M. J. van Gils, ‘Efficient process migration in the EMPS multiprocessor
system’, 6th Int. Parallel Process. Symp., 1992, pp. 58–66.
18. K. M. Lee and K. C. Fong, ‘A zonal Euler/Navier–Stokes solver for complex configuration
aerodynamics’, Eng. Technol., 1, (3), 30–36 (1991).
19. Kuok Ming Lee, Heng Kek Choo, Jianjian Song, ‘Cluster computing for CFD on a network of
engineering workstations’, J. High Performance Comput., 1, (1), 2–9 (1994).
20. Dongning Li, Jianjian Song, and Yong Ching Lim, ‘A polynomial-time algorithm for designing
digital filters with power-of-two coefficients’, The 1993 Int. Conf. on Circuits and Systems, May
1993, Chicago, USA.
21. Yong Ching Lim, Dongning Li and Jianjian Song, ‘A new signed power-of-two(SPT) term
allocation scheme for the design of digital filters with a given total number of SPT terms for the
coefficient values’, submitted to IEEE Trans. Circuits Syst.
22. A. Bricker, M. Litzkow and M. Livny, Condor Technical Summary, Dept. of Computer Sciences,
U. of Wisconsin, Madison, WI 53706, USA.
23. N. Carriero, D. Gelernter, D. Kaminsky and H. Westbrook, ‘Adaptive parallelism with Piranha’,
Dept. of Comp. Science, Yale U., 1993, pp. 1–18.
24. D. Freedman, ‘Experience building a process migration subsystem for UNIX’, in Proc. 1991
Winter USENIX Conf., Dallas, Texas, January 1991.
25. M. Litzkow, M. Livny and M. W. Mutka, ‘Condor - a hunter of idle workstations’, 8th IEEE Int.
Conf. on Distr. Comp. Sys., 1988, pp. 104–111.
26. M. Litzkow and M. Livny, ‘Experience with the Condor distributed batch system’, IEEE Work-
shop on Experimental Distributed Systems, 1990.
27. J. M. Smith and G. Q. Maguire, ‘Process migration: Effects on scientific computation’, ACM
SIGPLAN Not. 23, (3), 102–106 (1988).
28. J. M. Smith, ‘A survey of process migration mechanisms’, Oper. Syst. Rev., 22, (3), 28–40
(1988).
29. Jianjian Song, ‘A partially asynchronous and distributed algorithm for load balancing’, Parallel
Comput., 20, (6), 853–868 (1994).
30. M. M. Theimer and K. A. Lantz, ‘Finding idle machines in a workstation-based distributed
system’, IEEE Trans. Softw. Eng., 15, (11), 1444–1458 (1989).
31. M. M. Theimer and B. Hayes, ‘Heterogeneous process migration by recompilation’, Xerox Palo
Alto Research Center, Report No. CSL-92-3, 1992.

You might also like