Professional Documents
Culture Documents
Application-Level Load Migration and Its Implementation On Top of PVM
Application-Level Load Migration and Its Implementation On Top of PVM
SUMMARY
The development and experiment of a load (process) migration scheme conceptually similar
to moving house is described. The basic idea is to migrate a process by starting a new process
on another processor with checkpoint data prepared by the old process itself but transferred
automatically by the migration system. The new process will then unpack the data and resume
the computation. The migration mechanism of our facility is implemented by a set of library calls
on top of PVM. It performs functions such as freezing and unfreezing communications, checking
load conditions, selecting destination processors, starting new processes and receiving migrated
data. Before migrating, a process needs to freeze communication, handle pending messages in
the receive buffer and pack checkpoint data. Besides the usual merits of concurrency, location
transparency and the absence of residual dependency, our scheme solves the incoming message
problem at the application level and is portable and easy to use in a heterogeneous environment.
Our experiment shows that our facility can help to utilize 74% of idle CPU cycles of a network
of workstations with less than 6% overhead on their normal operations.
1. INTRODUCTION
Krueger and Chawla in [1] discussed the results of a survey of 199 Sun-3/50 diskless
workstations and 18 Sun servers connected on a network at the Ohio State University over
a 3-month period. Their conclusions are the following: (i) some of the users do need more
computing power; (ii) there is a huge computing resource that can be tapped from a network
of workstations; (iii) this combined resource can help to reduce the computing loads of
those heavily used workstations. Many successes have been reported on using workstation
networks for load sharing and parallel computing[2–6].
Although computing power can be borrowed from other people’s workstations, the
borrower must withdraw from using the machines when their owners want to use them. If
the borrower’s programs have been running on the machines for some time, they should
be migrated with the intermediate results to other workstations to continue. This way, the
owner has priority over his machines while the borrowed power is not wasted.
This paper discusses one solution to this type of load migration problem where distributed
memory architecture is assumed. Load migration can be defined as moving computation
load (processes, tasks, objects, etc.) from one processor to another without affecting the
execution of the computation and its environment. Since our discussion is in the context
of the UNIX operating systems, we shall use the term ‘process migration’ more often but
bear in mind that our scheme does not really migrate UNIX processes. There are a number
of other approaches to the problem such as migrating UNIX processes (also called ‘task
transfer’), threads, objects or user data, etc.[7,8].
There are a number of other reasons why load migration is needed, such as load sharing,
load balancing, application concurrency and owner protection[9]. Another reason for pro-
cess migration is to move a process to where its data resides[10]. Load migration techniques
can also help hide the internal complexity of computing technology.
There are two basic choices in migrating computation load. One is to physically copy
a process (or a thread) to another processor and the other is to start a new process to
continue the computation. Most previous work on process migration has concentrated on the
former. Many techniques have been proposed for mechanisms to migrate live processes[7].
However, migrating a live process has limited success on homogeneous environments due
to the difficulty that the kernel of the operating system may be involved and a large amount
of checkpoint data must be transferred. In addition, issues such as virtual memory, opened
files, message channels, execute state and other kernel states, etc. must be handled. There
are a number of other difficulties such as portability and asynchronous communications.
Migrating a live process is conceptually similar to moving house without the awareness
of the owner. It is difficult if not impossible to move house to a new location without
the knowledge of the residents while trying to keep all the functions and communications
exactly the same as they appeared to the residents. The approaches have limited success
under strict conditions such as the old house and new house must be similar in size and
design or at least appear to be the same virtually. Various techniques have been tried to
move every item in the old house and place the item in the exactly same location in the
exactly same state as it was in the old house. The approaches are not only time consuming
but also have too many restrictions to be portable.
We have designed and implemented a new scheme for load migration that is much closer
in principle to the practice of moving house. The owner of a process packs the items
belonging to the process (checkpoint data) and then asks our process migration system (the
mover) to move the process with the packed items to another processor as well as to change
communication links automatically. A new process will be started on the new processor
and unpack the items to resume the computation.
Our scheme is faster than moving a live process because it involves less data. It is also
easy to port because it is implemented on top of PVM which is portable. In addition, it is
easy to use because there are only five calls that a user needs to know, besides the PVM
calls.
Our implementation was made on top of PVM because we did not want to modify the
source code of PVM. Our experiments with CFD and digital filter design applications show
that our facility can help to utilize 74% of idle CPU cycles of a two-workstation network
with less than 6% overhead to their normal operations.
One problem with our implementation is the possibility of some messages being lost
while the receiving process is being migrated. Our experiments suggest some remedies
for the problem. We also point out how PVM internals can be modified to eliminate the
problem.
At the high level, process migration may have the following steps:
1. checkpoint a process
2. decide whether to migrate
3. stop the process and assign it to another processor
4. inform all the related processes of the change
5. start the migrating process from its checkpoint.
APPLICATION-LEVEL LOAD MIGRATION 3
A load migration scheme may consist of two parts: mechanism and policies.
Migration mechanism involves collecting statistics about processors and transferring
processes. Statistics include data on machine load (number of processes, links, and CPU
and network loads), individual processes (age, state, CPU utilization, and communication
rate), and selected active links (packets sent and received), etc.[9]. Transferring processes
involves handling virtual memory space, opened files, message channels, execution state
and other kernel states, and signals.
The mechanism is usually implemented by the operating system and its utilities. It could
be included in the operating system kernel such as in Charlotte[9] and Sprite[11], or outside
the kernel as in Condor [12], or at an even higher level as in Piranha[13].
Migration policies, on the other hand, deal with what processes to migrate (selection
policy), when to migrate (transfer policy), where to migrate (location policy), what infor-
mation to use (information policy), and who makes policies, etc.[11,14]. Policies can be
imbedded within the mechanism or decided by the user. There is a tradeoff between the
complexity and portability of policy strategies.
The parallel virtual machine (PVM) is a software package written by the Oak Ridge
National Laboratory in USA. It handles data conversion, process initialization, termi-
nation, communication and synchronization across networks of heterogeneous UNIX
computers[15]. PVM processes communicate with each other by using message-passing
constructs. Applications can be parallelized using PVM to run on a network of heteroge-
neous UNIX machines.
2. OTHER APPROACHES
This Section summarizes the mechanism and policies of some typical solutions to the load
migration problem, namely, Charlotte, Condor, Piranha and ADM. Although there are a
number of other systems such as Sprite[11], the V system[16], and the EMPS[17], etc.,
the four chosen here are representative of four distinctive approaches to load migration in
terms of the user’s involvement.
The extent of the user’s participation in load migration increases from Charlotte, Condor,
Piranha, to ADM. In Charlotte, the mechanism of detaching, transferring and reattaching a
migrated process is fixed in its kernel and the address space of a migrating process is copied
to the destination processor. In Condor, the mechanism is outside the UNIX kernel and a
checkpoint file is created by the Condor system and copied to the destination processor.
Piranha leaves the checkpointing to the user and uses shared memory to migrate data
implicitly. ADM leaves the whole responsibility of load redistribution to the user while the
system supplies load information only.
2.1. Charlotte
Charlotte is a message-based distributed operating system developed at the University of
Wisconsin[9]. Process migration is implemented in the kernel of Charlotte.
Mechanism. Charlotte collects statistics such as load (number of processes, links, CPU load
and network load), individual processes (age, state, CPU utilization and communication
rate) and selected active links (packets sent and received).
The address space of a migrating process is copied to the destination processor. The
4 J. SONG, H. K. CHOO AND K. M. LEE
link of the process is changed and passed to the related processes. Kernel data structures
pertaining to the process are marshaled, transferred and demarshaled.
Policies. A starter process decides when to migrate a process based on the statistics and
manual directions from a policy procedure given by the user. Two processors will negotiate
to see if a process can be migrated from one to the other. Charlotte process migration
does not work in a heterogeneous environment as it transfers address space between two
processors.
2.2. Condor
‘Condor is a software package for executing long-running, computation-intensive jobs on
workstations which would otherwise be idle[12].’ Condor uses a mechanism called ‘Remote
UNIX’ which starts a shadow process locally and a remote process on a remote processor.
The remote process carries out all the computation but will send any UNIX system calls to
the shadow process. Therefore, a consistent file system and environment can be maintained
although the remote process may be migrated to other processors.
Mechanism. Condor checkpoints a process regularly. A checkpoint file contains the test,
data, bss and the stack segments of a process, the registers, the status of open files, and any
messages sent by the process to its shadow for which a reply has not been received.
Migration is pre-emptive. When a process is migrated, its checkpoint file is copied to
another processor and the process is restored from its checkpoint.
Policies. Condor’s local scheduler checks the load average and/or keyboard activity to see
if a foreign process should be suspended. If a foreign process has been suspended longer
than a prespecified time, it will be asked to leave the processor. If the process does not
leave sometime after it has been asked to do so, it will be forced to terminate.
Condor has a central controller that polls each processor at two-minute intervals to
find idle processors as well as processors with waiting background tasks. A processor is
considered idle if its owner has not been active for at least 12.5 min. The controller sends
a remote task to an idle processor.
The main advantage of Condor is its portability since it is implemented outside the UNIX
kernel. The main shortcoming of Condor is that it assumes that a migrating process should
have no interprocess communication, signals or child processes.
2.3. Piranha
Piranha implements the master–worker model of parallel computing. The master (called
‘feeder’ in the Piranha system) resides on a processor permanently throughout computation
but the workers (‘piranha’) can migrate among a number of processors. To restore the global
state of a process, checkpointing is implemented by a function called ‘retreat’ which is
executed just before a worker is terminated. The retreat function is written by the user to
save intermediate results so that the process can continue on another processor[13].
Mechanism. Piranha implements globally shared memory and therefore it does not need
the migration mechanism associated with copying the virtual address space of a process.
Policies. Each processor has a Piranha daemon running at low priority to monitor whether
APPLICATION-LEVEL LOAD MIGRATION 5
the processor is available for foreign processes. If a processor becomes unavailable, the
retreat function is called and the foreign processes running on it are terminated.
Piranha uses a distributed job scheduler to decide where to migrate a process. A migrating
process first returns to a global pool of pending jobs and an available processor will get a
job from the pool.
An obvious advantage of Piranha is that there is no need to have a complex migration
mechanism. One difficulty of writing a Piranha program, however, is to write the retreat
function so that it can be called any time and still be able to restore the correct global state
of a computation.
2.4. ADM
ADM (adaptive data movement) is ‘a programming methodology for writing programs
that perform adaptive load distribution through data movement[8].’ It is suitable for data
parallel applications where there is no communication among the processes each of which
computes part of a data set. ADM uses the master–slave model of programming where a
master process assigns work (data) to a number of slave processes. The slave processes
may decide whether the machines on which they are running are too busy, and if so they
will inform the master process, who will in turn redistribute the data set to exclude those
slave processes on busy machines from doing any computation. Those idle slave processes
are put to sleep except for checking the machine loads periodically (once every minute
in [8]). Whenever a busy machine becomes idle, the sleeping processes on it will report
the fact to the master, who will redistribute the data set to include those processes in the
computation.
Mechanism. No mechanism is needed for ADM as the processes are not really migrated
but are put to sleep if the machines on which they are running are overloaded. Only load
information will be provided by the system to the slave processes.
Policies. In [8], slave processes check their machine loads after each iteration of compu-
tation. If the load of a machine is too large, the slave processes on the machine will stop
computing and report the fact to the master process. The other policies for data migration
are all left to the application programmer.
1. the migrating process will pack a checkpoint data set to be transferred to the new
process
6 J. SONG, H. K. CHOO AND K. M. LEE
2. the migrating process will make a call to freeze communication and the new process
will make a call to defreeze communication
3. the migration system is responsible for sending future messages to the new process
4. the system is responsible for sending the checkpoint data set to the new process
5. the system will start a new process on another processor.
Policies:
1. the migrating process will check explicitly to see if migration is necessary because
the owner comes back or load is too high, etc
2. the system will decide where to migrate the process. (Our current implementation
will choose a destination processor that is the most lightly loaded.)
Figure 1 shows how mpvmd works with PVM and an application code.
1. Start a new process by copying the virtual address space of the old one (Charlotte[9],
V[16], Sprite[11]).
2. Start a new process with the process image outside the operating system kernel
(Condor[12]).
3. Start a new process with checkpoint data stored in shared memory (Piranha[13]).
4. Start a new process with checkpoint data passed in a message (ours).
APPLICATION-LEVEL LOAD MIGRATION 9
5. EXPERIMENTAL RESULTS
The owner of a computer and its borrower may have different opinions about a process
migration facility. The owner wants to see no deterioration of the response time of his
machine and the borrower wants to fully utilize the idle cycles of the machine. This Section
discusses some experimental results to show how our facility satisfies both the concern of
the owner and the need of the borrower.
One of our experiments indicates that (i) the response time of a machine to interactive
users will not deteriorate too much (say, less than 6%) due to the existence of a foreign (or
remote) process and (ii) the foreign process can take advantage of the idle CPU power of a
network of computers by migrating among them.
One potential problem with our implementation for message handling is the possibility
of some messages being lost while the receiving process is being migrated. An experiment
was conducted to understand the problem.
We assumed that the activities of a typical programmer would be editing, compiling
and testing a program. We created a program to simulate an engineer’s activities of writ-
ing a program for digital filter design. The foreign process was a parallel program for
computational fluid dynamics (CFD).
Two workstations on a 10 Mbit/s Ethernet were used in our experiment. The machines
were IBM RS6000-370 with a clock rate of 42 MHz and 128 Mbyte RAM. Each workstation
APPLICATION-LEVEL LOAD MIGRATION 11
executed one copy of the digital filter design program. The CFD program tried to use the
idle cycles of the two machines by migrating back and forth between them.
Worker process {
if (migrated) {
Unpack data from the migration buffer;
mpvm_defreeze(); }
else
Receive initial parameters and blocks from master;
Loop (while less than maximum number of iterations) {
Process blocks.
Update boundary conditions synchronously.
if(mpvm_chkload()) {
mpvm_freeze();
12 J. SONG, H. K. CHOO AND K. M. LEE
Process messages;
Pack data;
mpvm_migrate(bufid);
exit();
}
} end of loop
Send final results back to the master.
} end of worker process
The worker process has migrating capability. While the worker uses about 8.2 Mbytes
of memory with no dynamic storage requirement, its checkpoint data is about 2.6 Mbytes.
Each iteration takes about 9 s of CPU time of the RS6000-370 and each process migration
operation takes about 5 s.
Figure 2. The editing times of the two engineers are not correlated
Figure 3. The elapsed times for the engineers in the three cases
14 J. SONG, H. K. CHOO AND K. M. LEE
Case 2: AGNES shares one machine with one engineer. (Same as Case 2 for the engi-
neers.)
Case 3: One AGNES worker migrates between two machines shared with the two
engineers. (Same as Case 3 for the engineers.)
Table 5 gives the results for the first two cases, where the total number of iterations was
1000. When the AGNES program ran concurrently with the engineer’s computations, its
execution time increased by as much as 67% in comparison to Case 1.
Our migrating facility uses two user-specified load figures to decide how to migrate a
process: the ‘upper load-threshold’ on the current machine and the ‘lower load-threshold’ on
the destination machine. When the one-minute load average of the current machine is greater
than the upper load-threshold, the AGNES worker will be stopped and mpvm migrate()
called to find another machine which has the load less than the lower load- threshold to
migrate the worker. If none of the loads of the two machines is less than the lower load-
threshold, the worker will be put to sleep and mpvm migrate() will poll the machines in
fixed time intervals until one machine becomes available.
The execution times for the engineer’s programs and the AGNES program in Case 3
are therefore related to three factors: the upper and lower load-thresholds and the polling
interval. Table 6 shows some experimental results when the upper and lower load-thresholds
and the polling interval varied.
It can be seen from Table 6 that the deterioration was quite small although the upper and
lower loads did affect the execution times for the engineers. The increase in the running
times for the engineers ranged from 0.18% to 5.8% as compared to their having exclusive
use of their machines in Case 1 for the engineers.
Our experiment has also indicated that the AGNES worker could take advantage of the
idle CPU power of the two machines. Figure 4 compares the cascade execution of the digital
filter design program and AGNES program with that of the migrating AGNES program of
Case 3. In the cascade cases, the digital design program would run to completion before
the AGNES program could start. When the load figures were 1.6 and 0.95, the migrating
AGNES worker took 14153 s to complete. As can be seen from Figure 4, the migrating
AGNES worker could finish 6042 s earlier than the cascade runs on Machine 1 and 8965 s
than on Machine 2, resulting in a time saving from 30% to 39%.
Furthermore, when the load figures were 1.6 and 0.95, the AGNES worker was executed
on Machine 1 exclusively for (14153 − 12378) = 1775 s after Engineer 1 stopped. But
the worker takes at least 8455 s to complete. Therefore, it stole (8455 − 1775) = 6680 s
from the two engineers. The maximum possible idle time (editing time) for the engineers
was (3346 + 5055) = 8396 s, out of which 6680 s were used by the AGNES program. It
APPLICATION-LEVEL LOAD MIGRATION 15
can therefore be claimed that 74% of the idle CPU power was utilized.
Figure 4. The migrating AGNES worker utilized 74% of idle CPU power
16 J. SONG, H. K. CHOO AND K. M. LEE
Table 6. Execution times for Case 3 of AGNES and the engineers when the upper and lower load
thresholds and polling interval varied
receive the integers and starts to migrate to another machine whenever it receives an integer
divisible by 5, i.e. after receiving the integers 5, 10, 15, . . . ,etc. More than 15 test runs for
each value of N were performed and message losses as well as the number of migrations
were recorded. The test runs were executed at various times during working hours of the
day when the Ethernet was being utilized by an average of 11 people performing tasks like
word-processing, editing, compiling, UNIX job execution, file transfer and host access, etc.
The number of migrations differs from case to case due to the nature of the receiver which
‘cleans up’ any pending messages in the receive buffer before migrating. For example, the
receiver starts migrating after receiving the integer 15. At this point, the incoming buffer
might have been filled with other integers because the sender is constantly sending them.
The receiver should appropriately handle these outstanding messages by first ‘freezing’
itself (by calling mpvm freeze()) and then processing the messages in the receive-buffer, if
any, before calling mpvm migrate().
The test results indicate that more than 99% of the lost integers were the first ones after
integers divisible by 5, e.g. 6, 31, 46, 101, etc. These were the integers which were sent
at the critical moment when the receiver was either close to migrating or in the process of
migrating.
Figure 5 shows the relationship among the total number of messages, the number of lost
messages and the number of migrations for each test case. It can be seen that the number
of lost messages increases approximately linearly when the number of migrations exceeds
about 350. This is attributed to the race condition being aggravated by the growing length
of the TID-lookup table maintained by the mpvmd as the number of migrations increases.
As each migration adds a new process entry to the table, response times of the calls to the
mpvmds become longer.
Figure 6 shows the relationship between the lost messages and their sequence numbers.
The message sequence numbers are ordinal numbers starting from 0th up to 3999th. Figure 6
indicates that later messages may have much greater probability of being lost than the earlier
ones due to the increasing length of the table.
Our experiment simulated the extreme case where the receiver will migrate immediately
after checking that its local message queue is empty. We have performed another experiment
that showed that if the receiver had been made to wait for about 0.2 s after being frozen and
before it checked for pending messages, all transient messages would arrive at the receiver,
resulting in zero message losses for all the cases of the experiment.
APPLICATION-LEVEL LOAD MIGRATION 17
6. DISCUSSIONS
The development and experiment of a new load migration scheme are reported. The basic
idea is to migrate a process by starting a new process on another processor with the
checkpoint data transferred from the old process to the new one. The checkpoint data are
the intermediate results generated by the old process and transferred automatically by the
migration system. Our facility is easily portable and easy to use because it is built on top
of the PVM and the user does not need to know the details of the migration mechanism. It
is also faster than migrating a live process because the amount of data to be transferred in
ACKNOWLEDGEMENTS
The authors are greatly in debt to the two anonymous referees who examined the first version
of this paper. We especially express our gratitude to one of the referees for his/her three-page
comments, which have helped to improve the readability of our paper tremendously.
REFERENCES
1. P. Krueger and R. Chawla, ‘The Stealth distributed scheduler’, The 11th Int. Conf. on Distr.
Computing Systems, 1991, pp. 336–343.
2. R. E. Benner and J. M. Harris, ‘Applications, algorithms, and software for massively parallel
computing’, AT&T Tech. J., Nov. 1991, 59–72.
3. R. Fatoohi and S. Weeratunga, ‘Performance evaluation of three distributed computing environ-
ments for scientific applications’, Supercomputing ‘94, pp. 400–408.
4. W. T. C. Kramer et al., ‘Clustered workstations and their potential role as high speed compute
processors’, RNS-94-003, NASA Ames Research Center, 1994.
5. R. E. Ewing et al., ‘Distributed computation of wave propagation models using PVM’, IEEE
Parallel Distrib. Technol., 2, (1), 26–31 (1994).
APPLICATION-LEVEL LOAD MIGRATION 19