Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

SIMULATION

PRACTI CE = THEORY
Simulation Practice and Theory 5 ( 1997) 167-184
An algorithm for parallel combined continuous
and discrete-event simulation
Dirk L. Kettenis
Computer Science Department, Agricultural University, Dreijenplein 2,
6703 HB Wageningen, The Netherlands
Received 6 May 1995; revised 9 March 1996
Abstract
As the level of detail and the complexity of the model increase, computation time needed to
simulate a model grows strongly. A parallel computer system consisting of many processors may
be needed to generate behavior of a model within a reasonable time. Therefore, an algorithm for
parallel combined continuous and discrete-event simulation has been designed and implemented. In
this article the algorithm will be described and the performance will be reported. The developed
combined simulation algorithm extended the parallel discrete simulation algorithm Time Warp
towards combined simulation. Experiments show that compared to the sequential implementation
of the model a modest speed-up can be achieved.
Keywords: Combined simulation; Parallel simulation; Simulation language
1. Combined simulation
Because of its flexibility, computer simulation has replaced mathematical analysis
and physical prototyping as the preferred method of research in many applications. The
most important types of models considered in computer simulation can be categorized
in continuous models and discrete-event models. Discrete-event models are applied in,
among others, operations research to study queueing problems. Continuous models are
applied mainly to study dynamical behavior of technological systems. Often, the time
scale in which phenomena occur in reality or in models may be far apart. In a continuous
model, a phenomenon that happens fast in relation to other phenomena can be modeled
by a discontinuity or by a discrete event or both. For example, the dynamics of a ball are
Email: kettenis@rcl.wau.nl.
0928.4869/97/$17.00 Copyright @ 1997 Elsevier Science B.V. All rights reserved
PII SO928-4869(96)00004-3
168 D.L. Kettenis/Simuhtion Practice and Theory 5 (1997) 167-184
governed by the laws of gravitation known as Newtons laws. The mathematical model is
a second-order differential equation, and, therefore, is of continuous type. When the ball
bounces on an obstacle, such as a wall or floor, Newtons laws are not valid anymore.
However, the time scale of the bounce process is fast related to the time scale of the
moving ball. The bounce process, therefore, is often modeled by a discontinuity and a
discrete event. This example shows that even a simple system may be modeled by a
combined continuous and discrete-event model.
The technique of combining continuous and discrete-event models has been proposed
for the first time by Fahrland [ 71. Although the idea of combined simulation is rather
old, the number of combined simulation languages is rather limited. One such language
is COSMOS [13,14].
As the level of detail and complexity of a model increase, computation time to simu-
late the system grows strongly. The speed of computing that affects the time needed to
simulate the system may become a limiting factor. Since hardware costs are decreasing,
multiprocessor computer systems are becoming affordable for many applications. Un-
fortunately, at present, programming of (massively) parallel computer systems is still a
task for specialists. Most simulation users do not have the necessary skills. A simulation
language implemented to generate programs that run on such computer systems will
give the simulation community a far better access to these advanced computers. Parallel
discrete simulation languages exist, to mention a few: SIM++ [ 11, MAISIJZ [2], and
YADDES [ 181. Some well-known discrete-event simulation languages, such as GPSS,
SIMSCRIPT, SLAM, and SIMAN have been implemented on so-called vector computers
like CRAY. The same holds for continuous simulation languages as ACSL and CSSL IV.
Recent developments in ESL [9] support parallel simulation to a limited degree. To
my knowledge there is no combined simulation language compiler available that gener-
ates code for message passing multiprocessor computers. Therefore, a project has been
started to study whether it is feasible to implement the combined simulation language
COSMOS to generate code for such parallel computer architectures. This article focuses
on recently developed algorithms for combined simulation on such systems. In [ 151 a
more elaborate treatment of the project has been presented.
The outline of this article is as follows. Because some aspects of the developed
algorithms are influenced by features of the COSMOS language, some characteristics
of this language and its implementation will be discussed first. In the next section
an overview of parallel discrete-event simulation and parallel continuous simulation
is presented. Thereafter the developed parallel combined simulation algorithm will be
discussed. This algorithm can be used for other combined simulation languages as well.
The article will be concluded with a discussion of performance of the algorithm based
on one specific model and the article ends with conclusions.
2. The simulation language COSMOS
COSMOS is a simulation language for description and implementation of continuous,
discrete-event, or combined continuous and discrete models. For continuous models, the
COSMOS language definition constitutes a superset of CSSL [ 201 with many additional
D.1. Kettenis/Simulation Practice and Theory 5 (1997) 167-184
169
structuring elements being incorporated, for example, for convenient modeling of piece-
wise continuous models. In a piecewise continuous model discontinuous behavior occurs
at a limited number of instants.
In COSMOS, discrete-event models are coded according to the so-called process-
interaction world view. In the process-interaction world view, acting entities are modeled
through processes. Each process may be thought of as a pattern or template. An active
version of a process (a so-called object) has to be created and activated explicitly. The
most important advantage of the process-interaction world view is that a system can
be composed of several objects, preferably objects that can be distinguished in the real
system. Composing a model of several objects has become known as object-oriented
modeling. In this respect is COSMOS an object-oriented simulation language, and it
has extended the object-oriented approach towards continuous simulation. In a COSMOS
program, continuous objects and discrete objects cooperate to model a system. Since
one may create several objects of each process type, models of variable structure can be
modeled easily. For process-view discrete-event simulation languages, variable structure
models are common. However, very few continuous simulation languages permit the
user to change the structure of the model and with the structure the total number of
differential equations present in the model.
As have been discussed above, COSMOS provides for structures to partition a model in
objects. Since COSMOS supports variable structure models, the object-oriented structure
must be preserved at run time. This despite of inefficiencies caused by information
exchange on a continuous basis among continuous objects. Continuous communication
requires special attention to the numerical integration process responsible for solution
of the differential equations. To prevent stability and accuracy problems while solving
the set of differential equations composed of all continuous objects active in the model,
the differential equations have to be solved synchronously with respect to the different
objects. A well-known Runge-Kutta integration algorithm is:
ko = f(Lyn),
kl = f(fr, + $,yn + ihko),
k2=f(tn+~h,y,-h(~ko+k,))
k = .f(f,, + h,y, + h(ko - k, + k2)),
kd = .f(f,, + h,y, + h($ko + ;k, + ;k2 + ;k3)),
y,3+1 = yn + h($ko + Sk, + ik2 + $k3),
4
Yt7+1 = ?,I + M$ko + $, + $kz + $k3 + $k4).
(1)
In ( 1) f, y, and ka to k4 are vectors. This algorithm computes new states in five
stages in which the evaluation of the derivatives at different locations is required. The
derivatives are stored in kc to k4. The difference between the third-order solution yi,,
and fourth-order solution yi+, is used to estimate the error in each integration step. The
error estimate is used to verify whether the computations have been accurate enough
and to compute the step size h for the next integration step.
A synchronous Runge-Kutta algorithm computes ko for all differential equations of
170 D.L. Kettenis/Simulation Practice and Theory 5 (1997) 167-184
the communicating objects. After stage one is finished the second stage starts based
on updated state values exchanged among the objects. This process is repeated until
all stages of the integration algorithm have been executed. The sketched procedure
gives correct results only if the objects communicate through state variables. In the
current version of the software communication among objects through so-called auxiliary
variables is not permitted. This is not a real limitation, since auxiliary variables can
always be expressed by state variables.
A discontinuity in a piecewise continuous model is described by a relation, for exam-
ple x 2 y. To deal with discontinuities properly three tasks have to be performed: (1)
Detect presence of a discontinuity in an integration step, (2) locate the earliest discon-
tinuity, and (3) pass the earliest discontinuity. It is of the utmost importance that the
model will not change within one integration step. Therefore, the conditions in the model
will not be changed during one integration step. A well-known technique is to create a
discontinuity function or switching function of the relation specified in the model. For
example, x 3 y will generate a discontinuity function @ = x - y. At each mesh point of
the numerical integration algorithm the discontinuity functions are evaluated. In case a
discontinuity has changed sign within one integration step, a discontinuity has occurred.
The task to locate the discontinuity is to find the zero of the switching functions. Zero
finding routines are iterative procedures, Regula Falsi is a popular one. The first zero is
chosen as next mesh point. The actions belonging to the discontinuity have to be carried
out and the condition in the model have to change its value before next integration step
can be taken.
3. Parallel simulation
Parallel simulation tries to exploit the parallelism present in the model and present in
the techniques employed to experiment with the model. In this section, a survey of the
research in parallel continuous simulation and parallel discrete-event simulation will be
presented. Each section will be concluded with remarks how the research relates to the
parallel algorithm described in this article.
3.1. Overview of parallel continuous simulation
The part of the research in parallel continuous simulation of interest for the research
presented in this article is oriented towards speed-up of the solution process of dif-
ferential equations. The basic idea is that speed-up of computation can be achieved
by distributing computational tasks across several processors (so-called nodes) of the
parallel computer. The tasks consist of computation of the derivatives, where the model
equations are involved, and the numerical integration. Approaches for distributing com-
putational tasks, therefore, can be placed into two basic categories [ 121. One category
is known as the equation segmentation approach, the second category can be named par-
allel algorithm approach. In the equation segmentation approach, the model equations
are partitioned and distributed across the processors of the computer system. Classical
D.L. Kettenis/Simulurinn Practice and Theory 5 (1997) 167-184 171
numerical integration algorithms are applied. In the parallel algorithm approach often
classical numerical integration algorithms are adapted to exploit parallelism.
In an equation segmentation implementation each processor is responsible for calcula-
tion of derivatives and updates of the states assigned to it as a result of the partitioning.
In general the load balance is not optimal since some differential equations are more
time-consuming than others. Furthermore, caused by the imbalance in the load, syn-
chronization problems occur since processors have to wait for information of other pro-
cessors. Since communication among processing units is slow compared to processing
speed, communication inhibits speed-up of the computational task drastically. In general,
for each model there will be a (sometimes relatively small) number of processors to
yield optimum speed-up.
In the parallel algorithm approach, the full model is replicated on each of the available
processors and a parallel integration algorithm is applied. Often the parallel integration
algorithm is an adapted version of a sequential algorithm. Birta and Abou-Rabia [ 31,
for example, adapted the so-called block implicit method to parallel computer systems.
Another development is based on multistage, implicit, one-step Runge-Kutta methods
[ 211. In the mentioned methods the workload in the different stages of the integration
algorithm is comparable, therefore a reasonable workload balance will result. Since
work of each stage of the algorithm is carried out by one processing unit, this method.
however, will limit the factor of speed-up by the number of stages of the numerical
integration algorithm.
Kerckhoffs [ 121 reports that numerical features of the known parallel integration
methods are generally worse than those of the classical sequential integration methods.
Since in COSMOS, the user partitions the model equations in various continuous objects,
it was decided to implement the equation segmentation approach and apply adapted
traditional integration algorithms in the current study.
3.2. Overview of parallel discrete simulation
Each real system consists of one or more physical entities. In parallel discrete-event
simulation, each entity will be represented in the model by a process object. On a
multiprocessor computer the process objects may be implemented on different processors
and may communicate through so-called event messages. In discrete-event models some
events will be mutually dependent. The order in which dependent events are processed
has to be preserved. Other events are independent and such events can be processed
in arbitrary order or concurrently. In a parallel realization of discrete-event models
one must take care to safeguard the order in which events are processed, so that no
event can occur unless all events on which it causally depends have already occurred.
One approach to constructing parallel programs is to recognize parallelism in existing
sequential programs. This approach does not seem to be generally successful. It is
particularly poor for discrete-event simulation because of frequent manipulation of the
event-list [ 61. Alternative approaches are required. Fujimoto and Nicol [ 8,171 presented
an overview of the research in the field of parallel discrete-event simulation.
The methods applied to synchronize different process objects in a parallel simulation,
can be classified into two categories:
172 D.L. Kettenis/Simulution Practice and Theory 5 (1997) 167-184
?? Conservative synchronization. In the conservative approach, simulation time can
only be advanced if it can be assured that essential constraints are not broken
[4-61.
?? Optimistic synchronization. In the optimistic approach simulation time can be
advanced ahead until a violation of essential constraints is discovered. Simulation
time is rolled back in case such a violation occurs [ 10,l I].
In parallel discrete-event programs objects communicate with each other by so-called
time-stamped messages. The time stamp of the message specifies, among others, the
receive time meaning the simulation time for processing the event. Progress of objects,
in fact, is initiated by the mentioned messages. Each object has associated with it
a variable called local simulation time that records simulation time of the last event
executed by that object. In the conservative methods all objects in the simulation progress
forward in simulation time together, with no object ahead in time of any other if there
is still a chance that at such an object messages arrive. An object in the model may
receive messages from several other objects and may send messages to several other
objects. Each incoming communication link has its own input message queue. Since the
conservative simulation algorithm cannot take into account non-existing incoming links,
the communication topology and, therefore, the number of objects must be fixed.
The most prominent implementation of the optimistic approach is Time Warp [ 10,111.
Time Warp speeds up simulations by automatically exploiting concurrency available in
the model. Each object acts independently of other objects, except each object considers
the messages coming in the input message queue. Since an object has no notion of global
time, the term virtual time is used instead of simulation time. Each object contains its
own so-called Local Virtual Time (LVT). The LVT acts as the simulation clock for that
object. Because each object acts independently from other objects in Time Warp, in any
snapshot of a simulation some objects will have LVT values greater than others. In the
Time Warp mechanism there is no notion of a fixed communication graph or network
connecting the objects. On the contrary, any object may interact with any other object
at any time. In Time Warp, each object receives and acts upon messages in the input
queue one by one in the order of their time stamps until the queue is exhausted. An
object never waits until it can safely process the next message, an object always charges
ahead. An object will be blocked only when its input queue is exhausted, and then only
until another message arrives.
Since the LVT of each object differs, a message can arrive at an object whose LVT
is greater than the receive time of the incoming message. In other words, the message
arrived in the (local) past. Such a message is called a straggler. Events have to be
processed in the right time order, so when a straggler arrives a causality error appears.
Since each message might change the state of the object and could cause event messages
to other objects, the causality error has to be corrected. In case of a causality error, the
receiving object will roll back to an earlier state before or equal to the time stamp of the
straggler. To be able to roll back, the Time Warp mechanism saves the state and input
messages of the objects. After restoring a previous state, the simulation starts simulating
forward again. For that purpose the input messages that have arrived since the moment
of the restored state of the object will be processed again. Restoring the object that
receives the straggler is not sufficient, since the object may have sent messages to other
objects during the period the object charged ahead incorrectly. Possibly these messages
would not have been sent if the straggler arrived earlier, or the messages could have
another impact. To cancel the effects of these event messages, so-called antimessages
have to be sent to objects that have received the original event messages. An antimessage
may arrive at an object at a moment the original message has already been processed.
Therefore, the receiving object has to roll back too. Consequently, rollbacks most likely
will propagate in the model.
Since in COSMOS both the objects active in the simulation and communication links
among the objects may change, the basic Time Warp algorithm has been extended in
this research towards handling continuous objects in addition to discrete objects.
4. A view on parallel combined simulation
In this section the view of the author on parallel combined simulation will be prc-
sented. Although the ideas will be presented related to the simulation language COSMOS
it is believed that the ideas presented will apply to other simulation languages as well.
In a combined model, in general, there will be discrete and continuous objects.
Compared with discrete objects, continuous objects will require a larger part of the
computing power of a node of the multiprocessor. One may think of partition the
computational task to solve the differential equations of one continuous object over
several nodes. Since in COSMOS objects can be created and deleted during the lifetime
of the simulation, it is easier to keep the equations of one object in one module.
Objects in a model will cooperate with other objects. Cooperation will lead to two
categories of messages: event messages or informational messages. Exchange of value
of a variable is an example of an informational message. Event messages are equivalent
to the so-called scheduling statements of process-interaction discrete-event simulation
languages. A discrete or continuous object may send event and informational messages
to discrete objects and continuous objects. An event message send by a continuous object
can only be a result of fulfillment of a certain condition such as crossing a threshold.
Apart from information exchange at discrete moments, some continuous objects com-
municate continuously. As has been discussed in Section 2 the solution of the differential
equations in a continuous model has to be carried out in a synchronous way. This needs
many messages and the transfer of a lot of data. On a parallel computer the organiza-
tional overhead must be limited. Therefore, synchronization will be restricted to objects
that communicate with each other on a continuous basis; continuously communicating
objects will be placed in what will be called a cluster. In many small and medium
scale continuous models all objects communicate in a continuous fashion. Therefore, in
these models there will be only one cluster of continuous objects and speed-up will be
limited. In a large scale combined model normally only a limited number of continu-
ous objects exchange information. Consequently, we may expect that some large scale
models consist of several clusters of communicating continuous objects. An example is
a large scale road traffic model in which the dynamical behavior of the moving objects
is described by differential equations. In such a system only objects moving close to-
gether will affect dynamics of each other. The integration process of one cluster will be
174 D.L. Kettenis/Simulation Practice and Theory 5 (1997) 167-184
independent of other clusters in the model. The integration step, for example, will be
chosen per cluster and may be different from the size of the step of another cluster.
During parts of an experiment with a model there may be continuous objects that do
not communicate continuously with other continuous objects. These objects will be called
individual continuous objects in the sequel. Individual continuous objects will progress
independently. Since in COSMOS the communication links may change dynamically, a
continuous object may be a member of several clusters during a simulation study.
Moreover, a continuous object may be an individual object from time to time. The
distribution of objects across clusters changes only at some of the discrete events in the
discrete part of the model or discontinuities in the continuous part of the model.
In the developed combined simulation algorithm, discrete objects, individual contin-
uous objects, and clusters of continuous objects progresses individually. So, there is a
chance that a straggler arrives at a continuous object. In such a case the continuous
object must roll back. When such an object belongs to a cluster, all objects belonging to
that cluster have to roll back. To be able for continuous objects to roll back, the status
of these objects must be saved. The status includes so-called state variables and state
of discontinuity functions. Solving the differential equations backwards to the receive
time of the straggler is an alternative for storing values of state variables. Since solution
is not guaranteed to be stable when differential equations are solved in the negative
direction of the independent variable, this method has not been used.
5. Node manager
In the developed algorithm each processor has a program to organize the work of the
objects allocated to that processor. In the sequel this node manager will be discussed
in more detail.
5.1. Algorithm for the node manager
Continuous and discrete objects progress in different ways. A discrete object pro-
gresses by handling received messages. A continuous object progresses based on numer-
ical integration, and incoming messages are handled when a continuous object reaches
receive time of the message. The algorithm for the node manager presented in Fig. 1 is
for discrete objects and individual continuous objects. If there are no continuous objects
on the node, the given algorithm performs as the Time Warp algorithm for discrete-event
simulation. When, during the simulation, all individual continuous objects have been re-
moved from the model and there are no messages for objects residing on the node,
the inner WHILE-loop comes to an end. Since new (continuous) objects are created
because of a message, the WHILE-loop starts again.
In case a message with receive time larger than LVT of the continuous object arrives
at an object, the numerical integration algorithm has to choose one of the future mesh
points of the integration algorithm to coincide with receive time of the message. When
the message is a straggler, the receiving object has to roll back to a moment before or
at the receive time of the straggler. Rollback includes restoring values of state variables
and discontinuity functions, and cancelling messages sent by the object. These messages
include messages sent as a result of crossing a discontinuity function. When the object
arrives at the receive time of the message, the accompanying activity can be processed.
After that moment the object may continue to charge ahead.
A processor may be responsible for execution of more than one object. In the event
that more than one object is ready to execute, the node manager has to decide which
of the objects is to be processed first. Preiss et al. [ 191 give the following three stratc-
gies for incrementing time of discrete objects: Round-Robin, Minimum-Virtual-Time,
and Minimum-Message-Time-stamp. For the developed implementation the Minimum-
Message-Time-stamp strategy has been chosen for discrete objects. The continuous
object with the smallest LVT will be processed first. In fact, the chosen algorithm is the
Minimum-Virtual-Time strategy of parallel discrete-event simulation. Application of the
Minimum-Virtual-Time strategy for continuous objects allocated to one node promotes
that the LVT of the objects will be close together. If workload balance of the nodes is
perfect, progress of time of all continuous and discrete objects in the model is in accor-
dance with each other. Consequently, it is expected that the number of stragglers and,
therefore rollbacks, will be kept to a minimum. Progress of objects will be discussed in
more detail in the next section.
5.2. Progress of objects
In Fig. I the progress in time for individual continuous objects is depicted as com-
pute a step for x. In the process to compute a step, the routine of the chosen numerical
integration algorithm is called. As described in Section 2, each step discontinuity func-
tions, if present, of the object have to be evaluated and present discontinuous spots
have to be located. The actually taken step may be smaller than expected caused by
discontinuities and, when a variable step integration algorithm is applied, by accuracy
WHILE not terminate DO
WHILE there is at least one object with
an input message OK continuous objects DO
Let 1nN
be smallest receive time of all
messages for objects on this node
OR cc if there is no such message;
IF there is a continuous object with I,vI < rf,,,
THEN
Let x be a continuous object with smallest LVT;
Let nt be receive time of first input message to .\
OR IX if none exists;
Compute a step for x. step sire < )I/ - LVT,.
ELSE
Process a message with receive time I,,,,:
END IF;
END WHILE:
Wait until message arrives;
END WHILE;
Fig. I. Algorithm for node manager.
176 D.L. Kettenis/Simulution Pructice and Theory 5 (1997) 167-184
- Lvr
continuous
I
I
II
I
object 1
I I I I I I
I ,
I
1 13 6 9 11 12 13 I 15
1 , /
continuous I I /
object 2
I 118 I I I 1 I I I
2 4; 5 7
6 I
10 14 I
discrete A
I
I
object 1
T
I
I
discrete
I
I
object 2
I
I mesh point of numerical integration
& receive time of message
Fig. 2. Progress of objects.
requirements. After the computations needed for a step have been executed, control will
be passed to the node manager to decide what has to be done next.
The manner how objects progress in time is illustrated in Fig. 2. In this figure,
the order in which time steps of the integration algorithm are processed is indicated
by arabic numbers. Messages are indicated by roman numbers. The figure shows that
the individual continuous object with lowest LVT is progressed first. Each individual
continuous object has its own step size. According to the algorithm given in Fig. 1, the
message indicated with I will be dealt with after steps 3 and 5 have been computed. A
consequence of the chosen strategy is that, if handling message I results in a message
to, for example, continuous object 1 with a receive time smaller than LVT at the position
3 continuous object 1 receives a straggler.
In Fig. 2 message II is sent to continuous object 1. Therefore, size of step 9 is
limited by the receive time of the message. If step 9 cannot reach receive time of the
message an additional step has to be computed. Only step size for object 1 is limited
by message II, each other continuous object will progress to an LVT up to or past
receive time of message II. In the given example, message II will be processed after
integration step 10.
In the algorithm for the node manager, a continuous object that does not receive a
message will be simulated until the finish condition of the model.
5.3. Cluster objects
In the previous section, progress of individual continuous objects has been discussed.
Apart from synchronization of the integration of differential equations, progress of a
cluster is similar. The integration step size will be selected based on the receive time of
messages to all objects belonging to the cluster.
Synchronization of objects of a cluster depends on the category of the numerical
integration algorithm that is used. The discussion will be limited to the third and
fourth-order Runge-Kutta algorithm (1) presented in Section 2. In Fig. 3 the numerical
integration process is presented.
D.L. Kettenis/Simulation Pructice and Theory 5 (1997) 167-184 177
request ka
controller t
1 2 request kl 2 1
ko kl
ko
time +
computing time
request for state variables
u
receipt of state variables
2
m
receipt of node 2 ready
Fig. 3. Cluster integration process
To be able to synchronize the integration process of all objects, belonging to a cluster
but allocated to different nodes, a controller is implemented. This controller runs on one
of the nodes, and it sends the necessary messages to the objects to execute the operations
needed to progress in time. At the end of the last stage of the integration process, each
node reports the estimate for the next step size to the controller. From the estimated
step sizes of all of the objects belonging to the cluster, the controller computes next
step size for the cluster. This synchronizing controller will be treated like an individual
continuous object in the algorithm given in Fig. 1.
As has been illustrated before, many messages are involved in synchronization of the
step forward process of cluster objects implemented on different processing nodes as
is clear from Fig. 3. The process to bring a cluster a step forward is often blocked
until new information is available. During the time the cluster cannot continue with
the integration process, however, the processor of the node can execute tasks for other
clusters, individual continuous objects, and discrete objects. Nevertheless, the question
rises under what conditions the described approach will lead to a speed-up of the
simulation. This will be studied with the help of an example.
6. Performance of the developed algorithm
In this section the results of a test simulation study will be presented. As became
clear from the discussions until now, parallel combined simulation is expected to be of
service to large scale models. Ideally, such a model would consist of:
?? A large number of objects. The objects in the mode1 are not necessarily of the
same type.
?? Solution of the differential equations of each continuous object type is a com-
putationally intensive job. This means that the state changes of the objects are
178 D.L. Keitenis/Simuiution Pructice und Theory 5 (1997) 167-184
governed by a large number of differential equations or highly non-linear differ-
ential equations or both.
?? A small number of continuous objects interact with each other. To put it differently,
the number of continuous objects in a cluster is limited. Besides, each object needs
the value of only a few state variables of other objects. Speed-up is large when
there are many so-called individual continuous objects implemented on different
nodes.
Since there was available no model of a real world system that meets all these
requirements, an adapted example model was used.
6.1. The test model
As test model a combined model of the soaking pit furnace in the steel making process
has been used. In this system, relative cool ingots arrive at a soaking pit in a steel plant.
When the ingots have to wait they will cool down as a result of the lower ambient
temperature. The soaking pit furnace heats an ingot so that it can economically be rolled
in the next stage of the steel-making process. Temperature of furnace and ingots are
described by a linear first-order differential equation. Therefore, ingot and furnace are
continuous objects. Discrete events appear in the model when the heat source of the
furnace is switched on and off. When the required temperature has been reached then
the ingot will be removed from the furnace and a cold ingot with the largest waiting
time will move into the furnace. In a realistic model cold ingots will arrive according
to one or another random distribution and the temperature of an arriving ingot may
vary. To be able to measure the influence of the number of processing nodes used to
simulate this system, the number of continuous objects and its arriving temperature are
kept constant during the study.
The model consists of one furnace object implemented on node 0. During the ex-
periments there were 100 ingots present in the model. The ingot objects are evenly
distributed across the, maximally 32, processing nodes. Each time an object leaves the
furnace, a new ingot object is created at the node the object was positioned. In the
furnace is place for 32 ingots. This number equals to the number of nodes available at
the multiprocessor computer used for the experiments. Therefore, at any node there is
always at least one object that is part of the cluster. Since only one cluster is present,
one cluster synchronization object will be active in the test program. This object resided
during the experiments on node 0, see Fig. 4.
6.2. Workload and message balance
Each node is responsible for an almost equal number of continuous objects. The
number of messages that are sent and received by the cluster objects formed by the
furnace and 32 ingots inside the furnace, however, are not balanced. Two categories of
messages are needed in the program to solve the equations of the continuous objects
of the cluster: so-called organizational messages and state messages. An overview of
the messages in this model is presented in Fig. 4. In the implemented variable-step
integration algorithm, the number of messages to complete one integration step depends
D.L. K~ttenis/Sinzukction Pructice und Theory 5 (1997) 167-184
Node 0 Node 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,........_.....................
179
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..............__....,_____.,.,
- -b Request for temperature of Ingot or Furnace (state message)
- b Iemperature of Ingot or Furnace (state message)
-W Request for next phase in computing next state (orgnnizat~onal)
--b Signal ready withe next phase in computing state, and mformatlon:
success of integration, discontinuity detected, location of discontinuity,
and step (organizational)
Fig. 4. Flow of messages in the model
on the number of iterations needed to find a solution with an estimated error smaller than
the tolerance. Consequently, the part of the program that is responsible for synchronizing
the cluster sends and receives many synchronizing messages.
State messages are responsible for the communication among the objects in a cluster.
In the current software an object requests the value of a state variable. As a consequence,
two messages are involved to transport one state variable. This overhead has been taken
for granted at this stage of the study. Each time there is a need to exchange state
information, the furnace object sends 64 state messages and receives 64 state messages.
This is because the furnace needs the temperature of 32 ingots and 32 ingots needs the
temperature of the furnace. Each ingot present in the furnace sends two state messages
and receives two state messages. Consequently, the node responsible for computation of
the state of the furnace handles the major part of the state messages.
Rollback appears in the model only for an ingot at the moment the ingot enters the
furnace. This is because incrementing the clock of the individual continuous objects is
an easy task for the processing nodes. Therefore it is most likely that these individual
continuous objects are ahead of the objects belonging to the cluster.
6.3. Experiments and results
The experiments have been executed on an NCUBE2 [ 161 distributed multiprocessor
system built up according to the hypercube topology. Each processor has a clock fre-
quency of 20Mhz. The system consisted of 32 general purpose 64-bit processors with
4 Mbytes of DRAM each. Its peak performance is 73.6 Mflops in double precision mode.
Dynamical behavior of each object in the model is governed by one first-order linear
differential equation. Because solution of a first-order linear differential equation is an
180 D.L. Kerfenis/Simuhtion Practice and Theory 5 (1997) 167-184
1 10 50 100 150 200
number of equations
sequential algorithm one node two nodes tour nodes eight nodes 16 nodes 32 nodes
_ - - _ _ . . - _ - .- .._. _-_- ____ _
Fig. 5. Execution time versus number of equations
easy task, it is expected that the speed-up of the execution of the program, if present, will
be limited. A more demanding task has been created artificially by solving per object the
differential equation of the object several times. This strategy has been preferred since
the purpose of these experiments was to study the influence of an increasing workload
caused by solving differential equations. By the chosen strategy this is achieved since:
?? The number of discontinuity functions and passed discontinuities are not affected.
?? The numerical integration algorithm chooses the same integration step size.
?? The number of state messages and organizational messages, therefore the commu-
nication load, will not be affected.
Fig. 5 presents the results of runs with 1, 10, 50, 100, and 200 equations per object.
In the figure execution times, given in seconds, are depicted versus the number of object
equations. The given lines connect the measurements for parallel algorithm on 1, 2,
4, 8, 16, and 32 processing nodes. To be able to study whether the parallel algorithm
gives speed-up compared to the original sequential implementation of the simulation
the original C program generated by the (sequential) COSMOS compiler has been run
on one node of the NCUBE~. When the number of nodes increases, the number of
organizational messages and the number of state messages to solve the differential
equations of the cluster objects grows. This is shown in Table 1 where the total number
of messages sent over the communication channels by all objects during a run are given,
categorized in state and other messages. The number specified by the other messages
includes messages for the pure organization of the simulation process, Besides, event
D.L. Kettenis/Simulation Practice and Theory 5 (1997) 167-184 181
Table I
Number of messages sent by objects during a run
Number of nodes Number of messages
State Other
Total Node 0 Total Node 0
2 23266 I1633 2044 I022
4 35760 17880 S.529 259x
8 41702 20896 12147 5593
16 44808 22404 2480.5 I I307
32 46316 23158 SO421 22890
messages to insert objects into the bank of cold ingots and to insert objects into the
furnace are included in this number. Both furnace object and synchronizing object are
allocated to node 0. The resulting message imbalance is illustrated by the figures given
in Table 1. Node 0 is responsible for sending of approximately half the total number
of messages. The other half of messages are sent by the other nodes collectively. These
messages are almost evenly distributed across the nodes.
As is depicted in Fig. 5, execution time of the simulation with the sequential algorithm
carried out on one node of the parallel computer is found to be smaller than execution
time of the simulation with the parallel algorithm running on one node. Causes are:
?? Information exchange in the parallel algorithm is less efficient, since messages are
involved.
?? Efficiency of the parallel simulation algorithm is lower than the efficiency of the
sequential algorithm. Node manager, saving states of the objects and rollback do
not appear in the sequential program.
From Fig. 5 can be concluded that the sequential algorithm is faster than any parallel
implementation for this model if the objects have less than about 70 differential equa-
tions. The program running on eight nodes is faster than other parallel implementations.
However, Figs. 5 and 6 show that the resulting speed-up differs only slightly from the
speed-up achieved with four nodes. Execution time increases if more than eight nodes
are used. This is caused by increase of the number of messages needed to solve the
equations of the cluster objects. Execution time of the programs increases almost linearly
with the number of object equations. This can be understood since when the number of
equations increases, the additional work needed to complete the simulation is only the
solution of the additional differential equations. Besides, since the portion of additional
work can be shared by several nodes, the slope of the curve decreases when the number
of nodes increases.
In Fig. 6, the data gathered during the experiments is displayed in a different way.
In this figure the speed-up (S) is depicted as a function of number of processors for 1,
10, 100, and 200 object equations. Speed-up Sk is defined by
(2)
Here is Ti the time needed to execute the parallel simulation with one processor and Tk
the time needed to execute the parallel program with k processors.
182 D.L. Kettenis/Simulation Practice and Theory 5 (1997) 167-184
2 4 8 1s 32
Number of processors
Number of equations I 10 100 2w
- .___ . . . . --.
Fig. 6. Speed-up S of example program
I-
0 -:
2 4 8 16 32
Number of processors
Number of equations 1 10 100 200
-____ . . . . . . . --.
Fig. 7. Efficiency (7) of example program.
The efficiency (r]k) for a k-processor system of the parallel program is defined as
The efficiency of this parallel program is presented in Fig. 7.
7. Discussion
As mentioned before, the distribution of the objects across the processing nodes is
a reason for imbalance in the message handling. In the model used for the tests both
the furnace object and the cluster synchronizing object are placed on node 0. Some
runs have been executed where the furnace object has been situated on node 1. The
cluster synchronizing object was still on node 0. This gave lower execution times, but
D.L. KettenD/Simulution Practice und Theory 5 (1997) 167-184 183
the results were not significantly better for execution with up to eight nodes. The reason
is that the burden of the state messages is the determinant factor.
The key problem in speed-up of simulation of continuous objects is the large number
of messages involved. The number of state messages needed in the presented algorithm
can be halved when the objects from a cluster send the needed information to the other
cluster objects without a request for this. Further research is needed to find a method to
implement that in both the compiler and run-time system.
As is well-known, sending a message to another node involves a substantial overhead
in particular when the length of the message in bytes is small. Therefore is packing
of messages the main issue to reduce the high cost of message passing in distributed
memory computers. This has been illustrated by experiments with the given model
where ten times more information has been exchanged per communication event. The
execution time needed was maximally 2% longer. Fifty times more information exchange
lead to approximately 10% increase of execution time of the program. It will not be
wise, however, to delay sending of messages for the sake of packing messages too
much. Because of the delay, progress of some objects will be delayed or objects have
to rollback. This will inhibit speed-up of the simulation.
8. Conclusion
The software developed has been tested with one example model. It is difficult to
draw conclusions about the speed-up of models in general based on such a limited
number of tests. Nevertheless, the conclusion is that combined simulation programs that
can be segmented in objects executed on a message passing distributed memory parallel
computer system results in a speed-up compared to execution of a sequential program if:
??The work for the computer to compute the derivatives of the state variables of the
model is large.
??The number of differential equations is large.
??Communicating continuous objects need only a small number of state variables.
??The model can be divided into independent clusters.
Resulting speed-up may be increased if exchange of information among the computing
nodes can be reduced. The main efforts of future research must be oriented towards
reducing communication overhead without introducing delays. Furthermore, application
of multiprocessor computers will become more feasible when communication speed of
the hardware can be improved and the software overhead can be decreased.
Acknowledgement
The author wishes to thank the Dutch Foundation Knowledge-Based Systems (SKBS),
Maastricht, for providing access to the NCUBE2. This computer was installed at the
Knowledge-Based Engineering Group of the Faculty of Technical Mathematics and
Informatics of Delft University of Technology, The Netherlands. The author expresses
his indebtedness to Prof. Dr. E.J.H. Kerckhoffs and his co-workers of Delft University
184 D.L. Kettenis/Simulation Practice and Theory 5 (1997) 167-184
for their support. Furthermore, the author is grateful for the suggestions to improve
the text by Prof. M.S. Elzas, Wageningen Agricultural University, and the anonymous
referees.
References
] 1 ] D. Baezner, G. Lomow and B.W. Unger, Sim++: The transition to distributed simulation, in: D. Nicol,
ed., Distributed Simulation (SCS, P.0. Box 17900, San Diego, CA, 1990) 211-218.
]2 ] R.L. Bagrodia and W.-T. Liao, Maisie: A language and optimizing environment for distributed simulation,
in: D. Nicol, ed., Distributed Simulation (SCS, P.0. Box 17900, San Diego, CA, 1990) 205-210.
[ 3 ] L.G. Birta and 0. Abou-Rabia, Performance of a class of parallel methods for odes, in: A. Javor, ed.,
Simulation in Research and Development (North-Holland, Amsterdam, 1985) 31-36.
]4] R.E. Bryant, Simulation of packet communication architecture computer systems, Tech. Rept., MIT,
LCS, TR-188, Massachussets Institute of Technology, Cambridge, MA, 1977.
15 I K.M. Chandy and J. Misra, Distributed simulation: A case study in design and verification of distributed
programs, IEEE Trans. Software Engineering 5 (5) (1979) 440-452.
I 6 I K.M. Chandy and J. Misra, Asynchronous distributed simulation via a sequence of parallel computations,
Comm. ACM 24 (11) (1981) 198-206.
]7 ] D.A. Fahrland, Combined discrete-event continuous systems simulation, Simulation 14 (2) (1970)
61-72.
[ 8 1 R.M. Fujimoto, Parallel discrete event simulation, &mm. ACM 33 ( 10) (1990) 30-53.
191 J.L. Hay, Real-time distributed simulation with ESL, in: A. Verbraeck and E.J.H. Kerckhoffs, eds.,
Proceedings of the European Simulation Symposium 1993 (SCS, P.0. Box 17900, San Diego, CA,
1993) 439-444.
I IO] D. Jefferson and H. Sowizral, Fast concurrent simulation using the time warp mechanism, Part I: Local
control, Tech. Rept. N-1906-AF, Rand, Santa Monica, CA 90406, 1982.
[ 1 I] D.R. Jefferson, Virtual time, ACM Trans. Programming Languages Systems 7 (3) (1985) 404-425.
[ 121 E.J.H. Kerckhoffs, Parallel processing and advanced environments in continuous system simulation,
Ph.D. Thesis, University of Ghent, Belgium, 1986.
I 13 ] D.L. Kettenis, The COSMOS modelling and simulation language, in: W. Ameling, ed., Proceedings of
the First European Simulation Congress (Springer, Berlin, 1983) 251-260.
[ 14 I D.L. Kettenis, COSMOS: A simulation language for continuous, discrete and combined models,
Simulation 58 (1) (1992) 32-41.
I 151 D.L. Kettenis, Issues of parallelization in implementation of the combined simulation language
COSMOS, Ph.D. Thesis, Delft University of Technology, The Netherlands, 1994.
I 161 nCube2 Programmers Reference Manual (nCUBE Corporation, Beaverton, OR, 1992).
[ 17 1 D. Nicol and R. Fujimoto, Parallel simulation today, Ann. Oper. Res. 53 ( 1994) 249-285.
[ 181 B.R. Preiss, The Yaddes distributed discrete event simulation specification language and execution
environment, in: B. Unger and R. Fujimoto, eds., Distributed Simulation 1989 (SCS, F!O. Box 17900,
San Diego, CA, 1989) 139-144.
[ 191 B.R. Preiss, I.D. Maclntyre and W.M. Loucks, On the trade-off between time and space in optimistic
parallel discrete-event simulation, in: M. Abrams and PF. Reynolds Jr, eds., Proceedings of the 6th
Workshop on Parallel and Distributed Simulation (SC& P.O. Box 17900, San Diego, CA, 1992) 33-42;
also: Simulation 24 (3).
[ 201 J.C. Strauss, The sci continuous simulation language, Simulafion 9 (6) ( 1967) 281-303.
121 ] P.J. van der Houwen and BP Sommeijer, Iterated Runge-Kutta methods on parallel computers, Tech.
Rept. NM-R9001, Centre for Mathematics and Computer Science, PO. Box 4079, 1009 AB Amsterdam,
The Netherlands, 1990.

You might also like