Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

198

Computer Physics Communications 62 (1991) 198216


North-Holland

Multi-million particle molecular dynamics


I. Design considerations for vector processing
D.C. Rapaport
Physics Department, Bar-han University, Ramat-Gan 52100, Israel

and
Hochsleistungsrechenzentrum, Kernforschungsanlage JUlich, W-51 70 Jlich, Germany
Received 12 January 1990

Recent progress in developing enhanced methods for carrying out molecular dynamics simulation on vector supercomputers
is described. The techniques in general use for rapid evaluation of the interactions between particles require modification in
order to allow efficient implementation within the pipelined processing environment intrinsic to practically all supercomputers. These modifications, while effective in terms of processor utilization, consume substantial amounts of storage, and
methods of reducing these requirements have had to be developed. The techniques discussed in this paper have been used in
feasibility tests involving systems with up to 2.5 million particles.

1. Introduction
This is the first of two papers that address the problem of implementing extremely large-scale molecular
dynamics simulations on modern supercomputers. There are two outstanding architectural characteristics
that are common to essentially all machines of this class, namely, that the computations are carried out by
means of vector processing, and that performance is enhanced by distributing the computational effort
across several processing units. The first feature is practically universal, while the second is becoming
increasingly widespread. In order to carry out simulations that involve systems containing from a few
hundred thousand to as many as several million particles on machines of this type, it is necessary
depending on hardware to incorporate either or both of these architectural features into the computational algorithms. The manner of doing this for a vector processor is discussed in the present paper; in the
second paper of the series [1] (hereinafter II) the approach used for distributed processing is described.
Both papers deal with extensions of work originally described in ref. [21.
Vector processing is to computation what the production line is to manufacturing; for no matter how
fast a particular generation of electronic device technology allows a computer to process data, if the
machine is given vector-processing hardware based on similar technology it will be able to perform many
of its tasks a great deal faster. Distributing the computational load over several processors will of course
lead to further improvements, as will become apparent in II. The hurdle faced when implementing
algorithms on vector machines is the need to ensure that the data is organized so that as much of the

Permanent address.

0010-4655/91/$03.50 1991

Elsevier Science Publishers B.V. (North-Holland)

D.C. Rapaporl

/ Multi-million particle molecular dynamics. I

199

computation as possible makes effective use of the vector hardware. This is often a non-trivial and, on
occasion, an even impossible task.
The majority of the enormous body of work carried out using molecular dynamics simulation [3] over
the past three decades has not involved systems of the sizes addressed here. The reason for this is simple
for the majority of problems there is absolutely nothing to be gained by studying such large systems, and a
few hundred to a few thousand particles generally suffice. The physics that underlies the phenomena
modeled in most molecular dynamics studies involves short-ranged spatial correlations: correlated motion
that extends over a distance of order several times the mean interatomic separation can be adequately
accommodated within systems whose edges are typically of length some ten times the mean separation
(even smaller sizes are still employed on occasion). Periodic boundaries are generally used in order to
reduce finite-size effects; otherwise, a substantial fraction of the particles would lie close to a boundary,
and relatively few in the bulk interior.
On the other hand there do exist problems in which the characteristic length scales of the phenomena of
interest are orders of magnitude greater than the mean interatomic separation, examples of which include
polymers [4], incommensurate surface phases [5], and spontaneous structure formation in hydrodynamic
flow [6]. Such problems cannot be seriously studied without resorting to systems of at least several tens of
thousands of particles, and it is reasonable to expect the demand for simulations of this (and even greater)
magnitude to grow with time as the value of the simulational approach becomes more widely appreciated.
It is a fact of life that algorithms designed for small problems do not always prove suitable when the
problems are scaled up by several orders of magnitude. This is especially true in the case of molecular
dynamics. The algorithm used for small numbers of particles is little more than trivial [7]. While such an
algorithm is adequate, perhaps even optimal, for systems containing up to a few hundred particles, some
form of enhancement is required to deal with larger systems, and it has long been recognized that the
introduction of certain bookkeeping techniques namely cells and/or neighbor lists [7,8] can greatly
enhance the performance, even by orders of magnitude. Until recently, however, it was thought that such
techniques were only marginally useful on vector processors, but it has since been demonstrated that both
bookkeeping schemes can be very effective on computers of this type [2,9].
The availability of improved algorithms has made possible extensive simulations of systems containing
as many as 2 x iO~ particles [10], taking full advantage of the benefits of vector processing. The
bookkeeping schemes exact an penalty however, namely one of storage, and the substantial price paid in
terms of storage requirements can make it unreasonable to consider systems of even greater size unless an
inordinate amount of storage is available. An examination of the algorithms reveals that this appetite for
storage can be overcome at the expense of further algorithmic complication, but, more importantly, with
only minor impact on the efficiency of the computation. Test runs involving over 2 x 106 particles have
been used to establish the effectiveness of the approach. In the large-system limit the fraction of storage
required for the bookkeeping functions tends to zero. The focus of this paper is on the interaction
computations, which constitute the bulk of the work in any molecular dynamics simulation. There are of
course many other details [7], such as integrating the equations of motion, constructing the initial state,
and making measurements of various properties in the course of the simulation; these consume a relatively
small fraction of the effort, and demand little by way of special treatment when transferred to a vector
environment.

2. Molecular dynamics

conventional methods

2.1. Basic approach


In order to provide a framework for introducing novel methods for handling molecular dynamics
calculations in other than the most conventional computing milieux, we begin with a summary of the most

200

D.C. Rapaport

/ Multi-million particle molecular dynamics.

straightforward approach for simple systems. The term simple as used here means that the molecules of
the system are reduced to particles having spherical symmetry, with interactions defined in terms of
two-body forces whose strength and direction depend only on the relative separation of the particles
involved. Many, perhaps the majority, of current applications of molecular dynamics deal with more
complex molecules [3,7] which may have a rigid or flexible internal structure, interactions involving several
force centers on each molecule or an explicit dependence on relative orientation, even three-body
potentials or polarizability.
For a broad range of problems it suffices to consider the simplest of particles the atom whose only
structural feature is the volume it occupies to the exclusion of all others. Such a particle is exemplified by
the hard sphere (or hard disk in two dimensions). While a truly hard sphere cannot be represented by a
differentiable potential, the step potential that characterizes the hard sphere can be replaced by a suitably
shaped differentiable function that acts repulsively over a very limited range and diverges rapidly when the
separation drops below the effective core diameter (typically following an r~2 law), but not too rapidly to
prevent numerical integration of the equations of motion. A common example of such a potential is one
derived from the LennardJones form by an appropriate shift and truncation, namely

U(r)= f4(r~l2_r6+~)~
~0,

F(r)=

48(r14~r8)r,
0,

r<r~,
rr~.

where U and F are the interaction energy and force at pair separation r (with r
r I) in familiar reduced
units, and the interaction range is i~ 21~~6.While not suitable for modeling the liquidgas phase
transition, this potential has been used extensively in studying melting transitions, fluid structure,
transport coefficients, liquid flow, and so on
in fact any situation where the behavior is primarily
attributable to excluded volume and not to other aspects of the potential. The algorithmic developments
for vector and distributed processing will be described in terms of this simplest of systems, but the
underlying principles readily generalize.
The molecular dynamics algorithm [3,7]involves the computation of the force acting on a given atom by
considering all other atoms in the system; once the total force acting on each atom has been computed, the
equations of motion are integrated by standard means. Starting from some initial state, the evolution of
the system is followed over a sequence of time steps, typically of the order iOn, although this might be an
order of magnitude (or more) greater for some problems. The amount of computation required for the
interactions varies quadratically with the number of atoms. For systems with long-range interactions this is
irreducible unless approximations are made, but in the case of short-range interactions only the immediate
environment of each atom really ought to be examined to fully determine the forces. Clearly some scheme
for representing the system state in a manner that allows the determination of neighborhood relationships
in an efficient manner is crucial.
=

2.2. Cell data organization


The most obvious way to reduce computational effort is to subdivide the system into a set of relatively
small boxes, or cells. If the edge of each cell exceeds i~, then each atom can interact only with atoms that
are in either the same cell or in one of the immediately adjacent cells (26 in three dimensions, 8 in two).
Since determining the cell membership for each atom requires a fixed computational effort, and the
interaction computations that rely on cell partitioning involve a neighborhood whose size depends on r~
but not on the total number of atoms in the system, N, the entire computation grows linearly with N the
slowest growth rate possible.

D.C. Rapaport

/ Multi-million particle molecular dynamics.

201

Beyond the requirement that the minimum cell edge exceed r~,there is no precise recipe for determining
optimal cell size. The preferred size is one in which the mean cell occupancy is close to unity; in a high
density system this will also be the smallest size allowed, but at lower densities unit cell occupancy rather
than minimum cell size might prove a more effective criterion, to avoid excessive processing of empty cells.
The ideal solution is an empirical one that amounts to varying the cell size for the system at the state point
to be studied and determining the value at which the simulation runs fastest.
There is also the question of how to represent the information describing cell membership. Given the
permissible range of cell occupancies determined by the extreme limits of local density fluctuation, the
most economical approach storagewise to to use a linked list of cell occupants. A separate list is used for
each cell, and all storage needed by the lists can be taken from a common pool whose overall size is just N.
To complete this particular data access scheme an additional set of pointers is introduced, so that for each
cell there is a pointer to the first atom it contains; starting from this atom, the linked list provides access to
the remaining atoms in the cell. Assuming the total number of cells to be of order N, the storage required
to implement this scheme is also proportional to N.
The algorithm is summarized below. Two techniques for improving computational efficiency are
included. One is to use precomputed tables instead of evaluating the interactions, the other is to eliminate
the need for dealing with periodic boundaries by making replicas [2] of atoms within distance i~ of the
boundaries that are suitably offset to the opposite sides of the system.
The edges of the simulated region are of length La,..., the cell array used for the interactions is of size
N~+ M~x M~x M~,and wy,... are the cell edge lengths. The coordinates of atom i are t~,,...; to reduce
the work in computing cell membership the coordinates would normally range from 0 to L~(etc.) rather
than be centered about the origin, but to allow space for the shifted replica atoms the actual coordinate
range of non-replica atoms is changed to ~
<L~+ w~.In general, in the interest of brevity only the
x-components of quantities taking part in the calculation are mentioned explicitly; the version of the
algorithm shown is for three dimensions.
The replication of atoms near boundaries is as follows. It should be noted that replication can be used
effectively for all molecular dynamics simulations with periodic boundaries, provided the interaction range
is small compared with the overall size. The programming constructs used here and elsewhere need little
explanation, except for the remark that explicit terminators of loops and conditional blocks are omitted if
the contents can be fit into a single line of text. Note that the loop over x, y, z is only indicative of what is
to be done, and should not be taken literally. After replication there are a total of N atoms to be
considered; atoms near corners can be replicated more than once.
N*N
for x, y, z do
n
N
for i Ito N do
if r~1< w, + r~then
~

nn+1

r~~+~r51+L5 r~~4r~1

elseif r~1>w,~+ L~ r~then

n*n+1

r~~*r~1~L5 r~r

endif
enddo
N n
enddo
*

Assignment to cells results in a set of linked lists in which the pointers associated with both cells and

202

D.C. Rapaport

/ Multi-million particle molecular dynamics.

atoms are stored in a common set of N + N~array elements (p, }; pointers between atoms are stored first,
followed by pointers to the first atoms in the cells.
forc=N-i-1toN+N~do
for i 1 to N do

p~O

([r~/~] XM~+ [ii~,,/w~j) X M

2 + [r~jw2] + N +1
Pi ~p~ PC~~1
enddo
The interaction calculations consider each cell, pair it with its neighbors (actually just half), and then
examine all pairs of atoms that appear in the linked lists. The organization used here differs from the more
familiar form of the computation [2] in that the outermost loop is over the offsets between adjacent cells
rather than the cells themselves; there is no loss of efficiency, and the rearrangement hints at subsequent
developments.
The acceleration components of atom i are a~1,..., and
is the total potential energy. The table
entries used for the force (which is identical to acceleration
reduced units)
andstab
energy
termssoareasdenoted
2 in
increments
of size
chosen
not to
by .F andsignificant
U~ these numerical
are tabulated
r r.~
at already
fixed r present in the integration method. The tables are
introduce
errorfor
beyond
that
of length Ltab, ~ ~tab ?~2/Ltab.The adjustments ~k 8xk etc. that are made to the cell loop limits have
values 0 or 1 depending on the offset index k; the ranges are chosen to ensure that only the correct cell
pairings are considered (determination of the actual values is left as an exercise). The offsets themselves
are 5xk~; they equal 0 or 1 and the three components of each offset can be combined into a single
value (as will be done in section 4). The case k 1 corresponds to cells being paired with themselves, and
covers the intracell interactions.
=

for i

1 to N do

a~
1 0
f

E~~- 0

for k=1 to l4do


for m~=8~k
to MX1~kdo
6yk
do
for m=~
~
for mZ to ~ toMZl~Zkdo
m*(m~XM~m~)XM~+m+N+1
m~ m~+ 5xk
m (m~X M~,+ mi,) X M~+ m~+ N + 1
6Zk

~
~

~Prn
for i 0 do
~

for i ~ 0 do
if k>1~i>i then
d
5~i~~r51 J*[(d~+...)/z~tab]+1
if j <Ltab then
a~1a~1+F~.Xd~~
E~
+
endif
endif
enddo
enddo
enddo
enddo
enddo
enddo
~-

D.C. Rapaport

/ Multi-million particle molecular dynamics.

203

While the linked-list method is suitable for scalar processors, the fact that it requires accessing memory
in what amounts to a haphazard fashion means that it ceases to be effective when optimal performance
insists that data be read from and written to memory sequentially, a feature of all modern vector
supercomputers. Alternative approaches that extend the cell technique in a manner compatible with
vectorization will be discussed in section 4.
2.3. Neighbor-list data organization

The observation that in a fluid of moderate to high density the environment of each atom changes only
gradually (relative to the size of time step used for integrating the equations of motion) suggests that
information on neighborhood relationships continues to be valid, at least approximately, for a certain
period of time subsequent to its original generation. The neighborhood is defined to be a spherical
(circular in two dimension) region with radius r~>r~.If a list of all the atoms present in the neighborhood
of a given atom is prepared [8], then it is clear that this information will remain useful in the sense that it
still contains all the interaction partners of that atom for a period spanning several time steps; the actual
duration of this period depends on the maximum velocity of the atoms involved (as well as on r,~itself).
The gain in performance over the cell method depends on the ratio of the volume of the neighborhood
region to the combined volume of all the cells which would have to be examined otherwise. Continual
monitoring of the atomic displacements can be used as a means of determining when regeneration of the
neighbor lists is required namely the earliest instant at which an atom not originally in the neighborhood
could possibly become an interaction partner. Insofar the representation of lists of neighbors is concerned,
the information can be stored as a sequence of atom pairs, or in a condensed format where data is grouped
according to one of the neighbors; in either case, the fact that the neighbor relationship is commutative
halves the total storage requirement.
The amount of storage needed for the neighbor data depends directly on the number of occupants of
the neighborhood region: The radius r,, is set equal to ~ plus a value representing the thickness of the
bordering shell (or annulus in two dimensions). The larger r,~ the less frequent the time-consuming
operation of regenerating the neighbor data, but once the proportion of atom pairs that are classified as
neighbors but which are separated by more than r~becomes substantial, the performance will begin to
drop; the optimal size must once again be determined by experimentation. The process of preparing the
neighbor lists should utilize the cell approach as a preliminary step, with the cell size now chosen to exceed
r~rather than just r~.
The storage cost can be substantial the neighbor-list approach exhibits an obvious largess in terms of
memory utilization in order to gain speed; for very large systems the method might not be viable for lack
of memory. Neighbor lists can be used on a vector computer in a straightforward manner provided each
atom has a moderately large number of neighbors (this excludes the case of very short-range forces), and
further improvements in performance are possible using a variant of the approach described in section 4.
The extension of the vectorized cell method by the use of partitioning (section 5) that is intended for
dealing with extremely large systems does not apply to neighbor lists. The reason for this is that the
neighbor data is generated once for the entire system and then used over the course of several time steps,
whereas the partitioned approach is designed for storage economy so that only the bare minimum of data
is retained for those parts of the system not under immediate consideration.

3. Vector processing
3.1. Architecture of vector computers

The vector supercomputer [11] represents a compromise between the ability to maximize performance
for only a limited set of operations and a need for the fastest possible computations over a broad range of

204

D.C. Rapaport

/ Multi-million particle molecular dynamics.

problems. The dominance of the former consideration is reflected in the fact that performance figures
quoted by manufacturers are almost always beyond the reach of the user [12], often by an order of
magnitude or more (these unachievable figures are sometimes referred to as machoflops); the situations
where the performance potential of the supercomputer is far from realized are all too frequent.
What distinguishes algorithms that vectorize effectively is the manner in which data is accessed and the
nature of the processing involved. The reason for a preferred mode of operation is that the processor
handles memory access and arithmetic in a pipelined fashion, with a resulting throughput substantially
greater than what would be possible if each operation were to be carried out separately. The pipelining is
only possible if the same operation is performed repeatedly on a set of data items arranged in a specific
manner; the preferred manner generally involves data stored in consecutive memory locations, although
evenly spaced items may be equally acceptable (with certain restrictions imposed by memory interleaving).
Any deviation from a general operational pattern of this kind results in reduced performance. However,
with the exception of limited kinds of computation mainly involving matrices which adhere precisely
to this prescription, such a state of perfection is rarely (if ever) attained. In addition to the data
organizational requirements, each vectorized operation has a fixed startup period independent of the
number of data items processed; this can sometimes be made to overlap (fully or partly) with a previous
vector operation. A paradoxical consequence of this overhead is that if the vectors are too short, vector
processing leads to reduced performance; the minimal vector length requirements vary and depend on
both the type of operation and the machine itself. The issue is how to achieve the best performance given
the preferred manner of operation of the hardware.

3.2. Vector operations

The term vector as used here has nothing to do with the vectors of physics and mathematics; it
merely denotes a sequence of data items that are processed as a single entity by the hardware. The data
items themselves may be integers, floating-point numbers, memory addresses, single bits whatever the
machine is prepared to accept. A vector is characterized by data type, the number of items involved, and
the starting address in memory. In a language such as Fortran, the name of the vector denotes the default
starting address, although this can be altered with an explicit index. The length would be either the size to
which the vector is dimensioned or some smaller value. The data type might be implicit in the name or
specified separately. Some language implementations (such as Cyber Fortran) allow the full description of
a vector to be summarized in a single quantity called a descriptor.
The concise notation for describing vector operations introduced previously [2] will be used here. The
vector x stands for an ordered set of n elements { x1, x2,..., x,~}; the final index will be shown only if not

apparent from the context. A subvector of x will be denoted by x[n1


is obvious. A typical arithmetic operation z x + y stands for

...

ni], or just x[n1] if the upper limit

fori=ltondo

z~4x1+y1.

An example of a comparison operation with output stored as a bit vector (where a set bit corresponds to
the test being satisfied) is b x > y, equivalent to
~

for i

1 to n do

b, ~ x,

>

y,.

Other operations used are the sum over the elements of a vector, ~x, and the count of the number of
one-bits in a bit vector, #b.
To help the user implement an algorithm whose intrinsic data organization bears little resemblance to
that needed for efficient vector processing, the instruction sets of most vector computers include some
capability for reorganizing data at a relatively high rate, generally intermediate between the vector and
scalar processing speeds. Different approaches to dealing with data reorganization exist, and not all are to

D.C. Rapaport

Multi-million particle molecular dynamics. 1

205

be found on all machines. Furthermore, even when a particular scheme for rearranging data is implemented in hardware, the questions of how fast such operations are carried out relative to peak computation speed, and whether the compiler is even capable of utilizing the hardware feature, must be taken into
account.
The two principal schemes for reordering data are known as gatherscatter and compressexpand.
Data gathering uses a vector of indices c to access in no particular order as far as the computer is
concerned some or all of the elements of a set of items (items can be accessed more than once) which are
then stored consecutively in another vector. The notation used is z x~c, corresponding to the loop

for i

1 to n do

z,

XC.

The scatter operation is the converse, in that the index vector is used to help store a consecutive set of data
items in some alternative order in another (possibly longer) vector; not all elements of the destination
vector need be affected, and destination elements may actually be stored into several times (assuming this
is meaningful for the particular context). The notation z @c x is shorthand for

for i

1 to n do

z~ ~ x~.

Compression involves selecting a subset of data items from a vector and storing them consecutively, in the
same order, in another vector; expansion is the converse. Because data order is preserved under these
operations, addressing information can be represented by means of a bit vector b; this provides an
extremely compact alternative for handling sparse data compared to the index vector needed for gather
and scatter. The compression operation is denoted by z
x ~ b, representing the loop
~-

j4-0
for i = 1 to n do

ifb1=lthen

J4j+1

z~x1

enddo
while expansion z
for i

x ~ b corresponds to

1 to n do
ifb1=lthen
else z,~0
enddo
=

j~j+1 z~x1

In the event that no order is required, the proportion of elements participating in an operation on a subset
of a vector might be used to determine whether gathering or compression is preferable provided the
choice exists.
On further operation will be introduced here, namely index selection. The operation c
x > 0 (for
example) stands for

j*0

for i

1 to n do
ifx,>Othen
enddo
=

j~j+1

cri

with c the resulting vector of indices showing which elements of x satisfy the given condition. The length
(j) of the index vector is also a product of the operation. In terms of bit vectors the example is equivalent
to
b*x>0

c4{1,2

n}.Lb

This defines the unary form of the operator

j< #b
~,

previously introduced in binary form for gather/scatter.

206

D.C. Rapaport

Multi-million particle molecular dynamics. I

3.3. Efficiency and portability

Ideally, one of the tasks of the compiler should be to produce a machine translation of the source
program delivering close to optimal performance on the designated hardware, without any special effort on
the part of the author of the program. Unfortunately, such an idealized situation is rare indeed. Judging by
the achievements to date, compiler efficiency is an even more complex issue than hardware efficiency, and
compiler performance
even products from the same manufacturer
exhibits considerable variation.
Irrespective of whether the logical structure of the algorithm is too complex to be analyzed by an
automated procedure, or whether the compiler simply has not been taught to recognize certain basic
computational patterns, the onus is on the programmer to meet the requirements of the compiler, and if no
alternative exists, to resort to additional measures (see further) that will ensure an efficient, although less
intelligible and portable program.
Even when the compiler is competent at mapping the source program to the hardware, there are
situations in which certain relatively simple constructs may, in principle, prevent the compiler vectorizing
parts of the program. One example involves operations whose general form implies a potential dependence
on something that has only just been computed; such operations are not generally vectorizable because of
the manner in which vector pipelining restricts data dependence. In those instances where it is known that
no dependence exists, the capability of conveying such information by means of compiler directives (that
are not part of the actual language) ought to help the compiler perform its task. The capacity for aiding the
compiler in this way varies.
The alternative to total reliance on the compiler is to use machine instructions directly. This can always
be done by programming in assembly language, but is best avoided (with intelligibility in mind) in favor of
a sometimes available alternative which allows access to hardware via subroutine calls from higher level
languages as such as Fortran. On the Cyber 205 and ETA processors, for example, q8vgath r,
q8vscatr, q8vcmprs and q8vxpnd do exactly what their names suggest; on Cray computers the
functions gather and scatter are available, while a set of functions with names such as when n e can be
used to carry out index selection. These functions correspond directly to the vector operations introduced
above.
Even when it appears that the portions of the program consuming most of the execution time have been
fully vectorized, there is usually no indication given as to whether the machine code produced is the most
efficient possible. The machine may, for example, be capable of achieving a given result in more than one
way, while the judicious employment of temporary registers, or the simultaneous use of multiple functional
units in the processor (typically by feeding the results of one vector operation into the next, a process
known as chaining), can result in substantial performance gains. Only by analyzing the assembly language
listing produced by the compiler is it possible to determine whether the performance level reflects what the
machine is really capable of achieving, but it is doubtful whether such a thorough analysis is often carried
out; the megaflop rate attained may well have little to do with the overall efficiency of the computation.
The issue of program portability is an acute one. Vector supercomputers tend to be very sensitive to
program and data structure (for reasons already given) and a program that runs well on one brand of
machine can fail to perform as expected on another, unless modifications are made. Performance can even
vary substantially among different models of a particular product line, depending on the kinds of
instructions implemented in hardware, the degree to which functional units are replicated, memory
organization and bandwidth, as well as other more subtle factors, such as the timing of individual
instructions, that might be entirely unknown to the user. Performance can also change between different
versions (or releases) of a compiler, and there is no guarantee that the code generation and optimization
capabilities improve monotonically with time. These considerations apply especially to vectorized implementations of molecular dynamics algorithms which, as pointed out in the course of this article, tend to
require machine-dependent adaptations in order to run efficiently.

D.C. Rapaport

Multi-million particle molecular dynamics. I

207

4. Layer data organization


4.1. Inhibiting vectorization
A molecular dynamics algorithm based on cells involves a set of linked lists, one per cell, in which the
list elements contain the identities of the atoms belong to the cell at a given instant. As pointed out in
section 2, the reason for prefering linked lists over sequential storage is that the number of atoms per cell
can fluctuate considerably; the alternative requires that the storage reserved for each cell allows for the
possibility of maximal occupancy.
Linked lists are handled very inefficiently on a vector processor, since the use of pointers to connect
related data items inhibits vectorization. The cell technique, which has shown itself to be very effective on
scalar computers, must be modified in a way that renders it vectorizable. Obviously it would be
unreasonable to abandon the use of cells entirely and return to the original method which considers all
pairs of atoms; though fully vectorizable, and even efficient for small systems, there comes a point at
which the gain due to vectorization can no longer compensate for the t!2(N2) dependence. The layer
method of reorganizing cell data, which will now be described, provides a solution that retains the benefits
of the cell framework.
4.2. Layers

In the cell version of the algorithm (section 2.2), the interaction computations involve a series of nested
loops. The outermost loop is over the possible offsets between pairs of neighboring cells, including the case
of zero offset where cells are paired with themselves. Scanning the cell array is the responsibility of the
next series of loops. The two innermost ioops generate pairings of occupants from the mutually offset cells,
with the case of zero offset incorporating a test to ensure that atom pairs are considered once only. Note
that it is the innermost loops that have the fewest numbers of iterations because of low mean cell
occupancy (typically unity). This fact rules out the possibility of vectorization. An essential requirement
for effective vector processing [11] is that the vector lengths are adequate to amortize fixed startup costs
over maximum useful computation. While the loop order just described fails to obey this criterion, a
reordering so that cell scanning (the cells themselves, not their contents) is done during the innermost ioop
would constitute a satisfactory solution. How is this realized in practice?
The scheme calls for a reorganization of the cell data. Instead of representing cell occupancy using
linked lists, the identities of atoms in the cells are placed in a set of arrays; each array contains one
element per cell, and the total number of arrays is not less than the maximum expected cell occupancy.
These arrays will be referred to as layers, and in fact amount to a return to the approach that was
dismissed earlier where a fixed amount of storage is allocated for each cell; alternative ways for
overcoming the storage problem will be presented. While scanning the atoms during cell assignment, the
first atom encountered in a given cell is assigned to the corresponding position in the first layer, the second
atom in the cell (if any) to the second layer, and so on. Unfilled layer positions are assigned a value
distinct from all valid atom identity numbers (such as zero). Layer generation is carried out as shown
below, with NL signifying the number of layers generated.
Initially { c~i 1,..., N) are the cells to which the atoms (including replicas) belong, but as atoms are
assigned to layers the corresponding c~are zeroed. f e
1,,, j 1,..., N~} describe the contents of the mth
layer, and (s1 i 1,..., n } are the (n) atoms remaining at the end of each layer. Several atoms may be
assigned to a -particular layer position, but only the last assignment is effective; the other atoms will be
candidates for subsequent layers.
for i ito N do
([r~C,/wXJ X M~+ [r~~/w~]) X M~+ [r~~/w~]+ 1
s,
i
enddo
n~N m*0
=

208

D.C. Rapaport

while n

>

/ Multi-million particle molecular dynamics.

0 do

for j 1 to N~do ejm 0


for i 1 to n do ec,m
forj=1toN~do
if ejm ~ 0 then ce
0
enddo
j~0
for i 1 to n do
ifc5~0then J*j+1
=

-~

sj~~~sj

enddo
enddo
NL

~-

The interaction computation, based on the layers just constructed, consists of pairing layers using all
allowed offsets, with an innermost loop that processes pairs of atoms specified in the layers. Only in cases
where both layer positions specify valid atoms is the calculation actually carried out. The details appear
elsewhere [2] and will not be repeated here; the algorithm can also be deduced from the vectorized versions
described below (but there would be little point to a non-vectorized implementation).
Two vectorized forms of the layer-based interaction computation have been constructed, each designed
bearing in mind the specific hardware features of the target processor. One version developed for use on
the (functionally identical) Cyber 205 and ETA machines represents cells requiring attention in each layer
by means of bit vectors. The other version developed for Cray computers, but having wider applicability,
employs sets of atom indices to represent layer contents and does not attempt to condense the
information. Both approaches lead to fully vectorized computations, but the inability to represent sparse
data on the Cray with the aid of bit vectors means that the storage scheme for the layers is inefficient; an
alternative scheme that uses storage more economically is described in section 5.
The pipelined nature of vector processing imposes certain restrictions on the data contained in the
vectors, the most significant being that processing of each data item can be carried out independently of
the others. The implication is that a particular atom can be mentioned no more than once in a set of atoms
that are processed in a single vector operation. Use of layers guarantees this to be the case since an atom
can only appear once in a layer (even when a layer is paired with itself, the two appearances of each atom
are in separate vectors). The compiler will not be aware of this fact, however, and it is necessary to inform
it
by means of the special directives referred to earlier
that certain loops involved in the layer
processing can be safely vectorized without fear of data dependence.

4.3. Vectorized layer algorithm using bit vectors

The computation begins with replication of atoms within r~of each of the boundaries. The vector
notation discussed in section 3 is used here.
n~N
for x, y, z do
b~,~[1...n]<w5+r~ n1*#b
,~[n+1]~r~b+L5 i~[n+1]~i~b
b~r5[i...n]>w5+L~r~ n24#b
r~[n+nj+1]*,~J,bL~ r~,,[n+n1+l]+-r~b
n

enddo
N

n + n1 + n2

D.C. Rapaport

/ Multi-million particle molecular dynamics. I

209

The next stage is layer assignment, resulting in a series of compressed layers that are accessed using bit
vectors. The identities of the atoms packed into the layers are stored in the vector e~,and bit vectors
(bm)
in which a single bit corresponds to each cell in the m th layer are used to associate atoms with
occupied cells. The quantities qm and tim are the starting position in e ~ of data for the m th layer, and the
number of atoms in that layer. The total length of e~ is N, while the storage needed for all the bm
amounts to N~x NL bits. Other temporary quantities make an appearance, but their meanings should be
obvious. Note that the first test for c ~ 0 produces an all-ones bit vector and is unnecessary here, but
would be needed if partitioning (section 5) is used, since gaps could then appear in the vectors holding the
atom data.

~
m40

1
q

0*0

n0~0

s~{i,2,..., N)
while #b> 0 do

b[1...N]*c#O

m~m+i
q~~q~1+n~1
b~[i...Nj~e:~t:0
nm*_~#bm

e[1...Nj*0
~

e@(c~b)~s,~.b

c~e~[q~]+-O b*c*0
enddo
NL m
5k are linear combinaInteraction
based
these
compressed
layers
The indices
offset values
tions of the calculations
5xk~ (section
2). on
Since
5k can
be negative,
thefollow.
bit vector
shown here can also become
negative (they would not contribute to the final result, but might lead to invalid memory references); to
ensure positive indices each bit vector is augmented by a constant margin [2] and the index adjusted
accordingly these margins are omitted here. An additional bit vector b* is used to distinguish boundary
cells, that contain only replica atoms, from interior cells; it is used to ensure that all pairings involve at
least one interior cell. Valid pairings between occupied cells in the layers are collected in the bit vector b.
Vectors r, a~,etc., are all of length N~and hold data for expanded layers; vectors such as ,.(c) hold
packed data after rearrangement according to layers. The energy calculation is omitted. F (and later also
U) holds the tabulated interaction terms; the possibility that a vectorized evaluation of the interaction
function might be faster than table lookup should not be overlooked. All 27 offsets are used when distinct
layers are paired, whereas only 13 are needed when a layer is paired with itself (k i corresponds to zero
offset and is skipped in this case).
While this algorithm follows the lines of one described previously [2], several changes have been made
to show that alternative implementations are possible, as well as to bring out the similarity with the
subsequent version based on index vectors. Here, the acceleration updates are done when the layer data is
in expanded form; if the atoms are separated by a distance greater than r~ the accelerations are still
updated, but using the final zero entry in the table; replica atom accelerations are eliminated after undoing
the initial layer rearrangement, simply by truncating at,... to length N. (For the case m m, a~ and
a~correspond to a single vector that is updated twice per offset; the present notation does not adequately
convey this fact and, ideally, a separate sequence in which a is replaced by a~should have been included.
The descriptor variable mentioned in section 3.2 readily handles this case without any modification, and is
implicit in the syntax used here.) The operations A and V denote bitwise Boolean and and or.
N] ,~[1 N]@e~
a~ 0
~

...

for m

1 to

NL

do

bm
a~ a~[q~]Ibm
for m m to NL do
if m ~ m then
r~[q]T bm
a~ a~[q~,]~
bm
,.(C)[q,]~

I.

-~

210

D.C. Rapaport

1
kmax
27
r~
a~ a~
to kmax do

else
for k

/ Multi-million particle molecular dynamics.


~

k,mn 2
kmax
~
b[1...Ne]4_bm~Abm~[Sk]A(b*Vb*[Sk])
i~ ~-

14

if #b> 0 then
d5 ~ ,~,~b r~J, b[sk]
j4 min([(d5 x d~+
t4--F@j
d~4-txd~
a~*a~+d51b a~4-acd51b[sk]
endif
enddo
if m ~ m then a~[q~] a~~ bm
enddo

...)/L~tab] +

1,

Ltab)

4-

enddo
a5[1... N]@e~~a~
In order to overcome problems associated with a Cyber 205 hardware restriction on maximum vector
length as well as reduce a more general requirement for temporary storage used during the computation, a
scheme for spatially subdividing the system during the layer construction was developed [2] (the forerunner
of the slice method described in section 5). At each time step the initial assignment of atoms to cells is
carried out for the entire system (taking care to break up vectorized loops that become too long), but the
layer data is then grouped according to which part of the system it addresses. Provided adjacent
subdivisions are extended to overlap by an amount r~,each group of layer data can be treated
independently during the interaction computations. Atoms lying in the overlap regions have some of their
interaction terms computed twice, but the bookkeeping ensures that this extra data is readily identified
and not used subsequently. The effort expended on duplicate interactions is small, depending on the
amount of overlap. The approach proved to be quite effective and was used in the large-scale production
runs, but is contingent on the ability to pack sparse data with the aid of bit vectors.
4.4. Vectorized layer algorithm using index vectors
The formulation in terms of index vectors might give the impression of a more concise algorithm than
before, but this has little bearing on whether the implementation is more efficient on a computer that
supports both approaches. As before, the computation begins with replication, but now based on index
vectors generated with the aid of the unary @ operator.
for x, y, z do
q[i...n1]4_-~r5[1...n]<w5+i~
,~[n+1]~r,~q+L~ i~,[n+1]~i~q
,~jn+n1+1]*r~(aJqL5

,~,[n+n1+1]*-,~q

n 4--n + n1 + n2

enddo
N

4-

Layer assignment based entirely on index vectors follows. The identities of the atoms in the mth layer are
contained in the non-zero elements of em; no attempt is made here to pack the layer data since this is
addressed later. To handle a partitioned system (section 5), the preparation of the index vector s could

D.C. Rapaport

Multi-million particle molecular dynamics. I

211

allow for gaps in the atom data. Note that when s is compacted at the end of each iteration, a new value
for its length, n, is also produced. There are several ways of employing @ operations to arrange atoms into
layers; the algorithm that follows is just one example.
~

m*O

n*N

while n > 0 do
m

m+ 1

c@em
enddo
NL m

em[1
s[i

...

...

Nj
n]

0
m@(C@S[1
s@(@(c@s) ~ 0)
4

...

n])

4S

4-

Finally, the interactions are evaluated. The vector g with one element per cell, is used exactly as the
bit vector b* earlier, to distinguish boundary cells from interior cells (elements of g are 0 or 1). Vector q
is filled with indices of cell positions that correspond to offset cell pairs in which at least one of the cells is
not a boundary cell and both are occupied. The margins mentioned previously are again omitted from the
description, and are now also required for the vectors em. The potential energy is computed here; to avoid
spurious contributions from replica atoms, it is evaluated separately for each atom (in vector u) and
accumulated at the end.
~,

a5[1

N] ~0
u[1
N] ~0
for m 1 to NL do
for m m to NL do
27
if
~
kmax4elsem~m
~
2then kmax
14
for k kmin to kmax do
q[1...n]4-~((g* g*[5~])Xem,Xe~,,[s~])~O
p 4_e~~~q p ~
d
5 ~i~~p
j4min(t(d~Xd5+...)/L~tab]+l, Ltab)
t4-F@j
d~4tx d~ a~@p4a~@p+ d~ U~@p4a~@p d~
t4 U@j
uQp~u@p+t
U@p4-U@p+t
enddo
enddo
enddo
...

...

E~~~u[1...N]

4.5. Performance
On the Cyber 205 and ETA-1OQ computers extensive production using two-dimensional systems with
as many as 2 X iO~ atoms has been carried out in exploring the applicability of molecular dynamics
simulation to the modeling of fluid flow instability. Test runs of up to 5 x 10~atoms were also conducted.
Moderately high densities were used the values of area, or volume, per atom used were 2.0 and 1.4 in two
and three dimensions. With cell size chosen to give an average cell occupancy near unity, the first layer is
almost fully populated, and the occupancy of subsequent layers drops sharply to zero after three to five
layers. The time required per atom step on the Cyber was 4.8 ~sswith the potential energy included in the
computation (25 p.s in three dimensions), or 4.1 without, irrespective of system size (beyond a minimum of
several thousand atoms), with calculations carried out in 32-bit arithmetic. (On the ETA, which should

212

D.C. Rapaport

/ Multi-million particle molecular dynamics.

have given similar performance figures, a substantial but unexplained size dependence was noted
possibly attributable to paging.)
Tests were also run using single processors of multiprocessor Cray XMP/48 and YMP systems. The
tests on the XMP only considered systems of up to approximately 7000 atoms at this juncture (the
partitioning scheme of section 5 was adopted for larger systems on the YMP) and used 64-bit arithmetic.
In two dimensions the time per atom step on the XMP was 5.2 p.s, while the YMP performed the same
computation in 3.4 or 3.9 p.s depending on the compiler (CFT and CFT77, respectively, the latter requiring
assistance from the special vector subroutines discussed in section 3). In three dimensions the XMP
required 23.5 p.s for the corresponding computation. One surprising detail emerged from these measurements: the bulk of the processing time on the Cray XMP (of the order of 7080%) was spent in the steps
leading to the construction of the index vectors p and p, and not in evaluating the interactions. The
times required when layers are not used are typically an order of magnitude larger.

4.6. Layers and neighbor lists


The layer approach can also be applied to neighbor lists. The idea is to associate the layers with cells
large enough to cover the neighborhood range r,,, and then use the layers to generate neighbor tables
segmented in a manner that permits no atom to appear more than once per segment. The subsequent
processing of such sets of atoms is then fully vectorizable. This scheme is practically identical to the
implementation of the layer approach just described; the principal difference being that the sets of indices
(p and p) would be stored for use over several time steps. For a three-dimensional system similar to
that tested here the Cray XMP required only 4.3 p.s per atom step, with the neighbor lists being refreshed
once every 29 time steps [9]; given the reduced number of interacting pairs that have to be considered, this
result is hardly surprising. The storage requirements for this method are approximately doubled, and if the
neighbor list refresh rate is increased for any reason (such as in studies of fluid flow with stationary walls
that produce shearing, or even by the use of a larger time step
where accuracy permits) the benefits
would be less. Partitioning schemes aimed at reducing storage are of course not applicable.

5. Partitioning for storage economy


5.1. Schemes for partitioning
If the layer data can be compressed and the information needed for reconstruction stored as compact
bit vectors, the storage overhead resulting from the introduction of layers is, for practical purposes,
negligible, amounting to a single index variable per atom (in e ~) together with one bit per cell (in bm) for
each layer. If the efficient bit-vector representation of sparse data is not supported by the processor in
question then, when the simulations become large enough that memory utilization becomes a serious
problem, it is necessary to consider approaches to subdividing the system, but in a manner different from
that outlined previously which was specifically designed for use with bit vectors.
The method of choice is to partition the system spatially and treat each of the subsystems separately. As
before, the separation cannot be complete since interactions will occur between atoms on opposite sides of
boundaries between subsystems, and atoms must also be allowed to cross these boundaries. In a manner
reminiscent of the approach which might be adopted for multiprocessor systems, the data for each
subsystem is stored separately, and when atoms do cross boundaries their associated data is explicitly
transferred from one storage area to another. The storage overhead associated with layers is then
proportional to the number of atoms per partition rather than the total N, and for very large systems that
are split into a substantial number of subsystems ideally without incurring a costly performance penalty

D.C. Rapaport

Multi-million particle molecular dynamics. I

213

the relative increase in storage needed to deal with layers falls to a low level.
The benefits of the partitioned approach go beyond mere economy of storage. Modern computer
systems tend to utilize a hierarchy of storage methods, and this can be put to use in a computation which is
organized so that each part of the system is processed essentially on its own (how the coupling between
parts is handled will be addressed below, but the basic idea remains unaltered). The main processor
memory is the place where the application keeps its data; even faster cache storage and yet faster sets of
registers may also exist, but these are of limited size and often beyond user control. Main memory is an
expensive commodity on the fastest of machines, but it is possible to augment storage by using what is
sometimes known as solid-state disk cheaper memory accessed in much larger blocks than normal but
an efficient approach provided accesses are carried out in manner similar to the way a disk is used (namely
by means of blocks of data rather than individual items). The next stage in the hierarchy is a real disk, and
again there is a tradeoff more and cheaper storage but slower response times. Virtual memory systems
operate in this way without the user being aware. A multilevel memory could well prove suitable for a
subdivided computation since those parts of the system not actually being processed do not need to remain
in main memory; the computation proceeds in a predictable fashion so that data for each part of the
system need only be delivered to main memory just prior to processing, and afterwards the data are
allowed to migrate to more economical levels in the storage hierarchy. Naturally an intricate organizational
problem of this kind would only be attempted when the systems are large typically iO~atoms or more.
There is a certain amount of flexibility in the way the partitioning is carried out, with the simplest
one-dimensional approach being used here: the system is cut into slices that span the entire region in all
directions except one. (The apparently optimal partitioning scheme is one which minimizes the surface to
volume ratio, but the gain can easily be outweighed by the extra computation required.) The computational scheme described here considers each slice once per time step, even when the boundaries are
periodic (so that the first and last slices are adjacent). The slices are treated in cyclic order, and three
adjacent slices are required at any instant; only the central one is actually being processed, while the other
two are either contributing data for inter-slice interactions or dealing with atoms that cross slice
boundaries. The slice thickness, and hence number of slices, must be optimized by experiment; if the slice
is too thick storage costs will grow, but too thin a slice will call for additional work to deal with
interactions across boundaries.

5.2. Data operations

Several sequences of data transfer and transformation appear in the various versions of the molecular
dynamics algorithm for subdivided systems, irrespective of whether one (as described here) or several (as in
II) processors are involved.
The copy operation is used to transfer atom coordinates between independently processed subregions
of the system; the subregions are processed either sequentially within the same processor, or concurrently
by distinct processors. For short-range interactions, only atoms close to subregion edges will be involved.
If a periodic boundary lies between the two subregions in question, then an appropriate shift in the
affected component of the coordinates is required. While it is usually only the coordinates that are
required in computing forces, other data associated with the atoms such as indices that might be used in
grouping atoms into polymers might also be needed. Copied data is discarded once the interactions have
been computed; the integration of the equations of motion for these atom takes place while processing the
subregion to which they belong.
The move operation transfers all the data associated with an atom that has drifted between
subregions to the storage area associated with the new subregion (this may be in the same or another
processor). Periodic boundaries are once again taken into account. The version of the data in the original
subregion is flagged as invalid and the storage made available for reuse.

214

D.C. Rapaport

/ Multi-million particle molecular dynamics.

The number of sets of copy and move operations required per time step, as well as the actual data
involved in each, depend on the integration method used. In contrast to other integration methods, the
simple leapfrog technique [7] requires one evaluation of the interactions per time step, makes only a single
computation of coordinates and velocities, involves no higher time derivatives of the coordinates than
second (the accelerations), and does not require information from steps prior to the current one.
Higher-order methods will require more data to be transferred (typically either accelerations from earlier
time steps or, equivalently, higher derivatives of the acceleration at the current step), while use of a
predictorcorrector solver will require two sets of move operations per time step.
The replicate operation, introduced earlier, is used to handle periodic boundaries in directions not
involved in the spatial subdivision, and is of course needed only if the dimensionality of the subdivision is
less than the system itself (e.g. a two- or three-dimensional system cut into slices). The need to address the
issue of periodicity when computing interactions is thereby eliminated.
5.3. Algorithm for partitioned systems

The scheme based on a one-dimensional subdivision is described here. No explicit reference is made to
any memory hierarchy but this could easily be added, even at the control level outside the actual program
(this is operating system dependent). Extension to allow overlap of computation with the transfer of data
in and out of main memory is comparatively straightforward, but provision would have to be made for
additional buffering.
Slices are subdivisions of the region in the x-direction, with the full size being used in y- and
z-directions. Periodicity in the x-coordinate is accommodated by retaining data from the first slice (both
the copy before and the move after the integration) for use with the last slice. Note that the first slice
is input while the last is still being processed, and the last is output during processing of the first on the
subsequent time step. This scheme for reducing transfers does not completely cover the initial and final
time steps in a sequence; there a few extra transfers are required. Storage for three active slices is provided
(since transfers do not overlap computation) and these are used in cyclic fashion. The absolute slice
numbers are used both for accessing external storage and for determining the spatial limits of the slice.
The storage used for the .N~slices of the system is denoted by ~9 (n 0,..., N~ 1), with
(b 0, 1, 2)
denoting storage for the three slices currently receiving attention. (Use of indices beginning from zero
simplifies the arithmetic for determining adjacent slices.)
To make the approach more flexible and, in particular, to make it easier to adapt to a multiprocessing
environment, several buffers are used to hold data destined for transfer between slices. Not all are really
necessary for a one-processor implementation, since some of the data could be transferred directly between
storage areas associated with neighboring slices, but the clarity is improved at minimal cost. Buffers
labeled ~h and %~I are used to hold data for copy operations in the high and low x-directions, while
is for copied data necessitated by the periodic wraparound between first and last slices. Buffers ~,gh ~iI
and .~sY~play similar roles in the move operations.
The description which follows shows a schematic outline of the computations involved in this approach.
Low-level details are omitted. The symbol
denotes a transfer of all relevant data for a copy, move or
slice input/output operation. Functions such as copyhi( ) are responsible for copying or moving data to
a buffer in the high-x or low-x directions from the specified slice, and making the necessary corrections for
period wraparound (specified by the second argument); the actual molecular dynamics work, including
replication, is done by process()
=

.~,,

~~N,_1

~1

4-

b~0

for num steps iterations do


bh
(b + 1) mod 3
~
L5)

gh4-~opyhj(g~6

L5)

D.C. Rapaport

for n

0 to N, 1 do
b4(b+1)mode3
=

4-

4-

~b

b~(b+1)mod 3

1)modN,

215

copyJo(9~

6h,

0)

~b

4-

49db)

4? I

move lo (~b

4 ~
=

b~(b1) mod 3

1 then
copyhi(.~, 0)

process
if n > 0 (.then

else
if n

if n <N,
else

/ Multi-million parricle molecular dynamics.

move lo ( ~b

0)

~b

4-

a - 1)mod N,4-

~b

~b

4-

4? h

L~)

.1~4 1 then

4-

move _hi(~b, L,~)

~b

4-

41)

~b

4-

else
4h

~ move .hi(PJb, 0)

enddo

4-

enddo
As atoms move out of a slice, vacancies will appear in the arrays used for storage. Vacated storage need
not be recovered immediately, and it may be sufficient to flag the elements in question by setting the atom
identifiers to zero (presuming identifiers are used in the calculation) to eliminate them from further
processing; such an approach does not preclude vectorization using the methods of section 4. Since very
few atoms move between slices at each time step, the appearance of vacancies is slow, but, especially for
vector processing, excessive fragmentation of storage will lead to reduced performance. The storage arrays
will have to be compressed periodically to eliminate gaps. The calculation must continually monitor
storage utilization to ensure that there is always sufficient space for copied and moved atoms.
5.4. Performance
Timing measurements were carried out on the Cray YMP for a series of two-dimensional systems
ranging in size up to 2.56 X 10~atoms. The additional computation time required to deal with subdivision
was minimal, amounting to no more than five percent (the time increased from 3.9 to 4.1 p.s per atom
step), and attributable principally to the additional computations involved in processing interactions
across boundaries. Less than 16 X 106 words of storage were required for the largest system, which was
subdivided into 80 subregions; if the leapfrog method is used for integrating the equations of motion only
four words of storage must be reserved for atoms not in the subregion being processed (or five words if
atom identifiers are also required).
5.5. Application to shared-memory multiprocessors
While the partitioned approach shares a lot in common with the implementation using a set of
communicating processors each with its own private storage (see II), it was pointed out above that it can
also form the basis for a version of the program designed to run on a multiprocessor system using shared
memory. One of the problems encountered when several processors attempt to use memory that is
common to all of them is access conflict. A spatial subdivision of the computation ensures that each
processor will spend most of its time working on its own private data, and even when data is transferred
between subregions there is little cause for conflict; the spatial subdivision approach should therefore
prove effective in such a situation. In fact two levels of partitioning are envisaged for a multiprocessor
-

216

D.C. Rapaport

/ Multi-million particle molecular dynamics.

implementation of this kind, one that splits the system among the processors, the other to economize on
storage needed for the layers in each processor. This is a subject for future exploration.

6. Conclusion
Supercomputers, with their uncompromising insistence on careful management of data, present a
challenge when it comes to implementing algorithms whose data is not structured in the required way.
Molecular dynamics simulation, especially in cases where the interactions are limited to a very short range,
provides an example of such a problem. However, by careful reformulation of the algorithm, it is possible
to arrive at a computational scheme whose data is organized in a manner that can be processed efficiently
by a vector computer. While the performance achieved in this way might still be far from the theoretical
maxima claimed for such machines owing to the considerable amount of data rearrangement that goes on
throughout the computation, the performance is substantially better than what would otherwise be
achieved. Despite the fact that these methods call for extra storage, further enhancement of the algorithms
keeps such requirements to a minimum.

Acknowledgements
I would like to thank David Landau (University of Georgia), Kurt Binder (University of Mainz) and
Dietrich Stauffer (KFA JUlich) for their hospitality during the periods that much of this work was carried
out. Burkhard Dunweg and Shlomo Harari are thanked for helpful discussion.

References
[1] D.C. Rapaport, Comput. Phys. Commun. 62 (1991) 217, this issue.
[2] D.C. Rapaport, Comput. Phys. Rep. 9 (1988) 1.
[3] G. Ciccotti and W.G. Hoover, eds., Molecular Dynamics Simulation of Statistical Mechanical Systems, Proceedings of the
Ennco Fermi International School of Physics, Course XCVII, Varenna, 1985 (North-Holland, Amsterdam, 1986).
[4] K. Kremer, G.S. Grest and I. Carmesin, Phys. Rev. Lett. 61(1988) 566.
[5] F.F. Abraham, Adv. Phys. 35 (1986) 1.
[61 D.C. Rapaport, Phys. Rev. A 36 (1987) 3288.
[7] M.P. Allen and D.J. Tildesley, Computer Simulation of Liquids (Oxford Univ. Press, Oxford, 1987).
[8] L. Verlet, Phys. Rev. 159 (1967) 98.
[9] G.S. Grest, B. Dunweg and K. Kremer, Comput. Phys. Commun. 55 (1989) 269.
[10] D.C. Rapaport, to be published.
[11] R.W. Hockney and C.R. Jesshope, Parallel Computers, 2nd ed. (Adam Hilger, Bristol, 1988).
[12] J.J. Dongarra, Argonne National Lab. Math, and Comp. Sci. Tech. Memo no. 23 (1988).

You might also like