Professional Documents
Culture Documents
Runtime Compilation Techniques For Data Partitioning and Communic
Runtime Compilation Techniques For Data Partitioning and Communic
Runtime Compilation Techniques For Data Partitioning and Communic
SUrface
Northeast Parallel Architecture Center L.C. Smith College of Engineering and Computer Science
1-1-1993
Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Ravi Ponnusamy
University of Maryland, Computer Science Department ; Syracuse University, Northeast Parallel Architecture Center
Joel Saltz
University of Maryland, Computer Science Department
Alok Choudhary
Syracuse University, Northeast Parallel Architectures Center
Follow this and additional works at: http://surface.syr.edu/npac Part of the Computer Sciences Commons Recommended Citation
Ponnusamy, Ravi; Saltz, Joel; and Choudhary, Alok, "Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse" (1993). Northeast Parallel Architecture Center. Paper 10. http://surface.syr.edu/npac/10
This Working Paper is brought to you for free and open access by the L.C. Smith College of Engineering and Computer Science at SUrface. It has been accepted for inclusion in Northeast Parallel Architecture Center by an authorized administrator of SUrface. For more information, please contact surface@syr.edu.
Runtime
Compilation
Techniques
for
Data
Partitioning
and
Communication
Ravi t
computer
Schedule
Joel Saltzt
Reuse*
Alok Choudhary$
Center University NY 1.9244
Ponnusamytt
Science Department of Maryland MD 20742
~No~hea~t
Architectures
Abstract
In this paper, we describe invokes two new ideas a user specified directives. arrays elements to describe conservative to recognize results loop by which mapping graph method that from data HPF proallow connec. that in
schedule ecutor
data.
In the ex-
communication
and computation [26] and KALI to handle loops). that loop The often results itercom-
are carried compiler loops We makes from ation data piler with it
irregular
computations
effectively.
referenced reuse
The directives
of army a compiler
communication
The second
is a simple
information on-processor code that, array that 90D loop array. record
with
to reuse previously
associates
computed schedules,
generates
iteration
have written
locations).
performance 90D
each inspector any indirection time the inneed aris asin the to parFor incompu-
mechanisms
compiler
may have been modified was invoked. memory data storage processor arrays
In distributed 1 Introduction In sparse and unstructured tern time. the is determined In these movement cases, work, of data problems values the data known carry structures the out runtime out access patonly and at runschedule of procescompiler In th~ pafor preprocesto be partitioned These mys. signed tition stance, tational When may by variable map partitioned Long term to specific distributed
arrays
between
array
locations manner.
distributed
machine.
in an irregular
preprocessing
does not have a useof the mesh. we in such a problem communication, elements to each prohave been departi-
by a d~tributed run time compilation that closely and a prototype techniques handle extensions [28]. architectures, of code: irregular
minimizes
we demonstrate to efficiently
make it possible
years promising associated have been studied the runtime to allow needed
heuristics with
the different
Fortran memory
[24, 25, 19, 17, 2, 13]. support and compiler the inforfuncelewith
loops with
We have implemented transformations mation tion. tion ments needed In our view, of graph
a customized spatial
this information
of a descrip
distributed
a communication
(NAG-1-1485), Author
associates
produces represent
th~ tioned.
generates
to a (user
(SC292-1-22913). Young
Investigator The content of the information does award ( CCR-9357840). not necessarily reflect the position of the policy of the Government and no official endorsement should be inferred
information,
and
standardized
representation
specified) at
code that,
produces
structure
is used to partition
361
$1.50
Pennission to copy wuhout fee all or p-ret of IMs material is granted, provided thm h copies we not made or dlstibuted for dkct ccinmercial advantage, the ACM copyrighl mice and die dlle of the publication and its dme appear, and nouce is given that copying IS by permission of the Association for Computing Machiaev. T. copy dkerwx. or to republish, requires a fee andhx sjwific permission.
Pkwe
c Single statement
FORALL y(ia(i)) END
loop
L1
Generate Pzrtitien
>
Data Psrtition
Phase
Gem#ate Partition
Loop Iterzticn
Remzp Arrays
Phzse D
Remap
preprocess
Leeps
Phzse Rxecute
E Leeps
x(end-pt2(i))))
Figure compiler-linked
present erations. piler To our knowledge, is the first this kind Fortran the implementation memory of support. of our tion in t h~ paper the Vienna specify support We will mations ular dwt ributed
performance methods.
related
7 and we conclude
in Section
to provide
definition,
transformation to Vienna extensions above. out where hand the etc). support,
strategies
required
the irregor deadWe have developed problems concurrent CHAOS; The library Solving memory major concern d~cuss Initially, CHAOS that consist efficient of a sequence support
new capabilities multiple pendencies dition, irregular indirection by the loop tran D syntax loop The statement
in the context
computational the runtime library [21, 26, 23]. concurrent machines (Figure a brief the
project earlier on
are left
is called
the CHAOS
accumulation,
max, rein,
is a superset irregular
of the
index. in Figure array The 1, we employ without is similar fluid to depict second two loops. loop The first loop is a single we carry
using data
our runtime
support, onto
steps
mapping them
indirect
We provide
is a loop second
in which
in detail
operations.
distributed
dynamics
known regular manner. In Phase A of Figure procedures can be called to construct a graph ture (the GeoCoL associated graph data data with structure) a particular how data using the patterns GeoCoL The tributed. In Phase B, the newly calculated used to decide how loop iterations among data processors. of arrays This calculation In Phase and loop out the data name access patterns.
methods results
structure
is passed
to a partitioned.
developed
by Syracuse
partitioned
calculates
implementation
reveal that the performance is within 10% of the hand This time tion loop paper technique iteration is organized the work in Section
of the compiler generated paralleiized version. as follows. 3, we describe schedules. In Section We describe data used to couple effort.
array d~tributions are are to be partitioned takes into account out loop the actual needed to
C we carry iterations.
4 we describe an overview
D, we carry
interprocessor a shared
present ture
of off-processor schedules,
standard
space.
the language
extensions
we use to con-
generating
362
array data.
indices
copies
....
S1 REAL*8 S2 INTEGER x(N),y(N) map(N) reg(N),irreg(N) reg(block) reg map array using some mapping
and allocating irregularly processor tion fkom computation. CHAOS adaptive dynamics distributed 2.2
space for copies of off-processor to retrieve from the numerous out the
in Phase to carry
E we use informanecessary
et y of applications,
method
a prototype multiprocessors.
[23] aimed
irreg(map) irreg
Overview port
of
Existing
Language
Sup-
directives
we employ
for irreguD. The is that tributed pattern ning enough a wealth partitioners There difficulty array. of irreg the declarations how to partition which map array The of Fortran extenFortran, and comit is not obvious the irregularly separately dis-
be presented
in the context
of Fortran
our work
be presented
D, the same optimizations sions could pilers Fortran such as Vienna D and HPF
language
gives the dwtribution by runconstructs are not rich of the map arWhile there are such effort. coding
a partitioned.
90) provide a definition [1 O, 8]. that Fortran In Figure declaration. a distribution attributes titioned
a rich set of data decomposition of such language extensions as currently how data to
explicitly
heuristics interface
These D can
require
can represent
a significant
users explicitly
is no standard
partitioners
lar inter-processor
of distributed an example
a template
Communication
The cost of carrying
Schedule
out an inspector when is computed analysis
Reuse
(phases once needed method from and in that B, C and D proused
7].
the significant is to be parin Figure duced repeatedly. We propose cases allows results as: from 2) can be amortized Compile time the information then
[12,
The distribution
size, dimension
by the inspector
using two declarations. POSITION. ity declaration statement onto Fortran how
tor communication
schedules
is touched
a simple an inspector
inspectors.
a template
L can be reused
processors. the user with a choice of several onto processors. In addition, is associated ALIGN. a user can explicitly with a d~tribution In statement
arrays
referenced
in loop
L have
the inspector
is to be mapped
there ated
is no ~ossibilitv with l~op L h&e invocation. generates a Fortran have written reference
that
indirection modified
arrasw sinc~
associthe last
D statement
been
3, two of size N each, one dimensional In statement equal sized S4, decomposition blocks, with S5, array irreg
map(i)
decompositions The
code that
at runtime
a is In to
one block
statements distributed
map is aligned
to a distributed thw
map will
array. record
each inspector
runtime
distribution is assigned
an integer
i of the
is set equal
see whether any indirection arrays may have been modified since the last time the inspector was invoked. In th~ presentation, for we assume loop. that we are carrying assume that out all an inspector a forall We also
363
array y(ia(i))
references where
to any distributed ia is a distributed the forall (DAD) things) block, loop. for
array array
timestamp
associof ind~
access descriptor
index
with other
A data type
DAD(ind~))
timestamp recorded
of by
access descriptor-DAD(
to generate
memory code, whenever references a distributed cess to the arrays a global data any array We maintain the cumulative sic Note ber that with globsl current nmod. changes. The out, first or statements that with DAD. structure a given a global number that
the compiler generates code that array, the compiler must have acIn our scheme, contains may DAD we will maintain on when represents array intrinarray. the numof code a a the = DAD(a) set that information which
the first
time
Ls inspector
the exethe
before
1. DAD(z~) 2. DAD(itad~)
variable of Fortran
any distributed
we are not counting array, the program to a distributed time data value stamp.
array.
may be viewed an array we update DAD(a) (i.e. array that with int rin-
at runtime, in some
runtime
as a global
we modify
overhead
is likely parallel
to be small Fortran
intensive
data
structure
in such codes primarintrinsic, so we need or array changes structure programs. be it simple graph when once per loop possible data with makes repartition
the current
nmod
globsl
If the array
a is remapped, = nmod.
We employ
used in
at runtime in Section
We call this data structure described for our compiler and carrying no change to avoid
j < n.
Each
an inspector
we store
for
each unique
indirection
array
ind~,
and DAD(ind~)), the values )) for 1< of j < n. DAD(ind~) Ls inspector DAD(ind~)). array ind~
in a
the same
access descriptor
DAD(z~), by
analysis as
to identify
must
be tracked of this
at runtime. optimization.
exploration
stored
L. DAD(z\), forall
tors.
q
L, we maintain
4 In
Coupling
irregular that and
2).
Partitioners
problems, work involve it ia often loop desirable by assigning iteration to allocate all com-
we maintain, current xi, and of the data z~ when access descriptor L carried out its global data access descriptor
the
with
to processors a given
L. DAD(z~ that
) is a record with
we partition using
termed a
was associated
iterations
phase,
a two-phase
last inspector. For each indkection array ind~, we also maintain two
the first
phase,
arrays
timestamps:
workload
are partitioned
the information
to be a practical
304
arrays
loops.
loop in the
in Figure second
1. loop
The
weight
associated 1 would
two subsections
describe
be pro-
to the degree
of the vertex
4.1
When assigned will data non-local
Data
Partitioning
distributed iterations array all) makes arrays, we have not yet we to most
g have identical parallel dominate. A given we find dhate tional titioners heuristics partitioners fashion. methods must This avsilbe coumanual when we parto difinformation and physical proximity problems,
computational problems
be used as a sole partitioning we partition loop iterations loop distributed to processors. references. computation with We assume that Our approach that
in embarrassingly computational
partition partitioning
to minimize
partitioned that
can make use of combinations or weight account from information. when important for problems node geometrical to take carrying where and
an implicit
associated rule.
side of each statement partitioning phenomena these in a manual troublesome data structures it extremely problems
- we call this
to node.
computes
are many
able based on physical [24, 2, 25, 13]. pled coupling titioners dependent, ferent 4.1.1 (but Currently to user programs is particularly use different making similar)
Since the data structure data partitioning Connectivity GeoCoL 4.1.2 data and/or
that
Geometrical,
and tedious
partitioners. difficult
Further, to adapt
GeoCoL
a compiler
Data
Structure
that can be emdata using that xcord, called a GeoCoL information declaration
and systems.
to direct
Interface
Data
Structures
for
a data strucis use of
the keyword The following specifies C$ ycord, This G1 inthe verThe ofi specified construct or value Hanxleden Similarly, vertex LOAD C$ 1 are the inarise In Here,
GEOMETRY. is an example of a GeoCoL information: G1 (N, GEOMETRY(3, a GeoCoL spatial and zcord. directives structure data structure
Partitioners
We link ture that partitioners stores kinds [24], [15], graph loops represent Data on data [19]. to programs structure structures Graph by using data information of program on which partitioners information. that vertices partitioning Some represent represent loops, partitionundirected array Consider the graph x and y.
can make
having
N vertices
coordinate The
dependencies. of arrays
to the geometrical
N elements
specifies the
vertices vertices
is(i) is(i)
weights
constructed
keyword
sa follows. CONSTRUCT a GeoCoL vertex construct i having example G2 (N, LOAD(weight)) called LOAD illustrates G2 consists of N verinforarrays with weight(i). how connectivity Integer associated declaration.
The union
graph
of Figure l(i)
In some cases, it is possible formation from finite a problem. or finite point sssign element We can that may
The following
is specified
in a GeoCoL edges. G3
such cases, each mesh in space. ordinates locations Vertices timated timate work to make will describe
each of E graph
each
LINK(E,
edge-listl,
its spatial
location.
These
can be used to partition also be sssigned costs. costs, computational computational be partitioned. the implicit
structures
ated with
the GeoCoL
on how compute
to generate data
structure
assumption
and connectivity
rule will be used to partition work. Under this assumption, computational cost associated with executing a statement will array be attributed reference. to the processor This results owning a left hand with unit side in a graph weights
365
....
CONSTRUCT yc, Zc)) SET RCB d~tfmt BY G (nnode, GEOMETRY(3, G USING xc,
PARTITIONING reg(d~tfmt) x, y
REDISTRIBUTE Loop
1, end-pt2
....
end-pt2, G (nnode, BY . ..) LINK(nedge,end-ptl, G USING Figure ric reg(distfmt) x, y 5: Example of Implicit Mapping 90D The statement (recursive S6 in the figbisection) a library using GeometC Loop over faces involving x, y
PARTITIONING
Information
spectral be provided
and the user can choose a customized matches. in statement shown FiS7 sequence
x, y
any one one of them. partitioned as long in Fortran 90D nslly, using the distributed 5 illustrates that Arrays
as the calling
Figure
4: Example
of Implicit
Mapping
Once cessors: 1. At
the
GeoCoL is carried
data out.
structure
is constructed, there
data
partitioning
We assume
are P pro-
ure 4 except
xc, yc, and ZC, which to which that using bhary arrays
coupling
calls
to the runtime
executes,
geometric
information. bisection
GeoCoL
data
structure that
titioning
procedure
partitions
4.3
d~tribution. is constructed with the iniOnce we A we re-
Loop
Iteration
Partitioning
data, we must partition compuasthe conwith This
an irregular data
Once we have partitioned tational signment dktributed vention rule. variable is that (If work. statement array is normally the left then the
is to compute hand
a program side.
GeoCoL
structure
associated
default
arrays.
of S references
the default
4.2 In
Linking Figure
Data
Partitioners
a possible set of partitioned
For example, 4 we illustrate FORALL S1 x(ib(i)) S2 y(ia(i)) END This assign loop work coupling directives for the loop L2 in Figure 1. We use statements S1 to S4 (Figure 4) to produce a default initial distribution in loop generation graph between lationship of arrays z and y and the indirection S5 and S6 directs graph on the LINK that array the and call L1, end-pt. The statements
S5 indicates
independent carried
dependence rule,
between we to of ib(i)
dependencies.
Were
distributed is provided
S1 would
366
of ia(i)
(OWNER(ia(i)).
The value of y(ib(i)) whenever OWNER(ib(i)) associated C$ K1 iteration that array places of the referK2 K3 C$ Start with end-pts Read Mesh block (end.ptl, d~tribution end-pt2, of . ..) LINK to generate (nedge, end.ptl, data arrays x, y and
# OWNER(ia(i)). is to assign all work to support a scheme that distributed a loop iteration structures default to a given processor. We have devel-
G (nnode, procedures BY
GeoCoL G
is the home
USING
to RSB
distribution
the
parti-
Support section we presented specify between work generate procedures. 4 to show how the compiler A (simplified) in Figure d~tribution. the proceproGeois are Statements is encountered, CHAOS the to CHAOS The statement procedure with (distfmt). runtime using the a loop accesses calls the initial ver6. S5 is shown a data in the code. d~tributions. statement with execution, data passed remapping in Figure directives how data processors. mapping. embeds a proand loop In this
C$ K4
grammer iterations section out th~ compiler CHAOS procedures We start When compiler dure COL calls, data cedures
can use to implicitly are to be partitioned we outline implicitly mapper compiler defined coupler
distribution
reg to distribution
Figure Mapping
6: Compiler
Transformations
for
Implicit
Data
transformations
unstructured ing the number The table iterations shown reuse. 6.2
mesh presents
and
loops,
varyfor 100
of processors
of the loops
are embedded
decomposed partitioned.
transformation array
using a recursive
emphasizes
the importance
the CONSTRUCT generates during structure When CHAOS (reg) described generate program
embedded
Timing Coupler
Results
using
the
Mapper
GeoCoL is then
structure. an user
specified In this procedures pler. culation Iinked present These unstructured mapping section, with we present the cost solver data that compares mapper the the coucalcosts incurred by the compiler timings Euler involve generated a loop and dynamics over coupler
the
a hand
embedded
mapper force
to the new distribution partitioned d~t ributed in Section 4.3 whenever array.
edges of an 3-D The compilerin the Fortran University. We on diftechniques iPSC/860. kinds of parallel (reof the ( coordinate partitioned
one irregularly
was incorporated at Syracuse on an Intel two different based [24]). partitioned based The of our runtime
developed
the performance
Experimental
Results
6.1
In
Timing
this section, saving
Results
we present technique over 10K water edges and
for
Schedule
performance in
Reuse
data for 3 for timings the the inEuler code of 1. Tafor
bisection cursive
performance
Section These
compiler embedded mapper version version are shown in Table 2. In Table tition the the time build 2, Partitioned arrays needed using to carry and depicts out
and hand
the time
needed Ezecutor
and an electro-
communication,
648 atom
the schedule.
In Table
Spectral
the performance
generated
Bisection depicts the time needed to partition the GeoCoL graph data structure using a parallelized version of Simons eigenvalue partitioned [24]. We partitioned the GeoCoL
technique
367
(Tii
in
16 123 7.7
64 239 17.4
16 227 8.0
2: Unstructured
Binary
H d C~~ed
-53
K Meah -32
Block Partition Hand Coded
Processors
Spectral Hand Coded
2.2
..-4
-----------C-enerahon tltloner
tnr
-lii@a.,
.-map .. .
I
Executor Total
10. A
258 -I
277.5
graph
into
equal
of The 2
based
decompositions.
Our type
GEOMETRY of value
construct
genemtion
particular for
Ezecutor
Williams using
the executor
programming unstructured machines. vironment programming dynamic There tributed Stanford, Austin time Marina implement projects; load
The results
demonstrate
10% of the hand cluded that the timings arose from blocked of the code
partitioned performance
environment balancing.
to quantify we assigned
the problem.
of compiler
projects
[27, 16]. Jade project and CODE in four PARTI project compiler environments. project compiler preprocessing
ous blocks a coordinate partitioned executor This of executor sen. When partitioned, associated significantly A detailed bisection dynamics timing using BLOCK 7
partitioned
Run[16], project to
lead to a factor time compared also points iterations compared the with
compilation Chens
example
on which
coordinate
inspector/executor compiler several that from d~tributed a strategy ation) compiler (but would marked more
recursive a faster
partitioned iteration
[16] and the ARF port per irregularly work, outlined In earlier
compiler
to suppa-
per executor
higher
partitioning
of the compiler-linked in Table partition blocks allowed much 3. In Table of arrays of arrays in HPF. better than
embedded
directly support.
here requires
of arrays distribution
supported
by HPF. 8
Conclusions
In this paper, The we have described Fortran here described and and presented 90D compiler demonstrates two timnew
Related
Research
Work
out by von Hanxleden which decompose arrays values, these are called [11] on based value partitioners array element ing data tation. for a prototype work implemen-
compiler-linked on distributed
368
Table
Coordinate
Bisection
Partitioned
with
Schedule
Reuse
53k Mesh Processors 16 32 64 1.6 2.5 1.8 19 ().7 2.() 5.1 3.0 1.9 91 5 -A. 17.2 12.3 m A , 92 n =.-. I 17.4
A.,
648 Atoms Processors 4 8 0.1 0.1 2.2 1.2 4.8 2.6 8.1 5.8 9.7 15.2
Table4: T
(Ti~~in sees Inspector Remap Executor Tot al
Performance
of Block
Partitioning
53k M esh Processors 32 1.9 2.8 54.7 59.4
with
Schedule
Reuse
648 Atoms Processors 4 8 2.7 1.5 4.5 2.6 10.3 7.6 17.5 11.7
for
dealing
with
procedures distribution
described ftp
in th~ paper
are availnetlib
The first
from
second
the anonymous
a compiler
Acknowledgments
The authors Das for reading. Chuck many The Koelbel would fruitful authors and like to thank discussions would Sanjay like Ranka Alan and for Sussman for many compilers; about help and Raja in proofFox, and
associates locations).
off-processor
to thank
Geoffrey
the CHAOS
procedures compiler
independent,
discussions
about
applicable into
partitioners
runtime
also like to thank and dynamic distributed array partiFortran-D thanks suggestions. authors
static
runtime
support
von Hanxleden
.
q q
loop arrays
iterations from
and indirection
helpful The
to another allocation
would
out
index
communication We consider integrated dependent guages. High subset our work effort runtime The
schedule
to be a part developing
unstructured
towards
support Fortran
programming
runtime
support support
in other
Performance into
and in fact,
References
[1] S. ning tion., [2] M.J. Baden. and on 12(1), Berger Programming coordinating January and S. 1991. H. Bokhari. on May
and
of the runtime
for scientific
parti. runComputa-
Fortran compiler
and from
tioning
our prototype
[20]
multiprocessors.
strategy Zhana.
for on
problems
multiprocessors.
[4]. We embedded
C-36(5):570-580,
Joel Saltz,
compared its performance against code. The compilers performance within about
10~0
Jeffrey
SCroggs.
Exeeution
support
for June
adaptive 1991.
scientific
algorithms Practice
on and
distributed Experience,
of the hand
compiled
machhies.
Concurrency:
369
[4]
B. S.
R.
Brooks,
R.
E. and
Bmccoleri, M. KarPlus.
B.
D.
Olsfson, A and
D.
J.
States, for
[22]
B.
Nour-Omid, equations on
A. Parallel
Raefsky, on concurrent
snd
G.
finite of SymMe-
computers.
energy,
minimization,
d yn=lcs-calcula4:187,
Boston, H.
McCammon, Partners April Mavriplis. problems Compilers J. Saltz 1992. techniques the and 1992.
R. pro-
[23]
J. for
Saltz,
Berryman, 1991.
Wu.
Runtime and
compilation Ezpenence,
a molecular OR, D.
dynamics
multiprocessors.
Practice
Confer[24]
mesh Analysis
problems on and
for
parAp-
Ponnusamy, runtime
processing.
Conference
Paralle/
compiler for
methods
Methods plications. R.
Physics
load February
al-
gorithms Das and Practice [26] J. the Loveman for (Ed.). version Research 1993. et mimd al. Compiling 1993. K. Kennedy, D Rice C. language COMP Koelbel, U. Kremer, DeRice fortran Report 90d/bpf SCCS-444, for NPAC, distributed Syracuse [28] H. A. port G. C. Fox, Tseng, S. Hkanandani, and of M. Wu. Computer December tion, Fortran Science 1990. Compiler of on irregulr Parallel K. for 5th support problems. Computation, C. Koelbel, on Haven, Load of for machine 1992. R. Daa, and August and D. 1992. on message J. Saltz. In Proindependent report, Center specification. TR90-141, Zima, SchwaId. Draft 1.0. High Performance Report Computation, Fortran Rice language University, [27 specification, Center January [9] Z. Bozkus Technical CRPC-TR92225, H. Zimar Wu, irregular Languages , OR, problems. In Proceedings of and Compilers for Parallel August 1993. .%th Workshop Computing, Port-
calculations.
3(5):457-482, and H. on
Sattz,
Berryman. In Parallel
multicomputers.
Proceedings Processing,
of
on Parallel
Superb:
tool
for
semi-
automatic 6:118,
MIMD/SIMD
parallelization.
Parallel
Computing,
computers.
B. Fortran Austrian
P.
Mehrotra, specification.
and Re-
March
ACPC-TR92-4, University
Parallel 1992.
Computa-
of Vienna,
Hanxleden.
Technical
analysis of the
prnblems CT,
in Fortran
Languages
Compilers
Computing, and
Hanxleden
Scott.
balancing and
Parallel
Distributed
C.
Compiler in Scalable
machine-independent
parallel Mehrotra
programming
Amsterdam,
efficient System
heuristic Technical
proceJournal,
Mehrotra,
J.
Van on
shared
architectures.
177186.
heuristics 86 and
prncess
behavior.
SIGMETRICS
loops Parallel
with
indirect
array Santa
of th e Fourth Computing,
Workshop
Languages
Compilers
Mansour. to
Physical
optimization of Computer
algorithms
for Technical
multiprocessors.
Science,Syracuse
unstructured In AIAA 1991. D. for July June Smith, support ACM 140-152,
multigrid 10th
for
equations,
91-1 549cp.
Compu-
Mkchandaney, Crowley. In on
M. Nicol,
parallel 1988.
and pmCon-
cessora.
International
370