Runtime Compilation Techniques For Data Partitioning and Communic

Syracuse University
SUrface
Northeast Parallel Architecture Center L.C. Smith College of Engineering and Computer Science
1-1-1993
Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Ravi Ponnusamy
University of Maryland, Computer Science Department ; Syracuse University, Northeast Parallel Architecture Center
Joel Saltz
University of Maryland, Computer Science Department
Alok Choudhary
Syracuse University, Northeast Parallel Architectures Center
Follow this and additional works at: http://surface.syr.edu/npac Part of the Computer Sciences Commons Recommended Citation
Ponnusamy, Ravi; Saltz, Joel; and Choudhary, Alok, "Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse" (1993). Northeast Parallel Architecture Center. Paper 10. http://surface.syr.edu/npac/10
This Working Paper is brought to you for free and open access by the L.C. Smith College of Engineering and Computer Science at SUrface. It has been accepted for inclusion in Northeast Parallel Architecture Center by an authorized administrator of SUrface. For more information, please contact surface@syr.edu.
Runtime
Compilation
Techniques
for
Data
Partitioning
and
Communication
Ravi t
computer
Schedule
Joel Saltzt
Reuse*
Alok Choudhary$
Center University NY 1.9244
Ponnusamytt
Science Department of Maryland MD 20742
~No~hea~t
parallel Syracuse Syracuse,
Architectures
University College Park,
Abstract
In this paper, we describe invokes two new ideas a user specified directives. arrays elements to describe conservative to recognize results loop by which mapping graph method that from data HPF proallow connec. that in
schedule ecutor
to prefetch phase, out
required The kind
off-processor ARF compiler arrays
data.
In the ex-
the actual [21].
communication
and computation [26] and KALI to handle loops). that loop The often results itercom-
are carried compiler loops We makes from ation data piler with it
compiler The first the user

tivity,
can deal with mechanism
irregular
computations
effectively.
[16] used thw indirectly possible to a simple (e.g.
of transformation (irregular method schedules, associates locations). maintains may computed
referenced reuse
cedure via a set of compiler to use progmm location spatial
The directives
propose inspectors partitions, copies
conservative previously that buffer
of army a compiler
and computational it is posinspectors partitions, copies with re-
communication
load. many sible (e.g.
The second
is a simple
information on-processor code that, array that 90D loop array. record
off-processor a record reference
casea enables communication that bufler these
with
to reuse previously
associates
computed schedules,
generates
at runtime, or intrinsic In this
iteration
of when another arrays spector
a Fortran dwtributed runtime
have written
information on-processor sults for implementation.
off-processor We present from a Fortran
to a d~tributed checks th~
is used to indirectly scheme, to see whether
locations).
performance 90D
each inspector any indirection time the inneed aris asin the to parFor incompu-
mechanisms
compiler
may have been modified was invoked. memory data storage processor arrays
since the last large data
In distributed 1 Introduction In sparse and unstructured tern time. the is determined In these movement cases, work, of data problems values the data known carry structures the out runtime out access patonly and at runschedule of procescompiler In th~ pafor preprocesto be partitioned These mys. signed tition stance, tational When may by variable map partitioned Long term to specific distributed
machines, local arrays
arrays
between
memories are called
of processors. distributed data
of distributed and memory
array
locations manner.
programmers data between to carry
distributed
machine.
It is frequently the nodes frequently structures array
advantageous of an irregular pattern
sing to partition sors. The
in an irregular
memories memory [23].
the way in which mesh are numbered we partition
code needed we call
preprocessing
does not have a useof the mesh. we in such a problem communication, elements to each prohave been departi-
can also be generated in a process where per, we present compilers D methods
by a d~tributed run time compilation that closely and a prototype techniques handle extensions [28]. architectures, of code: irregular
ful correspondence in a way that cessor. veloped tioning
to the connectivity the data arbitrary interprocessor
implementation problems related coded usto Fortran indirect
minimizes
we demonstrate to efficiently
make it possible
need to assign In recent methods and tradeoffs
years promising associated have been studied the runtime to allow needed
heuristics with
ing a set of language [10] or Vienna On distributed array tor.
the different
Fortran memory
[24, 25, 19, 17, 2, 13]. support and compiler the inforfuncelewith
loops with
We have implemented transformations mation tion. tion ments needed In our view, of graph
accesses can be handled The inspector accessed partitions by a loop,
by transforming an inspector iterations, loop
the original and an execuallocates Iaarray
users to specify can consist location array
loop into two sequences cal memory element *This

NSF (ASC Choudhary
to produce connectivity, that load.

of the
a customized spatial
distribution of array elements
this information
of a descrip
for each unique
off-processor and builds
distributed
a communication
(NAG-1-1485), Author
and information code that,

ation
associates
computational work was sponsored in part by

9213821) was also and ONR by supported NSF ARPA
Based on user directives at runtime,

above
the compiler a standardized

then passes parti-
produces represent
th~ tioned.
generates
to a (user
(SC292-1-22913). Young
Investigator The content of the information does award ( CCR-9357840). not necessarily reflect the position of the policy of the Government and no official endorsement should be inferred
information,
and
standardized
representation
specified) at
The compiler a data
alao generates that
code that,
runtime, loop it-
produces
structure
is used to partition
361
@ 1993 ACM 0-8186-4340-4/93/0011
$1.50
Pennission to copy wuhout fee all or p-ret of IMs material is granted, provided thm h copies we not made or dlstibuted for dkct ccinmercial advantage, the ACM copyrighl mice and die dlle of the publication and its dme appear, and nouce is given that copying IS by permission of the Association for Computing Machiaev. T. copy dkerwx. or to republish, requires a fee andhx sjwific permission.
Pkwe
A GCOCO1 Graph Geocol Graph Partition
c Single statement
FORALL y(ia(i)) END
loop
L1
Generate Pzrtitien
i = 1, N = x(ib(i)) FORALL + .. . x(ic(i))
>
Data Psrtition
Phase
B Iteration Iteration Graph Graph >
Gem#ate Partition
Loop Iterzticn
c Sweep over edges: Loop L2

Phsse C and Loop Itemtions >
FORALL REDUCE f(x(end-ptl(i)), REDUCE g(x(end-ptl(i)), END
i = l,N (ADD, y(end.ptl(i)),
Remzp Arrays
Phzse D
Remap
x(end-pt2(i)))) (ADD, y(end-pt2(i)),
preprocess
Leeps
Phzse Rxecute
E Leeps
x(end-pt2(i))))
FORALL Figure 1: Example Irregular Loops described Fortran comin trol
Figure compiler-linked
2: Solving runtime data We briefly
Irregular partitioning. to characterize discuss 8.
Problems In Section work 6 we
present erations. piler To our knowledge, is the first this kind Fortran the implementation memory of support. of our tion in t h~ paper the Vienna specify support We will mations ular dwt ributed
performance methods.
the performance in Sec-
related
7 and we conclude
in Section
to provide
We also note that function. Fortran. compiler transforThe runtime
[28] language distribution
definition,
a user can also described 2 2.1
a customized and compiler describe
transformation to Vienna extensions above. out where hand the etc). support,
strategies
Overview Overview of CHAOS

runtime phases. support of clearly The to deal with demarcated is called library. PARTI
here can also be applied and language
the runtime described loop
required
to provide that carried (e.g. of a single
the irregor deadWe have developed problems concurrent CHAOS; The library Solving memory major concern d~cuss Initially, CHAOS that consist efficient of a sequence support
new capabilities multiple pendencies dition, irregular indirection by the loop tran D syntax loop The statement
We assume only loop
accesses are carried rdlowed array with
in the context
computational the runtime library [21, 26, 23]. concurrent machines (Figure a brief the
project earlier on
are left
side reductions of a single is indexed
is called
the CHAOS
accumulation,
max, rein,
We also assume that level of directly Fordeto dy-
is a superset irregular
of the
accesses occur a distributed loops with shown
as a result array that
problems first three
distributed involves five
index. in Figure array The 1, we employ without is similar fluid to depict second two loops. loop The first loop is a single we carry
using data
our runtime
support, onto
In the example statement pendencies. out reduction
steps
2). The description in later
steps in the figure processors.
mapping them
and computations sections. are
indirect
references loop codes.
We provide
of these steps here, and will arrays decomposed in a
is a loop second
in which
in detail
operations.
distributed
those loops found namics loop to demonstrate
in unstructured our runtime our
computational procedures sections. as part on simple
codes and molecular
dynamics
We use th~ and compiler of the ForUnivertemplates code of
known regular manner. In Phase A of Figure procedures can be called to construct a graph ture (the GeoCoL associated graph data data with structure) a particular how data using the patterns GeoCoL The tributed. In Phase B, the newly calculated used to decide how loop iterations among data processors. of arrays This calculation In Phase and loop out the data name access patterns.
2, CHAOS data strucdata access The be d~
transformations tran 90D
in the following being
set of loops. arrays should
We have implemented compiler Our sit y [9].
methods results
structure
is passed
to a partitioned.
developed
by Syracuse
partitioned
calculates
implementation
reveal that the performance is within 10% of the hand This time tion loop paper technique iteration is organized the work in Section
of the compiler generated paralleiized version. as follows. 3, we describe schedules. In Section We describe data used to couple effort.
array d~tributions are are to be partitioned takes into account out loop the actual needed to
We set the context
2. In Section the procedures of our compiler generate
the runIn Secdata and 5 we the struc-
C we carry iterations.
to save communication partitioners which to compilers. the
remapping In Phase the storage involves
4 we describe an overview
D, we carry
preprocessing movement, This
(1) coordinate and (3) support
interprocessor a shared
(2) manage data, preprocessing translating
present ture
of, and access to, copies communication
of off-processor schedules,
transformations and describe
standard
space.
the language
extensions
we use to con-
generating
362
array data.
indices
to access local local buffer
copies
of off-processor globally indexed
data but local
....
S1 REAL*8 S2 INTEGER x(N),y(N) map(N) reg(N),irreg(N) reg(block) reg map array using some mapping
and allocating irregularly processor tion fkom computation. CHAOS adaptive dynamics distributed 2.2
space for copies of off-processor to retrieve from the numerous out the
It is also necessary distributed memories. the earlier
data-sets Finally, phases
in Phase to carry
E we use informanecessary
S3 DECOMPOSITION S4 DISTRIBUTE S5 ALIGN S6 . . . set
and PARTI computational codes and memory
procedures including fluid
have been used in a varilinear solvers, at codes, molecular
map with values .. of
et y of applications,
sparse matrix dynamics compiler
method
a prototype multiprocessors.
[23] aimed
S7 DISTRIBUTE S8 ALIGN x,y with
irreg(map) irreg
Overview port
of
Existing
Language
Sup-
.... Figure 3: Fortran with The D Irregular Distribution depicted in Figure 3
The data lar problems While
decomposition will will
directives
we employ
for irreguD. The is that tributed pattern ning enough a wealth partitioners There difficulty array. of irreg the declarations how to partition which map array The of Fortran extenFortran, and comit is not obvious the irregularly separately dis-
be presented
in the context
of Fortran
our work
be presented
in the context of languages Vienna Fortran
D, the same optimizations sions could pilers Fortran such as Vienna D and HPF
and analogous and HPF. from
language
gives the dwtribution by runconstructs are not rich of the map arWhile there are such effort. coding
be used for a wide range Fortran (evolved
has to be generated Fortran-D
a partitioned.
D and Fortran specifications; may be found in
for the user to couple compilation of partitioning from scratch codes.
the generation process. available, between the
90) provide a definition [1 O, 8]. that Fortran In Figure declaration. a distribution attributes titioned
a rich set of data decomposition of such language extensions as currently how data to
explicitly
ray to the program
heuristics interface
These D can
languages, define be used partition
specified, specify array of such
require
can represent
a significant
users explicitly
is to be distributed. an irreguelements. D called fixes the is produced is DECOMdimensionalThe second a Fortran
is no standard
partitioners
and the application
lar inter-processor
of distributed an example
3, we present In Fortran which of a distributed between
D, one declares array.
a template
Communication
The cost of carrying
Schedule
out an inspector when is computed analysis
Reuse
(phases once needed method from and in that B, C and D proused
7].
is used to characterize the array declaration template.
the significant is to be parin Figure duced repeatedly. We propose cases allows results as: from 2) can be amortized Compile time the information then
[12,
The distribution
size, dimension
and way in which processors. The first
A distribution fixes the name, array Distribute
by the inspector
using two declarations. POSITION. ity declaration statement onto Fortran how
to reuse inspecupon in many The zs long
Decomposition is DISTRIBUTE. and specifies D provides how
tor communication
schedules
is touched
and size of the distributed
a simple an inspector
conservative for loop
is an executable is to be mapped reguspecify A the using
us to reuse the results
inspectors.
a template
L can be reused
processors. the user with a choice of several onto processors. In addition, is associated ALIGN. a user can explicitly with a d~tribution In statement
distributions remained was invoked,
of data unchanged and
arrays
referenced
in loop
L have
lar distributions. a distribution array specific Fortran ure
since the lsst time
the inspector
is to be mapped
there ated
is no ~ossibilitv with l~op L h&e invocation. generates a Fortran have written reference
that
indirection modified
arrasw sinc~
associthe last
D statement
S3, of Figreg is parassigned (in
been
3, two of size N each, one dimensional In statement equal sized S4, decomposition blocks, with S5, array irreg
map(i)
decompositions The
inspector compiler of when may
are defined. titioned with into
code that
at runtime
maintains or array that array
a is In to
one block
record intrinsic thw
90D loops another checks
statements distributed
to each processor. distribution statement between using ement P.
In statement reg. Array
map is aligned
to a distributed thw
map will
be used to specify is to be partitioned
used to indirectly scheme,
array. record
S7) how distribution processors. An irregular when irreg array; distribution
each inspector
runtime
distribution is assigned
is specified to p, elto processor
an integer
i of the
is set equal
see whether any indirection arrays may have been modified since the last time the inspector was invoked. In th~ presentation, for we assume loop. that we are carrying assume that out all an inspector a forall We also
363
indirect the form loop
array y(ia(i))
references where
to any distributed ia is a distributed the forall (DAD) things) block, loop. for
array array
y are of and i is a arthe
last-od(DAD(ind~) at ed with and, the current
is the global data
timestamp
associof ind~
access descriptor
index
associated (among array
with other
A data type
access descriptor (e.g. In order
a distributed d~tribution and distributed irregular) correct
ray contains of the
the current cyclic,
L.laat_mod( data Ls inspector.
DAD(ind~))
is the global ind~), last
timestamp recorded
of by
access descriptor-DAD(
size of the array.
to generate
memory code, whenever references a distributed cess to the arrays a global data any array We maintain the cumulative sic Note ber that with globsl current nmod. changes. The out, first or statements that with DAD. structure a given a global number that
the compiler generates code that array, the compiler must have acIn our scheme, contains may DAD we will maintain on when represents array intrinarray. the numof code a a the = DAD(a) set that information which
After following cutions inspector
the first
time
Ls inspector
has been executed, the subsequent is false, conditions
the exethe
checks are performed must be repeated == ==
before
of L. If any of the following for L. 1<
have been modified. nmod 90D loops,
1. DAD(z~) 2. DAD(itad~)
L. DAD(z~), L. DAD(ind\), == ind~)), tracks is potential The
i < m 1< j < n
variable of Fortran
have modified instead will Each
any distributed
3. last_rnod(DAD(ind~)) L.last-mod(L.DAD( As the tions head above algorithm there cases.
we are not counting array, the program to a distributed time data value stamp.
the number execute nmod DAD(a), time
of assignments 1< j < n. array modificaover90
to the distributed of times writes a given data
we are counting any block
possible for high
array.
may be viewed an array we update DAD(a) (i.e. array that with int rin-
at runtime, in some
runtime
as a global
we modify
overhead
is likely parallel
to be small Fortran
access descriptor laatanod Thus awe of the global
in most computationally codes (see Section to record intrinsic to arrays produced
intensive
data
structure
to associate variable when a loop, it means nmod loop and
6). Calculations 90 array to a DAD
in such codes primarintrinsic, so we need or array changes structure programs. be it simple graph when once per loop possible data with makes repartition
the current
nmod
ily occur in loops or Fortran modifications call.
globsl
timest amp). modifies
sic or statement In this time
set laetmod(DAD(a)) and then
If the array
a is remapped, = nmod.
We employ
the same method the construction to link 4.1.1.
to track partitioners approach
case, we increment an inspector all the z~,
used in
of the graph, a new
laatmod(DAD(a)) it must ind~, out, perform arrays 1<
at runtime in Section
for a forall preprocessing. i < time m, 1 ~
L is carried Assume that n indirection for L is
We call this data structure described for our compiler and carrying no change to avoid
a GeoCoL ThH generating
and it will GeoCoL
L has m data arrays, carried
j < n.
Each
an inspector
out a potentially has occurred. optimize
expensive our inspector we could
we store
the following data
information: array Zj , for 1< i <
DAD(z~) m, and DAD(ind~) forl<j<n lazt-mod( We and designate last_mod(
for each unique
We could further by noting that all d~tributed
reuse mechanism modifications limit to ourselves
there arrays. possible
is no need to record Instead, modifications data will will
for
each unique
indirection
array
ind~,
to recording that have array. tion
of the require that
sets of arrays as an indnecinterprocedural
and DAD(ind~)), the values )) for 1< of j < n. DAD(ind~) Ls inspector DAD(ind~)). array ind~
in a
the same
access descriptor
Such optimization Future work
DAD(z~), by
analysis as
to identify
the sets of arrays include
must
be tracked of this
at runtime. optimization.
exploration
DAD(ind~ L. DAD(ind~) array xi
stored
L. DAD(z\), forall
tors.
q
and L.last_mod( and indirection
For a given data loop For instance, DAD(zL associated

q
L, we maintain
two sets of data access descrip-
4 In
Coupling
irregular that and
2).
Partitioners
problems, work involve it ia often loop desirable by assigning iteration to allocate all com-
we maintain, current xi, and of the data z~ when access descriptor L carried out its global data access descriptor
the
with
computational putations processor arrays (Figure ing
to processors a given
L. DAD(z~ that
) is a record with
to a single approach partitionIn the seciterations phase. cases
[3]. Consequently, loop

In
we partition using
termed a
both distributed data loop
was associated
iterations
phase,
a two-phase
last inspector. For each indkection array ind~, we also maintain two
the first
phase,
distributed called using
arrays
are partitioned. partitioning, from approach,
timestamps:
ond phase, This appears
workload
are partitioned
the information
the first as in many
to be a practical
304
the same set of distributed The following
arrays
are used by many
loops.
in the first a vertex portional
loop in the
in Figure second
1. loop
The
weight
associated 1 would
with f and can costs of con-
two subsections
describe
the two phsses.
of Figure costs. criterion in which
be pro-
to the degree
of the vertex
when functions Vertex weights
4.1
When assigned will data non-local
Data
Partitioning
distributed iterations array all) makes arrays, we have not yet we to most
g have identical parallel dominate. A given we find dhate tional titioners heuristics partitioners fashion. methods must This avsilbe coumanual when we parto difinformation and physical proximity problems,
computational problems
be used as a sole partitioning we partition loop iterations loop distributed to processors. references. computation with We assume that Our approach that
in embarrassingly computational
partition partitioning
so sa to attempt assumption will the variable
to minimize
partitioned that
can make use of combinations or weight account from information. when important for problems node geometrical to take carrying where and
nectivity y, geometrical computational or inertial costs vary make [5].
For inst ante, estimated out coorcomputaOther par-
an implicit
it is sometimes costs into bisection greatly use of both
(although on the left almost There
not necessarily hand
be carried appearing the
out in the processor owner
associated rule.
side of each statement partitioning phenomena these in a manual troublesome data structures it extremely problems
- we call this
to node.
computes
connectivity on which this the
are many
able based on physical [24, 2, 25, 13]. pled coupling titioners dependent, ferent 4.1.1 (but Currently to user programs is particularly use different making similar)
Since the data structure data partitioning Connectivity GeoCoL 4.1.2 data and/or
that
stores information we call
is to be based can represent Load information, structure.
Geometrical,
and tedious
wish to make use of parallelized
partitioners. difficult
Further, to adapt
Generating a directive A user can
GeoCoL
a compiler
Data
Structure
that can be emdata using that xcord, called a GeoCoL information declaration
and are very problem
We propose ployed structure.
CONSTRUCT to generate spatial specify
and systems.
to direct
Interface
Data
Structures
for
a data strucis use of
the keyword The following specifies C$ ycord, This G1 inthe verThe ofi specified construct or value Hanxleden Similarly, vertex LOAD C$ 1 are the inarise In Here,
GEOMETRY. is an example of a GeoCoL information: G1 (N, GEOMETRY(3, a GeoCoL spatial and zcord. directives structure data structure
Partitioners
We link ture that partitioners stores kinds [24], [15], graph loops represent Data on data [19]. to programs structure structures Graph by using data information of program on which partitioners information. that vertices partitioning Some represent represent loops, partitionundirected array Consider the graph x and y.
geometrical CONSTRUCT zcord)) statement by xcord, is closely [11]. a GeoCoL can
to be baaed. different graphs dices, tices graph example ers operate
can make
defines with ycord, related
having
N vertices
coordinate The
information GEOMETRY partitioning by von only
edges represent in Figure the
dependencies. of arrays
to the geometrical
1. In both loop of Figure
based decomposition data be
proposed which using
N elements
edges in the first edges linking edges linking
1 are the union i = 1,N i = 1,N
specifies the
vertices vertices
is(i) is(i)
and ib(i), and it(i), loop end.pt
weights
constructed
keyword
sa follows. CONSTRUCT a GeoCoL vertex construct i having example G2 (N, LOAD(weight)) called LOAD illustrates G2 consists of N verinforarrays with weight(i). how connectivity Integer associated declaration.
The union
graph
edges in the second vertices
of Figure l(i)
of edges linking with
and end-pt2(i). geometrical meshes often
tices with mation edgelktl C$ edgedist2)) The Any mation that
In some cases, it is possible formation from finite a problem. or finite point sssign element We can that may
to associate difference graph data weights In order
The following
For instance, is associated
is specified
in a GeoCoL edges. G3
d~cretizations. with vertex
and edge-list2 CONSTRUCT keyword LINK
list the vertices (N,
such cases, each mesh in space. ordinates locations Vertices timated timate work to make will describe
a location a set of cospatial [2, 22]. e~ esis
each of E graph
each
LINK(E,
edge-listl,
its spatial
location.
These
can be used to partition also be sssigned costs. costs, computational computational be partitioned. the implicit
structures
is used to specify graph. load of spatial,
the edges associinfor-
to represent to accurately weights
ated with
the GeoCoL
combination can be used
and connectivity GeoCoL data
we need information that an owner
on how compute
to generate data
structure. can xcord,
One way of deriving
For instance, uses both
the GeoCoL geometry
structure
for a partitioned information
assumption
and connectivity
rule will be used to partition work. Under this assumption, computational cost associated with executing a statement will array be attributed reference. to the processor This results owning a left hand with unit side in a graph weights
be specified as follows: C$ CONSTRUCT ycord, zcord), LINK(E,
G4 (N, GEOMETRY(3, edge-listl, edge-list2))
365
REAL*8 INTEGER S1 DYNAMIC, reg2(nedge) S2 DISTRIBUTE S3 ALIGN S4 ALIGN . ...
x(nnode),y(nnode) end.ptl(nedge), end.pt2(nedge) reg(nnode), reg2(BLOCK) S5 S6 S7 with reg2 C
....
CONSTRUCT yc, Zc)) SET RCB d~tfmt BY G (nnode, GEOMETRY(3, G USING xc,
DECOMPOSITION reg(BLOCK), reg
PARTITIONING reg(d~tfmt) x, y
x,y with end.pt
REDISTRIBUTE Loop
1, end-pt2
over edges involving
....
end-pt2, G (nnode, BY . ..) LINK(nedge,end-ptl, G USING Figure ric reg(distfmt) x, y 5: Example of Implicit Mapping 90D The statement (recursive S6 in the figbisection) a library using GeometC Loop over faces involving x, y
call read-data(end-ptl, S5 CONSTRUCT end-pt2)) S6 SET RSB d~tfmt
PARTITIONING
Information
in Fortran statement. RSB as input. available
S7 REDISTRIBUTE c Loop .... c Loop
CONSTRUCT with GeoCoL
over edges involving
ure calls the partitioned of commonly
spectral be provided
The user will partitioners Also, arrays code
and the user can choose a customized matches. in statement shown FiS7 sequence
over faces involving
x, y
any one one of them. partitioned as long in Fortran 90D nslly, using the distributed 5 illustrates that Arrays
the user can link are remapped returned similar to that
as the calling
Figure
4: Example
of Implicit
Mapping
the new distribution here the
by the partitioned. in Figinformation the spatial with the is
Once cessors: 1. At
the
GeoCoL is carried
data out.
structure
is constructed, there
data
Figure is shown. coordinates
partitioning
We assume
are P pro-
ure 4 except
use of geometric carry
xc, yc, and ZC, which to which that using bhary arrays
for elements S5 specifies recursive the data.
in x and y, are aligned the GeoCoL coordinate data
compile This that,
time code when data
dependency generates the program structures,
coupling
code is genersupthe generates
same decomposition Statement ifies that to be constructed partition
x and y are aligned. structure S6 specis used to
ated. port GeoCoL 2. The
calls
to the runtime
executes,
geometric
information. bisection
GeoCoL
data
structure that
is passed to a data parthe GeoCoL into
titioning
procedure
partitions
P subgraphs. 3. The GeoCoL vertices assigned to each subgraph
4.3
d~tribution. is constructed with the iniOnce we A we re-
Loop
Iteration
Partitioning
data, we must partition compuasthe conwith This
specify The tial
an irregular data
Once we have partitioned tational signment dktributed vention rule. variable is that (If work. statement array is normally the left then the
One convention element hand work need
is to compute hand
a program side.
GeoCoL
structure
S in the processor on Ss left side referred to as the out
associated
default
distribution the from arrays
of distributed given is built based on the new
arrays.
have the new distribution distribute the arrays communication schedule
by the partitioned, distribution.
owner-computes a replicated processors). within in all
of S references
and used to redistribute
is carried to generate of loop the following
the default
to the new distribution.
One drawback we may even in loops
to the owner-computes the absence
rule in sparse codes communication carried loop: dependencies.
4.2 In
Linking Figure
Data
Partitioners
a possible set of partitioned
For example, 4 we illustrate FORALL S1 x(ib(i)) S2 y(ia(i)) END This assign loop work coupling directives for the loop L2 in Figure 1. We use statements S1 to S4 (Figure 4) to produce a default initial distribution in loop generation graph between lationship of arrays z and y and the indirection S5 and S6 directs graph on the LINK that array the and call L1, end-pt. The statements
consider I=l,N = .. . .. . = x(ib(i)) FORALL has a loop no loop using
of code to construct Statement arrays
the GeoCoL based
the partitioned. edges
S5 indicates
the GeoCoL relations in the L1 and the re-
independent carried
dependence rule,
between we to of ib(i)
are to be generated by using
S1 and S2 but i, statement (OWNER(ib(i)))
dependencies.
Were
distributed is provided
x and y in loop the keyword
the owner-computes be computed statement while
for iteration be computed
S1 would
on the owner S2 would
366
on the owner would An with
of ia(i)
(OWNER(ia(i)).
The value of y(ib(i)) whenever OWNER(ib(i)) associated C$ K1 iteration that array places of the referK2 K3 C$ Start with end-pts Read Mesh block (end.ptl, d~tribution end-pt2, of . ..) LINK to generate (nedge, end.ptl, data arrays x, y and
have to be communicated alternate convention
# OWNER(ia(i)). is to assign all work to support a scheme that distributed a loop iteration structures default to a given processor. We have devel-
oped data partitioning. Our a loop largest ences.
and procedures is to employ
CONSTRUCT end-pt2)) Call CHAOS structure SET RSB distfmt
G (nnode, procedures BY
GeoCoL G
current iteration number
on the processor of the iterations
is the home
PARTITIONING graph partitioned format from
USING
Pass GeoCoL Obtain new tioner(dwtfmt) REDISTRIBUTE Remap arrays
to RSB
distribution
the
parti-
Compiler In the previous
Support section we presented specify between work generate procedures. 4 to show how the compiler A (simplified) in Figure d~tribution. the proceproGeois are Statements is encountered, CHAOS the to CHAOS The statement procedure with (distfmt). runtime using the a loop accesses calls the initial ver6. S5 is shown a data in the code. d~tributions. statement with execution, data passed remapping in Figure directives how data processors. mapping. embeds a proand loop In this
C$ K4
reg (distfmt) (x and y) aligned with distfmt
grammer iterations section out th~ compiler CHAOS procedures We start When compiler dure COL calls, data cedures
can use to implicitly are to be partitioned we outline implicitly mapper compiler defined coupler
distribution
reg to distribution
transformations and data code which
used to carry The the
Figure Mapping
6: Compiler
Transformations
for
Implicit
Data
transformations
unstructured ing the number The table iterations shown reuse. 6.2
mesh presents
and
molecular of Intel arrays
dynamics iPSC/860 time
loops,
varyfor 100
of processors
hypercube. irregularly The results of schedule
We use the example sion of the compiler with BLOCK to S7 in Figure
the execution dissection
of the loops
are embedded
wit h dwt ributed binary in the table
decomposed partitioned.
transformation array
using a recursive
emphasizes
the importance
4 are used to generate code the
the CONSTRUCT generates during structure When CHAOS (reg) described generate program
embedded
Timing Coupler
Results
using
the
Mapper
GeoCoL is then
structure. an user
specified In this procedures pler. culation Iinked present These unstructured mapping section, with we present the cost solver data that compares mapper the the coucalcosts incurred by the compiler timings Euler involve generated a loop and dynamics over coupler
partitioned. encountered, generated distribution Loop method at least
the
REDISTRIBUTE data (x and y) aligned at
to move arrays iterations are
a hand
embedded
mapper force
to the new distribution partitioned d~t ributed in Section 4.3 whenever array.
edges of an 3-D The compilerin the Fortran University. We on diftechniques iPSC/860. kinds of parallel (reof the ( coordinate partitioned
the elect rest atic code.
loop in a molecular technique being
one irregularly
was incorporated at Syracuse on an Intel two different based [24]). partitioned based The of our runtime
90D compiler ferent number
developed
the performance
Experimental
Results
of processors we employed 1) a geometry bwection
To map arrays partitioners
6.1
In
Timing
this section, saving
Results
we present technique over 10K water edges and
for
Schedule
performance in
Reuse
data for 3 for timings the the inEuler code of 1. Tafor
bisection cursive
[2]) and 2) a connectivity spectral
performance
schedule Fortran volve solver static for
proposed of 53K an 3-D
Section These
compiler embedded mapper version version are shown in Table 2. In Table tition the the time build 2, Partitioned arrays needed using to carry and depicts out
and hand
parallelized to pardepicts and to taken
90D compiler a loop [20] for force
implementation. mesh points [4]; results schedule
the time
needed Ezecutor
unstructured dynamics functionality
the partitioners, the actual depicts 2, Partition inspector
and an electro-
computation the time under
calculation is equivalent and without
loop in a molecular simulation the to the loop the
communication,
648 atom
the schedule.
In Table
Spectral
these loops ble 1 depicts code with
L2 in Figure of compiler reuse
the performance
generated
Bisection depicts the time needed to partition the GeoCoL graph data structure using a parallelized version of Simons eigenvalue partitioned [24]. We partitioned the GeoCoL
technique
367
(Tii
in
10K Mesh Processors 4 8 400 17.6 214 10.8
16 123 7.7
53k Mesh Processors 16 32 668 30.4 398 23.0
64 239 17.4
648 Atoms Processors 4 8 707 15.2 384 9.7
16 227 8.0
No Schedule Reuse Schedule Reuse
Table (Time in Sees) Graph

rail
2: Unstructured
Binary
H d C~~ed
Mesh Template Coordinate Bisection
-53
K Meah -32
Block Partition Hand Coded
Processors
Spectral Hand Coded
2.2
c il NoOS~~e~uie Reuse 1.6 379 ,6 1(. A 398
Compfler Schedule Reuse 1.6 4.2 ,1 11.4 23.0
Bisection Compder: Schedule Reuse

2.2
..-4
-----------C-enerahon tltloner
tnr
-lii@a.,
.-map .. .
I
Executor Total
1.6 4.3 ,,? . IO.* 22.4
0.0 4.7 , v O*. I 59.4
10. A
258 4.1 ..30
258 -I
277.5
4..0 ,0 1.3.9 277.9
graph
into
a number employed. partitioned time needed graph.
of subgraphs It should could depicts to carry The
equal
to the number that
of The 2
based
decompositions.
Our type
GEOMETRY of value
construct
can be environof ira with memory enThis support at disat at
processors parallelized graph ate GeoCoL 100 times.
be noted be used the time time out
any common to generin Table that phase for
viewed Several ments regular
as a particular researchers that or adaptive are targeted
based decomposition. programming classes
as a mapper. required shown
have developed towards (DIME) meshes problems.
genemtion
particular for
Ezecutor
Williams using
[25] describes calculations distributed
gives the time the performance
the executor
programming unstructured machines. vironment programming dynamic There tributed Stanford, Austin time Marina implement projects; load
environment triangular Baden targeted
The results
shown in the table version. coded
demonstrate
of the compiler coded for a hand
generated In table block the
code is within 4, we have inversion effects In contigub~ection in the
[1] haa developed towards particle provides
a programming computations. facilities that targeted
10% of the hand cluded that the timings arose from blocked of the code
partitioned performance
environment balancing.
in order version, of array b~ection
to quantify we assigned
the decision elements.
to partition We see that of two to three
the problem.
are a variety memory DINO provide
of compiler
projects
each processor or a spectral
multiprocessors project parallel at Colorado programming D project at Yale compiler
[27, 16]. Jade project and CODE in four PARTI project compiler environments. project compiler preprocessing
ous blocks a coordinate partitioned executor This of executor sen. When partitioned, associated significantly A detailed bisection dynamics timing using BLOCK 7
the use of either reduction
partitioned
Run[16], project to
lead to a factor time compared also points iterations compared the with
compilation Chens
methods work Kali
are employed [18] and our type

arrays
to the use of block out the importance partitioned bisection overhead.
partitioning. of the number should be chobisection is a but
the Fortran The
[14], the Kali was the first runtime

[26].
example
on which
[21, 26, 23].
to the recursive spectral time
coordinate
inspector/executor compiler several that from d~tributed a strategy ation) compiler (but would marked more
recursive a faster
partitioned iteration
[16] and the ARF port per irregularly work, outlined In earlier
was the first of the authors dld not
compiler
to suppa-
per executor
higher
partitioning
of the current attempt
performance loop is shown
of the compiler-linked in Table partition blocks allowed much 3. In Table of arrays of arrays in HPF. better than
coordinate 4, we present - we assigned to processors Irregular dw the existing
a compiler for compilers based approach partide-
for the unstructured results for naive
mesh loop and the molecular
implement to generate tioners scribed compiler
make it possible connectivity [6]. The from loops input
embedded
directly support.
each processor BLOCK tribution
contiguous d~tribution performs
here requires
the user and lesser
of arrays distribution
supported
by HPF. 8
Conclusions
In this paper, The we have described Fortran here described and and presented 90D compiler demonstrates two timnew
Related
Research
Work
out by von Hanxleden which decompose arrays values, these are called [11] on based value partitioners array element ing data tation. for a prototype work implemen-
haa been carried
compiler-linked on distributed
368
Table
3: Performance of Compiler-linked 10K Mesh T k

(Til&~n Partitioned Inspector Remap Executor Total Processors 8 4 0.6 0.6 0.6 1.2 1.6 3.1 7.0 12.7 10.8 17.6
Coordinate
Bisection
Partitioned
with
Schedule
Reuse
16 0.4 0.4 0.9 6.0 7.7
53k Mesh Processors 16 32 64 1.6 2.5 1.8 19 ().7 2.() 5.1 3.0 1.9 91 5 -A. 17.2 12.3 m A , 92 n =.-. I 17.4
A.,
648 Atoms Processors 4 8 0.1 0.1 2.2 1.2 4.8 2.6 8.1 5.8 9.7 15.2
16 0.1 0.7 1.5 5.7 8.0 t
Table4: T
(Ti~~in sees Inspector Remap Executor Tot al
Performance
of Block
Partitioning
53k M esh Processors 32 1.9 2.8 54.7 59.4
with
Schedule
Reuse
4 1.5 3.1 26.0 30.4
10K Mesh Processors 8 0.9 1.6 20.8 23.3
16 0.5 0.8 14.7 16.0
16 3.9 4.9 74.1 82.9
64 1.0 1,7 35.3 38.0
648 Atoms Processors 4 8 2.7 1.5 4.5 2.6 10.3 7.6 17.5 11.7
16 0.8 1.5 7.3 9.6
ideas cedure ble for
for
dealing
effectively invokes that results loop
with
irregular The the
computations. mapping it propossireusing commuis a simple for (e.g.
The CHAOS able for public or from
procedures distribution
described ftp
in th~ paper
are availnetlib
The first
mechanism using method computed schedules,
a user specified in many from data
and can be obtained
from
a set of dkectives. to recognize iteration
second
the anonymous
site hyena. cs.umd.edu.
conservative previously nication that buffer
cases makes potential
a compiler
Acknowledgments
The authors Das for reading. Chuck many The Koelbel would fruitful authors and like to thank discussions would Sanjay like Ranka Alan and for Sussman for many compilers; about help and Raja in proofFox, and
inspectors partitions, copies with described support
information on-processor here as formruntime conlibrary
associates locations).
off-processor
to thank
Geoffrey
We view support tains

q
the CHAOS
procedures compiler
enlightening we would and Seems integratproblems. for his
ing a portion library.
of a portable, The CHAOS that
independent,
discussions
about
universally Chuck many Koelbel, useful
applicable into
partitioners
runtime
how to embed Hiranandani ing into Our special
such partitioners for
procedures support tioning,
also like to thank and dynamic distributed array partiFortran-D thanks suggestions. authors
Ken Kennedy d~cussions for irregular
static
runtime
support
go to Reinhard also like and Tom Horst
von Hanxleden
.
q q
partitions remap carry
loop arrays
iterations from
and indirection
arrays, and and
helpful The
one distribution translation, buffer
to another allocation
would
to gratefully Haupt Simon of the Fortran software.
acknowledge they 90D compiler.
out
index
the help of Zeki Bozkus spent orienting We would
and the time
communication We consider integrated dependent guages. High subset our work effort runtime The
schedule
generation, of the ARPA powerful sponsored compiler inlana
us to internals mesh partitioning
also like to thank
for the use of his
to be a part developing
unstructured
towards
support Fortran
for parallel type
programming
runtime
support support
can be employed compilers, compiler. on computational a molecular support described
in other
Performance into
and in fact,
References
[1] S. ning tion., [2] M.J. Baden. and on 12(1), Berger Programming coordinating January and S. 1991. H. Bokhari. on May
and
of the runtime
here has been intem-
abstractions localized SIAM J. Sci.
for scientific
dynamically calculations and Stat.
parti. runComputa-
corporated We tested fluid code
the Vienna from code
Fortran compiler
and from
tioning
our prototype
[20]
multiprocessors.
plates extracted dynamics
an unstructured our runtime
mesh computational dynamics by hand and

[3]
partitioning IEEE 1987.
strategy Zhana.
for on
nonuniform Computers, Harry time memory 3(3):159-17S,

Berryman,
problems
multiprocessors.
[4]. We embedded
C-36(5):570-580,
Joel Saltz,
compared its performance against code. The compilers performance within about
10~0
the compiler generated on these templates was code.
Jeffrey
SCroggs.
Exeeution
support
for June
adaptive 1991.
scientific
algorithms Practice
on and
distributed Experience,
of the hand
compiled
machhies.
Concurrency:
369
[4]
B. S.
R.
Brooks,
R.
E. and
Bmccoleri, M. KarPlus.
B.
D.
Olsfson, A and
D.
J.
States, for
[22]
B.
Nour-Omid, equations on
A. Parallel
Raefsky, on concurrent
snd
G.
Lyzenga. In theis and
Solving Proc. Impact on
finite of SymMe-
Swaminathrm, Journal Clark, In Intel of R.
Chzrmm: Chemistry, J. for A.
prcmrmn 1983. and L.
element posium chanics,
computers.
macromolecular tions. [5] T. W.
energy,
minimization,
d yn=lcs-calcula4:187,
Computations December 1987. and Concurrency: J.
Computational v. Hanxleden, strategies
Boston, H.
McCammon, Partners April Mavriplis. problems Compilers J. Saltz 1992. techniques the and 1992.
R. pro-
[23]
J. for
Saltz,
Berryman, 1991.
Wu.
Runtime and
compilation Ezpenence,
Scott. gram. ence, [6] R. Dss,
Parallelization Timberline R. and Editors,
a molecular OR, D.
dynamics
multiprocessors.
Practice
Supercomputer Lndge, Mt.
University Hood, and irregular In
Confer[24]
3(6):573-592, H. allel Simon.
Partitioning In on Large Pergarnon Scale
of unstructured Proceedings Structural 1991. of mesh dynamic Press, o,f the
mesh Analysis
problems on and
for
parAp-
Ponnusamy, runtime
J. Saltz, for partitioning.
Distributed - data and P. copy Runtime Mehro[25]
processing.
Conference
Paralle/
memory reuse Software tra [71 R. ing on land [8] D.
compiler for
methods
Methods plications. R.
Physics
Scalable Amsterdam, J. H. Saltz.
Multiprocessors, The Program Netherlands, slicing
Elsevier. for compil-
Williams. for and J.
Performance unstructured Experience,
load February
bshncing Concurrency, 1991. Runtime
al-
gorithms Das and Practice [26] J. the Loveman for (Ed.). version Research 1993. et mimd al. Compiling 1993. K. Kennedy, D Rice C. language COMP Koelbel, U. Kremer, DeRice fortran Report 90d/bpf SCCS-444, for NPAC, distributed Syracuse [28] H. A. port G. C. Fox, Tseng, S. Hkanandani, and of M. Wu. Computer December tion, Fortran Science 1990. Compiler of on irregulr Parallel K. for 5th support problems. Computation, C. Koelbel, on Haven, Load of for machine 1992. R. Daa, and August and D. 1992. on message J. Saltz. In Proindependent report, Center specification. TR90-141, Zima, SchwaId. Draft 1.0. High Performance Report Computation, Fortran Rice language University, [27 specification, Center January [9] Z. Bozkus Technical CRPC-TR92225, H. Zimar Wu, irregular Languages , OR, problems. In Proceedings of and Compilers for Parallel August 1993. .%th Workshop Computing, Port-
calculations.
3(5):457-482, and H. on
Sattz,
S. Hkrmndani, for Conference 1991. M. Gerndt.
Berryman. In Parallel
compilation 1991 volume
methods International 2, pages H. 1988. P. Brezrmy, Vienna Bast, 26-30, and
multicomputers.
Proceedings Processing,
of
on Parallel
Superb:
tool
for
semi-
automatic 6:118,
MIMD/SIMD
parallelization.
Parallel
Computing,
memory University, [10]
computers.
B. Fortran Austrian
Chapman, a language Center Vienna, for Austria,
P.
Mehrotra, specification.
and Re-
March
ACPC-TR92-4, University
Parallel 1992.
Computa-
of Vienna,
partment University, [11] R. for [12] R. v.
Hanxleden.
parallelization Research v. Hanxleden,
Technical
Kennedy, irregular Workshop New L. R.
Compiler ceedings for [13] R. Parallel v.
analysis of the
prnblems CT,
in Fortran
Languages
Compilers
Computing, and
Hanxleden
Scott.
balancing and
passing Computing, [14]
architectures. 13:312324, K. and Saltz To
Journal 1991. and
Parallel
Distributed
S. Hiranandani, for In Compilers J.
Kennedy, Runtime and P. 1991. S. Lh.
C.
Tseng. for Editors,
Compiler in Scalable
support Fortran MultiproThe D.
machine-independent
parallel Mehrotra
programming
Software Elsevier. An Bell
cessors, Netherlands, [15] B.W. dure
Amsterdam,
appear and February
Kernighan for partitioning
efficient System
heuristic Technical
proceJournal,
graphs. 1970. and
49(2):291307, [16] C. 2nd Koelbel, data ACM P.
Mehrotra,
J.
Van on
Rosendale. memory Principles ACM, and ACM
Supporting In 1990. and March Practice
shared
structures SIGPLAN Programming,
on distributed Symposium pages
architectures.
of Parallel [lq W. In 86, [18] L. on E. Leland. Proceedings pages C. Lu
177186.
Load-balancing of Performance 1986. Chen. In 1991.
heuristics 86 and
prncess
behavior.
SIGMETRICS
5469, and M.C.
Parallelizing Proceedings for
loops Parallel
with
indirect
array Santa
references Clara, [19] N. data Ph.D. verait [20] D. the CA,
or pointem. and August
of th e Fourth Computing,
Workshop
Languages
Compilers
Mansour. to
Physical
optimization of Computer
algorithms
for Technical
mapping report, Uni-
distributed-memory Dissertation, y, 1992. Three Dynamics J. H. School
multiprocessors.
Science,Syracuse
J. Mavriplis. Euler Fluid
dimensional paper Conference, Saltz, of of the R. runtime 1988 pages M.
unstructured In AIAA 1991. D. for July June Smith, support ACM 140-152,
multigrid 10th
for
equations,
91-1 549cp.
Compu-
tational [21] R. Kay ference
Mkchandaney, Crowley. In on
M. Nicol,
parallel 1988.
and pmCon-
Principles Proceedings Supercomputing,
cessora.
International
370

Runtime Compilation Techniques For Data Partitioning and Communic

Uploaded by

Copyright:

Available Formats

You might also like

Runtime Compilation Techniques For Data Partitioning and Communic

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Runtime Compilation Techniques For Data Partitioning and Communic

Uploaded by

Copyright:

Available Formats

Syracuse University

parallel Syracuse Syracuse,

University College Park,

to prefetch phase, out

required The kind

off-processor ARF compiler arrays

the actual [21].

compiler The first the user

can deal with mechanism

[16] used thw indirectly possible to a simple (e.g.

of transformation (irregular method schedules, associates locations). maintains may computed

cedure via a set of compiler to use progmm location spatial

propose inspectors partitions, copies

conservative previously that buffer

and computational it is posinspectors partitions, copies with re-

load. many sible (e.g.

off-processor a record reference

casea enables communication that bufler these

at runtime, or intrinsic In this

of when another arrays spector

a Fortran dwtributed runtime

information on-processor sults for implementation.

off-processor We present from a Fortran

to a d~tributed checks th~

is used to indirectly scheme, to see whether

since the last large data

machines, local arrays

memories are called

of processors. distributed data

of distributed and memory

programmers data between to carry

It is frequently the nodes frequently structures array

advantageous of an irregular pattern

sing to partition sors. The

memories memory [23].

the way in which mesh are numbered we partition

code needed we call

can also be generated in a process where per, we present compilers D methods

ful correspondence in a way that cessor. veloped tioning

to the connectivity the data arbitrary interprocessor

implementation problems related coded usto Fortran indirect

need to assign In recent methods and tradeoffs

ing a set of language [10] or Vienna On distributed array tor.

accesses can be handled The inspector accessed partitions by a loop,

by transforming an inspector iterations, loop

the original and an execuallocates Iaarray

users to specify can consist location array

loop into two sequences cal memory element *This

to produce connectivity, that load.

distribution of array elements

for each unique

off-processor and builds

and information code that,

computational work was sponsored in part by

Based on user directives at runtime,

the compiler a standardized

The compiler a data