Runtime Compilation Techniques For Data Partitioning and Communic

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Syracuse University

SUrface
Northeast Parallel Architecture Center L.C. Smith College of Engineering and Computer Science

1-1-1993

Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse
Ravi Ponnusamy
University of Maryland, Computer Science Department ; Syracuse University, Northeast Parallel Architecture Center

Joel Saltz
University of Maryland, Computer Science Department

Alok Choudhary
Syracuse University, Northeast Parallel Architectures Center

Follow this and additional works at: http://surface.syr.edu/npac Part of the Computer Sciences Commons Recommended Citation
Ponnusamy, Ravi; Saltz, Joel; and Choudhary, Alok, "Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse" (1993). Northeast Parallel Architecture Center. Paper 10. http://surface.syr.edu/npac/10

This Working Paper is brought to you for free and open access by the L.C. Smith College of Engineering and Computer Science at SUrface. It has been accepted for inclusion in Northeast Parallel Architecture Center by an authorized administrator of SUrface. For more information, please contact surface@syr.edu.

Runtime

Compilation

Techniques

for

Data

Partitioning

and

Communication
Ravi t
computer

Schedule
Joel Saltzt

Reuse*
Alok Choudhary$
Center University NY 1.9244

Ponnusamytt
Science Department of Maryland MD 20742

~No~hea~t

parallel Syracuse Syracuse,

Architectures

University College Park,

Abstract
In this paper, we describe invokes two new ideas a user specified directives. arrays elements to describe conservative to recognize results loop by which mapping graph method that from data HPF proallow connec. that in

schedule ecutor

to prefetch phase, out

required The kind

off-processor ARF compiler arrays

data.

In the ex-

the actual [21].

communication

and computation [26] and KALI to handle loops). that loop The often results itercom-

are carried compiler loops We makes from ation data piler with it

compiler The first the user


tivity,

can deal with mechanism

irregular

computations

effectively.

[16] used thw indirectly possible to a simple (e.g.

of transformation (irregular method schedules, associates locations). maintains may computed

referenced reuse

cedure via a set of compiler to use progmm location spatial

The directives

propose inspectors partitions, copies

conservative previously that buffer

of army a compiler

and computational it is posinspectors partitions, copies with re-

communication

load. many sible (e.g.

The second

is a simple

information on-processor code that, array that 90D loop array. record

off-processor a record reference

casea enables communication that bufler these

with

to reuse previously
associates

computed schedules,

generates

at runtime, or intrinsic In this

iteration

of when another arrays spector

a Fortran dwtributed runtime

have written

information on-processor sults for implementation.

off-processor We present from a Fortran

to a d~tributed checks th~

is used to indirectly scheme, to see whether

locations).

performance 90D

each inspector any indirection time the inneed aris asin the to parFor incompu-

mechanisms

compiler

may have been modified was invoked. memory data storage processor arrays

since the last large data

In distributed 1 Introduction In sparse and unstructured tern time. the is determined In these movement cases, work, of data problems values the data known carry structures the out runtime out access patonly and at runschedule of procescompiler In th~ pafor preprocesto be partitioned These mys. signed tition stance, tational When may by variable map partitioned Long term to specific distributed

machines, local arrays

arrays

between

memories are called

of processors. distributed data

of distributed and memory

array

locations manner.

programmers data between to carry

distributed

machine.

It is frequently the nodes frequently structures array

advantageous of an irregular pattern

sing to partition sors. The

in an irregular

memories memory [23].

the way in which mesh are numbered we partition

code needed we call

preprocessing

does not have a useof the mesh. we in such a problem communication, elements to each prohave been departi-

can also be generated in a process where per, we present compilers D methods

by a d~tributed run time compilation that closely and a prototype techniques handle extensions [28]. architectures, of code: irregular

ful correspondence in a way that cessor. veloped tioning

to the connectivity the data arbitrary interprocessor

implementation problems related coded usto Fortran indirect

minimizes

we demonstrate to efficiently

make it possible

need to assign In recent methods and tradeoffs

years promising associated have been studied the runtime to allow needed

heuristics with

ing a set of language [10] or Vienna On distributed array tor.

the different

Fortran memory

[24, 25, 19, 17, 2, 13]. support and compiler the inforfuncelewith

loops with

We have implemented transformations mation tion. tion ments needed In our view, of graph

accesses can be handled The inspector accessed partitions by a loop,

by transforming an inspector iterations, loop

the original and an execuallocates Iaarray

users to specify can consist location array

loop into two sequences cal memory element *This


NSF (ASC Choudhary

to produce connectivity, that load.


of the

a customized spatial

distribution of array elements

this information

of a descrip

for each unique

off-processor and builds

distributed

a communication
(NAG-1-1485), Author

and information code that,


ation

associates

computational work was sponsored in part by


9213821) was also and ONR by supported NSF ARPA

Based on user directives at runtime,


above

the compiler a standardized


then passes parti-

produces represent
th~ tioned.

generates
to a (user

(SC292-1-22913). Young

Investigator The content of the information does award ( CCR-9357840). not necessarily reflect the position of the policy of the Government and no official endorsement should be inferred

information,

and

standardized

representation

specified) at

The compiler a data

alao generates that

code that,

runtime, loop it-

produces

structure

is used to partition

361

@ 1993 ACM 0-8186-4340-4/93/0011

$1.50

Pennission to copy wuhout fee all or p-ret of IMs material is granted, provided thm h copies we not made or dlstibuted for dkct ccinmercial advantage, the ACM copyrighl mice and die dlle of the publication and its dme appear, and nouce is given that copying IS by permission of the Association for Computing Machiaev. T. copy dkerwx. or to republish, requires a fee andhx sjwific permission.

Pkwe

A GCOCO1 Graph Geocol Graph Partition

c Single statement
FORALL y(ia(i)) END

loop

L1

Generate Pzrtitien

i = 1, N = x(ib(i)) FORALL + .. . x(ic(i))

>

Data Psrtition

Phase

B Iteration Iteration Graph Graph >

Gem#ate Partition

Loop Iterzticn

c Sweep over edges: Loop L2


Phsse C and Loop Itemtions >

FORALL REDUCE f(x(end-ptl(i)), REDUCE g(x(end-ptl(i)), END

i = l,N (ADD, y(end.ptl(i)),

Remzp Arrays
Phzse D

Remap

x(end-pt2(i)))) (ADD, y(end-pt2(i)),

preprocess

Leeps

Phzse Rxecute

E Leeps

x(end-pt2(i))))

FORALL Figure 1: Example Irregular Loops described Fortran comin trol

Figure compiler-linked

2: Solving runtime data We briefly

Irregular partitioning. to characterize discuss 8.

Problems In Section work 6 we

present erations. piler To our knowledge, is the first this kind Fortran the implementation memory of support. of our tion in t h~ paper the Vienna specify support We will mations ular dwt ributed

performance methods.

the performance in Sec-

related

7 and we conclude

in Section

to provide

We also note that function. Fortran. compiler transforThe runtime

[28] language distribution

definition,

a user can also described 2 2.1

a customized and compiler describe

transformation to Vienna extensions above. out where hand the etc). support,

strategies

Overview Overview of CHAOS


runtime phases. support of clearly The to deal with demarcated is called library. PARTI

here can also be applied and language

the runtime described loop

required

to provide that carried (e.g. of a single

the irregor deadWe have developed problems concurrent CHAOS; The library Solving memory major concern d~cuss Initially, CHAOS that consist efficient of a sequence support

new capabilities multiple pendencies dition, irregular indirection by the loop tran D syntax loop The statement

We assume only loop

accesses are carried rdlowed array with

in the context

computational the runtime library [21, 26, 23]. concurrent machines (Figure a brief the

project earlier on

are left

side reductions of a single is indexed

is called

the CHAOS

accumulation,

max, rein,

We also assume that level of directly Fordeto dy-

is a superset irregular

of the

accesses occur a distributed loops with shown

as a result array that

problems first three

distributed involves five

index. in Figure array The 1, we employ without is similar fluid to depict second two loops. loop The first loop is a single we carry

using data

our runtime

support, onto

In the example statement pendencies. out reduction

steps

2). The description in later

steps in the figure processors.

mapping them

and computations sections. are

indirect

references loop codes.

We provide

of these steps here, and will arrays decomposed in a

is a loop second

in which

in detail

operations.

distributed

those loops found namics loop to demonstrate

in unstructured our runtime our

computational procedures sections. as part on simple

codes and molecular

dynamics

We use th~ and compiler of the ForUnivertemplates code of

known regular manner. In Phase A of Figure procedures can be called to construct a graph ture (the GeoCoL associated graph data data with structure) a particular how data using the patterns GeoCoL The tributed. In Phase B, the newly calculated used to decide how loop iterations among data processors. of arrays This calculation In Phase and loop out the data name access patterns.

2, CHAOS data strucdata access The be d~

transformations tran 90D

in the following being

set of loops. arrays should

We have implemented compiler Our sit y [9].

methods results

structure

is passed

to a partitioned.

developed

by Syracuse

partitioned

calculates

implementation

reveal that the performance is within 10% of the hand This time tion loop paper technique iteration is organized the work in Section

of the compiler generated paralleiized version. as follows. 3, we describe schedules. In Section We describe data used to couple effort.

array d~tributions are are to be partitioned takes into account out loop the actual needed to

We set the context

2. In Section the procedures of our compiler generate

the runIn Secdata and 5 we the struc-

C we carry iterations.

to save communication partitioners which to compilers. the

remapping In Phase the storage involves

4 we describe an overview

D, we carry

preprocessing movement, This

(1) coordinate and (3) support

interprocessor a shared

(2) manage data, preprocessing translating

present ture

of, and access to, copies communication

of off-processor schedules,

transformations and describe

standard

space.

the language

extensions

we use to con-

generating

362

array data.

indices

to access local local buffer

copies

of off-processor globally indexed

data but local

....
S1 REAL*8 S2 INTEGER x(N),y(N) map(N) reg(N),irreg(N) reg(block) reg map array using some mapping

and allocating irregularly processor tion fkom computation. CHAOS adaptive dynamics distributed 2.2

space for copies of off-processor to retrieve from the numerous out the

It is also necessary distributed memories. the earlier

data-sets Finally, phases

in Phase to carry

E we use informanecessary

S3 DECOMPOSITION S4 DISTRIBUTE S5 ALIGN S6 . . . set

and PARTI computational codes and memory

procedures including fluid

have been used in a varilinear solvers, at codes, molecular

map with values .. of

et y of applications,

sparse matrix dynamics compiler

method

a prototype multiprocessors.

[23] aimed

S7 DISTRIBUTE S8 ALIGN x,y with

irreg(map) irreg

Overview port

of

Existing

Language

Sup-

.... Figure 3: Fortran with The D Irregular Distribution depicted in Figure 3

The data lar problems While

decomposition will will

directives

we employ

for irreguD. The is that tributed pattern ning enough a wealth partitioners There difficulty array. of irreg the declarations how to partition which map array The of Fortran extenFortran, and comit is not obvious the irregularly separately dis-

be presented

in the context

of Fortran

our work

be presented

in the context of languages Vienna Fortran

D, the same optimizations sions could pilers Fortran such as Vienna D and HPF

and analogous and HPF. from

language

gives the dwtribution by runconstructs are not rich of the map arWhile there are such effort. coding

be used for a wide range Fortran (evolved

has to be generated Fortran-D

a partitioned.

D and Fortran specifications; may be found in

for the user to couple compilation of partitioning from scratch codes.

the generation process. available, between the

90) provide a definition [1 O, 8]. that Fortran In Figure declaration. a distribution attributes titioned

a rich set of data decomposition of such language extensions as currently how data to
explicitly

ray to the program

heuristics interface

These D can

languages, define be used partition

specified, specify array of such

require

can represent

a significant

users explicitly

is to be distributed. an irreguelements. D called fixes the is produced is DECOMdimensionalThe second a Fortran

is no standard

partitioners

and the application

lar inter-processor

of distributed an example

3, we present In Fortran which of a distributed between

D, one declares array.

a template

Communication
The cost of carrying

Schedule
out an inspector when is computed analysis

Reuse
(phases once needed method from and in that B, C and D proused
7].

is used to characterize the array declaration template.

the significant is to be parin Figure duced repeatedly. We propose cases allows results as: from 2) can be amortized Compile time the information then
[12,

The distribution

size, dimension

and way in which processors. The first

A distribution fixes the name, array Distribute

by the inspector

using two declarations. POSITION. ity declaration statement onto Fortran how

to reuse inspecupon in many The zs long

Decomposition is DISTRIBUTE. and specifies D provides how

tor communication

schedules

is touched

and size of the distributed

a simple an inspector

conservative for loop

is an executable is to be mapped reguspecify A the using

us to reuse the results

inspectors.

a template

L can be reused

processors. the user with a choice of several onto processors. In addition, is associated ALIGN. a user can explicitly with a d~tribution In statement

distributions remained was invoked,

of data unchanged and

arrays

referenced

in loop

L have

lar distributions. a distribution array specific Fortran ure

since the lsst time

the inspector

is to be mapped

there ated

is no ~ossibilitv with l~op L h&e invocation. generates a Fortran have written reference

that

indirection modified

arrasw sinc~

associthe last

D statement

S3, of Figreg is parassigned (in

been

3, two of size N each, one dimensional In statement equal sized S4, decomposition blocks, with S5, array irreg
map(i)

decompositions The

inspector compiler of when may

are defined. titioned with into

code that

at runtime

maintains or array that array

a is In to

one block

record intrinsic thw

90D loops another checks

statements distributed

to each processor. distribution statement between using ement P.

In statement reg. Array

map is aligned

to a distributed thw

map will

be used to specify is to be partitioned

used to indirectly scheme,

array. record

S7) how distribution processors. An irregular when irreg array; distribution

each inspector

runtime

distribution is assigned

is specified to p, elto processor

an integer
i of the

is set equal

see whether any indirection arrays may have been modified since the last time the inspector was invoked. In th~ presentation, for we assume loop. that we are carrying assume that out all an inspector a forall We also

363

indirect the form loop

array y(ia(i))

references where

to any distributed ia is a distributed the forall (DAD) things) block, loop. for

array array

y are of and i is a arthe

last-od(DAD(ind~) at ed with and, the current

is the global data

timestamp

associof ind~

access descriptor

index

associated (among array

with other

A data type

access descriptor (e.g. In order

a distributed d~tribution and distributed irregular) correct

ray contains of the

the current cyclic,

L.laat_mod( data Ls inspector.

DAD(ind~))

is the global ind~), last

timestamp recorded

of by

access descriptor-DAD(

size of the array.

to generate

memory code, whenever references a distributed cess to the arrays a global data any array We maintain the cumulative sic Note ber that with globsl current nmod. changes. The out, first or statements that with DAD. structure a given a global number that

the compiler generates code that array, the compiler must have acIn our scheme, contains may DAD we will maintain on when represents array intrinarray. the numof code a a the = DAD(a) set that information which

After following cutions inspector

the first

time

Ls inspector

has been executed, the subsequent is false, conditions

the exethe

checks are performed must be repeated == ==

before

of L. If any of the following for L. 1<

have been modified. nmod 90D loops,

1. DAD(z~) 2. DAD(itad~)

L. DAD(z~), L. DAD(ind\), == ind~)), tracks is potential The

i < m 1< j < n

variable of Fortran

have modified instead will Each

any distributed

3. last_rnod(DAD(ind~)) L.last-mod(L.DAD( As the tions head above algorithm there cases.

we are not counting array, the program to a distributed time data value stamp.

the number execute nmod DAD(a), time

of assignments 1< j < n. array modificaover90

to the distributed of times writes a given data

we are counting any block

possible for high

array.

may be viewed an array we update DAD(a) (i.e. array that with int rin-

at runtime, in some

runtime

as a global

we modify

overhead

is likely parallel

to be small Fortran

access descriptor laatanod Thus awe of the global

in most computationally codes (see Section to record intrinsic to arrays produced

intensive

data

structure

to associate variable when a loop, it means nmod loop and

6). Calculations 90 array to a DAD

in such codes primarintrinsic, so we need or array changes structure programs. be it simple graph when once per loop possible data with makes repartition

the current

nmod

ily occur in loops or Fortran modifications call.

globsl

timest amp). modifies

sic or statement In this time

set laetmod(DAD(a)) and then

If the array

a is remapped, = nmod.

We employ

the same method the construction to link 4.1.1.

to track partitioners approach

case, we increment an inspector all the z~,

used in

of the graph, a new

laatmod(DAD(a)) it must ind~, out, perform arrays 1<

at runtime in Section

for a forall preprocessing. i < time m, 1 ~

L is carried Assume that n indirection for L is

We call this data structure described for our compiler and carrying no change to avoid

a GeoCoL ThH generating

and it will GeoCoL

L has m data arrays, carried

j < n.

Each

an inspector

out a potentially has occurred. optimize

expensive our inspector we could

we store

the following data

information: array Zj , for 1< i <

DAD(z~) m, and DAD(ind~) forl<j<n lazt-mod( We and designate last_mod(

for each unique

We could further by noting that all d~tributed

reuse mechanism modifications limit to ourselves

there arrays. possible

is no need to record Instead, modifications data will will

for

each unique

indirection

array

ind~,

to recording that have array. tion

of the require that

sets of arrays as an indnecinterprocedural

and DAD(ind~)), the values )) for 1< of j < n. DAD(ind~) Ls inspector DAD(ind~)). array ind~
in a

the same

access descriptor

Such optimization Future work

DAD(z~), by

analysis as

to identify

the sets of arrays include

must

be tracked of this

at runtime. optimization.

exploration

DAD(ind~ L. DAD(ind~) array xi

stored

L. DAD(z\), forall
tors.
q

and L.last_mod( and indirection

For a given data loop For instance, DAD(zL associated


q

L, we maintain

two sets of data access descrip-

4 In

Coupling
irregular that and
2).

Partitioners
problems, work involve it ia often loop desirable by assigning iteration to allocate all com-

we maintain, current xi, and of the data z~ when access descriptor L carried out its global data access descriptor

the

with

computational putations processor arrays (Figure ing

to processors a given

L. DAD(z~ that

) is a record with

to a single approach partitionIn the seciterations phase. cases

[3]. Consequently, loop


In

we partition using
termed a

both distributed data loop

was associated

iterations
phase,

a two-phase

last inspector. For each indkection array ind~, we also maintain two

the first

phase,

distributed called using

arrays

are partitioned. partitioning, from approach,

timestamps:

ond phase, This appears

workload

are partitioned

the information

the first as in many

to be a practical

304

the same set of distributed The following

arrays

are used by many

loops.

in the first a vertex portional

loop in the

in Figure second

1. loop

The

weight

associated 1 would

with f and can costs of con-

two subsections

describe

the two phsses.

of Figure costs. criterion in which

be pro-

to the degree

of the vertex

when functions Vertex weights

4.1
When assigned will data non-local

Data

Partitioning
distributed iterations array all) makes arrays, we have not yet we to most

g have identical parallel dominate. A given we find dhate tional titioners heuristics partitioners fashion. methods must This avsilbe coumanual when we parto difinformation and physical proximity problems,

computational problems

be used as a sole partitioning we partition loop iterations loop distributed to processors. references. computation with We assume that Our approach that

in embarrassingly computational

partition partitioning

so sa to attempt assumption will the variable

to minimize

partitioned that

can make use of combinations or weight account from information. when important for problems node geometrical to take carrying where and

nectivity y, geometrical computational or inertial costs vary make [5].

For inst ante, estimated out coorcomputaOther par-

an implicit

it is sometimes costs into bisection greatly use of both

(although on the left almost There

not necessarily hand

be carried appearing the

out in the processor owner

associated rule.

side of each statement partitioning phenomena these in a manual troublesome data structures it extremely problems

- we call this

to node.

computes

connectivity on which this the

are many

able based on physical [24, 2, 25, 13]. pled coupling titioners dependent, ferent 4.1.1 (but Currently to user programs is particularly use different making similar)

Since the data structure data partitioning Connectivity GeoCoL 4.1.2 data and/or

that

stores information we call

is to be based can represent Load information, structure.

Geometrical,

and tedious

wish to make use of parallelized

partitioners. difficult

Further, to adapt

Generating a directive A user can

GeoCoL
a compiler

Data

Structure
that can be emdata using that xcord, called a GeoCoL information declaration

and are very problem

We propose ployed structure.

CONSTRUCT to generate spatial specify

and systems.

to direct

Interface

Data

Structures

for
a data strucis use of

the keyword The following specifies C$ ycord, This G1 inthe verThe ofi specified construct or value Hanxleden Similarly, vertex LOAD C$ 1 are the inarise In Here,

GEOMETRY. is an example of a GeoCoL information: G1 (N, GEOMETRY(3, a GeoCoL spatial and zcord. directives structure data structure

Partitioners
We link ture that partitioners stores kinds [24], [15], graph loops represent Data on data [19]. to programs structure structures Graph by using data information of program on which partitioners information. that vertices partitioning Some represent represent loops, partitionundirected array Consider the graph x and y.

geometrical CONSTRUCT zcord)) statement by xcord, is closely [11]. a GeoCoL can

to be baaed. different graphs dices, tices graph example ers operate

can make

defines with ycord, related

having

N vertices

coordinate The

information GEOMETRY partitioning by von only

edges represent in Figure the

dependencies. of arrays

to the geometrical

1. In both loop of Figure

based decomposition data be

proposed which using

N elements

edges in the first edges linking edges linking

1 are the union i = 1,N i = 1,N

specifies the

vertices vertices

is(i) is(i)

and ib(i), and it(i), loop end.pt

weights

constructed

keyword

sa follows. CONSTRUCT a GeoCoL vertex construct i having example G2 (N, LOAD(weight)) called LOAD illustrates G2 consists of N verinforarrays with weight(i). how connectivity Integer associated declaration.

The union

graph

edges in the second vertices

of Figure l(i)

of edges linking with

and end-pt2(i). geometrical meshes often

tices with mation edgelktl C$ edgedist2)) The Any mation that

In some cases, it is possible formation from finite a problem. or finite point sssign element We can that may

to associate difference graph data weights In order

The following

For instance, is associated

is specified

in a GeoCoL edges. G3

d~cretizations. with vertex

and edge-list2 CONSTRUCT keyword LINK

list the vertices (N,

such cases, each mesh in space. ordinates locations Vertices timated timate work to make will describe

a location a set of cospatial [2, 22]. e~ esis

each of E graph

each

LINK(E,

edge-listl,

its spatial

location.

These

can be used to partition also be sssigned costs. costs, computational computational be partitioned. the implicit

structures

is used to specify graph. load of spatial,

the edges associinfor-

to represent to accurately weights

ated with

the GeoCoL

combination can be used

and connectivity GeoCoL data

we need information that an owner

on how compute

to generate data

structure. can xcord,

One way of deriving

For instance, uses both

the GeoCoL geometry

structure

for a partitioned information

assumption

and connectivity

rule will be used to partition work. Under this assumption, computational cost associated with executing a statement will array be attributed reference. to the processor This results owning a left hand with unit side in a graph weights

be specified as follows: C$ CONSTRUCT ycord, zcord), LINK(E,

G4 (N, GEOMETRY(3, edge-listl, edge-list2))

365

REAL*8 INTEGER S1 DYNAMIC, reg2(nedge) S2 DISTRIBUTE S3 ALIGN S4 ALIGN . ...

x(nnode),y(nnode) end.ptl(nedge), end.pt2(nedge) reg(nnode), reg2(BLOCK) S5 S6 S7 with reg2 C

....
CONSTRUCT yc, Zc)) SET RCB d~tfmt BY G (nnode, GEOMETRY(3, G USING xc,

DECOMPOSITION reg(BLOCK), reg

PARTITIONING reg(d~tfmt) x, y

x,y with end.pt

REDISTRIBUTE Loop

1, end-pt2

over edges involving

....
end-pt2, G (nnode, BY . ..) LINK(nedge,end-ptl, G USING Figure ric reg(distfmt) x, y 5: Example of Implicit Mapping 90D The statement (recursive S6 in the figbisection) a library using GeometC Loop over faces involving x, y

call read-data(end-ptl, S5 CONSTRUCT end-pt2)) S6 SET RSB d~tfmt

PARTITIONING

Information

in Fortran statement. RSB as input. available

S7 REDISTRIBUTE c Loop .... c Loop

CONSTRUCT with GeoCoL

over edges involving

ure calls the partitioned of commonly

spectral be provided

The user will partitioners Also, arrays code

and the user can choose a customized matches. in statement shown FiS7 sequence

over faces involving

x, y

any one one of them. partitioned as long in Fortran 90D nslly, using the distributed 5 illustrates that Arrays

the user can link are remapped returned similar to that

as the calling

Figure

4: Example

of Implicit

Mapping

the new distribution here the

by the partitioned. in Figinformation the spatial with the is

Once cessors: 1. At

the

GeoCoL is carried

data out.

structure

is constructed, there

data

Figure is shown. coordinates

partitioning

We assume

are P pro-

ure 4 except

use of geometric carry

xc, yc, and ZC, which to which that using bhary arrays

for elements S5 specifies recursive the data.

in x and y, are aligned the GeoCoL coordinate data

compile This that,

time code when data

dependency generates the program structures,

coupling

code is genersupthe generates

same decomposition Statement ifies that to be constructed partition

x and y are aligned. structure S6 specis used to

ated. port GeoCoL 2. The

calls

to the runtime

executes,

geometric

information. bisection

GeoCoL

data

structure that

is passed to a data parthe GeoCoL into

titioning

procedure

partitions

P subgraphs. 3. The GeoCoL vertices assigned to each subgraph

4.3
d~tribution. is constructed with the iniOnce we A we re-

Loop

Iteration

Partitioning
data, we must partition compuasthe conwith This

specify The tial

an irregular data

Once we have partitioned tational signment dktributed vention rule. variable is that (If work. statement array is normally the left then the

One convention element hand work need

is to compute hand

a program side.

GeoCoL

structure

S in the processor on Ss left side referred to as the out

associated

default

distribution the from arrays

of distributed given is built based on the new

arrays.

have the new distribution distribute the arrays communication schedule

by the partitioned, distribution.

owner-computes a replicated processors). within in all

of S references

and used to redistribute

is carried to generate of loop the following

the default

to the new distribution.

One drawback we may even in loops

to the owner-computes the absence

rule in sparse codes communication carried loop: dependencies.

4.2 In

Linking Figure

Data

Partitioners
a possible set of partitioned

For example, 4 we illustrate FORALL S1 x(ib(i)) S2 y(ia(i)) END This assign loop work coupling directives for the loop L2 in Figure 1. We use statements S1 to S4 (Figure 4) to produce a default initial distribution in loop generation graph between lationship of arrays z and y and the indirection S5 and S6 directs graph on the LINK that array the and call L1, end-pt. The statements

consider I=l,N = .. . .. . = x(ib(i)) FORALL has a loop no loop using

of code to construct Statement arrays

the GeoCoL based

the partitioned. edges

S5 indicates

the GeoCoL relations in the L1 and the re-

independent carried

dependence rule,

between we to of ib(i)

are to be generated by using

S1 and S2 but i, statement (OWNER(ib(i)))

dependencies.

Were

distributed is provided

x and y in loop the keyword

the owner-computes be computed statement while

for iteration be computed

S1 would

on the owner S2 would

366

on the owner would An with

of ia(i)

(OWNER(ia(i)).

The value of y(ib(i)) whenever OWNER(ib(i)) associated C$ K1 iteration that array places of the referK2 K3 C$ Start with end-pts Read Mesh block (end.ptl, d~tribution end-pt2, of . ..) LINK to generate (nedge, end.ptl, data arrays x, y and

have to be communicated alternate convention

# OWNER(ia(i)). is to assign all work to support a scheme that distributed a loop iteration structures default to a given processor. We have devel-

oped data partitioning. Our a loop largest ences.

and procedures is to employ

CONSTRUCT end-pt2)) Call CHAOS structure SET RSB distfmt

G (nnode, procedures BY

GeoCoL G

current iteration number

on the processor of the iterations

is the home

PARTITIONING graph partitioned format from

USING

Pass GeoCoL Obtain new tioner(dwtfmt) REDISTRIBUTE Remap arrays

to RSB

distribution

the

parti-

Compiler In the previous

Support section we presented specify between work generate procedures. 4 to show how the compiler A (simplified) in Figure d~tribution. the proceproGeois are Statements is encountered, CHAOS the to CHAOS The statement procedure with (distfmt). runtime using the a loop accesses calls the initial ver6. S5 is shown a data in the code. d~tributions. statement with execution, data passed remapping in Figure directives how data processors. mapping. embeds a proand loop In this

C$ K4

reg (distfmt) (x and y) aligned with distfmt

grammer iterations section out th~ compiler CHAOS procedures We start When compiler dure COL calls, data cedures

can use to implicitly are to be partitioned we outline implicitly mapper compiler defined coupler

distribution

reg to distribution

transformations and data code which

used to carry The the

Figure Mapping

6: Compiler

Transformations

for

Implicit

Data

transformations

unstructured ing the number The table iterations shown reuse. 6.2

mesh presents

and

molecular of Intel arrays

dynamics iPSC/860 time

loops,

varyfor 100

of processors

hypercube. irregularly The results of schedule

We use the example sion of the compiler with BLOCK to S7 in Figure

the execution dissection

of the loops

are embedded

wit h dwt ributed binary in the table

decomposed partitioned.

transformation array

using a recursive

emphasizes

the importance

4 are used to generate code the

the CONSTRUCT generates during structure When CHAOS (reg) described generate program

embedded

Timing Coupler

Results

using

the

Mapper

GeoCoL is then

structure. an user

specified In this procedures pler. culation Iinked present These unstructured mapping section, with we present the cost solver data that compares mapper the the coucalcosts incurred by the compiler timings Euler involve generated a loop and dynamics over coupler

partitioned. encountered, generated distribution Loop method at least

the

REDISTRIBUTE data (x and y) aligned at

to move arrays iterations are

a hand

embedded

mapper force

to the new distribution partitioned d~t ributed in Section 4.3 whenever array.

edges of an 3-D The compilerin the Fortran University. We on diftechniques iPSC/860. kinds of parallel (reof the ( coordinate partitioned

the elect rest atic code.

loop in a molecular technique being

one irregularly

was incorporated at Syracuse on an Intel two different based [24]). partitioned based The of our runtime

90D compiler ferent number

developed

the performance

Experimental

Results

of processors we employed 1) a geometry bwection

To map arrays partitioners

6.1
In

Timing
this section, saving

Results
we present technique over 10K water edges and

for

Schedule
performance in

Reuse
data for 3 for timings the the inEuler code of 1. Tafor

bisection cursive

[2]) and 2) a connectivity spectral

performance

schedule Fortran volve solver static for

proposed of 53K an 3-D

Section These

compiler embedded mapper version version are shown in Table 2. In Table tition the the time build 2, Partitioned arrays needed using to carry and depicts out

and hand

parallelized to pardepicts and to taken

90D compiler a loop [20] for force

implementation. mesh points [4]; results schedule

the time

needed Ezecutor

unstructured dynamics functionality

the partitioners, the actual depicts 2, Partition inspector

and an electro-

computation the time under

calculation is equivalent and without

loop in a molecular simulation the to the loop the

communication,

648 atom

the schedule.

In Table

Spectral

these loops ble 1 depicts code with

L2 in Figure of compiler reuse

the performance

generated

Bisection depicts the time needed to partition the GeoCoL graph data structure using a parallelized version of Simons eigenvalue partitioned [24]. We partitioned the GeoCoL

technique

367

(Tii

in

10K Mesh Processors 4 8 400 17.6 214 10.8

16 123 7.7

53k Mesh Processors 16 32 668 30.4 398 23.0

64 239 17.4

648 Atoms Processors 4 8 707 15.2 384 9.7

16 227 8.0

No Schedule Reuse Schedule Reuse

Table (Time in Sees) Graph


rail

2: Unstructured

Binary
H d C~~ed

Mesh Template Coordinate Bisection

-53

K Meah -32
Block Partition Hand Coded

Processors
Spectral Hand Coded
2.2

c il NoOS~~e~uie Reuse 1.6 379 ,6 1(. A 398

Compfler Schedule Reuse 1.6 4.2 ,1 11.4 23.0

Bisection Compder: Schedule Reuse


2.2

..-4

-----------C-enerahon tltloner
tnr

-lii@a.,

.-map .. .
I

Executor Total

1.6 4.3 ,,? . IO.* 22.4

0.0 4.7 , v O*. I 59.4

10. A

258 4.1 ..30

258 -I

277.5

4..0 ,0 1.3.9 277.9

graph

into

a number employed. partitioned time needed graph.

of subgraphs It should could depicts to carry The

equal

to the number that

of The 2

based

decompositions.

Our type

GEOMETRY of value

construct

can be environof ira with memory enThis support at disat at

processors parallelized graph ate GeoCoL 100 times.

be noted be used the time time out

any common to generin Table that phase for

viewed Several ments regular

as a particular researchers that or adaptive are targeted

based decomposition. programming classes

as a mapper. required shown

have developed towards (DIME) meshes problems.

genemtion

particular for

Ezecutor

Williams using

[25] describes calculations distributed

gives the time the performance

the executor

programming unstructured machines. vironment programming dynamic There tributed Stanford, Austin time Marina implement projects; load

environment triangular Baden targeted

The results

shown in the table version. coded

demonstrate

of the compiler coded for a hand

generated In table block the

code is within 4, we have inversion effects In contigub~ection in the

[1] haa developed towards particle provides

a programming computations. facilities that targeted

10% of the hand cluded that the timings arose from blocked of the code

partitioned performance

environment balancing.

in order version, of array b~ection

to quantify we assigned

the decision elements.

to partition We see that of two to three

the problem.

are a variety memory DINO provide

of compiler

projects

each processor or a spectral

multiprocessors project parallel at Colorado programming D project at Yale compiler

[27, 16]. Jade project and CODE in four PARTI project compiler environments. project compiler preprocessing

ous blocks a coordinate partitioned executor This of executor sen. When partitioned, associated significantly A detailed bisection dynamics timing using BLOCK 7

the use of either reduction

partitioned

Run[16], project to

lead to a factor time compared also points iterations compared the with

compilation Chens

methods work Kali

are employed [18] and our type


arrays

to the use of block out the importance partitioned bisection overhead.

partitioning. of the number should be chobisection is a but

the Fortran The

[14], the Kali was the first runtime


[26].

example

on which

[21, 26, 23].

to the recursive spectral time

coordinate

inspector/executor compiler several that from d~tributed a strategy ation) compiler (but would marked more

recursive a faster

partitioned iteration

[16] and the ARF port per irregularly work, outlined In earlier

was the first of the authors dld not

compiler

to suppa-

per executor

higher

partitioning

of the current attempt

performance loop is shown

of the compiler-linked in Table partition blocks allowed much 3. In Table of arrays of arrays in HPF. better than

coordinate 4, we present - we assigned to processors Irregular dw the existing

a compiler for compilers based approach partide-

for the unstructured results for naive

mesh loop and the molecular

implement to generate tioners scribed compiler

make it possible connectivity [6]. The from loops input

embedded

directly support.

each processor BLOCK tribution

contiguous d~tribution performs

here requires

the user and lesser

of arrays distribution

supported

by HPF. 8

Conclusions
In this paper, The we have described Fortran here described and and presented 90D compiler demonstrates two timnew

Related
Research

Work
out by von Hanxleden which decompose arrays values, these are called [11] on based value partitioners array element ing data tation. for a prototype work implemen-

haa been carried

compiler-linked on distributed

368

Table

3: Performance of Compiler-linked 10K Mesh T k


(Til&~n Partitioned Inspector Remap Executor Total Processors 8 4 0.6 0.6 0.6 1.2 1.6 3.1 7.0 12.7 10.8 17.6

Coordinate

Bisection

Partitioned

with

Schedule

Reuse

16 0.4 0.4 0.9 6.0 7.7

53k Mesh Processors 16 32 64 1.6 2.5 1.8 19 ().7 2.() 5.1 3.0 1.9 91 5 -A. 17.2 12.3 m A , 92 n =.-. I 17.4
A.,

648 Atoms Processors 4 8 0.1 0.1 2.2 1.2 4.8 2.6 8.1 5.8 9.7 15.2

16 0.1 0.7 1.5 5.7 8.0 t

Table4: T
(Ti~~in sees Inspector Remap Executor Tot al

Performance

of Block

Partitioning
53k M esh Processors 32 1.9 2.8 54.7 59.4

with

Schedule

Reuse

4 1.5 3.1 26.0 30.4

10K Mesh Processors 8 0.9 1.6 20.8 23.3

16 0.5 0.8 14.7 16.0

16 3.9 4.9 74.1 82.9

64 1.0 1,7 35.3 38.0

648 Atoms Processors 4 8 2.7 1.5 4.5 2.6 10.3 7.6 17.5 11.7

16 0.8 1.5 7.3 9.6

ideas cedure ble for

for

dealing

effectively invokes that results loop

with

irregular The the

computations. mapping it propossireusing commuis a simple for (e.g.

The CHAOS able for public or from

procedures distribution

described ftp

in th~ paper

are availnetlib

The first

mechanism using method computed schedules,

a user specified in many from data

and can be obtained

from

a set of dkectives. to recognize iteration

second

the anonymous

site hyena. cs.umd.edu.

conservative previously nication that buffer

cases makes potential

a compiler

Acknowledgments
The authors Das for reading. Chuck many The Koelbel would fruitful authors and like to thank discussions would Sanjay like Ranka Alan and for Sussman for many compilers; about help and Raja in proofFox, and

inspectors partitions, copies with described support

information on-processor here as formruntime conlibrary

associates locations).

off-processor

to thank

Geoffrey

We view support tains


q

the CHAOS

procedures compiler

enlightening we would and Seems integratproblems. for his

ing a portion library.

of a portable, The CHAOS that

independent,

discussions

about

universally Chuck many Koelbel, useful

applicable into

partitioners

runtime

how to embed Hiranandani ing into Our special

such partitioners for

procedures support tioning,

also like to thank and dynamic distributed array partiFortran-D thanks suggestions. authors

Ken Kennedy d~cussions for irregular

static

runtime

support

go to Reinhard also like and Tom Horst

von Hanxleden

.
q q

partitions remap carry

loop arrays

iterations from

and indirection

arrays, and and

helpful The

one distribution translation, buffer

to another allocation

would

to gratefully Haupt Simon of the Fortran software.

acknowledge they 90D compiler.

out

index

the help of Zeki Bozkus spent orienting We would

and the time

communication We consider integrated dependent guages. High subset our work effort runtime The

schedule

generation, of the ARPA powerful sponsored compiler inlana

us to internals mesh partitioning

also like to thank

for the use of his

to be a part developing

unstructured

towards

support Fortran

for parallel type

programming

runtime

support support

can be employed compilers, compiler. on computational a molecular support described

in other

Performance into

and in fact,

References
[1] S. ning tion., [2] M.J. Baden. and on 12(1), Berger Programming coordinating January and S. 1991. H. Bokhari. on May
and

of the runtime

here has been intem-

abstractions localized SIAM J. Sci.

for scientific

dynamically calculations and Stat.

parti. runComputa-

corporated We tested fluid code

the Vienna from code

Fortran compiler
and from

tioning

our prototype
[20]

multiprocessors.

plates extracted dynamics

an unstructured our runtime

mesh computational dynamics by hand and


[3]

partitioning IEEE 1987.

strategy Zhana.

for on

nonuniform Computers, Harry time memory 3(3):159-17S,


Berryman,

problems

multiprocessors.

[4]. We embedded

C-36(5):570-580,
Joel Saltz,

compared its performance against code. The compilers performance within about
10~0

the compiler generated on these templates was code.

Jeffrey

SCroggs.

Exeeution

support

for June

adaptive 1991.

scientific

algorithms Practice

on and

distributed Experience,

of the hand

compiled

machhies.

Concurrency:

369

[4]

B. S.

R.

Brooks,

R.

E. and

Bmccoleri, M. KarPlus.

B.

D.

Olsfson, A and

D.

J.

States, for

[22]

B.

Nour-Omid, equations on

A. Parallel

Raefsky, on concurrent

snd

G.

Lyzenga. In theis and

Solving Proc. Impact on

finite of SymMe-

Swaminathrm, Journal Clark, In Intel of R.

Chzrmm: Chemistry, J. for A.

prcmrmn 1983. and L.

element posium chanics,

computers.

macromolecular tions. [5] T. W.

energy,

minimization,

d yn=lcs-calcula4:187,

Computations December 1987. and Concurrency: J.

Computational v. Hanxleden, strategies

Boston, H.

McCammon, Partners April Mavriplis. problems Compilers J. Saltz 1992. techniques the and 1992.

R. pro-

[23]

J. for

Saltz,

Berryman, 1991.

Wu.

Runtime and

compilation Ezpenence,

Scott. gram. ence, [6] R. Dss,

Parallelization Timberline R. and Editors,

a molecular OR, D.

dynamics

multiprocessors.

Practice

Supercomputer Lndge, Mt.

University Hood, and irregular In

Confer[24]

3(6):573-592, H. allel Simon.

Partitioning In on Large Pergarnon Scale

of unstructured Proceedings Structural 1991. of mesh dynamic Press, o,f the

mesh Analysis

problems on and

for

parAp-

Ponnusamy, runtime

J. Saltz, for partitioning.

Distributed - data and P. copy Runtime Mehro[25]

processing.

Conference

Paralle/

memory reuse Software tra [71 R. ing on land [8] D.

compiler for

methods

Methods plications. R.

Physics

Scalable Amsterdam, J. H. Saltz.

Multiprocessors, The Program Netherlands, slicing

Elsevier. for compil-

Williams. for and J.

Performance unstructured Experience,

load February

bshncing Concurrency, 1991. Runtime

al-

gorithms Das and Practice [26] J. the Loveman for (Ed.). version Research 1993. et mimd al. Compiling 1993. K. Kennedy, D Rice C. language COMP Koelbel, U. Kremer, DeRice fortran Report 90d/bpf SCCS-444, for NPAC, distributed Syracuse [28] H. A. port G. C. Fox, Tseng, S. Hkanandani, and of M. Wu. Computer December tion, Fortran Science 1990. Compiler of on irregulr Parallel K. for 5th support problems. Computation, C. Koelbel, on Haven, Load of for machine 1992. R. Daa, and August and D. 1992. on message J. Saltz. In Proindependent report, Center specification. TR90-141, Zima, SchwaId. Draft 1.0. High Performance Report Computation, Fortran Rice language University, [27 specification, Center January [9] Z. Bozkus Technical CRPC-TR92225, H. Zimar Wu, irregular Languages , OR, problems. In Proceedings of and Compilers for Parallel August 1993. .%th Workshop Computing, Port-

calculations.

3(5):457-482, and H. on

Sattz,

S. Hkrmndani, for Conference 1991. M. Gerndt.

Berryman. In Parallel

compilation 1991 volume

methods International 2, pages H. 1988. P. Brezrmy, Vienna Bast, 26-30, and

multicomputers.

Proceedings Processing,

of

on Parallel

Superb:

tool

for

semi-

automatic 6:118,

MIMD/SIMD

parallelization.

Parallel

Computing,

memory University, [10]

computers.

B. Fortran Austrian

Chapman, a language Center Vienna, for Austria,

P.

Mehrotra, specification.

and Re-

March

ACPC-TR92-4, University

Parallel 1992.

Computa-

of Vienna,

partment University, [11] R. for [12] R. v.

Hanxleden.

parallelization Research v. Hanxleden,

Technical

Kennedy, irregular Workshop New L. R.

Compiler ceedings for [13] R. Parallel v.

analysis of the

prnblems CT,

in Fortran

Languages

Compilers

Computing, and

Hanxleden

Scott.

balancing and

passing Computing, [14]

architectures. 13:312324, K. and Saltz To

Journal 1991. and

Parallel

Distributed

S. Hiranandani, for In Compilers J.

Kennedy, Runtime and P. 1991. S. Lh.

C.

Tseng. for Editors,

Compiler in Scalable

support Fortran MultiproThe D.

machine-independent

parallel Mehrotra

programming

Software Elsevier. An Bell

cessors, Netherlands, [15] B.W. dure

Amsterdam,

appear and February

Kernighan for partitioning

efficient System

heuristic Technical

proceJournal,

graphs. 1970. and

49(2):291307, [16] C. 2nd Koelbel, data ACM P.

Mehrotra,

J.

Van on

Rosendale. memory Principles ACM, and ACM

Supporting In 1990. and March Practice

shared

structures SIGPLAN Programming,

on distributed Symposium pages

architectures.

of Parallel [lq W. In 86, [18] L. on E. Leland. Proceedings pages C. Lu

177186.

Load-balancing of Performance 1986. Chen. In 1991.

heuristics 86 and

prncess

behavior.

SIGMETRICS

5469, and M.C.

Parallelizing Proceedings for

loops Parallel

with

indirect

array Santa

references Clara, [19] N. data Ph.D. verait [20] D. the CA,

or pointem. and August

of th e Fourth Computing,

Workshop

Languages

Compilers

Mansour. to

Physical

optimization of Computer

algorithms

for Technical

mapping report, Uni-

distributed-memory Dissertation, y, 1992. Three Dynamics J. H. School

multiprocessors.

Science,Syracuse

J. Mavriplis. Euler Fluid

dimensional paper Conference, Saltz, of of the R. runtime 1988 pages M.

unstructured In AIAA 1991. D. for July June Smith, support ACM 140-152,

multigrid 10th

for

equations,

91-1 549cp.

Compu-

tational [21] R. Kay ference

Mkchandaney, Crowley. In on

M. Nicol,
parallel 1988.

and pmCon-

Principles Proceedings Supercomputing,

cessora.

International

370

You might also like