Pisarenko RSCD 2021

Southern Federal University
Supercomputers and Neurocomputers

Research Center
Taganrog, Russia
Scalable Parallel Description of Information Graphs with

Associative Operations by Means of Set@l Programming
Language
Levin I., Dordopulo A., Pisarenko I., Mikhailov D., Melnikov A.
September 27-28, 2021, Moscow, Russia

Problems of Parallel Programming
State-of-the-art supercomputers use various computational architectures.
Relevant research direction: the development of hybrid (heterogeneous) computer

systems which combine universal processors with other types of computing devices
(GPUs, special coprocessors, FPGAs).
Software for high-performance computing
Parallel calculations Pipeline calculations Procedural calculations
Challenges:
• difficulty of development and further support;
• increasing of time and costs;
• complexity of porting.
Currently available programming methods and tools have serious architectural limitations.
As a result, the porting of parallel programs between different architectures requires the
development of a new code in another programming language.
2
Problems of Parallel Programming
Architectural orientation of methods and tools for parallel programming:
OpenMP Multiprocessor computer systems with shared memory
MPI Multiprocessor computer systems with distributed memory
OpenACC, CUDA C, C++ AMP, ATI Stream CPU+GPU
COLAMO Reconfigurable computer systems with FPGA field
OpenCL CPU+GPU, CPU+FPGA
Porting of parallel applications:

• mathematical sense of algorithm remains unchanged;
• features of implementation (parallelizing, memory, sync etc.) are modified according to
the specific computer system.
Architectural limitations of available programming languages are caused by the

inseparable description of algorithm and details of its implementation.
3
Set@l Language of Architecture-Independent Programming
Set@l (Set Aspect-Oriented Language) is a next-generation language of architecture-
independent parallel programming for high-performance computer systems.
Set@l
Set-theoretical codeview, Source code describes the

Paradigm of aspect-oriented
relational calculus, information graph of a
programming (AOP)
partition of collections computational problem
SETL (Jacob T. Schwartz, COLAMO (Common Language for Gregor Kiczales, 1997
1960s), SETL2, SETLX Architectures of Multi Objects)
4
Features of Set@l Programs
Program in Set@l includes the following modules:
• architecture-independent source code (program) describing the algorithm;
• system of aspects (aspect) that adapts the algorithm to the architecture and
configuration of the specific computer system.
Sets are the essential objects of the Set@l programming language.
Set declaration: set(<name of set>); set(P); set(A,B,C);
Elements of sets:
• enumeration of all elements: A=set(1,2,4,8,16,32);
• range specification:
<name of set>=<type>(<1st el.>,<2nd el.>...<the last el.>);
A=set(1,3...q); A=set(1...q);
• relational expressions:
<name of set>=<type>( <variable> | <predicate> );
A=set(p | p in N and mod(p,2)=1);
B=set(p | p in N and mod(p,2)=0);
C=set(p | p in N and p>=a and p<=b); 5
Set Classification in Set@l
By parallelism of By definiteness of By any user-defined

their elements elements criteria
Implicit description of Modification of algorithms attribute <name> (<set/element>):

algorithm during architectural <attribute description>;
parallelization adaptation end(<name>);
Type of collection Processing parallelism Symbolic notation Description format
Set Parallel-independent 1,2,…,p par(1…p)

Tuple Procedural (sequential) 1,2,…,p seq(1…p)
Pipeline tuple Pipeline 1,2,…,p pipe(1…p)
Set of processing by Ԧ
Parallel-dependent {1,2,…,p }Ԧ conc(1…p)
iterations
Implicit collection Type is defined in other aspect [[1,2,…,p]] imp(1…p)
Ԧ 1 , 𝐺2 , 𝐺3 , 𝐺4 , 𝐺5 , 𝐺6 };
𝐺 = {𝐺 Ԧ Ԧ 1 , 𝐺2 , 𝐺3 },
𝐺 = {𝐺 Ԧ {𝐺
Ԧ 4 , 𝐺5 , 𝐺6 }Ԧ ; 𝐺 = 𝐺1 , 𝐺2 , 𝐺3 , 𝐺4 , 𝐺5 , 𝐺6 ;
6
Basic Topologies of Graphs with Associative Operations
Sequential («head/tail», H/T) Parallel («half-splitting», DIV2)
a1 a1
s6 f s3
f
a2 a2 𝜏 = log 2 𝑛
f s5 𝜏 =𝑛−1 a3
f
s1
a3
f s4 f s4
a4 a4
f s3 f Res
a5 a5
f s2 f s5
a6 a6 s2
f s1 f
a7 a7
f Res f s6
a8
a8
Hardware resource R is enough to implement all operational vertices: «half-splitting».
The lack of R for the implementation of full graph: scaling of calculations using
performance reduction methods.
«Half-splitting»: irregular interconnections between iterations, dependence of

decomposition on problem dimension → cumbersome and inconvenient description.
«Head/tail»:
• regular interconnecting structure suitable for efficient reduction;
• high specific performance only at minimal R.
7
Combined Topologies of Graphs with Associative Operations
Solution:
for intermediate configurations of hardware resource, it is reasonable to use the combined
topology that contains sequential and parallel fragments of calculations.
Disadvantages of combined topologies:

• do not have a regular and isomorphic structure;
• are difficult to automatic scaling and performance reduction (in terms of classical
programming languages);
DIV2
• are rarely used in practice.
Aspect-oriented program in Set@l: DIV2
Limit cases: H/T
«half-splitting» «head/tail» DIV2
General case:
Set of combined topologies Performance reduction for
+ different configurations of
Rules for the transition between them supercomputers
8
9
Resource-Independent Program in Set@l. «Head/tail» and «half-splitting»
a1
f s6
a2
attribute [F(A,Res)|F=(Rec(f),H/T)]: a3
f s5
f s4
a4
operand(set(A),element(Res)); f s3
a5
element(s); a6
f s2
f s1
F(A,Res)=break[card(Tail(A))=1: a7
f Res
a8
f(Head(Tail(A)),Head(A),Res)],
main[conc(F(Tail(A),s),f(s,Head(A),Res))];
end(F);
attribute [F(A,Res)|F=(Rec(f),DIV2)]:
a1
f s3
operand(set(A),element(Res)); a2
f
a3 s1
d2(A,A1,A2); f s4
a4
element(s1,s2); f Res
a5
f s5
F(A,Res)=break[card(A1)=1 and card(A2)=1: a6 s2
f
f(Head(A1),Head(A2),Res)], a7
f s6
a8
main[F(A1,s1),F(A2,s2),f(s1,s2,Res)];
end(F);
10
a1 DIV2
Resource-Independent Program in Set@l. Combined topologies
a2
a3
a4
 
A = a1 ... a4  ,a5 ... a8  ,a9 ... a12  ,a13 ... a16 
a5 DIV2
a6
a7
a8 DIV 2 DIV 2 DIV 2 DIV 2

a9 DIV2
H /T
a10
a11
a12
a13 DIV2
a14 Res
a15
a16
A = H/T [subA1, subA2, … , subAr]; subAp = DIV2{ab , ab+1, … , ac},
Q=floor(R/R0); // number of hardwarily implemented vertices;

r=ceil(n/Q); // reduction coefficient;
graph_modification(Array,A,r); // decomposition and typing of set A;
Ԧ
G = {{subG Ԧ
1, subG2, … , subGr}, v1, v2, … , vr–1}
Ԧ
Gr=< subG1, {subG Ԧ Ԧ Ԧ Ԧ Ԧ
2, v1}, {subG3, v2}, … , {subGr, vr–1} > 11
Modification of Combined Topology. Motivation
Previously, we considered the idealized case of operational vertices with a unit latency.
In practice, the latency of a vertex performing an associative operation (e.g. the addition or
multiplication of fixed-point numbers) typically exceeds one cycle.
So, the proposed transformation technique does not provide the obtainment of efficient
pipeline implementation of calculations: if the delay of feedback circuit is increased to the
latency of an operational vertex, the addition of all partial sums appeared at the output of
the pyramid structure is not ensured.
To take into account the non-unit latency of operational vertices and to form the correct
sequence of partial sums’ adding, we propose to modify the topology of the information
graph with associative operations.
By combining sequential and parallel fragments of calculations, it is possible to synthesize

the topology in accordance with the available amount of hardware resource and latency of
the operating vertex and reduce the total time to solve the problem by providing a dense
data flow at the input of the computing structure.
12
Modification of Combined Topology. Full information graph
Gr1 Gr2 Grl
a1 psG1 psGm+1
ak
psG2 psGm+2
ak+1
a2k hsG1 hsG2
psGm psG2m
amk
pG
• The resulting graph contains l isomorphic subgraphs Gr1, Gr2, ... , Grl connected by means of the
“half-splitting” principle through pyramid subgraph pG.
• Each subgraph Gri is formed using the combined principle “half-splitting + head/tail”: it includes m
isomorphic and pyramid subgraphs psG(i-1)·m+1, psG(i-1)·m+2, ... , psGi·m and one sequential unit hsGi
that calculates intermediate results and incorporates (m–1) operational vertices.
• In turn, each pyramid subgraph psGj processes k elements of the original data array
• If l = 1, the subgraph pG does not contain operational vertices and topology corresponds to the
previously described limit case.
13
Modification of Combined Topology. Computing structure
mGr1
<a1, ak+1, >
<a2, ak+2, > C1
v1
mGr2
pG
mGrl
• At the first stage, information-independent subgraphs psG(i-1)·m+1, psG(i-1)·m+2, ... , psGi·m in

each subgraph Gri are transformed into a subcadr Ci.
• In the cadr, tuples of data elements are supplied to the inputs of Ci.
• The block hsGi of information-dependent operational vertices form an additional vertex

vi with the feedback circuit delay of l cycles.
14
MG pG*
<<a1, ak+1, >, <amk+1, am(k+1)+1, >, >
<<a2, ak+2, >, <amk+2, am(k+1)+2, >, >
v
1
2
pG
l 1
• At the second stage of the transformation, the pG fragment is replaced with the advanced pG*
structure, which performs associative operation on l elements per l cycles: the correct result of
processing of l operands appears at the output during the last l-th cycle.
• For this purpose, every data element is delayed for the corresponding number of cycles from 0
to (l–1).
• At the same time, the blocks mGr1, mGr2,..., mGrl are replaced with the single fragment MG, to the
inputs of which flows or tuples of input data with the duty cycle of 1 are supplied. In this case, the
data streams at each input are represented as nested tuples of elements of the original set.
15
W
P
1 MX DMX
W
1
W
2 2 1
2
1
3 4
W
4 4
6
5 1
l-2
W l-2
l-2
l-1
1
• Fragments W calculate intermediate results sequentially. Therefore, according to the

“embedded pipeline” principle, it is reasonable to replace them by the structure shown
on the right.
• The serial connection of the multiplexer (MX) and demultiplexer (DMX) with the delays
can be reduced by leaving the single W fragment.
• After the analogous transformation of each iteration in the pyramid P, we obtain the
accumulating pipeline structure with operational vertices and delays.
16
1 2 4 l/2
• At the output of the block v with the feedback, we obtain l intermediate results y1 = f (x1, xl+1),
y2 = f (x2, xl+2), ... , yl = f (xl, x2l), and there is a need to perform additional operations f on these
data elements.
• In order to calculate f (y1, y2), f (y2, y3), … , f (yl–1, yl), the first vertex of the accumulating pipeline
structure is supplied with the same stream of operands delayed for one cycle. In 2nd, 4th, 8th etc.
cycles, at the output of the first vertex, we get the result of executing operation f on data
elements y1 and y2, y3 and y4, y5 and y6 etc. At this stage, the size of the intermediate data
sequence is l/2.
• Therefore, for the second vertex, data operands are delayed by two cycles, and its output gives
l/4 results f (y1, y2, y3, y4), f (y5, y6, y7, y8) etc. Similarly, in the third and fourth vertices, the delays
are 4 and 8 cycles, respectively.
• At the output of the last vertex with the number log2(l)+1, after passing the entire array of input
data, we obtain the result of performing associative operation f on two operands that is equal to
the result of processing all elements included in the initial data array.
17
Modification of Combined Topology. Resulting computing structure
l
log2l
1 2 4 l/2
l = 2
log2 L
where L is the latency of operational vertex
k =  R / R0  − log2 l
where R is the amount of available hardware resource; R0 is the hardware resource

required for the implementation of single operational vertex.
 N 
m= 
 l  ( floor( R / R0 ) − log 2 l ) 
where N is the dimension of the processed data array (N >> l)
It is possible to organize the described computing structure only if the value of l is equal to
the integer degree of two. Otherwise, the data flow is extended by neutral elements, and
the feedback circuit is supplemented with additional delay elements up to the l.
18
Modified Resource-Independent Program in Set@l
General description of the topology of the information graph with associative operations in
the source code of the resource-independent program in Set@l:
After the translation of the source code, the following sets with imposed types of
parallelism are formed:
G = { {Gr1, Gr2, … , Grl}, pG }
Gri = { {psG(i–1)·m+1, psG(i–1)·m+2, … , psGi·m}, hsGi }
hsGi = { vi,1, vi,2, … , vi,m–1 }
19
Conclusions
• In this paper, we propose the method that rearrange the vertices of information graph
with associative operations and perform further optimization of computing structure in
order to reduce the time of problem solution by the number of times corresponding to
the latency of operational vertex.
• The designed general graph topology combines sequential and parallel fragments of
calculations and provides the formation of dense data flow at available hardware
resource. The developed method extends the technique considered in our previous
paper to multiple cases when the latency of associative vertex exceeds one cycle.
• The architecture-independent Set@l programming language allows to describe the

transformations in compact resource-independent form. In comparison to traditional
parallel programming languages, in which the change of information graph topology
requires the modification of the program source code, Set@l specifies many
implementation variants in one program. The synthesis of particular computing
structure is performed automatically according to the configuration parameters
specified by the user (the amount of available computational resource and latency of
basic operation).
20
Acknowledgments
The reported study was funded by the Russian Foundation for Basic Research, project
number 20-07-00545.
Thank you for your attention!

Pisarenko RSCD 2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pisarenko RSCD 2021

Uploaded by

Copyright:

Available Formats

Southern Federal University

Supercomputers and Neurocomputers

Scalable Parallel Description of Information Graphs with

Levin I., Dordopulo A., Pisarenko I., Mikhailov D., Melnikov A.

September 27-28, 2021, Moscow, Russia

Relevant research direction: the development of hybrid (heterogeneous) computer

Software for high-performance computing

Parallel calculations Pipeline calculations Procedural calculations

OpenMP Multiprocessor computer systems with shared memory

MPI Multiprocessor computer systems with distributed memory

OpenACC, CUDA C, C++ AMP, ATI Stream CPU+GPU

COLAMO Reconfigurable computer systems with FPGA field

OpenCL CPU+GPU, CPU+FPGA

Porting of parallel applications:

Architectural limitations of available programming languages are caused by the

Set-theoretical codeview, Source code describes the

By parallelism of By definiteness of By any user-defined

Implicit description of Modification of algorithms attribute <name> (<set/element>):

Type of collection Processing parallelism Symbolic notation Description format

Set Parallel-independent 1,2,…,p par(1…p)

Hardware resource R is enough to implement all operational vertices: «half-splitting».

«Half-splitting»: irregular interconnections between iterations, dependence of

Disadvantages of combined topologies:

Aspect-oriented program in Set@l: DIV2

Limit cases: H/T

«half-splitting» «head/tail» DIV2

a8 DIV 2 DIV 2 DIV 2 DIV 2

A = H/T [subA1, subA2, … , subAr]; subAp = DIV2{ab , ab+1, … , ac},

Q=floor(R/R0); // number of hardwarily implemented vertices;

By combining sequential and parallel fragments of calculations, it is possible to synthesize

• At the first stage, information-independent subgraphs psG(i-1)·m+1, psG(i-1)·m+2, ... , psGi·m in

• The block hsGi of information-dependent operational vertices form an additional vertex

• Fragments W calculate intermediate results sequentially. Therefore, according to the

where L is the latency of operational vertex

where R is the amount of available hardware resource; R0 is the hardware resource

where N is the dimension of the processed data array (N >> l)

G = { {Gr1, Gr2, … , Grl}, pG }

Gri = { {psG(i–1)·m+1, psG(i–1)·m+2, … , psGi·m}, hsGi }

hsGi = { vi,1, vi,2, … , vi,m–1 }

• The architecture-independent Set@l programming language allows to describe the

Thank you for your attention!

You might also like