Professional Documents
Culture Documents
Pisarenko RSCD 2021
Pisarenko RSCD 2021
Challenges:
• difficulty of development and further support;
• increasing of time and costs;
• complexity of porting.
Currently available programming methods and tools have serious architectural limitations.
As a result, the porting of parallel programs between different architectures requires the
development of a new code in another programming language.
2
Problems of Parallel Programming
Architectural orientation of methods and tools for parallel programming:
Set@l
SETL (Jacob T. Schwartz, COLAMO (Common Language for Gregor Kiczales, 1997
1960s), SETL2, SETLX Architectures of Multi Objects)
4
Features of Set@l Programs
Program in Set@l includes the following modules:
• architecture-independent source code (program) describing the algorithm;
• system of aspects (aspect) that adapts the algorithm to the architecture and
configuration of the specific computer system.
Sets are the essential objects of the Set@l programming language.
Set declaration: set(<name of set>); set(P); set(A,B,C);
Elements of sets:
• enumeration of all elements: A=set(1,2,4,8,16,32);
• range specification:
<name of set>=<type>(<1st el.>,<2nd el.>...<the last el.>);
A=set(1,3...q); A=set(1...q);
• relational expressions:
<name of set>=<type>( <variable> | <predicate> );
A=set(p | p in N and mod(p,2)=1);
B=set(p | p in N and mod(p,2)=0);
C=set(p | p in N and p>=a and p<=b); 5
Set Classification in Set@l
Ԧ 1 , 𝐺2 , 𝐺3 , 𝐺4 , 𝐺5 , 𝐺6 };
𝐺 = {𝐺 Ԧ Ԧ 1 , 𝐺2 , 𝐺3 },
𝐺 = {𝐺 Ԧ {𝐺
Ԧ 4 , 𝐺5 , 𝐺6 }Ԧ ; 𝐺 = 𝐺1 , 𝐺2 , 𝐺3 , 𝐺4 , 𝐺5 , 𝐺6 ;
6
Basic Topologies of Graphs with Associative Operations
Sequential («head/tail», H/T) Parallel («half-splitting», DIV2)
a1 a1
s6 f s3
f
a2 a2 𝜏 = log 2 𝑛
f s5 𝜏 =𝑛−1 a3
f
s1
a3
f s4 f s4
a4 a4
f s3 f Res
a5 a5
f s2 f s5
a6 a6 s2
f s1 f
a7 a7
f Res f s6
a8
a8
The lack of R for the implementation of full graph: scaling of calculations using
performance reduction methods.
«Head/tail»:
• regular interconnecting structure suitable for efficient reduction;
• high specific performance only at minimal R.
7
Combined Topologies of Graphs with Associative Operations
Solution:
for intermediate configurations of hardware resource, it is reasonable to use the combined
topology that contains sequential and parallel fragments of calculations.
General case:
Set of combined topologies Performance reduction for
+ different configurations of
Rules for the transition between them supercomputers
8
9
Resource-Independent Program in Set@l. «Head/tail» and «half-splitting»
a1
f s6
a2
attribute [F(A,Res)|F=(Rec(f),H/T)]: a3
f s5
f s4
a4
operand(set(A),element(Res)); f s3
a5
element(s); a6
f s2
f s1
F(A,Res)=break[card(Tail(A))=1: a7
f Res
a8
f(Head(Tail(A)),Head(A),Res)],
main[conc(F(Tail(A),s),f(s,Head(A),Res))];
end(F);
attribute [F(A,Res)|F=(Rec(f),DIV2)]:
a1
f s3
operand(set(A),element(Res)); a2
f
a3 s1
d2(A,A1,A2); f s4
a4
element(s1,s2); f Res
a5
f s5
F(A,Res)=break[card(A1)=1 and card(A2)=1: a6 s2
f
f(Head(A1),Head(A2),Res)], a7
f s6
a8
main[F(A1,s1),F(A2,s2),f(s1,s2,Res)];
end(F);
10
a1 DIV2
Resource-Independent Program in Set@l. Combined topologies
a2
a3
a4
A = a1 ... a4 ,a5 ... a8 ,a9 ... a12 ,a13 ... a16
a5 DIV2
a6
a7
a12
a13 DIV2
a14 Res
a15
a16
Ԧ
G = {{subG Ԧ
1, subG2, … , subGr}, v1, v2, … , vr–1}
Ԧ
Gr=< subG1, {subG Ԧ Ԧ Ԧ Ԧ Ԧ
2, v1}, {subG3, v2}, … , {subGr, vr–1} > 11
Modification of Combined Topology. Motivation
Previously, we considered the idealized case of operational vertices with a unit latency.
In practice, the latency of a vertex performing an associative operation (e.g. the addition or
multiplication of fixed-point numbers) typically exceeds one cycle.
So, the proposed transformation technique does not provide the obtainment of efficient
pipeline implementation of calculations: if the delay of feedback circuit is increased to the
latency of an operational vertex, the addition of all partial sums appeared at the output of
the pyramid structure is not ensured.
To take into account the non-unit latency of operational vertices and to form the correct
sequence of partial sums’ adding, we propose to modify the topology of the information
graph with associative operations.
psGm psG2m
amk
pG
• The resulting graph contains l isomorphic subgraphs Gr1, Gr2, ... , Grl connected by means of the
“half-splitting” principle through pyramid subgraph pG.
• Each subgraph Gri is formed using the combined principle “half-splitting + head/tail”: it includes m
isomorphic and pyramid subgraphs psG(i-1)·m+1, psG(i-1)·m+2, ... , psGi·m and one sequential unit hsGi
that calculates intermediate results and incorporates (m–1) operational vertices.
• In turn, each pyramid subgraph psGj processes k elements of the original data array
• If l = 1, the subgraph pG does not contain operational vertices and topology corresponds to the
previously described limit case.
13
Modification of Combined Topology. Computing structure
mGr1
<a1, ak+1, >
<a2, ak+2, > C1
v1
mGr2
pG
mGrl
• In the cadr, tuples of data elements are supplied to the inputs of Ci.
14
Modification of Combined Topology. Computing structure
MG pG*
<<a1, ak+1, >, <amk+1, am(k+1)+1, >, >
<<a2, ak+2, >, <amk+2, am(k+1)+2, >, >
v
1
2
pG
l 1
• At the second stage of the transformation, the pG fragment is replaced with the advanced pG*
structure, which performs associative operation on l elements per l cycles: the correct result of
processing of l operands appears at the output during the last l-th cycle.
• For this purpose, every data element is delayed for the corresponding number of cycles from 0
to (l–1).
• At the same time, the blocks mGr1, mGr2,..., mGrl are replaced with the single fragment MG, to the
inputs of which flows or tuples of input data with the duty cycle of 1 are supplied. In this case, the
data streams at each input are represented as nested tuples of elements of the original set.
15
Modification of Combined Topology. Computing structure
W
P
1 MX DMX
W
1
W
2 2 1
2
1
3 4
W
4 4
6
5 1
l-2
W l-2
l-2
l-1
1
• The serial connection of the multiplexer (MX) and demultiplexer (DMX) with the delays
can be reduced by leaving the single W fragment.
• After the analogous transformation of each iteration in the pyramid P, we obtain the
accumulating pipeline structure with operational vertices and delays.
16
Modification of Combined Topology. Computing structure
1 2 4 l/2
• At the output of the block v with the feedback, we obtain l intermediate results y1 = f (x1, xl+1),
y2 = f (x2, xl+2), ... , yl = f (xl, x2l), and there is a need to perform additional operations f on these
data elements.
• In order to calculate f (y1, y2), f (y2, y3), … , f (yl–1, yl), the first vertex of the accumulating pipeline
structure is supplied with the same stream of operands delayed for one cycle. In 2nd, 4th, 8th etc.
cycles, at the output of the first vertex, we get the result of executing operation f on data
elements y1 and y2, y3 and y4, y5 and y6 etc. At this stage, the size of the intermediate data
sequence is l/2.
• Therefore, for the second vertex, data operands are delayed by two cycles, and its output gives
l/4 results f (y1, y2, y3, y4), f (y5, y6, y7, y8) etc. Similarly, in the third and fourth vertices, the delays
are 4 and 8 cycles, respectively.
• At the output of the last vertex with the number log2(l)+1, after passing the entire array of input
data, we obtain the result of performing associative operation f on two operands that is equal to
the result of processing all elements included in the initial data array.
17
Modification of Combined Topology. Resulting computing structure
l
log2l
1 2 4 l/2
l = 2
log2 L
k = R / R0 − log2 l
N
m=
l ( floor( R / R0 ) − log 2 l )
It is possible to organize the described computing structure only if the value of l is equal to
the integer degree of two. Otherwise, the data flow is extended by neutral elements, and
the feedback circuit is supplemented with additional delay elements up to the l.
18
Modified Resource-Independent Program in Set@l
General description of the topology of the information graph with associative operations in
the source code of the resource-independent program in Set@l:
After the translation of the source code, the following sets with imposed types of
parallelism are formed:
19
Conclusions
• In this paper, we propose the method that rearrange the vertices of information graph
with associative operations and perform further optimization of computing structure in
order to reduce the time of problem solution by the number of times corresponding to
the latency of operational vertex.
• The designed general graph topology combines sequential and parallel fragments of
calculations and provides the formation of dense data flow at available hardware
resource. The developed method extends the technique considered in our previous
paper to multiple cases when the latency of associative vertex exceeds one cycle.