Professional Documents
Culture Documents
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
Simulation Manual For Configurable Mapreduce Accelerator: Gheorghe M. S Tefan
for
Configurable MapReduce Accelerator
(work in progress)
Gheorghe M. Stefan
The emergence of the hybrid computation domain is an incipient process. Roughly speaking it is about a system
containing two parts: a standard computing engine, used as host and to run the complex part of the code1 , and an accelerator,
for running the intense part of the code2 . While for the host there are few consecrated solutions (from the shelf mono- or
multi-core processors), for the position of the accelerator part complete few solutions. Some of them have a considerable
advance: various GPUs (such as Nvidia, AMDs ATI), MICs3 (such as Intels Xeon Phi, Adaptevas Epiphany), or FPGA
implemented circuits. GPU solutions are limited because the architecture is biased due to the graphic functionality legacy,
while MIC processors are limited because of their ad hoc structured organization. The FPGA solutions look the most
promising because of their flexibility. The flexibility is used to provide well fitted solutions and in the same time it helps in
the prototyping process when the final target is an ASIC implemented hybrid system.
The only drawback in using FPGAs is the requirement of circuit design abilities in defining and implementing the
circuit used as accelerator. A good compromise is to use a predefined framework for the FPGA design as a configurable
programmable parallel system. In the following, a configurable Map-Reduce programmable structure [11] is considered
as a generic accelerator engine.
In the second section the structure of the simulated system is described. The assembly language is described in the third
section. The fourth section contains examples. The fifth section develops a library of functions. The last section is reserved
for upgrades expected as outcomes of the evaluation process.
1A code is said complex if its size is in the same range with its execution time.
2A code is said intense if its size is much smaller than its execution time.
3 Many Integrated Core
1
Contents
1 Functional Electronics 4
I SIMULATOR 5
2 The General Description of Configurable MapReduce Accelerator 5
II LIBRARY OF FUNCTIONS 47
5 The list of the reserved storage resources 47
5.1 The list of the reserved storage resources in the scalar memory . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 The list of the reserved storage resources in the vector memory . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Transfer Functions 49
6.1 Two-dimension Array Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 Load N full horizontal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.2 Store N full horizontal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.1.3 Load M m-component vertical vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.1.4 Store M m-component vertical vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Two-dimension Arrays Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2
8 Sparse Linear Algebra 62
8.1 Sparse matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.1.1 Band matrices representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.1.2 Sparse matrices with randomly distributed non-zero elements representation . . . . . . . . . . . . . 62
8.2 Band Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.2.1 Band Matrix Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.3 Random Sparse Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3.1 Sparse Matrix Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3.2 Sparse Matrix Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.3.3 Sparse Matrices Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9 Graphs 73
9.1 Minimum Spanning Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.2 All-Pairs Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.3 Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
III UPGRADES 77
References 78
3
1 Functional Electronics
The evolution of electronics tends naturally toward the emergence of systems where circuits interleave with information
in order to achieve high functional capabilities. The action of Moores Low provides big sized circuits, but there is not a
Moore Low for the functional complexity. Structures get big but only if they remain simple, characterized by repetitive
patterns. The complexity comes only if the flexible informational structures can be inserted in the big pattern-based physical
structures.
Indeed, it is easy to design, verify, implement and test silicon chips with billions of transistors, but only if the description
of these circuits are kept in reasonable limits. If the structure is big & complex (the description of the circuit has the size
in the same magnitude order with the size of the circuits), then it is impossible to provide a verifiable design and a credible
test procedure for it.
In this context, Functional Electronics is the emergent domain of the functionally big & complex systems built by
tightly interleaving pattern-based big circuits with complex information. Thus:
Circuit & Information = Functional Electronics
Because circuits a naturally parallel engines, Functional Electronics is equivalent in the commercial space with
Parallel Embedded Systems.
The accelerator described in the following is a typical product of Functional Electronics with applications in the Parallel
Embedded System domain. As circuit it is based on a N-order digital system with a scan super (global) loop, a reduction
super (global) loop and a controlled super (global) loop [12]. As computation engine, it is based on the synergy between
Stephen Kleenes mathematical model of computation and John Backuss Functional Programming Systems [11].
The physical implementation of the accelerator provides, in 28 nm technology, for less than 10 Watt:
2 32-bit TOPS
The degree of parallelism depends on the application. For linear algebra domain it tends to be more than 90%. For
molecular dynamics it is already proved to be over 75%. In the Artificial Neural Network domain the performance of our
programmable solution is similar to those of ASICs.
4
Part I
SIMULATOR
2 The General Description of Configurable MapReduce Accelerator
The structure of the development system we consider (see Figure 1) consists of:
Host + External Memory, with functionality specified only at the level of the interaction assembly language used to
control the accelerator;
ACCELERATOR
The Host is supposed to run a program whose intense part is sent to be run by the Accelerator. The entire program is loaded
in External Memory. The intense part of the program and the associated data are sent to Accelerator using the interface
subsystem containing:
DMA: Direct Memory Access controller which receives commands from Host, through inFIFO, or from Accelera-
tors Controller
inFIFO: used to receive
commands from Host
program from External Memory under the Host control
data from the External Memory
outFIFO: used to
send back the result of computation
send requests of data from the External Memory
ACCELERATOR
reset
?
inFIFO in
-
control
The computational part of the accelerator (see Figure 2) performs functions dealing with scalar or vectors and consists
of a three parts:
CONTROLLER: performing functions defined on scalars with values in scalars; it has a Hardware RISC architecture
with its program memory (prog mem), data memory (mem) and execution unit (eng)
MAP section: performing functions defined on vectors with values in vectors; it is a linear array of cells each with
its own data memory and execution unit similar with those of the controller
REDUCTION network: performing functions defined on vectors with values in scalars; it is a log-depth circuit.
5
from inFIFO
?? ? ?? ? MAP ?? ? ?
- eng mem - eng mem - eng mem - eng mem prog
mem
- to/from DMA
6 6CONTROLLER
- to outFIFO
?? ?
REDUCE
The users image of the system is presented in Figure 3. It consists of the memory resources accessible at the level of
the assembly language. There are three levels of storage in the system we simulate:
External Memory, loaded at the beginning of the simulation with program and data; at the end of simulation it
contains the results.
Controllers Memory resources are:
Accumulator Register: is a 32-bit register in the accumulator-based execution unit; it provides one of the
operand and and stores the result of the unary and binary operations performed by the execution unit
reg [n-1:0] acc
Carry Bit: is a 1-bit register whose content is actualized at each arithmetic operation (shifts are arithmetic
operations)
reg cr
Scalar Memory: is the data memory of the controller; it provides, by the rule,, the second operand for binary
operations.
reg [n-1:0] mem[0:(1<<s)-1]
Address Register: is a register used to form the address for Scalar Memory when relative addressing mode is
used; its content is added with the immediate value provided by controllers instruction
reg [s-1:0] addr
Programm Memory: contains at each location a pair of instructions, one for CONTROLLER and another for
MAP-REDUCE array; it is loaded under the control of DMA unit
reg [31:0] progMem[0:(1<<p)-1]
6
Vector Memory: contains m = 2v p-component vectors
reg [n-1:0] vectMem[0:(1<<x)-1][0:(1<<v)-1]
as follow
7
ARRAYS MEMORY
Serial Register r0 r1 ri
vector[m-1]
vector[j] v ji
Address Vector a0 a1 ai
Carry Vector c0 c1 ci
Accumulator Vector v0 v1 vi
(BooleanVector) b0 b1 bi
Index Vector 0 1 i 2x 1
CONTROLLERS MEMROY
Address Register a
Carry Bit c
Accumulator Register s
External Memory m0 m1 ml
8
...
vector[i]: reg [n-1:0] vectMem[0:(1<<x)-1][j]
...
vector[p-1]: reg [n-1:0] vectMem[0:(1<<x)-1][p-1]
Serial Register: is a serial-parallel register distributed along the MAPs cells; each of the p cells contains a n-bit
parallel register serially connected in the previous and in the next cell
reg [n-1:0] serialReg[0:(1<<x)-1]
Index Vector: is a constant vector used to index the p cells of the MAP section
reg [x-1:0] ixVect[0:(1<<x)-1]
There are the following five operation modes in the storage space just described:
1. vector to scalar mode: is performed in REDUCTION section starting from accVect and providing a value in acc
or back to the MAP section.
Important note: the REDUCTION unit is a log-depth circuit with a latency (p) = 1 + 0.5log2 p. Therefore, any
scalar generated at the output of the REDUCTION unit is valid with a cycles delay, i.e., between the instruction
which set the content of accVect submitted to a reduction operation and the instruction which uses the result of
the reduction operation whatever instructions must be inserted; if nothing to do, then no operation instructions are
welcome.
2. scalar-scalar to scalar mode: is performed in CONTROLLER between acc and mem[i] or immediate value con-
tained in instruction or coOperand with result in acc; coOperand is the scalar value received, with cycles latency,
through REDUCTION unit from MAP section
3. vector-scalar to vector: is performed in MAP section between accVect and immediate value contained in instruction
or coOperand with result in accVect; coOperand is the scalar value received from CONTROLLER or, with
cycles latency, from the REDUCTION unit
4. vector-vector to vector mode: is performed in MAP section between accVect and vectMem[j]
9
3 The Assembly Language
Instruction formats:
executed by MapReduce Accelerator (MRA) on its internal data structures (see Figure 3):
mraInstr[31:0] = {controllerInstr[7:0], value[7:0], arrayInstr[7:0], value[7:0]}
10
3.1 Host-Accelerator Interface
The host-accelerator interface allows program load form the external memory and data transfers between external memory
and the internal vector memory.
@yyy
yyyyyyyy
yyyyyyyy
...
yyyyyyyy
where y is a hexa symbol. The file starts with yyy which represents the starting address in extMem. This address must be
carefully attributed in concordance with the size of the program loaded starting form the address 0.
11
3.2 MapReduce Accelerator
The parameters used to configure the ACCELERATOR are the following:
parameter
n = 32 , // word s i z e
x = 10 , // i n d e x s i z e > 2 x = 1024 c e l l s
v = 11 , // v e c t o r memory a d d r e s s s i z e > 2048 1024 component v e c t o r s
s = 9 , // s c a l a r memory a d d r e s s s i z e > 512 32 b i t s c a l a r s
p = 8 , // p r o g r a m memory a d d r e s s s i z e > 256 p a i r s o f i n s t r u c t i o n s
c = 8 , // value size in i n s t r u c t i o n
a = 5 // ( s i z e o f a c t i v a t i o n c o u n t e r > 32 embedded WHEREs)
32-bit word
1024-cell array
2048-word local memory in each cell, which translates in a Vector Memory of 2048 vectors of 1024 32-bit scalar
each
12
3.2.1 Input-output instructions
send from mem[cScalar] to DMA unit the size of vector to be transferred :
cLSIZE(cScalar): size <= mem[cScalar[s-1]][x-1:0] // in DMA unit
send from mem[cScalar] to DMA unit the address in the external memory where starts the transfer :
cLADDR(cScalar): addr <= mem[cScalar[s-1]][28:0] // in DMA unit
send from mem[cScalar] to DMA unit the size of stride in the external memory :
cLSTRIDE(cScalar): stride <= mem[cScalar[s-1]][28:0] // in DMA unit
send from mem[cScalar] to DMA unit the type of transfer and the transfer start command :
cTRUN(cScalar): case(cScalar[2:0])
001: load
010: store
011: strided load
100: strided store
101: gathered load
110: scattered store
endcase
13
3.2.2 Load instructions
The subset of load instructions are used to load, from various storage resources inside the accelerator, n-bit words in the
accumulator of controller, acc, or in the accumulators of each cell, acc[i].
no operation :
cNOP: acc <= acc
NOP: acc[i] <= acc[i]
index load :
IXLOAD: acc[i] <= i
immediate load :
cVLOAD(cScalar): acc <= {(n-c){cScalar[c-1]}}, cScalar}
VLOAD(aScalar): acc[i] <= {(n-c){aScalar[c-1]}}, aScalar}
absolute load :
cLOAD(cScalar): acc <= mem[cScalar]
LOAD(aScalar): acc[i] <= vectMem[i][aScalar]
relative load :
cRLOAD(cScalar): acc <= mem[addr + cScalar]
RLOAD(aScalar): acc[i] <= vectMem[i][addrVect[i] + aScalar]
co-operand load :
cCLOAD(0): acc <= reductionAdd
cCLOAD(1): acc <= reductionMin
cCLOAD(2): acc <= reductionMax
cCLOAD(3): acc <= reductionFlag
cCLOAD(4): acc <= serialReg[0]
cCLOAD(5): acc <= serialReg[(1<<x)-1]
CLOAD: acc[i] <= acc
CALOAD: acc[i] <= vectMem[i][acc]
CRLOAD: acc[i] <= vectMem[i][addrVect[i] + acc]
14
3.2.3 Store instructions
The subset of store instructions are used to store, into various storage resources inside the accelerator, n-bit words from the
accumulator of controller, acc, or from the accumulators of each cell, acc[i].
absolute store :
cSTORE(cScalar): mem[cScalar] <= acc
STORE(aScalar): vectMem[i][aScalar] <= acc[i]
relative store :
cRSTORE(cScalar): mem[addr + cScalar] <= acc
RSTORE(aScalar): vectMem[i][addrVect[i] + aScalar] <= acc[i]
co-operand store :
CSTORE: vectMem[i][acc] <= acc[i]
cCRSTORE(0): mem[addr + reductionAdd] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
cCRSTORE(1): mem[addr + reductionMin] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
cCRSTORE(2): mem[addr + reductionMax] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
cCRSTORE(3): mem[addr + reductionFlag] <= acc (reduction applied to acc[i] (1+x/2) cycles before)
CRSTORE: vectMem[i][addrVect[i] + acc] <= acc[i]
15
3.2.4 Address register load instructions
These instructions are used to instantiate the value of the address register in controller, addr, and in each cell of the array,
addrVect[i]. The address register is used to provide differentiations in the address apace of each local data memory
distributed in array at the cells level.
address register takes the value from accumulator :
cADDRLD: addr <= acc
ADDRLD: addrVect[i] <= acc[i]
16
3.2.5 Two-operand n-bit integer instructions
The pattern for the two-operand instruction is presented using the function ADD (addition). Each of the two-operand in-
struction has the following 12 forms (5 for Controller and 7 for Array) according to the way the second operand is selected.
(For the sake of simplicity, in the following, acc[i] stands for accVect[i] and cr[i] stands for crVect[i].)
immediate add :
cVADD(cScalar): {carry, acc} <= acc + {(n-8){cScalar[7]}}, cScalar}
VADD(aScalar): {carry[i], acc[i]} <= acc[i] + {(n-8){aScalar[7]}}, aScalar}
absolute add :
cADD(cScalar): {carry, acc} <= acc + mem[cScalar]
ADD(aScalar): {carry[i], acc[i]} <= acc[i] + vectMem[i][aScalar]
relative add :
cRADD(cScalar): {carry, acc} <= acc + mem[addr + cScalar]
RADD(aScalar): {carry[i], acc[i]} <= acc[i] + vectMem[i][addrVect[i] + aScalar]
co-operand add :
cCADD(0): {carry, acc} <= acc + reductionAdd(applied to acc[i] (1+x/2) cycles before)
cCADD(1): {carry, acc} <= acc + reductionMin(applied to acc[i] (1+x/2) cycles before)
cCADD(2): {carry, acc} <= acc + reductionMax(applied to acc[i] (1+x/2) cycles before)
cCADD(3): {carry, acc} <= acc + reductionFlag(applied to acc[i] (1+x/2) cycles before)
cCADD(4): {carry, acc} <= acc + serialReg[0]
cCADD(5): {carry, acc} <= acc + serialReg[(1<<x)-1]
CADD: {carry[i], acc[i]} <= acc[i] + acc
CAADD: {carry[i], acc[i]} <= acc[i] + vectMem[i][acc]
CRADD: {carry[i], acc[i]} <= acc[i] + vectMem[i][addrVect[i] + acc]
For the following mnemonics, the previously described 12 instructions forms are the same:
ADDC - add with carry: {carry, acc} <= acc + op + carry
SUB - subtract: {carry, acc} <= acc - op
RSUB - reverse SUB: {carry, acc} <= op - acc
SUBC - SUB with carry: {carry, acc} <= acc - op - carry
RSUBC - reverse SUBC: {carry, acc} <= op - acc - carry
DIV - division: acc <= acc / op
17
RDIV - reverse DIV: acc <= op / acc
MULT - multiplication: acc <= acc * op
AND - bitwise and: acc <= acc & op
OR - bitwise or: acc <= acc | op
XOR - bitwise xor: acc <= acc ^ op
COMPARE - compare: {carry, acc} <= (acc - op)&(10...0)|{0, acc};
Thus, instead of the suffix ADD in one of the previous 12 instruction descriptions, one of the previous can be used. For
example: VADDC, instead of VADD. Thus, 12 13 instructions are already described.
18
3.2.6 Floating point instructions
The floating point set of instructions use as co-operand only the local memory content addressed by the immediate value
from the instruction: mem[cScalar] for controller and vectMem[i][aScalar] for each arrays cell. The execution times
for float operations are:
float add: 3 cycles for the following sequence of instructions (exemplified for controller):
float multiplication: 2 cycles for the following sequence of instructions (exemplified for controller):
float division: 26 cycles for the following sequence of instructions (exemplified for controller):
19
second step float division :
cMDIV: 24-cycle operation on mantissa
MDIV: 24-cycle operation on mantissa
20
3.2.7 Shift instructions
shift right one bit position :
cSHRIGHT: {cr, acc} <= {acc[0], 1b0, acc[n-1:1]}
SHRIGHT: {cr[i], acc[i]} <= {acc[i][0], 1b0, acc[i][n-1:1]}
21
3.2.8 Send controllers operand as co-operand for array
Are executed only by the controller. Are used to send as co-operand for the array the operand of the controller. Must be
used in conjunction with an instruction which in array requests the co-operand.
send as co-operand to array the output of the reduction unit selected with cScalar[1:0] :
cCSEND(0): opVect[k] = reductionAdd
cCSEND(1): opVect[k] = reductionMin
cCSEND(2): opVect[k] = reductionMax
cCSEND(3): opVect[k] = reductionFlg
cCSEND(4): opVect[k] = serialReg[0]
cCSEND(5): opVect[k] = serialReg[(1<<x)-1]
This subset of instructions is used in conjunction with the instructions CXXX, where XXX is one of the two-operand
instructions previously defined. For example:
22
3.2.9 Sequential control instructions
unconditioned jump to the instruction labeled with LB(cScalar) :
cJMP(cScalar): pc <= pc + valueComputedByAssembler
branch if acc is zero to the instruction labeled with LB(cScalar) & decrement :
cBRZDEC(cScalar): pc <= (acc = 0) ? pc + valueComputedByAssembler : pc + 1
acc <= acc - 1
branch if acc is not zero to the instruction labeled with LB(cScalar) & decrement :
cBRNZDEC(cScalar): pc <= (acc = 0) ? pc + 1 : pc + valueComputedByAssembler
acc <= acc - 1
branch if acc+1 is zero to the instruction labeled with LB(cScalar) & increment :
cBRZINC(cScalar): pc <= (acc+1 = 0) ? pc + valueComputedByAssembler : pc + 1
acc <= acc + 1
branch if acc+1 is not zero to the instruction labeled with LB(cScalar) & increment :
cBRNZINC(cScalar): pc <= (acc+1 = 0) ? pc + 1 : pc + valueComputedByAssembler
acc <= acc + 1
branch if acc is negative to the instruction labeled with LB(cScalar) & increment :
cBRSGN(cScalar): pc <= (acc[n-1] = 1) ? pc + valueComputedByAssembler : pc + 1
branch if acc is positive to the instruction labeled with LB(cScalar) & increment :
cBRNSGN(cScalar): pc <= (acc[n-1] = 1) ? pc + 1 : pc + valueComputedByAssembler
halt :
cHALT: pc <= pc
23
3.2.10 Spatial control instructions
The instructions from this subset are used to select the active cells.
The cell i is active if boolVect[i] = 1, where: boolVect[i] = (actVect[i] == 0).
activate all cells :
ACTIVATE: actVect[i] <= 0
save the active cells going back to the previous selection pattern :
SAVEACT: actVect[i] <= actVect[i] - 1
24
3.2.11 Global shift instructions
global rotate with one position :
GROTATE: acc[i] <= acc[(i+1)%(1<<x)]
25
3.2.12 Global search/insert/delete instructions
search for co-operand in all cells :
SRCALL: boolVect[i] <= (acc[i] == acc) ? 1b1 : 1b0
26
3.2.13 Serial register instructions
push right cScalar in serial register :
cVPUSHR(cScalar): serialReg[i] <= (i < (1<<x)-1) ? serialReg[i+1] : {(n-c){cScalar[c-1]}, cScalar}
27
4 How to Use the Assembler
In order to evaluate what can be the maximum of performance of our architecture, the assembly language must be used.
Therefore, the initial stage of evaluation must be done in assembly language. (The next stage, beyond the scope of our
approach, is to provide an efficient compiler from a high level language to the machine language.) We provide few sim-
ple example of using the previously described assembly language. The behavioral description of the generic structure is
simulated on ISE Design Suite 14.2 provided by Xilinx.
For simulation reasons, the engine is kept small. It is defined by the content of 00 parameters.v file:
parameter
n = 32 , // word s i z e
x = 4 , // i n d e x s i z e > 16 c e l l s
v = 6 , // v e c t o r memory a d d r e s s s i z e > 64 v e c t o r s
s = 8 , // s c a l a r memory a d d r e s s s i z e > 256 s c a l a r s
p = 8 , // p r o g r a m memory a d d r e s s s i z e > 256 32 b i t i n s t r u c t i o n s
c = 8 , // value size in i n s t r u c t i o n
a = 5 // s i z e of a c t i v a t i o n counter
For editorial reasons, the simulators monitor has the following, compressed form:
i n i t i a l begin
$ m o n i t o r ( t =%0d pc=%d a=%0d a [0]=%0 d a [1]=%0 d a [2]=%0 d . . . a [6]=%0 d a [7]=%0 d
b = %0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d%0d c c=%0d ,
$time / 2 ,
d u t . pc ,
d u t . acc ,
dut . accVect [ 0 ] ,
dut . accVect [ 1 ] ,
dut . accVect [ 2 ] ,
dut . accVect [14] ,
dut . accVect [15] ,
dut . boolVect [ 0 ] , dut . boolVect [ 1 ] , dut . boolVect [ 2 ] , dut . boolVect [ 3 ] ,
dut . boolVect [ 4 ] , dut . boolVect [ 5 ] , dut . boolVect [ 6 ] , dut . boolVect [ 7 ] ,
dut . boolVect [ 8 ] , dut . boolVect [ 9 ] , dut . boolVect [10] , dut . boolVect [11] ,
dut . boolVect [12] , dut . boolVect [13] , dut . boolVect [14] , dut . boolVect [15] ,
dut . cc
);
end
cXXX : g e n e r i c i n s t r u c t i o n e x e c u t e d i n C o n t r o l l e r
XXX: g e n e r i c i n s t r u c t i o n e x e c u t e d i n t h e MapReduce a r r a y
/
cPLOAD ; / / e x e c u t e d by DMA
28
/ / f o l l o w s t h e p r o g r a m e x e c u t e d by a c c e l e r a t o r
cXXX ; XXX;
... ...
LB ( 5 0 ) cXXX ; XXX; / / t h e s t a r t i n g l i n e o f c o d e i n t h e p r o g r a m PPP
... ...
cXXX ; XXX;
/ / e n d s t h e p r o g r a m e x e c u t e d by y a c c e l e r a t o r
cPRUN ( 5 0 ) ; / / e x e c u t e d by DMA
//
Example 4.3 Working in slave mode, the accelerator receives a sequence of unrequested operations.
The s e q u e n c e o f u n r e q u e s t e d o p e r a t i o n s a r e :
/
cPLOAD ;
... / / h e r e g o e s t h e p r o g r a m SSS
cPRUN ( 1 2 ) ;
cLSIZE [ 1 2 8 ] ) ; / / s e t t h e v e c t o r s i z e ; [ . . . ] means a c t u a l v a l u e
cTRUN ( 1 ) ; / / l o a d t h e f i r s t v e c t o r ; SSS knows where i n v e c t o r memory
cTRUN ( 1 ) ; / / l o a d t h e s e c o n d v e c t o r ; SSS knows where i n v e c t o r memory
cTRUN ( 2 ) ; / / when t h e p r o g r a m ends , s e n d s t h e r e s u l t b a c k
//
29
Listing 4: Data transfer in the program YYY
/
YYY
30
4.2 How to Program the ACCELERATOR
The following examples are presented in order to show how the main features of the MapReduce engine, with the generic
architecture, works. The following classes of operations are exemplified:
3. reduction operations
31
4.2.1 Data transfer programs
The following data transfer operations are possible in this generic version of the accelerator:
load : vector load, which requests the following parameters:
gathered load : vector gathered load, which requests the following parameters:
size: the number of n-bit scalars of vector containing n addresses
scattered store : vector scattered store, which requests the following parameters:
size: the number of n-bit scalars of the vector containing n/2 pais address-data
Once a parameter of a certain type is received by the DMA unit it is maintained until a new value is sent to the DMA unit
for the same type of parameter. For example, once the vector size is established for the first data transfer, we dont need to
re-send the size for the next transfers if the size remains the same.
32
Example 4.5 Load vector requests two parameters: the size of vector and the starting address in the external memory,
because the program knows where to load the vector in the vector memory. The parameters of the transfer are loaded
in two locations of the controllers memory in order to be used in more complex programs, where eventually they can be
submitted to some computations. In or example the size of vector is stored in mem[1] and the initial address in external
memory is stored in mem[2]. These values are used in the instructions cLSIZE and cLADDR pointed by the locations in mem
where they are stored: 1, respectively 2.
Important note: the instruction cIOWAIT cannot follow in the next cycle the instruction TRUN. At least on cycle delay
must be introduced before starting to wait the end of transfer. Any kind of processing instruction can be executed meantime.
In our example we inserted a cNOP instructions.
/
TEST PROGRAM FOR : l o a d v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 10
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 0 ) ; ACTIVATE ; / / a d d r = 14 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; NOP ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 1 ) ; NOP ; / / s t a r t i n DMA t h e l o a d o p e r a t i o n
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; IOLOAD ; / / s t o p c y c l e c o u n t e r ; l o a d ioReg i n accVect
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////
33
Example 4.6 Store vector requests two parameters: the size of vector and the starting address in the external memory,
because the program knows from where, in the vector memory, to take the vector. The parameters of the transfer are
loaded in two locations of the controllers memory in order to be used in more complex programs, where eventually they can
be submitted to some computations. In or example the size of vector is stored in mem[1] and the initial address in external
memory is stored in mem[2]. These values are used in the instructions cLSIZE and cLADDR pointed by the locations in mem
where they are stored: 1, respectively 2.
Important note: the instruction cIOWAIT cannot follow in the next cycle the instruction TRUN. At least on cycle delay
must be introduced before starting to wait the end of transfer. Any kind of processing instruction can be executed meantime.
In our example we inserted a cNOP instructions.
/
TEST PROGRAM FOR : s t o r e v e c t o r
The p r o g r a m :
l o a d s 10 i n {\ t t mem[ 1 ] } and 14 i n {\ t t mem[ 2 ] }
s t o r e s t h e f i r s t mem [ 1 ] c o m p o n e n t s o f t h e i n d e x v e c t o r i n t h e e x t e r n a l memory
s t a r t i n g from t h e a d d r e s s mem [ 2 ]
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 0 ) ; NOP ; / / s i z e = 10
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 4 ) ; ACTIVATE ; / / a d d r = 14 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t i n DMA t h e s t o r e o p e r a t i o n
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////
The program ends with the stream 0, 1, ... 9 loaded in the external memory starting from the address 14.
34
Example 4.7 Store - load vector is a program which loads the full index vector in the external memory starting from the
address 32, then loads back into the accumulator vector the same vector. Meantime the content of the accumulator vector
is incremented (VADD(1)).
/
TEST PROGRAM FOR : s t o r e l o a d v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 3 2 ) ; ACTIVATE ; / / a d d r = 32 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; VADD( 1 ) ; / / w a i t t o s t a r t b e f o r e w a i t t o end ; a c c V e c t + 1
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; NOP ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 1 ) ; NOP ; / / s t a r t t h e l o a d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; IOLOAD ; / / s t o p c y c l e c o u n t e r ; l o a d ioReg i n accVect
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////
35
Example 4.8 Strided load vector program has two parts. The first part loads the external memory with two foll vectors
starting from the location 16. The first vector is the index vector and the second is the incremented index vector.
/
TEST PROGRAM FOR : s t r i d e d l o a d v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 6 ) ; ACTIVATE ; / / a d d r = 16 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cVLOAD ( 3 2 ) ; NOP ; / / a d d r = 32
cSTORE ( 2 ) ; VADD( 1 ) ; / / s a v e a d d r a t mem [ 2 ] ; i n c r e m e n t a c c V e c t
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cLADDR ( 2 ) ; IOSTORE ; / / s e n d a d d r t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
/ / the strided part
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / mem [ 1 ] <= s i z e
cVLOAD ( 1 8 ) ; NOP ; / / a d d r = 18
cSTORE ( 2 ) ; NOP ; / / mem [ 2 ] <= a d d r
cVLOAD ( 4 ) ; NOP ; // burst = 4
cSTORE ( 3 ) ; NOP ; / / mem [ 3 ] <= b u r s t
cVLOAD ( 7 ) ; NOP ; // stride = 7
cSTORE ( 4 ) ; NOP ; / / mem [ 4 ] <= s t r i d e
cLSIZE ( 1 ) ; NOP ; / / s i z e > DMA
cLADDR ( 2 ) ; NOP ; / / a d d r > DMA
cLBURST ( 3 ) ; NOP ; / / b u r s t > DMA
cLSTRIDE ( 4 ) ; NOP ; / / s t r i d e > DMA
cTRUN ( 3 ) ; NOP ; / / run s t r i d e d load
cNOP ; NOP ; / / wait to s t a r t the t r a n s f e r
cIOWAIT ; NOP ; / / w a i t t o end t h e t r a n s f e r
cSTOP ; IOLOAD ; / / s t o p c y c l e c o u n t e r ; l o a d ioReg i n accVect
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////
The program loads, starting from location 18 in the external memory, 4 words, then starts loading from 18 + 7 other 4
words and so on until 16 words are transferred as a 16-word vector into the accumulator vector accVect of the array.
36
Example 4.9 Strided store vector program prepares the parameters of the transfer in 4 location in controllers data memory
starting from 1. Then load the accumulator vector, accVect, into the input-output register, ioReg, and runs the transfer.
The program wait for the end of transfer, stop the cycle counter and halts the accelerator.
/
TEST PROGRAM FOR : s t r i d e d s t o r e v e c t o r
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 2 ) ; NOP ; / / s i z e = 12
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 4 ) ; ACTIVATE ; / / a d d r = 14 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cVLOAD ( 2 ) ; NOP ; // burst = 2
cSTORE ( 3 ) ; NOP ; / / s a v e b u r s t a t mem [ 3 ]
cVLOAD ( 5 ) ; NOP ; // stride = 5
cSTORE ( 4 ) ; NOP ; / / s a v e s t r i d e a t mem [ 4 ]
cLBURST ( 3 ) ; NOP ; / / s e n d b u r s t t o DMA
cLSTRIDE ( 4 ) ; NOP ; / / s e n d b u r s t t o DMA
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 4 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////
The program loads in the external memory, starting from the address 14, a burst of 2 words, then do the same from the
address 14+5, and so on until 12 words from the accumulator vector are loaded into the external memory.
37
Example 4.10 Scattered store starts having in accReg a vector containing pairs of address-data words. The first ele-
ment of the pair is used to address in the external memory the location where the second element of the pair is stored. Then,
the size of the transfer, the only parameter of this transfer type, must be an even number.
/
TEST PROGRAM FOR : s c a t t e r e d s t o r e
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 0 ) ; ACTIVATE ; / / s i z e = 10; a c t i v a t e a l l c e l l s
cSTORE ( 1 ) ; IXLOAD ; / / s a v e s i z e a t mem [ 1 ] ; l o a d i n d e x i n a l l c e l l s
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 6 ) ; NOP ; / / s t a r t t h e s c a t e r e d s t o r e o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ; / / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////
Because the accumulator register, accReg, is loaded with the index vector, the program loads at the address 0 in the
external memory the value 1, at 2 the value 3, and so on until loading at 8 the value 9. The size of the transfer is 10, then 5
values are transferred into the external memory.
38
Example 4.11 Gathered load program prepares first the content of the external memory loading from the location 16 the
index vector followed by the incremented index vector. Then, the accumulator in incremented with 17 and is used as address
vector to gather data from the external memory.
/
TEST PROGRAM FOR : g a t h e r e d l o a d
/
/
cPLOAD ; NOP ; / / l o a d command
/ / BEGIN PROGRAM
cSTART ; NOP ; / / s t a r t cycle counter
cVLOAD ( 1 6 ) ; NOP ; / / s i z e = 16
cSTORE ( 1 ) ; NOP ; / / s a v e s i z e a t mem [ 1 ]
cVLOAD ( 1 6 ) ; ACTIVATE ; / / a d d r = 16 ; a c t i v a t e a l l c e l l s
cSTORE ( 2 ) ; IXLOAD ; / / s a v e a d d r a t mem [ 2 ] ; l o a d i n d e x i n a l l c e l l s
cLADDR ( 2 ) ; NOP ; / / s e n d a d d r t o DMA
cLSIZE ( 1 ) ; IOSTORE ; / / s e n d s i z e t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; NOP ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cVLOAD ( 3 2 ) ; NOP ; / / a d d r = 32
cSTORE ( 2 ) ; VADD( 1 ) ; / / s a v e a d d r a t mem [ 2 ] ; i n c r e m e n t a c c V e c t
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cLADDR ( 2 ) ; IOSTORE ; / / s e n d a d d r t o DMA; l o a d i o r e g i s t e r w i t h a c c
cTRUN ( 2 ) ; VADD( 1 7 ) ; / / s t a r t t h e s e n d o p e r a t i o n i n DMA
cNOP ; NOP ; / / w a i t t o s t a r t b e f o r e w a i t t o end
cIOWAIT ; NOP ; / / w a i t t h e t r a n s f e r end
cVLOAD ( 1 6 ) ; NOP ;
cSTORE ( 1 ) ; NOP ;
cLSIZE ( 1 ) ; IOSTORE ;
cTRUN ( 5 ) ; NOP ;
cNOP ; NOP ;
cIOWAIT ; NOP ;
cSTOP ; IOLOAD ;
cHALT ; NOP ;
/ / h a l t t h e program
/ / END PROGRAM
cPRUN ( 0 ) ; NOP ; / / r u n command
/////
The program loads in accVect data gathered from the external memory starting from the address 18, because the
address vector loaded in ioReg is 18, 19, . . . , 33. The coming back in accVect is 2, 3, . . . , 15, 1, 2.
39
4.2.2 Simple Vector & Reduction Programs
Example 4.12 The program which provide in the controllers accumulator, acc, the sum of indexes loaded in the accumu-
lators of each cell, accVect[i], is:
/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i ] <= i , f o r i = 0 , 1 , . . . , 1 5
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . + a c c [ 1 5 ]
o n l y t h r e e l a t e n c y s t e p s a r e i n s e r t e d b e c a u s e x = 4 ( lambda = 2 + x )
/
//
cSTART ; ACTIVATE ; / / s t a r t c y c l e c o u n t e r ; a c t i v a t e a l l c e l l s
cNOP ; IXLOAD ; / / load the index of each c e l l in accumulator
cNOP ; NOP ; / / latency step 1
cNOP ; NOP ; / / latency step 2
cNOP ; NOP ; / / latency step 3
cNOP ; NOP ; / / latency step 4
cNOP ; NOP ; / / latency step 5
cNOP ; NOP ; / / latency step 6
cCLOAD ( 0 ) ; NOP ; / / a c c <= sum o f i n d e x e s
cSTOP ; NOP ; / / stop cycle counter
cNOP ; NOP ; / / t o show c y c l e c o u n t e r s t o p p e d
cHALT ; NOP ;
/////
Appropriately commented means //* instead of /* before the first line of code.
The assembled code, provided by the simulator, is:
progMem [ 0 ] = 00110111000000000111011100000000
progMem [ 1 ] = 01101111000000000000000000000000
progMem [ 2 ] = 00000000000000000000000000000000
progMem [ 3 ] = 00000000000000000000000000000000
progMem [ 4 ] = 00000000000000000000000000000000
progMem [ 5 ] = 00000000000000000110010000000000
progMem [ 6 ] = 00000000000000000111111100000000
progMem [ 7 ] = 00000000000000000000000000000000
progMem [ 8 ] = 00000000000000000000011100000000
40
In the initial cycle (t=0) the system reset resets the cycle counter, cc = 0. In the first cycle (t=1) the program activate
all the cells of the array (the Boolean vector is filled up with 1s). The operation is validated in the next cycle when b <=
11...1. Then the accumulator in each cell takes the value of index. During three cycles the reduction network computes
the sum of indexes. Then in t=7 the controllers accumulator is loaded with the sum of indexes, i.e., the sum of the numbers
acc = 0 + 1 + 2 + ... + 15 = 120, because we instantiated for our simulation an array with 16 cells. The cycle
counter stops in the next cycle on the value 6, which means: the program performed the task in cc - 1 = 5 cycles (the
instruction which stops the counter is also counted).
41
Example 4.13 The program which stores at mem[24] the inner product of the index vector with itself is:
/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i ] <= i f o r 1 = 0 , 1 , . . . , 1 5
memVect [ i ] [ 4 ] <= a c c [ i ] f o r 1 = 0 , 1 , . . . , 1 5
a c c [ i ] <= a c c [ i ] x vectMem [ i ] [ 4 ]
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . + a c c [ 1 5 ]
mem[ 2 4 ] <= a c c = i n n e r P r o d u c t ( i n d e x , i n d e x )
/
//
cSTART ; ACTIVATE ; / / a c t i v a t e a l l c e l l s
cNOP ; IXLOAD ; / / a c c [ i ] <= i n d e x
cNOP ; STORE ( 4 ) ; / / memVect [ i ] [ 4 ] <= a c c [ i ] , f o r a l l i
cNOP ; MULT ( 4 ) ; / / a c c [ i ] <= a c c [ i ] memVect [ i ] [ 4 ]
cNOP ; NOP ; / / latency step 1
cNOP ; NOP ; / / latency step 2
cNOP ; NOP ; / / latency step 3
cNOP ; NOP ; / / latency step 4
cNOP ; NOP ; / / latency step 5
cNOP ; NOP ; / / latency step 6
cCLOAD ( 0 ) ; NOP ; / / a c c <= r e d u c t i o n A d d ( a c c [ i ] )
cSTORE ( 2 4 ) ; NOP ; / / mem[ 2 4 ] <= a c c
cSTOP ; NOP ; / / stop cycle counter
cHALT ; NOP ;
///
//=============================================================================
The simulation provides, a little edited to fit in page, the following results:
In cycle 9 the controllers accumulator is loaded with the value of the inner product and in the next cycle its content is
stored in the local scalar memory.
42
Example 4.14 The program which provide in acc the number of components of the index vector bigger than 5 and smaller
than 15 is:
/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i } <= i
k e e p a c t i v e c e l l s where ( a c c [ i ] >= 5 )
k e e p a c t i v e c e l l s where ( a c c [ i ] < 1 5 )
a c c [ i ] <= 1 o n l y i n a l l a c t i v e c e l l s
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . + a c c [ 1 5 ] o n l y f o r t h e a c t i v e c e l l s
/
//
cNOP ; ACTIVATE ; // activate all cells
cNOP ; IXLOAD ; / / a c c [ i ] <= i n d e x
cNOP ; VSUB ( 5 ) ; / / { c r , a c c [ i ] } <= a c c [ i ] 5
cNOP ; WHERENCARRY; / / where c r =1 r e m a i n a c t i v e
cNOP ; VSUB ( 1 0 ) ; / / {{ c r , a c c [ i ] } <= a c c [ i ] ( 1 5 5 )
cNOP ; WHERECARRY; / / where c r =0 r e m a i n a c t i v e
cNOP ; VLOAD ( 1 ) ;
cNOP ; ENDWHERE; / / r e a c t i v a t e where t h e s e c o n d WHERE a c t e d
cNOP ; ENDWHERE; / / r e a c t i v a t e where t h e f i r s t WHERE a c t e d
cNOP ; NOP ; / / latency step 3
cNOP ; NOP ; / / latency step 4
cNOP ; NOP ; / / latency step 5
cNOP ; NOP ; / / latency step 6
cCLOAD ( 0 ) ; NOP ; / / a c c <= number o f a c t i v e c e l l s
cHALT ; NOP ;
///
//=============================================================================
Indeed, the index vector contains 10 components bigger then 5 and smaller than 15.
43
Example 4.15 Load index in cells accumulators and do l = 9 times: divide by 2 (integer operation) and increment with
99 each accumulator. The program is:
/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c <= 8 ; i n i t i a l i z e t h e l o o p c o u n t e r w i t h l 1
a c c [ i } <= i ; l o a d i n d e x
do ( a c c + 1 ) t i m e s
a c c [ i ] <= a c c [ i ] / 2
a c c [ i ] <= a c c [ i ] + 99
/
//
cNOP ; ACTIVATE ;
cVLOAD ( 8 ) ; IXLOAD ;
LB ( 1 ) ; cNOP ; SHRIGHT ;
cBRNZDEC ( 1 ) ; VADD( 9 9 ) ; / / b r a n c h i f a c c =0 and acc <=acc 1
cHALT ; NOP ;
///
//=============================================================================
The initial value of accVect is {0, 1, ..., 14, 15}. After 8 execution of the two cycles loop it becomes: {197,
197, ..., 197, 197}.
44
Example 4.16 Add in accVect index with the sum of all indexes. The program is:
/
T e s t p r o g r a m : 03 e x a m p l e s . v ( a p p r o p r i a t e l y commented )
activate all cells
a c c [ i } <= i ; l o a d i n d e x
a c c <= a c c [ 0 ] + a c c [ 1 ] + . . . a c c [ 1 5 ]
a c c [ i ] <= a c c [ i ] + a c c
/
In 6 cycles is performed a computation consisting of 29 additions. In the general case, for a number of p cells, the
execution time is 3 + (1 + 0.5 log p) = 4 + 0.5 log p, where 1 + 0.5 log p is the latency of the reduction net.
Therefore, 2p 1 are performed in 4 + 0.5 log p cycles by an engine with p cells. The acceleration belongs to
p
O( )
log p
which is normal for a computation involving communication.
45
Example 4.17 Example of program with data transfer:
46
Part II
LIBRARY OF FUNCTIONS
5 The list of the reserved storage resources
5.1 The list of the reserved storage resources in the scalar memory
mem[16 ] = numberOfLines : number of lines in array, i.e., number of vectors in vectMem, for functions
05 arrayLoad
05 arrayStore
mem[17 ] = numberOfcolumns : number of columns in array, i.e., number of component per vector, for functions
05 arrayLoad
05 arrayStore
mem[18 ] = vectorAddress : the address of the first line (vector) in vectMem, for functions
05 arrayLoad
05 arrayStore
mem[19 ] = scalarAddress : the address of the first location in extMem where starts the stream of vectors to be trans-
ferred, for functions
05 arrayLoad
05 arrayStore
mem[20 ] = burstSize :
mem[21 ] = strideSize :
mem[22 ]: size : the edge size of the matrix submitted to one of the following functions
05 matrixTranspose
05 matrixVectorMultiply
05 matrixMatrixMultiply
mem[23 ] = destMatrix : the address in vectMem of the first line of the result matrix, for functions
05 matrixTranspose
05 matrixMatrixMultiply
mem[24 ] = destVect : the address of the result vector for 05 matrixVectorMultiply
mem[25 ] = firstMatrix the address pionting to the matrix used as operand for
05 matrixTranspose
05 matrixVectorMultiply for which points to the last line
05 matrixMatrixMultiply as multiplicand
mem[26 ] = secondMatrix : the address of the first line in the matrix used as second operand for
05 matrixMatrixMultiply
mem[27 ] = operandVector : the address of the vector used as operand for 05 matrixVectorMultiply
mem[28 ] =
mem[29 ] =
mem[30 ] =
mem[31 ] =
47
5.2 The list of the reserved storage resources in the vector memory
vectMem[16 ]:
vectMem[17 ]:
vectMem[18 ]:
vectMem[19 ]:
vectMem[20 ]:
vectMem[21 ]:
vectMem[22 ]:
vectMem[23 ]:
vectMem[24 ]:
vectMem[25 ]:
vectMem[26 ]:
vectMem[27 ]:
vectMem[28 ]:
vectMem[29 ]:
vectMem[30 ]:
vectMem[31 ]:
48
6 Transfer Functions
The reserved locations in the controllers data memory are: mem[16], ..., mem[21]
/
FUNCTION NAME:
AUTHOR: Gheorghe M. S t e f a n
DATE :
/
//
/////
49
6.1.1 Load N full horizontal vectors
/
FUNCTION NAME: Twod i m e n s i o n a r r a y l o a d
AUTHOR: Gheorghe M. S t e f a n
DATE : S e p t . 25 2016
The p a r a m e t e r s f o r t h e f u n c t i o n a r e s e t i n c o n t r o l l e r s d a t a memory i n f o u r s u c c e s s i v e
l o c a t i o n s s t a r t i n g w i t h 1 6 . Recommended p a r a m e t e r i n i t i a l i z a t i o n s e q u e n c e :
Example : i f i n t h e d a t a memory o f c o n t r o l l e r t h e r e i s t h e f o l l o w i n g c o n t e n t
mem[ 1 6 ] = 4
mem[ 1 7 ] = 16
mem[ 1 8 ] = 8
mem[ 1 9 ] = 16
and i n t h e e x t e r n a l memory
extMem [ 1 6 ] = 16
extMem [ 1 7 ] = 17
...
extMem [ 7 9 ] = 79
t h e n t h e r u n o f t h e f u n c t i o n p r o v i d e s i n v e c t o r memory
v e c t [ 8 ] = <16 , 1 7 , . . . , 31>
v e c t [ 9 ] = <32 , 3 3 , . . . , 47>
v e c t [ 1 0 ] = <48 , 4 9 , . . . , 63>
v e c t [ 1 1 ] = <64 , 6 5 , . . . , 79>
/
//
cLSIZE ( 1 7 ) ; NOP ; / / s i z e > DMA
LB ( 1 6 ) ; cLADDR ( 1 9 ) ; NOP ; / / a d d r > DMA
cTRUN ( 1 ) ; NOP ; / / r u n l o a d v e c t o r > DMA
cLOAD ( 1 6 ) ; NOP ; / / a c c <= number o f t r a n s f e r s
cVADD ( 2 5 5 ) ; NOP ; / / a c c <= a c c 1
cSTORE ( 1 6 ) ; NOP ; //
cLOAD ( 1 9 ) ; NOP ; / / a c c <= a d d r
cADD ( 1 7 ) ; NOP ; / / a c c <= a c c + s i z e = n e x t a d d r e s s
cSTORE ( 1 9 ) ; NOP ; / / save next address
cLOAD ( 1 8 ) ; NOP ; / / a c c <= v e c t o r a d d r e s s
cVADD ( 1 ) ; CADDRLD; / / inc v e c t o r address ; addrVect [ i ] = acc
cSTORE ( 1 8 ) ; NOP ; / / save next vector address
cIOWAIT ; NOP ; / / w a i t t h e end o f l o a d
cLOAD ( 1 6 ) ; IOLOAD ; / / l o a d number o f l i n e s ; a c c V e c t [ i ] <= i o R e g [ i ]
cBRNZ ( 1 6 ) ; RSTORE ( 0 ) ; / / i f a c c ! = 0 imp t o LB ( 1 6 )
/////
50
6.1.2 Store N full horizontal vectors
/
FUNCTION NAME: Twod i m e n s i o n a r r a y s t o r e
AUTHOR: Gheorghe M. S t e f a n
DATE : S e p t . 25 2016
The p a r a m e t e r s f o r t h e f u n c t i o n a r e s e t i n c o n t r o l l e r s d a t a memory i n f o u r s u c c e s s i v e
l o c a t i o n s s t a r t i n g w i t h 1 6 . Recommended p a r a m e t e r i n i t i a l i z a t i o n s e q u e n c e :
Example : i f i n t h e d a t a memory o f c o n t r o l l e r t h e r e i s
mem[ 1 6 ] = 4
mem[ 1 7 ] = 16
mem[ 1 8 ] = 8
mem[ 1 9 ] = 16
and i n t h e v e c t o r memory
v e c t [ 8 ] = <16 , 1 7 , . . . , 31>
v e c t [ 9 ] = <32 , 3 3 , . . . , 47>
v e c t [ 1 0 ] = <48 , 4 9 , . . . , 63>
v e c t [ 1 1 ] = <64 , 6 5 , . . . , 79>
t h e n , t h e f u n c t i o n s t o r e i n t h e e x t e r n a l memory , s t a r t i n g from t h e a d d r e s s 16 t h e
f o l l o w i n g s t r e a m o f d a t a : <16 , 1 7 , . . . , 79>
/
/
cLOAD ( 1 8 ) ; NOP ;
cLSIZE ( 1 7 ) ; CADDRLD; / / s i z e > DMA; a d d r V e c t [ i ] <= a c c
LB ( 1 7 ) ; cLADDR ( 1 9 ) ; RILOAD ( 0 ) ; / / a d d r > DMA; a c c [ i ] <= memVect [ a d d r V e c t [ i ] ]
cTRUN ( 2 ) ; IOSTORE ; / / r u n l o a d v e c t o r > DMA; i o R e g [ i ] <= a c c [ i ]
cLOAD ( 1 6 ) ; RILOAD ( 1 ) ; / / acc <=n u m b e r O f T r a n s f e r s ; a d d r V e c t [ i ]<= a d d r V e c t [ i ] + 1
cVADD ( 2 5 5 ) ; NOP ; / / a c c <= a c c 1
cSTORE ( 1 6 ) ; NOP ; //
cLOAD ( 1 9 ) ; NOP ; / / a c c <= a d d r
cADD ( 1 7 ) ; NOP ; / / a c c <= a c c + s i z e = n e x t a d d r e s s
cSTORE ( 1 9 ) ; NOP ; / / save next address
cIOWAIT ; NOP ; / / w a i t t h e end o f l o a d
cLOAD ( 1 6 ) ; NOP ; / / l o a d number o f l i n e s ;
cBRNZ ( 1 7 ) ; NOP ; / / i f a c c ! = 0 imp t o LB ( 1 7 )
/////
51
6.1.3 Load M m-component vertical vectors
6.1.4 Store M m-component vertical vectors
52
7 Dense Linear Algebra
The main features of the architecture stressed by these dwarf [1] are vector multiplication and reduction add.
53
7.1 Matrix-Vector Multiplication
/
FUNCTION NAME: M a t r i x v e c t o r m u l t i p l i c a t i o n ( r e s e r v e d p r e f i x : MV)
FILE NAME: 05 m a t r i x V e c t o r M u l t i p l y . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 05 2017
The f u n c t i o n m u l t i p l i e s a NxN m a t r i x w i t h a v e c t o r
Initial :
addr [ i ] = M : address of the l a s t l i n e in matrix
a c c [ i ] = V[ i ] : t h e v e c t o r
Final :
acc [ i ] = r e s u l t
EXAMPLE: f o r t h e d e f i n i t i o n below
acc = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
addr = 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
vect [0] = x x x x x x x x x x x x x x x x
vect [1] = x x x x x x x x x x x x x x x x
vect [2] = x x x x x x x x x x x x x x x x
vect [3] = x x x x x x x x x x x x x x x x
vect [4] = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
vect [5] = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
vect [6] = 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
vect [7] = 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
vect [8] = 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
vect [9] = 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
vect [10] = 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
vect [11] = 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
vect [12] = 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
vect [13] = 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
vect [14] = 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
vect [15] = 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
vect [16] = 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
vect [17] = 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
vect [18] = 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14
vect [19] = 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
then f i n a l l y the following changes are produced :
acc = 39 52 65 78 91 104 117 130 143 156 169 182 195 0 0 0
addr = 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
vect [0] = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
DEFINITIONS
Parameters :
d e f i n e MV N 13 / / m a t r i x edge s i z e
d e f i n e MV W 0 / / working space : to save v e c t o r
d e f i n e MV S ( x /2 2) / / latency size
Labels :
d e f i n e MV M 1 / / main l o o p l a b e l
d e f i n e MV L 2 / / l a t e n c y loop l a b e l
/
cNOP ; STORE ( MV W ) ; / / mem[ i ] [W] <= a c c [ i ] = V[ i ]
cVLOAD( MV N ) ; RLOAD ( 0 ) ; / / a c c <= N; a c c [ i ] <= l a s t m a t r i x l i n e
cVSUB ( 1 ) ; MULT( MV W) ; / / a c c <= N1; a c c [ i ] <= a c c [ i ] mem[ i ] [W]
LB ( MV M ) ; cCPUSHL ( 0 ) ; RILOAD ( 2 5 5 ) ; / / p u s h redSum ; a c c [ i ] <= p r e v i o u s l i n e
cBRNZDEC( MV M ) ; MULT( MV W) ; / / l o o p c o n t r o l ; a c c [ i ]<= a c c [ i ] mem[ i ] [W]
cVLOAD( MV S ) ; NOP ; / / l o a d f o r l a t e n c y l o o p : x /2 2
LB ( MV L ) ; cBRNZDEC( MV L ) ; NOP ; / / l a t e n c y loop
cNOP ; SRLOAD ; / / load r e s u l t in acc [ i ]
54
The execution time is: 2n + 4 + 0.5x. In our example, there are only 2 latency step because x = 4 (the simulation is for
an array with 2x = 16 cells).
55
7.2 Matrix Transpose
/
FUNCTION NAME: M a t r i x t r a n s p o s e ( r e s e r v e d p r e f i x : MT)
FILE NAME: 05 m a t r i x T r a n s p o s e . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 05 2017
M: t h e m a t r i x t o be t r a n s p o s e d s t o r e d s t a r t i n g from t h e a d d r e s s MT S
MT: t h e t r a n s p o s e d m a t r i x s t o r e d s t a r t i n g from t h e a d d r e s s MT D
N : t h e s i z e o f t h e s q u a r e m a t r i x , named MT N
Example : f o r t h e d e f i n i t i o n s b e l l o w , i f t h e i n i t i a l s t a t e i s :
vect [16] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [17] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [18] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [19] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [20] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [21] = x x x x x x x x x x x x x x x x
then the f i n a l s t a t e i s :
vect [0] = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0
vect [1] = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0
vect [2] = 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0
vect [3] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [4] = 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11
vect [5] = x x x x x x x x x x x x x x x x
...
vect [16] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [17] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [18] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [19] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [20] = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vect [21] = x x x x x x x x x x x x x x x x
...
vect [32] = 0 0 0 0 0 5 5 5 5 5 10 10 10 10 10 15
vect [33] = 1 1 1 1 1 6 6 6 6 6 11 11 11 11 11 0
vect [34] = 2 2 2 2 2 7 7 7 7 7 12 12 12 12 12 0
vect [35] = 3 3 3 3 3 8 8 8 8 8 13 13 13 13 13 0
vect [36] = 4 4 4 4 4 9 9 9 9 9 14 14 14 14 14 0
vect [37] = x x x x x x x x x x x x x x x x
The work s p a c e u s e d by t h e f u n c t i o n i s : vectMem [ 0 ] , . . . , vectMem [ 4 ]
DEFINITIONS
Parameters :
d e f i n e MT N 5 / / m a t r i x edge s i z e
d e f i n e MT S 16 / / s o u r c e a d d r e s s i n v e c t o r memory
d e f i n e MT D 32 / / d e s t i n a t i o n a d d r e s s i n v e c t o r memory
Labels :
d e f i n e MT M 1 / / main l o o p l a b e l
d e f i n e MT L 2 / / l e f t s h i f t loop l a b e l
d e f i n e MT R 3 / / r i g h t s h i f t loop l a b e l
/
cNOP ; IXLOAD ; / / a c c [ i ] <= i n d e x
cNOP ; VDIV ( MT N ) ; / / a c c [ i ] <= i n d e x /N
cNOP ; VMULT( MT N ) ; / / a c c <= N ( i n d e x /N) i n i n t e g e r s
cNOP ; STORE ( 0 ) ; / / mem [ 0 ] [ i ] <= a c c [ i ]
cVLOAD( MT N ) ; IXLOAD ; / / a c c [ i ] <= i n d e x
cVSUB ( 1 ) ; SUB ( 0 ) ; / / acc <=acc 1; a c c [ i ]<= i n d e x N ( i n d e x /N) = ixModN
cSTORE ( 1 ) ; STORE ( 0 ) ; / / mem[5] <= s i z e 1= c y c l e s ; mem [ 0 ] [ i ]<=ixModN [ i ]
/ / mem [ 1 ] [ i ]<= sAddr [ i ] = ( ixModN [ i ] c y c l e s ) modN
56
cNOP ; CSUB ; // a c c <= ixModN c y c l e s
cNOP ; WHERECARRY; // a c c <= N; s e l e c t where c a r r y
cNOP ; VADD( MT N ) ; // a c c [ i ] <= a c c [ i ] + a c c
cNOP ; ENDWHERE; // reselect all cells
cNOP ; STORE ( 1 ) ; // s t o r e a t mem [ 1 ] [ i ]
// mem [ 2 ] [ i ]<=dAddr [ i ] = ( ixModN [ i ] + c y c l e s ) modN
cNOP ; LOAD ( 0 ) ; // a c c <= c y c l e s ; a c c [ i ] <= ixModN [ i ]
cNOP ; CADD; // a c c <= N; a c c [ i ] <= ixModN [ i ] + c y c l e s
cNOP ; VCOMPARE( MT N ) ; / / compare w i t h N ( a c c N)
cNOP ; WHERENCARRY; // s e l e c t where n o t c a r r y
cNOP ; VSUB( MT N ) ; // a c c [ i ] <= a c c N;
cNOP ; ENDWHERE; // reselect all cells
cNOP ; STORE ( 2 ) ; // s t o r e a t mem [ 2 ] [ i ]
// r e a d on d i a g o n a l
LB ( MT M ) ;
cVLOAD( MT S ) ; LOAD ( 1 ) ; // l o a d s o u r c e ; l o a d ( ixModN [ i ] c y c l e s ) modN
cNOP ; ADDRLD; // a d d r [ i ] <= ( ixModN [ i ] c y c l e s ) modN
cLOAD ( 1 ) ; CRLOAD; // a c c [ i ] <= mem[ i ] [ S + ( ixModN [ i ] c y c l e s ) modN ]
// l o c a l , modN r o t a t e w i t h c y c l e s
cVSUB ( 1 ) ; STORE ( 3 ) ; // s a v e d i a g o n a l ( a r e g i s t e r s h o u l d be good )
LB ( MT L ) ;
cBRNZDEC( MT L ) ; GLSHIFT ; // global l e f t s h i f t cycle times
cNOP ; STORE ( 4 ) ; // save the l e f t s h i f t e d diagonal
// w r i t e on d i a g o n a l
cNOP ; LOAD ( 2 ) ; // l o a d d e s t ; l o a d ( ixModN [ i ] + c y c l e s ) modN
cNOP ; ADDRLD; // a d d r [ i ] <= ( ixModN [ i ] + c y c l e s ) modN
cVLOAD( MT N ) ; LOAD ( 4 ) ; // reload the shifted diagonal
cSUB ( 1 ) ; RSTORE( MT D ) ; // a c c [ i ] <= mem[ i ] [ D + ( ixModN [ i ] + c y c l e s ) modN ]
cVSUB ( 1 ) ; LOAD ( 3 ) ; // reload the diagonal
LB ( MT R ) ;
cBRNZDEC( MT R ) ; GRSHIFT ; // g l o b a l r i g h t s h i f t Nc y c l e s t i m e s
cVLOAD( MT N ) ; STORE ( 4 ) ; // save the r i g h t s h i f t e d diagonal
cSUB ( 1 ) ; LOAD ( 0 ) ; // a c c <= c y c l e s ; a c c [ i ] <= ixModN [ i ]
cNOP ; CCOMPARE; // compare ixModN [ i ] w i t h c y c l e s
cNOP ; WHERENCARRY; // where n o t c a r r y
cVLOAD( MT D ) ; LOAD ( 4 ) ; // restore the right shifted diagonal
cNOP ; CRSTORE ; // a c c [ i ] <= mem[ i ] [ D + ( ixModN [ i ] + c y c l e s ) modN ]
cVLOAD( MT N ) ; ENDWHERE; // a c c <= N; r e s e l e c t a l l c e l l s
// increment source diagonal
cVSUB ( 1 ) ; LOAD ( 1 ) ; // a c c <= N1; l o a d s o u r c e d i a g o n a l a d d r e s s e s
cNOP ; CSUB ; // a c c [ i ] <= ( ixModN [ i ] + c y c l e s ) modN (N1)
cVLOAD( MT N ) ; WHERENZERO; // s e l e c t where n o t c a r r y
cNOP ; CADD; // a c c [ i ] <= a c c [ i ] + a c c
cNOP ; ENDWHERE; // reselect all cells
cNOP ; STORE ( 1 ) ; // s t o r e b a c k aAddr [ i ]
// decrement dest diagonal
cVSUB ( 1 ) ; LOAD ( 2 ) ; // acc <=N1;
// a c c [ i ]<=dAddr [ i ] = ( ixModN [ i ] + c y c l e s ) modN
cNOP ; WHEREZERO; // s e l e c t where z e r o
cNOP ; CLOAD; // where 0 a c c <= N1
cLOAD ( 1 ) ; ELSEWHERE ; // s e l e c t where n o t z e r o
cVSUB ( 1 ) ; VSUB ( 1 ) ; // a c c <= c y c l e s ; a c c [ i ] <= a c c [ i ] 1
cSTORE ( 1 ) ; ENDWHERE; // a c c <= a c c 1 ; r e s e l e c t a l l c e l l s
cBRNZ ( MT M ) ; STORE ( 2 ) ; // mem [ 5 ] <= c y c l e s ; s t o r e b a c k dAddr [ i ]
// move d i a g o n a l
cVLOAD( MT S ) ; LOAD ( 0 ) ; // a c c <= S ; a c c [ i ] <= ixModN [ i ]
cNOP ; ADDRLD; // a d d r [ i ] <= ixModN [ i ]
cVLOAD( MT D ) ; CRLOAD; // a c c <= D; a c c [ i ] <= mem[ i ] [ S + ixModN [ i ] ]
cNOP ; CRSTORE ; // mem[ i ] [ D + ixModN [ i ] ] <= a c [ i ]
57
The execution time is: T (N) = N 2 + 29N 7.
58
7.3 Matrix-Matrix Multiplication
/
FUNCTION NAME: M a t r i x m a t r i x m u l t i p l i c a t i o n ( r e s e r v e d p r e f i x : MM)
FILE NAME: 05 m a t r i x M a t r i x M u l t i p l y . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 07 2017
The p a r a m e t e r s f o r t h e f u n c t i o n a r e s e t i n c o n t r o l l e r s d a t a memory i n f o u r s u c c e s s i v e
l o c a t i o n s s t a r t i n g w i t h 2 6 . Recommended c e l l a c t i v a t i o n & p a r a m e t e r i n i t i a l i z a t i o n
sequence :
EXAMPLE: i f t h e i n i t i a l s t a t e o f t h e v e c t o r memory is :
vect [0] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
...
vect [16] = 1 2 3 4 5 6 7 8 9 10 11 x x x x x
vect [17] = 2 3 4 5 6 7 8 9 10 11 12 x x x x x
vect [18] = 3 4 5 6 7 8 9 10 11 12 13 x x x x x
vect [19] = 4 5 6 7 8 9 10 11 12 13 14 x x x x x
vect [20] = 5 6 7 8 9 10 11 12 13 14 15 x x x x x
vect [21] = 6 7 8 9 10 11 12 13 14 15 16 x x x x x
vect [22] = 7 8 9 10 11 12 13 14 15 16 17 x x x x x
vect [23] = 8 9 10 11 12 13 14 15 16 17 18 x x x x x
vect [24] = 9 10 11 12 13 14 15 16 17 18 19 x x x x x
vect [25] = 10 11 12 13 14 15 16 17 18 19 20 x x x x x
vect [26] = 11 12 13 14 15 16 17 18 19 20 21 x x x x x
...
vect [48] = 0 0 0 0 0 0 0 0 0 0 0 x x x x x
vect [49] = 1 1 1 1 1 1 1 1 1 1 1 x x x x x
vect [50] = 2 2 2 2 2 2 2 2 2 2 2 x x x x x
vect [51] = 3 3 3 3 3 3 3 3 3 3 3 x x x x x
vect [52] = 4 4 4 4 4 4 4 4 4 4 4 x x x x x
vect [53] = 5 5 5 5 5 5 5 5 5 5 5 x x x x x
vect [54] = 6 6 6 6 6 6 6 6 6 6 6 x x x x x
vect [55] = 7 7 7 7 7 7 7 7 7 7 7 x x x x x
vect [56] = 8 8 8 8 8 8 8 8 8 8 8 x x x x x
vect [57] = 9 9 9 9 9 9 9 9 9 9 9 x x x x x
vect [58] = 10 10 10 10 10 10 10 10 10 10 10 x x x x x
...
then the f i n a l s t a t e i s :
vect [0] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
vect [1] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
vect [2] = 0 1 2 3 4 5 6 7 8 9 10 x x x x x
vect [3] = 10 0 1 2 3 4 5 6 7 8 9 x x x x x
vect [4] = 0 0 0 0 0 0 0 0 0 0 10 x x x x x
...
vect [16] = 1 2 3 4 5 6 7 8 9 10 11 x x x x x
vect [17] = 2 3 4 5 6 7 8 9 10 11 12 x x x x x
vect [18] = 3 4 5 6 7 8 9 10 11 12 13 x x x x x
vect [19] = 4 5 6 7 8 9 10 11 12 13 14 x x x x x
vect [20] = 5 6 7 8 9 10 11 12 13 14 15 x x x x x
vect [21] = 6 7 8 9 10 11 12 13 14 15 16 x x x x x
vect [22] = 7 8 9 10 11 12 13 14 15 16 17 x x x x x
vect [23] = 8 9 10 11 12 13 14 15 16 17 18 x x x x x
vect [24] = 9 10 11 12 13 14 15 16 17 18 19 x x x x x
vect [25] = 10 11 12 13 14 15 16 17 18 19 20 x x x x x
59
vect [26] = 11 12 13 14 15 16 17 18 19 20 21 x x x x x
...
vect [32] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [33] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [34] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [35] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [36] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [37] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [38] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [39] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [40] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [41] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
vect [42] = 440 495 550 605 660 715 770 825 880 935 990 x x x x x
...
vect [48] = 0 0 0 0 0 0 0 0 0 0 0 x x x x x
vect [49] = 1 1 1 1 1 1 1 1 1 1 1 x x x x x
vect [50] = 2 2 2 2 2 2 2 2 2 2 2 x x x x x
vect [51] = 3 3 3 3 3 3 3 3 3 3 3 x x x x x
vect [52] = 4 4 4 4 4 4 4 4 4 4 4 x x x x x
vect [53] = 5 5 5 5 5 5 5 5 5 5 5 x x x x x
vect [54] = 6 6 6 6 6 6 6 6 6 6 6 x x x x x
vect [55] = 7 7 7 7 7 7 7 7 7 7 7 x x x x x
vect [56] = 8 8 8 8 8 8 8 8 8 8 8 x x x x x
vect [57] = 9 9 9 9 9 9 9 9 9 9 9 x x x x x
vect [58] = 10 10 10 10 10 10 10 10 10 10 10 x x x x x
vect [59] = x x x x x x x x x x x x x x x x
DEFINITIONS :
Parameters :
define N 11 / / m a t r i x edge s i z e
d e f i n e M1 16 / / f i r s t matrix address
d e f i n e M2 48 / / second matrix address
d e f i n e MT 32 / / transposed matrix address
d e f i n e MR 32 / / r e s u l t matrix address
Label :
d e f i n e MM 0 / / matrix multiply loop l a b e l
Parameters for matrix transpose :
define S M2 / / s o u r c e a d d r e s s i n v e c t o r memory
define D MT / / d e s t i n a t i o n a d d r e s s i n v e c t o r memory
Labels for matrix transpose :
d e f i n e TL 1 / / t r a n s p o s e loop l a b e l
d e f i n e LS 2 / / l e f t s h i f t loop l a b e l
d e f i n e RS 3 / / r i g h t s h i f t loop l a b e l
Parameters for matrix vector multiply :
define M ( M1+ N1) / / a d d r e s s o f t h e LAST l i n e i n m a t r i x
define W 0 / / working space : to save v e c t o r
define L ( x /2 2) / / latency size
Labels for matrix vector multiply :
d e f i n e MV 4 / / loop l a b e l
d e f i n e LL 5 / / l a t e n c y loop l a b e l
/
i n c l u d e 05 m a t r i x T r a n s p o s e . v
60
i n c l u d e 05 m a t r i x V e c t o r M u l t i p l y . v
61
8 Sparse Linear Algebra
The two kinds of sparse matrices are investigated:
The algorithms for sparse matrices presented in this section are designed only for the limited case N P.
62
8.2 Band Matrix Operations
8.2.1 Band Matrix Vector Multiplication
/
FUNCTION NAME: Band m a t r i x v e c t o r m u l t i p l i c a t i o n
FILE NAME: 05 BdMV . v
AUTHOR: Gheorghe M. S t e f a n
DATE : J a n u a r y 7 2017
where : v !== 0
THE ALGORITHM
========================================================================
z = md
v e c t o r [ r v a ] <= <0 0 . . . 0>
f o r i = 0 ; i <bw ; i = i + 1 ;
z <= z 1
i f ( ! ( z <0))
v e c t o r [ r v a ] <= v e c t o r [ r v a ] + ( v e c t o r [ f d a + i ] v e c t o r [ va ] ) << z
else
v e c t o r [ r v a ] <= v e c t o r [ r v a ] + ( v e c t o r [ f d a + i ] v e c t o r [ va ] ) >> | z |
========================================================================
DEFINITIONS :
Parameters :
define N 8 / / m a t r i x edge s i z e
define R 4 / / r e s u l t v e c t o r a d d r e s s
define F 8 / / f i r s t diagonal address
define W 4 / / number o f d i a g o n a l s
define M 3 / / main d i a g o n a l p o s i t i o n
define V 6 / / vector address
Labels :
d e f i n e ML 0 / / main l o o p l a b e l
d e f i n e LS 1 // left shift label
d e f i n e RS 2 // right shift label
d e f i n e SK 3 / / skip label
EXAMPLE: l e t be t h e o p e r a t i o n p e r f o r m e d w i t h t h e p r e v i o u s l y d e f i n e d
p a r a m e t e r s and l a b e l s :
|2 3 4 0 0 0 0 0 | |1| |20|
|1 2 3 4 0 0 0 0 | |2| |30|
|0 1 2 3 4 0 0 0 | |3| |40|
|0 0 1 2 3 4 0 0 | |4| |50|
|0 0 0 1 2 3 4 0 | X |5| = |60|
|0 0 0 0 1 2 3 4 | |6| |70|
|0 0 0 0 0 1 2 3 | |7| |44|
|0 0 0 0 0 0 1 2 | |8| |23|
63
With t h e p r e v i o u s i n i t i a l i z a t i o n data is represented i n v e c t o r memory a s follows :
vect [6] = 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0
vect [7] = x x x x x x x x x x x x x x x x
vect [8] = 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0
vect [9] = 0 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0
vect [10] = 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
The f i n a l c o n t e n t o f t h e v e c t o r memory i s :
vect [0] = 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0
vect [1] = 20 30 40 50 60 70 44 23 0 0 0 0 0 0 0 0
vect [2] = x x x x x x x x x x x x x x x x
vect [3] = x x x x x x x x x x x x x x x x
vect [4] = 20 30 40 50 60 70 44 23 0 0 0 0 0 0 0 0
vect [5] = x x x x x x x x x x x x x x x x
vect [6] = 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0
vect [7] = x x x x x x x x x x x x x x x x
vect [8] = 0 0 4 4 4 4 4 4 0 0 0 0 0 0 0 0
vect [9] = 0 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0
vect [10] = 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
/
cVLOAD( V ) ; ACTIVATE ; / / a c c <= va ; activate all cells
cVLOAD( W) ; CALOAD; / / a c c <= bw ; a c c [ i ] <= mem[ va ]
cSTORE ( 0 ) ; STORE ( 0 ) ; / / mem [ 0 ] <= bw ; mem [ 0 ] [ i ] <= v
cVLOAD( F ) ; VLOAD ( 0 ) ; / / a c c <= f d a ; a c c [ i ] <= 0
cVSUB ( 1 ) ; STORE ( 1 ) ; / / a c c <= f d a 1; mem [ 1 ] [ i ] <= 0
cVLOAD( M) ; CLOAD; / / a c c <= md ; a c c [ i ] <= f d a 1
cSTORE ( 1 ) ; ADDRLD; / / mem [ 1 ] <= md ; a d d r [ i ] <= f d a 1
LB ( ML) ; cLOAD ( 1 ) ; NOP ; / / a c c <= mem [ 1 ]
cVSUB ( 1 ) ; RILOAD ( 1 ) ; / / a c c <= acc 1; a c c [ i ] <= mem[ a d d r + 1 ]
cSTORE ( 1 ) ; MULT ( 0 ) ; / / mem [ 1 ] <= a c c ; a c c [ i ] <= a c c [ i ] mem [ 0 ] [ i ]
cBRSGN ( RS ) ; NOP ; / / i f a c c [ n 1] jmp ( 3 2 ) ( 2 nd s t e p o f f l o a t m u l t )
cBRZDEC ( SK ) ; NOP ; / / i f a c c =0 jmp 33
LB ( LS ) ; cBRNZDEC( LS ) ; GLSHIFT ; / / i f a c c =0 jmp ( 3 3 ) a c c [ i ] <= a c c [ i + 1 ]
cJMP ( SK ) ; NOP ;
Evaluation: TBdMV max (W ) = 0.5W 2 + 9.5W + 8. For floating point operations the execution time is the same. There are
reserved NOPs for the second ant third steps of the floating operations. See parenthesis like (2nd step of float
mult) in the comments of the program.
64
8.3 Random Sparse Matrix Operations
8.3.1 Sparse Matrix Transpose
/
FUNCTION NAME: S p a r s e m a t r i x t r a n s p o s e
AUTHOR: Gheorghe M. S t e f a n
DATE : Oct . 30 2016
w : ( . . . . . . ) / / working v e c t o r
x, y, < N
INITIALIZATION i s done by t h e f o l l o w i n g s e q u e n c e i n c o n t r o l l e r s s i d e o f t h e c o d e :
EXAMPLE:
| 8 0 0 7 | | 8 0 4 0 |
| 0 6 0 5 | | 0 6 0 2 |
| 4 0 3 0 | | 0 0 3 0 |
| 0 2 0 1 |T = | 7 5 0 1 |
The a l g o r i t h m i s e m b a r r a s s i n g l y s i m p l e :
t h e l i n e v e c t o r i s swapped w i t h t h e column v e c t o r
I f t h e d a t a memory o f t h e c o n t r o l l e r i s i n i t i a l i z e d by t h e s e q u e n c e :
cVLOAD ( 8 ) ;
cSTORE ( 2 5 ) ; / / mem[ 2 5 ] <= 8 = fmv
cVLOAD ( 1 4 ) ;
cSTORE ( 2 7 ) ; / / mem[ 2 7 ] <= 14 = wm
vect [8] = 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0
vect [9] = 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4
vect [10] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4
vect [14] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4
/
cLOAD ( 2 5 ) ; NOP ; / / a c c <= l = a d d r e s s o f l i n e v e c t o r
65
cLOAD ( 2 7 ) ; CALOAD; / / a c c <= w = a d d r e s s o f w o r k i n g v e c t o r ; a c c [ i ] <= mem[ l ] [ i ]
cLOAD ( 2 5 ) ; CSTORE ; / / a c c <= l ; mem[w] <= ml
cVADD ( 1 ) ; NOP ; / / a c c <= l +1 = c : a d d r e s s o f column v e c t o r
cVSUB ( 1 ) ; CALOAD; / / a c c <= l ; a c c [ i ] <= mem[ c ] [ i ]
cLOAD ( 2 7 ) ; CSTORE ; / / a c c <= w ; mem[ l ] [ i ] <= a c c [ i ] = mem[ c ] [ i ]
cLOAD ( 2 5 ) ; CALOAD; / / a c c <= l ; a c c [ i ] <= ml
cVADD ( 1 ) ; NOP ; / / a c c <= c ;
cNOP ; CSTORE ; / / mem[ c ] [ i ] <= ml
//
66
8.3.2 Sparse Matrix Vector Multiplication
/
FUNCTION NAME: S p a r s e m a t r i x v e c t o r m u l t i p l i c a t i o n
AUTHOR: Gheorghe M. S t e f a n
DATE : Nov . 10 2016
The f u n c t i o n m u l t i l i e s i n w a s p a r s e NxN m a t i c e , A, w i t h a d e n s e v e c t o r , v , s t o r e d i n
v e c t o r memory . The m a t r i x i s r e p r e s e n t e d by t h r e e s e q u e n c e s i n v e c t o r s o f P e l e m e n t s
( P : number o f c e l l s ) , a s f o l l o w s :
x, y < N
w: r e s u l t r e g i s t e r
z : working r e g i s t e r
sv : s e r i a l v e c t o r implemented i n hardware d i s t r i b u t e d along t h e c e l l s
THE ALGORITHM
==========================
s r <= v
f o r i = 0 ; i <N; i = i + 1 ;
s e l e c t ( where c s == i )
z <= s r [ 0 ]
s r <= s r << 1
z <= z v s
f o r i =N1; i =<0; i = i 1;
s e l e c t ( where l s == i )
s r <= { redAdd ( z ) , s r }
w <= s r
==========================
INITIALIZATION i s done by t h e f o l l o w i n g s e q u e n c e i n c o n t r o l l e r s s i d e o f t h e c o d e :
cVLOAD ( 4 ) ;
cSTORE ( 2 2 ) ; / / mem[ 2 2 ] <= N; m a t r i x / v e c t o r s i z e
cVLOAD ( 1 5 ) ;
cSTORE ( 2 3 ) ; / / mem[ 2 3 ] <= wsa ; r e s u l t v e c t o r ( ws ) a d d r e s s
cVLOAD ( 8 ) ;
cSTORE ( 2 4 ) ; / / mem[ 2 4 ] <= v s a ; v a l u e s e q u e n c e ( v s ) a d d r e s s
cVLOAD ( 9 ) ;
cSTORE ( 2 5 ) ; / / mem[ 2 5 ] <= l s a ; l i n e s e q u e n c e ( l s ) a d d r e s s
cVLOAD ( 1 0 ) ;
cSTORE ( 2 6 ) ; / / mem[ 2 6 ] <= c s a ; columns e q u e n c e ( c s ) a d d r e s s
cVLOAD ( 1 1 ) ;
cSTORE ( 2 7 ) ; / / mem[ 2 7 ] <= va ; v e c t o r ( v ) a d d r e s s
EXAMPLE: t h e i n i t a l d a t a i s
A = |8 0 0 7| v = |4|
|0 6 0 5| |3|
|4 0 0 0| |2|
|0 2 0 1| |1|
t h e n , t h e i n i t a i l c o n t e n t o f v e c t o r memory :
vect [8] = 8 7 6 5 4 2 1 0 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
67
vect [10] = 0 3 1 3 0 1 3 4 4 4 4 4 4 4 4 4
vect [11] = 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0
The f i n a l c o n t e n t o f v e c t o r memory ( a f t e r 26 c l o c k c y c l e s ) :
vect [0] = 32 7 18 5 16 6 1 x x x x x x x x x
vect [1] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
...
vect [8] = 8 7 6 5 4 2 1 0 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
vect [10] = 0 3 1 3 0 1 3 4 4 4 4 4 4 4 4 4
vect [11] = 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0
...
vect [15] = 39 23 16 7 0 0 0 0 0 0 0 0 0 0 0 0
which c o r r e s p o n d t o :w = |39|
|23|
|16|
| 7|
/
cLOAD ( 2 7 ) ; ACTIVATE ; / / a c c <= va ; a c t i v a t e a l l c e l l s
cNOP ; CALOAD; / / a c c [ i ] <= mem[ va ] [ i ] = v [ i ]
cLOAD ( 2 6 ) ; SRSTORE ; / / a c c <= c s a ; s r [ i ] <= v [ i ]
cVLOAD ( 0 ) ; CALOAD; / / a c c <= 0 ; a c c [ i ] <= mem[ c s a ] [ i ] = c s [ i ]
cNOP ; STORE ( 1 ) ; / / mem [ 1 ] [ i ] <= c s [ i ]
/ / DISTRIBUTE VECTOR COMPONENTS
LB ( 2 6 ) ; cVADD ( 1 ) ; SEARCH ; / / a c c <= a c c + 1 ; s e l e c t column a c c
cCSEND ( 4 ) ; CLOAD; / / coOp = s r [ 0 ] ; a c c [ i ] <= s r [ 0 ]
cVPUSHR ( 0 ) ; STORE ( 0 ) ; / / pop s r [ 0 ] ; mem [ 0 ] [ i ] <= a c c [ i ] = v [ j ]
cSKIPEQ ( 2 2 ) ; ACTIVATE ; / / s k i p i s (mem[ 2 2 ] = N ) ; a c t i v a t e a l l c e l l s
cJMP ( 2 6 ) ; LOAD ( 1 ) ; / / jump t o LB ( 2 6 ) ; a c c [ i ] <= c s [ i ]
/ / MUTIPLY
cLOAD ( 2 4 ) ; LOAD ( 0 ) ; / / a c c <= v s a ; a c c [ i ] <= mem [ 0 ] [ i ] = m u l t i p l i e r
cNOP ; CAMULT; / / a c c [ i ] <= a c c [ i ] v s
cLOAD ( 2 5 ) ; STORE ( 0 ) ; / / a c c <= l s a ; mem [ 0 ] [ i ] <= p r o d u c t s
/ / ADD LINES
cLOAD ( 2 2 ) ; CALOAD; / / a c c <= N ; a c c [ i ] <= l s [ i ]
cVSUB ( 1 ) ; STORE ( 1 ) ; / / a c c <= N1; mem [ 1 ] [ i ] <= l i n e i n d e x e s
cNOP ; SRCALL ; / / a c c <= N2; s e a r c h N1 i n l s
68
8.3.3 Sparse Matrices Multiplication
/
FUNCTION NAME: S p a r s e m a t r i x m u l t i p l i c a t i o n
AUTHOR: Gheorghe M. S t e f a n
DATE : Oct . 31 2016
wm: ( 0 0 . . . 0 0 0 . . . ) / / w o r k i n g m a t r i x w i t h t h e s h a p e o f sm
x, y, z, w < N
THE ALGORITHM
=================================================
i n i t i a l i z e wm t o z e r o
do
s e l e c t f i r s t nonz e r o column i n sm
t a k e column i n d e x : c
do
s e l e c t f i r s t nonz e r o s c a l a r i n column c
take value v
remove f i r s t s c a l a r
take l i n e index : l
s e l e c t column l i n fm
m u l t i p l y i n wm column l i n f m w i t h v
l o o p u n t i l ( no s c a l a r i n column )
do
s e l e c t f i r s t nonempty l i n e i n wm
take index : l
compute redAdd : r
s t o r e ( r , l , c ) i n rm
remove l i n e l i n wm
l o o p u n t i l ( no nonz e r o l i n e i n wm)
c l e a r f i r s t nonz e r o column i n sm
l o o p u n t i l ( no nonz e r o column i n sm )
=================================================
INITIALIZATION i s done by t h e f o l l o w i n g s e q u e n c e i n c o n t r o l l e r s s i d e o f t h e c o d e :
69
cVLOAD(wm) ; // a c c <= wm = a d d r e s s o f t h e w o r k i n g v e c t o r
cSTORE ( 2 7 ) ; // mem[ 2 7 ] <= wm
cVLOAD( rmv ) ; // a c c <= rmv = a d d r e s s o f t h e r e s u l t m a t r i x v a l u e v e c t o r
cSTORE ( 2 3 ) ; // mem[ 2 3 ] <= rmv
EXAMPLE:
| 8 0 0 7 | | 1 0 1 0 | | 8 7 8 7 |
| 0 6 0 5 | | 1 1 0 0 | | 6 11 0 5 |
| 4 0 3 0 | | 0 0 1 0 | | 4 0 7 0 |
| 0 2 0 1 | X | 0 1 0 1 | = | 2 3 0 1 |
I f t h e d a t a memory o f t h e c o n t r o l l e r i s i n i t i a l i z e d by t h e s e q u e n c e :
cVLOAD ( 4 ) ;
cSTORE ( 2 2 ) ; / / mem[ 2 2 ] <= 4 = N
cVLOAD ( 8 ) ;
cSTORE ( 2 5 ) ; / / mem[ 2 5 ] <= 8 = fmv
cVLOAD ( 1 3 ) ;
cSTORE ( 2 6 ) ; / / mem[ 2 6 ] <= 13 = smc
cVLOAD ( 1 4 ) ;
cSTORE ( 2 7 ) ; / / mem[ 2 7 ] <= 14 = wm
cVLOAD ( 1 5 ) ;
cSTORE ( 2 3 ) ; / / mem[ 2 3 ] <= 15 = rmv
f i r s t matrix :
vect [8] = 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4
vect [10] = 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4
second matrix :
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
vect [12] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
vect [13] = 0 2 0 1 2 1 3 4 4 4 4 4 4 4 4 4
working v e c t o r :
vect [14] = x x x x x x x x x x x x x x x x
space reserved for result :
vect [15] = x x x x x x x x x x x x x x x x
vect [16] = x x x x x x x x x x x x x x x x
vect [17] = x x x x x x x x x x x x x x x x
vect [8] = 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0
vect [9] = 0 0 1 1 2 2 3 3 4 4 4 4 4 4 4 4
vect [10] = 0 3 1 3 0 2 1 3 4 4 4 4 4 4 4 4
vect [11] = 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
vect [12] = 0 0 1 1 2 3 3 4 4 4 4 4 4 4 4 4
vect [13] = 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
vect [14] = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
vect [15] = 8 6 4 2 7 11 3 8 7 7 5 1 x x x x
vect [16] = 0 1 2 3 0 1 3 0 2 0 1 3 4 4 4 4
vect [17] = 0 0 0 0 1 1 1 2 2 3 3 3 4 4 4 4
/
cLOAD ( 2 7 ) ; VLOAD ( 0 ) ;
cLOAD ( 2 2 ) ; CSTORE ; / / wm <= 0
cLOAD ( 2 3 ) ; CLOAD; / / a c c [ i ] <= N
cVADD ( 1 ) ; NOP ;
cVADD ( 1 ) ; CSTORE ; / / r m l <= N
cNOP ; CSTORE ; / / rmc <= N
70
cLOAD ( 2 6 ) ; CSTORE ; / / mem[ 1 4 ] <= wm
cNOP ; CALOAD; / / a c c [ i ] <= smc
cNOP ; NOP ; / / latency
71
cLOAD ( 2 2 ) ; NOP ; // a c c <= N
cLOAD ( 3 ) ; SEARCH ; // a c c <= r ; s e l e c t f r e e s p a c e i n rm
cNOP ; NOP ; // latency for f i r s t
cNOP ; WHEREFIRST ; //
cLOAD ( 2 3 ) ; CINSERT ; // a c c [ f i r s t ] <= r
cLOAD ( 2 ) ; CSTORE ; // a c c <= l ; mem [ 1 5 ] [ f i r s t ] <= r
cLOAD ( 2 3 ) ; CINSERT ; // a c c [ f i r s t ] <= l ;
cVADD ( 1 ) ; NOP ; //
cLOAD ( 1 ) ; CSTORE ; // a c c <= c ; mem [ 1 6 ] [ f i r s t } <= l
cLOAD ( 2 3 ) ; CINSERT ; // a c c [ f i r s t ] <= c ;
cVADD ( 2 ) ; NOP ;
cNOP ; CSTORE ; // mem [ 1 7 ] [ f i r s t ] <= c ;
cLOAD ( 2 7 ) ; ACTIVATE ; // all cells activated
cNOP ; CALOAD; //
cNOP ; NOP ; // latency
cLOAD ( 2 7 ) ; NOP ; // latency
cNOP ; CALOAD; // latency
/ / cNOP ; NOP ; // latencies
cCLOAD ( 0 ) ; NOP ;
cBRNZ ( 2 5 ) ; NOP ;
cLOAD ( 2 6 ) ; NOP ;
cLOAD ( 2 2 ) ; CALOAD;
cNOP ; CSUB ;
/ / cNOP ; NOP ; // latencies
cNOP ; NOP ;
cLOAD ( 2 6 ) ; NOP ; / / latency
cNOP ; NOP ; / / latency
cCLOAD ( 0 ) ; CALOAD;
cBRNZ ( 2 3 ) ; NOP ; / / branch ; l a t e n c y
//
72
9 Graphs
A possible weakness of the pRISC circuit is the option for the simplest interconnection network between the cells in the
MAP section. In the worst case, from one cell to another cell the distance is in O(log P) (the depth of the reduction
network). The advantage is the small size of pRISC circuit (S pRISC O(P)), the a small inter-connectivity compared, for
example, with a hyper-cube interconnection organization. The ninth computational motif graph traversal is used to
prove that, despite the simplicity of the interconnection network, the pRISC-based hybrid computing version achieve the
same performance as the hyper-cube version of parallel engines, according to [5], whose size are in O(Plog P).
73
9.1 Minimum Spanning Tree
The evaluation used the Prims algorithm for computing the minimum spanning tree, MST, of a graph with N vertices.
The main functions of pRISC involved in providing an efficient algorithm are: vector to scalar functions (reduction-minim,
reduction-add), spatial control functions (WHERE[COND], SEARCH, WHEREFIRST). The evaluation program provides for
dense graphs:
TMST Dense = (N 1)(20 + log2 P) O(N log P)
while for sparse graphs:
TMST Sparse = 2(N 1)log2 P + 31N 24 O(N log P)
74
9.2 All-Pairs Shortest Path
The N N adjacency matrix A of graph G is used to compute the matrix of the shortest path in G, A , using the modified
matrix multiplication, X Y . If X and Y are matrices, then computing X Y means to substitute in the matrix multiplication
algorithm the scalar multiplication with addition and the reduction sum with reduction minim. The algorithm is A =
| A {z. . . A}. The algorithm is not the optimal one, but is used in systems which perform efficiently (modified) matrix
A
(N1) times
multiplication. The time for computing A from A is TAPSP O(N 2 log P).
75
9.3 Breadth-First Search
The breadth-first search algorithm uses mainly the same specific function as the minimum spanning tree algorithm (only
instead of reduction-minim the reduction-maxim is used). The simulation program provides for dense graphs:
76
Part III
UPGRADES
Envisaged versions:
Stack based engines instead of accumulator based engines.
77
References
[1] Krste Asanovic, et al., The landscape of parallel computing research: A view from Berkeley, 2006.
See at: www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
[2] John Backus, Can programming be liberated from the von Neumann style? A functional style and its algebra of
programs. Communications of the ACM 21, 8 (August) 1978. 613-641.
[3] Calin Bra, R. Hobincu, Lucian Petrica, OPINCAA: A Light-Weight and Flexible Programming Environment For
Parallel SIMD Accelerators Romanian Journal of Information Science and Technology, Volume 16, Numbers 4, 2013,
336-350
[4] Stephen Kleene, General recursive functions of natural numbers. Mathematische Annalen 112, 5, 1936. 727-742.
[5] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing. Design and Analysis of Algorithms,
The Benjamin/Cummings Pub. Comp., Inc., 1994.
[6] Mihaela Malita, Gheorghe M. Stefan, Dominique Thiebaut, Not Multi-, but Many-Core: Designing Integral Parallel
Architectures for Embedded Computation. ACM SIGARCH Computer Architecture News, Vol. 35, No. 5, December
2007. 32-39.
[7] Mihaela Malita, and Gheorghe M. Stefan, Backus language for functional nano-devices. CAS 2011, vol. 2, 331-334.
[8] Gheorghe M. Stefan, et al., The CA1024: A fully programmable system-on-chip for cost-effective HDTV media
processing. Hot Chips: A Symposium on High Performance Chips. Memorial Auditorium, Stanford University.
[9] Gheorghe M. Stefan, One-chip TeraArchitecture. Proceedings of the 8th Applications and Principles of Information
Science Conference. Okinawa, Japan, 2009.
See at: www.dropbox.com/s/5oqncu71t7zf8es/teraArchitecture.pdf?dl=
[10] Gheorghe M. Stefan, Integral parallel architecture in system-on-chip designs. The 6th International Workshop on
Unique Chips and Systems, Atlanta, GA, USA, December 4, 2010, pp. 23-26.
[11] Gheorghe M. Stefan, Mihaela Malita, Can One-Chip Parallel Computing Be Liberated From Ad Hoc Solutions? A
Computation Model Based Approach and Its Implementation, 18th Inter. Conf. on Ciruits, Systems, Communications
and Computers, Santorini, July 17-21, 2014, 582-597.
See at: www.dropbox.com/s/rtzzs1d06526jzj/COMPUTERS2-42.pdf?dl=0
[12] Gheorghe Stefan, Loops & Complexity in DIGITAL SYSTEMS. Lecture Notes on Digital Design in Giga-Gate/Chip
Era, (work in endless progress) 2016 version.
See at: www.dropbox.com/s/neooi2cca5y8lxa/0-BOOK.pdf?dl=0
78