Professional Documents
Culture Documents
The Application of Systolic Architectures in VLSI Design
The Application of Systolic Architectures in VLSI Design
STARS
THE APPLICATION OF SYSTOLIC
ARCHITECTURES IN VLSI DESIGN
Retrospective Theses and Dissertations
1986
STARS Citation
Manno, Steven V., "The Application of Systolic Architectures in VLSI Design" (1986). Retrospective Theses
and Dissertations. 4974. Fall Term
https://stars.library.ucf.edu/rtd/4974 1986
I
I
I
iii
LIST OF FIGURES INTRODUCTION
Figure 1. Geometric Configurations of Systolic Very large scale integrated (VLSI) circuit technology
Structures • . • • • • • • • • • • • 10
makes feasible the implementation of high-performance
Figure 2. The Systolic Stack . 13
special-purpose computational systems. These special-purpose
Figure 3. Sample Operation of the Systolic Stack • 15
devices are useful for application specific tasks and for
Figure 4. Sample Operation of the Systolic Queue • 18
off-loading general-purpose computers of time-consuming
Figure 5. Band Matrix Multiplication • • • 19
computations. Because both processing elements and memory
Figure 6. The Inner Product Step Processor • 20
elements can be implemented in VLSI, it is possible to
Figure 7. Hexagonal Array for Band Matrix
Multiplication • • • • • • • • • ...... 22 construct highly concurrent systems on a single silicon
chip.
better performance.
iv
2
arrays are applicable to a wide range of applications such The cost of designing VLSI components is much larger
as fast Fourier transform, matrix multiplication, LU than the cost of the components themselves. Design costs
decomposition, priority queues, pattern matching, and become especially significant if the component is only
3
4 5
The way to build a fast computer system is to use fast A special-purpose system often receives data and
components. The way to build sti 11 faster computer systems returns results by means of a host computer. The special-
is to use concurrency. "With current technology, tens of purpose system is implementing a compute-bound algorithm if
thousands of gates can be put in a single chip, but no gate the number of computing operations is larger than the number
is much faster than its TTL counterpart of ten years ago" of input and output elements. The I/O bandwidth between the
(Kung 1982). Since component density is increasing much host and the special-purpose hardware may limit the maximum
more quickly than component speed, improvements in system rate of operations in a compute-bound situation if the
performance require the use of concurrent processing. special-purpose machine needs to access the host's memory
for each operation. The required I/O bandwidth in compute-
Communication and Memory Locality
bound problems can be greatly reduced by pipelining.
Since VLSI is a planar technology, the interconnections
Pipelining allows the special-purpose machine to perform
between the many devices on a chip may cost more in chip
more than one operation per I/O access by storing
area and device performance than the devices themselves
intermediate results internally.
(Leiserson 1979). Restricting the interconnections between
the components on a chip to local communication between
neighboring components improves chip density and reduces
parasitic interconnect delay. The complete elimination of
global or broadcast signals from the design is the ideal
situation.
array or systolic tree. Information in a systolic system The basic principle of a systolic system is simple.
flows rhythmically between neighboring processing elements Replacing a single processing element with an array of
in a pipelined fashion. Communication between a systolic processing elements wi 11 result in a higher computation
system and the outside world occurs only through the throughput without an increase in memory bandwidth (Kung
processing elements located at the boundaries of the 1982). The goal of the systolic approach is to make
systolic array or systolic tree. exhaustive use of data brought out of memory before it is
returned. As the data from memory passes through the
The name systolic is derived from the word systole.
systolic system, it is available for use by each of the
Systole refers to the rhythmic contraction of the heart
processing elements it encounters along the way.
during which blood is forced onward through the body. The
rhythmic pumping of data through a systolic system is Computational tasks can be classified as either
analogous to the rhythmic pumping of blood through the body. compute-bound or I/O-bound. An example of a compute-bound
task is matrix-matrix multiplication, because the number of
multiplication and addition . operations is larger than the
6
CHAPTER III, APPLICATIONS OF SYSTOLIC STRUCTURES
9
10
11
TABLE 1
A Systolic Stack
The basic function of the stack is to provide memory
storage where data elements are inserted and deleted in a INSERT DELETE
occupied memory cells containing elements ordered such that operation, neighboring processors may exchange the elements
the element ready for deletion was the last element inserted contained in their memory cells among each other. Elements
into the stack. are exchanged according to the following two rules:
c c
operations respectively. Note that the number of cycles
14 I '3 I2 I I1 I I I 14 I '3 I I2 I1 I
needed to ready the stack is independent of the length of c D l
Is 14 I 13 I2 I I1 I I 14 I 13 I I2 I I I I 1
the stack. I t c
Is I 14 '3 I I I 2 1 I 14 I 13 I I2 I I I 1
c c
Is 14 I '3 I I I I
2 1 14 I hi I I
2 I1 I
c c
Is I 14 '3 I I I I 2 1 14 I '3 I 12 I I I
1
c c
Is Is I 14 13 I 12 , , I 14 I 13 I I2 I I I 1
I t c
Is I Is 14 I 13 12 I ,, I 14 I 13 I I2 I I I 1
c
Is Is I 14 13 I I2 I1 I
c
Is I Is 14 I 13 12 I I I
c
1
Note that the second and third rules are identical to the
The Stack Becomes a Priority Queue
rules for the stack. The systolic queue will be ready for
The basic function of the priority queue is to provide
an insert or delete operation if the previous insert or
a way for records to be inserted unordered into a set and to
delete operation is followed by four cycle operations.
be retrieved from that set ordered. · The records are ordered
by key such that the first record retrieved has the smallest Figure 4 shows the operation of the systolic priority
key. queue as it responds to a sample input sequence. The
numerals indicate the key value of the records in the queue,
The queue is constructed in the same manner as the
and the letters I, D, and C represent insert, delete, and
stack with the exception of a few modifications. In the
cycle operations respectively. As with the systolic stack,
queue, the memory cells store records with a field for a key
the number of cycles needed to ready the queue is
value rather than data elements as in the stack. The
independent of the length of the queue.
processors of the queue must follow an additional rule
during a cycle operation.
c c
~,,~,~l~s~,3~,-4~1-19-,-s-,-----
I I1 13 I
c c
I I1 I '3 I I
c c
I I1 I '3 I I c
I al I a12
0
b11 b I: b 13
0
cl I c, 2 C13 C14
0
c a11 a22 an b11 b12 b 13 b24 C.:?I C22 C~3 C:4
~,4~,-,~,~,-3~,-------------
I I
I ... a31 aJ:? a33 a34 b 31 b 33 b34 b 35 C31 C32 C33 c 34
D
1, 14 I 13 I I ..
J
04 2 b43 C41 C42
c c
--.....--..------------ 0 0 0
,~,~,---,4~'3--.-1
I
c c A B c
~,-,-,~'3~,-4~,----------- I 13 I 14 I Is Is I Is I
c c
~,-,-,~13-,---,4~,----------- I
c
I
0
Figure 5. Band Matrix Multiplication.
( I 14 I Is I Is I I 191
c c
,~,-l---ls_'3
__1 -,4--1---------- 14 I Is I Is I jg I
c c
,~-,-,-,3-,-5~,-1-4-1 --------- I 1.. I Is I Is I jg I
c c The Inner Product Step Processor
~, -,-,-,3-1---15-1.--1--------- I 14 I Is I Is I I 19 I I be
c
t c
able
The processor for the matrix · multiplication
to perform an operation called the inner product step.
must
flows through the array are timed such that the correct
entries of matrices A, B, and C intersect to form the inner
products.
c
t
B
t
c
I
I
I I 1
I
I
I ~//
........ I I
' ' I
I
1 I I I
I /
/ for implementation in the VLSI technology.
033 023 I / b32 b33
' ,1 I I
t, ..... 1
I I I ,,,fI"'
I
I
I
I
I
I r"'
;I Reduced I/O Bandwidth
I I I I
042 032 a~2 b12 b 23 . b14
I
' '
I
I /
/
/ Because systolic designs make multiple use of each
' '
input data item, systolic systems can provide high
throughput with relatively small I/O bandwidth requirements.
Data in a systolic system is used repetitively at ~ach
I I
I l
I I I I processor, so I/O activity is restricted to the input of
I
I I I l
I I I \ I
I I;
I
I
data and the output of results. Intermediate results are
' I
I v '{ I
I I \
I I \
I
I
stored internally to the systolic system and can be excluded
I C31 C22 C13 \
I
I I
I; \ I
y '\ \ from I/O activity.
I
I
( C41 c 32 C23 C14 \
I
Extensive Use of Concurrency
1
23
24
Modular Design
operational frequencies.
25
REFERENCES
26
_J