Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 53, NO.

8, AUGUST 2006 707

Hardware-Efficient Systolization of DA-Based


Calculation of Finite Digital Convolution
Pramod Kumar Meher, Senior Member, IEEE

Abstract—Novel one- and two-dimensional systolic structures number of -point convolutions where . The
are designed for computation of circular convolution using memory size can be restricted to some extent by this approach
distributed arithmetic (DA). The proposed structures involve sig- but, it involves significant amount of hardware and time as over-
nificantly less memory and less area-delay complexity compared
with the existing DA-based structures for circular convolution. head for mapping of input samples and convolved output. In
Besides, it is shown that the proposed systolic designs for circular the context of aforesaid observations, in this brief, we aim at
convolution can be used for computation of linear convolution as presenting a novel scheme for hardware-efficient unified imple-
well. mentation of both circular and linear convolutions by systolic
Index Terms—Circular convolution, linear convolution, systolic decomposition of DA-based computation.
array, VLSI. In Section II, we have described the formulation of the
proposed algorithms for circular and linear convolutions. In
I. INTRODUCTION Section III, we have derived the systolic structures from the
proposed algorithms. The hardware and time complexities of
ALCULATION of finite digital convolution is frequently the proposed structures are discussed, and compared with the
C encountered in several digital signal processing (DSP) ap-
plications [1]–[3]. Efficient VLSI implementation of the digital
existing structures, in Section IV. The conclusion along with
the scope for future work is presented in Section V.
convolution for real-time DSP applications is, therefore, an im-
portant task. Amongst the existing VLSI systems, systolic ar- II. FORMULATION OF ALGORITHM
chitectures have been extensively popular owing not only to the For simplicity of discussion, we have assumed the signal sam-
simplicity of their design and development; but also for the po- ples to be unsigned words of size , although the proposed al-
tential of using high level of pipelining in a small chip-area gorithm can be used for two’s complement coding and offset
[4]. Several different systolic architectures are, therefore, sug- binary coding also.
gested for VLSI implementation of digital convolution [4]–[6].
In most DSP applications, one of the convolving sequences is A. Algorithm for Circular Convolution
derived from the input samples while the other sequence is usu-
The circular convolution of two -point sequences
ally fixed (e.g., impulse response of a filter or coefficients of the
and can be given by
sinusoidal transform kernel, etc.). This behavior of DSP algo-
rithms makes it possible to use distributed arithmetic (DA) for
computation of digital convolution. It yields faster output com- for (1)
pared with the multiplier-accumulator-based designs because it
stores the pre-computed partial results in the memory elements where .
[7]. The DA-based technique is, therefore, widely used in var- Let be a fixed sequence and be the input
ious DSP applications [7]–[10]. The memory requirement of sequence which may change from time to time. It can be
DA-based computation of convolution, however, increases ex- found that the sequence for
ponentially with the convolution length. Attempts are, therefore, . is a cir-
made to use offset binary coding [11] to reduce the ROM size by cular-right-shift operator that shifts the elements of a sequence
a factor of 2. Recently, Chen et al. [12] have suggested a group circularly right by one position, such that
DA approach in a nonsystolic structure with ROM of .
the order of words for implementation of -point circular for , for any given value of may be
convolution. But, the memory requirement still remains pro- expressed in expanded form as
hibitively large for long-length convolution. To bring it further
down, they have prescribed to use the decomposition method
(2)
of Agarwal and Cooley [2]. According to the two-factor de-
composition of [2], -point circular convolution can be com-
puted through number of -point convolutions followed by where denotes the th bit of .
Substituting the expansion of as given in (2), (1) be-
Manuscript received May 21, 2005; revised October 17, 2005 and December comes
12, 2005. This paper was recommended by Associate Editor C.-T. Lin.
The author is with the School of Computer Engineering, Nanyang Techno-
logical University, Singapore 639798 (e-mail: aspkmeher@ntu.edu.sg). (3)
Digital Object Identifier 10.1109/TCSII.2006.877277

1057-7130/$20.00 © 2006 IEEE


708 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 53, NO. 8, AUGUST 2006

Interchanging the order of summations, (3) may be written as

(4)

When is a composite number given by , ( and


may be any two positive integers) one can map the index into
for and
so as to express (4) in a form

(5a)

where

(5b)

for and .
For any given sequence , the possible values of Fig. 1. DG for DA-based computation of circular convolution. (a) DG. (b)
Function of node A. (c) Function of node B.
corresponding to the permutations of -point bit
sequence for for
may be stored in a look-up table (LUT) of The bit vector (corresponding to the th convolution output)
words. These values of can be read out when the which can be used as address word for the LUT is given by
bit sequence is fed to the ROM as address. Equation (5) may,
thus, be written in term of memory-read operation as

(6) (9)

for and .
where and It may be observed here that the computations of (6) and (8)
, for for circular and linear convolutions, respectively, differ only by
and . the input bit vectors.
The bit vector is used as address word for the LUT
and is the memory-read operation. III. PROPOSED STRUCTURES

B. Algorithm for Linear Convolution We derive here the proposed one-dimensional (1-D) and two-
dimensional (2-D) systolic arrays for computation of circular
The finite linear convolution of an input sequence and linear convolutions.
with an -point fixed sequence can be given by
A. Proposed 1-D Systolic Array for Digital Convolution
The dependence graph (DG) for computation of circular con-
(7) volution according to (6) is shown in Fig. 1. It consists of
rows, where each row consists of number of node A and one
boundary node B. The functions of node A and node B are de-
where . picted in Fig. 1(b) and (c), respectively. A bit vector
The computation involved in linear convolution of (7) is consisting of a sequence of bits (derived from the th bit of
closely similar to computation of circular convolution given the elements of the input sequence as given in (6) is fed to node
by (2). In (2), the input sequence is circularly shifted A on th row and th column. The node uses the
for computing every new value of convolved output while in sequence of input bits of the input bit vector as address for
(7), the input sequence is derived from serially shifted an LUT, and reads the content stored at the location specified by
input samples using a window of size . The linear convolu- the address. The value read from the LUT is then added with the
tion of (7) may also, therefore, be computed via memory-read input available from its left, and the sum is passed to the node on
operations [as given by (6) for circular convolution], and may its right. Node B performs a shift-add (SA) operation such that
be written as it makes a left-shift of the bits of the input available from the
top, then adds the input available from the left to the left-shifted
(8) value, and passes the result down to its adjacent node. The DG
can be projected vertically along the projection direction
MEHER: SYSTOLIZATION OF DA-BASED CALCULATION 709

Fig. 2. Proposed 1-D array for circular convolution. (a) Linear systolic array.
1
(b) Function of PE. (c) Function of output cell. stands for a unit delay.

with default schedule [4] to derive a linear array consisting of


number of processing elements (PEs) and an output cell as
shown in Fig. 2. The bits of the vector derived from the
input sequence is fed to the th PE in most sig-
nificant bits (MSBs) to least significant bits (LSBs) order such
that th bits of input values are fed to the PE at first, and Fig. 3. Proposed 2-D array for circular convolution. (a) The 2-D systolic array.
the zeroth bits are fed at the end. Besides, input to each PE is 1
(b) Function of PE. (c) Function of SA cell. stands for a unit delay. .
staggered by one cycle period with respect to the preceding PE
to meet the causality requirement. The function of the PEs is de-
and the 1-D systolic array of Fig. 2 may, therefore, also be used
scribed in Fig. 2(b). Each PE consists of an ROM of words.
for computing the linear convolution when the input buffer is
During a cycle period (time step), each PE reads the content on
replaced by a serial-in serial-cum-parallel-out buffer. The con-
its ROM at the location specified by the input bit vector. The
tent of the buffer is serially right-shifted by one position and
value read from the ROM is then added to the input available to
transferred in parallel to the bit-serial word-parallel converter
the PE from its left. During every cycle period, the sum is then
in every cycles.
transferred as output to its right. Function of the output cell is
shown in Fig. 2(c). Each output cell contains a shift-register and
B. Proposed 2-D Systolic Structure for Digital Convolution
an adder. During a cycle period it shifts the content of its register
left by one position and then adds the available input to the re- For high-throughput computation of the circular convolution,
cently shifted content in its register. After cycles it delivers a each node of the DG of Fig. 1 can also be assigned to a PE ex-
desired convolution output. The structure will yield its first con- clusively to obtain a 2-D systolic array of rows and
volved output cycles after the first input is fed to the first columns as shown in Fig. 3. Each row of the structure consists
PE, while the successive convolution output becomes available of number of PEs and an SA cell . The computation of all the
in every cycles interval. For computation of circular convolu- subsequent values of the convolution output may also be given
tion, the input buffer is a circularly shift buffer, where the con- by similar DGs, and the computation of corresponding nodes of
tent of the buffer is right shifted by one position, and also trans- all such DGs may be folded to the same structure. The input sam-
ferred in parallel to a bit-serial word-parallel converter once in ples are fed to a bit-parallel word-serial converter which gener-
every cycles. For high throughput applications one may, how- ates number of bit streams of the input sequence, where each bit
ever, have a structure with number of 1-D arrays which would stream contains the corresponding bits of all the input words. The
yield convolved output in every cycles duration. The DG output of bit-parallel word-serial generator is fed to the circularly
of Fig. 1 may also be used for computing the linear convolution, right-shift registers, through bit-stream buffers associated with
710 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 53, NO. 8, AUGUST 2006

TABLE I
HARDWARE- AND TIME-COMPLEXITIES OF PROPOSED STRUCTURES AND
EXISTING STRUCTURES OF [10] AND [12]

number of PEs, where each PE consists of an adder and a ROM of


words. The output cell consists of an adder and a shift-reg-
Fig. 4. Proposed 2-D array for linear convolution. 1 stands for unit delay. ister. The latency of the structure is cycles, the dura-
tion of each cycle being , where and
each of the rows, as shown in Fig. 3(a). The bit vector con- are, respectively, the time required to perform a memory-ac-
sisting of number of bits from the th circularly right-shift cess operation and an addition in the PE. It will yield the sub-
register is loaded to the th PE of the th row (for sequent convolution output in an interval of cycle periods. An
and ). Each PE [shown in -array structure consisting of number of proposed 1-D arrays
Fig. 3(b)] uses the bit vector as address for the LUT of the will involve -times more number of adders and memory words,
PE to read a partial result. The PE then adds the input available but will yield times more throughput compared with that of a
from the left with its recently read partial result, and passes that single 1-D array. The proposed 2-D structure has number
out to its right. Each row of the structure is terminated with a SA of PEs arranged in number of linear arrays. Each PE of this 2-D
cell. The function of SA is depicted in Fig. 3(c). Each SA during array consists of an adder and a ROM of words as in case
a cycle period makes a left-shift of its input available from the top of the PE of 1-D array. The latency of the structure is the same
and adds that input to its input available from the left. The sum is cycles as that of the single-array structure. It gives one
then passed downward to its adjacent SA. To meet the data-de- convolved output in every cycle period thereafter. The memory
pendence requirement, the SA cell on every th row is staggered requirement, adder complexity,latencyand throughputof the pro-
by one cycle period with respect to the SA cell on the th posed 1-D and 2-D structures are listed in Table I along with those
row. Each of the circularly right-shift registers of the structure is of existing DA-based systolic structure for circular-convolution
reloadedwithanewbit-streamafterevery cyclesforcomputing of [10], and also for the structure of [12]. It is found that the pro-
the cyclic convolution of a new input sequence. Linear convolu- posed 2-D structure involves ROM of size less than several or-
tion can also be computed by the 2-D structure of circular con- ders of magnitude, and offers the same throughput with a small
volution by simple alteration as shown in Fig. 4. The number increase in latency compared with the structure of [10] at the cost
of bit streams generated by the bit-parallel word-serial converter of -time increase in adder-complexity, e.g., for ,
are fed in this case to PEs of each of the rows of the structure , and the proposed 2-D structure requires 80 adders
through serial-in-parallel-out shift registers. The function of the and ROM of words while the structure of [10] will involve
PEs and the SA cells in this case are exactly the same as those of 32 adders and ROM of words. The area of memory cells and
the structure of circular convolution as shown in Fig. 3(b) and (c), adders along with the addition-time and access-time for memory
respectively. In the single-array structure (Fig. 2), the processing of various sizes are determined using the Synopsys DesignWare
of different bit steams are time multiplexed to the same PE, while 0.18- m TSMC library for 16-bit data width [13]. Accordingly,
in the 2-D structure (Figs. 3 and 4) each bit stream is processed the average computation time per output , area of the pro-
by a separate row of PEs. We can also derive a structure with posed 2-D structure and VLSI complexity measure (for
number of such linear arrays (for , where and are pos- ) are calculated and listed in Table II with those of the
itive integers) by projecting the nodes of number of rows of the structure of [12]. Since the size of memory modules of the pro-
DG to a single array structure instead of projecting the nodes of all posed structures depends on the ratio of it is possible to use
the rows to a single linear array. One may, therefore, opt to derive memory of small and convenient size even for higher values of .
a structure with multiple linear arrays, and similarly may also opt Due to this behavior, the time-complexity of the proposed struc-
for a suitable value of ( number of PEs on one row of the ture is independent of convolution-length and the area-com-
array) for flexible implementation to meet the hardware and time plexity increases linearly with . But in the structure of [12]
specification of constraint-driven systems. although the computation-time falls linearly with , the memory
size rises exponentially with . Therefore, the complexity
IV. HARDWARE AND TIME COMPLEXITIES of [12] is very high compared with the that of the proposed struc-
In Section III, we have proposed 1-D and 2-D systolic arrays ture. The increase in computational delay of the existing struc-
for linear and circular convolution. Each 1-D array consists of tures due to rise in memory access time may however be avoided,
MEHER: SYSTOLIZATION OF DA-BASED CALCULATION 711

TABLE II ever, the reduction of total area of the structure is not prominent
AT VLSI PERFORMANCE MEASURE OF PROPOSED 2-D STRUCTURE AND for larger values of .
STRUCTURE OF [12] FOR SMALL CONVOLUTION LENGTHS
V. CONCLUSION
A novel decomposition scheme for DA-based computation of
finite digital convolution is described for hardware-efficient im-
plementation in fully pipelined 1-D and 2-D systolic structures.
The proposed 1-D array can be used for both circular and the
linear convolution by changing the input buffer only. Similarly,
the proposed 2-D structure for circular convolution can also be
used for linear convolution by changing the input loading. The
proposed structures can offer reduction of ROM size by several
orders of magnitude over the existing DA-based systolic struc-
tures with relatively much less increase in adder-complexity at
the same throughput rate of computation. The function of each
PE in the proposed structures can also be realized in a systolic
array by another level of systolic decomposition using some
more adders to bring down the memory requirement further.
Besides, one can have flexible options to choose the number of
rows and number of columns of PEs for the structure for flexible
implementation in a constraint-driven system, and to derive a
systolic core for efficient DA-based realization of digital filters
and discrete unitary transforms for various DSP applications.
The proposed two-factor decomposition of DA-based computa-
tion can also be used for efficient computation of inner-product;
and can be extended further for three-factor decomposition to
achieve better area-time efficiency.
Fig. 5. Comparison of area-delay product (in micrometer square nanoseconds)
of the proposed structure with the structure of Chen et al. [12]. REFERENCES
[1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Prin-
ciples, Algorithms, and Applications. Englewood Cliffs, NJ: Pren-
tice-Hall, 1996.
[2] R. Agarwal and J. Cooley, “New algorithms for digital convolution,”
IEEE Trans Acoust., Speech, Signal Process., vol. 25, no. 5, pp.
392–410, Oct. 1977.
[3] K. J. Jones, “Prime number DFT computation via parallel circular
convolvers,” Proc. IEE, Radar Signal Process., vol. 137, no. 3, pp.
205–212, Jun. 1990.
[4] S. Y. Kung, VLSI Array Processors. Englewood Cliffs, NJ: Prentice-
Hall, 1988.
[5] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, “A
systolic array architecture for the discrete sine transform,” IEEE Trans.
Signal Process., vol. 50, no. 9, pp. 2347–2354, Sep. 2002.
[6] B. K. Mohanty and P. K. Meher, “Novel flexible systolic mesh archi-
tecture for parallel VLSI implementation of finite digital convolution,”
Fig. 6. Variation of area-complexity of proposed 1-D structure in micrometer IETE J. Res., vol. 44, no. 6, pp. 261–266, Nov.–Dec. 1998.
square with the decomposition factor P for N = 64 and N = 128. [7] S. A. White, “Applications of distributed arithmetic to digital signal
processing: A tutorial review,” ASSP Mag., vol. 6, no. 3, pp. 4–19, Jul.
1989.
[8] S. Hwang, G. Han, S. Kang, and J. Kim, “New distributed arithmetic
and it may be reduced to one addition-time if the memory access algorithm for low-power FIR filter implementation,” IEEE Siignal
and additions are carried out in separate pipelined stages. The Process. Lett., vol. 11, no. 5, pp. 463–466, May 2004.
[9] H. -C. Chen, J. -I. Guo, C. -W. Jen, and T. -S. Chang, “Distributed
area-delay complexity of the proposed structures for and arithmetic realisation of cyclic convolution and its DFT application,”
is plotted with that of the existing structure of [12] Proc. IEE, Circuits, Devices, Syst., pp. 615–629, Dec. 2005.
in Fig. 5, where cycle period is taken as one addition time. The [10] J. I. Guo, “A new distributed arithmetic algorithm and its hardware
architecture for the discrete hartley transform,” Pattern Recong. Image
area-delay product of the structure of [12] is less than that of the Anal., vol. 10, no. 3, pp. 368–378, 2000.
proposed structure for values of . But, for large values of [11] J. -P. Choi, S. -C. Shin, and J. -G. Chung, “Efficient ROM size reduc-
, the area-delay-product of proposed structure is , and it tion for distributed arithmetic,” in Proc. IEEE Int. Symp. Circuits Syst.
(ISCAS’00), 2000, May 2000, vol. 2, pp. 61–64.
is less by several orders of magnitude than that of the other which [12] H. -C. Chen, J. -I. Guo, T. -S. Chang, and C. -W. Jen, “A memory-
is . The area complexity of the proposed 1-D struc- efficient realization of cyclic convolution and its application to discrete
ture for and are evaluated for different values cosine transform,” IEEE Trans. Circuits Syst. Video Technol., vol. 15,
no. 3, pp. 445–453, Mar. 2005.
of and plotted in Fig. 6. It is observed that the area-complexity [13] Synposys, DesignWare. Foundry Libraries Mountain View, CA, 2005
falls rapidly with the increase of decomposition factor . How- [Online]. Available: http://www.synopsys.com/

You might also like