Professional Documents
Culture Documents
Tcsii 2006 877277
Tcsii 2006 877277
Abstract—Novel one- and two-dimensional systolic structures number of -point convolutions where . The
are designed for computation of circular convolution using memory size can be restricted to some extent by this approach
distributed arithmetic (DA). The proposed structures involve sig- but, it involves significant amount of hardware and time as over-
nificantly less memory and less area-delay complexity compared
with the existing DA-based structures for circular convolution. head for mapping of input samples and convolved output. In
Besides, it is shown that the proposed systolic designs for circular the context of aforesaid observations, in this brief, we aim at
convolution can be used for computation of linear convolution as presenting a novel scheme for hardware-efficient unified imple-
well. mentation of both circular and linear convolutions by systolic
Index Terms—Circular convolution, linear convolution, systolic decomposition of DA-based computation.
array, VLSI. In Section II, we have described the formulation of the
proposed algorithms for circular and linear convolutions. In
I. INTRODUCTION Section III, we have derived the systolic structures from the
proposed algorithms. The hardware and time complexities of
ALCULATION of finite digital convolution is frequently the proposed structures are discussed, and compared with the
C encountered in several digital signal processing (DSP) ap-
plications [1]–[3]. Efficient VLSI implementation of the digital
existing structures, in Section IV. The conclusion along with
the scope for future work is presented in Section V.
convolution for real-time DSP applications is, therefore, an im-
portant task. Amongst the existing VLSI systems, systolic ar- II. FORMULATION OF ALGORITHM
chitectures have been extensively popular owing not only to the For simplicity of discussion, we have assumed the signal sam-
simplicity of their design and development; but also for the po- ples to be unsigned words of size , although the proposed al-
tential of using high level of pipelining in a small chip-area gorithm can be used for two’s complement coding and offset
[4]. Several different systolic architectures are, therefore, sug- binary coding also.
gested for VLSI implementation of digital convolution [4]–[6].
In most DSP applications, one of the convolving sequences is A. Algorithm for Circular Convolution
derived from the input samples while the other sequence is usu-
The circular convolution of two -point sequences
ally fixed (e.g., impulse response of a filter or coefficients of the
and can be given by
sinusoidal transform kernel, etc.). This behavior of DSP algo-
rithms makes it possible to use distributed arithmetic (DA) for
computation of digital convolution. It yields faster output com- for (1)
pared with the multiplier-accumulator-based designs because it
stores the pre-computed partial results in the memory elements where .
[7]. The DA-based technique is, therefore, widely used in var- Let be a fixed sequence and be the input
ious DSP applications [7]–[10]. The memory requirement of sequence which may change from time to time. It can be
DA-based computation of convolution, however, increases ex- found that the sequence for
ponentially with the convolution length. Attempts are, therefore, . is a cir-
made to use offset binary coding [11] to reduce the ROM size by cular-right-shift operator that shifts the elements of a sequence
a factor of 2. Recently, Chen et al. [12] have suggested a group circularly right by one position, such that
DA approach in a nonsystolic structure with ROM of .
the order of words for implementation of -point circular for , for any given value of may be
convolution. But, the memory requirement still remains pro- expressed in expanded form as
hibitively large for long-length convolution. To bring it further
down, they have prescribed to use the decomposition method
(2)
of Agarwal and Cooley [2]. According to the two-factor de-
composition of [2], -point circular convolution can be com-
puted through number of -point convolutions followed by where denotes the th bit of .
Substituting the expansion of as given in (2), (1) be-
Manuscript received May 21, 2005; revised October 17, 2005 and December comes
12, 2005. This paper was recommended by Associate Editor C.-T. Lin.
The author is with the School of Computer Engineering, Nanyang Techno-
logical University, Singapore 639798 (e-mail: aspkmeher@ntu.edu.sg). (3)
Digital Object Identifier 10.1109/TCSII.2006.877277
(4)
(5a)
where
(5b)
for and .
For any given sequence , the possible values of Fig. 1. DG for DA-based computation of circular convolution. (a) DG. (b)
Function of node A. (c) Function of node B.
corresponding to the permutations of -point bit
sequence for for
may be stored in a look-up table (LUT) of The bit vector (corresponding to the th convolution output)
words. These values of can be read out when the which can be used as address word for the LUT is given by
bit sequence is fed to the ROM as address. Equation (5) may,
thus, be written in term of memory-read operation as
(6) (9)
for and .
where and It may be observed here that the computations of (6) and (8)
, for for circular and linear convolutions, respectively, differ only by
and . the input bit vectors.
The bit vector is used as address word for the LUT
and is the memory-read operation. III. PROPOSED STRUCTURES
B. Algorithm for Linear Convolution We derive here the proposed one-dimensional (1-D) and two-
dimensional (2-D) systolic arrays for computation of circular
The finite linear convolution of an input sequence and linear convolutions.
with an -point fixed sequence can be given by
A. Proposed 1-D Systolic Array for Digital Convolution
The dependence graph (DG) for computation of circular con-
(7) volution according to (6) is shown in Fig. 1. It consists of
rows, where each row consists of number of node A and one
boundary node B. The functions of node A and node B are de-
where . picted in Fig. 1(b) and (c), respectively. A bit vector
The computation involved in linear convolution of (7) is consisting of a sequence of bits (derived from the th bit of
closely similar to computation of circular convolution given the elements of the input sequence as given in (6) is fed to node
by (2). In (2), the input sequence is circularly shifted A on th row and th column. The node uses the
for computing every new value of convolved output while in sequence of input bits of the input bit vector as address for
(7), the input sequence is derived from serially shifted an LUT, and reads the content stored at the location specified by
input samples using a window of size . The linear convolu- the address. The value read from the LUT is then added with the
tion of (7) may also, therefore, be computed via memory-read input available from its left, and the sum is passed to the node on
operations [as given by (6) for circular convolution], and may its right. Node B performs a shift-add (SA) operation such that
be written as it makes a left-shift of the bits of the input available from the
top, then adds the input available from the left to the left-shifted
(8) value, and passes the result down to its adjacent node. The DG
can be projected vertically along the projection direction
MEHER: SYSTOLIZATION OF DA-BASED CALCULATION 709
Fig. 2. Proposed 1-D array for circular convolution. (a) Linear systolic array.
1
(b) Function of PE. (c) Function of output cell. stands for a unit delay.
TABLE I
HARDWARE- AND TIME-COMPLEXITIES OF PROPOSED STRUCTURES AND
EXISTING STRUCTURES OF [10] AND [12]
TABLE II ever, the reduction of total area of the structure is not prominent
AT VLSI PERFORMANCE MEASURE OF PROPOSED 2-D STRUCTURE AND for larger values of .
STRUCTURE OF [12] FOR SMALL CONVOLUTION LENGTHS
V. CONCLUSION
A novel decomposition scheme for DA-based computation of
finite digital convolution is described for hardware-efficient im-
plementation in fully pipelined 1-D and 2-D systolic structures.
The proposed 1-D array can be used for both circular and the
linear convolution by changing the input buffer only. Similarly,
the proposed 2-D structure for circular convolution can also be
used for linear convolution by changing the input loading. The
proposed structures can offer reduction of ROM size by several
orders of magnitude over the existing DA-based systolic struc-
tures with relatively much less increase in adder-complexity at
the same throughput rate of computation. The function of each
PE in the proposed structures can also be realized in a systolic
array by another level of systolic decomposition using some
more adders to bring down the memory requirement further.
Besides, one can have flexible options to choose the number of
rows and number of columns of PEs for the structure for flexible
implementation in a constraint-driven system, and to derive a
systolic core for efficient DA-based realization of digital filters
and discrete unitary transforms for various DSP applications.
The proposed two-factor decomposition of DA-based computa-
tion can also be used for efficient computation of inner-product;
and can be extended further for three-factor decomposition to
achieve better area-time efficiency.
Fig. 5. Comparison of area-delay product (in micrometer square nanoseconds)
of the proposed structure with the structure of Chen et al. [12]. REFERENCES
[1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Prin-
ciples, Algorithms, and Applications. Englewood Cliffs, NJ: Pren-
tice-Hall, 1996.
[2] R. Agarwal and J. Cooley, “New algorithms for digital convolution,”
IEEE Trans Acoust., Speech, Signal Process., vol. 25, no. 5, pp.
392–410, Oct. 1977.
[3] K. J. Jones, “Prime number DFT computation via parallel circular
convolvers,” Proc. IEE, Radar Signal Process., vol. 137, no. 3, pp.
205–212, Jun. 1990.
[4] S. Y. Kung, VLSI Array Processors. Englewood Cliffs, NJ: Prentice-
Hall, 1988.
[5] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraitis, “A
systolic array architecture for the discrete sine transform,” IEEE Trans.
Signal Process., vol. 50, no. 9, pp. 2347–2354, Sep. 2002.
[6] B. K. Mohanty and P. K. Meher, “Novel flexible systolic mesh archi-
tecture for parallel VLSI implementation of finite digital convolution,”
Fig. 6. Variation of area-complexity of proposed 1-D structure in micrometer IETE J. Res., vol. 44, no. 6, pp. 261–266, Nov.–Dec. 1998.
square with the decomposition factor P for N = 64 and N = 128. [7] S. A. White, “Applications of distributed arithmetic to digital signal
processing: A tutorial review,” ASSP Mag., vol. 6, no. 3, pp. 4–19, Jul.
1989.
[8] S. Hwang, G. Han, S. Kang, and J. Kim, “New distributed arithmetic
and it may be reduced to one addition-time if the memory access algorithm for low-power FIR filter implementation,” IEEE Siignal
and additions are carried out in separate pipelined stages. The Process. Lett., vol. 11, no. 5, pp. 463–466, May 2004.
[9] H. -C. Chen, J. -I. Guo, C. -W. Jen, and T. -S. Chang, “Distributed
area-delay complexity of the proposed structures for and arithmetic realisation of cyclic convolution and its DFT application,”
is plotted with that of the existing structure of [12] Proc. IEE, Circuits, Devices, Syst., pp. 615–629, Dec. 2005.
in Fig. 5, where cycle period is taken as one addition time. The [10] J. I. Guo, “A new distributed arithmetic algorithm and its hardware
architecture for the discrete hartley transform,” Pattern Recong. Image
area-delay product of the structure of [12] is less than that of the Anal., vol. 10, no. 3, pp. 368–378, 2000.
proposed structure for values of . But, for large values of [11] J. -P. Choi, S. -C. Shin, and J. -G. Chung, “Efficient ROM size reduc-
, the area-delay-product of proposed structure is , and it tion for distributed arithmetic,” in Proc. IEEE Int. Symp. Circuits Syst.
(ISCAS’00), 2000, May 2000, vol. 2, pp. 61–64.
is less by several orders of magnitude than that of the other which [12] H. -C. Chen, J. -I. Guo, T. -S. Chang, and C. -W. Jen, “A memory-
is . The area complexity of the proposed 1-D struc- efficient realization of cyclic convolution and its application to discrete
ture for and are evaluated for different values cosine transform,” IEEE Trans. Circuits Syst. Video Technol., vol. 15,
no. 3, pp. 445–453, Mar. 2005.
of and plotted in Fig. 6. It is observed that the area-complexity [13] Synposys, DesignWare. Foundry Libraries Mountain View, CA, 2005
falls rapidly with the increase of decomposition factor . How- [Online]. Available: http://www.synopsys.com/