Professional Documents
Culture Documents
VLSI Synthesis of DSP Kernels - Algorithmic and Architectural Transformations
VLSI Synthesis of DSP Kernels - Algorithmic and Architectural Transformations
by
MANESH MEHENDALE
Texas Instruments (India), Ltd.
and
SUNILD. SHERLEKAR
Silicon Automation Systems Ltd.
List of Figures xi
List of Tables xv
Foreword xvii
Acknow ledgments xix
Preface XXI
1. INTRODUCTION
1.1 An Example
1.2 The Design Process: Constraints and Alternatives 3
1.3 Organization of the Book 7
1.4 For the Reader 9
2. PROGRAMMABLE DSP BASED IMPLEMENTATION 11
2.1 Power Dissipation - Sources and Measures 13
2.1.1 Components Contributing to Power Dissipation 13
2.1.2 Measures of Power Dissipation in Busses 13
2.1.3 Measures of Power Dissipation in the Multiplier 13
2.2 Low Power Realization of DSP Algorithms 16
2.2.1 Allocation of Program, Coefficient and Data Memory 16
2.2.2 Bus Coding 17
2.2.2.1 Gray Coded Addressing 17
2.2.2.2 TO coding 18
2.2.2.3 Bus Invert Coding 20
2.2.3 Instruction Buffering 21
2.2.4 Memory Architectures for Low Power 22
2.2.5 Bus Bit Reordering 24
2.2.6 Generic Techniques for Power Reduction 26
2.3 Low Power Realization of Weighted-sum Computation 26
2.3.1 Selective Coefficient Negation 27
2.3.2 Coefficient Ordering 28
2.3.2.1 Coefficient Ordering Problem Formulation 29
2.3.2.2 Coefficient Ordering Algorithm 30
2.3.3 Adder Input Bit Swapping 31
2.3.4 Swapping Multiplier Inputs 33
2.3.5 Exploiting Coefficient Symmetry 34
v
VI VLSI SYNTHESIS OF DSP KERNELS
Xl
XII VLSI SYNTHESIS OF DSP KERNELS
xv
xvi VLSI SYNTHESIS OF DSP KERNELS
XVll
XVlll VLSI SYNTHESIS OF DS? KERNELS
GENE FRANTZ
Senior Fellow, Digital Signal Processing
Texas Instruments Inc.
Houston, Texas
April 2001
Acknowledgments
First and foremost, we would like to express our sincere gratitude to Milind
Sohoni, Vi kram Gadre and Supratim Biswas (all ofIIT Bombay), G. Venkatesh
(with Sasken Communication Technologies Ltd., earlier with IIT Bombay)
and Rubin Parekhji of Texas Instruments (India) for their insightful comments,
critical remarks and feedback which enriched the quality of this book.
We are thankful to Bobby Mitra and Sham Banerjee of Texas Instruments
(India) for their help, support and guidance.
We are grateful to Texas Instruments (India) for sponsoring the doctoral
studies of the first author. We deeply appreciate the support and encouragement
of IIT Bombay and Sasken Communication Technologies Ltd.
We are thankful to Amit Sinha, Somdipta Basu Roy, M.N. Mahesh, Satrajit
Gupta, Anand Pande, Sunil Kashide and Vikas Agrawal (all with Texas Instru-
ments (India) when the work was done) for their assistance in implementing
some of the techniques discussed in this book.
Our warm thanks to our children - Aarohi Mehendale and Apama &
Nachiket Sherlekar for putting up with our long hours at work. Finally, thanks
are due to our wives - Archana Mehendale and Gowri Sherlekar for being
there with us at all times.
MAHESH MEHENDALE
SUNIL D. SHERLEKAR
Preface
D.E Knuth in his seminal paper "Structured Programming with Goto State-
ments" underlines the importance of optimizing the inner loop in a computer
program. More than twenty five years and a revolution in semiconductor tech-
nology have not diminished the importance of the inner loop.
This book is about synthesis of the 'inner loop' or the kernel of Digital
Signal Processing (DSP) systems. These systems process - in real time -
digital information in the form of text, data, speech, images, audio and video.
The wide variety of these systems notwithstanding, their kerneis or inner loops
share a common dass of computation. This is the weighted sum (L: A[i]X[i]).
It occurs in Finite Impulse Response (FIR) and Infinite Impulse Response (HR)
filters, in signal correlation and in computing signal transforms.
Unlike general purpose computation which asks for computation to be 'as
fast as possible', DSP systems require performance that is characterized by
the arrival rate of a data stream which, in turn, is determined by the Nyquist
sampling rate of the signal to be processed. The performance of the system is
therefore a constraint within which one must optimize the area (cost) and power
(battery life). This is usually a matter of tradeoff.
The area-power tradeoff is complicated by additional requirements of flexi-
bility. Flexibility is important to track evolving standards, to cater to multiplicity
of standards (such as air interfaces in mobile communication) and fast-paced
innovation in algorithms. Flexibility is achieved by implementation in software,
but a completely soft implementation is likely to be ruinous for power. It is
therefore imperative that the requirements of flexibility be carefully predicted
and the system be partitioned into hardware and software components.
In this book, we present several algorithmic and architectural transformations
to optimize weighted-sum based DSP kerneis over the area-delay-power space.
These transformations address implementation technologies that offer varying
degrees of programmability (and therefore flexibility) ranging from software
programmable processors to customized hardwired solutions using standard-
cell or gate-array based ASICs. We consider both the multiplier-less and the
hardware multiplier-based implementations of the weighted-sum computation.
To start with, we present a comprehensive framework that encapsulates tech-
niques for low power implementation of DSP algorithms on programmable
DSPs. These techniques complement one another and address power reduction
XXI
xxii VLSI SYNTHESIS OF DSP KERNELS
in various components such as the program and data memory busses and the
multiplier-accumulator datapath of a Harvard architecture based digital signal
processor. The techniques are then specialized for weighted sum computations
and then for FIR filters.
Next we present architectural transforms for power optimization for hard-
wired implementation ofFIR filters. Multirate architectures are presented as an
important and interesting transform. A detailed analysis of the computational
complexity of multirate architectures is presented with results that indicate sig-
nificant power savings compared to other FIR filter structures.
Distributed Arithmetic (DA) has been presented in the literature as one of
the approaches for multiplier-less implementation of weighted-sum computa-
tion. We present techniques for deriving multiple DA based structures that
represent different data-points in the area-delay space. We look at improving
area-efficiency of DA based implementations and specifically show how the
fiexibility in coefficient partitioning can be exploited to reduce the area of a DA
structure using two look-up-tables. We also address the problem of reducing
power dissipation in the input data shift-registers of DA based FIR filters. Our
technique is based on a generic nega-binary representation scheme which is cus-
tomized for a given distribution profile of input data values, so as to minimize
toggles in the shift-registers.
For non-adaptive signal processing applications in which the weight val-
ues are constant and known at design time, an area-efficient realization can be
achieved by implementing the weighted sum computation using shift and add
operations. We present techniques for minimizing additions in such multiplier-
less implementations. These techniques are also useful for efficient implemen-
tation of weighted-sum computations on programmable processors that do not
support a hardware multiplier.
We address a special dass of weighted-sum computation problem, where the
weight-values are restricted to {O, 1, -I}. We present techniques for optimized
code generation of one dimensional and two dimensional multiplication-free
linear transforms. These are targeted to both register-rich and single-register,
accumulator based architectures.
Residue Number Systems (RNS) have been proposed for high-speed paral-
lel implementation of addition, subtraction and multiplication operations. We
explain how the power of RNS can be exploited for optimizing the implemen-
tation of weighted sum computations. In particular, RNS is proposed as a
method to enhance the results of other techniques presented in this book. RNS
is also proposed as a technique to enhance the precision of computations on a
programmable DSP.
To tie up all these techniques, a methodology is presented to systematically
identifying transformations that exploit the characteristics of a given DSP al-
PREFACE XXIll
MAHESH MEHENDALE
SUNIL D. SHERLEKAR
Bangalore
April 2001
Chapter 1
INTRODUCTION
Today's digitally networked society has seen the emergence of many appli-
cations that process and transceive information in the form of text, data, speech,
images, audio and video. Digital Signal Processing (DSP) is the key technology
enabling this digital revolution. With advances in semiconductor technology
the number of devices that can be integrated on a single chip has been growing
exponentially. Experts forecast that Moore's law of exponential growth in chip
density will hold good atIeast tilI year 2010. By then, the minimum feature
size of 0.07 micron will enable the integration of as many as 800 million tran-
sistors on a single chip [69]. As we move into the era of ULSI (Ultra Large
Scale Integration), the electronic systems which required multi-chip solutions
can now be implemented on a single chip. Single chip solutions are now avail-
able for applications such as Video Conferencing, DTADs (Digital Telephone
Answering Devices), cellular phones, pagers, modems etc.
1.1. An Example
As an example, consider the electronics of a Digital Still Camera (DSC) [26]
shown in figure 1.1. The system-level components are the CCD image sen-
sor, the A/D conversion front-end, the DSP engine for image processing and
compression and various interface and memory drivers.
Although there are no intrinsic real-time constraints for such a system, it has
performance requirements dictated by the need to have as short a shot-to-shot
delay as possible. Besides, many DSCs now have a provision of attaching an
audio clip with each picture which requires real-time compression and storage.
Of course, being a portable device, the most important constraint on the system
design is the need for low power to ensure a long battery life.
Figure 1.2 shows the DSP pipeline of the DSC [26]. The following blocks
are of particular interest:
CCD
DSC Engine
LCD /L-
Display ~ Image
I Proeessing I
Image
NTSC/PAL /L- Compression
~~
output
Universal
~t
Flash
t
Serial RS232
Memory
Bus
• Fault pixel eorreetion: Large pixel CCD-arrays may have defeetive pixels.
During the normal operation of the DSC, the image values at the faulty pixel
loeations are computed using an interpolation technique.
• CFA interpolation: The nature of the front-end is such that only one of
R, G or B values is available for each pixel. The other values need to be
interpolated from the neighboring pixels.
• Color space conversion: While the CCD sensor produces RGB values, typ-
ical image compression techniques use YCrCb. These values are weighted
sums of the RGB values.
I
R G
1 R G Analog
I
Optical Lens
H
I
I
I
G B G B processing --t black --t distortion t--
I
I
R G R G and AlD cJamp compensation
I
I G B G B
I
CFA Fault
color Gamma White
<J- <}-,- <J-
,---
correction balance
pixel ~
interpolation correction
-- L
- - Auto focus
r
Edge Write
----c detection r-
lPEG
- ----c to flash
color compression
4 conversion
- "v
RGB to YCrCb False
----c color r-
-"v
Scaling for For
suppression 4 monitor/LCD preview
.1"-
v System Specification
~
System 'A
System Partitioning IA
Estimators
I"
Validation
~
IA
i
Hardware
rt---c Synthesis
I" ,'7
Software Library
Synthesis ~
l-1
I~ "v
peripherals and custom hardwired logic but also involves developing the soft-
ware that implements the desired functionality. These hardware and software
components are strongly interdependent and hence need to be co-developed.
Figure 1.3 shows the methodology for designing an embedded real-time dig-
ital signal processing SoC. The design process starts with a specification of
the system in terms of its functionality and the design constraints/objectives.
Various mechanisms exist to specify the system functionality. One approach
is to use a high-level specification language such as Silage [25] or Lustre [24].
CAD systems such as Ptolemy [79], DSP-Station 1 from Mentor Graphics and
COSSAP from Synopsys provide block diagram editors that support hierar-
chical system specification. The blocks represent various functions and their
interconnections represent the data flow. These systems support both the syn-
chronous dataflow (SDF) and the dynamic dataflow (DDF) models [10, 79]
for capturing the specifications of DSP algorithms. These environments also
provide a rich library of commonly used DSP functions such as filters, FFT,
linear transforms, matrix multiplication etc. This can significantly reduce the
time required to specify the functionality of an embedded DSP system.
Area, delay (performance) and power constitute the three important design
constraints for most systems.
The area constraint is driven primarily by considerations of cost. Area ef-
ficient implementation results in a sm aller die size and hence is more cost
effective. It also enables integrating more functionality on a single chip.
1 Product and company names appearing here and elsewhere in the book are trademarks owned by the
respective companies.
Introduction 5
All the above requirements imply some sort of programmability. The down-
side of a programmable implementation, however, is penalty in tenns of either
area or power or both.
Fortunately, there is an interesting reason why programmable implementa-
tion are becoming increasingly feasible for DSP systems. With the advances
in semiconductor technology, digital circuits can be fabricated with increasing
chip density and can operate at increasing speeds. However, some of the fun-
damental properties of signals and systems in nature have remained more or
less the same. These include parameters such as the frequency range of signals
audible to the human ear, the frequency range of light visible to the human eye,
the duration of persistence of human vision etc. Thus more and more real-time
DSP functions which earlier required dedicated hardwired solutions can now
be implemented using a programmable processor for the same cost. Increasing
6 VLSI SYNTHESIS OF DSP KERNELS
speeds means that power can be reduced by dropping the clock rate and the
operating voltage.
For any technology, however, a hardwired implementation is always more
efficient in area and power than a software one. The system design method-
ology must trade programmability for area and power by considering imple-
mentation technologies with varying degree of programmability. These range
from software programmable solutions offered by programmable processors
and hardware programmable solutions offered by FPGAs to dedicated hard-
wired functions implemented as standard cell or gate-array based ASICs.
An important step in the system design process, therefore, is to partition the
system into various components and decide on the implementation approach
for each component. This decision process involves determining whether to
implement a component in hardware or software, and also assigning area, delay
and power budgets so as to meet the system level design constraints. For a given
function to be implemented in hardware or software, multiple alternatives exist,
each representing a different data point in the area-delay-power space.
Most approaches to system partitioning [9, 21, 31] model it as a combinato-
rial optimization problem and use integer programming or other heuristic tech-
niques to arrive at a solution. However, these approaches assume the availabil-
ity of area, delay and power estimates for different implementation alternatives.
The quality of partitioning thus depends on the accuracy of these estimates. A
serious barrier to accurate estimation is that the deeper we move into submi-
cron geometries, the lesser the correlation between high-level descriptions (or
even optimized logic equations) and the size and speed ofthe circuit [69]. One
approach to address this limitation is to actually perform hardware/software
synthesis [71] and extract the design parameters. However, this method is time
consuming and can limit the search space that can be explored. The other ap-
proach is to partition the system in terms of pre-characterized library functions.
While it is virtually impossible to build a comprehensive library of functions
which can realize any system behavior, such an approach can successfully be
used for designing systems belonging to specific application domains such as
DSP. Most DSP systems can be characterized in terms of the core algorithms
or kerneIs they use. These include functions such as filtering (both Finite Im-
pulse Response (FIR) and Infinite Impulse Response (UR) filtering), correlation
and linear transforms (matrix multiplication). All these perform weighted-sum
(2:: A[iJX[i]) as the core computation. This class of DSP algorithms forms the
focus of this book.
The optimality of a system partition can be greatly influenced by providing
a rich set of implementation alternatives. This book covers the entire solution
space as shown in figure 1.4 for realizing weighted-sum based DSP kerneIs. It
represents implementation styles that offer varying degrees of programmability
and perform weighted-sum computation with or without a hardware multiplier.
Introduction 7
Implementation
With Using Hardware Programmable
Hardware Multiplier(s) Digital Signal
Multiplier and Adder(s) Processors
Residue Processors
Distributed
Without Implementation Number WithNo
Arithmetic (DA
Hardware Using Adders System (RNS) Dedicated
Based
Multiplier and S hifters Based Implementation Hardware
Implementation Multiplier
Residue Number Systems (RNS) have been proposed for high-speed parallel
implementation of addition, subtraction and multiplication operations. Chapter
7 describes RNS based implementation of the weighted-sum computation and
presents transformations that aim at reducing area, delay and power dissipation
of the implementation. Chapter 7 also presents RNS as a transformation to
improve performance and reduce power dissipation of DSP algorithms which
need the data and the coefficients to have a higher bit precision than what is
supported by the target DSP architecture.
To tie up all these techniques, a methodology is presented in Chapter 8
to systematically identify transformations that exploit the characteristics of a
given DSP algorithm and of the implementation style to achieve tradeoffs in
the area-delay-power space.
Chapter 9 summarizes the key topics covered in this book to address the
VLSI synthesis and optimization of DSP kemeIs - primarily the weighted-sum
based kemeis.
• Compute Intensive: Most DSP kemeis are compute intensive with weighted-
sum being the core computation. A programmable DSP hence incorpo-
rates a dedicated hardwired multiplier and its datapath supports single cycle
multiply-accumulate (MAC) operation.
• Data Intensive: In most DSP kemeis, each multiply operation ofthe weight-
sum computation is performed on a new set of coefficient and data values.
A programmable DSP is hence pipelined with an operand read stage before
the execute stage, has an address generator unit that operates in parallel with
the execute datapath and uses a Harvard architecture with multiple busses
to program and data memory.
II
CPU
The rest of this chapter is organized as folIows. Section 2.1, identifies the
main sources of power dissipation and deveJops measures for estimating power
dissipated in each of the sources. Various techniques for low power realization
of DSP algorithms are discussed in section 2.2. Section 2.3, presents algo-
rithmic and architectural transformations which are specific to weighted-sum
computation. Section 2.4 presents additional transformations for low power
realization of FIR (Finite Impulse Response) filters - a DSP kernel which is
based on weighted-sum computation. Finally, section 2.5 integrates various
transformations into a comprehensive framework for low power realization of
FIR filters on programmable DSPs.
A380 A280 Al 80 AO 80
P7 P6 PS P4 P3 P2 PI PO
diagram of a 4x4 bit parallel array multiplier is shown in figure 2.2. The
multiplier consists of AND gates to compute partial inner products and an
array of adders to compute the complete product. The power dissipation of a
multiplier is direct1y proportional to the number of switchings at all the internal
nodes of the multiplier. These are the outputs of the AND gates and of the 1
bit adders. The number of internal node switchings depend on the multiplier
input values. This dependence can be analyzed using the 'Transition Density'
measure of circuit activity.
'Transition Density' [68] of a signal is the average number of transitions/
toggles of the signal per unit time. Consider a combinational logic block with
inputs Xl, X2, ... , X n and the output Y. Let T XI , T x2 , ••• ,Txn be the transition
densities at the inputs and let PXI , PX2 ' ••• ,PXn be the probabilities ofthe input
signal values being I. Assuming the input values to be mutually independent,
the transition density at the output Y is given by
n
Ty = L P(Bd(Y, xd) . T Xi (2.2)
i=l
The probability P y of the AND gate output being 1 is given by (Pa' H). The
transition density at the output of a two input XOR gate (y = a EB b) is given
by (Ty = Ta + Tb). These relationships indicate that the multiplier power
is directly dependent on the transition densities and the probabilities of the
multiplier inputs.
The transition densities of the multiplier inputs depend on the Hamming
distance between successive input values. The input signal probabilities depend
on the number of I s in the input signal values of the multiplier. These two thus
form the measures of multiplier power dissipation.
It can also be noted in figure 2.2, that the transitions in input bits BO and
BI affect more internal nodes than the transitions in input bit B3. In general,
transitions in lower order bits of the input signal contribute more to the multiplier
power dissipation than the higher order bits. Thus while minimizing transition
densities of all the input bits is important, higher gains can be achieved by
focusing on lower order bits of the input signals.
These measures have been experimentally verified by simulating an 8x8
parallel array multiplier. One input of the multiplier was kept constant and
1000 random numbers were fed to the other input. The total toggle count at all
the internal nodes and the inputs of the multiplier was measured. The toggle
count measurement was carried out for all 256 (0 to 255) values ofthe constant.
This data was then used to compute the average toggle count as a function of
the number of 1s in the constant input. This relationship, shown in figure 2.3,
confirms the analysis that the multiplier power is a direct function of the number
of 1s in its inputs.
The second experiment used sets of 1000 random numbers such that the
Hamming distance between consecutive numbers within a set was constant.
Seven such sets of numbers were generated corresponding to seven Hamming
distance values (l to 7). The total toggle count was measured by applying these
random numbers to one input of the multiplier while keeping the other input
constant. The toggle count vs Hamming distance relationship for four different
constants is shown in figure 2.4. It confirms the analysis that the multiplier
power is a direct function of the Hamming distance between its successive
inputs.
In addition to the array multiplier, other multiplier topologies based on Booth
encoding and Wall ace tree are also common in programmable DSPs [42]. While
the measure of Hamming distance between successive inputs applies to all these
topologies, the measure based on input data pattern may vary across topologies.
For example, power analysis of a Booth multiplier [36] shows that the power
dissipation is directly dependent on the number of 1s in the Booth encoded
input. This chapter focuses on techniques for reducing the Hamming distance
in the successive inputs of the multiplier. These techniques are hence applicable
to all multiplier topologies.
16 VLSI SYNTIlESlS OF DSP KERNELS
140
120
100
8'
0
x 80
§
0
u
j
e-
60
40
20
0
0 4
Numbet of ones
Figure 2.3. Toggle Count as a Function of Number of Ones in the Multiplier Inputs
160
x143
140
x77
120 x88
x36
0' 100
8
x
~
u
80
m
g:
'"
60
40
20
0
0 4
Hamming dislance
Figure 2.4. Toggle Count as a Function of Hamming Distance between Successive Inputs
70 . - - - - - - - - - - , - - - - - - - - - - - , - - - - - - - - - - , - - - - - - - - - - , , - - - - - ,
<J)
~
'"
'"0
f- 65
roc:
'"
üi
C
Q)
"
«'"
TI
+ 60
Q)
"c:
*'"
(5
E
c:
E
'"
I 55
Ei0
f-
50L---------~-----------L----------L---------~------~
o 5 10 15 20
Start Address
priately selecting the start locations of the code segments. The same technique
can also be applied for storing coefficients and data values in the memory for
weighted-sum computation in which these are accessed sequentially. Figure 2.5
shows the sum of total Hamming distance and total number of adjacent signals
toggling in opposite directions in the consecutive addresses as a function of start
location for a 24 word memory block. The analysis shows that start address of
Ox 14 results in 14% more power dissipation in the address busses compared to
the start address of OxOO. The power dissipation in the addresses busses can
thus be reduced by aligning the start address far the program, coefficient and
data blocks with the beginning of a memory page. The capacitive loading for
the address bus transitions and hence the power dissipation can also be reduced
by storing the most frequently accessed program segments, coefficients and
data on the on-chip memory.
Bn Bn-I Bn-2 B2 BI BO
Gn Gn-I Gn-2 G2 GI GO
2.2.2.2 TO coding
The power dissipation in the address busses during sequential access can be
further reduced by using the asymptotic zero-transition encoding referred to as
TO coding in [8]. Figure 2.9 shows the memory access scheme based on TO
coding.
Programmable DSP based Implementation 19
Binary to Gray
r--1> Converter
~
Binary Gray
Memory Memory
I data I
Figure 2.7. Memory Reorganization to Support Gray Coded Addressing
Bn Bn-I Bn-2 B2 BI BO
Gn Gn-I Gn-2 G2 GI GO
MSB Programmable Gray Coded Address LSB
At the beginning of the series of sequential accesses, the processor sends the
start location on the address bus and the same is loaded in the counter which
20 VLSI SYNTHESIS OF DSP KERNELS
(~
Program/
Coefficient
Memory
! Counter
Counter
'I increment
CPU
invert
Source Destination
v
Instruction - Instruction
Program Butler Program Buffer
Memory Memory
Decode
Logic
~'od'
Logic
I
--
epu epu
7
Decode Decode
Program Logic Pro gram Logic
Memory Memory
--[;: Decoded ,-- Decoded
Instruction Instruction
Buffer Butler
Execute Execute
Logic Logic
epu epu
Even
o
~ CPU
Memory CPU
Odd
Memory
T-FF
/\ /\
I I CLK'/2 CLK
CLK CLK
CLK
16 32
... 16
~::l
Memory CPU Memory ~ a:l
.c
/
/ "
CPU
8
~
...d.>
0...
A ~ /\
I I I
CLK CLK CLKJ2 CLK CLK
be clocked at half the CPU clock, resulting in power reduction. Figure 2.13
shows such a memory architecture.
The property of sequential access can also be exploited by using a wider
memory and reading two words per memory access. The data can be stored in
a pre-fetch buffer such that while the memory is accessed at half the CPU clock
rate, the CPU gets the data on evcry cycle during sequential access. Figure 2.14
shows such a memory architecture.
It can be noted that this scheme can be generalized such that for a B bit
data, the memory width can be set to N*B to read N words per memory access
and consequently clock the memory at I/N times the CPU clock. The prefetch
24 VLSI SYNTHESIS OF DSP KERNELS
AO - -A2- - AO
AI - -AO
-- AI
A4
A2 ---- A2
_ _ A.L_
PROGRAMI A3 A3
COEFFICIENT CPU
MEMORY A4 - -A3
-- A4
A5 - -A5- - A5
A7
A6 ---- A6
A6
A7 ---- A7
Table 2.1. Adjacent Signal Transitions in Opposite Direction as a Function of the Bus-
reordering Span
#taps Initial ±I ±2 ±3 ±4 ±5 ±6 ±7 ±8
24 38 32 18 18 16 12 8 8 8
27 38 30 24 16 4 4 4 4 4
32 26 14 12 10 10 10 10 10 10
36 32 20 14 10 8 8 8 8 8
40 40 24 24 18 18 18 18 18 16
64 62 38 38 34 34 28 22 22 22
72 68 54 40 40 40 40 40 40 40
96 84 74 64 60 58 54 54 54 54
128 112 94 84 84 78 78 78 78 78
can then be mapped onto the problem of finding the lowest cost Hamiltonian
Path in an edge-weighted graph or the traveling salesman problem.
As can be noted from figure 2.15, the bus bit reordering scheme has the
downside of increasing the bus netlength and hence the interconnect capaci-
tance. This overhead can be minimized if the reordering span for each bus
bit is kept within a limit. For example, the bus reordering scheme shown in
figure 2.15, uses the reordering span of ±2. The optimum bit order thus needs
to satisfy the constraint in terms of the maximum reordering span. This is
achieved by suitably modifying edge-weights such that all edge weights Wi,j
are made infinite if li - jl > M axSpan.
The algorithm starts with the normal order as the initial order. It uses hill-
c1imbing based iterative improvement approach to arrive at the optimum bit
ordering. During each iteration a new feasible order is derived and is accepted
as a new solution if it results in a lower cost function (i.e. lower number of
adjacent signal transitions in opposite direction)
The impact of bit reordering on the power reduction was analyzed in the
context of a DSP code that performs FIR filtering. Nine filters with the number
of taps ranging from 24 to 128 were used. For each case, the algorithm was
applied with the reordering span constraint ranging from ± 1 to ±8.
The results shown in table 2.1 show significant reduction in the number of
adjacent signal transitions in opposite directions. It can also be noted that
the reduction increases with the increase in the bus reordering span which is
expected. However as mentioned earlier, higher reordering span implies higher
interconnect length.
Figure 2.16 plots the average percentage reduction as a function of bus re-
ordering span. As can be seen from the plot the incremental saving in the
26 VLSI SYNTHESIS OF DSP KERNELS
70
65
60
55
c
0 50
U
::>
-0 45
Q)
a: 40
#-
35
30
25
20
1 2 3 4 5 6 7 8
Bus Reordering Span
Figure 2.16. %Reduction in the Number of Adjacent Signal Transitions in Opposite Directions
as a Function 01' the Bus Reordering Span
number of adjacent signal transitions gets smaller beyond the reordering span
of ±4. For the span of ±4, the cross-coupling related power dissipation in the
program memory data bus reduces on the average by 54%. This hence is the
optimal reordering span to get the most power reduction.
0.5
0.4
Q)
:;)
0; 0.3
>
C
Q) 0.2
'(3
~0 0.1
Ü
-0.1
5 10 15 20 25 30
Coefficient Number
Figure 2.17. Coefficients of a 32 Tap Linear Phase Low Pass FIR Filter
Table 2.2. Impact of Selective Coefficient Negation on Total Number of I s in the Coefficients
Y[n] = A[O]- X[O] + A[l] . X[l] + A[2]· X[2] + A[3]· X[3] (2.3)
or as
Y[n] = A[l]- X[l] + A[3] - X[3] + A[O]- X[O] + A[2]· X[2] (2.4)
Since the weighted sum computation also does not impose any restriction on
how the coefficient and data values are stored_ The address generator needs to
Programmable DSP based Implementation 29
comprehend the locations and generate the correct pair of addresses (to access
the coefficient and corresponding data sampie value) for each product compu-
tation. The order of coefficient-data product computation directly affects the
sequence of coefficients appearing on the coefficient memory data bus. Thus
this order determines the power dissipation in the bus.
The following subsection formulates the problem of finding an optimum or-
der of the coefficients such that the total Hamming distance between consecutive
coefficients is minimized.
(2.6)
The algorithms proposed [33] to solve this class oftraveling salesman prob-
lems include nearest neighbor, nearest insertion, farthest insertion, cheapest
insertion, nearest merger etc. Experiments with various low pass FIR filters
show that in alm ost all cases, the nearest neighbor algorithm performs the best.
Procedure Order-Coefficients-for-Low-Power
Inputs: N coefficients A[O] to A[N-l]
Output: A coefficient order which results in minimum total Hamming distance
between successive coefficient values
Table 2.3. Impact of Coefficient Ordering on Hamming Distance and Adjacent ToggIes
Since selective coefficient negation also hel ps in reducing the total Ham-
ming distance between the successive coefficient values, it can be applied in
conjunction with coefficient ordering to achieve further power reduction.
(inl) (in2)
Yl 0011 + 1100
Y2 0100 + 1011
(in!) (in2)
Yl 0011 + 1100
Y2 0011 + 1100
This computation results in zero toggles in both the databusses and consequently
has no pairs of adjacent signals toggling in opposite direction. As can be seen
from this example, appropriate bit swapping can significantly reduce power
dissipation.
Figure 2.18 shows a scheme to perform the bit swapping so as to minimize
the toggle count. The scheme compares for every bit, the new value with the
current value, and performs bit swapping if the two values are different.
As can be seen from figure 2.18, the reduction in the toggles in the adder
inputs is achieved at the expense of additional logic i.e. the multiplexers and
the exclusive-or gates. The power dissipated in this logic offsets power savings
in the adder and its input busses. The final savings depend on the data values
being accumulated and also on the relative capacitance of the adder input busses
and the multiplexer inputs.
To evaluate the effectiveness of the input bit swapping technique for power
reduction in the adder and its input busses, 1000 random number pairs were
generated with bit widths of 8, 12 and 16. Table 2.4 gives the results in terms
of total Hamming distance between consecutive data values and total number
of adjacent signals toggling in opposite direction, in both the busses. As can
be seen from the results the proposed scheme saves more than 25% power in
the two input data busses of the adder and also results in power savings in the
adder itself.
Programmable DSP based Implementatiofl 33
Data Read
Address
Register
Pro gram
Counter Data Write
Program/ Address I---+---N Data
Register Memory
Coefficicnt
Memory
CPU
Figure 2.18. Sehe me far Reducing Power in the Adder Input Busses
Table 2.4. Power Optimization Results Using Input Bit Swapping for 1000 Random Number
Pairs
Figure 2.19. Data Flow Graph of a Weighted-sum Computation with Coefficient Symmetry
The corresponding data f10w graph is shown in figure 2.19. While the core
computation in equation 2.9 is also multiply-accumulate, the coefficient is mul-
tiplied with the sum of two input sampies. The architectures such as shown in
figure 2.1, do not support single cycle execution of this computation. While
it is possible to compute data sum and use it to perforrn MAC, the resultant
code would require more number of cycles and more number of data memory
accesses than the direct implementation of equation 2.8, that ignores coefficient
symmetry.
Figure 2.20 shows a suitable abstraction of the datapath of the TMS320C54X
DSP [97] that supports single-cycle execution (FIRS instruction) ofthe multiply-
accumulate computation of equation 2.9.
This architecture has an additional data read bus which enables fetehing the
coefficient and the two data values in a single cycle. It's datapath has an adder
Programmable DSP based Implementation 35
Data Read
Address
Program Register I Data
Counter
Program/ Data Write
Address l~====~-4
~ Memory
Register
Coefficient Data Read
Memory Address
Register 2
CPU
and a MAC unit, so that the sum of the input data sam pIes and the multiply-
accumulate operation can be performed simultaneously in a single cycle. Since
the computational complexity of equation 2.9 is lesser than that of equation 2.8,
the corresponding implementation of equation 2.9 is significantly more power
efficient.
N-l
Y[n] = L A[i] . X[n - i] (2.10)
i=ü
X(Z)
Y(Z)
The weights (A[i» in the expression are the filter eoeffieients. The number
of taps (N) and theeoeffieient values are derived so as to satisfy the desired filter
response in terms of passband ripple and stopband attenuation. Unlike UR fil-
ters, FIR filters are all-zero filters and are inherently stable [73]. FIR filters with
symmetrie eoeffieients (A[i] = A[N-I-i» have a linear phase response [73] and
are henee an ideal ehoiee for applieations requiring minimal phase distortion.
While the teehniques deseribed in the earlier two seetions ean be applied in
the eontext ofFIR filters, this seetion deseribes additionallow powertechniques
specifie to FIR filters.
Consider the multi rate architecture shown in figure 2.22. Assuming even
number of taps, each of the sub-filters is of length (N/2) and hence requires N/2
multiplications and (N/2)-1 additions. There are four more additions required
to compute the two outputs YO and Yl. This architecture hence requires 3N/4
multiplications per output wh ich is less than the direct form architecture for all
values of N and requires (3N+2)/4 additions per output which is less than the
direct form architecture for ((N - 1) > (3N + 2)/4) i.e. (N > 6).
Table 2.5 shows the implementation of a direct form FIR filter on TMS320C2x
[40, 95]. The coefficients are stored in the program memory and the data is
Programmable DSP based Implementation 39
The power reduction due to multi rate architecture based FIR filter imp\e-
mentation can be analyzed as folIows. Since the multirate architecture requires
fewer cycles, the frequency can be lowered using the following relationship :
With the lowered frequency, the processor gets more time to execute the instruc-
tion. This time-slack can be used to appropriately lower the supply voltage,
using the relationship given in the following equation :
Since most programmable DSPs [95, 96, 97] are implemented using a fully
static CMOS technology, such a voltage scaling is indeed possible.
In terms of capacitance, the main computation loop in the direct form re-
alization requires N multiplications, N additions and N memory reads. The
multirate implementation has three computation loops corresponding to the
three sub-filters. These loops require 3N/4 multiplications, 3N/4 additions and
3N/4 memory reads per output. Based on this observation,
CtotaLmultirate / CtotaLdirect ~ 0.75
Cmultirate/Cdirect ~ (0.75 x 4 x (N + 19))/(3N + 82)
Based on this analysis, für a 32 tap FIR filter,
fmultirate/fdirect = (3 x 32 + 82)/4 x (32 + 19) = 0.87
40 VLSI SYNTHESIS OF DSP KERNELS
For this lowering of frequency, based on equation 2.15, the voltage can be
reduced from 5 volts to 4.55 volts.
Thus using the multirate architecture, the power dissipation of a 32 tap FIR filter
implemented on the TMS320C2x processor can be reduced by 38%. Similar
analysis for TMS320C5x processor based implementation shows the power
reduction by 35%.
Figure 2.23 shows power dissipation as a function of number of taps for the
multirate FIR filters implemented on TMS320C2x. The power dissipation is
normalized with respect to the direct form FIR structure. As can be seen from
Programmable DSP based Implementation 41
0.8 ,------,------,------r----,---.---------,
0.75
.Q 0.7
.,~
."
0
0
~ 0.65
0-
.,<U
~
"
0.6
Z
0.55
Figure 2.23. Normalized Power Dissipation as a Function of Number of Taps for the Multirate
FIR Filters Implemented on TMS320C2x
the figure, the power dissipation reduces with increasing order of the filter. The
power savings can be as much as 40% for filters with >42 taps.
X(Z) -,----------,---------,-----------,---------~
-I
Z
Y(Z)
J)ata
Pn)gramJ
MCI11()ry
CoclTicicnt
Memory
I",,, -----------"
All the approaches presented so far assume a given set of coefficient values
that meet the desired filter response. The following two techniques optimally
modify the filter coefficients such that they result in power reduction while still
meeting the desired filter response.
Thus the coefficients of the scaled filter are given by (K . A[i]). Given the
allowable range of scaling (e.g. ±3db), an optimal scaling factor K can be
Programmable DSP based Implementation 43
found such that the total Hamming distance between consecutive coefficient
values is minimized. This technique thus reduces the power dissipation in the
coefficient memory data bus and also the multiplier.
Due to the finite precision effects, the scaled coefficients may in some cases
violate the filter characteristics. This can be avoided by scaling the full pre-
cision coefficients and then quantizing them to the desired number of bits. It
is verified that the scaled coefficients satisfy the desired filter characteristics
before accepting them.
In case of the steepest descent strategy, for every coefficient its nearest higher
and nearest lower coefficient values are identified. A new set of coefficients
can be fonned by replacing one of the coefficients with its nearest higher or
nearest lower value. This approach is used to generate 2N sets of coefficients
for an N tap filter, during each iteration of the optimization process. From the
2N sets of coefficients, the coefficient set that maximizes the gain function is
selected and is used as the current set of coefficients for the next iteration. The
gain function , is computed as follows
, = Tolerance . H D red
Tolerance = (Pdb req - Pdb)/ Pdb req + (Sdb - Sdbreq )/ Sdbreq
where
H D red is the reduction in the total Hamming distance for the new set of coeffi-
eients compared to the total Hamming distance for the current set of coefficients,
Pdb req is the desired passband ripple,
Sdb req is the desired stopband attenuation,
Pdb is the passband ripple of the new set of coefficients, and
Sdb is the stopband attenuation for the new set of coefficients.
In case of the additional requirement of retaining the linear phase response the
filter coefficients are perturbed in pairs (A[i], A[N-i-l]) to maintain symmetry.
Thus for an N tap filter, N differen( sets of coefficients are generated during
each iteration and the set that maximizes the gain function , is selected.
In case of first improvement strategy, the optimization quality depends on
the order in which the coefficients are perturbed. The coefficient order is ran-
domized and for aselected coefficient whether to search of nearest higher or
nearest lower value is also selected randomly. During each iteration, the first
perturbation that reduces the Hamming distance and satisfies the filter charac-
teristics is accepted and is used to form the current set of coefficient for the next
iteration. The dependence on the coefficient order is minimized by generating
5 or 10 different coefficient orders and selecting the one that results in least
total Hamming distance.
Procedure Optimize-CoeJficientsJor-Low-Power
Inputs: Low pass filter characteristics in tenns of passband ripple Pdb req and
stopband attenuation Sdb req . An initial set of N fi Iter coefficients A[O] to A[N-
I] that meet the specified filter response.
Output: An updated set of filter coefficients A[O] to A[N-I] which minimize
total Hamming distance between successive coefficient values and still meet
the desired filter characteristics.
46 VLSI SYNTHESIS OF DSP KERNELS
repeat {
for each coefficient A[i] (i=O,N-l) {
Find a coefficient value A[i+] such that :
(HD(A[i],A[i-l])+HD(A[i],A[i+ 1])) > (HD(A[i+],A[i-l])+
HD(A[i+], A[i+ 1])) and (A[i+] - A[i]) is minimum
Generate a new set of coefficients by replacing A[i] with A[i+]
Compute the passband ripple (Pdb i+) and the stopband attenuation (Sdbi+)
if (Pdb i+ < Pdb req ) and (Sdb i+ > Sdbreq ) {
Find the tolerance given by
Toli+ = (Pdbreq - Pdbi+)/Pdbreq + (Sdbi+ - Sdbreq)/Sdbreq
} else { Toli+ = 0; }
Find a coefficient value A[i-] such that :
(HD (A[i], A[i-l]) + HD (A[i], A[i+ 1])) > (HD (A[i-], A[i-l]) +
HD(A[i-], A[i+ 1])) and (A[i] - A[i-]) is minimum
Generate a new set of coefficients by replacing A[i] with A[i-]
Compute the passband ripple (Pdb i -) and the stopband attenuation (Sdb i -)
if (Pdb i - < Pdb req ) and (Sdb i - > Sdbreq ) {
Find the tolerance given by
Tol i _ = (Pdbreq - Pdbi-)/Pdbreq + (Sdbi- - Sdbreq)/Sdbreq
} else { Tol i _ = 0; }
}
Find the coefficient value among A[i+]'s and A[i-]'s for which
the gain function "( given by (Tolerance· H Dreduciion) is maximum.
if (, > 0) {
Replace the original coefficient with the new value
} else { Optimization_possible = FALSE }
}until (! Optimization_possible)
The above algorithm can be easily modified to handle the additional require-
ment of retaining the linear phase characteristics. This can be achieved by
modifying both A[i] and A[N-l-i] with A[i+] (and later with A[i-]) to generate
the new set of coefficients, and searching only the first (N+ 1)/2 coefficients
during each iteration.
The 'first improvement' approach based version of the algorithm uses a ran-
dom number generator to pick a coefficient (A[i]) for perturbation and also to
decide whether A[i+] or A[i-] value needs to be considered. The new coefficient
value is accepted if the new values of passband ripple and stopband attenua-
tion are within the allowable limits. The optimization process stops when no
coefficient is perturbed for the specified number of iterations.
The techniques of coefficient scaling and coefficient optimization were ap-
plied to the following six low pass FIR filters.
Programmable DSP based Implementation 47
Table 2.7. Harnrning Oistance and Adjacent Signal Toggles After Coefficient Scaling Followed
by Steepest Oescent and First Irnprovernent Optirnization with No Linear Phase Constraint
Figure 2.26 shows the frequency domain characteristics of the 24 tap FIR
fi lter for three sets of coefficients corresponding to the initial solution, optimized
with no linear phase constraint and optimization with linear phase constraint.
48 VLSI SYNTHESIS OF DSP KERNELS
Table 2.8. Hamming Oistance and Adjacent Signal Toggles After Coefficient Scaling Followed
by Steepest Oescent and First Improvement Optimization with Linear Phase Constraint
Table 2.9. Hamming Oistance and Adjacent Signal Toggles for Steepest Oescent and First Im-
provement Optimization with and without Linear Phase Constraint (with No Coefficient Scaling)
The results show that the algorithm using both scaling and coefficient opti-
mization with no linear phase constraint results in upto 36% reduction in the
total Hamming distance and upto 88% reduction the total number of adjacent
signal toggles. Similar savings are achieved even with the linear phase con-
straints.
Programmable DSP based Implementation 49
-20
D
"0
-40
"
"<ij
"
-60
-45
-50
-55
-60
-65
Figure 2_26 Frequency Domain Characteristics of a 24 Tap F1R Filter Before and After Opti-
mization
l+Öl~----'
1- Ö1 1 - - - - - - - :
ffi
Stopband frequency: Ws
Passband ripple: 28 1 between (1 - ( 1 ) to (1 + 8d
Stopband attenuation: 82
N the number of coefficients
B the number of bits of precision for the fixed point representation of the
coefficients
H n the value of the n'th coefficient
Variables
The coefficient bit values form the variables, where Cn,b is the value of the b'th
bit of the n'th coefficient and Cn,b E {O, I}
Objective Function
Let H Dn,b be the Hamming distance between bits Cn,b and C n+1 ,b given by
H Dn,b = Cn,b EB Cn+ 1 ,b
The same can be represented as - H Dn,b :S (Cn,b - C n+ 1 ,b) :S H Dn,b
The objective function then is to Minimize ~;:;;02 ~f==(/ H Dn,b
Constraints
The coefficient values can be computed as folIows:
°
H n = -Cn, + ~f==ll Cn ,b(2)-b
Given the filter coefficients, the magnitude response at a given frequency w can
be computed using the following equations:
ForN odd,
Fw = ~i~~1)/2 a[k]cos(wk)
where a[O] = H[(N-l)/2] and
a[k] = 2H[(N-l)/2 - k], k = 1,2, ... ,(N-l)/2
ForN even,
Fw = ~:~i a[k]cos(w(k - 1/2))
where b[k] = 2H[N/2 - k], k=1,2, .. . ,N/2
°
The frequency response should meet the following constraints:
For all frequencies w : :S w :S wp -+ (1 - 8d :S Fw :S (1 + 8d
For all frequencies w : Ws :S w :S 7r -+ F w :S 82
Any ofthe available 0-1 programming packages can be used to arrive at Cn,b
values that satisfy the filter characteristics and minimize the total Hamming
distance between successive coefficients.
No I Y=
t Reduced Computational
+/-3db gain acceptable ? Multirate Architecture - - - Complexity -> 40%
overall power reduction
No I Yes
t
Coefficient Scaling ,
,, Upto 35% reduction in the
t
coefficient data bus power ->
Architecture support for single ~ ,
, power savings in the multiplie
cycle multiply-add and multipIy-
Coefficient Optimization
subtract in a repeat loop ?
No
I y,
~
Architecture support for non-
sequential data addressing ?
Selective Coefficient Negation ,,
,,
No I Ye, Upto 88% reduction in the
No Yes
t Upto 50% reduction in address
Pipelined architecture + high GrayrrO coded addressing,
- - -
busses + power reduction in the
cap, busses feeding the ALU? Bus-invert coding
data busses
No I Ye,
t Upto 25% reduction in the adder
Embedded DSP with contra I ove
Adder input bits swapping --- input busses -> power reduction
routing of memory-cpu busses?
in the addder
No I Ye,
t
Low Power FIR Filter on a Programmable DSP
Figure 2.28. Framework for Low Power Realization of FIR Filters on a Programmable DSP
Chapter 3
55
taining the same throughput, thus reducing the power which is proportional to
the square of the supply voltage. This however is achieved at the expense of
significant silicon area overhead. Thus for an implementation with fixed and
limited number of hardware resources, these techniques do not offer significant
advantage.
The architectures that reduce the computational complexity of FIR filters in-
clude block FIR implementations [41, 77] and multirate architectures [66]. The
algorithm for block FIR filters presented in [77] performs transformations on
the direct form state space structure. It reduces the number of multiplications
at the expense of increased number of additions. Since the multiplier area and
delay are significantly higher than the adder area and delay, these transforma-
tions result in low power FIR implementation. Block FIR filters are typically
used for filters with lower order. Their structures are not as regular as the direct
form structure. This results in controllogic overhead in their implementations.
The multirate architectures [66] reduce the computational complexity of the
FIR filter while partially retaining the direct form structure. These architec-
tures can hence enable low power FIR realization on a programmable DSP and
also as a dedicated ASIC implementation. The basic two level decimated mul-
tirate architecture was presented in the previous chapter, this chapter provides
a more detailed analysis of the computational complexity of various multi rate-
architectures and also evaluate their effectiveness in reducing power dissipation
of linear phase FIR filters.
Differential Coefficients Method [84] is another approach for reducing com-
putational complexity and hence the power dissipation in hardwired FIR filters.
The filter structure transformed using this method requires multiplication with
coefficient differences having lesser precision than the coefficients themselves.
Since the coefficient differences are stored for use in future iterations, this
method results in significant memory overhead.
X[n]
A[O]
Y[n]
C4
CS ---r;t
C6 _________ ------~--~------~~--
C7
---------------- -------- -* -----
C8
C9
Y[n]
Figure 3.2. Scheduled DFG Using One Multiplier and One Adder
Y[n]
Figure 3.3. Scheduled DFG Using One Pipelined Multiplier and One Adder
As can be seen from the figure, with one level loop unrolling the delay per
output computation reduces to 5T, thus enabling further lowering of supply
voltage and hence further power reduction to achieve the throughput of 9T per
output.
Retiming has been presented in the literature as a transform that reduces the
critical path delay and hence the power dissipation. The direct form structure
shown in figure 3.1 has a critical path delay of 5T (three adders and one mul-
tiplier). In general, a direct form structure of an N tap filter has a critical path
delay of one multiplier and (N-l) adders. The re-timing transform has the same
effect as applying transposition theorem and results in the multiple constant
multiplication(MCM) structure shown in figure 3.5.
As can be seen from the figure this structure has a critical path delay of
one multiplier and one adder. While this critical path is significantly smaller
than the direct form structure, it can be truly exploited only if the filter is to be
implemented using many multipliers and adders.
Figure 3.6 shows the scheduled data flow graph of the re-timed filter using
one pipelined multiplier and one adder. As can be seen from the figure, this
structure has a delay of 5T which is marginally lesser than the delay of 6T for
the direct form structure shown in figure 3.2.
Implementation Using Hardware Multiplier( s) and Adder( s) 59
C2
C3
C4
CS
C9
CIO
Y[n] Y[n-I]
Figure 3.4. Loop Unrolled DFG Using I Pipelined Multiplier and I Adder
X[n] ----------------~------------~----------~
Y[n)
The delay per FIR filter output computation can also be reduced by using
multiple functional units. This can be considered as parallel processing at a
micro level. Figure 3.7 shows the scheduled data f10w graph of the direct form
structure that uses two pipelined multipliers and one adder.
60 VLSI SYNTHESIS OF DSP KERNELS
CI---
A[O]
01'
-*- -- --- - --
I
A[ 1]
02'
[n]
A[2]
03'
X n]
A[3]
C2 __ \.8
C3 ________+_______ ---8-
C4
C5
________ + ----7----
---------+------>ts--
-------- --------- --------- ------ ----
Y[n] 01 02 03
Figure 3.6. MCM DFG Using One Pipelined Multiplier and One Adder
I J
---
X[n] X[n-l] X[n-2] X[n-3]
---8---e-~-r
e- - -
Cl
C2 - - - - - - -1.------ *- -- -
C3 ________ ~__ _ ____/ _________ _
C4
------------------ --- -----------
CS ____________________ ~------------
Y[n]
Figure 3.7. Direct Form DFG Using Two Pipelined Multipliers and One Adder
Implementation Using Hardware Multiplier( s) and Adder(s) 61
01' 03'
A[O] A[2] A[3]
Y[n] 01 02 03
Figure 3.8. MCM DFG Using Two Pipelined Multipliers and Two Adders
As can be seen from the figure, with one more multiplier the delay per output
does reduce to 5T. It is also interesting to note that for the re-timed, MCM
based structure, the delay continues to be 5T even if two pipelined multipliers
are available. The parallelism inherent to this structure can be truly exploited
by using multiple multiplier-adder pairs. Figure 3.8 shows the schedule data
f10w graph for the MCM based structure using two pipeline multipliers and two
adders.
As can be seen from the figure, using two multiplier-adder pairs reduces the
delay to 4T. This analysis shows that the delay per output can be reduced by
using multiple functional units. This can be used to lower the supply voItage
and hence reduce the power dissipation, if the throughput requirement is same
as that achieved using one multiplier and one adder.
.- 1.4
11
Z
"-
1.2 Peak P wer
.2
Cf)
co
"0
Q) 0.8
N
eil
0.6
E
0
c 0.4
Cf)
Q)
:J
eil
0.2 Energ
> o ~~ __ ~~L-~ _ _~~_ _~_ _L-~
o 2 345 6 7 8 9
Degree of Parallelism (N)
Figure 3.9. Energy and Peak Power Dissipation as a Function of Degree of Parallelism
With the degree of parallelism N, the amount of capacitance switched per cycle
goes up by a factor of N. Since the power is proportional to V 2 , the peak power
dissipation can be reduced only if the supply voltage is reduced by a factor of
VN. Figure 3.9 plots both the energy (or average power) and the peak power as
a function of degree ofparallelism N for VD D =3V and VT =0.7Y. As can be seen
from the figure, while the energy per output or the average power dissipation
reduces with increasing degree of the parallelism, the peak power dissipation
increases beyond N=4.
For a given degree of parallelism N, the following condition should be sat-
isfied for the peak power dissipation to be less than with degree one.
VDD
~------~ .N > VDD/VN
----~~------- (3.2)
(VDD - VT)2 ((VDD/VN) - VT)2
This gives the following relationship between VDD, VT and N.
(3.3)
Figure 3.10 plots this relationship as the lower limit on VDD/VT for no
increase in the peak power dissipation with the given degree of parallelism N.
lmplementation Using Hardware Multiplier(s) andAdder(s) 63
7 r---,--------,--------,--------,--------,----,
2 L -_ _L -_______ L_ _ _ _ _ _ ~ _ _ _ _ _ _ _ _L -_______ L_ _ ~
2 468 10
Degree of Parallelism (N)
Figure 3.10. Lower Limit of VDD/VT for Reduced Peak Power Dissipation as a Function of
Degree of Parallelism
XO(Z) ~ Y(Z)
X(Z)~
Yl(Z)
Xl(Z)
X(Z)
Y(Z)
Figure 3.13. Signal Flow Graph of a Direct Form FIR Structure with Non-linear Phase
X(Z)
Figure 3.14. Signal Flow Graph of a Direct Form FIR Structure with Linear Phase
of length (N+ 1)/2 and the third sub-filter (H 1 ) of length (N-I )/2. The multirate
architecture thus requires (3N+ 1)/4 muItiplications which is less than the direct
form architecture for all values of N and requires (3N+ 3)/4 additions per output
which is less than the direct form architecture for ((N -1) > (3N + 3)/4) i.e.
(N > 7)
IfN is even, the three decimated sub-filters have the following Z-domain transfer
functions
~-l
Ho(Z) L A[2k] . (Z2)-k (3.5)
k=O
~-l
L A[2k + 1] . (Z2)-k (3.6)
k=O
~-l
(Ho + H1)(Z) = L (A[2k] + A[2k + 1]) . (Z2)-k (3.7)
k=O
The coefficient symmetry of the sub-filters can be analyzed using the relation-
ship in equation 3.4 to show that the sub-filters Ho abd H 1 do not have linear
phase and the sub-filter (Ho + Hd does have linear phase characteristics.
Computational Complexity - linear phase FIR filters with even number of taps
Since Ho and H 1 have non-linear phase, they require (N/2) multiplications
and (N/2)-l additions each. Since Ho + H 1 sub-filter has a linear phase, it
requires N/4 multiplications and (N/2)-l additions, if N/2 is even, and requires
(N+2)/4 multiplications and (N/2)-l additions, if N/2 is odd.
Thus the topology-I multi rate architecture requires per output 5N/8 multi-
plications and (3N+2)/4 additions if N/2 is even, and (5N+2)/8 multiplications
and (3N+2)/4 additions if N/2 is odd. In both the cases, the number of multi-
plications required are more than the direct form structure. The primary reason
for the multirate architecture requiring higher number of multiplications is the
fact that two of the three sub-filters have non-linear phase characteristics.
The topology-II multirate architecture has sub-filters with transfer functions
(Ho + H 1)/2, (Ho - Hd/2 and H 1. Since Ho + H 1 has linear phase, the sub-
filter (Ho + H 1)/2 also has linear phase characteristics. It can be shown that the
coefficients of (Ho - Hd/2 are anti-symmetrie (i.e. Ai = -A N - l - i ). This
sub-filter has hence the same computational complexity as (Ho + Hd/2. This
multirate architecture hence requires N/2 multiplications and (3N+6)/4 addi-
tions ifN/2 is even and needs (N+ 1/2) multiplications and (3N+6)/4 additions if
NI2 is odd. While this multirate architecture requires fewer multiplications than
the topology-I architecture, it is still not less than the number of multiplications
required by the direct form structure.
Implementation Using Hardware Multiplier( s) and Adder(s) 67
Thus in case of linear phase FIR filters, one level decimated multirate ar-
chitectures can at best require the same number of multiplications as the direct
form structure when N/2 is even. They require fewer number of additions for
((N - 1) > (3N + 6)/4) i.e. (N) 10).
Computational Complexity - linear phase FIR with odd number of taps
In case of linear phase filter with odd number of taps, it can be shown that
the sub-filters Ho and H 1 both have linear phase but the sub-filter Ho + H 1
has non-linear phase characteristics. Since Ho is of length (N+ 1)/2 and H 1 is
of length (N-I)/2, the two sub-filters together require (N+ 1)/2 multiplications.
The topology-I multi rate architecture hence require (N+ 1)/2 multiplications
and (3N+3)14 additions per output. Thus for the linear phase FIR filter with
odd number of taps, the one level decimated multi rate architecture can at best
require the same number of multiplications as the direct form structure. It
requires fewernumberofadditions for ((N -1) > (3N +3)/4) i.e. (N > 7).
The above analysis (summarized in table 3.1) demonstrates how the multirate
architectures can reduce the computational complexity of FIR filters. Each of
the sub-filters in the one level decimated architectures (shown in figure 3.11)
68 VLSI SYNTHESIS OF DSP KERNELS
Figure 3.15. Signal Flow Graph of a Two Level Decimated Multirate Architecture
14
12
10
The reduced frequency for the multirate architecture directly translates into its
lower power dissipation.
The lowering of the frequency has another important advantage. Since the
clock period is increased, the logic delays can be correspondingly higher without
affecting the overall throughput. In CMOS logic, supply voItage is one of the
factors that affects the delays. The delay dependence on supply voItage is given
by the following relationship
(3.9)
where VDD is the supply voItage and VT is the threshold voltage of the
transistor.
Figure 3.16 shows this delay vs V dd relationship for VT = O.8V. The delay
values are normalized with respect to the delay at V dd=Sy' Since the multirate
architectures allow higher logic delays, the supply voltage can be appropriately
lowered. This reduces the power proportional to the square of the reduction in
the supply voltage.
The analysis shown below assumes that the total capacitance charged/ dis-
charged per output is proportional to the total area of the multipliers and the
adders required to compute each output. Let Am be the area of a multiplier and
Aa be the area of an adder. For an N tap FIR filter with non-linear phase, the
total capacitance for the direct form structure is given by :
The capacitance per cycle Cdirect for the direct form realization is hence given
by
Cdirect cx: (N x Am + (N - 1) X Aa )/ fdirect (3.12)
The capacitance per cycle Cmultirate for the multi rate architecture is given by
It can be noted that if the area ratio Am / A a is same as the delay ratio Om/Oa,
and f multirate is appropriately scaled to maintain the same throughput, the two
capacitance values Cdirect and Cmultirate are same.
The above analysis shows that for a non-linear phase 32 tap FIR filter, the one-
level decimated multirate architecture (figure 3.11) results in 50% reduction in
the power dissipation.
The amount of power reduction using multirate architecture is mainly de-
pendent on the amount by which the frequency can be lowered. The lowered
frequency not only reduced power directly, but also enables reducing the voltage
which has a bigger impact on power reduction. The frequency ratio relationship
presented above indicates that the amount of frequency reduction is dependent
on the number oftaps and also on the delay ratio om/ 00.' Using this relationship,
it can be shown that frequency lowering is possible if (Om /Oa) > (6/ N - 1).
This relationship indicates that for N > 6 the frequency of the multi rate archi-
tecture can always be lowered independent of the (Om/oa) ratio.
Implementation Using Hardware Multiplier( s) and Adder(s) 71
0.9
0.8
c
0
~
0.
·iii 0.7
U)
Ci L_20
(j;
;: 0.6
0
(L
"0
<l.l
.~
co 0.5 NL 10
E
0
z
0.4
0.3
NL_20
0.2
0 5 10 15 20 25 30 35 40 45 50
Number 01 taps
Table 3.2. Comparison with Direct Form and Block FIR Implementations
The results show that the power dissipation reduces with increasing number
of taps in all the 3 cases.
For non-linear phase FIR implementation, one level decimated multi rate ar-
chitecture results in the power saving of upto 50%. The two level decimated
multirate architecture results in the power saving of upto 73%. This reduction
is more than the 64% power reduction achieved using the parallel processing
technique. The significant point to note is that the power reduction using mul-
tirate architecture requires no datapath area overhead compared to 240% area
overhead [15] in the parallel processing approach. It can be noted that the
multirate architectures however do result in the coefficient and data storage
overhead. For a non-linear phase N tap FIR filter, the multirate architecture
shown in figure 3.11 requires (N/2) more number of coefficient memory loca-
tions and (N/2) more number of data memory locations, compared to the direct
form implementation.
In case of linear phase FIR filters, since the one level decimated multi rate
architectures do not reduce the number of multipliers per output, the power
saving is primarily due to the reduction in the number of additions. For the
filter with N (number of taps)=20, the frequency can be lowered by 1.03 which
translates into the 7% power reduction. For higher values ofN power reduction
of upto 9% is achieved. Depending on the number of taps, the two level multi-
rate architectures use an appropriate combination oftopology-I and topology-II
to minimize the number of multiplications. The two level decimated multi rate
architectures result in upto 35% power reduction.
As can be seen from the results, the doubly-decimated multi rate architecture
requires lesser number of operations and lesser area than the block FIR imple-
mentation. This shows the effectiveness of multirate architectures in reducing
power dissipation with minimal area-overhead.
Chapter 4
75
values are known at design time, the look-up-tables stored in these memory
modules can be implemented as hardwired logic blocks. The chapter proposes
a coefficient partitioning technique so as to minimize the area of these logic
blocks.
The chapter also presents techniques for reducing power dissipation of the
DA based structure. With the primary focus on the power dissipated in the input
data shift registers, it proposes a data coding technique to minimize the number
of toggles in these registers. For a given profile of input data distribution an
optimum coding scheme can be derived so to minimize power dissipation.
where A[n]'s are the fixed coefficients and the X [n]'s are K-bit input data words.
If each X[n] is a 2's-complement binary number scaled such that IX[n]1 < 1,
then X [n] can be represented as
K-1
X[n] = -bno + L (b nk . T k ) (4.2)
k=l
where the bnk are the bits 0 or 1, bno is the sign bit and bn,K -1 is the LSB.
Combining equations 4.1 and 4.2 gives
K-1 K-1
Y = L A[n]· (-bno + L bnk . T k ) (4.3)
k=l k=l
Since each bnk may take on values of 0 and 1 only, expression 4.5 may have 2N
possible values. Instead of computing these values on-line, they can be precom-
pu ted and stored in a look-up-table memory. The input data can then be used
Distributed Arithmetic Based Implenzentation 77
16 Word Memory
0000 0
0001 A3
0010 A2
00 I I A2+A3
X(n-I) 0100 AI
0101 AI+A3
o I 10 AI+A2
oI I I AI+A2+A3
1000 AO
X(n-2) 1001 AO+A3
10 I 0 AO+A2
101 I AO+A2+A3
I 100 AO+AI
I 101 AO+AI+A3
X(n-3) I I 10 AO+AI+A2
1111 AO+AI+A2+A3
to directly access the memory and the result can be added to the accumulator.
Y can thus be obtained after K such cycles using K-1 additions.
Figure 4.1 shows DA based implementation of a 4 tap FIR filter. The input
data values X[n] to X[n-3] are stored in input shift registers. During each cycle
the last bits of the registers are used as an address to look-up into the coefficient
memory and the read value is added to the right shifted accumulator. The shift
register chain is then right shifted. Since the input values are stored in 2's
complement form, the value read from the coefficient memory during the K'th
iteration is subtracted from the right shifted accumulator. The output Y[n] is
thus available in the accumulator after every K cycles. Figure 4.1 also shows
the coefficient memory map for a 4 tap filter with coefficients A[O] to A[3].
Oll 0
01 AI
J() All
11 AII+AI
Clk
16 Ward Memory
0000 0
000 I AI
0010 2*AI
0011 3*AI
0100 AO
0101 AO+AI
o I 10 AO+2*AI
oI I I AO+3*AI
1000 2*AO
1001 2*AO+AI
1010 2*AO+2*AI
101 I 2*AO+3*AI
I I 00 3*AO
I 101 3*AO+AI
I I 10 3*AO+2*AI
I I I I 3*AO+3*AI
Nl+N2 N
L (A[n]· bnk) + ... + L (A[n] . bnk ) (4.6)
n=l n=N-N}.f+ 1
NI NI
2 ward
ROM
N2 N2
2 word
ROM
Nm Nm
2 ward
ROM
However, for a three memory bank implementation with 2BAAT data ac-
cess, the number of additions required (3K /2 - 1) is less than the two bank
implementation with lBAAT, and the coefficient memory required (3· 2 2N / 3 )
is less than the single bank implementation with I BAAT data access. The three
bank implementation with 2BAAT data access thus represents a data point on
the area-delay curve between the single bank lBAAT and the two bank I BAAT
DA implementations.
Y(n)
1-+--1> Y(n-Il
architecture [66] that uses a decimation factor of three. In this architecture, the
decimated sub-filters HO,HI and H2 are derived by grouping every third filter
coefficient as shown below :
aO X2-X1 bo HO
al (XO - X2 . Z-3) - (Xl - XO) b1 H1
a2 -aO. Z-3 b2 H2
a3 (Xl - XO) b3 HO+H1
a4 (XO - X2 . Z-3) b4 H1+H2
as XO bs HO+H1 +H2
mi ai * bi , i = 0,1,2,3,4,5
YO m2 + (m4 + ms)
Y1 ml + m3 + (m4 + ms)
Y2 mo + m3 + ms
This multirate architecture has six sub-filters of length N/3. Each of these
filters can be implemented using DA based approach, thus requiring total
coefficient memory of 6 . 2N /3. These sub-filters require 6(K - 1) addi-
tions. There are 10 more additions required, four out of which are at the input
and can be implemented bit-serially. Thus this architecture requires total of
(6(K - 1) + 6)/3 = 2K additions per output.
The area-delay tradeoff of this architecture with 2BAAT data access can be
analyzed much the same way as the earlier multirate architecture. It can be
shown that with 2BAAT data access this architecture requires K additions per
output and 6 . 22N/3 words of coefficient memory.
For an N tap filter, where N is an integer multiple ofthree, it can be shown that
the sub-filters HO and H2 have the same set of coefficients and can hence share
the same coefficient memory of size 2 N / 3 . Similarly the sub-filters HO+HI
and H I +H2 have the same set of coefficients and can hence share the same
coefficient memory of size 2 N / 3 . The sub-filters HI and HO+HI+H2 have
symmetric coefficients and hence require total of 2 * 2N / 6 words of coefficient
memory. Thus the total coefficient memory required for the linear phase filter
is given by (2(N/3+ 1) + 2(N/6+1)).
84 VLSI SYNTHESIS OF DSP KERNELS
As can be seen from the results, the techniques discussed in this chapter
enable achieving different points in the area-delay space for the DA based
implementation of FIR filters. For a given filter, some of these points can be
eliminated as their memory requirements are very high or they require higher
memory for the same number of additions compared to another implementation.
Even with these eliminations, as many as eight meaningful data points can be
achieved on the area-delay curve. Figure 4.7 shows these memory-vs-number
of addition plots for the 8 tap and 12 tap FIR filters with 16 bits of input data
precision.
The following section looks at DA based implementation ofFIR filters whose
coefficients are known at design time. It presents a technique to improve the
area efficiency of a DA structure that uses two LUTs. It can be noted that
Distributed Arithmetic Based Jmplementation 85
Table 4.1. Coefficient Memory and Number of Additions for DA based Implementations
1600
CD
1400
N
i:Jj 1200
~
0
E 1000
CD
:2 800
C
CD
'0
600
~0 400
Ü
200
0
0 10 20 30 40 50 60
Number 01 Additions
overall precision required in the LUTs is less and the implementation area can
be reduced.
For filters with fixed coefficient values the required area could be drastically
reduced by removing the redundancy inherent in a memory structure by using a
two level PLA implementation or the more efficient multi-level logic optimiza-
tion. In a two LUT implementation, the functionality of the LUTs depends
on the coefficient partitioning. Experiments indicate [86] that 20% to 25%
swings in implementation area occur based on the type of partition. Hence this
ftexibility needs to be explored.
In general, a 2N tap filter could be partitioned in (2N CN) /2 ways. Clearly,
even for a modestly sized 16 tap filter this implies a search space with 6435
partitions. The unfeasibility of an optimized area mapping for the exhaustive
set, and then choosing the most efficient partition is at once apparent. A set of
heuristics is hence required for estimating the area of different partitions so as
to speed up the search of the most area efficient partition.
Table 4.2. A Few Functions and Their Corresponding Correlations with Actual Area
Based on the analysis of the coefficients of various low pass filters with taps
ranging from 16 to 40, the following heuristic rute [86] can be used to choose
an efficient partition:
Stepl : Separate the coefficients into positive and negative sets.
Step2 : Sort each set by magnitude.
Step3 : Group the top half of each set as the first partition and the remaining
as the second partition.
Functions 4 and 5 form the basis of the area comparison procedure and will
be explained in detail later. Function 1 gives a very naive estimate, assuming
that number of ones is a measure of the number of min-terms that needs to
implemented. It does not consider the minimizations that occur because of
Distributed Arithmetic Based lmplementation 89
particular groupings of 1'so However, it could be used effectively for filters with
sparsely populated truth-tables. Function 2 is similar to a fan-in type algorithm
used for FSMs [19, 63]. It reflects the fact that additional area results when two
particular outputs have more Hamming distance between their corresponding
min-tenns. However, the fact that it sums up over a11 possible combinations
of rows results in favorable pairs being overshadowed by area expensive ones.
Function 3 tries to group outputs with maximum overlap between them and adds
the extra non-overlap cost. However, it does not account for simplifications that
could arise from row overlaps. Further, it pairwise sums up a11 best case column
groupings without accounting for the fact that one favorable grouping might
exclude the possibility of another one.
n-12,,-i_12 i -1
ROF = L L L hd(j.2 i + k,j.2 i - (k + 1)) (4.8)
i=O j=1 k=O
where, hd(p, q) represents the Hamming distance between the pth and the
qth row entries in the modified truth-table. 1t can be observed that when the
Hamming distance between two output rows is more, the number of input min-
tenns that could be combined is less and hence the added cost. ROF gives a very
high correlation with ac tu al implementation area but its correlation deteriorates
as the column overlap factor begins to dominate. Consider the following simple
example,
1111111111111111 1010101010101010
0000000000000000 0101010101010101
Clearly, Case 1 would require 1esser area, because of the greater column
overlap which in turn implies that only one min-term needs to be implemented.
To account for this, the Column Overlap Factor (COF) is computed.
Column Overlap Factor(COF)
COF computation is based on the minimum-spanning-tree algorithm [18].
It begins with one output column, tries to locate another one which is c10sest
to it (in terms of maximum' 1' overlap), then for a third one which is c10sest to
either of them and so on. In each case it adds to the cost function the amount
of non-overlap. Assuming m outputs in the truth-table, COF is computed using
the Prim's technique [18] for minimum-spanning-tree computation as follows:
The graph G consists of columns as nodes. The edge-weight (ew) is the
extra non-overlap cost between a pair of columns and COF is sum of the edge-
weights of the minimum-spanning-tree.
Define, G = {Ck ICk --t k th output column, k=[O,m-l] }
eWij = ones (Cj) - overlap (Ci, Cj)
where overlap( Ci, Cj) gives the number of positions where both Ci and Ck have
'I' entries in corresponding rows.
Stepl : Initialize count=O; COF=ones( Co ); and the span set
as Spantree = { Co }.
Step2 : Repeat Steps 2-4 while count::; m - 1.
Step3 : Find Ck such that for all Ci E Spantree,
Ck E Spantree/\ (eWik --t MIN)V(i,k i- i)
Step4 : Increment count; Add Ck to Spantree and edgeweight
(the extra non-overlap cost) to COF.
COF = COF + eWik
CF2 is computed using a linearcombination ofCOF and ROF. It was observed
that CF2 values had as much as 90% correlation with actual areas.
Computation of CF2
A linear weighted combination of normalized COF and ROF (cost function
CF2) was tested on truth-tables of fi lter coefficients generated using the Parks
McClellan algorithm, with taps ranging from 16 to 36.
G F2 = k 3 . RO F + k 4 . GOF (4.9)
where ROF, GOF represent the normalized values of ROF, COF computed
for the different partitions. The values of k 3 and k 4 far maximum correlation
were found out as 0.92 and 0.08 respectively. 80% to 90% correlation to actual
area were observed. The values of CF2 obtained for the various partitions can
therefore be used to obtain a minimized search space of desired size. Figure 4.1 0
shows a typical correlation trend obtained using CF2 for 25 different 16 tap
filter partitions. As can be seen, the two most area efficient partitions could
be isolated. The 'kinks' in between correspond to the 'transition zone' where
Distributed Arithmetic Based Implementation 91
0.95
/~
TI 0.9
,~/-6 0
Q)
.~ 0,./'"
(ij
//0
E 0.85 o 9"
0 o /
N
c
lL
o 'I/i
ü 0.8 ~~
o}o
0.75 //f~
/'{,
0.7 "----'-----''----'-----'----'-----'
200 300 400 500 600 700 800
Actual Area (equivalent NA210 NANO gates)
Figure 4.10. Area vs Normalized CF2 Plot for 25 Ditlerent Partitions of a 16 Tap Filter
neither a row nor a column overlap factor dominates; there occurs a complex
interdependency in row/column simplifications.
where k 1 and k 2 are the corresponding weights and ci represents the i th coeffi-
cient in the partition; hd is a simple Hamming distance between the two input
vectors and ones is the number of '1' entries in Ci.
The function was implernented on all possible uniform twin partitions of
filters with number of coefficients ranging from 8 to 20. The values of k 1
and k 2 that resulted in the highest correlation between the value of CFl and the
actual area were found out as 0.83 and 0.17 respectively. Experiments indicated
that the correlation values remained almost the same after CFl values obtained
for individual coefficient sets in a give partition were added up and compared
with the sum of the individual implementation areas.
As can be seen, 8% to 10% area saving can easily be obtained from a good
partition. Table 4.4 compares the area required for a ROM implementation
with that of hard wired implementation (using SIS) for different numbers of
16-bit precision coefficients. The area mapping was performed using a library
of TSC4000 0.351Lm CMOS Standard gates from Texas Instruments [100]. It
can be seen that more savings result for smaller number of coefficients as the
decoder overhead does not decrease proportionately for the ROM even though
the memory size decreases. Table 4.5 illustrates the kind of area variations that
occur depending on the partitioning of coefficients for some typical filters.
Distributed Arithmetic Based Implementation 93
Table 4.4. ROM vs Hardwired Area (Equivalent NA210 NANO Gates) Comparison
Table 4.5. Area (Equivalent NA21 0 NANO Gates) Statistics for All Possible Coefficient Par-
titions
Clearly the correct choice of partitions results in 20% to 25% area saving
and so a proper algorithm for choosing the ideal partition is altogether justified.
CFI and CF2 were implemented on filters with taps ranging from 8 to 40.
For filters with number of coefficients less than 20, all possible partitions were
generated while for larger ones a comparable number of random partitions were
generated. In each case the actual area mapping of the simplified circuit was
obtained through SIS simulation.
• 82% to 90% probability of choosing the most area optimal partition using
CFI and CF2.
• Over 95% probability of having the most optimal partition in the search
space reduced to 2% of its original size.
• All cases yielded partitions close to minimal area in the reduced search
space.
94 VLSI SYNTHESIS OF DSP KERNELS
D Flip-Flop (TI standard cell [100]) Cpd,toggle (pF) Cpd,no-toggle (pF) %extra
DTPIO 0.157 0.070 124%
DTP20 0.180 0.071 154%
DTPIA 0.155 0.069 138%
DTPIO 0.167 0.070 139%
• The CF1, CF2 estimates had greater correlation as the size of the search
space increased; and a larger sized domain is where CF2 and CF2 have their
real application.
• For an 8 input truth-table with 256 rows and 16 output columns, SIS required
a CPU time of 350.53s on a Sun SPARC 5 station while CF2 computation
required only 0.15s, a speed advantage of around 2400. Further, this speed
advantage increased sharply with filters of higher order.
For applications where the distribution of data values is known, a data coding
scheme can be derived which for a given distribution profile of data values,
results in lesser number of toggles in the shift registers. The main constraint
Distributed Arithmetic Based Implementation 95
In the above case, powers of 2 alternate in signs. While the 2's complement
representation has the range of [ - 2 N -1 , 2 N -1 - 1 ], this nega-binary scheme
has the range of [ _(4 LN/ 2J - 1)/3, (4 fN / 21 - 1)/3]. It can be noted that in
general the nega-binary scheme results in a different range of numbers than the
2's complement representation. Thus there can be a number that has an N bit 2's
complement representation but does not have an N bit nega-binary representa-
tion. This issue has been addressed in section 4.3.1.2. Here is a simple example
that demonstrates how the nega-binary scheme can result in reduced number
of toggles. Consider the 2's complement number 010101018. Using a nega-
binary scheme with alternating positive and negative signs (weights - (- 2)i ),
the corresponding representation will be ll111111 NR . Clearly while the first
case has maximum possible toggles the second one has minimum toggles. If
instead the number was 10101010B, this nega-binary scheme would result in
a representation with same number of toggles as the 2's complement. How-
96 VLSI SYNTHESIS OF DSP KERNELS
-31 -8 o 7 31
Range ror . - - - - Range ror + + + + +
Figure 4. JJ. Range of Represented Values for N=4, 2's Complement and N+ 1=5, Nega-binary
0.03 r-----..,.-------r----r------r----~---___,
0.025
0.02
f
n"
0
":>
lL
0.015
~
.ö;
"'"
0
0.01
0.005
o L-_ _ ..g,:;~_
Figure 4.12. Typical Audio Data Distribution for 25000 Sampies Extracted from an Audio File
Figure 4.12 iIlustrates the distribution profile of a typica1 audio data extracted
from an audio file. The non-uniform nature of the distribution is at once appar-
ent. A nega-binary scheme which has the minimum number of toggles in data
values with very high probability of occurrence will substantially reduce power
consumption. Further, each ofthe 2N + 1 overlap cases have different 'regions'
of minimum toggle over the range, which implies that there exists a nega-binary
representation which minimizes total weighted toggles corresponding to a data
distribution peaking at a different 'region' in the range. While the relative data
distribution of a typical audio data is similar to that shown in figure 4.12, its
mean can shift depending on factors such as volume control. The flexibility
of selecting a coding scheme depending on the 'mean' values is hence very
critical for such applications. Section 4.3.1.4 shows that the binary to nega-
binary conversion can be made programmable so that the desired nega-binary
representation can be selected (even at run-time) by simply programming a
register.
It can be noted that the toggle reduction using the nega-binary coding comes
at the cost of an extra bit of precision. The amount of saving hence reduces as
the distribution becomes more and more uniform. This is to be expected, as
any exhaustive N-bit code (i.e. one that comprises of all possible combinations
98 VLSI SYNTHESIS OF DSP KERNELS
3 ,
2.5
l'
cn
1.5
Q)
"öl
Cl
0
f-
.S: ~ ~ "-'
Q)
'-'
c
e?
Q)
:e 0.5
0
0 I- ..... , .......
-
-0.5
I
-1
I,
-40 -30 -20 -10 0 10 20 30 40
VALUE --->
Figure 4.13. Difference in Toggles far N=6, 2's Complement and Nega-binary Scheme : + - -
+-++
of 1sand Os) will necessarily have the same total number of toggles (summed
over all its representations) as any other similar code. Therefore, as the data
distribution becomes more and more uniform i.e. all possible values tend to
occur with equal probability, toggle reduction decreases.
Figures 4.13 and 4.14 ilIustrate the difference in number of toggles for a 6-bit,
2's complement representation and two different 7-bit, nega-binary represen-
tations for each data value. Figures 4.15 and 4.16 show two profi les for 6-bit
Gaussian distributed data. As can be seen the nega-binary scheme of figure 4.13
can be used effectively for a distribution like the one shown in Figure 4.15, re-
sulting in 34.2% toggle reduction. Similarly nega-binary scheme of figure 4.14
can be used for a distribution like the one shown in figure 4.16, resulting in
34.6% toggle reduction. Figures 4.13 and 4.14 depict two out of a total of 65
possibilities. Each of these peaks (i.e. the corresponding nega-binary scheme
has fewer toggles compared to the 2's complement case) differently, and hence,
for a given distribution, a nega-binary scheme can be selected to reduce power
dissipation.
Distributed Arithmetic Based Implementation 99
4r-----.---~h-----,,-----._,----,_----,,----_.----_.
l'
cn
2 ~ <>i
Q)
c;,
'"0
I-
.~
Q)
u
c
~
Q)
::=
(5
0 - .
~ )~ ..
'ti
r
-1 W ~
I
,
-2
-40 -30 -20 -10 0 10 20 30 40
VALUE ---->
Figure 4.14. Differenee in Toggles for N=6, 2's Complement and Nega-binary Sehe me : - + +
-+-+
Procedure Optimum-Nega-binary-Scheme
Input: Array Bit[N] - N bit 2's complement representation of a number, with
100 VLSI SYNTHESIS OF DSP KERNELS
0.07 !
0.06
0.05
1
c 0.04
0
t5c •
::J
LL
:c
'iii 0.03
c
QJ
0
0.02
0.01
1111I
0
-40 -30 -20 -10 0 10 20 30 40
VALUE ---->
für (i =0 tü N-2) {
== 1) {
if (Bit[i+ 1)
Sign[i) = '+'
} else {
Sign[i] = '-'
}
}
if (Bit[N-l) == 1) {
Sign[N-l} = '+'
Sign[N} = '-'
} else {
Sign[N-l] = '-'
Sign[N} = '+'
}
Distributed Arithmetic Based lmplementation 101
0,07 ,
0,06
0,05
c
~ 0,04
.Q
ti
c
::J
LL
C:- 0,03
'Vi
c
<l>
0
0,02
0,01
[ 1II1
0
-40 -30 -20 -10 0 10 20 30 40
VALUE ---->
Procedure Binary-to-Nega-binary
Inputs: Array Bit[N] - N bit 2's complement representation of a number with
Bit[O] being the LSB and Bit[N-I] being the MSB.
Array Sign[N] - N bit nega-binary representation, with Sign[O] being the sign
for the LSB.
Output: Array NegaBit[N] - N bit nega-binary representation for the number.
bi---+-,r-------------------------~\ Serial
Data-in
o
DA based
FIR structure
(4.12)
where p(i) is the probability of occurrence of a data with value 'i', N is the
2's complement bit-precision used, togs(i) and negatogs(i) are the number of
Distributed Arithmetic Based Implementation 103
toggles in the representation ofi forthe 2's complement case and the nega-binary
case respectively.
The above saving computation does not account for 'inter-data' toggles that
result from two data values being placed adjacent to each other in the shift
register sequence. It may be observed that for a T tap filter with N-bit precision
registers an architecture similar to Figure 4.1 would imply a virtual shift register
(obtained through concatenating all the individual registers) of length TxN.
Actual shift simulations were performed sampie by sampie for different data
profiles and different number of sampies to find out the nega-binary scheme
that resuIts in maximum saving. These simulations showed that in all cases, the
nega-binary scheme that resulted in the best saving was the same as the scheme
that resulted in maximum estimate of power saving. This can be attributed to
the observation (based on the simulations) that the contribution due to inter-
data toggle is almost identical across various nega-binary schemes. Hence the
power saving estimate, given in equation 4.12, can be used to arrive at the
optimum nega-binary scheme. There are two advantages of choosing a nega-
binary scheme this way. One, it does not require actual sampie by sampIe
data, only an overall distribution profile is sufficient. Two, the run times for
computing the best nega-binary scheme are orders of magnitude smaller.
• The resuIts given in table 4.7 also indicate that the amount of power saving
reduces with increasing bit precision. This trend can be explained as fol-
lows. The dynamic range of data with B bits of precision is given by 2 B .
With one extra bit of precision (i.e. B+ I), the dynamic range doubles (i.e.
2B + 1 ). Since the mean and the standard deviation ofthe data distribution are
specified as a fraction of max, these also scale (double) with an additional
bit of precision. Thus a data value D with B bits of precision, gets mapped
onto two data values (2D) and (2D+ 1) with (B+ 1) bits of precision. As can
be seen from table 4.7, the nega-binary representation for (B+ 1) bits of pre-
cision is derived by appending '+' to the nega-binary representation for B
bits of precision. Thus the nega-binary representation of (2D) and (2D+ I) is
104 VLSJ SYNTHESJS OF DSP KERNELS
Table 4.7. Best Nega-binary Schemes for Gaussian Data Distribution ( mean = max/2; SD =
0.17 max)
18
16
14
12
10
1
l
Cl
8
c
.:;
'"
(J)
6
-2
2 4 6 8 10 12 14 16 18
SO (% 01 range) --->
Figure 4.18. Saving vs SD Plot for N=8, Gaussian Distributed Data with Mean = max/2
reduction in the LUT outputs as weIl. Such a reduction apart from saving power
in the adder also results in substantial power savings in the LUT itself. Table 4.8
shows the number of Repeated Consecutive Addresses (RCAs) to the LUT for
the 2's complement and the nega-binary case. It is easy to observe that the
number of repeated consecutive addresses in the shift register outputs gives the
number of times no toggles occur in the LUT outputs (since the same contents
106 VLSI SYNTHESIS OF DSP KERNELS
0.05
0,045
0,04
0,035
c
f 0.03
nc
0
:> 0025
u..
2:-
'0;
c: 0,02
0'"
0.015
0.0 1
0.005
0
-150 -100 -50 o 50 100 150
VALUE -.-.>
Table 4.8. Toggle Rcduction in LUT (far 10,000 Sampies; Gaussian Distributed Data)
are being read). This toggle reduction is, therefore, independent of the filter
coefficients.
• 2's complement RCAs were obtained by counting the number of cases (out
of a possible of 1OOOOxN times the LUT is addressed) where two consecutive
Distributed Arithmetic Based lmplementation 107
0.01 , - - -- . . . , - - - - - , - - - - - - - , - - - - - - - r - - - - - - , - - - - - - ,
0.009
0.008
0,007
i 0.006
,§
Ü
§ 0.005
LL
~
' ij;
:ii 0.004
o
0.003
0.002
0.001
oL----~----~ww
·150 ·100 -50 o 50 100 150
VALUE •••• >
addresses were identical. A similar computation was performed for the best
nega-binary scheme obtained using the techniques presented in the previous
sections (the total number of cases in this case is obviously 1OOOOx(N+ 1) ).
• Toggle reduction was computed by finding the difference between the num-
ber of times at least one toggle occurred at the LUT output for the two
schemes.
• For all the three different precisions a Gaussian distribution with mean =
max/2 and an SD = 0.2 max was used (max being the largest positive 2's
complement number represented).
COUNTER
(Gray Sequence - same
as Routing Sequence)
Figure 4.2/. Shiftless lmplementation of DA Based FIR with Fixed Gray Sequencing
DA implementation.
1. Using a gray sequence in the counter (column decoder) for selecting subse-
quent bits - this would reduce the toggling in the counter outputs which drive
the multiplexers to the theoretical minimum.
2. Using the flexibility of having several gray schemes to choose a data distri-
bution dependent scheme which minimizes toggles in the multiplexer outputs.
Gray coded addressing has been presented in the literature [61] as a technique
for significantly reducing power dissipation in an address bus, especially in
case of a sequential access. Figure 4.21 illustrates a DA based FIR with a
fixed gray sequencing scheme. This results in theoretically minimum possible
toggles occurring in the counter output. As can be seen such an implementation
requires no additional hardware in the basic DA structure.
An N bit gray code can be obtained in N! ways (in a gray code any two
columns can be swapped to obtain another a gray code). This freedom can be
exploited to obtain a data specific gray code which minimizes the toggle count
as successive bits are selected within the register. This gives us dual power
saving : one, in the counter output lines themselves and two, in the multiplexer
output which drives the LUT (i.e. the LUT address bus). There is an additional
overhead of course. Since the register is not scanned sequentially, a simple shift
b
Distributed Arithmetic Based Implementation 109
f---+------X[n~lI
~x[n-l]=====1~
LUT
t
X[n-k]
Shirt Count
COUNTER
Figure 4.22. Shiftless Implementation 01' DA Based FIR with Any Sequencing Possible
Table 4.9. Comparison of Weighted Toggle Data for Din'erent Gray Sequences
Tables 4.7, 4.8 and 4.9 highlight the effectiveness of the proposed tech-
niques in reducing power dissipation in the DA based implementation of FIR
filter. Here are some more results on power savings obtained for different num-
ber of bits of data precision and different distribution profiles of data values.
Table 4.10 shows the percentage reduction in the number of toggles for two
different Gaussian distributions.
Table 4.10. Toggle Reduction as a Percentage of 2's Complement Case for Two Different
Göussian Distributions
TRI is the weighted toggle reduction as computed using the saving formula;
TR2 is the percentage toggle reduction obtained by using 25000 actual sampies
(i.e. it accounts for the 'inter-data' toggles as weil as the other factors mentioned
in section 4.3.1.5) in an 8 tap filter. The predictable trend in the best case nega-
binary scheme for different precisions is at once apparent. Further, it can be
Distributed Arithmetic Based lmplementation 111
Table 4./1. Toggle Reduction with Gray Sequencing for N = 8 and Some Typical Distributions
observed that as the precision increases TR 1 and TR2 values approach each
other, for the reasons mentioned in section 4.3.1.5.
Table 4.11 shows the best case gray sequencing toggle reduction, in the LUT
address bus, obtained for 8-bit precision data with five different distributions.
The first four are Gaussian, and the last one is a Gaussian Bimodal distribu-
tion. As was pointed before, the toggle reduction decreases as the distribution
becomes more and more uniform (i.e. as SD increases).
With gray sequencing, toggle reductions are obtained even with Bimodal
distributions. Nega-binary representations with low toggle regions symmet-
rically distributed about the origin do not exist and therefore in case of such
distributions the nega-binary architecture does not give good results.
Chapter 5
MULTIPLIER-LESS IMPLEMENTATION
Many DSP applications involve linear transfonns whose coefficients are fixed
at design time. Examples of such transforms include DCT, IDCT and color
space conversion kemels such RGB-to-YUv. Since the coefficients are fixed,
the flexibility of a multiplier is not necessary and an efficient implementation
of such transfonns can be obtained using adders and shifters. This chapter
presents techniques for area efficient implementation of fixed coefficient l-D
and 2-D linear transforms.
A 2-D linear transfonn that transfonns N inputs to generate M outputs can
be performed using matrix multiplication as shown below:
Y[l] A[l,l] A[I,2] A[l,N] X[l]
Y[2]
[
A[2,1] A[2,2] A[2,N] X[2]
113
X3 X2 Xl XO
AO Al
Y = AO . X3 + Al . X2 + A2 . Xl + A3 . XO (5.1)
Y = X3 + X3 « I + X3 « 3 + X3 « 4 + X3 « 5 +
Multiplier-Iess Implementation 115
X2 + X2 « 1 + X2 « 3 + X2 « 5 +
Xl + Xl « 1 + Xl « 4 + Xl « 5 - Xl « 7
XO « 1 + XO « 3 + XO « 6 - XO « 7 (5.2)
Y = X3« 4 + Xl + Xl « 4 + Xl « 5 + XO « 3 + XO « 6 +
X23 + X23 « 1 + X23 « 3 + X23 « 5 +
X01 « 1 - X01 « 7 (5.3)
Y = X3«4+X1«4+XO«3+XO«6+
X23 « 1 + X23 « 3 + X01 « 1 - XOl « 7+
X123 + X123 « 5 (5.4)
the total number of non-zero bits in the matrix less one. The coefficient matrix
for the 4 term weighed-sum mentioned above is shown below.
7 6 5 4 3 2 1 0
AO 0 0 1 1 1 0 1 1
A1 0 0 1 0 1 0 1 1
A2 -1 0 1 1 o0 1 1
A3 -1 1 o 0 1 0 1 0
7 6 5 4 3 2 1 0
AO 0 0 1 1 1 0 1 1
A1 0 0 1 0 1 0 1 1
A2 0 0 1 1 000 1
A3 0 1 0 0 1 0 0 0
XA23 -1 0 0 0 001 0
AG
') 4
4
2
A3 AI
2 3
A2
Figure 5.2. Coefficient Subexpression Graph for the 4-term Weighted-sum Computation
for each edge Eij, its end nodes i and j and all the other edges connecting to
the end nodes are deleted from the graph. The modified graph is searched to
find the edge with the highest weight. This weight is assigned as the one level
lookahead weight for the edge Eij. The subexpression corresponding to the
edge with highest one level lookahead weight is selected for elimination.
YO AO·X (5.7)
Yl Al·X (5.8)
Y2 A2·X (5.9)
Y3 A3·X (5.10)
YO X+X«1+X«3+X«4+X«5 (5.1l)
Yl X+X«1+X«3+X«5 (5.12)
Multiplier-less Implementation 121
x----~-----------.-----------,-----------.
AO Al A2 A3
YO YI Y2 Y3
Y2 X+X«1+X«4+X«5-X«7 (5.13)
Y3 X«1+X«3+X«6-X«7 (5.14)
YO X _01 + X « 3 + X « 4 + X « 5 (5.15)
Y1 X _0 1 + X « 3 + X « 5 (5.16)
Y2 X _0 1 + X « 4 + X « 5 - X « 7 (5.17)
Y3 X«1+X«3+X«6-X«7 (5.18)
YO X _01 + X« 3 + X A5 (5.19)
Y1 X _0 1 + X « 3 + X « 5 (5.20)
Y2 X_01 + XA5 - X« 7 (5.21)
Y3 X«1+X«3+X«6-X«7 (5.22)
122 VLSI SYNTHESIS OF DSP KERNELS
ii. CSAB+- in which the bit values at the two bit locations are non-zero for
more than one coefficients, and the values are either 1 and -1 or -1 and 1.
7 6 5 4 3 2 0 X35
1
AO o0 o 0 1 0 1 11
Al o0 1 0 1 0 1 01
A2 -1 o 0000 1 11
A3 -1 1 0 010 1 0 0
in tenns of X_O I so as to reduce the number of additions by one and reduce the
number of shifts by one.
In the fourth and the final phase of the optimization process, the coefficient
matrix (first B columns) is searched for two 2 bit subexpressions with 'shift'
relationship among them. Such expressions can also be eliminated so as to
reduce the number of additions. For example, consider two coefficients AO =
0.0101010 and AI = 0.1000101, with corresponding YO and YI computations
given by :
YO X«1+X«3+X«5 (5.27)
Yl X+X«2+X«6 (5.28)
While no CSABs can be found for these coefficients, there exist subexpressions
X_13 (in AO) and X_02 (in A 1) that are related by 'shift' relationship. YO and
Y I can hence be recomputed in tenns of X _02 (= X + X « 2) as follows
Procedure M inimize-Additions-Jor-M CM
Input: N filter coefficients represented using B bits of precision
Output: Data ftow graph representation of the MCM computation, with the
nodes of the ftow graph restricted to add, subtract and shift operations.
can be reduced by an average factor of 2.2. This is much higher than the factor
of 1.43 (avg.) for the FIR filters presented in [78].
5 The above mentioned upper bound does not comprehend the reduction
achieved using the common subexpression precomputation technique. This
point has also been highlighted in [78].
Based on the above observations, a tighter upper bound on the number of
additions can be obtained by first coming up with a subset of constants which
have more than one 1s. This subset can be further reduced by eliminating those
constants that can be obtained by left-shifting other constants in the subset. In
other words, in the reduced subset no two constants are related by just a shift
operation.
For a given constant NI (with more than one number of Is in its binary
representation), another constant N2 can always be found such that N2 has one
less number of 1sand the Hamming distance between NI and N2 is one. The
multiplication NI . X can hence be computed as one addition of N2 . X with
appropriately left shifted X.
Based on the above analysis, the multiplication by each member of the re-
duced subset can be computed using just one addition. The upper bound on the
number of additions is thus given by the cardinality of the reduced subset.
It can be noted that no two constants with '1' as their LSBs can be related
by just a shift operation. It can also be noted that for a constant NI whose
LSB is '0', there always exists a constant N2 with '1' as its LSB, such that
NI = N2 < < K, where the amount of left shift K is given by the number of
continuous Os as the LSBs ofNl. For example, for NI = 00101000, there exists
N2 = 00000101, such that NI = N2 < < 3. Based on these observations the
reduced subset consists of those constants wh ich have more than one 1sand
have 'I' as their LSBs. It can easily be shown that for a B bit number, the
cardinality of such a reduced subset is given by 2 8 - 1 - 1. This hence is an
upper bound on the number of additions for MCM computation.
The analysis presented above assumes constants with B bit unsigned repre-
sentation. A similar analysis can be performed in case of constants with B bit
2's complement representation. These constants (except for - 2 8 - 1 ) can also
be represented using a B bit signed-magnitude representation. It can be noted
that multiplying a variable X with a negative constant can be achieved by multi-
plying -X with the corresponding positive constant. Thus once -X is computed,
multiplication by all negative constants which have only one '1' in their magni-
tude part, can be implemented using only a shift operation. The multiplication
by the constant - 2 B -1 can also be handled the same way. It can also be noted
that a multiplication of a negative constant (say -NI) with a variable X, can be
computed using one subtraction as (- NI . X) = 0 - (NI· X).
The B bit constants can be divided into a set of (B-I) bit positive constants
and a set of (B-I) bit negative constants. Since the reduced subset of (B-I) bit
constants has the cardinality of (2 8 - 2 - 1), the MCM computation using the
positive constants can be achieved using (2 8 - 2 - 1) additions. Similarly, the
128 VLSI SYNTHESIS OF DSP KERNELS
YI = 19 . X = 00010011 . X
Using the common subexpression precomputation technique the above compu-
tation can be performed using two additions as follows
Y o =X+X«4
YI =Yo + X « l
Using 8 bit CSD scheme no common subexpression exists. This computation
thus requires two additions and one subtraction (which is one extra compared
to Uni-sign representation) as shown below
Yo = 00010001 . X = X + X < < 4
YI = 0001010N . X = X < < 4 + X < < 2 - X
Here is another example, where CSD reduces total number of non-zero bits
in coefficients but does not minimize total number of additions+subtractions
across coefficient multiplications. Consider the following computation with
coefficients represented in 8 bit 2's complement form
Y = 00010101 . Xl + 10011101 . X 2 + 10011001 . X 3
Using the techniques presented in section 5.1, the above computation can be
performed using six additions and one subtraction as folIows:
Tl = Xl + X 2
T2 = X 2 + X 3
Y = X 3 + Tl + Tl < < 2 + T 2 < < 3 + T 2 < < 4 - T 2 < < 7
Using CSD representation the computation to be performed is
Y = 00010101 . Xl + N0100N01 . X 2 + N010N001 . X 3
Using the techniques presented in section 5.1, this computation can be per-
formed using five additions and three subtractions as folIows:
T2 = X 2 + X 3
Y = X I + X I < < 2 + Xl< < 4 + T 2 + T 2 < < 5 - T 2 < < 7 - X 2 < <
2-X3 « 3
While the total number of non-zero bits is reduced by 1 (12 to 11) using CSD
representation, it results in an extra computation.
The FIR signal flow graph can be transformed so as to compute the output
Y[n] in terms of input data values and the previously computed output Y[n-I].
Multiplier-less Implementatioll 131
-I -I -I -I -I
Z Z Z Z Z
X[nl---+----,-~--,----+-----,-----+-.-----7--,
A[O]
f---+--'----7 Y[ n 1
N-l
Y[n] = L A[i] . X[n - i] (5.32)
i=O
By adding the LHS of equation 5.3 t and subtracting the RHS of equation 5.31
to the RHS of equation 5.32 gives :
N-l N-l
Y[n] = L A[i]· X[n - i]- L A[i]· X[n - 1 - i] + Y[n - 1] (5.33)
i=O i=O
N-l N-2
Y[n] A[O] . X[n] + L A[i] . X[n - i]- L A[i] . X[n - 1 - i]
i=l i=O
-A[N - 1] . X[n - N] + Y[n - 1] (5.34)
N-l
Y[n] A[O]· X[n] + L (A[k] - A[k - 1]) . X[n - k]
k=l
- A[ N - 1] . X [n - N] + Y [n - 1] (5.35)
Figure 5.4 shows the signal flow graph of a 4 tap FIR filter transformed using
the above mentioned approach.
The direct form structure of an N tap FIR filter requires N multiplications
and N-I additions. With the above mentioned SFG transformation, the resultant
structure (figure 5.4) requires (N+ 1) multiplications and (N+ 1) additions. While
this transform results in more computation, it also modi fies the filter coefficients.
If the saving in the number of additions due to the modified filter coefficients is
more than the overhead of the additional computation, this transformation can
result in an area-efficient multiplier-Iess FIR implementation. Such a possibility
132 VLSI SYNTHESIS OF DSP KERNELS
is higher in case of linear phase FIR filters because for such filters this SFG
transformation retains the number of multiplications required to compute the
output. This can be proved by analyzing coefficient symmetry property of the
transformed SFG.
The coefficient symmetry property (stated below) of the linear phase filters
ean be used to reduee the number of multiplieations by half in the direet form
FIR implementation.
For an N tap FIR filter, the eorresponding transformed strueture (equa-
tion 5.35) has N+ I eoeffieients C[O] to C[N). Ifthe original filter has symmetrie
eoeffieients (linear phase) the eoeffieients of the transformed strueture are anti-
symmetrie as shown below. i.e. C[i] = -C[N - i].
From 5.35, C[O] = A[O] and C[N] = -A[N - 1]
From 5.36, A[O] = A[N - 1]
From the above two equations, C[O] = -C[N] ... proved for i=O
From 5.35, C[j] = A[j]- A[j -1] and C[N - j] = A[N - j]- A[N - j -1]
From 5.36, A[N - j] = A[j - 1] and A[N - j - 1] = A[j]
From the above two equations, C[j] = A[j] - A[j - 1] and C[N - j] =
A[j - 1] - A[j]
henee C[j] = -C[N - j] ... proved.
An N tap linear phase filter requires N/2 multiplieations if the number of
coefficients is even and requires (N+ 1)/2 multiplications if the number of eo-
efficients is odd. If N is odd, the transformed filter has even number (N+ 1) of
eoeffieients whieh are anti-symmetrie and henee require (N+ 1)/2 multiplica-
tions. For N even, the transformed filter has odd number (N+ 1) of eoeffieients
and henee requires (N+2)/2 number of multiplieations. However, sinee from
(5.36) (A[N/2] = A[N/2-l D, the eoeffieient C[N/2] = A[N/2] - A[N/2-1] = O.
Thus for N even, the transformed filter requires N/2 number of multiplieations.
For example eonsider the SFG shown in figure 5.4. If the original fi Iter has lin-
ear phase, the eoeffieient values A[ 1] and A[2] are same, henee the eoeffieient
(A[2]-A[ 1]) in this SFG is O. This SFG thus requires two multiplieations and
four additions, as against two multiplieations and three additions required by
the direet form 4 tap linear phase filter.
The above analysis shows that this signal f10w graph transformation retains
the number of multiplieations required in ease of linear phase FIR filters, and
provides an opportunity to reduce the number of additions by altering the eo-
effieient values. As an example eonsider the ease of A[2] = 19 = 00010011
and A[3] = -13 = OOOONNON. The transformed strueture will have a eoeffieient
C[3] = A[3] - A[2] = -32 = OOONOOOOO whieh has just one non-zero bit.
The eomputation of Y[n] in terms of Y[n-l] can also be aehieved by sub-
traeting the LHS of equation 5.31 and adding the RHS of equation 5.31 to the
Multiplier-less Implementation 133
-I -I -I -I -I
Z Z Z Z Z
X[ n]----).---;--~-,---~--,------+--r---+---,
[0]
r----)--'------7 Y [n]
2.5
B - - -
<J
co
LL
C
0
U 1.5
::l r- ,-
U
Ql ,-
er:
0.5
VI. diff-csd: Transfonned SFG (as in figure 5.4) with coefficient differences
represented in CSD fonn.
VII. sum-csd: Transfonned SFG (as in figure 5.5) with coefficient sums rep-
resented in CSD form.
The bar chart in figure 5.6 shows the average reduction factor for various
transforms. This can be used to analyze the impact of coefficient transfonns
on the amount of minimization achieved using common subexpression elimi-
nation. As can be seen from the bar chart, the common subexpression elimina-
tion results in maximum reduction when the coefficients are represented in 2's
complement fonn. It can be noted that among all the coefficient transforms, 2's
complement coefficient representation results in maximum number of additions
in the initial solution (i.e. without common subexpression elimination). The
trend in figure 5.6 shows that for a given filter, if a transform results in higher
total number of non-zero bits in the coefficient representations, the higher is
the reduction achieved using common-subexpression elimination. This also
indicates that if a coefficient transform results in the best initial solution, it may
not always give the best final solution. Hence it is important to explore across
coefficient transforms to get the most optimal implementation.
Multiplier-less Implementation 135
5 dilf-csd
4
csd
csd
3 dilf-csd dilf-csd
dilf-csd csd csd csd
dilf-csd
csdsum-csd csd
2 diff-csd
LP16 LP19 LP24 LP27 LP32 LP36 LP40 LP48 LP53 LP61 LP64 LP72 LP96 LP128
Figure 5.7. Best Reduction Factors Using Coefficient Transforms Without Common Sub-
expression Elimination
The bar chart in figure 5.7 gives the best reduction factors w.r.t. initial so-
lutions (i.e. 2's complement representation) using coefficient transforms alone
(i.e. without applying common subexpression elimination). It can be noted that
reduction by a factor as high as 4.9 can be achieved. Figure 5.7 also shows for
each filter the coefficient transform that results in most reduction. As expected
CSD representation results in maximum reduction for all the filters. It can also
be noted that CSD used in conjunction with SFG transformations that compute
Y[n] in terms of Y[n-l] result in maximum reduction in 50% of the cases.
The bar chart in figure 5.8 gives the best reduction factors using the coeffi-
cient transforms in conjunction with common subexpression elimination. The
reduction factors are computed with respect to an initial solution given by ap-
plying common subexpression elimination on coefficients represented in 2's
complement form. It can be noted that reduction by a factor of as high as 2 can
be achieved. Figure 5.8 also shows for each filter the coefficient transforms that
result in most reduction. It can be noted that for a filter, multiple coefficient
transforms can result in the best final solution. As an example, for LP64 three
coefficient transforms (Il. allp, V. csd and VI. diff-csd) result in the best final
solution. It can also be noted that a coefficient transform that results in the best
136 VLSI SYNTHESIS OF DSP KERNELS
csd
2
diff-csd csd
1.8
csd
diff-csd diff-allp
1.6 allp diff-csd
diff-csd allp csd
csd csd
diff-csd csd
1A csd sum-csd allp csd
sum-csd
~ csd
'c"
l.L. 1.2
.f50
::J
"0
Q)
er:
0.8
0.6
OA
0.2
LP16 LP19 LP24 LP27 LP32 LP36 LP40 LP48 LP53 LP61 LP64 LP72 LP96 LP128
Figure 5.8. Best Reduction Factors Using Coefficient Transforms with Common Sub-
expression Elimination
initial solution may not in all the cases result in the best final solution. As an
example consider LP48 for which CSD (V) transform results in the best initial
solution but allp (II) transform results in the best final solution.
The bar chart in figure 5.9 gives the number of times each coefficient trans-
form results in the best final solution. As can be seen from the figure while
CSD gives the best solution in most of the cases, the uni-sign representation
can also in some cases perform better than CSD. The figure also highlights the
role of SFG transformations shown in figures 5.4 and 5.5, which result in the
best final solution in 8 out of the 14 filters.
This data can also be used to compare the two signal flow graph transfor-
mations shown in figures 5.4 and 5.5, which result in SFG structures with
coefficient-differences and coefficient-sums respectively. As discussed earlier,
the transformation in figure 5.5 always results in one more number of coeffi-
cient multiplications in case of even order filters. 1t hence results in a relatively
higher number of additions for even order filter. Overall, the SFG transforma-
tion based on coefficient differences (figure 5.4) provides higher reduction than
the transform based on coefficient sums (figure 5.5).
Multiplier-less lmplementation 137
10 .--
.--
4
2 ,--
o ,,~
äliP
n
Ulll-allp ~Urll-äll~ es ul -G:iU SU - ~U
Figure 5.9. Frequency of Various Coefficient Transforms Resulting in the Best Reduction
Factor with Common Sub-expression Elimination
The resuIts presented in this section thus demonstrate the role of various
transforms in minimizing number of computations in the muItiplier-less imple-
mentations of FIR filters. The various coefficient transforms discussed in this
chapter enable exploration of a wider search space resulting in the best final
solution.
8 15 CI 8 15
Vl-> Reg I V2 -> Reg 2 VI-> Reg I V2 -> Reg 2
16 9 C2 9 16
V3 -> Reg I V4 -> Reg 2 V4 -> Reg I V3 -> Reg 2
~
Op 1 -> Adder I
Gf0
Op2 -> Adder 2
CI
~
Op I -> Adder I
~
Op2 -> Adder 2
G-~
Op3 -> Adder I
~
Op4 -> Adder 2
C2
~
Op4 -> Adder I
@
Op3 -> Adder 2
Cl
+16
C2
Since register allocation, functional unit binding and scheduling are inter-
dependent, an approach that uni fies these three steps in one synthesis algorithm
is necessary to get an optimal implementation. Such an integrated and precision
sensitive approach to the synthesis of multi-precision data ftow graphs has been
presented in [3].
Chapter 6
IMPLEMENTATION OF MULTIPLICATION-FREE
LINEAR TRANSFORMS ON A PROGRAMMABLE
PROCESSOR
141
and then use list scheduling based algorithm for instruction scheduling. In case
of multiplication-free linear transforms since primarily ADD and SUB instruc-
tions are used, the operand part of the instructions dominates the overall power
dissipation. Instead of assigning registers to the variables before instruction
scheduling, the approach presented in this chapter first performs instruction
scheduling using DAG variables and then does register assignment for low
power code generation. The chapter also presents a technique that reorders the
nodes of the DAG so as to minimize the power dissipation in case of single
register architectures.
This chapter is organized as two sections, 6.1 and 6.2, which present code
generation techniques targeted to register-rich and single register architectures,
respectively. Each section also gives resuIts that highlight the effectiveness of
these techniques in terms of reducing the number of cycles and also the power
dissipation for various multiplication-free linear transforms.
where Srl and Sr2 are the source registers, Dr is the destination register and
'Shift' is the amount by which Sr2 is left shifted before being added to/subtracted
from Srl. In addition to these instructions, the architecture also supports the
load and store instructions for movement of data between the registers and the
data memory.
It is assumed that a transform is implemented as a function and all the input
data values are loaded in the registers before calling the function. Within the
main body of the function, the transform is performed as aseries of ADD and
SUB instructions that operate on the data stored in the registers and produce
outputs which are stored in the registers.
The code generation phase takes the DAG as the input and performs instruc-
tion scheduling and register assignment aimed at having a code that is smallest
in terms of program size, executes in minimum number of cycles, uses minimum
number of registers and dissipates least amount of power.
Implementation of Multiplication-free Linear Transforms 143
J INSTRUCTION RE(,ISTER
l
PR<X}RAM j 1 tJ
MEMORY
DECODE
t ~ ~ ~
I
EXECUTE CONTROL
1 1 1 SHIT
REG. ADDRESSES Sr2
«
DATA
Sr!
MEMORY RHilSTLR
I~ ~
FIl.E +/-
Dr
WI W2 W3 XI X2 X3
W4 WS W6 X4 XS X6
W7 W8 W9 X7 X8 X9
Since this sequence is primarily decided by the sequence in which the instruc-
tions are fetched, this component of the power dissipation is also dependent on
the Hamming distance between successive instructions. Thus it can be noted
that minimizing Hamming distance between consecutive instructions reduces
power dissipation in all the three stages of the pipeline.
Examples of such transforms include the 3x3 pixel window transforms [27]
used in image processing. Consider a 3x3 pixel window of an image (figure 6.2)
with values X[ I] to X[9] and the corresponding transform window with weights
W[l] to W[9]. The transform is then computed as:
9
Y = L W[i] . Xli] (6.2)
i=l
Figure 6.3 shows the Prewitt window transform [27] which is used for edge
detection. The corresponding DAG is also shown in figure 6.3.
This subsection looks at low power code generation for such transforms.
As discussed earlier, the code generation should aim at reducing the Hamming
distance between successive instructions. An instruction has two parts - the first
part gives the operator (ADD or SUB) and the second part gives the operands
(source and destination registers). The Hamming distance in the operator part
of the instruction can be reduced during instruction scheduling by maximizing
the sequences of consecutive ADD and consecutive SUB operations. The other
technique is to modify the DAG itself so as to maximize nodes/operations of
the same type. For example, the DAG in figure 6.3 can be transformed to
Implementation of M ultiplication-free Linear Transforms 145
Xl X3 X4 X6 X7 X9
\1 \1 \1
1 0 -1
8 8
1 0 -1 ~Gf
~
1 0 -1
Xl X3 X6 X4 X9 X7
\1 \1 \1
8 8
~8/
~ y
the DAG shown in figure 6.4. While the initial DAG in figure 6.3 has three
SUB and two ADD nodes, the transfonned graph has all five nodes of SUB
type. Consequently, the code generated from the transfonned DAG has zero
Hamming distance in the operator part of the instructions.
The reduction in the Hamming distance in the operands part of the instruc-
tions results in power reduction in the register file also and thus has a bigger
impact on the overall power reduction. Here is a technique that suitably trans-
forms the DAG and generates code with minimum total Hamming distance
between successive instructions.
146 VLSI SYNTHESIS OF DSP KERNELS
Xl ~~~E)-ß-~~ Y
t t t t t
X4 X7 X3 X6 X9
Figure 6.5. Chain-type DAG for Prewitt Window Transform
Step 1: Convert the DAG to a 'Chain' structure and reorder the nodes so
as to group all ADD nodes together. The reordering minimizes the Hamming
distance in the operator part of the instructions. Figure 6.5 shows such a DAG
for the Prewitt Window transform.
Step 2: The chain structure fixes the scheduling of the operations and is
given by the order of nodes in the DAG. Generate the instruction sequence
using variables as operands. It can be noted that in such a sequence, the first
source and the destination variables of all instructions can be assigned to the
same register. Such an assignment results in zero Hamming distance in the
Srl and Dr operands of the instructions. For the variables in Sr2, the registers
are assigned in a gray code sequence so as to minimize the Hamming distance
between successive Sr2 operands of the instruction. Figure 6.6 shows the code
for Prewitt Window transform before and after the register assignment.
It can be noted that this two step algorithm generates code that is optimal in
terms of minimum program size, minimum number of cycles, minimum number
of registers and also minimum power dissipation.
1 1
o
j 1[ ~~ 1
-1
1 -1
o 1 -1 X4
CS++ in which the elements in two columns of the matrix are both 1 or both
-1 for more than one rows (e.g. X 12+, columns 1, 2 for rows 1 and 3).
2 CS+- in which the elements in two columns of the matrix are (+ 1,-1) or
(-1, + 1) for more that one rows.
Every iteration involves selecting a common subexpression that results in maxi-
mum reduction in the number of operations used to perform the transform. This
is the same heuristic that is used for minimizing additions in the multiplier-less
implementation of weighted-sum and MCM computations.
Once the subexpression is identified, the transformation matrix is updated
to reflect the precomputation. This is done by adding a new column to the
transformation matrix and suitably updating the matrix elements. For example,
consider X12+ common subexpression. The modified transformation matrix is
shown below :
=~ ~ 1[Jt I
o 1
-1 o
o -1
o 1
148 VLSI SYNTHESIS OF DSP KERNELS
Tl
XI~+ YI
X2 ~ - }------'d----~ Y2
X3~+ Y3
X4 ~81------->~ Y4
Figure 6.7 shows the optimized DAG for 4x4 the Haar transform. It requires
six computations compared to eight computations required if the transform is
computed as four I x4 transforms.
ADD Xl X2 Tl
SUB Xl X2 Y2
SUB X3 X4 Y4
ADD X3 X4 T2
ADD Tl T2 YI
SUB Tl T2 Y3
T2 which differs in three fields (all the three operands) and Y4 which differs in
all the four fields.
While computing difference between the current node and the latest sched-
uled node, if the current node is of type ADD, use commutativity to swap
the source operands and check whether the difference reduces with swapped
operands.
For example, if the latest scheduled node corresponds to SUB X2 Xl Yl,
then for anode with operation ADD Xl X2 Y2 the difference will be 4, however,
with inputs swapped the same operation ADD X2 Xl Y2 will have a difference
oftwo.
} Until ready-to-be-scheduled list is empty
Figure 6.8 gives the output of instruction scheduling for the DAG shown in
figure 6.7.
Step 2: Register Assignment
From the schedule derived in step I, find lifetimes of all the variables.
Figure 6.9 shows the data ftow graph for the scheduled DAG and the lifetime
spans for all the variables.
Construct a register-conjlict graph as follows. Each node in the graph rep-
resents a variable in the data ftow graph. Connect two nodes of the graph if the
lifetimes of the corresponding variables overlap.
Figure 6.10 shows the register-conjlict graph for the data ftow graph shown
in figure 6.9.
The register assignment approaches discussed in the literature [4] solve the
problem as a graph coloring problem where no two nodes which are connected
by an edge are assigned the same color and the graph is thus colored using
minimum number of colors.
In this approach, the number of registers are minimized only to the extent of
eliminating register-spills and the focus is more on low power considerations.
The instruction schedule is analyzed to build a consecutive-variables graph in
which each node represents a variable in the data ftow graph. Two nodes of
the graph are connected if the corresponding variables appear in the consec-
150 VLSI SYNTHESIS OF DSP KERNELS
XI X2 X3 X4 XI X2 X3 X4 TI T2 YI Y2 Y3 Y4
CI
C2
C3
C4
C5
C6
I
YI Y2 Y4 Y3
Figure 6.9. Data Flow Graph and Variable Lifetimes for 4x4 Haar Transform
Y3 Y2 Y3 Y2
Y4
XI TZ XI T2
X4 X4
utive eyc\es at the same operand loeation. Eaeh edge E[i, j] in the graph is
assigned a weight W[i, j] given by the number of times variables i and j appear
eonseeutively in the instruetion sequenee.
Figure 6.11 shows the consecutive-variables graph for the DFG shown in
figure 6.9. It ean be noted that for this graph, all the edges have the same
weight( =1).
lmplementation of Multiplication-free Linear Transforms 151
HD
YI
R5(OIOI) ADD RO RI5,O R3
2
SUB RO RI5,O R2
3
XI SUB RI R7,O R6
2
ADD RI R7,O R7
2
TI ADD R3 R7,O R5
2
SUB R3 R7,O R4
X}
X4-T2
11
RI(OOOI) R7(OIII)
CF = L L HD[i,j].W[i,j] (6.3)
j
This cost function is same as that used for FSM state encoding. Many tech-
niques have been proposed to solve this problem and include approaches such
as simulated annealing [63] and stochastic evolution [47] based optimization.
I 2 1 -I -I -I 1 1 1
0 0 0 -I 8 -I I I I
-I -2 -I -I -I -I 1 I 1
Sobel Window Transfonn Spatial High-Pass Filter Spatial Low-Pass (Averaging) Filter
these DAGs. The third column presents results for the C code that represents
the reordered DAG and consequently re-scheduled instructions. The fourth
column gives the Hamming distance measure for the code generated by the low
power code generator. The results assurne that the Hamming distance between
the ADD and the SUB opcodes is one.
As can be seen from the results, significant power reduction can be achieved
by using a low power driven code generation approach. To compare this ap-
proach with the approach that first does register assignment and then performs
cold scheduling, the register assignment done by the TMS470Rlx C Compiler
was used to cold schedule the Prewitt Window Transform. The total Hamming
distance for the resultant code was eight compared to the measure of five for
the low power code generator. This justifies the approach of first scheduling
the instructions and then performing low power register assignment.
Implementation oj Multiplication-free Linear Transjorms 153
PDB I lNSTRUCTlON
I
REGISTER
I
DRAß
PAß I ADDRESS
I
GI'NERATOR I DWAB
DRDß
PROGRAM DATA
MEMORY MEMORY
«
+/-
r
~
I Ace
I «
DWDB
Y2
Xl X2 X3 X4
Let 'current' node be the latest evaluated node and 'new' node be the new
node for which the code is being generated.
Implementation of Multiplication-free Linear Transforms 155
If the 'current' node is not one of the fanin nodes of the 'new' node, save
the 'current' node (SAC instruction), load the left fanin node of the 'new'
node (LAC instruction) and ADD/SUBTRACT the right fanin node of the
'new' node.
3 If the 'current' node is a right fanin node of the 'new' node and the 'new'
node function is SUBTRACT, negate the 'current' node (NEG) instruction
and ADD the left fanin node of the 'new' node.
Considersequences {TI, T2, T3, YI, Y2}, {T2, T3, Tl, YI, Y2} and {T2,
TI, Y I, T3, Y2} for the DAG shown in figure 6.16. The corresponding code is
shown in table 6.2.
As can be seen from this example, for a given DAG, the code size and
consequently the number of cycles depend on the sequence in which the nodes
are evaluated. The code optimization problem thus maps onto the problem of
finding an optimum sequence of DAG node evaluations.
156 VLSI SYNTHESIS OF DSP KERNELS
Procedure DAG-Schedule
Input: DAG representation of the computation to be implemented on a single
register, accumulator based machine
Output: An order in which the DAG nodes need to be scheduled so as to
generate a code with minimum accumulator spills
scheduled-node-list = {}
current-node = 0
while (no.of-scheduled-nodes < total-no.of-intermediate+output-nodes) {
/* build candidate-node-list */
candidate-node-list = {}
for all (nodei tf- scheduled-node-list) {
if «nodei.left-fanin E (input-node-list + scheduled-node-list)) .and.
(nodei.right-fanin E (input-node-Iist + scheduled-node-list)))
candidate-node-list += nodei
} /* assign weights to the candidate-nodes */
for all (nodei E candidate-node-list) {
nodei.weight = 1
if «nodei E output-node-list) .or. (nodei.fanout 2:: 2))
nodei. weight++
if «nodei.left-fanin = current-node) .or.
«(nodei.right-fanin = current-node) .and. (nodei.op = ADD)))
nodei. weight += 2
if (nodei.fanout-node.right-fanin E scheduled-node-list)
nodei.weight += 2
lmplementation of Multiplication-free Linear Transforms 157
}
/* select the node with the highest weight for scheduling */
Find (nodem E candidate-node-list) such that nodem.weight is maximum
scheduled-node-list += node m
current-node = node m
}
In the above algorithm, at each stage of node selection, there can be more
than one nodes with the same weight. During the first phase of the algorithm a
node is selected randomly. In the iterative refinement phase anode selected at
each stage is replaced by other node (if available) with the same weight. The
resultant schedule is compared with the initial schedule and accepted if it results
in fewer number of cycles.
The scheduling algorithm when applied to the DAG shown in figure 6.16,
generates the schedule T2, T3, Tl, Yl, Y2 with no further improvement possible
in the iterative refinement phase. During the first iteration of the algorithm,
the candidate-node-list consists of nodes Tl and T2, with weights 1 and 4
respectively. Node T2 is hence scheduled first. In the second iteration, the
candidate list consists of nodes Tl and T3 both having weight 3. Selecting T3
results in the schedule T2, T3, Tl, Yl, Y2 which requires 11 cycles. During
the iterative refinement phase of the algorithm, Tl is selected instead of T3
resulting in the schedule T2, Tl, Y 1, T3, Y2 which also requires 11 cycles.
The code generated by the algorithm presented in this chapter was compared
with that generated using optimizing C compiler for TMS320C5x. The DAGs
for 4x4 Walsh-Hadamard transform shown in figures 6.17, 6.18 and 6.23 were
converted to an equivalent C program and compiled with highest optimization
level. The generated code which used indirect addressing, was converted to
use direct addressing thus reducing number of cycles. Table 6.3 shows the
comparison in terms of number of cycles assuming that the program and data
are available in on-chip memories.
The results show that the code generator generates as compact code as the
'C5x C compiler for the first 2 DAGs. It does better in case of the DAG in fig-
158 VLSI SYNTHESIS OF DSP KERNELS
Xl YI Xl Y3
X2 X3 X4 X2 X3 X4
Xl Y2 Xl Y4
X2 X3 X4 X2 X3 X4
ure 6.23. The main reason for this is that the C compiler during its optimization
phase modi fies the DAG and in the process generates code with more number
of cycles.
1 1
-1
1 -1
-1 -1
1
-~ 1r ;~ 1
-1
1
X3
X4
Xl Y1
X2 Y2
X3 Y3
X4 Y4
20 cycles, the DAG in figure 6.18 requires 22 cycles to compute the transform,
eventhough it has four less nodes. Clearly, fewer number of nodes does not
always translate into fewer number of cycles. The main reason for the DAG in
figure 6.18 requiring more cycles, is that all its intermediate nodes have fanout
of two. For single register, accumulator based architectures, such intermedi-
ate nodes result in accumulator spills, and consequently in 'store' and 'load'
overhead.
LAC XI
ADD X2 XI~OTI LAC XI
SAC TI ADD X2
X2....:7 ~
LAC X3 ~Yl X3
ADD X3
X3~
c)
ADD X4 ADD X4
ADD TI X4....:7 T2 SAC Yl
X4 YI
SAC YI
shows the serialized DAGs which require five cycles compared to six cycles
required for the butterfly computation.
As can be seen from the figure, there are two ways of serializing a butterfly
depending on whether Yl is computed in tenns of Y2 (Yl = Y2 + 2*X2) or
Y2 is computed in terms of Yl (Y2 = Yl - 2*X2). The choice of the transfonn
depends on the context in which the butterfly appears in the overall DAG.
"1"
LAC XI
LAC XI
ADD X2 LAC XI X3;fYl
ADD X2 ADD X2
SAC TI XI~
:::t
X2--'" SAC YI
X2--'" SAC Y2 OR
SAC YI
SUB X4 LAC XI
LAC TI
--'" Y2 ADD X2
ADD X4 ADD X3
X4 ADD X4
SAC Y2 SAC YI X4 Y2
X3 YI SAC Y2
XI XI XI YI
X2 X2 Y2
X4 Y3
X3 X3
XI Y2
X4 X4 Y4
Xl
X3
Y4
section presents a technique for synthesis of spill-free DAGs that are optimal
in terms of number of cycles.
Procedure Synthesize-Spill-Jree-DAG
Input: Two dimensional matrix representing the multiplication-free linear
transform
Output: Spill-free DAG representation of the computation with the DAG hav-
ing minimum number nodes
already-computed-output-list = { }
most-recently-computed-output = (j)
/* Construct initial graph and compute edge costs */
for (i=ü,i <no-of-outputs;i++) {
edge[i,i].cost = number of non-zero entries in row 'i' + 1
}
repeat {
------pfnd the edge E(M,N) with the lowest cost.
if (M == N) { /* self loop */
Generate the DAG to compute output(N) in terms of only the inputs
} else {
Generate DAG to compute output(N) in terms of inputs and output(M)
}
/* Update the graph */
Delete edge E(N,N)
for each node (i E already-computed-output-list) {
Delete edge E(i,N)
}
already-computed-output-list += N
for each node (i E yet-to-be-computed-output-list) {
E(most-recently-computed-output,i).cost++
}
most-recently-computed-output = N
for each node i E yet-to-be-computed-output-list {
Add edge E(N,i)
E(N,i).cost = number of mismatches between row N and row 'i'
of the transformation matrix
}
} until (yet-to-be-computed-output-list == {})
Figure 6.23 shows each iteration of the algorithm applied to the 4x4 Walsh-
Hadamard transform matrix, and the resultant DAG. It can be noted that the
resultant DAG is spill-free and requires just 14 cycles to compute the transform.
Here are results of applying these transformations on 8x8 Walsh-Hadamard
transform, 8x8 Haar transform and 4x4 Slant transform.
164 VLSI SYNTHESIS OF DSP KERNELS
4
4
XI YI Y3 Y2 Y4
X2 X3 X4 X3 X2 X3 X4
XI YI XI YI
X2 Y2 X2 Y2
X3 Y3 X3 Y3
X4 Y4 X4 Y4
X5 YS XS YS
X6 Y6 X6 Y6
X7 Y7 X7 Y7
X8 Y8 X8 Y8
Y1 1 1 1 1 1 1 1 1 X1/v'S
Y2 111 1 -1 -1 -1 -1 X2/v'S
Y3 1 1 -1 -1 o 0 o 0 X3/2
Y4 o 0 0 0 1 1 -1 -1 X4/2
Y5 1 -1 0 0 o 0 o 0 X5/v'2
Y6 o 0 1-1 o 0 o 0 X6/v'2
Y7 000 0 1 -1 o 0 X7/v'2
Y8 000 0 o 0 1 -1 X8/v'2
The direct computation of this transform requires 24 additions + subtractions
and the corresponding code executes in 40 cycles. The number of additions +
subtractions can be minimized to 14 using the common subexpression precom-
putation algorithm. The resultant DAG is shown in figure 6.26. The code
corresponding to this DAG requires 39 cycles. The DAG was optimized by
applying transformations to serialize all the butterflies. The resultant DAG is
also shown in figure 6.26. This DAG also has 14 nodes but the corresponding
code requires 30 cycles.
The spilI-free DAG for the 8x8 Haar transform has 20 nodes and the corre-
sponding code requires 32 cycles.
166 VLSI SYNTHESIS OF DSP KERNELS
Yl Y2 Y3
X8
-2
X7 X6 X5 X4 X3 X2 Xl X2 X4 X6 X8 X2 X3 X6 X7
X7 X6 X3 X2 X8 X6 X8 X5 X3 X2 X8 X6 X4
Y7
X2 X4 X6 X8
Yl Y5 D Y2 Y3
X8
Y8
Y7 Y6 Y4
XI YI XI YI
X2 Y5 X2 Y5
X3 Y3 X3 Y3
X4 Y6 X4 Y6
Y2 X5 Y2
Y7 X6 Y7
: Y4
Y8
X7
X8
Y4
Y8
The 4x4 Slant transfonn [27] can be transfonned into a 4x8 multiplication-
free transfonn as shown below :
-~ -; 1 [
X1/2
X2/2v1s
1
-1 1 X3/2
3 -1 X4/2v1s
X1/2
X2/2v1s
-n
1 0 0 0 X3/2
Y2
[ Y1
Y3
Y4
1 [ J 1
-1
-1
-1
-1
-1
1
-1
0
0
0
0
-1
0
0
X4/2V5
Xl
X2/vIs
X3
X4/vIs
It can be noted that the left half of the 4x8 matrix is same as the Walsh-
Hadamard transform. The direct computation of the 4x8 transfonn requires
16 additions+ subtractions and the corresponding code executes in 24 cyc\es.
The number of additions + subtractions can be minimized to 12 using the com-
mon subexpression precomputation algorithm. The code corresponding to the
resultant DAG requires 26 cyc\es.
Interestingly the spill-free DAG can be synthesized directly from the 4x4
matrix with elements 1,-1,3 and -3. The 4 outputs can be computed as
YI = Xl + X2 + X3 + X4, Y2 = YI + Xl «1 - X3«1 - X4«2,
Y3 = Y2 - Xl «1 - X2«1 + X4«2, Y4 = Y3 - X2«I + X3«2 - X4«1
The DAG for the above computation has 12 nodes and requires 17 cyc\es.
The results presented so far are summarized in table 6.4.
Table 6.4. Number of Nodes (Ns) and Cycles(Cs) for Various DAG Transforms
The following subsections present a technique that reorders the nodes of the
DAG to achieve low power realization of multiplication-free linear transfonns
on a single register, accumulator based architecture.
168 VLSI SYNTHESIS OF DSP KERNELS
The problem of finding an optimum node order can be reduced to the problem
of finding the lowest cost Hamiltonian path in an edge-weighted graph or the
traveling salesman problem. Since the last instruction has to be SAC (i.e. store
the accumulatorcontents into the specified data memory location), the algorithm
uses the corresponding node as the starting point and works backwards to get
the lowest cost Hamiltonian path. It also comprehends the constraint that the
first instruction has to be LAC (Load the accumulator with the contents of
Implementation of Multiplication-free Linear Transforms 169
the specified data memory loeation) and only those variables that are to be
multiplied by the weight of +1 ean be used for the LAC instruetion.
For a two dimensional transform, the spill-free DAG strueture is used as
the starting point and is partitioned into sub-DAGs bounded by primary output
eomputations (i.e. the SAC instruetions). For example, for the spill-free DAG
ofthe Walsh-Hadamard transform shown in figure 6.23, it is partitioned into four
sub-DAGs bounded by SAC Yl, SAC Y3, SAC Y2 and SAC Y4 instruetions.
The nodes within eaeh of the sub-DAGs ean then be reordered without affeeting
the overall funetionality, thus resulting in a code with redueed power dissipation.
Table 6.5 gives the Hamming distanee based measure (deseribed in see-
tion 6.2.7) for six multiplieation-free linear transforms. For the 3x3 window
transforms, it is assumed that the variables Xl to X9 are stored at loeations
OxOO to Ox08 and the output Y is stored at loeation OxOF. For the 4x4 Haar and
Wall transforms, it is assumed that the inputs Xl to X4 are stored at loeations
OxOO to Ox03 and the outputs YI to Y4 are stored at loeations Ox08 to OxOB.
1t is also assumed that the opeodes for LAC, ADD, SUB and SAC are I-hot
eneoded and henee have a Hamming distanee of two between any two of them.
As can be seen by the results the proposed node reordering technique results in
significant power reduetion.
Chapter 7
(7.2)
where M is the dynamic range of the given moduli set. So, the moduli
set is determined based on the bit-precision needed for the computation. Für
example, für 19-bit precisiün the modul i set 5,7,9,11,13,16 can be used [87].
Let X,Y and Z have the residue representations X = (Xl, X 2, X 3, ... , X n ),
Y = (YI , Y 2, Y 3, ... , Y n ) and Z = (Zl, Z2, Z3, ... , Zn) respectively and Z =
(X Op Y) where Op is any operation in addition, multiplication or subtractiün.
Thus we have in RNS,
Since, Xi's and Yi 's require lesser precisiün than X and Y, the computation
üf Zi 's can be performed faster than the computation of Z. Müreüver, since the
171
---------
A[N-I] mod MI A[N-I] mod M2 A[N-I] mod MK
MAC RNS
modMI MAC to r--- y
modM2 Binary
~
MAC
modMK
---------
f t
Binary 10 RNS
t
Xli]
* moduJo 3
A[i]
A X Y
*
L + ~
,---
A
C - f-t> Y
A
00 00
00 01
00 JO
01 00
OJ 01
00
00
00
00
OJ
Z
moduJo M moduJo M
f----t' C
X
OJ 10
JO 00
JO
00
JO OJ 10
~ ~ 1010 01
I I
Xli] CJk
A[i]
L MAC
modulo M
-
A
C
"-
f---!> C I-~
f-C> y
T
I
Xli] Clk
---------
A[N-2] mod MI A[N-I] mod MI
1 1
+
~
MAC
~ RNS
modMI modMI
MAC ----;0
to -.... y
----;0
mod MI Binary
----;0
X[O] mod MI
I
X[I] modMI
Data ----------
Address ~ X[2] mod MI X[3] mod MI
Generator ---------
---------
X[N-2] mod MI X[N-I]modMI
t t t
Binary to RNS
1
Xli]
Figure 7.4. RNS Based Implementation of FIR Filters with Parallel Processing Transformation
where HD[i,j] is the Hamming distance between the coding of residues 'i'
and 'j', and W[i, j] is the edge weight. It can be noted that the cost function
CF is similar to that used for FSM state encoding. The stochastic evolution
based optimization strategy described in [47] can thus be used to perform the
coefficient residue encoding.
*
moduln M *
modulo M
XII) (M-lil XII) ... (M-1i1
SCHEME I SCHEME 11
11). For example, consider the look-up-table for modulo 3 multiplication and an
input swapping scheme in which the inputs to the look-up-table are swapped
if the LSB of the first input is 1. It can be noted that with such ascheme, the
input pairs 01 00 and 01 10 will never be fed to the LUT. Thus the number of
entries of the look-up-table can be reduced from 9 to 7. While this reduction is
Iess than the reduction from 9 to 6 entries using the scheme mentioned earlier,
overall it may be more area efficient and faster (no comparator delay).
Table 7.1. Area estimates für PLA based müdulü adder implementatiün
PLA is equal to 2*(no. of input bits) + no. of output bits. The number of rows
in PLA are obtained from the residue encoded truth table minimized using
Espresso [62].
It can be noted that the redundancy elimination technique can be used in
conjuction with residue encoding. Tables 7.1, 7.2 and 7.3 show area reduction
for such combination of transformations as weIl. Here are the cases for which
the results are presented:
As can be seen from the results, the PLA area can be reduced by as much as
45% (corresponding to modulo-13 addition using Xform3+ 1). The results also
show that no one combination of techniques results in the most area reduction
across all moduli. While (Xform3+ 1) combination gives maximum area reduc-
tion in most cases, it has an associated delay overhead. For modulo-7 addition,
the (Xform3+2) combination gives minimum area and it comes with minimal
delay overhead.
Table 7.2. Area estimates for PLA based modulo multiplier implementation
Table 7.3. Area estimates for PLA based modu10 MAC implementation
In case of MAC computation, the results show that the PLA area can be
reduced by as much as 66%. In this case Xform3+ I gives the minimum area
for all the moduli (i.e. 5, 7, 9, 11 and 13).
N-l N-l
Y[n] = (K· L A[i]·X[n-i]) ·1/ K = (L (K·A[i]) ·X[n-i]) ·1/ K (7.5)
i=ü i=ü
Thus the FIR filter output can be calculated using coefficients scaled by a
factor K and the weighted-sum result scaled by 1/K.
Since the residue values of the scaled coefficients are different than the
residue values of the original coefficients, scaling can be used as a transfor-
]80 VLSI SYNTHESIS OF DSP KERNELS
X[i] ------,,--------,-------,-----------.
*2 *3 * (M-I)
modM modM modM
CONNECTION ARRA Y
-I + +
Z
-I +
'----_Z_-I/I mod ,---_,/1 mod mod f-------I> Y[i]
M M M
Figure 7.6. Modulo MAC structure for Transposed Form FIR Filter
It can be noted that the reduction in the number of unique residues across
the moduli set results in an area efficient implementation of the transposed FIR
filter structure shown in Figure 7.6 because of the reduction in the number of
modulo multipliers needed in the structure.
The impact of this transformation can be appreciated from the results for
6 low pass filters shown below. These filters vary in terms of desired filter
characteristics and consequently in the number of coefficients. These filters
have been synthesized using the Park-McClellan's algorithm [73] for minimum
number of taps. The optimization has been performed using first improvement
and the steepest descent strategies. The strategies differ in the optimization
approaches. While in the steepest descent approach, in each iteration of the
optimization, the move which gives the maximum gain is selected, in first
improvement approach, the first move which gives gain is selected in each
iteration.
The coefficient values quantized to 16-bit 2's complement fixed point repre-
sentation form the initial set of coefficients for optimization. coefficient opti-
mization algorithm has been applied across the moduli set {5,7,9, 11,13, 17} for
PLA based implementation of modulo multiplier and modulo MAC. The area
improvements and total number of unique residues for different optimization
strategies for modulo multiplier and modulo MAC modules relative to the con-
ventional implementation are shown in Table 7.5. The metric chosen for PLA
area is the product of the number of rows and columns in the PLA. The number
182 VLSI SYNTHESIS OF DSP KERNELS
Filter 5 7 9 11 13 17 Total
LPI -lp_16L3KA.5L2A2_24
Conventional 5 6 6 9 7 7 40
Steepest 4 4 4 6 5 5 28
Ist Impr. 4 6 3 6 5 3 27
LP2 -lp_12L2L3K.12A5_28
Conventional 4 7 9 8 10 9 47
Steepest 4 5 5 5 6 3 28
I st Impf. 5 5 4 5 3 5 27
LP3 -lp_IOL2K_3K_0.05AO_29
Conventional 5 6 8 8 10 10 47
Steepest 5 5 5 4 4 4 27
Ist Impr. 4 6 5 3 7 4 29
LP4 -lp_12K_2.2K_3.1 K_.16A9_34
Conventional 5 7 5 9 8 11 45
Steepest 3 6 6 6 6 5 32
I st Impf. 4 5 7 4 7 4 31
LP5 -lp_IOK_1.8K_2.5L.15_60AI
Conventional 5 7 9 9 9 14 53
Steepest 5 7 8 8 8 9 45
I st Impf. 5 7 7 8 6 9 42
LP6 -lp_IOK_I.8L2.5L.03_70_55
Conventional 5 7 9 11 10 14 56
Steepest 5 7 9 10 9 1I 51
I st Impf. 5 7 8 9 8 13 50
Table 7.5. Impact of Coefficient Optimization on the Area of Modulo Multiplier and Modulo
MAC
The moduli set should be pairwise relatively prime to enable high dynamic
range.
184 VLSI SYNTHESIS OF DSP KERNELS
2 The modul i set has to be selected such that residue computations are easy
(e.g. 2n ,2 n - 1) [92].
4 The moduli should have simple multiplicative inverses. This ensures con-
version from residue to binary domain with less computations.
Table 7.6. RNS based FIR filter with 24-bit precision on C5x
Table 7.7. Number of Operations for RNS based FIR filter with 24-bit precision on C5x
As can be seen from table 7.6, the program and data memory requirements of
the RNS based implementation (RNS--FIR) are much higher than the implemen-
tation in the binary domain (RNS~IN). In terms of execution time however,
the RNS based implementation requires fewer cycles for filters with more than
Residue Number System based Jmplementation 185
15 taps. As can be seen from table 7.7, the RNS based implementation requires
fewer loads and stores and performs fewer number of additions than the binary
domain implementation for filters with more than 8 taps. Thus the RNS based
implementation is also power efficient for higher filter orders.
It can be noted that the transformation based on the multi rate architectures
presented in chapter 2 (section 2.4.2) can be applied in conjunction with the
RNS based implementation [44] to further improve performance and reduce
power dissipation.
Chapter 8
187
3 Data Coding
The area, delay, power parameters of most implementation styles are im-
pacted by the bit pattern of the processed data. Data coding techniques use
different number representation schemes so as to appropriately alter the bit
pattern and hence impact area, delay and power parameters. Data coding
has the associated overhead of encoding and decoding. The transformations
of this type include:
and also increase the control complexity. The transformations of this type
include:
• Linear Phase FIR Filters which exploit the coefficient symmetry and
use the distributivity property to reduce by half the number of multipli-
cations (section 3.4.1.2).
• Coefficient Scaling discussed in sections 2.4.4 and 7.2.1.
• Selective Coefficient Negation discussed in section 2.3.1.
• Coefficient Ordering discussed in sections 2.3.2 and 7.1.3.
• Se\ective Bit Swapping of ALU Inputs in section 2.3.3.
• Computing output Y[n] in terms of Y[n-l] of an FIR Filter discussed in
section 5.3.2.
• DAG Node Reordering for Low Power discussed in section 6.1.3.
• DAG Transformation - Tree to Chain Conversion discussed in sec-
tion 6.2.5.1.
• DAG Transformation - Serializing a Butterfly discussed in section 6.2.5.2.
• DAG Transformation - Fanout Reduction discussed in section 6.2.5.3.
• DAG Transformation - Merging discussed in section 6.2.5.4.
190 VLSI SYNTHESIS OF DSP KERNELS
6 Exploiting Relationship between the Real Value Domain and the Binary
Domain
While the frequency response of an FIR filter depends on the real values of
the coefficients, the area-delay-power parameters are affected by the prop-
erties of the binary representation of the coefficients. A small change in the
value of a coefficient has minimal impact on the filter characteristics but
can impact the binary representation significantly. For example, numbers
31 and 32 have a small difference in the real value domain, but in the binary
domain 31 has five 1s while 32 has just one 1. This relationship can be ex-
ploited to suitably alter the filter coefficients while still meeting the desired
filter characteristics in terms of passband ripple and stopband attenuation.
The transformations of this type include:
Get the 'baseline' mapping of the algorithm on the target implementation style
N NI
I
~ +
Is the algorithm Can the supply voltage
repetitive in ~ be scaled ? and
nature? Is increase in the Y
f----- Use parallel processing
area acceptable ?
N Is increase in the area y
f----- Use pipelining
and latency acceptable ?
..... _-_._-_ .............. __ ......... . ..................... __ ...........
Is increase in the
control complexity ~ Use loop unrolling
acceptable ?
NI
I
~ +
Is there any Is the potential loss in Use common sub-
redundancy in regularity and the expression pre-
the computation
that can be
exploited?
--
y corresponding increase
in control complexity
acceptable ?
~
computation or other
transforms to reduce
computational
complexity
N NI
I
tt
Continued on
* the next page *
* Continued from
*
the previous page
!
Are there busses Is the encoder-
y Use Gray coding
(e.g. address bus)
~ decoder overhead ~ or TO coding
that see sequential acceptable ?
data access ?
NI
N 1~ t I
Are there busses Is the encoder- Use bus-invert
y
that see values that ~ decoder overhead ~ coding
are da ta dependent ? acceptable ?
NI
N 1~ ~ I
Are representative Is there a layout-level Use bus bit-reordeing
traces of data values
~ t1exibility in routing ~ during routing
availab1e for the a bus?
busses?
NI
Nl J t I
Does the impl-
Is the area overhead of Use selective bit-
mentation havc ALU Y
with large capacitive ~ a mux and an xor gate ~ swapping for the
per input bit acceptable ? ALU inputs
input busses?
NI
N 1~ t I
Implementation with the desired area-power tradeoff
SUMMARY
195
beyond which the peak power dissipation starts increasing. The DFG transfor-
mations mentioned above do not impact the computational complexity; they
achieve power reduction at the expense of increased area. In the context of FIR
filters, multi rate architectures were presented as structures which reduce com-
putational complexity and thus achieve power reduction with minimal datapath
area overhead.
Multiplier-Iess Implementation
The problem of minimizing the number of additions in the multiplier-less
implementations of 1-D and 2-D linear transforms was presented. A common
subexpression precomputation technique that can be applied to both weighted-
sum computation and the multiple constant multiplication (MCM) computation
was presented. In the context of FIR filters, coefficient transformations were
identified which, in conjunction with the common subexpression precompu-
tation technique, realize area-efficient multiplier-less FIR filters. Since the
resultant data flow graphs have operators and variables with varying bit preci-
sion, an approach to precision-sensitive high level synthesis was discussed.
[I] J. W. Adams and A. N. Willson, "Some Efficient Digital Prefilter Structures", IEEE
Transactions on Circuits and Systems, May 1984, pp. 260-265
[2] M. Agarwala and P. T. Balsara, "An Architecture for a DSP Field Programmable
Gate Array", IEEE Transactions on VLSI Systems, March 1995, pp. 136-141
[3] Vikas Agrawal, Anand Pande, Mahesh Mehendale, "High Level Synthesis of
Multi-precision Data Flow Graphs", 14th International Conference on VLSI De-
sign, January 2001, pp. 411-416
[4] A. Aho, R. Sethi and J. Ullman, Compilers Principles, Techniques and Tools,
Addison- Wesley, 1986
[5] G. Araujo, S. Malik, M. T-C. Lee, "Using Register-Transfer Paths in Code Genera-
tion for Heterogeneous Memory-Register Architectures", ACMIIEEE 33rd Design
Automation Conference, 1996, pp. 591-596
[9] N. Binh, M. Imai, A. Shiomi and N. Hikichi, "A Hardware/ Software Partition-
ing Algorithm for Designing Pipelined ASIPs with Least Gate Counts", 33rd
ACMIIEEE Design Automation Conference, DAC-1996, pp. 527-532
[10] Ivo Bolsens, Hugo J. Oe Man, Bill Lin, Kar! Van rompaey, Steven Vercautcren and
Dicderik Verkest, "Hardware/Software Co-Design of Digital Telecommunication
Systems", Proceedings of the IEEE, March 1997, pp. 391-418
199
200 VLSI SYNTHESIS OF DS? KERNELS
[16] Jui-Ming Chang and Massoud Pedram, "Energy Minimization Using Multiple
Supply Voltages", IEEE Transactions on VLSI Systems, December 1997, pp. 436-
443
[17] Pali ab Chatterjee and Graydon Larrabee, "Gigabit Age Microelectronics and Their
Manufacture", IEEE Transactions on VLSI Systems, March 1993, pp. 7-21
[18] N. Deo, Graph Theory with Applications to Engineering and Computer Science,
Prentice Hall India, 1989
[20] W.E. Dougherty, D.J. Pursley, D.E. Thomas, "Instruction subsetting: Trading
power for programmability", IEEE Computer Society Workshop on VLSJ'98,
1998 , pp. 42-47
[22] Daniei Gajski, Nikil Dutt, A C-H Wu, S Y-L Lin, High,level Synthesis -Introduc-
tion to Chip and System Design, Kluwer Academic Publishers, 1992
[26] K. Illgner, H-G. Gruber, P. Gelabert, J. Liang, Y. Yoo, W. Rabadi and Raj Talluri,
"Programmable DSP Platform for Digital Still Cameras", International Conference
on Acoustics, Speech and Signal Processing, ICASSP-1999, pp. 2235-2238
REFERENCES 201
[27) Anil K. Jain, Fundamentals 0/ Digital Image Processing, Prentice Hall Inc. 1989
[28) R. Jain, et.al, "Efficient CAD Tools for Coefficient Optimization of Arbitrary
Integrated Digital Filters", IEEE International Conference on Accoustics, Speech
and Signal Processing, 1984
[29) W. K. Jenkins and B. Leon, "The Use of Residue Number System in the Design
of Finite Impulse Response Filters", IEEE Transactions on Circuits and Systems,
April 1977, pp. 191-201
[31) I. Karkowski and R.H.J .M. Otten, "An Automatie Hardware-Software Partitioner
Based on the Possibilistic Programming", European Design and Test Conference,
1996, pp. 467-472
[32) D. Kodek and K. Steiglitz, "Comparison of Optimal and Local Search Methods for
Designing Finite Wordlength FIR Digital Filters", IEEE Transactions on Circuits
and Systems, January 1981, pp. 28-32
[33) E.L. Lawler, J.K. Lenstra, A.H.G. Rinnooy Kan, and D.B. Shmoys (eds.), The
Travelling Salesman Problem, lohn Wiley & Sons Ltd, 1985
[34) Edward A. Lee, "Programmable DSP Architectures: Part I", IEEE ASSP Maga-
zine, October 1988, pp. 4-19
[35) Edward A. Lee, "Programmable DSP Architectures: Part 11", IEEE ASSP Maga-
zine, lanuary 1989, pp. 4-14
[36) M. T.-c. Lee, V. Tiwari, S. Malik and M. Fujita, "Power Analysis and Minimization
Techniques for Embedded DSP Software", IEEE Transactions on VLSI Systems,
March 1997, pp. 123-135
[37) H. Lekatsas, l. HenkaL W. Wolf, "Code Compression for Low Power Embedded
System Design", ACM/IEEE Design Automation Conference, DAC-2000, pp.
294-299
[38) Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, "Instruction Selection
Using Binate Covering for Code Size Optimization", IEEE International Confer-
ence on CAD, ICCAD-1995, pp. 393-399
[39) Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, Albert wang, "Storage
Assignment to Decrease Code Size", ACM conference on Programming Language
Design and Implementation, 1995
[40) Kun-Shan Lin (ed.), Digital Signal Processing Applications with the TMS320
Family - Theory, Algorithms alld Implementations - Vol I, Texas Instruments, 1989
[41) E. Lueder, "Generation of Equivalent Block Parallel Digital Filters and Algo-
rithms by a Linear Transformation", IEEE International Symposium on Circuits
and Systems, 1993, pp. 495-498
[42) Gin-Kou Ma and Fred J. Taylor, "Multiplier Policies For Digital Signal Process-
ing", IEEE ASSP Magazine. January 1990, pp. 6-19
202 VLSI SYNTHESIS OF DSP KERNELS
[43] M. N. Mahesh, Satrajit Gupta and Mahesh Mehendale, "Improving Area Effi-
ciency of Residue Number System based Implementation of DSP Algorithms",
12th International conference on VLSI design, January 1999, pp. 340-345
[47] Mahesh Mehendale, B. Mitra, "An Integrated Approach to State Assignment and
Sequential Element Selection for FSM Synthesis", 7th International Conference
on VLSI Design, 1994, pp. 369-372
[57] Mahesh Mehendale, Amit Sinha, S. D. Sherlekar, "Low Power Realization of FIR
Filters Implemented Using Distributed Arithmetic", Asia and South Pacific Design
Automation Conference, ASP-DAC'98, pp. 151-156
[61] Huzefa Mehta, R. M. Owens, M. J. Irwin, "Some Issues in Gray Code Addressing",
GLS-VLSI'96, 6th Great Lakes Symposium on VLSI, 1996, pp. 178-181
[62] G.De Micheli, Synthesis and optimization of digital circuits, McGraw-HiII, 1994.
[63] Biswadip Mitra, Shantanu Jha and P. Pal Chaudhuri, "A Simulated Annealing
Based State Assignment Approach for Control Synthesis", 4th CSIIIEEE Interna-
tional Symposium on VLSI Design, 1991, pp. 45-50
[64] Biswadip Mitra, P. R. Panda and P. Pal Chaudhuri, "Estimating the Complexity of
Synthesized Designs from FSM Specifications", 5th International Conference on
VLSI Design, 1992, pp. 175-180
[66] Z.J. Mou and P. Duhamel, "Short-Length FIR Filters and Their Use in Fast Nonre-
cursive Filtering". IEEE Transactions on Signal Processing, June 1991, pp. 1322-
1332
[68] Farid Najm, 'Transition Density: A New Measure of Activity in Digital Circuits",
IEEE Transactions on CAD, Feb 1993, pp. 310-323
[70] B. New, "A Distributed Arithmetic Approach to Designing Scalable DSP Chips",
Electronic Design News, August 17, 1995
[72] W. J. Oh and Y. H. Lee, "Cascade/Parallel Form FIR Filters with Powers-of- Two
Coefficients", IEEE International Symposium on Circuits and Systems, ISCAS-
1994, Vol. H, pp. 545-548
[73] A.v. Oppenheim and R.W. Schaffer, Discrete Time Signal Processing, Prentice
Hall, 1989
[74J Preeti Panda and Nikil Dutt, "Reducing Address Bus Transitions for Low Power
Memory Mapping", European Design and Test Conference, 1996, pp. 63-68
[75] Anand Pande, Sunil Kashide, Hardware Software Codesign of DSP Algorithms,
ME Thesis, Centre for Electronics Design and Technology, Indian Institute of
Science, Bangalore, India, January 2000
[76] K.K. Parhi, "Algorithms and Architectures for High-Speed and Low-Power Digital
Signal Processing", 4th International Conference on Advances in Communications
and Control, 1993, pp. 259-270
[77] D.N. Pearson and K.K. Parhi, "Low-PowerFiR Digital Filter Architectures", IEEE
International Symposium on Circuits and Systems, Vol I, 1995, pp. 231-234
[80] Wu Qing, M. Pedram, Wu Xunwci, "Clock-gating and its application to low power
design of sequential circuits", IEEE Transactions on Circuits and Systems I: Fun-
damental Theory and Applications, Volume: 47 3, March 2000, pp. 415-420
[81] Anand Raghunathan and Niraj Jha, "Behavioral Synthesis for Low Power", Pro-
ceedings of International Conference on Computer Design, ICCD-1994, pp. 318-
322
[82] K. R. Rao and P. Yip, Discrete Cosine Trans:form: Algorithms, Advantages and
Applications, Academic Press, 1990
[83] H. Samueli, "An Improved Search Algorithm for the Design of Multiplierless
FIR Filters with Powers-of-Two Coefficients", IEEE Transactions on Circuits and
Systems, July 1989, pp. 1044-1047
[84] N. Sankarayya and K. Roy, "Algorithms for Low Power FIR Filter Realization
Using Differential Coefficients", International Conference on VLSI Design, 1997,
pp. 174-178
REFERENCES 205
[85] H. Schroder, "High Word-Rate Digital Filters with Programmablc Table Look-
Up", IEEE Transactions on Circuits and Systems, May 1977, pp. 277-279
[86] Amit Sinha, Mahesh Mehendale, "Improving Area Efficicncy ofFiR Filters Imple-
mented Using Distributed Arithmetic", International Conference on VLSI Design,
VLSI Design'98, pp. 104-109
[88] M. R. Stan and W. P. Burleson, "Bus Invert Coding for Low Power 1/0", IEEE
Transactions on VLSI Systems, March 1995, pp. 49-58
[89] Ching-Long Su, Chi-Ying Tsui and Alvin M. Despain, "Saving Power in the Con-
trol Path of Embedded Proeessors", IEEE Design and Test of Computers, Winter
1994, pp. 24-30
[90] Ashok Sudarsanam, Sharad Malik, "Memory Bank and Register AlIoeation in
Software Synthesis for ASIPs", IEEE International Conferenee on CAD, ICCAD-
1995, pp. 388-392
[91] Earl Swartzlander Jr., VLSI Signal Processing Systems, Kluwer Academie Pub-
lishers, 1985
[92] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and its Applications to Computer
Technology, Me-Graw Hili, 1967
[93] N. Tan, S. Eriksson and L. Wanhammar, "A Power-Saving Teehnique for Bit-Serial
DSP ASICs", ISCAS 94, Vol. IV, pp. 51-54
[100] TSC4000ULV 0.35pm CMOS Standard Cell, Maero Library Summary, Appliea-
tion Specifie Integrated Cireuits, Texas Instruments, 1996
[101] Y. Tsividis and P. Antognetti, editors, Design of MOS VLSI Circuits.f(Jr Te/ecom-
munications, Prentiee-Hall, 1985
[102] c.- Y.Wang and K. Roy, "Control Unit Synthesis Targeting Low-Power Proces-
sors", IEEE International Conferenee on Computer Design, Oetoher 1995, pp.
454-459
206 VLSI SYNTHESIS OF DSP KERNELS
[103] Ching-Yi Wang and K. K. Parhi, "High-Level DSP Synthesis Using Concurrent
Transformations, Scheduling and Allocation", IEEE Transactions on Computer-
Aided Design 01' Integrated Circuits and Systems, March 1995, pp. 274-295
[105] Q. Zhao and Y. Tadokoro, "A Simple Design of F1R Filters with Powers-01'-Two
Coefficients", IEEE Transactions of Circuits and Systems, May 1988, pp. 566-570
Accumulator based Architecture, 153 Low Power Code Generation, 148, 168
Adder Input Bit Swapping, 31 MAC (Multiply-Accumulate) Instruction, 12
Array Multiplier, 14 Memory Architectures for Low Power, 22
Binary to Nega-Binary Conversion, 10 I Memory Partitioning for Low Power, 23
Bitwise Commutativity of ADD Operation, 31 Memory Prefetch ButTer, 23
Block FlR Filters, 56 Modulo MAC, 173
Bus Bit Reordering, 24 Multiple Constant Multiplication (MCM), 113, 120
Bus Coding, 17 Multiplication-free Linear Transforrn, 141
Bus Invert Coding, 20 Multirate Architectures, 37,63,81
CSD Representation, 129 Nega-binary Coding, 95
Characteristics of DSP Algorithms, 1I Optimization using 0-1 Programming, 50
Circular Buffer, 36 Parallel Processing, 59, 174
Classification of Transformations, 187 Pixel Window Transform, 144
Code Generation of I-D Transform, 144 Power Analysis of Multirate Architectures, 68
Coefficient Optimization, 43, 137, 180 Power Dissipation due to Cross Coupling, 13
Coefficient Ordering, 28, 175 Power Dissipation in CMOS, 12
Coefficient Partitioning, 86 Power Dissipation in a Bus, 13
Coefficient Scaling, 42, 179 Power Dissipation in a Multiplier, 13
Color Space Conversion, 2, 113 Pre-Filter Structures, 138
Common Subexperssion Elimination, 116, 122, 147 Precision Sensitive Binding, 139
Consecutive-Variables Graph, 150 Precision Sensitive Register Allocation, 138
DAG Optimizing Transformations, 159 Precision Sensitive Scheduling, 140
DAG Transform - Fanout Reduction, 160 Prewitt Window Transform, 144
DAG Transform - Merging, 161 RNS Moduli Selection, 183
DAG Transform - Serializing a Butterfly, 159 Register Assignment, 149
DAG Transform - Tree to Chain Conversion, 159 Register-Conflict Graph, 150
DCT, 2, 113, 172, 189 Register-rich Architecture, 143
DFG Transformations, 56 Residue Encoding, 174, 177
DSC Image Pipeline, I Residue Number System, 171
DSP Architecture, II Retiming,59
Decoded Instruction Buffer, 21 Selective Coefficient Negation, 27
Digital Still Camera, I Shiftless DA Implementation, 107
Distributed Arithmetic, 76 Signal Flow Graph Transformations, 130
Slant Transform, 167
Dynamic/Switching Power in CMOS, 12
SoC Design Methodology, 4
Energy vs Peak Power Tradeoff, 61
Sobel Window Transform, 152
FFT,44, 172, 189
Solution Space for DSP Implementation, 6
FIR Filter, 35
Spatial High Pass Filter, 152
Gaussian Data Distribution, 104
Spatial Low Pass Filter, 152
Generic Techniques for Low Power, 26
Spill-free DAG, 162
Graph Coloring Problem, 151
Stochastic Evolution, 151
Gray Coded Addressing, 17, 108
TO coding, 18
Haar Transforrn, 147, 165
TMS320C2x/C5x, 39, 153
Hardware-Software Partitioning, 6
TMS320C54x - FIRS Instruction, 34
High Level Synthesis of Multiprecision DFGs, 138
Transformation Framework, 51, 191
High Precision Signal Processing, 183
Transition Density, 14
Instruction ButTering, 21
Transposed FIR Filter, 41, 180
Instruction Scheduling, 148
Traveling Salesman Problem, 29
LUT Redundancy Elimination, 176
Truth Table Area Estimation, 88
Linear Phase FIR Filters, 65
Two Dimensional Linear Transform, 113
Loop Unrolling, 59 Uni-sign Representation, 129
Walsh-Hadamard Transform, 158, 164
207
About the Authors
209