VLSI Synthesis of DSP Kernels

VLSI SYNTHESIS OF DSP KERNELS
Algorithmic and Architectural Transformations

VLSI SYNTHESIS
OF DSP KERNELS
Algorithmic and Architectural Transformations
by
MANESH MEHENDALE
Texas Instruments (India), Ltd.
and
SUNILD. SHERLEKAR
Silicon Automation Systems Ltd.
Springer Science+Business Media, LLC

A C.I.P. Catalogue record for this book is available from the Library ofCongress.
ISBN 978-1-4419-4904-2 ISBN 978-1-4757-3355-6 (eBook)

DOI 10.lO07/978-1-4757-3355-6
Printed on acid-free paper
All Rights Reserved

© 200 1 Springer Science+Business Media New York
Originally published by Kluwer Academic Publishers, Boston in 200l.
Softcover reprint ofthe hardcover 1st edition 2001
No part of the material protected by this copyright notice may be reproduced or
utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage and
retrieval system, without written permission from the copyright owner.
Contents
List of Figures xi
List of Tables xv
Foreword xvii
Acknow ledgments xix
Preface XXI
1. INTRODUCTION
1.1 An Example
1.2 The Design Process: Constraints and Alternatives 3
1.3 Organization of the Book 7
1.4 For the Reader 9
2. PROGRAMMABLE DSP BASED IMPLEMENTATION 11
2.1 Power Dissipation - Sources and Measures 13
2.1.1 Components Contributing to Power Dissipation 13
2.1.2 Measures of Power Dissipation in Busses 13
2.1.3 Measures of Power Dissipation in the Multiplier 13
2.2 Low Power Realization of DSP Algorithms 16
2.2.1 Allocation of Program, Coefficient and Data Memory 16
2.2.2 Bus Coding 17
2.2.2.1 Gray Coded Addressing 17
2.2.2.2 TO coding 18
2.2.2.3 Bus Invert Coding 20
2.2.3 Instruction Buffering 21
2.2.4 Memory Architectures for Low Power 22
2.2.5 Bus Bit Reordering 24
2.2.6 Generic Techniques for Power Reduction 26
2.3 Low Power Realization of Weighted-sum Computation 26
2.3.1 Selective Coefficient Negation 27
2.3.2 Coefficient Ordering 28
2.3.2.1 Coefficient Ordering Problem Formulation 29
2.3.2.2 Coefficient Ordering Algorithm 30
2.3.3 Adder Input Bit Swapping 31
2.3.4 Swapping Multiplier Inputs 33
2.3.5 Exploiting Coefficient Symmetry 34
v
VI VLSI SYNTHESIS OF DSP KERNELS
2.4 Techniques for Low Power Realization of FIR Filters 35

2.4.1 Circular Buffer 36
2.4.2 Multirate Architectures 37
2.4.2.1 Computational Complexity of Multirate Architectures 37
2.4.2.2 Multirate Architecture on a Programmable DSP 38
2.4.3 Architecture to Support Transposed FIR Structure 41
2.4.4 Coefficient Scaling 42
2.4.5 Coefficient Optimization 43
2.4.5.1 Coefficient Optimization - Problem Definition 43
2.4.5.2 Coefficient Optimization - Problem Formulation 43
2.4.5.3 Coefficient Optimization Aigorithm - Components 44
2.4.5.4 Coefficient Optimization Aigorithm 45
2.4.5.5 Coefficient Optimization Using 0-1 Programming 50
2.5 Framework for Low Power Realization of FIR Filters on a
Programmable DSP 51
3. IMPLEMENTATION USING HARDWARE MULTIPLIER(S)
AND ADDER(S) 55
3.1 Architectural Transformations 55
3.2 Evaluating the Effectiveness of DFG Transformations 56
3.3 Low Energy vs Low Peak Power Tradeoff 61
3.4 Multirate Architectures 63
3.4.1 Computational Complexity of Multirate Architectures 64
3.4.1.1 Non-linear Phase FIR Filters 64
3.4.1.2 Linear Phase FIR Filters 65
3.5 Power Analysis of Multirate Architectures 68
3.5.1 Power Analysis for One Level Decimated Multirate
Architectures 68
3.5.1.1 Power Analysis - an Example 70
3.5.1.2 Power Reduction Using Multirate Architectures 71
4. DISTRIBUTED ARITHMETIC BASED IMPLEMENTATION 75

4.1 DA Structures for Area-Delay Tradeoff 76
4.1.1 DA Based Implementation of Linear Phase FIR Filters 77
4.1.2 I-Bit-At-A-Time vs 2-Bits-At-A-Time Access 78
4.1.3 Multiple Coefficient Memory Banks 79
4.1.4 Multiple Memory Bank Implementation with 2BAAT
Access 80
4.1.5 DA Based Implementation of Multirate Architectures 81
4.1.6 Multirate Architecture with a Decimation Factor ofThree 82
4.1.7 Multirate Architectures with Two Level Decimation 84
4.1.8 Coefficient Memory vs Number of Additions Tradeoff 84
4.2 Improving Area Efficiency of Two LUT Based DA Structures 85
4.2.1 Minimum Area Partitions for Two ROM Implementation 87
4.2.2 Minimum Area Partitions for Hardwired Logic 88
Contents Vll
4.2.2.1 CF2: Estimating Area from the Actual Truth-Table 89

4.2.2.2 CF1: Estimating Area from the Coefficients in Each
Partition 91
4.2.3 Evaluating the Effectiveness ofthe Coefficient Partitioning
Technique 92
4.3 Techniques for Low Power Implementation of DA Based FIR
Filters 94
4.3.1 Toggle Reduction Using Data Coding 95
4.3.1.1 Nega-binary Coding 95
4.3.1.2 2's Complement vs Nega-binary Representation 96
4.3.1.3 Deriving an Optimum Nega-binary Scheme for a Given
Data Distribution 99
4.3.1.4 Incorporating a Nega-binary Scheme into the DA Based
FIR Filter Implementation 101
4.3.1.5 A Few Observations 103
4.3.1.6 Additional Power Saving with Nega-binary Architecture 104
4.3.2 Toggle Reduction in Memory Based Implementations
by Gray Sequencing and Sequence Reordering 107
5. MULTIPLIER-LESS IMPLEMENTATION 113
5.1 Minimizing Additions in the Weighted-sum Computation 114
5.1.1 Minimizing Additions - an Example 114
5.1.2 2 Bit Common Subexpressions 116
5.1.3 Problem Formulation 116
5.1.4 Common Subexpression Elimination 118
5.1.5 The Algorithm 119
5.2 Minimizing additions in MCM Computation 120
5.2.1 Minimizing Additions - an Example 120
5.2.2 2 Bit Common Subexpressions 122
5.2.3 Problem Formulation 123
5.2.4 Common Subexpression Elimination 124
5.2.5 The Algorithm ] 24
5.2.6 An UpperBoundon theNumberof Additions forMCM
Computation 126
5.3 Transformations for Minimizing Number of Additions 128
5.3.] Number Theoretic Transforms 128
5.3.1.] 2's Complement Representation 128
5.3.1.2 Uni-sign Representation 129
5.3.1.3 Canonical Signed Digit (CSD) Representation 129
5.3.2 Signal Flow Graph Transformations 130
5.3.3 Evaluating Effectiveness of the Transformations 133
5.3.4 Transformations for Optimal Initial Solution 137
5.3.4.1 Coefficient Optimization ] 37
5.3.4.2 Efficient Pre-Filter Structures 138
5.4 High Level Synthesis of Multiprecision DFGs 138
viii VLSI SYNTHESlS OF DSP KERNELS
5.4.1 Precision Sensitive Register Allocation 138

5.4.2 Precision Sensitive Functional Unit Binding 139
5.4.3 Precision Sensitive Scheduling 140
6. IMPLEMENTATION OFMULTIPLICATION-FREE LINEAR
TRANSFORMS 141
6.1 Optimum Code Generation for Register-rich Architectures 142
6.1.1 Generic Register-rich Architecture Model 142
6.1.2 Sources and Measures of Power Dissipation 143
6.1.3 Optimum Code Generation for 1-D Transforms 144
6.1.4 Minimizing NumberofOperations in Two Dimensional
Tran sform s 146
6.1.5 Low Power Code Generation 148
6.2 Optimum Code Generation for Single Register, Accumulator
Based Architectures 153
6.2.1 Single Register, Accumulator Based Architecture Model 153
6.2.2 Code Generation Rules 154
6.2.3 Computation Scheduling Algorithm 156
6.2.4 ImpactofDAG Structure on the Optimality ofGenerated
Code 158
6.2.5 DAG Optimizing Transformations 159
6.2.5.1 Transformation I - Tree to Chain Conversion 159
6.2.5.2 Transformation 11 - Serializing a Butterfly 159
6.2.5.3 Transformation III - Fanout Reduction 160
6.2.5.4 Transformation IV - Merging 161
6.2.6 Synthesis of Spill-free DAGs 162
6.2.7 Sources and Measures of Power Dissipation 168
6.2.8 Low Power Code Generation 168
7. RESIDUE NUMBER SYSTEM BASED IMPLEMENTATION 171
7.1 Optimizing RNS based Implementation of the Weighted-sum
Computation 172
7.1.1 Parallel Processing 174
7.1.2 Residue Encoding for Low Power 174
7.1.3 Coefficient Ordering 17 5
7.1.4 Exploiting Redundancy 176
7.1.5 Residue Encoding for minimizing LUT area 177
7.2 Optimizing RNS based Implementation of FIR Filters 179
7.2.1 Coefficient Scaling 179
7.2.2 Coefficient Optimization for Low Power 180
7.2.3 RNS based Implementation of Transposed FIR Filter
Strucrure lW
7.2.4 Coefficient Optimization for Area Reduction 180
7.3 RNS as an Optimizing Transformation for High Precision Signal
Processing 183
Contcnts IX
8. A FRAMEWORK FOR ALGORITHMIC AND ARCHITECTURAL

TRANSFORMATIONS 187
8.1 Classification of Algorithmic and Architectural Transformations 187
8.2 A Snapshot of the Framework ] 91
9. SUMMARY ] 95
References ]99
Topic Index 207
About the Authors 209
List of Figures
1.1 Digital Still Camera System 2

1.2 DSC Image Pipeline 3
1.3 Hardware-Software Codesign Methodology for a System-
on-a-chip 4
1.4 Solution Space for Weighted-Sum Computation 7
2.1 Generic DSP Architecture 12
2.2 4x4 Array Multiplier 14
2.3 Toggle Count as a Function of Number of Ones in the
Multiplier Inputs 16
2.4 Toggle Count as a Function of Hamming Distance be-
tween Successive Inputs 16
2.5 Address Bus Power Dissipation as a Function of Start Address 17
2.6 Binary to Gray Code Conversion 18
2.7 Memory Reorganization to Support Gray Coded Addressing 19
2.8 Programmable Binary to Gray Code Converter 19
2.9 TO Coding Scheme 20
2.10 TO Coding Scheme 21
2.11 Instruction Buffering 22
2.12 Decoded Instruction Buffering 22
2.13 Memory Partitioning for Low Power 23
2.14 Prefetch Buffer 23
2.15 Bus Reordering Scheme for Power Reduction in PD bus 24
2.16 %Reduction in the Number of Adjacent Signal Transi-
tions in Opposite Directions as a Function of the Bus
Reordering Span 26
2.17 Coefficients of a 32 Tap Linear Phase Low Pass FIR Filter 27
2.18 Scheme for Reducing Power in the Adder Input Busses 33
2.19 Data Flow Graph of a Weighted-sum Computation with
Coefficient Symmetry 34
2.20 Suitable Abstraction of TMS320C54x Architecture for
Exploiting Coefficient Symmetry 35
2.21 Signal Flow Graph of a Direct Form FIR Filter 36
2.22 One Level Decimated Multirate Architecture 38
Xl
XII VLSI SYNTHESIS OF DSP KERNELS
2.23 Normalized Power Dissipation as a Function ofNumber

of Taps for the Multirate FIR Filters Implemented on
TMS32OC2x 41
2.24 Signal Flow Graph of the Transposed FIR Filter 42
2.25 Architecture to Support Efficient Implementation ofTrans-
posed FIR Filter 42
2.26 Frequency Domain Characteristics of a 24 Tap FIR Fil-
ter Before and After Optimization 49
2.27 Low Pass Filter Specifications 50
2.28 Framework for Low Power Realization of FIR Filters
on a Programmable DSP 53
3.1 Direct Form Structure of a 4 Tap FIR Filter 57
3.2 Scheduled DFG Using One Multiplier and One Adder 57
3.3 Scheduled DFG Using One Pipelined Multiplier and
One Adder 58
3.4 Loop Unrolled DFG Using 1 Pipelined Multiplier and 1 Adder 59
3.5 Retimed 4 Tap FIR Filter 59
3.6 MCM DFG Using One Pipelined Multiplier and One Adder 60
3.7 Direct Form DFG Using Two Pipelined Multipliers and
One Adder 60
3.8 MCM DFG Using Two Pipelined Multipliers and Two Adders 61
3.9 Energy and Peak Power Dissipation as a Function of
Degree of Parallelism 62
3.10 LowerLimit of VDD/VT for Reduced Peak Power Dis-
sipation as a Function of Degree of Parallelism 63
3.11 One Level Decimated Multirate Architecture: Topology-I 63
3.12 One Level Decimated Multirate Architecture: Topology - 11 64
3.13 Signal Flow Graph of a Direct Form FIR Structure with
Non-linear Phase 65
3.14 Signal Flow Graph of a Direct Form FIR Structure with
Linear Phase 65
3.15 Signal Flow Graph of a Two Level Decimated Multirate
Architecture 68
3.16 Normalized Delay vs Supply Voltage Relationship 69
3.17 Normalized Power Dissipation vs Number of Taps 71
4.1 DA Based 4 Tap FIR Filter 77
4.2 4 Tap Linear Phase FIR Filter 78
4.3 2 Tap FIR Filter with 2BAAT 79
4.4 Using Multiple Memory Banks 80
4.5 Multirate Architecture 81
4.6 DA Based 4 Tap Multirate FIR Filter 82
4.7 Area-Delay Curves for FIR Filters 85
List 0/ Figures Xlll
4.8 Two Bank Implementation - Simple Coefficient Split 86

4.9 Two Bank Implementation - Generic Coefficient Split 86
4.10 Area vs Normalized CF2 Plot for 25 Different Partitions
of a 16 Tap Filter 91
4.11 Range ofRepresented Values for N=4, 2's Complement
and N+ 1=5, Nega-binary 96
4.12 Typical Audio Data Distribution for 25000 SampIes Ex-
tracted from an Audio File 97
4.13 Difference in Toggles for N=6, 2's Complement and
Nega-binary Scheme : + - - + - + + 98
4.14 Difference in Toggles for N=6, 2's Complement and
Nega-binary Scheme : - + + - + - + 99
4.15 Gaussian Distributed Data with N=6, Mean=22, SD=6 100
4.16 Gaussian Distributed Data with N=6, Mean=-22, SD=6 101
4.17 DA Based FIR Architecture Incorporating the Nega-
binary Scheme 102
4.18 Saving vs SD Plot for N=8, Gaussian Distributed Data
with Mean = max/2 105
4.19 Narrow (SD=8) Gaussian Distribution 106
4.20 Broad (SD=44) Gaussian Distribution 107
4.21 Shiftless Implementation of DA Based FIR with Fixed
Gray Sequencing 108
4.22 Shiftless Implementation of DA Based FIR with Any
Sequencing Possible 109
5.1 Data Flow Graph for a 4-term Weighted-sum Computation 114
5.2 Coefficient Subexpression Graph for the 4-term Weighted-
sum Computation 118
5.3 Data Flow Graph for 4 term MCM Computation 121
5.4 SFG Transformation - Computing Y[n] in Terms of
Y[n-l] 131
5.5 SFG Transformation - Computing Y[n] in Terms of
Y[n-I] 133
5.6 Average Reduction Factor Using Common Subexpres-
sion Elimination 134
5.7 Best Reduction Factors Using Coefficient Transforms
Without Common Sub-expression Elimination 135
5.8 Best Reduction Factors Using Coefficient Transforms
with Common Sub-expression Elimination 136
5.9 Frequency of Various Coefficient Transforms Result-
ing in the Best Reduction Factor with Common Sub-
expression Elimination 137
5.10 Precision Sensitive Register Allocation 139
XIV VLSI SYNTHESIS OF DSP KERNELS
5.11 Precision Sensitive Register Allocation 139

5.12 Precision Sensitive Scheduling 140
6.1 Generic Register-rich Architecture 143
6.2 3x3 Pixel Window Transform 144
6.3 Prewitt Window Transform 145
6.4 Transformed DAG with All SUB Nodes 145
6.5 Chain-type DAG for Prewitt Window Transform 146
6.6 Optimized Code for Prewitt Window Transform 146
6.7 Optimized DAG for 4x4 Haar Transform 148
6.8 Schedu1ed Instructions for 4x4 Haar Transform 149
6.9 Data Flow Graph and Variable Lifetimes for 4x4 Haar
Transform 150
6.10 Register-Conflict Graph 150
6.11 Consecutive-Variables Graph 150
6.12 Register Assignment for Low Power 151
6.13 Code Optimized for Low Power 151
6.14 3x3 Window Transforms 152
6.15 Single Register, Accumulator Based Architecture 153
6.16 Example DAG 154
6.17 DAG for 4x4 Walsh-Hadamard Transform 158
6.18 Optimized DAG for 4x4 Walsh-Hadamard Transform 159
6.19 Transformation I - Tree to Chain Conversion 160
6.20 Transformation 11 - Serializing a Butterfly 160
6.21 Transformations III and IV 161
6.22 Optimizing DAG Using Transformations 161
6.23 Spill-free DAG Synthesis 164
6.24 DAGs for 8x8 Walsh-Hadamard Transform 165
6.25 Spill-free DAGs for 8x8 Walsh-Hadamard Transform 166
6.26 DAGs for 8x8 Haar Transform 166
7.1 RNS Based Implementation of FIR Filters 173
7.2 Modulo MAC using look-up-tables 173
7.3 Modulo MAC using a single LUT 174
7.4 RNS Based Implementation ofFIR Filters with Parallel
Processing Transformation 175
7.5 Minimizing Look Up Tab\e Area by Exploiting Redundancy 177
7.6 Modulo MAC structure for Transposed Form FIR Filter 181
8.1 A Framework for Area-Power Tradeoff 192
8.2 A Framework for Area-Power Tradeoff - continued 193
List of Tables
2.1 Adjacent Signal Transitions in Opposite Direction as a

Function of the Bus-reordering Span 25
2.2 Impact of Selective Coefficient Negation on Total Num-
ber of 1s in the Coefficients 28
2.3 Impact of Coefficient Ordering on Hamming Distance
and Adjacent Toggles 31
2.4 Power Optimization Results Using Input Bit Swapping
for 1000 Random Number Pairs 33
2.5 TMS320C2x Code for Direct Form Architecture 38
2.6 TMS320C2x Code for the Multirate Architecture 40
2.7 Hamming Distance and Adjacent Signal Toggles After
Coefficient Scaling Followed by Steepest Descent and
First Improvement Optimization with No Linear Phase
Constraint 47
2.8 Hamming Distance and Adjacent Signal Toggles After
Coefficient Scaling Followed by Steepest Descent and
First Improvement Optimization with Linear Phase Constraint 48
2.9 Hamming Distance and Adjacent Signal Toggles for
Steepest Descent and First Improvement Optimization
with and without Linear Phase Constraint (with No Co-
efficient Scaling) 48
3.1 Computational Complexity of Multirate Architectures 67
3.2 Comparison with Direct Form and Block FIR Implementations 72
4.1 Coefficient Memory and Number of Additions for DA
based Implementations 85
4.2 A Few Functions and Their Corresponding Correlations
with Actual Area 88
4.3 ROM Areas as a % of Maximum Theoretical Area 92
4.4 ROM vs Hardwired Area (Equivalent NA210 NAND
Gates) Comparison 93
4.5 Area (Equivalent NA210 NAND Gates) Statistics for
All Possible Coefficient Partitions 93
4.6 Toggle and No-toggle Power Dissipation in Some D FFs 94
xv
xvi VLSI SYNTHESIS OF DSP KERNELS
4.7 Best Nega-binary Schemes for Gaussian Data Distribu-

tion ( mean = max/2; SD = 0.17 max ) 105
4.8 Toggle Reduction in LUT (for 10,000 SampIes; Gaus-
sian Distributed Data) 106
4.9 Comparison ofWeighted Toggle Data for Different Gray
Sequerices 110
4.10 Toggle Reduction as a Percentage of 2's Complement
Case for Two Different Gaussian Distributions 110
4.11 Toggle Reduction with Gray Sequencing for N = 8 and
Some Typical Distributions 111
5.1 Number of Additions+Subtractions (Initial and After
Minimization) 120
5.2 Numberof Additions+Subtractions for Computing MCM
Intermediate Outputs 126
6.1 Total Hamming Distance Between Successive Instructions 152
6.2 Code Dependance on the Scheduling of DAG Nodes 155
6.3 Comparison of Code Generator with 'C5x C Compiler 157
6.4 NumberofNodes (Ns) and Cycles(Cs) for Various DAG
Transforms 167
6.5 Hamming Distance Measure for Accumulator based Ar-
chitectures 169
7.1 Area estimates for PLA based modulo adder implementation 178
7.2 Area estimates for PLA based modulo multiplier imple-
mentation 179
7.3 Area estimates for PLA based modulo MAC implementation 179
7.4 Distribution of Residues across the Moduli Set 182
7.5 Impact of Coefficient Optimization on the Area of Mod-
ulo Multiplier and Modulo MAC 183
7.6 RNS based FIR filter with 24-bit precision on C5x 184
7.7 Number of Operations for RNS based FIR filter with
24-bit precision on C5x 184
Foreword
Technology is a driving force in society. At times it seems to be driving us

faster than we want to go. At the same time it seems to patiently wait for us to
adapt to it and, finally, adopt it as our own. Let me give a few examples.
The answering machine is a good example of us adopting a technology.
Twenty years ago if you had called my horne and I had an answering machine,
rather than me, responding to your call, you would have thought, "Oh, how
rude of hirn! I don't want to talk to a machine, I want to talk to Gene". Today,
twenty years later, if you call my horne and do not get my answering machine
(or me), you will think, "Oh, how rude of hirn! He should have an answering
machine so that I can at least leave a message". We have actually gone far
beyond the answering machine in this respect. We now have cellular phones -
with answering machines. Forget "Snail" mail, even Email is not fast enough;
we have Instant Messaging. But although I have a videophone on my desk, no
one else seems to have one. I guess we haven't adopted all of the technology
that we are introduced to.
Another example of the advance of technology is seen in the definition of
"portable". The term has changed over the last several decades as a result of
advances in integrated circuit technology. Think of the Personal Computer.
Not long ago, "portable" meant one (strong!) person could carry a Personal
Computer on an airplane without putting it in the checked-in baggage. Now,
"portable" means I can put my computer in my briefcase and still have room for
other things. It is beginning to mean that I can put my computer in my pocket. In
the future, it may very weil be that each one ofus wears multiple computers as a
matter of daily life. We will have a communications computer, an entertainment
computer, an information computer, a personal medical computer to name a few.
They will all communicate with one another over a personal area network on
our bodies. I like to call this personal area network the "last meter".
The definition of "portable" has also changed in the area of portable phones.
We have graduated from car phones - where the electronics were hidden in the
trunk of the car - to cellular phones so small that they can easily get lost in a
shirt pocket.
There are many more examples of how the marriage of Digital Signal Pro-
cessing to Integrated Circuit Technology has revolutionized our lives. But rather
than continue in that direction, I would like to turn to abrief historical perspec-
tive of how successful this marriage has been. After looking at history, 1 would
like to tie all of this to the value of this book.
Digital Signal Processing, depending on your view of history, has been
around for only about forty years. It began as a university curiosity in the
1960s. This was about the same time that digital computers were becoming
XVll
XVlll VLSI SYNTHESIS OF DS? KERNELS
useful. In the 1970s, Digital Signal Processing became a military advantage

for those nations who could afford it. It was in the late I 970s and early 1980s
that Integrated Circuit Technology became mature enough to impact Digital
Signal Processing with the introduction of a new device called "Digital Signal
Processor". With this new device, Digital Signal Processing moved from the
laboratory and military advantage to being a commercial success. Telecommu-
nications was the earliest to adopt Digital Signal Processing with many others to
follow. It was in the decade of the 1990s that Digital Signal Processing moved
from being a commercial success to being a consumer success. This was a
direct result of the advances in Integrated Circuit Technology. These advances
yielded four significant benefits: I) lower cost, 2) higher performance, 3) lower
power and 4) more transistors per device. The industry began to think in terms
of a System on a Chip (SoC). This led us to where we are now and will lead us
to where we will go in the coming decades.
What I see in our future is the opportunity to take advantage of these four
benefits of Integrated Circuit Technology as it is applied to Digital Signal Pro-
cessing. SoC technology will either complicate or simplify our decisions on
how best to implement Digital Signal Processing solutions on VLSI. We will
need to optimize on the best combination of Performance, Power dissipation
and Price. We will not only continue to change the definition of "portable" but
will begin to change the definitions of "personal", "good enough" and "pro-
grammable".
This book focuses on this very marriage of Digital Signal Processing to
Integrated Circuit Technology. It addresses implementation options as we try
to create new products which will impact society. These new products will
need to have good enough performance, low enough power dissipation and a
low enough price. At the same time they will need to be quick to market.
So, read this book! It will give you insights and arm you with techniques to
make the inevitable tradeoffs necessary to implement Digital Signal Processing
on Integrated Circuits to create new products.
One last thought on the marriage of Digital Signal Processing to Integrated
Circuit technology. Over the last several years, I have observed that every time
the performance of Digital Signal Processors increases significantly, the rules
of how we apply Digital Signal Processing theory change. Isn't this a great
time we live in?
GENE FRANTZ
Senior Fellow, Digital Signal Processing
Texas Instruments Inc.
Houston, Texas
April 2001
Acknowledgments
First and foremost, we would like to express our sincere gratitude to Milind
Sohoni, Vi kram Gadre and Supratim Biswas (all ofIIT Bombay), G. Venkatesh
(with Sasken Communication Technologies Ltd., earlier with IIT Bombay)
and Rubin Parekhji of Texas Instruments (India) for their insightful comments,
critical remarks and feedback which enriched the quality of this book.
We are thankful to Bobby Mitra and Sham Banerjee of Texas Instruments
(India) for their help, support and guidance.
We are grateful to Texas Instruments (India) for sponsoring the doctoral
studies of the first author. We deeply appreciate the support and encouragement
of IIT Bombay and Sasken Communication Technologies Ltd.
We are thankful to Amit Sinha, Somdipta Basu Roy, M.N. Mahesh, Satrajit
Gupta, Anand Pande, Sunil Kashide and Vikas Agrawal (all with Texas Instru-
ments (India) when the work was done) for their assistance in implementing
some of the techniques discussed in this book.
Our warm thanks to our children - Aarohi Mehendale and Apama &
Nachiket Sherlekar for putting up with our long hours at work. Finally, thanks
are due to our wives - Archana Mehendale and Gowri Sherlekar for being
there with us at all times.
MAHESH MEHENDALE
SUNIL D. SHERLEKAR
Preface
D.E Knuth in his seminal paper "Structured Programming with Goto State-
ments" underlines the importance of optimizing the inner loop in a computer
program. More than twenty five years and a revolution in semiconductor tech-
nology have not diminished the importance of the inner loop.
This book is about synthesis of the 'inner loop' or the kernel of Digital
Signal Processing (DSP) systems. These systems process - in real time -
digital information in the form of text, data, speech, images, audio and video.
The wide variety of these systems notwithstanding, their kerneis or inner loops
share a common dass of computation. This is the weighted sum (L: A[i]X[i]).
It occurs in Finite Impulse Response (FIR) and Infinite Impulse Response (HR)
filters, in signal correlation and in computing signal transforms.
Unlike general purpose computation which asks for computation to be 'as
fast as possible', DSP systems require performance that is characterized by
the arrival rate of a data stream which, in turn, is determined by the Nyquist
sampling rate of the signal to be processed. The performance of the system is
therefore a constraint within which one must optimize the area (cost) and power
(battery life). This is usually a matter of tradeoff.
The area-power tradeoff is complicated by additional requirements of flexi-
bility. Flexibility is important to track evolving standards, to cater to multiplicity
of standards (such as air interfaces in mobile communication) and fast-paced
innovation in algorithms. Flexibility is achieved by implementation in software,
but a completely soft implementation is likely to be ruinous for power. It is
therefore imperative that the requirements of flexibility be carefully predicted
and the system be partitioned into hardware and software components.
In this book, we present several algorithmic and architectural transformations
to optimize weighted-sum based DSP kerneis over the area-delay-power space.
These transformations address implementation technologies that offer varying
degrees of programmability (and therefore flexibility) ranging from software
programmable processors to customized hardwired solutions using standard-
cell or gate-array based ASICs. We consider both the multiplier-less and the
hardware multiplier-based implementations of the weighted-sum computation.
To start with, we present a comprehensive framework that encapsulates tech-
niques for low power implementation of DSP algorithms on programmable
DSPs. These techniques complement one another and address power reduction
XXI
xxii VLSI SYNTHESIS OF DSP KERNELS
in various components such as the program and data memory busses and the
multiplier-accumulator datapath of a Harvard architecture based digital signal
processor. The techniques are then specialized for weighted sum computations
and then for FIR filters.
Next we present architectural transforms for power optimization for hard-
wired implementation ofFIR filters. Multirate architectures are presented as an
important and interesting transform. A detailed analysis of the computational
complexity of multirate architectures is presented with results that indicate sig-
nificant power savings compared to other FIR filter structures.
Distributed Arithmetic (DA) has been presented in the literature as one of
the approaches for multiplier-less implementation of weighted-sum computa-
tion. We present techniques for deriving multiple DA based structures that
represent different data-points in the area-delay space. We look at improving
area-efficiency of DA based implementations and specifically show how the
fiexibility in coefficient partitioning can be exploited to reduce the area of a DA
structure using two look-up-tables. We also address the problem of reducing
power dissipation in the input data shift-registers of DA based FIR filters. Our
technique is based on a generic nega-binary representation scheme which is cus-
tomized for a given distribution profile of input data values, so as to minimize
toggles in the shift-registers.
For non-adaptive signal processing applications in which the weight val-
ues are constant and known at design time, an area-efficient realization can be
achieved by implementing the weighted sum computation using shift and add
operations. We present techniques for minimizing additions in such multiplier-
less implementations. These techniques are also useful for efficient implemen-
tation of weighted-sum computations on programmable processors that do not
support a hardware multiplier.
We address a special dass of weighted-sum computation problem, where the
weight-values are restricted to {O, 1, -I}. We present techniques for optimized
code generation of one dimensional and two dimensional multiplication-free
linear transforms. These are targeted to both register-rich and single-register,
accumulator based architectures.
Residue Number Systems (RNS) have been proposed for high-speed paral-
lel implementation of addition, subtraction and multiplication operations. We
explain how the power of RNS can be exploited for optimizing the implemen-
tation of weighted sum computations. In particular, RNS is proposed as a
method to enhance the results of other techniques presented in this book. RNS
is also proposed as a technique to enhance the precision of computations on a
programmable DSP.
To tie up all these techniques, a methodology is presented to systematically
identifying transformations that exploit the characteristics of a given DSP al-
PREFACE XXIll
gorithm and of the implementation style, to achieve tradeoffs in the area-delay-

power space.
This book is meant for practicing DSP system designers, who understand
that optimal design can never be a push-button activity. We sincerely hope that
they can benefit from the variety of techniques presented in this book. Each
of the techniques has a potential benefit to offer. But actual benefit will accrue
only from a proper selection from these techniques and their appropriate imple-
mentation: something that is in the realm of human expertise and judgement.
MAHESH MEHENDALE
SUNIL D. SHERLEKAR
Bangalore
April 2001
Chapter 1
INTRODUCTION
Today's digitally networked society has seen the emergence of many appli-
cations that process and transceive information in the form of text, data, speech,
images, audio and video. Digital Signal Processing (DSP) is the key technology
enabling this digital revolution. With advances in semiconductor technology
the number of devices that can be integrated on a single chip has been growing
exponentially. Experts forecast that Moore's law of exponential growth in chip
density will hold good atIeast tilI year 2010. By then, the minimum feature
size of 0.07 micron will enable the integration of as many as 800 million tran-
sistors on a single chip [69]. As we move into the era of ULSI (Ultra Large
Scale Integration), the electronic systems which required multi-chip solutions
can now be implemented on a single chip. Single chip solutions are now avail-
able for applications such as Video Conferencing, DTADs (Digital Telephone
Answering Devices), cellular phones, pagers, modems etc.
1.1. An Example
As an example, consider the electronics of a Digital Still Camera (DSC) [26]
shown in figure 1.1. The system-level components are the CCD image sen-
sor, the A/D conversion front-end, the DSP engine for image processing and
compression and various interface and memory drivers.
Although there are no intrinsic real-time constraints for such a system, it has
performance requirements dictated by the need to have as short a shot-to-shot
delay as possible. Besides, many DSCs now have a provision of attaching an
audio clip with each picture which requires real-time compression and storage.
Of course, being a portable device, the most important constraint on the system
design is the need for low power to ensure a long battery life.
Figure 1.2 shows the DSP pipeline of the DSC [26]. The following blocks
are of particular interest:
M. Mehendale et al., VLSI Synthesis of DSP Kernels

© Springer Science+Business Media New York 2001
2 VLSI SYNTHESIS OF DSP KERNELS
CCD
Driver Correlated Automatie AID

~" Double Gain
Converter
~/ Sampling Control
Timing
Generator
DSC Engine
LCD /L-
Display ~ Image
I Proeessing I
Image
NTSC/PAL /L- Compression
~~
output
Universal
~t
Flash
t
Serial RS232
Memory
Bus
Figure 1.1. Digital Still Camera System
• Fault pixel eorreetion: Large pixel CCD-arrays may have defeetive pixels.
During the normal operation of the DSC, the image values at the faulty pixel
loeations are computed using an interpolation technique.
• CFA interpolation: The nature of the front-end is such that only one of
R, G or B values is available for each pixel. The other values need to be
interpolated from the neighboring pixels.
• Color space conversion: While the CCD sensor produces RGB values, typ-
ical image compression techniques use YCrCb. These values are weighted
sums of the RGB values.
• Edge enhancement: CFA interpolation introduces low-pass filtering in the

image and this needs to be corrected to restore image sharpness.
• Compression: To reduce memory requirement, images are compressed typi-

cally using the JPEG compression scheme. This involves DCT computation.
Introduction 3
I
R G
1 R G Analog
I
Optical Lens
H
I
I
I
G B G B processing --t black --t distortion t--
I
I
R G R G and AlD cJamp compensation
I
I G B G B
I
~------------------I Auto exposure

C'
CFA Fault
color Gamma White
<J- <}-,- <J-
,---
correction balance
pixel ~
interpolation correction
-- L
- - Auto focus
r
Edge Write
----c detection r-
lPEG
- ----c to flash
color compression
4 conversion
- "v
RGB to YCrCb False
----c color r-
-"v
Scaling for For
suppression 4 monitor/LCD preview
Figure 1.2. DSC Image Pipeline
The common characteristics of each of these blocks is that they require

computation of weighted sums of pixel values. The optimal computation of
such weighted sums in the main concern of this book. Also note that the
coefficients in most of the weighted-sum computations in a DSC are constant.
Some of the techniques presented in this book are specifically targeted to such
situations.
1.2. The Design Process: Constraints and Alternatives

The process of designing a System-on-a-Chip (SoC), such as for the pre-
ceding example, not only involves integrating processor cores with memory,
.1"-
v System Specification
~
System 'A
System Partitioning IA
Estimators
I"
Validation
~
IA
i
Hardware
rt---c Synthesis
I" ,'7
Software Library
Synthesis ~
l-1
I~ "v
Figure 1.3. Hardware-Software Codesign Methodology far a System-on-a-chip
peripherals and custom hardwired logic but also involves developing the soft-
ware that implements the desired functionality. These hardware and software
components are strongly interdependent and hence need to be co-developed.
Figure 1.3 shows the methodology for designing an embedded real-time dig-
ital signal processing SoC. The design process starts with a specification of
the system in terms of its functionality and the design constraints/objectives.
Various mechanisms exist to specify the system functionality. One approach
is to use a high-level specification language such as Silage [25] or Lustre [24].
CAD systems such as Ptolemy [79], DSP-Station 1 from Mentor Graphics and
COSSAP from Synopsys provide block diagram editors that support hierar-
chical system specification. The blocks represent various functions and their
interconnections represent the data flow. These systems support both the syn-
chronous dataflow (SDF) and the dynamic dataflow (DDF) models [10, 79]
for capturing the specifications of DSP algorithms. These environments also
provide a rich library of commonly used DSP functions such as filters, FFT,
linear transforms, matrix multiplication etc. This can significantly reduce the
time required to specify the functionality of an embedded DSP system.
Area, delay (performance) and power constitute the three important design
constraints for most systems.
The area constraint is driven primarily by considerations of cost. Area ef-
ficient implementation results in a sm aller die size and hence is more cost
effective. It also enables integrating more functionality on a single chip.
1 Product and company names appearing here and elsewhere in the book are trademarks owned by the
respective companies.
Introduction 5
The perfonnance requirements of a system are driven by its data process-

ing needs. For real-time embedded DSP systems, throughput is the primary
perfonnance criterion (although latency is also important for a two-way com-
munication system). The perfonnance constraint is thus dependent on the rate
at which the input signals are sampled and on the complexity of processing to
be perfonned.
Low power dissipation is a key requirement for portable, battery operated
systems as it extends battery life. It also enables operation of the system with
smaller batteries thus reducing the weight of the handheld device. Low power
dissipation also helps reduce the packaging cost (plastic instead of ceramic) and
eliminate/reduce cooling (heat sinks) overhead. Finally, lower power dissipa-
tion increases the reliability of the device. The generated heat can be a limiting
factor for integrating a large number of devices on a single chip.
Low area, high perfonnance and low power dissipation are often conflicting
requirements. The system design process hence involves perfonning appropri-
ate trade-offs so as to meet the desired constraints. In DSP systems, perfonnance
is dictated by the Nyquist sampling rate; the tradeoff is then between area and
power.
In addition to the area, delay and power constraints there are other design
considerations that significantly influence the choice of the target implementa-
tion. These include:
• Shorter design cycle time for faster time-to-market.
• F1exibility of making changes to cater to changes in an evolving standard.
• Field upgradability for longer time to obsolescence.
• Re-use and customizability for amortization of design cost over several

variants of a product.
All the above requirements imply some sort of programmability. The down-
side of a programmable implementation, however, is penalty in tenns of either
area or power or both.
Fortunately, there is an interesting reason why programmable implementa-
tion are becoming increasingly feasible for DSP systems. With the advances
in semiconductor technology, digital circuits can be fabricated with increasing
chip density and can operate at increasing speeds. However, some of the fun-
damental properties of signals and systems in nature have remained more or
less the same. These include parameters such as the frequency range of signals
audible to the human ear, the frequency range of light visible to the human eye,
the duration of persistence of human vision etc. Thus more and more real-time
DSP functions which earlier required dedicated hardwired solutions can now
be implemented using a programmable processor for the same cost. Increasing
speeds means that power can be reduced by dropping the clock rate and the
operating voltage.
For any technology, however, a hardwired implementation is always more
efficient in area and power than a software one. The system design method-
ology must trade programmability for area and power by considering imple-
mentation technologies with varying degree of programmability. These range
from software programmable solutions offered by programmable processors
and hardware programmable solutions offered by FPGAs to dedicated hard-
wired functions implemented as standard cell or gate-array based ASICs.
An important step in the system design process, therefore, is to partition the
system into various components and decide on the implementation approach
for each component. This decision process involves determining whether to
implement a component in hardware or software, and also assigning area, delay
and power budgets so as to meet the system level design constraints. For a given
function to be implemented in hardware or software, multiple alternatives exist,
each representing a different data point in the area-delay-power space.
Most approaches to system partitioning [9, 21, 31] model it as a combinato-
rial optimization problem and use integer programming or other heuristic tech-
niques to arrive at a solution. However, these approaches assume the availabil-
ity of area, delay and power estimates for different implementation alternatives.
The quality of partitioning thus depends on the accuracy of these estimates. A
serious barrier to accurate estimation is that the deeper we move into submi-
cron geometries, the lesser the correlation between high-level descriptions (or
even optimized logic equations) and the size and speed ofthe circuit [69]. One
approach to address this limitation is to actually perform hardware/software
synthesis [71] and extract the design parameters. However, this method is time
consuming and can limit the search space that can be explored. The other ap-
proach is to partition the system in terms of pre-characterized library functions.
While it is virtually impossible to build a comprehensive library of functions
which can realize any system behavior, such an approach can successfully be
used for designing systems belonging to specific application domains such as
DSP. Most DSP systems can be characterized in terms of the core algorithms
or kerneIs they use. These include functions such as filtering (both Finite Im-
pulse Response (FIR) and Infinite Impulse Response (UR) filtering), correlation
and linear transforms (matrix multiplication). All these perform weighted-sum
(2:: A[iJX[i]) as the core computation. This class of DSP algorithms forms the
focus of this book.
The optimality of a system partition can be greatly influenced by providing
a rich set of implementation alternatives. This book covers the entire solution
space as shown in figure 1.4 for realizing weighted-sum based DSP kerneIs. It
represents implementation styles that offer varying degrees of programmability
and perform weighted-sum computation with or without a hardware multiplier.
Introduction 7
Implementation
With Using Hardware Programmable
Hardware Multiplier(s) Digital Signal
Multiplier and Adder(s) Processors
Residue Processors
Distributed
Without Implementation Number WithNo
Arithmetic (DA
Hardware Using Adders System (RNS) Dedicated
Based
Multiplier and S hifters Based Implementation Hardware
Implementation Multiplier
Low Medium High
<11---- Degree of Programmability
Figure J.4. Solution Space for Weighted-Sum Computation
The implementation style marked 'X' (hardware multiplier with no programma-

bility) has not been considered, as such an implementation does not offer any
advantage and hence is not useful.
The synthesis process involves taking the system through various levels of
design abstraction. These include behavioral/algorithmic level, architecturall
register transfer level, logic level, circuit level and layout level abstractions
for the hardwired functions and behaviorallalgorithmic level, high-Ievellan-
guage program level, assembly language program level and object code level
abstractions for the functions implemented in software. Various studies have
shown [60] that design decisions taken at higher levels of abstraction have a
much bigger impact on the area-delay-power characteristics of a system. The
book therefore focuses on behavioral level transformations to achieve the de-
sired area-delay-power tradeoffs.
1.3. Organization of the Book

The rest of the book is organized as folIows.
Chapter 2 presents a comprehensive framework that encapsulates techniques
for low power implementation of DSP algorithms on programmable DSPs. The
sources of power dissipation in various components of programmable DSPs
(e.g. data paths, busses) are identified and their measures established. Mutually
complementary techniques for power reduction in these components are then
described. The techniques are then specialized for weighted sum computations
and then for FIR filters.
Chapter3 presents implementations using hardware multiplier(s) and adder(s).

It evaluates the effectiveness of various DFG transformations such as parallel
processing, pipelining, re-timing and loop-unrolling with respect to weighted
sum computation. It shows that the parallel processing technique, while reduc-
ing energy dissipation, may result in increased peak power dissipation. Most of
these transforms do not impact computational complexity and achieve power
reduction at the expense of increased area. On the other hand, multirate ar-
chitectures, which are presented next, achieve power reduction by reducing
the computational complexity ofFIR filters. A detailed analysis is presented to
show how multirate architectures can realize low power FIR filters with minimal
overhead in datapath area.
Distributed Arithmetic (DA) has been presented in the literature as one of
the approaches for multiplier-Iess implementation of weighted-sum computa-
tion. In Chapter 4, various techniques are discussed far deriving multiple DA
based structures that represent different data-points in the area-delay space. The
chapter looks at improving area-efficiency of DA based implementations and
specifically shows how the flexibility in coefficient partitioning can be exploited
to reduce the area of a DA structure using two look-up-tables. The chapter also
addresses the problem of reducing power dissipation in the input data shift-
registers of DA based FIR filters. This is achieved using a technique that is
based on a generic nega-binary representation scheme which is customized for
a given distribution profile of input data values so as to minimize toggles in the
shift -registers.
For non-adaptive signal processing applications in which the weight values
are constant and known at design time, an area-efficient realization can be
achieved by implementing the weighted sum computation using shift and add
operations. Chapter 5 presents a technique based on common subexpression
precomputation for reducing the number of additions. The chapter also presents
techniques that use different number representation schemes (such as Canonical
Signed Digit (CSD)) and transform signal flow graphs to further minimize the
number of additions. Finally, the chapter discusses high-level synthesis of
multi-precision DFGs - a useful technique to minimize computation at the
bit-precision level.
Chapter 6 focuses on optimized code generation of multiplication-free lin-
ear transforms. These transforms perform weighted-sum computation with the
weights being restricted to {O, 1,-1} and can be realized using add and sub-
tract operations. Both one dimensional and two dimensional transforms are
considered. The code generation is targeted to both a register-rich architec-
ture and a single register, accumulator based architecture, and aims at reducing
power dissipation. The chapter also presents DAG transformations to further
improve performance and reduce power dissipation of the multiplication-free
linear transforms.
Introduction 9
Residue Number Systems (RNS) have been proposed for high-speed parallel
implementation of addition, subtraction and multiplication operations. Chapter
7 describes RNS based implementation of the weighted-sum computation and
presents transformations that aim at reducing area, delay and power dissipation
of the implementation. Chapter 7 also presents RNS as a transformation to
improve performance and reduce power dissipation of DSP algorithms which
need the data and the coefficients to have a higher bit precision than what is
supported by the target DSP architecture.
To tie up all these techniques, a methodology is presented in Chapter 8
to systematically identify transformations that exploit the characteristics of a
given DSP algorithm and of the implementation style to achieve tradeoffs in
the area-delay-power space.
Chapter 9 summarizes the key topics covered in this book to address the
VLSI synthesis and optimization of DSP kemeIs - primarily the weighted-sum
based kemeis.
1.4. For the Reader

This book is meant for practicing DSP system designers, who understand
that optimal design can never be a push-button activity. In summary, designers
will find the following in this book:
• Description and analysis of several algorithmic and architectural transfor-

mations that help achieve the desired tradeoffs in the area-delay-power space
for various implementation styles.
• Automated and semi-automated techniques for applying these transforma-

tions.
• Classification ofthe transformations based on the properties that they exploit
and their encapsulation in a design framework. A methodology that uses
the framework to systematically explore the application of these transforms
depending on the characteristics of the algorithm and the target implemen-
tation style.
As a caveat, the designer needs to use his expertise and judgement to make
a proper selection from these techniques and implement them properly in the
context of the specific system being designed.
Chapter 2
PROGRAMMABLE DSP BASED

IMPLEMENTATION
A programmable DSP is a processor customized to implement digital signal

processing algorithms efficiently. The customization is based on the following
characteristics of DSP algorithms:
• Compute Intensive: Most DSP kemeis are compute intensive with weighted-
sum being the core computation. A programmable DSP hence incorpo-
rates a dedicated hardwired multiplier and its datapath supports single cycle
multiply-accumulate (MAC) operation.
• Data Intensive: In most DSP kemeis, each multiply operation ofthe weight-
sum computation is performed on a new set of coefficient and data values.
A programmable DSP is hence pipelined with an operand read stage before
the execute stage, has an address generator unit that operates in parallel with
the execute datapath and uses a Harvard architecture with multiple busses
to program and data memory.
• Repetitive: DSP algorithms are repetitive both at micro-Ievel (e.g. Multiply-

accumulate operation repeated N times during an N-term weighted-sum
computation) and at macro-Ievel (e.g. kemeIs such as filtering repeated
every time a new data sampIe is read). A DSP architecture hence uses
special instructions (such as RPT MAC) and control mechanisms to support
zero overhead looping.
This chapter focuses on low power realization of DSP algorithms on a pro-

grammable DSP. The techniques are targeted to a generic DSP architecture
shown in figure 2. I.
The architecture has two separate memory spaces (program and data) which
can be accessed simultaneously. This is similar to the Harvard architecture
II

Program Datu Read

Address
Counter Register Data
ProgramJ
Memory
Coefficient Data Write
Memory Address
Register
CPU
Figure 2.1. Generic DSP Architecture
employed in most of the programmable DSPs [95, 96]. During weighted-sum

computation, one of the memories can be used to store coefficients (weights)
and the other to store input data sampies. The arithmetic unit performs fixed
point computation on numbers represented in 2's complement form. It consists
of a dedicated hardware multiplier and an adderlsubtracter connected to the ac-
cumulator so as to be able to efficiently execute the multiply-accumulate (MAC)
operation. This again is similar to the single-cycle MAC instruction supported
by most of the programmab\e DSPs [34, 35, 95, 96]. An N-term weighted-
sum computation can be performed using this architecture as a sequence of N
multiply-accumulate operations.
The sources of power dissipation in CMOS circuits can be classified into
dynamic power, short circuit current and leakage current [15]. For most CMOS
designs, dynamic power is the main source of power dissipation and is given
by equation 2.1.
Pdynamic = Cswitch. V 2 .1 (2.1 )
where V is the supply voltage, 1 is the operating frequency, Cswitch is the

switching capacitance given by the product of the physical capacitance being
charged/discharged and the corresponding switching activity. The low power
transformations described in this chapter aim at reducing one or more of these
factors while maintaining the throughput of the algorithm.
Programmahle DSP hased Implementation 13
The rest of this chapter is organized as folIows. Section 2.1, identifies the
main sources of power dissipation and deveJops measures for estimating power
dissipated in each of the sources. Various techniques for low power realization
of DSP algorithms are discussed in section 2.2. Section 2.3, presents algo-
rithmic and architectural transformations which are specific to weighted-sum
computation. Section 2.4 presents additional transformations for low power
realization of FIR (Finite Impulse Response) filters - a DSP kernel which is
based on weighted-sum computation. Finally, section 2.5 integrates various
transformations into a comprehensive framework for low power realization of
FIR filters on programmable DSPs.
2.1. Power Dissipation - Sources and Measures

2.1.1. Components Contributing to Power Dissipation
Each step in a weighted-sum computation involves getting the appropriate
coefficient and data values and performing a multiply-accumulate computation.
Thus address and data busses of both the memories and the multiplier-adder
datapath experience the highest signal activity during a weighted-sum compu-
tation. These hardware components therefore form the main sources of power
dissipation.
2.1.2. Measures of Power Dissipation in Busses

For a typical embedded processor, address and data busses are networks with
a large capacitive loading [89]. Hence signal switching in these networks has
a significant impact on power consumption. In addition to the net capacitance
of each signal(bit) of the bus, inter-signal cross-coupling capacitance also con-
tributes to the bus power dissipation. The amount of this capacitance depends
on the processing technology and on the spacing between adjacent metal lines
carrying the signals. With the deep submicron technology, there is a trend to-
wards increasing contribution of the cross-coupling capacitance to the total bus
capacitance. The data presented in [11] indicates that the power dissipation
due to the cross-coupling capacitance varies depending on the adjacent signal
values. The current required for signals to switch between 5's (OlOlb) and A's
(lOlOb) is about 25% more than the current required for the signals to switch
between O's (OOOOb) and F's (111Ib) [li].
The Hamming distance between consecutive signal values and the number
of adjacent signals toggling in opposite direction thus form the measures of
power dissipation in the busses.
2.1.3. Measures of Power Dissipation in the Multiplier

Due to high speed requirements, parallel array architectures are used far
implementing dedicated multipliers in programmable DSPs [42, 91]. The logic
A380 A280 Al 80 AO 80
P7 P6 PS P4 P3 P2 PI PO
Figure 2.2. 4x4 Array Multiplier
diagram of a 4x4 bit parallel array multiplier is shown in figure 2.2. The
multiplier consists of AND gates to compute partial inner products and an
array of adders to compute the complete product. The power dissipation of a
multiplier is direct1y proportional to the number of switchings at all the internal
nodes of the multiplier. These are the outputs of the AND gates and of the 1
bit adders. The number of internal node switchings depend on the multiplier
input values. This dependence can be analyzed using the 'Transition Density'
measure of circuit activity.
'Transition Density' [68] of a signal is the average number of transitions/
toggles of the signal per unit time. Consider a combinational logic block with
inputs Xl, X2, ... , X n and the output Y. Let T XI , T x2 , ••• ,Txn be the transition
densities at the inputs and let PXI , PX2 ' ••• ,PXn be the probabilities ofthe input
signal values being I. Assuming the input values to be mutually independent,
the transition density at the output Y is given by
n
Ty = L P(Bd(Y, xd) . T Xi (2.2)
i=l
where Bd(Y, Xi), is the Boolean difference of Y w.r.t. Xi.

Using equation 2.2, it can be shown that the transition density at the output
of a two input AND gate (y = a&b) is given by T y = (Ta· Pb + n.
Pa).
Programmable DSP based Implementation 15
The probability P y of the AND gate output being 1 is given by (Pa' H). The
transition density at the output of a two input XOR gate (y = a EB b) is given
by (Ty = Ta + Tb). These relationships indicate that the multiplier power
is directly dependent on the transition densities and the probabilities of the
multiplier inputs.
The transition densities of the multiplier inputs depend on the Hamming
distance between successive input values. The input signal probabilities depend
on the number of I s in the input signal values of the multiplier. These two thus
form the measures of multiplier power dissipation.
It can also be noted in figure 2.2, that the transitions in input bits BO and
BI affect more internal nodes than the transitions in input bit B3. In general,
transitions in lower order bits of the input signal contribute more to the multiplier
power dissipation than the higher order bits. Thus while minimizing transition
densities of all the input bits is important, higher gains can be achieved by
focusing on lower order bits of the input signals.
These measures have been experimentally verified by simulating an 8x8
parallel array multiplier. One input of the multiplier was kept constant and
1000 random numbers were fed to the other input. The total toggle count at all
the internal nodes and the inputs of the multiplier was measured. The toggle
count measurement was carried out for all 256 (0 to 255) values ofthe constant.
This data was then used to compute the average toggle count as a function of
the number of 1s in the constant input. This relationship, shown in figure 2.3,
confirms the analysis that the multiplier power is a direct function of the number
of 1s in its inputs.
The second experiment used sets of 1000 random numbers such that the
Hamming distance between consecutive numbers within a set was constant.
Seven such sets of numbers were generated corresponding to seven Hamming
distance values (l to 7). The total toggle count was measured by applying these
random numbers to one input of the multiplier while keeping the other input
constant. The toggle count vs Hamming distance relationship for four different
constants is shown in figure 2.4. It confirms the analysis that the multiplier
power is a direct function of the Hamming distance between its successive
inputs.
In addition to the array multiplier, other multiplier topologies based on Booth
encoding and Wall ace tree are also common in programmable DSPs [42]. While
the measure of Hamming distance between successive inputs applies to all these
topologies, the measure based on input data pattern may vary across topologies.
For example, power analysis of a Booth multiplier [36] shows that the power
dissipation is directly dependent on the number of 1s in the Booth encoded
input. This chapter focuses on techniques for reducing the Hamming distance
in the successive inputs of the multiplier. These techniques are hence applicable
to all multiplier topologies.
16 VLSI SYNTIlESlS OF DSP KERNELS
140
120
100
8'
0
x 80
§
0
u
j
e-
60
40
20
0
0 4
Numbet of ones
Figure 2.3. Toggle Count as a Function of Number of Ones in the Multiplier Inputs
160
x143
140
x77
120 x88
x36
0' 100
8
x
~
u
80
m
g:
'"
60
40
20
0
0 4
Hamming dislance
Figure 2.4. Toggle Count as a Function of Hamming Distance between Successive Inputs
2.2. Low Power Realization of DSP Algorithms

This section presents techniques and architectural extensions that can be
applied to most DSP applications for reducing power dissipation.
2.2.1. Allocation of Program, Coefficient and Data Memory

Since in most DSP kerneIs the code within the loops executes sequentially,
the transitions in the program memory address bus can be minimized by appro-
70 . - - - - - - - - - - , - - - - - - - - - - - , - - - - - - - - - - , - - - - - - - - - - , , - - - - - ,
<J)
~
'"
'"0
f- 65
roc:
'"
üi
C
Q)
"
«'"
TI
+ 60
Q)
"c:
*'"
(5
E
c:
E
'"
I 55
Ei0
f-
50L---------~-----------L----------L---------~------~
o 5 10 15 20
Start Address
Figure 2.5. Address Bus Power Dissipation as a Function of Start Address
priately selecting the start locations of the code segments. The same technique
can also be applied for storing coefficients and data values in the memory for
weighted-sum computation in which these are accessed sequentially. Figure 2.5
shows the sum of total Hamming distance and total number of adjacent signals
toggling in opposite directions in the consecutive addresses as a function of start
location for a 24 word memory block. The analysis shows that start address of
Ox 14 results in 14% more power dissipation in the address busses compared to
the start address of OxOO. The power dissipation in the addresses busses can
thus be reduced by aligning the start address far the program, coefficient and
data blocks with the beginning of a memory page. The capacitive loading for
the address bus transitions and hence the power dissipation can also be reduced
by storing the most frequently accessed program segments, coefficients and
data on the on-chip memory.
2.2.2. Bus Coding

2.2.2.1 Gray Coded Addressing
The property of sequential memory access by most DSP algorithms can be
further exploited by using gray coded addressing [61, 89, 54] to reduce power
dissipation in the address busses. This can result in apower saving of close to
MSB Binary Address LSB
Bn Bn-I Bn-2 B2 BI BO
Gn Gn-I Gn-2 G2 GI GO
MSB Gray Coded Address LSB
Figure 2.6. Binary to Gray Code Conversion
50% compared to binary sequential access. A binary address generator can be

modified to perform gray coded addressing using a binary-to-gray converter as
shown figure 2.6
Since the gray coded addressing changes the order in which memory loca-
tions are accessed, the memory contents such as program, coefficients and data
need to be suitably re-arranged to ensure correct functionality. Figure 2.7 shows
a simple scheme that can be used to reorganize the memory contents at design
(compile) time.
Since the data memory access is sequential only during certain functions such
as weighted-sum computation, the gray coded addressing on the data memory
address bus needs to be selectively activated at run-time. Figure 2.8 shows a
scheme that supports such programmable binary-to-gray code conversion.
It can be noted that for certain functions such as finding an average or finding
a range (minimum and maximum) of N data values, the sequence of accessing
the data memory need not be preserved and hence the data memory need not
be reorganized.
2.2.2.2 TO coding
The power dissipation in the address busses during sequential access can be
further reduced by using the asymptotic zero-transition encoding referred to as
TO coding in [8]. Figure 2.9 shows the memory access scheme based on TO
coding.
Binary to Gray
r--1> Converter
~
Binary Gray
Memory Memory
Binary 00 instr 1 00 instr 1

{> ---{)
Counter 01 instr 2 01 instr 2
10 instr 3 10 instr 4
11 instr 4 11 instr 3
I data I
Figure 2.7. Memory Reorganization to Support Gray Coded Addressing
MSB Binary Address LSB
Bn Bn-I Bn-2 B2 BI BO
Gn Gn-I Gn-2 G2 GI GO
MSB Programmable Gray Coded Address LSB
Figure 2.8. Programmable Binary to Gray Code Converter
At the beginning of the series of sequential accesses, the processor sends the
start location on the address bus and the same is loaded in the counter which
Mem Addr Program
(~
Program/
Coefficient
Memory
! Counter
Counter
'I increment
CPU
Figure 2.9. TO Coding Scheme
is part of the memory wrapper supporting TO coding. During the sequential

accesses, the processor holds the address bus at the same initial value (hence
o toggles) and sets the 'increment' signal high to indicate sequential access.
During each cycle the counter in the memory wrapper is incremented to generate
the next sequential access. While the TO coding scheme enables zero toggles
in the address bus, it requires an additional signal and also results in marginal
area, delay and power overhead due to the counter and the multiplexer logic in
the memory wrapper.
2.2.2.3 Bus Invert Coding

The power dissipation in the data busses depends on the Hamming distance
between successive values. Thus for a B bit bus, there can be upto B number
of toggles. Bus invert coding [88] technique selectively inverts the data value
so as to limit the maximum number of toggles to B/2, thus reducing the power
dissipation. As shown in figure 2.10, in this scheme, if the Hamming distance
between the new and the current data values is higher than B/2 (for a B bit bus),
the new data value is inverted be fore sending to the destination. The source
also asserts the 'invert' signal which is used by the destination to get the correct
value by inverting the received data. This technique has the overhead of an
extra signal and the encoding and the decoding circuitry at the source and the
destination respectively.
invert
Source Destination
Figure 2.10. TO Coding Sehe me
2.2.3. Instruction Buffering

DSP app1ications typically involve repeated execution ofDSP kemeIs. These
kemeIs are smaller in code size but take up significant portion of the run time.
Instruction buffering has been proposed [7] as an architectural extension to
reduce power dissipation during the iterative execution of DSP kemeIs. The
scheme, as shown in figure 2.11, involves adding an instruction buffer to the
processor cpu. During the first iteration, as the code is executed, the fetched
instructions are also stored in the instruction buffer. For the following iterations,
the instructions are fetched from the instruction buffer thus requiring no access
to the program memory and thereby reducing power dissipation.
This technique can be extended further to eliminate power dissipation in the
instruction decoder. As has been shown in [7], since DSPs have many complex
instructions that are executed in single cycle, the decoder power dissipation
forms a significant portion of the total power dissipation of the CPU. In this
scheme (figure 2.12), during the first iteration, the output of the instruction
decoder is stored in the decoded instruction buffer which is part of the CPU.
During the following iterations the decoder output is fetched from the decoded
instruction buffer thus eliminating power dissipated for program fetch and de-
code. Since the decoder output is typically much wider than the instruction
width, the area penalty due to the decoded instruction buffer is much higher
than the instruction buffer.
Iteration 1 Iterations 2 -> N
v
Instruction - Instruction
Program Butler Program Buffer
Memory Memory
Decode
Logic
~'od'
Logic
I
--
epu epu
Figure 2.11. Instruction Buffering
Iteration 1 Iterations 2 -> N
7
Decode Decode
Program Logic Pro gram Logic
Memory Memory
--[;: Decoded ,-- Decoded
Instruction Instruction
Buffer Butler
Execute Execute
Logic Logic
epu epu
Figure 2.12. Decoded Instruction Buffering
2.2.4. Memory Architectures for Low Power

The sequential nature of memory accesses by DSP algorithms can be used
to organize the memory for low power without resulting in any perfonnance
penalty. One approach is to partition the memory into two halves corresponding
to data at odd and even addresses respectively. During sequential access no two
consecutive addresses can be both odd or both even. The two memory halves
thus get accessed on every alternate cycle. The two memory halves can hence
Even
o
~ CPU
Memory CPU
Odd
Memory
T-FF
/\ /\
I I CLK'/2 CLK
CLK CLK
CLK
Figure 2.13. Memory Partitioning for Low Power
16 32
... 16
~::l
Memory CPU Memory ~ a:l
.c
/
/ "
CPU
8
~
...d.>
0...
A ~ /\
I I I
CLK CLK CLKJ2 CLK CLK
Figure 2.14. Prefetch ButTer
be clocked at half the CPU clock, resulting in power reduction. Figure 2.13
shows such a memory architecture.
The property of sequential access can also be exploited by using a wider
memory and reading two words per memory access. The data can be stored in
a pre-fetch buffer such that while the memory is accessed at half the CPU clock
rate, the CPU gets the data on evcry cycle during sequential access. Figure 2.14
shows such a memory architecture.
It can be noted that this scheme can be generalized such that for a B bit
data, the memory width can be set to N*B to read N words per memory access
and consequently clock the memory at I/N times the CPU clock. The prefetch
AO - -A2- - AO
AI - -AO
-- AI
A4
A2 ---- A2
_ _ A.L_
PROGRAMI A3 A3
COEFFICIENT CPU
MEMORY A4 - -A3
-- A4
A5 - -A5- - A5
A7
A6 ---- A6
A6
A7 ---- A7
Figure 2.15. Bus Reordering Scheme for Power Reduction in PD bus
buffer scheme can also be used in conjunction with memory partitioning to

further reduce power dissipation.
2.2.5. Bus Bit Reordering

In case of a system-on-a-chip design, the designer integrates the DSP CPD
core with the memories and the application specific glue logic on a single chip.
In such a design methodology the designer has control over the placement of
various memories and the routing of memory-CPU busses. This flexibility can
be exploited to develop a layout level technique for reducing power dissipation
in the program/coefficient memory data bus. The technique aims at reducing
power dissipation due to cross-coupling capacitance. One approach to achieve
this is to increase the spacing between the bus lines. This however results in
increased area. A better approach is to reorder the bus bits in such a way that
the number of adjacent signals toggling in opposite direction are minimized.
Figure 2.15 illustrates this approach and shows how the bus signals AO to A 7
can be reordered in the sequence of A2-AO-A4-AI-A3-A5-A7-A6.
For a given DSP application the program execution can be traced (using
simulation) to get signal transition information on the program memory data
bus. This data can then be used to arrive at optimum bus order so as to minimize
the number of adjacent signal toggles.
For an N bit bus, an N node fully connected graph is constructed. The
edges are assigned weights Wi,j given by the number of times bit 'i' and bit 'j'
transition in the opposite direction. The problem of finding an optimum bit order
Table 2.1. Adjacent Signal Transitions in Opposite Direction as a Function of the Bus-
reordering Span
#taps Initial ±I ±2 ±3 ±4 ±5 ±6 ±7 ±8
24 38 32 18 18 16 12 8 8 8
27 38 30 24 16 4 4 4 4 4
32 26 14 12 10 10 10 10 10 10
36 32 20 14 10 8 8 8 8 8
40 40 24 24 18 18 18 18 18 16
64 62 38 38 34 34 28 22 22 22
72 68 54 40 40 40 40 40 40 40
96 84 74 64 60 58 54 54 54 54
128 112 94 84 84 78 78 78 78 78
can then be mapped onto the problem of finding the lowest cost Hamiltonian
Path in an edge-weighted graph or the traveling salesman problem.
As can be noted from figure 2.15, the bus bit reordering scheme has the
downside of increasing the bus netlength and hence the interconnect capaci-
tance. This overhead can be minimized if the reordering span for each bus
bit is kept within a limit. For example, the bus reordering scheme shown in
figure 2.15, uses the reordering span of ±2. The optimum bit order thus needs
to satisfy the constraint in terms of the maximum reordering span. This is
achieved by suitably modifying edge-weights such that all edge weights Wi,j
are made infinite if li - jl > M axSpan.
The algorithm starts with the normal order as the initial order. It uses hill-
c1imbing based iterative improvement approach to arrive at the optimum bit
ordering. During each iteration a new feasible order is derived and is accepted
as a new solution if it results in a lower cost function (i.e. lower number of
adjacent signal transitions in opposite direction)
The impact of bit reordering on the power reduction was analyzed in the
context of a DSP code that performs FIR filtering. Nine filters with the number
of taps ranging from 24 to 128 were used. For each case, the algorithm was
applied with the reordering span constraint ranging from ± 1 to ±8.
The results shown in table 2.1 show significant reduction in the number of
adjacent signal transitions in opposite directions. It can also be noted that
the reduction increases with the increase in the bus reordering span which is
expected. However as mentioned earlier, higher reordering span implies higher
interconnect length.
Figure 2.16 plots the average percentage reduction as a function of bus re-
ordering span. As can be seen from the plot the incremental saving in the
70
65
60
55
c
0 50
U
::>
-0 45
Q)
a: 40
#-
35
30
25
20
1 2 3 4 5 6 7 8
Bus Reordering Span
Figure 2.16. %Reduction in the Number of Adjacent Signal Transitions in Opposite Directions
as a Function 01' the Bus Reordering Span
number of adjacent signal transitions gets smaller beyond the reordering span
of ±4. For the span of ±4, the cross-coupling related power dissipation in the
program memory data bus reduces on the average by 54%. This hence is the
optimal reordering span to get the most power reduction.
2.2.6. Generic Techniques for Power Reduction

In addition to the techniques discussed above, there are a few other low power
techniques which can be applied in the context of programmable DSPs. These
are listed below in bullet forms with appropriate references which give more
detai led description.
• Cold scheduling [89]

• Opcode assignment [102]
• Clock gating [80]
• Guarded evaluation [94]
• Pre-computation logic [65]
• Instruction subsetting [20]
• Code Compression [37]
2.3. Low Power Realization of Weighted-sum Computation

Whi Je the techniques described in the earlier section can be applied to the low
power realization of weighted-sum computation, this section presents additional
low power techniques specific to weighted-sum implementation.
0.5
0.4
Q)
:;)
0; 0.3
>
C
Q) 0.2
'(3
~0 0.1
Ü
-0.1
5 10 15 20 25 30
Coefficient Number
Figure 2.17. Coefficients of a 32 Tap Linear Phase Low Pass FIR Filter
2.3.1. Selective Coefficient Negation

During a weighted-sum computation, the coefficients (weights) are stored in
the coefficient memory in 2's complement form. For a given number N and the
number of bits B used to represent it, the number of 1s in the 2's complement
representation of +N and -N can differ significantly. For example, both 8 and 16
bit representations of 9 (00001001 b, 00000000 00001001 b) have two 1s, the 8
bit representation of -9 (11110111 b) has seven 1sand the 16 bit representation
of -9 (11111111 11110111b) has fifteen Is. Both 8 and 16 bit representations
of 63 (00111111 b) has six 1s, the 8 bit representation of -63 (11000001 b) has
three Is and the 16 bit representation (11111111 11000001b) has eleven Is.
For each coefficient A[i], either A[i] or -A[i] can be stored in the coefficient
memory, depending on the value that has lesser number of 1s in its 2's comple-
ment binary representation. If -A[i] is stored in the memory, the corresponding
product (- A[i] . X[n - i]) needs to be subtracted from the accumulator so as
to get the correct weighted-sum result. This technique of selective coefficient
negation reduces the number of 1s in the coefficient input of the multiplier. It
also reduces Hamming distance between consecutive coefficients, especially in
cases where a small positive coefficient is followed by a small negative coeffi-
cient. This technique thus can result in significant reduction in the multiplier
power and also the coefficient data bus power.
If the coefficients to be negated follow a regular pattern, a modified MAC
instruction that alternates between multiply-add and multiply-subtract can sup-
port selective coefficient negation. For example, analysis of many low pass fi Iter
coefficients shows that the coefficients to be negated follow a regular alternating
pattern. As an example consider the coefficients of a 32 tap linear phase FIR
filter shown in figure 2.17. It can be noted that coefficients follow a repetitive
pattern of two positive coefficients followed by two negative coefficients.
Table 2.2. Impact of Selective Coefficient Negation on Total Number of I s in the Coefficients
#taps #ls # negated % negated #1 s after %

coeffs coeffs selcctive reduction
Negation
16 120 6 37.5% 88 26.7%

24 212 12 50.0% 144 32.1%
32 230 12 37.5% 160 30.4%
36 282 16 44.4% 142 49.7%
40 312 18 45.0% 202 35.3%
48 366 20 41.7% 200 45.4%
64 498 30 46.8% 276 44.6%
72 550 34 47.2% 336 38.9%
96 782 48 50.0% 358 54.2%
128 984 66 51.6% 426 56.7%
In cases where the coefficients to be negated folIowarandom paUem, all

such coefficients can be grouped together and the filtering performed using
two loops - first one performing repeated multiply-add and the second one
performing repeated multiply-subtract.
Table 2.2 presents the impact of selective coefficient negation in case of 10
low pass FIR filters synthesized using Parks-McClelIan algorithm [73]. The
results show that selective coefficient negation selects 37% to 51 % of coeffi-
cients for negation, and results in 26% to 56% reduction in the total number of
1s in the coefficient values. As mentioned earlier, this reduction translates into
power reduction in the multiplier.
2.3.2. Coefficient Ordering

Since the summation operation is both commutative and associative, the
weighted-sum output is independent of the order of computing the coefficient
products. Thus for a four term weighted-sum computation, the output can be
computed as
Y[n] = A[O]- X[O] + A[l] . X[l] + A[2]· X[2] + A[3]· X[3] (2.3)
or as
Y[n] = A[l]- X[l] + A[3] - X[3] + A[O]- X[O] + A[2]· X[2] (2.4)
Since the weighted sum computation also does not impose any restriction on
how the coefficient and data values are stored_ The address generator needs to
comprehend the locations and generate the correct pair of addresses (to access
the coefficient and corresponding data sampie value) for each product compu-
tation. The order of coefficient-data product computation directly affects the
sequence of coefficients appearing on the coefficient memory data bus. Thus
this order determines the power dissipation in the bus.
The following subsection formulates the problem of finding an optimum or-
der of the coefficients such that the total Hamming distance between consecutive
coefficients is minimized.
2.3.2.1 Coefficient Ordering Problem Formulation

For an N term weighted-sum computation, N! different coefficient orders
are possible. The problem of finding the optimum order can be reduced to the
problem of finding the lowest cost Hamiltonian Circuit in an edge-weighted
graph or the traveling salesman problem. Since this problem is NP complete,
heuristics need to be developed to obtain a near-optimal solution in polynomial
time.
The coefficient ordering problem can thus be formulated as a traveling sales-
man problem. The coefficients map onto the cities and the Hamming distances
between the coefficients map onto the distances between the cities. The optimal
coefficient order thus becomes the optimal tour of the cities, where each city is
visited only once and the total distance traveled is minimum.
Several heuristics have been proposed [33] to solve the general traveling
salesman problem and also a special case of the problem where distances be-
tween the cities satisfy the triangular inequality. For any three cities Ci, Cj and
Cb the triangular inequality property requires that
The coefficient ordering problem satisfies the triangular inequality as folIows:

Let Ai, AjandA k be any three coefficients and let Hij, H jk and H ik be the
Hamming distances between (Ai - A j ), (A j - A k ) and (Ai - A k ) pairs of
the coefficients respectively. Let B ij , Bjk and B ik be sets of bit locations in
which the coefficients (Ai - A j ), (A j - A k ) and (Ai - A k ) differ respectively.
The Hamming distances H ij , H jk and Hik thus give the cardinality of the sets
B ij , Bjk and Bik respective\y. These sets satisfy the following relationship :
(2.6)
The cardinality of B ik is maximum when the set (B ij n B jk ) is empty and

is given by H ij + H jk . Thus the Hamming distances satisfy the following
relationship :
H ik :::; (Hij + H jk ) (2.7)
which is the triangular inequality.
The algorithms proposed [33] to solve this class oftraveling salesman prob-
lems include nearest neighbor, nearest insertion, farthest insertion, cheapest
insertion, nearest merger etc. Experiments with various low pass FIR filters
show that in alm ost all cases, the nearest neighbor algorithm performs the best.
2.3.2.2 Coefficient Ordering Aigorithm

Here is the algorithm for finding the optimum coefficient order.
Procedure Order-Coefficients-for-Low-Power
Inputs: N coefficients A[O] to A[N-l]
Output: A coefficient order which results in minimum total Hamming distance
between successive coefficient values
/* build hamming distance matrix */

for each coefficient A[i] (i=O to N-l) {
for each coefficient A[j] (j=O to N-l ) {
Hd [i][j] = CounLno_oLOnes (A[i] EB AU]) } }
/* initialization */
Coefficient-Order-List = {A[O]}
Latest-Coefficient-Index = 0
/* build the coefficient order */
for ( i = 1 to N -1) {
Find A[j] such that ( A[j] tf- Coefficient-Order-List ) and
Hd [jHLatest-Coefficient-Index] is minimum
Coefficient-Order-List = Coefficient-Order-List + A[j]
Latest-Coefficient-Index = j
}
The 'Coefficient-Order-List' gives the desired sequence of coefficient-data

product computations. Once the sequence of accessing the coefficients is identi-
fied, the coefficients and the corresponding data values can be reordered and ap-
propriately stored in the memory, such that the desired sequence of coefficient-
data product computations is achieved when the memory is accessed sequen-
tially.
Table 2.3 shows the impact of the coefficient ordering technique on the total
Hamming distance and the total number of adjacent signals toggling in opposite
direction, between successive coefficient values. This reduction directly trans-
lates into the power savings in both the coefficient data bus and the multiplier.
The results show that the total Hamming distance can be reduced by 54%
to 83% using coefficient ordering. This directly translates into 54% to 83%
saving in the coefficient memory data bus power. The total number of adjacent
Programmable DSP based Implementatioll 31
Table 2.3. Impact of Coefficient Ordering on Hamming Distance and Adjacent ToggIes
#taps H.d. H.d. % Adj. Adj %

Initial Optimized reduction Toggle Toggle reduction
Initial Optimized
16 102 46 54.9% 8 3 62.5%
24 158 64 59.5% 20 4 80.0%
32 204 68 66.7% 22 7 68.2%
36 242 82 66.1% 28 7 75.0%
40 280 94 66.4% 32 8 75.0%
48 350 108 69.1% 50 12 76.0%
64 452 118 73.9% 54 8 85.2%
72 510 110 78.4% 52 6 88.5%
96 700 138 80.3% 64 11 82.8%
128 952 156 83.6% 84 12 85.7%
toggles in opposite direction is also reduced by 62% to 85% using coefficient

ordering.
Since selective coefficient negation also hel ps in reducing the total Ham-
ming distance between the successive coefficient values, it can be applied in
conjunction with coefficient ordering to achieve further power reduction.
2.3.3. Adder Input Bit Swapping

The bit-wise commutativity property ofthe ADD operation can be exploited
to develop a technique that reduces the number of toggles in the busses that
feed the inputs to the adder. This not only reduces power dissipation in these
busses, it also reduces the power dissipated in the adder and the accumulator
that drives one of the busses.
Bitwise commutativity implies that the result of an ADD operation is not
affected even when one or more bits from one input are swapped with the corre-
sponding bits in the second input. Consider two 4 bit numbers A =(a3,a2,al ,aO)
and B = (b3,b2,b 1,bO) , where a3 and b3 are the MSBs of A and B respectively.
It can be easily shown that
a3,a2,al,aO + b3,b2,bl,bO b3,a2,al,bO + a3,b2,bl,aO

a3,b2,al,aO + b3,a2,bl,bO
a3,a2,bl,bO + b3,b2,al,aO
and so on
This property can be used as folIows:

Consider the following 4 bit input data sequence for addition
(inl) (in2)
Yl 0011 + 1100
Y2 0100 + 1011
Computation of Y2 as shown above results in three signal toggles in the databus

(in I) (bit a2:0 ---+ 1, bit al: 1 ---+ 0 and bit aO: 1 ---+ 0) and one pair of adjacent
signals (a2-a 1) toggling in opposite direction. It also results in three signal
toggles in the databus (in2) (bit b2: 1 ---+ 0, bit bl:O ---+ 1 and bit bO:O ---+ 1)
and one pair of adjacent signals (b2-b 1) toggling in opposite direction. Thus
totally six signal toggles and two adjacent signal toggles in opposite directions
contribute to the power dissipation during Y2 computation.
By using bitwise commutativity, Y2 can be calculated after swapping bits
a2,a 1,aO with b2,b 1,bO respectively. With this bit swapping the computation
sequence looks as folIows:
(in!) (in2)
Yl 0011 + 1100
Y2 0011 + 1100
This computation results in zero toggles in both the databusses and consequently
has no pairs of adjacent signals toggling in opposite direction. As can be seen
from this example, appropriate bit swapping can significantly reduce power
dissipation.
Figure 2.18 shows a scheme to perform the bit swapping so as to minimize
the toggle count. The scheme compares for every bit, the new value with the
current value, and performs bit swapping if the two values are different.
As can be seen from figure 2.18, the reduction in the toggles in the adder
inputs is achieved at the expense of additional logic i.e. the multiplexers and
the exclusive-or gates. The power dissipated in this logic offsets power savings
in the adder and its input busses. The final savings depend on the data values
being accumulated and also on the relative capacitance of the adder input busses
and the multiplexer inputs.
To evaluate the effectiveness of the input bit swapping technique for power
reduction in the adder and its input busses, 1000 random number pairs were
generated with bit widths of 8, 12 and 16. Table 2.4 gives the results in terms
of total Hamming distance between consecutive data values and total number
of adjacent signals toggling in opposite direction, in both the busses. As can
be seen from the results the proposed scheme saves more than 25% power in
the two input data busses of the adder and also results in power savings in the
adder itself.
Programmable DSP based Implementatiofl 33
Data Read
Address
Register
Pro gram
Counter Data Write
Program/ Address I---+---N Data
Register Memory
Coefficicnt
Memory
CPU
Figure 2.18. Sehe me far Reducing Power in the Adder Input Busses
Table 2.4. Power Optimization Results Using Input Bit Swapping for 1000 Random Number
Pairs
Hamming Distanee Adjaeent Signal Toggles

Initial Final %reduetion Initial Final %reduetion
8 bit da ta 7953 5937 25.3% 1836 1090 40.6%
12 bit data 11979 8925 25.5% 2766 1791 35.2%
16 bit data 15945 11865 25.6% 3545 2170 38.8%
2.3.4. Swapping Multiplier Inputs

Since the power dissipation in a Booth multiplier depends on the number
of I s in the Booth encoded input, the coefficient and data inputs to the mul-
tiplier can be appropriately swapped so as to reduce power dissipation in the
multiplier. The results presented in [36] indicate that the amount of reduction
is dependent on the data values, and in worst case can result in increase in the
power dissipation. If the coefficients to be swapped follow a regular pattern,
such selective input swapping can be supported as an enhancement to the repeat
MAC instruction.
X[O] X[I] X[2] X [N/2-2] X[N/2-1]

X[N-2] X[N-3]
Figure 2.19. Data Flow Graph of a Weighted-sum Computation with Coefficient Symmetry
2.3.5. Exploiting Coefficient Symmetry

In case of some DSP kemels such as linear phase FIR filters, the coefficients
of the weighted-sum computation are symmetrie. This property can be used to
reduce by half the number of multiplications per output. For N being even, the
weighted-sum equation given by:
N-l
Y = L A[i] . Xli] (2.8)
i=ü
can be written as:

N/2-1
Y = L A[i]· (X[i] + X[N - i-I]) (2.9)
i=O
The corresponding data f10w graph is shown in figure 2.19. While the core
computation in equation 2.9 is also multiply-accumulate, the coefficient is mul-
tiplied with the sum of two input sampies. The architectures such as shown in
figure 2.1, do not support single cycle execution of this computation. While
it is possible to compute data sum and use it to perforrn MAC, the resultant
code would require more number of cycles and more number of data memory
accesses than the direct implementation of equation 2.8, that ignores coefficient
symmetry.
Figure 2.20 shows a suitable abstraction of the datapath of the TMS320C54X
DSP [97] that supports single-cycle execution (FIRS instruction) ofthe multiply-
accumulate computation of equation 2.9.
This architecture has an additional data read bus which enables fetehing the
coefficient and the two data values in a single cycle. It's datapath has an adder
Data Read
Address
Program Register I Data
Counter
Program/ Data Write
Address l~====~-4
~ Memory
Register
Coefficient Data Read
Memory Address
Register 2
CPU
Figure 2.20. Suitable Abstraction of TMS320C54x Architecture for Exploiting Coefficient

Symmetry
and a MAC unit, so that the sum of the input data sam pIes and the multiply-
accumulate operation can be performed simultaneously in a single cycle. Since
the computational complexity of equation 2.9 is lesser than that of equation 2.8,
the corresponding implementation of equation 2.9 is significantly more power
efficient.
2.4. Techniques for Low Power Realization of FIR Filters

FIR filtering is achieved by convolving the input data sampIes with the desired
unit impulse response of the filter. The output Y[n] of an N tap FIR filter is
given by the weighted sum of latest N input data sampIes (equation 2.10).
N-l
Y[n] = L A[i] . X[n - i] (2.10)
i=ü
The corresponding signal ftow graph is shown in figure 2.21.

X(Z)
Y(Z)
Figure 2.21. Signal Flow Graph of a Direct Form FIR Filter
The weights (A[i» in the expression are the filter eoeffieients. The number
of taps (N) and theeoeffieient values are derived so as to satisfy the desired filter
response in terms of passband ripple and stopband attenuation. Unlike UR fil-
ters, FIR filters are all-zero filters and are inherently stable [73]. FIR filters with
symmetrie eoeffieients (A[i] = A[N-I-i» have a linear phase response [73] and
are henee an ideal ehoiee for applieations requiring minimal phase distortion.
While the teehniques deseribed in the earlier two seetions ean be applied in
the eontext ofFIR filters, this seetion deseribes additionallow powertechniques
specifie to FIR filters.
2.4.1. Circular Buffer

In addition to the weighted sum computation, FIR filtering also involves
updating the input data sampIes. For an N tap filter, the latest N data sampIes
are required. Hence the latest N sampIes need to be stored in the data memory.
After every output computation, a new data sampIe is read and stored in the
data memory and the oldest data sampIe is removed. A data sampIe X[k] for the
eurrent eomputation becomes data sampIe X[k-l] for the next FIR computation.
Thus in addition to accepting the new data sampIe, the existing data sampIes
need to be shifted by one position, for every output. One way of achieving this
is to read each data sampIe in the order X[n-N+2] to X[n] and write it in the
next memory loeation. Thus for an N tap filter, this approach of data movement
requires N-I memory writes.
The power related to data movement ean be minimized by eliminating these
memory writes. This can be aehieved by eonfiguring the data memory as a
circular buffer [96] where instead of moving the data, the pointer to the data is
moved.
2.4.2. Multirate Architectures

Multirate architectures involve implementing an FIR filter in terms of its dec-
imated sub-filters [66]. These architectures can be derived using Winograd's
algorithms for reducing computational complexity of polynomial multiplica-
tions. Consider the Z-domain representation of the FIR filtering algorithm :
Y (Z) = H (Z) . X (Z), where H(Z) is the filter transfer function, X(Z) is the
input and Y(Z) is the output.
Far an N tap filter, H(Z) = 2:~ol A[i] . Z-i, where A[i]'s are the filter
coefficients. H(Z) can be decimated into two interleaved sequences by grouping
odd and even coefficients as folIows:
7- 1 7- 1
H(Z) L A[2k]· Z-2k + L A[2k + 1]· Z-(2k+1)
k=O k=O
7- 1 7- 1
L A[2k]· (Z2)-k + Z-l . L A[2k + 1]· (Z2)-k
k=O k=O
H O(Z2) + H l (Z2) . Z-l (2.1 I)
The input X(Z) can be similarly decimated into two sequences X o (all even
sampIes) and Xl (all odd sampIes) such that X(Z) = X O(Z2) + Xl (Z2). Z-l
The filtering operation can now be represented as :
(Ho + H l . Z-l) . (X o + Xl· Z-l) =} Co + Cl· Z-l + C 2 · Z-2 (2.12)
Using Winograd's polynomial multiplication algorithm, Co, Cl and C 2 can be

computed as
Co Ho· X o,
H 1 ·Xl ,
(Ho + Hd . (X o + Xd - Co - C 2 (2.13)
Cl gives the output sub-sequence Y l . Since C 2 sam pIes 'overlap' with Co, they
need to be added, with apprapriate shift, to Co to get the output sub-sequence
Y o.
It can be noted that Co, Cl, C 2 computation involves filtering of the decimated
input sequences using the decimated sub-filters. An N tap FIR filtering is thus
achieved using three (N/2) tap FIR filters. The signal ftow graph ofthe resultant
FIR architecture is shown in figure 2.22. The architecture processes two input
sampIes simultaneously to produce the corresponding two outputs.
2.4.2.1 Computational Complexity of Multirate Architectures

Fram the signal ftow graph shown in figure 2.21, an N tap direct form FIR
filter requires N multiplications and (N-I) additions per output.
Figure 2.22. One Level Decimated Multirate Architecture
Table 2.5. TMS320C2x Code for Direct Form Architecture
NXTPT IN XN,PAO ; bring in the new sampIe

LRLK AR1,XNLAST ; point to X(n-(N- I»
LARP ARI
MPYK 0 ; cJear product register
ZAC ; cJear accumulator
RPTK NMl ; loop N times
MACD HNMl,*- ; multiply, accumulate
APAC
SACH YN, I
OUT YN, PAI ; output the filter response y(n)
B NXTPT ; branch to get the next sampie
Consider the multi rate architecture shown in figure 2.22. Assuming even
number of taps, each of the sub-filters is of length (N/2) and hence requires N/2
multiplications and (N/2)-1 additions. There are four more additions required
to compute the two outputs YO and Yl. This architecture hence requires 3N/4
multiplications per output wh ich is less than the direct form architecture for all
values of N and requires (3N+2)/4 additions per output which is less than the
direct form architecture for ((N - 1) > (3N + 2)/4) i.e. (N > 6).
2.4.2.2 Multirate Architecture on a Programmable DSP

While the multi rate architecture in not as regular as the direct form structure,
it has significantly reduced computational complexity and partial regularity in
terms of the decimated sub-filters.
Table 2.5 shows the implementation of a direct form FIR filter on TMS320C2x
[40, 95]. The coefficients are stored in the program memory and the data is
stored in the data memory. This imp\ementation requires N+ 19 cycles to com-

pute one output of an N tap FlR filter.
The TMS320C5x [96] which is the next generation fixed point DSP has some
additional instructions that help optimize the FlR implementation further. Its
features such as RPTZ instruction that clears P register and Accumulator before
RPTing, delayed-branch instruction and memory mapped output can be used
to implement the FlR filtering algorithm that requires N+14 cycles per output.
Table 2.6 shows the implementation ofthe multi rate architecture using 'C2x.
It can be noted that this code is actually a 'C5x code that uses 'C2x compatible
instructions. The only exception is the accumulator buffer related instructions
SACB and SBB, which are not supported in 'C2x. This implementation re-
quires (3N+82)/4 cycles per output computation of an N tap FlR filter. The
corresponding C5x implementation requires (3N+70)/4 cycles per output com-
putation.
Thus in case of TMS320C2x based implementation, the multi rate architec-
ture needs lesser cycles for FIR filters with (N > 6). In case of TMS320C5x
based implementation the multirate architecture needs lesser cycles for FlR
filters with (N > 14).
The power reduction due to multi rate architecture based FIR filter imp\e-
mentation can be analyzed as folIows. Since the multirate architecture requires
fewer cycles, the frequency can be lowered using the following relationship :
fmultirate/ fdirect = (3N + 82)/(4 x (N + 19)) (2.14)
With the lowered frequency, the processor gets more time to execute the instruc-
tion. This time-slack can be used to appropriately lower the supply voltage,
using the relationship given in the following equation :
delay a V dd/(V dd - VT)2 (2.15)
Since most programmable DSPs [95, 96, 97] are implemented using a fully
static CMOS technology, such a voltage scaling is indeed possible.
In terms of capacitance, the main computation loop in the direct form re-
alization requires N multiplications, N additions and N memory reads. The
multirate implementation has three computation loops corresponding to the
three sub-filters. These loops require 3N/4 multiplications, 3N/4 additions and
3N/4 memory reads per output. Based on this observation,
CtotaLmultirate / CtotaLdirect ~ 0.75
Cmultirate/Cdirect ~ (0.75 x 4 x (N + 19))/(3N + 82)
Based on this analysis, für a 32 tap FIR filter,
fmultirate/fdirect = (3 x 32 + 82)/4 x (32 + 19) = 0.87
Table 2.6. TMS320C2x Code for the Multirate Architecture
NXTPT IN XON,PAO ; read new even and odd sampies

IN XIN, PAO
LACL XON
ADD XIN
SACL XOIN ; compute XO+XI
LRLK AR I ,XONLAST ; point to XO(n-(N/2-1))
LARP ARI
MPYK 0 ; clear product register
LACL YHI ; load X I *H I from previous iteration
RPTK NBY2 ; loop (N/2+ I) times
MACD HON, *- ; compute XO*HO
SACH YON, I
OUT YON, PAI ; output YO sampie 01' filter response
SUB YHI ; YH I stores X I *H 1 output from earlier iteration
SACB ; accumulator --+ accumulator buffer
ZAC
RPTK NBY2Ml ; loop N/2 times
MACD HIN,*- ; multiply, accumulatc to compute X I *H I
SACH YHI ; result stored in YH I for use in the next iteration
NEG ; negate the accumulator
RPTK NBY2M2 ; loop (N/2-1) times
MACD H01,*- ; compute (XO+XI)*(HO+HI)
APAC
SBB ; subtract accumulator butTer from accumulator
SACH YIN, I
OUT YIN,PAI ; output Y I sampie 01' filter response
B NXTPT
For this lowering of frequency, based on equation 2.15, the voltage can be
reduced from 5 volts to 4.55 volts.
Pmultirate Cmultirate X ( V mu l tirate)2 X fmultirate

Pdirect Cdirect Vdirect f direct
= 0.75/0.87 X (0.91)2 x 0.87 = 0.62
Thus using the multirate architecture, the power dissipation of a 32 tap FIR filter
implemented on the TMS320C2x processor can be reduced by 38%. Similar
analysis for TMS320C5x processor based implementation shows the power
reduction by 35%.
Figure 2.23 shows power dissipation as a function of number of taps for the
multirate FIR filters implemented on TMS320C2x. The power dissipation is
normalized with respect to the direct form FIR structure. As can be seen from
0.8 ,------,------,------r----,---.---------,
0.75
.Q 0.7
.,~
."
0
0
~ 0.65
0-
.,<U
~
"
0.6
Z
0.55
0.5 ' - - - - - - - - ' - - - - - - - - ' - - - - - - - ' - - - - - - ' - - - ' - - - - - - - - '

10 20 30 40 50 60 70
Number of taps
Figure 2.23. Normalized Power Dissipation as a Function of Number of Taps for the Multirate
FIR Filters Implemented on TMS320C2x
the figure, the power dissipation reduces with increasing order of the filter. The
power savings can be as much as 40% for filters with >42 taps.
2.4.3. Architecture to Support Transposed FIR Structure

The signal flow graph shown in figure 2.21 can be transformed using the
transposition theorem [73] into a signal flow graph shown in figure 2.24. While
the computational complexity of the transposed structure is same as the direct
form structure (figure 2.21), it involves multiplying all the coefficients by the
same input data (X[n». This structure is hence called the Multiple Constant
Multiplication (MCM) structure. Thus throughout the filter computation, one
of the multiplier inputs is fixed, resulting in significant power saving in the
multiplier compared to the direct form implementation. While the transposed
form does not require movement of old data sampIes, it needs to store the results
of the intermediate computations for future use. Since the intermediate results
have higher precision (precision of the coefficient + precision of the data), wider
busses are required to support single-cycle multiply-add operations.
It can be noted that the architecture shown in figure 2.1 cannot support
efficient implementation ofthe MCM based FIR filters. An architecture shown
in figure 2.25 has been proposed to support N cycJe implementation of an N tap
transposed filter. The increased capacitance due to the wider readlwrite busses
and an extra write performed every cycJe result in increased power dissipation
compared to the direct form implementation. This increase typically outweighs
the saving in the multiplier power, thus making such an architecture overallless
power efficient.
X(Z) -,----------,---------,-----------,---------~
A[N-I] A[N-2] A[N-3] A[l] A[O]
-I
Z
Y(Z)
Figure 2.24. Signal Flow Graph of the Transposed FIR Filter
J)ata
Pn)gramJ
MCI11()ry
CoclTicicnt
Memory
I",,, -----------"
Figure 2.25. Architecture to Support Efficient Implementation ofTransposed FIR Filter
All the approaches presented so far assume a given set of coefficient values
that meet the desired filter response. The following two techniques optimally
modify the filter coefficients such that they result in power reduction while still
meeting the desired filter response.
2.4.4. Coefficient Scaling

Far an N tap filter the output Y[n] is given by equation 2.10. Scaling the out-
put preserves the fi Iter characteristics in terms of passband ripple and stopband
attenuation, but results in an overall magnitude gain equal to the sc ale factor.
For a scale factor K, equation 2.10 translates to the following:
N-l N-l
K· Y[n] = K· L (A[i]· X[n - i]) = L ((K· A[i]) . X[n - i]) (2.16)
i=ü i=ü
Thus the coefficients of the scaled filter are given by (K . A[i]). Given the
allowable range of scaling (e.g. ±3db), an optimal scaling factor K can be
found such that the total Hamming distance between consecutive coefficient
values is minimized. This technique thus reduces the power dissipation in the
coefficient memory data bus and also the multiplier.
Due to the finite precision effects, the scaled coefficients may in some cases
violate the filter characteristics. This can be avoided by scaling the full pre-
cision coefficients and then quantizing them to the desired number of bits. It
is verified that the scaled coefficients satisfy the desired filter characteristics
before accepting them.
2.4.5. Coefficient Optimization

Coefficient optimization has been discussed in the literature primarily in
the context of designing finite wordlength FIR filters [32]. The algorithms
presented in [105] and [83] address the design of FIR filters with powers-of-
two coefficients. The algorithm presented in [28] minimizes the total number
of 1s in the 2's complement representation of the coefficients. These techniques
aim at efficient multiplierless implementation of FIR filters.
2.4.5.1 Coefficient Optimization - Problem Definition

The coefficient optimization problem can be stated as folIows: Given an
N tap FIR filter with coefficients (A[i], i=O,N-l) that satisfy the response in
terms of passband ripple, stopband attenuation and phase characteristics (linear
or non-linear), find a new set of coefficients (LA[i], i=O, N-l) such that the
total Hamming distance between successive coefficients is minimized while
still satisfying the desired filter characteristics in terms passband ripple and
stopband attenuation. Also retain the linear phase characteristics if such an
additional constraint is specified.
2.4.5.2 Coefficient Optimization - Problem Formulation

The coefficient optimization problem can be formulated as a local search
problem, where the optimum coefficient values are searched in their neighbor-
hood. This is done via an iterative improvement process. During each iteration
one or more coefficients are suitably modified so as reduce the total Hamming
distance while still satisfying the desired filter characteristics. The optimization
process continues till no further reduction is possible.
The coefficient optimization can be performed either on the initial coefficient
values or on the coefficient values that are uniformly scaled using the approach
mentioned earlier. One coefficient is perturbed in each iteration of the opti-
mization process. In case of an additional requirement to retain the linear phase
characteristics, the coefficients are perturbed in pairs (A[i] and A[N-I-i]) so as
to preserve the coefficient symmetry.
The selection of a coefficicnt for perturbation and the amount of perturbation

has direct impact on the overall optimization quality. Various strategies can be
adopted far coefficient perturbation [32]. These include 'steepest descent' and
'first improvement' strategics. While in the steepest descent approach, the
best coefficient perturbation is selected at every stage, in the first improvement
approach, the first coefficient perturbation that minimizes the cost is selected.
The quality of the first improvement approach depends on the order in which
the coefficients are selected.
The following subsections present the details of various components of the
algorithm and follow it up with the overall algorithm.
2.4.5.3 Coefficient Optimization Aigorithm - Components

Computing filter response
During each iteration, the filter characteristics need to be computed for every
coefficient perturbation. Thc overall computational efficiency of the algorithm
thus depends on how fast thc filter characteristics (passband ripple and stopband
attenuation) can be derived. The frequency response of a filter can be computed
by taking Fourier Transform of its unit impulse response [73]. In case of
FIR filters, the filter coefficient values form the unit impulse response. The
frequency response of an FIR filter can thus be computed by taking Fourier
transform of a sequence cünsisting of the filter coefficients padded with Os.
The Fourier transform can be efficiently computed using the radix-2 FFT (Fast
Fourier Transform) technique. The number of Os to be padded are decided
based on the desired resolution (no. of frequency points) and also to make the
total number of points apower of 2. As a thumb rule, the number of coefficients
can be multiplied by 8 and the nearest higher power of 2 picked to decide on
the total number of points für FFT computation.
Given the passband and stopband frequency ranges, the pass band ripple and
stopband attenuation can be computed by analyzing the frequency response.
Coefficient perturbation
During coefficient perturbation, the selected coefficient (A[i]) is modified
in such a way that the Hamming distance between the new (A[i]) value and
its adjacent coefficients (Ali-I], A(i+l]) is less than the Hamming distance
between the current A[i] value and the adjacent coefficients. The change in the
coefficient value needs to be as small as possible, so as to minimally impact
the filter characteristics. Thc neighborhood of the selected coefficient (A[i]) is
searched so as to find the ncarest higher and the nearest lower coefficients that
reduce the Hamming distance. The maximum allowable difference between
the perturbed coefficient and the original coefficient can be controlled so as to
focus on lower significant bits during the optimization.
Coefficient selection
In case of the steepest descent strategy, for every coefficient its nearest higher
and nearest lower coefficient values are identified. A new set of coefficients
can be fonned by replacing one of the coefficients with its nearest higher or
nearest lower value. This approach is used to generate 2N sets of coefficients
for an N tap filter, during each iteration of the optimization process. From the
2N sets of coefficients, the coefficient set that maximizes the gain function is
selected and is used as the current set of coefficients for the next iteration. The
gain function , is computed as follows
, = Tolerance . H D red
Tolerance = (Pdb req - Pdb)/ Pdb req + (Sdb - Sdbreq )/ Sdbreq
where
H D red is the reduction in the total Hamming distance for the new set of coeffi-
eients compared to the total Hamming distance for the current set of coefficients,
Pdb req is the desired passband ripple,
Sdb req is the desired stopband attenuation,
Pdb is the passband ripple of the new set of coefficients, and
Sdb is the stopband attenuation for the new set of coefficients.
In case of the additional requirement of retaining the linear phase response the
filter coefficients are perturbed in pairs (A[i], A[N-i-l]) to maintain symmetry.
Thus for an N tap filter, N differen( sets of coefficients are generated during
each iteration and the set that maximizes the gain function , is selected.
In case of first improvement strategy, the optimization quality depends on
the order in which the coefficients are perturbed. The coefficient order is ran-
domized and for aselected coefficient whether to search of nearest higher or
nearest lower value is also selected randomly. During each iteration, the first
perturbation that reduces the Hamming distance and satisfies the filter charac-
teristics is accepted and is used to form the current set of coefficient for the next
iteration. The dependence on the coefficient order is minimized by generating
5 or 10 different coefficient orders and selecting the one that results in least
total Hamming distance.
2.4.5.4 Coefficient Optimization Algorithm

Here is the overall coefficient optimization algorithm that uses the steepest
descent strategy and assumes no linear phase constraint.
Procedure Optimize-CoeJficientsJor-Low-Power
Inputs: Low pass filter characteristics in tenns of passband ripple Pdb req and
stopband attenuation Sdb req . An initial set of N fi Iter coefficients A[O] to A[N-
I] that meet the specified filter response.
Output: An updated set of filter coefficients A[O] to A[N-I] which minimize
total Hamming distance between successive coefficient values and still meet
the desired filter characteristics.
repeat {
for each coefficient A[i] (i=O,N-l) {
Find a coefficient value A[i+] such that :
(HD(A[i],A[i-l])+HD(A[i],A[i+ 1])) > (HD(A[i+],A[i-l])+
HD(A[i+], A[i+ 1])) and (A[i+] - A[i]) is minimum
Generate a new set of coefficients by replacing A[i] with A[i+]
Compute the passband ripple (Pdb i+) and the stopband attenuation (Sdbi+)
if (Pdb i+ < Pdb req ) and (Sdb i+ > Sdbreq ) {
Find the tolerance given by
Toli+ = (Pdbreq - Pdbi+)/Pdbreq + (Sdbi+ - Sdbreq)/Sdbreq
} else { Toli+ = 0; }
Find a coefficient value A[i-] such that :
(HD (A[i], A[i-l]) + HD (A[i], A[i+ 1])) > (HD (A[i-], A[i-l]) +
HD(A[i-], A[i+ 1])) and (A[i] - A[i-]) is minimum
Generate a new set of coefficients by replacing A[i] with A[i-]
Compute the passband ripple (Pdb i -) and the stopband attenuation (Sdb i -)
if (Pdb i - < Pdb req ) and (Sdb i - > Sdbreq ) {
Find the tolerance given by
Tol i _ = (Pdbreq - Pdbi-)/Pdbreq + (Sdbi- - Sdbreq)/Sdbreq
} else { Tol i _ = 0; }
}
Find the coefficient value among A[i+]'s and A[i-]'s for which
the gain function "( given by (Tolerance· H Dreduciion) is maximum.
if (, > 0) {
Replace the original coefficient with the new value
} else { Optimization_possible = FALSE }
}until (! Optimization_possible)
The above algorithm can be easily modified to handle the additional require-
ment of retaining the linear phase characteristics. This can be achieved by
modifying both A[i] and A[N-l-i] with A[i+] (and later with A[i-]) to generate
the new set of coefficients, and searching only the first (N+ 1)/2 coefficients
during each iteration.
The 'first improvement' approach based version of the algorithm uses a ran-
dom number generator to pick a coefficient (A[i]) for perturbation and also to
decide whether A[i+] or A[i-] value needs to be considered. The new coefficient
value is accepted if the new values of passband ripple and stopband attenua-
tion are within the allowable limits. The optimization process stops when no
coefficient is perturbed for the specified number of iterations.
The techniques of coefficient scaling and coefficient optimization were ap-
plied to the following six low pass FIR filters.
Table 2.7. Harnrning Oistance and Adjacent Signal Toggles After Coefficient Scaling Followed
by Steepest Oescent and First Irnprovernent Optirnization with No Linear Phase Constraint
No linear phase constraint

Initial Scaling + Scaling +
FIR Filter Steepest descent First irnprovernent %red
best of 10 randorn W.f.t. initial
HO Togs HO Togs HO Togs HO Togs
FrRI 372 58 298 19 298 19 19.9% 67.2%
FIR2 180 50 118 12 114 8 36.7% 84.0%
FIR3 292 44 258 25 256 24 12.3% 45.5%
FIR4 214 44 138 6 138 5 35.5% 88.6%
FIR5 258 36 168 14 168 16 34.9% 61.1%
FIR6 220 16 156 12 156 12 29.1% 25.0%
FIRI: Ip_16K_3KAK_. L62_50

FIR2: Ip_16L3KA.5K_.2A2_24
FIR3: Ip_l OK_1.8K_2.5K_.I 5_60A I
FIR4: Ip_12K_2K_3K_.12A5_28
FIR5: Ip_l2K_2.2K_3.1 K_.16A9_34
FIR6: Ip_lOK_2K_3K_0.05AO_29
These filters vary in terms of the desired filter characteristics and conse-
quently in the number of coefficients. These filters have been synthesized using
the Park-McCJellan's algorithm for minimum number of taps. The coefficient
values quantized to 16 bit 2's complement fixed point representation form the
initial sets of coefficients for optimization. Tables 2.7, 2.8 and 2.9 give results
in terms of total Hamming distance (HD) and total number of adjacent signals
toggling in opposite direction (Togs) for different optimization strategies. The
names of the filters indicate the filter characteristics. For example, the FIR2
filter Ip_16L3KA.5K_.2AL24 is a low pass filter with the following charac-
teristics :
Sampling frequency = 16000
Passband frequency = 3000
Stopband frequency = 4500
Pass band ripple = 0.2db
Stopband attenuation = 42db
Number of coefficients = 24
Figure 2.26 shows the frequency domain characteristics of the 24 tap FIR
fi lter for three sets of coefficients corresponding to the initial solution, optimized
with no linear phase constraint and optimization with linear phase constraint.
Table 2.8. Hamming Oistance and Adjacent Signal Toggles After Coefficient Scaling Followed
by Steepest Oescent and First Improvement Optimization with Linear Phase Constraint
With linear phase constraint

Scaling + Scaling +
FIR Filter Steepest descent First improvement %red
best of 10 random w.r.l. initial
HO Togs HO Togs HO Togs
Ip_16K3KAK_. L62_50 302 20 302 20 18.8% 65.5%
Ip_16L3KA.5L.2A2_24 118 14 114 10 36.7% 80.0%
Ip_IOK_I.8K_2.5K_.15_60AI 264 28 264 20 9.6% 54.5%
Ip_12K_2K_3K_.12A5_28 140 8 136 6 36.4% 86.4%
Ip_12K_2.2K3.1 K_.16A9_34 178 16 170 14 34.1% 61.1%
Ip_IOK_2K_3K_0.05AO_29 154 10 154 14 30.0% 37.5%
Table 2.9. Hamming Oistance and Adjacent Signal Toggles for Steepest Oescent and First Im-
provement Optimization with and without Linear Phase Constraint (with No Coefficient Scaling)
No linear phase constraint

No Scaling + No Scaling +
FIR Filter S tee pest descent First improvement %red
best of 10 random W.f.l. initial
HO Togs HO Togs HO Togs
Ip_16K_3KAL 1_62_50 308 21 312 33 17.2% 63.8%
Ip_16L3KA.5L2A2_24 126 14 126 12 30.0% 76.0%
Ip_IOK_I.8K_2.5K_.15_60A1 252 28 252 30 13.7% 36.4%
Ip_12K_2K_3K_.12A5_28 154 20 152 19 29.0% 56.8%
Ip_12K_2.2K3.1 K_.16A9_34 204 26 198 24 23.6% 33.3%
Ip_lOL2K_3K_0.05AO_29 170 7 170 5 22.7% 68.8%
With linear phase constraint
Ip_16L3KAL 1_62_50 316 30 312 26 16.1% 55.2%
Ip_16L3KA.5L2A2_24 136 20 136 22 24.4% 60.0%
Ip_IOK_I.8L2.5K_.15_60AI 260 30 260 30 10.9% 31.8%
Ip_12L2K3L 12A5_28 154 24 154 22 28.0% 50.0%
Ip_12K_2.2K_3.1 L.16A9_34 210 30 210 28 18.6% 22.2%
Ip_10K_2K_3K_0.05AO_29 180 6 176 6 20.0% 62.5%
The results show that the algorithm using both scaling and coefficient opti-
mization with no linear phase constraint results in upto 36% reduction in the
total Hamming distance and upto 88% reduction the total number of adjacent
signal toggles. Similar savings are achieved even with the linear phase con-
straints.
-20
D
"0
-40
"
"<ij
"
-60
1000 2000 3000 4000 5000 6000 7000 8000

Frequency (Hz)
-40 . - - - - , - - - - - - - - , - - - - , - - - - - - - , - - - - - , - - - - ,
-45
-50
-55
-60
-65
-70 L-J._"----_ _ _---"-_ _ _ _-'---_ _ _---"-_ _ _ _...LL--'---'

4600 4800 5000 5200 5400
Frequency (Hz)
Figure 2_26 Frequency Domain Characteristics of a 24 Tap F1R Filter Before and After Opti-
mization
The coefficient optimization algorithm alone (i.e. with no coefficient scal-

ing) with no linear phase constraints results in upto 29% reduction in the total
Hamming distance and upto 68% reduction in the total number of adjacent
signal toggles. Similar savings are achieved with the linear phase constraints.
Further analysis of the results show that the combination of scaling and
coefficient optimization results in higher reduction (except in one case) in the
power dissipation measures than using coefficient optimization alone. The
l+Öl~----'
1- Ö1 1 - - - - - - - :
------------------------ -------------- ~------~
ffi
Figure 2.27. Low Pass Filter Specifications
additional constraint of retaining linear phase marginally impacts the power

savings. The impact however is noticeable in case of coefficient optimization
with no scaling.
In terms of optimization strategies, the first improvement approach with best
of 10 random solutions performs as weIl or marginally better in most of the
cases. Experiments also indicate that the number of coefficient perturbations
evaluated in first improvement approach are significantly lesser than in steepest
descent approach. This is reftected in faster runtimes for the first improvement
optimization strategy.
2.4.5.5 Coefficient Optimization Using 0-1 Programming

Instead of the two phase approach of arriving at an initial solution and refin-
ing it for low power, the coefficient optimization problem can also be solved as
a coefficient synthesis problem with the additional constraint for power reduc-
tion. Here is a 0-1 programming formulation for synthesizing coefficients that
meet the desired characteristics of a linear phase FIR filter and have minimum
total Hamming distance in the 2's complement fixed point representation of
successive coefficients.
Given
Filter characteristics (figure 2.27) in terms of
Passband frequency: wp
Stopband frequency: Ws
Passband ripple: 28 1 between (1 - ( 1 ) to (1 + 8d
Stopband attenuation: 82
N the number of coefficients
B the number of bits of precision for the fixed point representation of the
coefficients
H n the value of the n'th coefficient
Variables
The coefficient bit values form the variables, where Cn,b is the value of the b'th
bit of the n'th coefficient and Cn,b E {O, I}
Objective Function
Let H Dn,b be the Hamming distance between bits Cn,b and C n+1 ,b given by
H Dn,b = Cn,b EB Cn+ 1 ,b
The same can be represented as - H Dn,b :S (Cn,b - C n+ 1 ,b) :S H Dn,b
The objective function then is to Minimize ~;:;;02 ~f==(/ H Dn,b
Constraints
The coefficient values can be computed as folIows:
°
H n = -Cn, + ~f==ll Cn ,b(2)-b
Given the filter coefficients, the magnitude response at a given frequency w can
be computed using the following equations:
ForN odd,
Fw = ~i~~1)/2 a[k]cos(wk)
where a[O] = H[(N-l)/2] and
a[k] = 2H[(N-l)/2 - k], k = 1,2, ... ,(N-l)/2
ForN even,
Fw = ~:~i a[k]cos(w(k - 1/2))
where b[k] = 2H[N/2 - k], k=1,2, .. . ,N/2
°
The frequency response should meet the following constraints:
For all frequencies w : :S w :S wp -+ (1 - 8d :S Fw :S (1 + 8d
For all frequencies w : Ws :S w :S 7r -+ F w :S 82
Any ofthe available 0-1 programming packages can be used to arrive at Cn,b
values that satisfy the filter characteristics and minimize the total Hamming
distance between successive coefficients.
2.5. Framework for Low Power Realization of FIR Filters

on a Programmable DSP
Various transformations presented in this chapter complement each other
and together operate at algorithmic, architectural, logic and layout levels of
design abstraction. Figure 2.28 presents a framework that encapsulates these
transformations in the form of a flow-chart. The figure also shows the amount of
power savings that each ofthe transformations achieves. As has been presented
in this chapter, the multirate architecture results in 40% overall reduction in

the power dissipation. Further reduction of 50% in the address busses, upto
80% reduction in the coefficient data bus and upto 25% reduction in the adder
input busses can be achieved using these transformations. The reduction in
the power dissipated in the busses, has also direct impact on the reduction
in the power dissipated in the multiplier and the adder. The framework thus
presents a comprehensive approach to low power realization of FIR filters on
programmable DSPs.
FIR Filter Specifications )

--------------~------------~
Filter Order higher (> 14) ?

Coefficient Synthesis
Increase in pro gram and data
Park-McClellan's algorithm
memory acceptable ?
No I Y=
t Reduced Computational
+/-3db gain acceptable ? Multirate Architecture - - - Complexity -> 40%
overall power reduction
No I Yes
t
Coefficient Scaling ,
,, Upto 35% reduction in the
t
coefficient data bus power ->
Architecture support for single ~ ,
, power savings in the multiplie
cycle multiply-add and multipIy-
Coefficient Optimization
subtract in a repeat loop ?
No
I y,
~
Architecture support for non-
sequential data addressing ?
Selective Coefficient Negation ,,
,,
No I Ye, Upto 88% reduction in the
Architecture extensions for bus-

t , coefficient data bus power ->
power savings in the multiplier
coding + area/delay overhead
acceptable ? ~ Coefficient Ordering f,'
No Yes
t Upto 50% reduction in address
Pipelined architecture + high GrayrrO coded addressing,
- - -
busses + power reduction in the
cap, busses feeding the ALU? Bus-invert coding
data busses
No I Ye,
t Upto 25% reduction in the adder
Embedded DSP with contra I ove
Adder input bits swapping --- input busses -> power reduction
routing of memory-cpu busses?
in the addder
No I Ye,
t Avg. 54% reduction in the cross

Bus bit re-ordering -- - coupling related power in the
coefficient data bus
t
Low Power FIR Filter on a Programmable DSP
Figure 2.28. Framework for Low Power Realization of FIR Filters on a Programmable DSP
Chapter 3
IMPLEMENTATION USING HARDWARE

MULTIPLIER(S) AND ADDER(S)
If an application's performance requirements cannot be met by a programm-

able DSP, one option is to use multiple processors. A more area-delay-power
efficient solution is to build a processor that is customized for the application.
Such a solution typical offers limited capabilities and limited programmability.
However, if these are adequate for the application and are of lesser importance
than the area-delay-power requirements, such a dedicated solution is the pre-
ferred option. The problem ofhardware realization from an algorithmic descrip-
tion of an application, has been extensively discussed in the literature. Many
silicon-compilers such as HYPER [14], CATHEDRAL [46] and MARS [103]
address this problem of high level synthesis. The domain of high level trans-
formations has also been weil studied and many such techniques have been
presented in the literature [14, 60]. The problem of high level synthesis for low
power has been addressed in [81] and the technique of using multiple supply
voltages for minimizing energy has been discussed in [16]. This chapter first
presents architectural transforms from high level synthesis domain and evalu-
ates the applicability of some of these techniques in the context of FIR filters
(weighted-sum computation in general). It then presents a detailed analysis of
multi rate architectures as a transform that reduces the computational complexity
of FIR fi ltering, thereby providing performance/power advantage.
3.1. Architectural Transformations

The architectural transforms can be classified into two categories - one that
alter the Data Flow Graph(DFG) but do not impact the computational complex-
ity and the other that are focussed on reducing the computational complexity.
The first type of transformations include pipelining, parallel processing, loop
unrolling and retiming [15, 76]. These techniques improve the performance of
the DFG. Altematively they enable reducing the supply voltage while main-
55

taining the same throughput, thus reducing the power which is proportional to
the square of the supply voltage. This however is achieved at the expense of
significant silicon area overhead. Thus for an implementation with fixed and
limited number of hardware resources, these techniques do not offer significant
advantage.
The architectures that reduce the computational complexity of FIR filters in-
clude block FIR implementations [41, 77] and multirate architectures [66]. The
algorithm for block FIR filters presented in [77] performs transformations on
the direct form state space structure. It reduces the number of multiplications
at the expense of increased number of additions. Since the multiplier area and
delay are significantly higher than the adder area and delay, these transforma-
tions result in low power FIR implementation. Block FIR filters are typically
used for filters with lower order. Their structures are not as regular as the direct
form structure. This results in controllogic overhead in their implementations.
The multirate architectures [66] reduce the computational complexity of the
FIR filter while partially retaining the direct form structure. These architec-
tures can hence enable low power FIR realization on a programmable DSP and
also as a dedicated ASIC implementation. The basic two level decimated mul-
tirate architecture was presented in the previous chapter, this chapter provides
a more detailed analysis of the computational complexity of various multi rate-
architectures and also evaluate their effectiveness in reducing power dissipation
of linear phase FIR filters.
Differential Coefficients Method [84] is another approach for reducing com-
putational complexity and hence the power dissipation in hardwired FIR filters.
The filter structure transformed using this method requires multiplication with
coefficient differences having lesser precision than the coefficients themselves.
Since the coefficient differences are stored for use in future iterations, this
method results in significant memory overhead.
3.2. Evaluating the Effectiveness of DFG Transformations

This section evaluates the effectiveness of various DFG transformations in
the context of a 4 tap FIR filter shown in figure 3.1.
Figure 3.2 shows the data flow graph implementing the filter using one mul-
tiplier and one adder. The delay of the multiplier is assumed to be 2T and the
delay of the adder is assumed to be T. As can be seen from the data flow graph,
the filter requires the delay of 9T per output computation.
The delay per output computation can be reduced by replacing the single
stage multiplier (with delay of 2T) by a two stage pipelined multiplier, with
each stage requiring the delay of T. The scheduled data flow graph using one
pipelined multiplier and one adder is shown in figure 3.3
As can be seen from the figure, with the pipelined multiplier the delay per
output reduces to 6T. If the throughput requirement is 9T per output, the clock
Implementation Using Hardware Multiplier( s) and Adder( s) 57
X[n]
A[O]
Y[n]
Figure 3.1. Direct Form Structure of a 4 Tap FIR Filter
_ ~!~]-! _~~\I_]- __~!:]

X[n] X[n-I] X[n-2] X[n-3]
_____ ~~] --- --__

CI
--- -*----- -- ---- -- --- --- ----
C2
----- ------- --------
C3
C4
CS ---r;t
C6 _________ ------~--~------~~--
C7
---------------- -------- -* -----
C8
C9
Y[n]
Figure 3.2. Scheduled DFG Using One Multiplier and One Adder
frequency of this implementation can be reduced by a factor ] .5. With the

increased clock delay, the supply voltage can be lowered resulting in power
reduction.
The loop-unrolling transform unrolls the FIR computation loop so as to
compute multiple outputs in one iteration. The effect of loop unrolling is similar
to the parallel processing FIR structure proposed in [60]. Figure 3.4 shows the
scheduled data-flow graph of the filter that has been unrolled once.
Y[n]
Figure 3.3. Scheduled DFG Using One Pipelined Multiplier and One Adder
As can be seen from the figure, with one level loop unrolling the delay per
output computation reduces to 5T, thus enabling further lowering of supply
voltage and hence further power reduction to achieve the throughput of 9T per
output.
Retiming has been presented in the literature as a transform that reduces the
critical path delay and hence the power dissipation. The direct form structure
shown in figure 3.1 has a critical path delay of 5T (three adders and one mul-
tiplier). In general, a direct form structure of an N tap filter has a critical path
delay of one multiplier and (N-l) adders. The re-timing transform has the same
effect as applying transposition theorem and results in the multiple constant
multiplication(MCM) structure shown in figure 3.5.
As can be seen from the figure this structure has a critical path delay of
one multiplier and one adder. While this critical path is significantly smaller
than the direct form structure, it can be truly exploited only if the filter is to be
implemented using many multipliers and adders.
Figure 3.6 shows the scheduled data flow graph of the re-timed filter using
one pipelined multiplier and one adder. As can be seen from the figure, this
structure has a delay of 5T which is marginally lesser than the delay of 6T for
the direct form structure shown in figure 3.2.
Implementation Using Hardware Multiplier( s) and Adder( s) 59
X[n-I] X[n-2] X[n-3] X[n-4]

A[I] A[2] A[3)
C2
C3
C4
CS
C9
CIO
Y[n] Y[n-I]
Figure 3.4. Loop Unrolled DFG Using I Pipelined Multiplier and I Adder
X[n] ----------------~------------~----------~
A[3] A[2] A[I] A[O]
Y[n)
Figure 3.5. Retimed 4 Tap FIR Filter
The delay per FIR filter output computation can also be reduced by using
multiple functional units. This can be considered as parallel processing at a
micro level. Figure 3.7 shows the scheduled data f10w graph of the direct form
structure that uses two pipelined multipliers and one adder.
CI---
A[O]
01'
V- -- --\- ----- ---

X[n] X[n]
-*- -- --- - --
I
A[ 1]
02'
[n]
A[2]
03'
X n]
A[3]
C2 __ \.8
C3 ________+_______ ---8-
C4
C5
________ + ----7----
---------+------>ts--
-------- --------- --------- ------ ----
Y[n] 01 02 03
Figure 3.6. MCM DFG Using One Pipelined Multiplier and One Adder
I J
---
X[n] X[n-l] X[n-2] X[n-3]
A[O] A[l] A[2] A[3]
---8---e-~-r
e- - -
Cl
C2 - - - - - - -1.------ *- -- -
C3 ________ ~__ _ ____/ _________ _
C4
------------------ --- -----------
CS ____________________ ~------------
Y[n]
Figure 3.7. Direct Form DFG Using Two Pipelined Multipliers and One Adder
Implementation Using Hardware Multiplier( s) and Adder(s) 61
01' 03'
A[O] A[2] A[3]
Y[n] 01 02 03
Figure 3.8. MCM DFG Using Two Pipelined Multipliers and Two Adders
As can be seen from the figure, with one more multiplier the delay per output
does reduce to 5T. It is also interesting to note that for the re-timed, MCM
based structure, the delay continues to be 5T even if two pipelined multipliers
are available. The parallelism inherent to this structure can be truly exploited
by using multiple multiplier-adder pairs. Figure 3.8 shows the schedule data
f10w graph for the MCM based structure using two pipeline multipliers and two
adders.
As can be seen from the figure, using two multiplier-adder pairs reduces the
delay to 4T. This analysis shows that the delay per output can be reduced by
using multiple functional units. This can be used to lower the supply voItage
and hence reduce the power dissipation, if the throughput requirement is same
as that achieved using one multiplier and one adder.
3.3. Low Energy vs Low Peak Power Tradeoff

Many of the transforms discussed above do not impact the computational
complexity of the filter. Thus the total capacitance switched per computation
is same across all the transformations. While the impact of these transforms
on energy per output and average power dissipation has been presented in the
literature, the impact on peak-power has not been considered. Since peak
power dissipation is also an important design consideration, the impact of these
transforms on this factor needs be looked at carefully.
This section specifically looks at the parallel processing transform. With a
degree of parallelism N, the number clock cycles required to perform a com-
.- 1.4
11
Z
"-
1.2 Peak P wer
.2
Cf)
co
"0
Q) 0.8
N
eil
0.6
E
0
c 0.4
Cf)
Q)
:J
eil
0.2 Energ
> o ~~ __ ~~L-~ _ _~~_ _~_ _L-~
o 2 345 6 7 8 9
Degree of Parallelism (N)
Figure 3.9. Energy and Peak Power Dissipation as a Function of Degree of Parallelism
putation is reduced by a factor of N. To achieve the same delay per output as

with degree one, the clock period can be increased by a factor of N. This can
be used to lower the supply voItage using the relationship
delay Cl< VDD/(VDD - VT)2 (3.1 )
With the degree of parallelism N, the amount of capacitance switched per cycle
goes up by a factor of N. Since the power is proportional to V 2 , the peak power
dissipation can be reduced only if the supply voltage is reduced by a factor of
VN. Figure 3.9 plots both the energy (or average power) and the peak power as
a function of degree ofparallelism N for VD D =3V and VT =0.7Y. As can be seen
from the figure, while the energy per output or the average power dissipation
reduces with increasing degree of the parallelism, the peak power dissipation
increases beyond N=4.
For a given degree of parallelism N, the following condition should be sat-
isfied for the peak power dissipation to be less than with degree one.
VDD
~------~ .N > VDD/VN
----~~------- (3.2)
(VDD - VT)2 ((VDD/VN) - VT)2
This gives the following relationship between VDD, VT and N.
(3.3)
Figure 3.10 plots this relationship as the lower limit on VDD/VT for no
increase in the peak power dissipation with the given degree of parallelism N.
lmplementation Using Hardware Multiplier(s) andAdder(s) 63
7 r---,--------,--------,--------,--------,----,
2 L -_ _L -_______ L_ _ _ _ _ _ ~ _ _ _ _ _ _ _ _L -_______ L_ _ ~
2 468 10
Degree of Parallelism (N)
Figure 3.10. Lower Limit of VDD/VT for Reduced Peak Power Dissipation as a Function of
Degree of Parallelism
Figure 3.//. One Level Decimated Multirate Architecture: Topology-l
3.4. Multirate Architectures

Section 2.4.2 presented the derivation of a multirate architecture using Wino-
grad's algorithm for reducing computational complexity of polynomial multi-
plications. Figure 3.11 shows the signal flow graph ofthe multirate architecture
that uses a decimation factor of two. The architecture processes two input sam-
pies simultaneously to produce the corresponding two outputs.
Since the input sequence is decimated by two, each of the sub-filters operate
at the half the frequency of the input sampies. Since different sections of the
XO(Z) ~ Y(Z)
X(Z)~
Yl(Z)
Xl(Z)
Figure 3.12. One Level Decimated Multirate Architecture: Topology - 11
filter operate at different frequencies, these architectures are called multirate

architectures.
The multirate architecture shown in figure 3.11 decimates the input by a factor
oftwo and also decimates the filterby a factoroftwo. The decimation factors for
the input and the filter can be same or different and can be integers higher than
two [66]. Each such combination results in a different multirate architecture.
The discussion in this chapter is restricted to the multirate architectures with
decimation factor oftwo for both the input and the filter. For a given decimation
factor, different architecture topologies are possible. Figure 3.12 shows one
such architecture that has the same decimation factor of two as the topology-I
architecture shown in figure 3.11. Itcan be seen that the topology-II architecture
(figure 3.12) requires two more additions than the topology-I architecture but
has sub-fi Iters with different transfer functions.
3.4.1. Computational Complexity of Multirate Architectures

3.4.1.1 Non-linear Phase FIR Filters
Figure 3.13 shows the signal flow graph(SFG) of a direct form FIR filter
with non-linear phase. As can be seen from the SFG an N tap filter requires N
multiplications and (N-l) additions per output.
Consider the topology-I multirate architecture. In case of even number of
taps, each of the sub-filters is of length (N/2) and hence requires N/2 multi-
plications and (N/2)-1 additions. There are four more additions required to
compute the two outputs YO and Y I. This architecture hence requires 3N/4
multiplications per output which is less than the direct form architecture for aB
values of N and requires (3N+2)/4 additions per output which is less than the
direct form architecture for ((N - 1) > (3N + 2)/4) i.e. (N > 6).
In case of odd numberoftaps, the filtercan be converted to an equivalent even
tap fi Iter by adding a coefficient of value O. This coefficient can then be dropped
from the decimated sub-filters. This results in two sub-filters (Ho and Ho + H 1 )
Implementation Using Hardware Multiplier(s) andAdder(s) 65
X(Z)
Y(Z)
Figure 3.13. Signal Flow Graph of a Direct Form FIR Structure with Non-linear Phase
X(Z)
Figure 3.14. Signal Flow Graph of a Direct Form FIR Structure with Linear Phase
of length (N+ 1)/2 and the third sub-filter (H 1 ) of length (N-I )/2. The multirate
architecture thus requires (3N+ 1)/4 muItiplications which is less than the direct
form architecture for all values of N and requires (3N+ 3)/4 additions per output
which is less than the direct form architecture for ((N -1) > (3N + 3)/4) i.e.
(N > 7)
3.4.1.2 Linear Phase FIR Filters

For linear phase FlR filters, the coefficient symmetry can be exploited to
reduce the number of multiplications in the direct form structure (figure 3.14)
by 50%. The direct form structure for linear phase FIR filter requires N/2
multiplications «N+I)/2 ifN is odd) and N-I additions.
Phase Characteristics of decimated sub-filters
This subsection analyzes the phase characteristics ofthe decimated sub-filters

of the topology-I multirate architecture for the linear phase FIR filter with even
number of taps.
For the filter H(Z) = L~Ol A[i] . Z-i , linear phase characteristics imply
A[i] = A[N - 1 - i] (3.4)
IfN is even, the three decimated sub-filters have the following Z-domain transfer
functions
~-l
Ho(Z) L A[2k] . (Z2)-k (3.5)
k=O
~-l
L A[2k + 1] . (Z2)-k (3.6)
k=O
~-l
(Ho + H1)(Z) = L (A[2k] + A[2k + 1]) . (Z2)-k (3.7)
k=O
The coefficient symmetry of the sub-filters can be analyzed using the relation-
ship in equation 3.4 to show that the sub-filters Ho abd H 1 do not have linear
phase and the sub-filter (Ho + Hd does have linear phase characteristics.
Computational Complexity - linear phase FIR filters with even number of taps
Since Ho and H 1 have non-linear phase, they require (N/2) multiplications
and (N/2)-l additions each. Since Ho + H 1 sub-filter has a linear phase, it
requires N/4 multiplications and (N/2)-l additions, if N/2 is even, and requires
(N+2)/4 multiplications and (N/2)-l additions, if N/2 is odd.
Thus the topology-I multi rate architecture requires per output 5N/8 multi-
plications and (3N+2)/4 additions if N/2 is even, and (5N+2)/8 multiplications
and (3N+2)/4 additions if N/2 is odd. In both the cases, the number of multi-
plications required are more than the direct form structure. The primary reason
for the multirate architecture requiring higher number of multiplications is the
fact that two of the three sub-filters have non-linear phase characteristics.
The topology-II multirate architecture has sub-filters with transfer functions
(Ho + H 1)/2, (Ho - Hd/2 and H 1. Since Ho + H 1 has linear phase, the sub-
filter (Ho + H 1)/2 also has linear phase characteristics. It can be shown that the
coefficients of (Ho - Hd/2 are anti-symmetrie (i.e. Ai = -A N - l - i ). This
sub-filter has hence the same computational complexity as (Ho + Hd/2. This
multirate architecture hence requires N/2 multiplications and (3N+6)/4 addi-
tions ifN/2 is even and needs (N+ 1/2) multiplications and (3N+6)/4 additions if
NI2 is odd. While this multirate architecture requires fewer multiplications than
the topology-I architecture, it is still not less than the number of multiplications
required by the direct form structure.
Table 3.1. Computational Complexity of Multirate Architectures
Filter Implementation Mults per o/p Adds per o/p

Non-linear phase
Direct Form N N-l
Multirate-I level N even 3N/4 (3N+2)/4
N odd (3N+ 1)/4 (3N+3)/4
Multirate-2 level N even N/2 even 9N/16 (9N+44)/l6
N/20dd (9N+6)/l6 (9N+50)/l6
N odd (N+ 1)/2 even (9N+5)/16 (9N+49)/l6
(N+l)/2 odd (9N+7)/l6 (9N+51 )/16
Linear phase
Direct Form N even N/2 N-I
N odd (N+ 1)/2 N-l
Multirate-I level N even N/2 even N/2 (3N+6)/4
N/20dd (N+ 1)/2 (3N+6)/4
N odd (N+ 1)/2 (3N+3)/4
Multirate-2 level N even N/2 even N/4 even 7N/l6 (9N+76)/l6
N/40dd (7N+8)/16 (9N+76)/l6
N/20dd (7N+I0)/16 (9N+ 70)/16
N odd (N+l)/2 even (N+l)/4 even (7N+7)/l6 (9N+57)/l6
(N+l)/40dd (7N+8)/16 (9N+57)/l6
(N+l)/20dd (N-I)/40dd (7N+13)/l6 (9N+59)/l6
(N-I )/4 even (7N+9)/l6 (9N+59)/l6
Thus in case of linear phase FIR filters, one level decimated multirate ar-
chitectures can at best require the same number of multiplications as the direct
form structure when N/2 is even. They require fewer number of additions for
((N - 1) > (3N + 6)/4) i.e. (N) 10).
Computational Complexity - linear phase FIR with odd number of taps
In case of linear phase filter with odd number of taps, it can be shown that
the sub-filters Ho and H 1 both have linear phase but the sub-filter Ho + H 1
has non-linear phase characteristics. Since Ho is of length (N+ 1)/2 and H 1 is
of length (N-I)/2, the two sub-filters together require (N+ 1)/2 multiplications.
The topology-I multi rate architecture hence require (N+ 1)/2 multiplications
and (3N+3)14 additions per output. Thus for the linear phase FIR filter with
odd number of taps, the one level decimated multi rate architecture can at best
require the same number of multiplications as the direct form structure. It
requires fewernumberofadditions for ((N -1) > (3N +3)/4) i.e. (N > 7).
The above analysis (summarized in table 3.1) demonstrates how the multirate
architectures can reduce the computational complexity of FIR filters. Each of
the sub-filters in the one level decimated architectures (shown in figure 3.11)
Figure 3.15. Signal Flow Graph of a Two Level Decimated Multirate Architecture
can be further decimated to further reduce the computational complexity of

FIR filters. Figure 3.15 shows the signal flow graph of a two level decimated
multirate architecture. Table 3.1 also presents the computational complexity
for two level decimated multirate architectures. It can be noted that the doubly-
decimated multirate architectures further reduce the computational complexity
of FIR filters.
3.5. Power Analysis of Multirate Architectures

3.5.1. Power Analysis for One Level Decimated Multirate
Architectures
The reduction in the computational complexity of multirate architectures can
be exploited to achieve low power realization of FIR filters while maintaining
the same throughput. In case of CMOS designs, capacitance switching forms
the main source of power dissipation.
The switching power is given by Pswitching = C· V 2 . f, where C is the
capacitance charged/discharged per c10ck cyc1e, V is the supply voltage and f
is the c10ck frequency [15].
The throughput of an FIR filter depends on the product of the c10ck period
and the number of cyc1es of computation per output. Since the multirate archi-
tectures require !esser number of cycles of computation per output, they can run
at higher c10ck period (lower frequency) while maintaining the same through-
put. The frequency ratio between the multirate and the direct form architectures
is given by the ratio of the total delay of the multiplications and the additions
per output for the respective architectures. Let 15 m be the delay of a multiplier
and let t5 a be the delay of an adder. The frequency ratio for an N tap non-linear
phase FIR filter is given by
f multi rate 3N/4 x 15m + (3N + 2)/4 x t5 a

(3.8)
f direct N x 15m + (N - 1) x t5 a
Implementation Using Hardware Multiplier(s) andAdder(s) 69
14
12
10
1.5 2.5 3.5 4.5

VDD
Figure 3./6. Normalized Delay vs Supply Voltage Relationship
The reduced frequency for the multirate architecture directly translates into its
lower power dissipation.
The lowering of the frequency has another important advantage. Since the
clock period is increased, the logic delays can be correspondingly higher without
affecting the overall throughput. In CMOS logic, supply voItage is one of the
factors that affects the delays. The delay dependence on supply voItage is given
by the following relationship
(3.9)
where VDD is the supply voItage and VT is the threshold voltage of the
transistor.
Figure 3.16 shows this delay vs V dd relationship for VT = O.8V. The delay
values are normalized with respect to the delay at V dd=Sy' Since the multirate
architectures allow higher logic delays, the supply voltage can be appropriately
lowered. This reduces the power proportional to the square of the reduction in
the supply voltage.
The analysis shown below assumes that the total capacitance charged/ dis-
charged per output is proportional to the total area of the multipliers and the
adders required to compute each output. Let Am be the area of a multiplier and
Aa be the area of an adder. For an N tap FIR filter with non-linear phase, the
total capacitance for the direct form structure is given by :
CtotaLdirect cx: (N x Am + (N - 1) x Aa ) (3.10)

The total capacitance for the multi rate architecture is given by :
CtotaLmultirate cx: (3N /4 x Am + (3N + 2)/4 X A a) (3.11)
The capacitance per cycle Cdirect for the direct form realization is hence given
by
Cdirect cx: (N x Am + (N - 1) X Aa )/ fdirect (3.12)
The capacitance per cycle Cmultirate for the multi rate architecture is given by
Cmultirate cx: (3N /4 x Am + (3N + 2)/4 X A a )/ fmultirate (3.13)
It can be noted that if the area ratio Am / A a is same as the delay ratio Om/Oa,
and f multirate is appropriately scaled to maintain the same throughput, the two
capacitance values Cdirect and Cmultirate are same.
3.5.1.1 Power Analysis - an Example

This subsection demonstrates the above analysis for a 32 tap FIR filter with
non-linear phase. Assuming Am / A a = Om/Oa = 8 and the direct form FIR
filter running at V dd=5V, the frequency ratio can be calculated as shown below:
fmultirate 32 x 3/4 x 8 + (3 x 32 + 2)/4
"----- = = 0.75 (3.14)
f direct 32 x 8 + 32 - 1
This implies that the delay can be increased by a factor of 1.33, which translates
into lowering ofvoltage by a factor ofO.82 (from 5V to 4.1 V). Since the area and
delay ratios between the multiplier and the adder are same, the Cmultir'ate re-
mains the same. Thus the power reduction using one-level decimated multirate
architecture is given by:
Pmultirate Cmultiro.te X ( VmUltirate)2 X fmultirate
Pdirect Cdirect Vdirect f direct
1 X (0.82)2 x 0.75 = 0.5 (3.15)
The above analysis shows that for a non-linear phase 32 tap FIR filter, the one-
level decimated multirate architecture (figure 3.11) results in 50% reduction in
the power dissipation.
The amount of power reduction using multirate architecture is mainly de-
pendent on the amount by which the frequency can be lowered. The lowered
frequency not only reduced power directly, but also enables reducing the voltage
which has a bigger impact on power reduction. The frequency ratio relationship
presented above indicates that the amount of frequency reduction is dependent
on the number oftaps and also on the delay ratio om/ 00.' Using this relationship,
it can be shown that frequency lowering is possible if (Om /Oa) > (6/ N - 1).
This relationship indicates that for N > 6 the frequency of the multi rate archi-
tecture can always be lowered independent of the (Om/oa) ratio.
0.9
0.8
c
0
~
0.
·iii 0.7
U)
Ci L_20
(j;
;: 0.6
0
(L
"0
<l.l
.~
co 0.5 NL 10
E
0
z
0.4
0.3
NL_20
0.2
0 5 10 15 20 25 30 35 40 45 50
Number 01 taps
Figure 3./7. Normalized Power Dissipation vs Number of Taps
3.5.1.2 Power Reduction Using MuItirate Architectures

This section presents results in tenns of power reduction using the multirate
architectures for dedicated ASIC implementation using a hardware multiplier
andan adder. Thepoweranalysis uses (Ami A a) = (Jm/J a) = 8. Thearearatio
has been selected to be same as the delay ratio so as to have Cdirect = Cmultimte
during power analysis. The actual values of the ratios vary depending on the
implementation styles used for the multiplier (e.g. Array, Booth, Wallace Tree
etc.) and the adder (e.g. Ripple Carry, Carry Look Ahead, Carry Select etc.).
While the amount of power saving differs with different values of the area and
delay ratios, the overall trend of reduced power dissipation very much holds
good.
Figure 3.17 shows the power dissipation as a function of number of taps for
the following implementations : (i). non-linear phase FIR fi Iter implementation
using one level decimated multirate architecture (graph labeled NL-I 0), (ii).
non-linear phase FIR filter implementation using two level decimated muItirate
architecture (graph labeled NL-2D and (iii). linear phase FIR filterimplemented
using two level decimated multirate architecture (graph labeled L-2D). The
power dissipation is nonnalized with respect to the power dissipation of direct
fonn realization.
72 VLSI SYNTHESIS OF DS? KERNELS
Table 3.2. Comparison with Direct Form and Block FIR Implementations
Direct Form Block FIR impl. Multirate-2 level

N *s +s ops area *s +s ops area *s +s ops area
4 4 3 7 35 2.5 6 8.5 26 2.25 5 7.25 23
5 5 4 9 44 3 8 11 32 3.25 6 9.25 32
6 6 5 11 53 3.5 10 13.5 38 3.75 6.5 10.25 36.5
7 7 6 13 62 4 12 16 44 4.25 7 11.25 41
The results show that the power dissipation reduces with increasing number
of taps in all the 3 cases.
For non-linear phase FIR implementation, one level decimated multi rate ar-
chitecture results in the power saving of upto 50%. The two level decimated
multirate architecture results in the power saving of upto 73%. This reduction
is more than the 64% power reduction achieved using the parallel processing
technique. The significant point to note is that the power reduction using mul-
tirate architecture requires no datapath area overhead compared to 240% area
overhead [15] in the parallel processing approach. It can be noted that the
multirate architectures however do result in the coefficient and data storage
overhead. For a non-linear phase N tap FIR filter, the multirate architecture
shown in figure 3.11 requires (N/2) more number of coefficient memory loca-
tions and (N/2) more number of data memory locations, compared to the direct
form implementation.
In case of linear phase FIR filters, since the one level decimated multi rate
architectures do not reduce the number of multipliers per output, the power
saving is primarily due to the reduction in the number of additions. For the
filter with N (number of taps)=20, the frequency can be lowered by 1.03 which
translates into the 7% power reduction. For higher values ofN power reduction
of upto 9% is achieved. Depending on the number of taps, the two level multi-
rate architectures use an appropriate combination oftopology-I and topology-II
to minimize the number of multiplications. The two level decimated multi rate
architectures result in upto 35% power reduction.
Power Reduction - Comparison with Other Architectures
Table 3.2 compares the doubly-decimated multirate architecture with the

direct form and the block FIR implementation presented in [77], in terms of
number of operations and area in equivalent number of adders. The area has
been computed assuming Ami Aa = 8.
Implementatioll Using Hardware Multiplier(s) and Adder(s) 73
As can be seen from the results, the doubly-decimated multi rate architecture
requires lesser number of operations and lesser area than the block FIR imple-
mentation. This shows the effectiveness of multirate architectures in reducing
power dissipation with minimal area-overhead.
Chapter 4
DISTRIBUTED ARITHMETIC BASED

IMPLEMENTATION
High speed digital filtering applications generally require dedicated hard-

wired implementation of the filters. Programmable processors can provide
high sampIe rates only with excessive amount of parallelism which may not be
cost effecti ve.
One approach to meet the performance requirements is to use dedicated hard-
wired solutions based on hardware multipliers, such as those discussed in the
previous chapter. The other approach is to use Distributed Arithmetic [13, 70,
104] based structures which enable high-speed multiplier-Iess implementations
of FIR filters. Distributed Arithmetic (DA) is a bit-serial computational op-
eration that forms an inner(dot) product of a pair of vectors in a single direct
step. This is achieved by storing all possible intermediate computations in a
look-up-table memory and using it to select an appropriate value for a specified
bit vector. The DA structure supports coefficient programmability i.e. the same
filter structure can be used for a different set of coefficients by appropriately
programming the look-up-table memory.
In a typical system level design process the overall area and performance
goals are partitioned to arrive at area-delay constraints for the various compo-
nents. If a component can be realized in a multiple ways, each representing a
different point on the area-delay curve, the partitioning process can explore a
much wider search space resulting in an efficient system implementation.
The first part of this chapter focuses on algorithmic transforms that generate
different structures for DA based implementation of FIR filters. It evaluates
the area-delay tradeoff associated with these transforms in the context of single
Adder-Shifter-Accumulator (minimum area) implementations.
This is followed by a look at area-efficient implementations of DA based
structures for filters whose coefficient values are known at design time. An
implementation using two memory modules is considered. Since the coefficient
75

values are known at design time, the look-up-tables stored in these memory
modules can be implemented as hardwired logic blocks. The chapter proposes
a coefficient partitioning technique so as to minimize the area of these logic
blocks.
The chapter also presents techniques for reducing power dissipation of the
DA based structure. With the primary focus on the power dissipated in the input
data shift registers, it proposes a data coding technique to minimize the number
of toggles in these registers. For a given profile of input data distribution an
optimum coding scheme can be derived so to minimize power dissipation.
4.1. DA Structures for Area-Delay Tradeoff

Consider the following sum of products
N
Y = L A[n] . X[n] (4.1 )
n=l
where A[n]'s are the fixed coefficients and the X [n]'s are K-bit input data words.
If each X[n] is a 2's-complement binary number scaled such that IX[n]1 < 1,
then X [n] can be represented as
K-1
X[n] = -bno + L (b nk . T k ) (4.2)
k=l
where the bnk are the bits 0 or 1, bno is the sign bit and bn,K -1 is the LSB.
Combining equations 4.1 and 4.2 gives
K-1 K-1
Y = L A[n]· (-bno + L bnk . T k ) (4.3)
k=l k=l
Interchanging the order of the summations gives

K-l N N
L (L A[n]· bnk ) . 2- k + L A[n]· -bno (4.4)
k=l n=l n=l
Consider the bracketed term in the above expression

N
L A[n]· bnk (4.5)
n=l
Since each bnk may take on values of 0 and 1 only, expression 4.5 may have 2N
possible values. Instead of computing these values on-line, they can be precom-
pu ted and stored in a look-up-table memory. The input data can then be used
Distributed Arithmetic Based Implenzentation 77
16 Word Memory
0000 0
0001 A3
0010 A2
00 I I A2+A3
X(n-I) 0100 AI
0101 AI+A3
o I 10 AI+A2
oI I I AI+A2+A3
1000 AO
X(n-2) 1001 AO+A3
10 I 0 AO+A2
101 I AO+A2+A3
I 100 AO+AI
I 101 AO+AI+A3
X(n-3) I I 10 AO+AI+A2
1111 AO+AI+A2+A3
Figure 4./. DA Based 4 Tap FIR Filter
to directly access the memory and the result can be added to the accumulator.
Y can thus be obtained after K such cycles using K-1 additions.
Figure 4.1 shows DA based implementation of a 4 tap FIR filter. The input
data values X[n] to X[n-3] are stored in input shift registers. During each cycle
the last bits of the registers are used as an address to look-up into the coefficient
memory and the read value is added to the right shifted accumulator. The shift
register chain is then right shifted. Since the input values are stored in 2's
complement form, the value read from the coefficient memory during the K'th
iteration is subtracted from the right shifted accumulator. The output Y[n] is
thus available in the accumulator after every K cycles. Figure 4.1 also shows
the coefficient memory map for a 4 tap filter with coefficients A[O] to A[3].
4.1.1. DA Based Implementation of Linear Phase FIR Filters

Linear phase FIR filters have symmetrie coefficients [73], i.e. for an N tap
filter A[i] = A[N - 1 - i] for i = 0,1, ... ,N-1. The coefficient symmetry can
be exploited to reduce the number of multiplications by half in the direct form
realization ofFIR filters. This is achieved by adding first the input data values to
be multiplied by the same fi Iter coefficient. This mechanism can also be used to
reduce the coefficient memory in the DA based implementation of linear phase
FIR filters. Since the input data is accessed bit serially, the addition of the data
values before coefficient memory look-up can be achieved using a bit-serial
adder.
Figure 4.2 shows DA based implementation of a 4 tap linear phase FIR filter.
Linear phase FIR filters can thus be implemented with a minimal performance
overhead of a l-bit-serial-addition but with a significantly smaller coefficient
memory given by 2N / 2 ifN is even and 2(N+l)/2 ifN is odd.
4 WonJ Memory f------,
Oll 0
01 AI
J() All
11 AII+AI
Clk
Figure 4.2. 4 Tap Linear Phase FIR Filter
The following subsections present transfonnations [104] to ac hieve the area-

delay tradeoff. These transfonns can be applied to both the non-linear phase and
the linear phase FIR filters. Since area-delay tradeoff is evaluated in the context
of single Adder-Shifter-Accumulator based realizations, coefficient memory
size is used as the measure of area. With the assumption that the memory
output is latched so that memory read and addition can be perfonned in parallel,
number of additions is used as the measure of delay.
4.1.2. I-Bit-At-A-Time vs 2-Bits-At-A-Time Access

The DA based implementation shown in figure 4.1 uses one bit a time
(1 BAAT) from each input data. The number of additions can be reduced to
K /2 - 1 if two bits at a time (2BAAT) are are used to access the coefficient
memory. This however results in exponential increase in the coefficient memory
requirement given by memory(2BAAT) = memory(lBAAT)2. Figure 4.3
shows DA based implementation of a 2 tap FIR filter, which uses 2BAAT co-
efficient memory look-up. Figure 4.3 also shows the corresponding coefficient
memory map.
This implementation assurnes the two most significant bits of the input data
to be sign bits. Input data in 2's-complement fonn can be easily converted
to such a representation by sign-extending the data to odd number of bits and
prepending a 0 as MSB.
The coefficient memory requirement ofthe DA based implementations shown
in figures 4.1 and 4.2 grows exponentially (2 N for non-linear phase and 2N /2
for linear phase) with the filter order. For example, a 16 tap FIR filter needs
64K (2 16 ) words of coefficient memory making it highly inefficient in terms of
area and almost impractical to implement. The coefficient memory requirement
gets even worse with the 2BAAT scheme shown in figure 4.3. The coefficient
Distributed Arithmetic Based Implementation 79
16 Ward Memory
0000 0
000 I AI
0010 2*AI
0011 3*AI
0100 AO
0101 AO+AI
o I 10 AO+2*AI
oI I I AO+3*AI
1000 2*AO
1001 2*AO+AI
1010 2*AO+2*AI
101 I 2*AO+3*AI
I I 00 3*AO
I 101 3*AO+AI
I I 10 3*AO+2*AI
I I I I 3*AO+3*AI
Figure 4.3. 2 Tap FIR Filter with 2BAAT
memory requirements can be reduced by using a technique that uses multiple

memory banks.
4.1.3. Multiple Coefficient Memory Banks

For an N tap filter, instead of using all N-bits to address a single coefficient
memory, the N address bits can be partitioned into two or more groups each
addressing its own coefficient memory.
Letthe N address bits be partitioned into M disjoint groups such that 'Lf!1 Ni =
N, where Ni is the number of address bits in group 'i'. The expression 4.5 can
then be written as:
Nl+N2 N
L (A[n]· bnk) + ... + L (A[n] . bnk ) (4.6)
n=l n=N-N}.f+ 1
The expression has M terms corresponding to the M partitions of the address

bits. A term corresponding to the i th group can take 2 Ni different values and
can hence be implemented using a memory with 2 Ni words. The resultant DA
based implementation, shown in figure 4.4, has M coefficient memory banks
with total memory size of 'Lt~l 2 Ni . This number is less than 2 N for all values
of M ~ 2. This implementation however takes (M . K - 1) additions per
computation, which is higher than (K - 1) required for M= 1.
For a given number of address bits N and the number of groups M into which
they need to be partitioned, the optimum partition that results in minimum total
memor~ is given by assigning equal number of address bits (N/M) to all the
groups. The total memory required by the resultant implementation is given by
M·2 N / M .
NI NI
2 ward
ROM
N2 N2
2 word
ROM
Nm Nm
2 ward
ROM
Figure 4.4. Using Multiple Memory Banks
The above partitioning scheme can be applied when N is an integer multiple

of M. If N/M is a non-integer, the optimum partitioning can be achieved by
assigning to each group Ni number of bits such that IN/ M - Ni I < 1
4.1.4. Multiple Memory Bank Implementation with 2BAAT

Access
Since the use of multiple memory banks increases the number of additions,
it is worthwhile to look at the area-delay tradeoff achievable using 2BAAT
access. Consider a two memory bank implementation of an N tap FIR filter
using lBAAT access. It requires (2K - 1) additions (where K is the number
bits in the input data) and 2N /2+ 1 words of coefficient memory. With 2BAAT
data access, the number of additions can be reduced to (K - 1) which is same
as the number of additions required with a single memory bank. This however
increases the total coefficient memory to 2 N + 1 which is in fact more than the
memory required for the DA implementation with a single memory bank. Thus
the two memory bank implementation with 2BAAT data access does not provide
a useful area-delay tradeoff.
Figure 4.5. Multirate Architecture
However, for a three memory bank implementation with 2BAAT data ac-
cess, the number of additions required (3K /2 - 1) is less than the two bank
implementation with lBAAT, and the coefficient memory required (3· 2 2N / 3 )
is less than the single bank implementation with I BAAT data access. The three
bank implementation with 2BAAT data access thus represents a data point on
the area-delay curve between the single bank lBAAT and the two bank I BAAT
DA implementations.
4.1.5. DA Based Implementation of Multirate Architectures

In section 2.4.2 multi rate architectures have been presented as computation-
ally efficient structures for implementing FIR filters. Figure 4.5 shows the
signal ftow graph of a multirate architecture that uses the decimation factor of
two.
The multirate architecture (figure 4.5) for an N tap filter uses 3 sub-filters of
length N/2. Each of these sub-filters can be implemented using the DA scheme
shown in figure 4.1. With such an implementation, each sub-filter requires
coefficient memory of 2N /2 words. The total memory requirement is hence
3·2 N / 2 . This requirement is 50% more than the memory required (2 .2 N / 2 )
for the DA based implementation using 2 memory banks.
In terms of number of additions, each sub-filter computation requires (K-l)
additions and there are four more additions required to compute the outputs
YO and Yl. Out of these four additions, the input addition (XO+Xl) can be
performed bit-serially using the same approach as used in case of linear phase
FIR filters. Thus the multirate architecture requires (3· (K -1) + 3) /2 additions
per output. These numberof additions are less than (2· K -1) additions required
for the two memory bank implementation, for all values of K > 2. Figure 4.6
shows the DA based implementation of a 4 tap multirate FIR filter.
The number of additions can be further reduced (at the expense of increased
coefficient memory) by using 2BAAT access. For an N tap filter, such an
Y(n)
1-+--1> Y(n-Il
Figure 4.6. DA Based 4 Tap Multirate FlR Filter
implementation requires (3 . 2N ) words of coefficient memory and (3K/4)

number of additions. It can be noted that this memory requirement is higher
than the single memory DA based implementation with I-bit-at-a-time access
for all values of N but the number of additions is less for (K > 4). Thus
the DA implementations of the multirate architecture with both lBAAT and
2BAAT access result in a meaningful area-delay tradeoff.
Here is a look at the DA based implementation ofthe multirate FIR filter with
linear phase and I-bit-at-a-time data access. Consider an 8 tap filter with coef-
ficients AO,A I ,A2,A3,A3,A2,A 1 and AO. In the corresponding multirate archi-
tecture the coefficients of the three sub-filters are given by HO: [AO,A2,A3,A 1]
, Hl:[Al,A3,A2,AO] and HO+Hl:[AO+Al, A2+A3, A2+A3, AO+Al]. It can
be noted that both HO and Hl has the same set of coefficients and can hence
share the same coefficient memory of size (2 4 ). The coefficients ofHO+Hl are
symmetrie and hence need the coefficient memory of size (2 2 ).
In general, it can be shown that for an N tap filter (with N even), the sub-
filters HO and HI can share the same coefficient memory of size (2 N / 2 ) and
the sub-filter HO+Hl requires the coefficient memory of size (2 N / 4 ). The total
coefficient memory is thus more than the memory required for the single bank
implementation. Since the number of additions is also higher, the DA based
implementation of a multirate linear phase FIR filter with lBAAT data access
does not result in a useful area-delay tradeoff.
4.1.6. Multirate Architecture with a Decimation Factor of Three

The multirate architecture described in the earlier section decimates the in-
puts and the filter coefficients by two. The decimation factors of higher than
two can also be used to derive other multi rate architectures, such as a multi rate
Distributed Arithmetic Based lmplementation 83
architecture [66] that uses a decimation factor of three. In this architecture, the
decimated sub-filters HO,HI and H2 are derived by grouping every third filter
coefficient as shown below :
't- 1 't- 1 't- 1

H(Z) = L A[3k]·Z-3k+ L A[3k+1]-Z-(3k+l) + L A[3k+2]·Z-(3k+ 2)
k=O k=O k=O
(4.7)
The input data is also decimated into XO, Xl and X2 in a similar way. The
multirate architecture takes three inputs a time and computes three outputs at
time using the following computations. .
aO X2-X1 bo HO
al (XO - X2 . Z-3) - (Xl - XO) b1 H1
a2 -aO. Z-3 b2 H2
a3 (Xl - XO) b3 HO+H1
a4 (XO - X2 . Z-3) b4 H1+H2
as XO bs HO+H1 +H2
mi ai * bi , i = 0,1,2,3,4,5
YO m2 + (m4 + ms)
Y1 ml + m3 + (m4 + ms)
Y2 mo + m3 + ms
This multirate architecture has six sub-filters of length N/3. Each of these
filters can be implemented using DA based approach, thus requiring total
coefficient memory of 6 . 2N /3. These sub-filters require 6(K - 1) addi-
tions. There are 10 more additions required, four out of which are at the input
and can be implemented bit-serially. Thus this architecture requires total of
(6(K - 1) + 6)/3 = 2K additions per output.
The area-delay tradeoff of this architecture with 2BAAT data access can be
analyzed much the same way as the earlier multirate architecture. It can be
shown that with 2BAAT data access this architecture requires K additions per
output and 6 . 22N/3 words of coefficient memory.
For an N tap filter, where N is an integer multiple ofthree, it can be shown that
the sub-filters HO and H2 have the same set of coefficients and can hence share
the same coefficient memory of size 2 N / 3 . Similarly the sub-filters HO+HI
and H I +H2 have the same set of coefficients and can hence share the same
coefficient memory of size 2 N / 3 . The sub-filters HI and HO+HI+H2 have
symmetric coefficients and hence require total of 2 * 2N / 6 words of coefficient
memory. Thus the total coefficient memory required for the linear phase filter
is given by (2(N/3+ 1) + 2(N/6+1)).
4.1.7. Multirate Architectures with Two Level Decimation

Each of the sub-filters in the multi rate architectures discussed above can be
further decimated to realize multirate architectures with two level decimation.
For example, sub-filters of the architecture shown in figure 4.5 can be further
decimated by a factor of two. The resultant architecture has nine sub-filters
with N/4 number of taps. Each of these sub-filters can be implemented using
DA based approach. The architecture reads in four inputs and computes four
outputs. It needs 15 additions in addition to those required to implement the
sub-filters. Thus this architecture requires total coefficient memory given by
9· 2N / 4 and total number of additions per output given by (9(K - 1) + 15)/4.
Some other two level decimated multi rate architectures can also be derived
and analyzed far the associated area-delay tradeoff. One such architecture can
be obtained by decimating by two the sub-filters of the multirate architecture
obtained with first level decimation factor of three. The resultant multirate ar-
chitecture has 18 sub-filters with N/6 number of taps. Each of these sub-filters
can be implemented using DA based approach. This architecture reads in 6 in-
puts and computes 6 outputs. It needs 21 additions in addition to those required
to implement the sub-filters. Thus this architecture requires total coefficient
memory given by 18· 2N / 6 and total number of additions per output given by
(18(K - 1) + 21/6.
Further area-delay tradeoff can be achieved by implementing the above two
level decimated architectures using 2BAAT data access.
4.1.8. Coefficient Memory vs Number of Additions Tradeoff

Table 4.1 gives coefficient memory size for 3 non-linear phase FIR filters
(with 8, 12 and 18 taps) implemented using 16 different DA based approaches
discussed in this chapter. The table also gives the number of additions required
by these 16 approaches for two values of input data precision (K= 12 and 16).
As can be seen from the results, the techniques discussed in this chapter
enable achieving different points in the area-delay space for the DA based
implementation of FIR filters. For a given filter, some of these points can be
eliminated as their memory requirements are very high or they require higher
memory for the same number of additions compared to another implementation.
Even with these eliminations, as many as eight meaningful data points can be
achieved on the area-delay curve. Figure 4.7 shows these memory-vs-number
of addition plots for the 8 tap and 12 tap FIR filters with 16 bits of input data
precision.
The following section looks at DA based implementation ofFIR filters whose
coefficients are known at design time. It presents a technique to improve the
area efficiency of a DA structure that uses two LUTs. It can be noted that
Distributed Arithmetic Based Jmplementation 85
Table 4.1. Coefficient Memory and Number of Additions for DA based Implementations
Impleme- Memory size Number of +s

ntation N=8 N=12 N=18 K=12 K=16
IB,IMem 256 2 'L 2 Hl II 15
2B,IMem 2 16 224 236 5 7
IB,2Mem 32 128 1024 23 31
2B,2Mem 512 2 13 2 19 II 15
IB,3Mem 20 48 192 35 47
2B,3Mem 144 768 3 x 2 12 17 23
IB,4Mem 16 32 96 47 63
2B,4Mem 64 256 3072 23 31
I B,/2 48 192 1536 18 24
2B,/2 768 3 x 2 12 3 X 2 18 9 12
I B,/3 44 96 384 24 32
2B,/3 336 1536 6 x 2 12 12 16
I B,/2/2 36 72 240 28.5 37.5
2B,/2/2 144 576 6912 15 19.5
I B,/3/2 56 72 144 36.5 48.5
2B,/3/2 192 288 1152 18.5 24.5
1600
CD
1400
N
i:Jj 1200
~
0
E 1000
CD
:2 800
C
CD
'0
600
~0 400
Ü
200
0
0 10 20 30 40 50 60
Number 01 Additions
Figure 4.7. Area-Delay Curves for F1R Filters
the proposed technique is generic and can be extended to other DA structures

discussed earlier.
4.2. Improving Area Efficiency of Two LUT Based DA

Structures
For a two bank implementation, equal partitions are the most efficient in
terms of the total number of rows of the two look-up-tables. The totallook-up-
table size for such an implementation has an upper bound given by 2 . 2 N / 2 .
rm + log2(N/2)l, where N is the number of taps and m is the number of bits
X(n) 4 word Memory

00 0
01 AI
X(n-I) 10 AO
II AO+AI
X(n-2) 4 word Memory

00 0
01 A3
X(n-3) 10 A2
II A2+A3
Figure 4.8. Two Bank Implementation - Simple Coefficient Split
X(n) 4 word Memory

00 0
01 A2
X(n-I) 10 AO
II AO+A2
X(n-2) 4 word Memory

00 0
01 A3
X(n-3) 10 AI
II AI+A3
Figure 4.9. Two Bank Implementation - Generic Coefficient Split
of precision of the coefficients. The number of columns/outputs of the memory

modules is a worst-case upper limit and can be reduced depending upon the
values of the coefficients.
Coefficient partitioning need not be restricted to a simple split as shown in
Figure 4.8 but can be extended to a more generic case as shown in Figure 4.9.
If the dynamic range of filter coefficients in any given partition is smalI, the
Distributed Arithmetic Based Implemenfation 87
overall precision required in the LUTs is less and the implementation area can
be reduced.
For filters with fixed coefficient values the required area could be drastically
reduced by removing the redundancy inherent in a memory structure by using a
two level PLA implementation or the more efficient multi-level logic optimiza-
tion. In a two LUT implementation, the functionality of the LUTs depends
on the coefficient partitioning. Experiments indicate [86] that 20% to 25%
swings in implementation area occur based on the type of partition. Hence this
ftexibility needs to be explored.
In general, a 2N tap filter could be partitioned in (2N CN) /2 ways. Clearly,
even for a modestly sized 16 tap filter this implies a search space with 6435
partitions. The unfeasibility of an optimized area mapping for the exhaustive
set, and then choosing the most efficient partition is at once apparent. A set of
heuristics is hence required for estimating the area of different partitions so as
to speed up the search of the most area efficient partition.
4.2.1. Minimum Area Partitions for Two ROM Implementation

Once the number of taps is given, the number of words in the look-up ROM
gets fixed. Area optimization must therefore seek to reduce the number of
columns (i.e. the number of outputs) in it. The ROM size is direCtly propor-
tional to the number of output columns it the truth-table. Therefore, minimum
area results from that partition pair where the total number of output columns
is minimum. It can be observed that the maximum bit precision required is
determined by the magnitude of the largest positive or negative sum, whichever
is larger. The largest positive number to be stored in the look-up ROM is given
by the sum of aII positive numbers in the partition; the same applying for the
largest negative number. The problem to find the most column-efficient parti-
tion, hence, can be formulated as folIows:
CoefficienLset = {partitionli,partition2i } I
sizeof( partitionli) = sizeof( partition2 i )
Define,
psumli = L AlLPositives(partitionli);
psum2 i = L AlLPositives(partition2d;
nsumli = L AlLN egatives(partition2 i );
nsv,m2 i = L AlLN egatives(partition2i);
precisionli = M AX (rlog2lpsum1i Il, rlog21nsurnli Il)
precision2i = MAXWog2lpsum2ill, rlog2lnsum2 i ll)
M inimize (precisionli + precision2i ) Vi (i runs over all possible partitions)
For smaII coefficient sets it is possible to exhaustively run the algorithm and
obtain the most efficient partition. However for larger sets a few heuristic rules
need to be used to choose a close to most efficient partition.
88 VLSf SYNTHESfS OF DSP KERNELS
Table 4.2. A Few Functions and Their Corresponding Correlations with Actual Area
No. Function Used Corre-

lation
I Number of 'l's in the truth-table 27%
2. I: I: (Number of 'l's common to
i th , lh rows in the truth-table)
x ( Hamming distance between
i th , lh min-terms ) -4%
3. I: MIN ( Hamming Distance between
i th , lh columns IV U # i) ) 28%
4. Modified Row Hamming Distance
based cost function (ROF) 81%
5. Modified Column Hamming Distance
based cost function (COF) 66%
Based on the analysis of the coefficients of various low pass filters with taps
ranging from 16 to 40, the following heuristic rute [86] can be used to choose
an efficient partition:
Stepl : Separate the coefficients into positive and negative sets.
Step2 : Sort each set by magnitude.
Step3 : Group the top half of each set as the first partition and the remaining
as the second partition.
4.2.2. Minimum Area Partitions for Hardwired Logic

The unfeasibility of an exhaustive search for the best case partition by syn-
thesizing all ofthem was mentioned earlier. Most synthesis tools [12] like SIS
take hours to ruggedly optimize even a ten coefficient truth-table. Estimating
the area of a multi-level logic implementation from the truth-table description
is a non-trivial problem. A technique for estimating the complexity of synthe-
sized designs from FSM specifications has been proposed in [64]. The variables
used by this scheme, such as the number ofinputs (which is fixed in this case),
number of state transitions etc. are not applicable in this context. Experiments
were conducted [86] with many cost functions using 40 random partitions each
of 8 to 20 tap filters. Table 4.2 lists some of the functions and the average
correlation between the expected and actual areas.
Functions 4 and 5 form the basis of the area comparison procedure and will
be explained in detail later. Function 1 gives a very naive estimate, assuming
that number of ones is a measure of the number of min-terms that needs to
implemented. It does not consider the minimizations that occur because of
particular groupings of 1'so However, it could be used effectively for filters with
sparsely populated truth-tables. Function 2 is similar to a fan-in type algorithm
used for FSMs [19, 63]. It reflects the fact that additional area results when two
particular outputs have more Hamming distance between their corresponding
min-tenns. However, the fact that it sums up over a11 possible combinations
of rows results in favorable pairs being overshadowed by area expensive ones.
Function 3 tries to group outputs with maximum overlap between them and adds
the extra non-overlap cost. However, it does not account for simplifications that
could arise from row overlaps. Further, it pairwise sums up a11 best case column
groupings without accounting for the fact that one favorable grouping might
exclude the possibility of another one.
4.2.2.1 CF2 : Estimating Area from the Actual Truth-Table

CF2 extracts area infonnation from the truth-table itself. It comprises oftwo
factors: the row overlap factor and the column overlap factor.
Row Overlap Factor(ROF)
For an mbit, n tap filter truth-table, any particular input combination can have
a maximum of m + fZog2n 1outputs. ROF accounts for the area optimization
that results if two column entries in a truth-table can be combined. The ROF
computation is as folIows:
Step1 : Arrange the truth-table with inputs in gray code fonnat (i.e. where
the successive inputs differ from previous and subsequent on es in only one bit
position).
Step2 : Assuming an n-coefficient partition, its N = 2n truth-table entries
are sorted in gray fomlat and labeled from [0 .. N-l]. The symmetric Hamming
distance is then computed as folIows:
n-12,,-i_12 i -1
ROF = L L L hd(j.2 i + k,j.2 i - (k + 1)) (4.8)
i=O j=1 k=O
where, hd(p, q) represents the Hamming distance between the pth and the
qth row entries in the modified truth-table. 1t can be observed that when the
Hamming distance between two output rows is more, the number of input min-
tenns that could be combined is less and hence the added cost. ROF gives a very
high correlation with ac tu al implementation area but its correlation deteriorates
as the column overlap factor begins to dominate. Consider the following simple
example,
1111111111111111 1010101010101010
0000000000000000 0101010101010101
Case 1, ROF = 16 Case 2, ROF = 16

Clearly, Case 1 would require 1esser area, because of the greater column
overlap which in turn implies that only one min-term needs to be implemented.
To account for this, the Column Overlap Factor (COF) is computed.
Column Overlap Factor(COF)
COF computation is based on the minimum-spanning-tree algorithm [18].
It begins with one output column, tries to locate another one which is c10sest
to it (in terms of maximum' 1' overlap), then for a third one which is c10sest to
either of them and so on. In each case it adds to the cost function the amount
of non-overlap. Assuming m outputs in the truth-table, COF is computed using
the Prim's technique [18] for minimum-spanning-tree computation as follows:
The graph G consists of columns as nodes. The edge-weight (ew) is the
extra non-overlap cost between a pair of columns and COF is sum of the edge-
weights of the minimum-spanning-tree.
Define, G = {Ck ICk --t k th output column, k=[O,m-l] }
eWij = ones (Cj) - overlap (Ci, Cj)
where overlap( Ci, Cj) gives the number of positions where both Ci and Ck have
'I' entries in corresponding rows.
Stepl : Initialize count=O; COF=ones( Co ); and the span set
as Spantree = { Co }.
Step2 : Repeat Steps 2-4 while count::; m - 1.
Step3 : Find Ck such that for all Ci E Spantree,
Ck E Spantree/\ (eWik --t MIN)V(i,k i- i)
Step4 : Increment count; Add Ck to Spantree and edgeweight
(the extra non-overlap cost) to COF.
COF = COF + eWik
CF2 is computed using a linearcombination ofCOF and ROF. It was observed
that CF2 values had as much as 90% correlation with actual areas.
Computation of CF2
A linear weighted combination of normalized COF and ROF (cost function
CF2) was tested on truth-tables of fi lter coefficients generated using the Parks
McClellan algorithm, with taps ranging from 16 to 36.
G F2 = k 3 . RO F + k 4 . GOF (4.9)
where ROF, GOF represent the normalized values of ROF, COF computed
for the different partitions. The values of k 3 and k 4 far maximum correlation
were found out as 0.92 and 0.08 respectively. 80% to 90% correlation to actual
area were observed. The values of CF2 obtained for the various partitions can
therefore be used to obtain a minimized search space of desired size. Figure 4.1 0
shows a typical correlation trend obtained using CF2 for 25 different 16 tap
filter partitions. As can be seen, the two most area efficient partitions could
be isolated. The 'kinks' in between correspond to the 'transition zone' where
0.95
/~
TI 0.9
,~/-6 0
Q)
.~ 0,./'"
(ij
//0
E 0.85 o 9"
0 o /
N
c
lL
o 'I/i
ü 0.8 ~~
o}o
0.75 //f~
/'{,
0.7 "----'-----''----'-----'----'-----'
200 300 400 500 600 700 800
Actual Area (equivalent NA210 NANO gates)
Figure 4.10. Area vs Normalized CF2 Plot for 25 Ditlerent Partitions of a 16 Tap Filter
neither a row nor a column overlap factor dominates; there occurs a complex
interdependency in row/column simplifications.
4.2.2.2 CFl: Estimating Area from the Coefficients in Each Partition

Computation of CF2 involves an additional overhead of generating the truth-
table of all possible sum combinations for a given partition. For large coefficient
sets it is desirable to be able to predict area efficiencies from the coefficient
values themselves. The cost function (CFI) described below uses coefficient
values themselves, to get those partitions from the exhaustive set that are highly
Iikely to give optimum area. Hence, a hierarchical reduction in the search space
can be performed by using CF 1 on the exhausti ve set and CF2 on those partitions
screened by CF 1.
CFl is based on the Hamming distance between pairs of coefficients and the
total number of ones in all the coefficients. Statistical data between actual area
after SIS simulation and the corresponding truth-table has shown correlation
as high as 50% to 60% between the number of ones in the truth-table and the
corresponding area. CFl exploits this basis for estimation. Since, the entries
in the truth-table are the sum combinations of the coefficients, the Hamming
distance between any pair gives the number of ones that will result from their
addition. It was observed that a similar correlation exists between the number
of ones in the coefficients themselves and the area. Further, the Hamming
distance estimate and the number-of-ones estimate were complementary. A
linear combination of the two produced good primary estimate for reducing the
search space to a manageable size. Therefore,
CFl = k1 . L L hd(Ci, Cj) + k L ones(ci) 2 . (4.10)

i Ni
Table 4.3. ROM Areas as a % of Maximum Theoretical Area
No.of No.of Best Worst Max. Base

taps Coetfs Area Area Saving Case
24 12 61% 74% 13% 63%
28 14 61% 74% 13% 66%
32 16 66% 76% 10% 68%
32 16 66% 74% 8% 66%
32 16 63% 74% 11% 66%
36 18 65% 70% 5% 65%
40 20 67% 72% 5% 67%
where k 1 and k 2 are the corresponding weights and ci represents the i th coeffi-
cient in the partition; hd is a simple Hamming distance between the two input
vectors and ones is the number of '1' entries in Ci.
The function was implernented on all possible uniform twin partitions of
filters with number of coefficients ranging from 8 to 20. The values of k 1
and k 2 that resulted in the highest correlation between the value of CFl and the
actual area were found out as 0.83 and 0.17 respectively. Experiments indicated
that the correlation values remained almost the same after CFl values obtained
for individual coefficient sets in a give partition were added up and compared
with the sum of the individual implementation areas.
4.2.3. Evaluating the Effectiveness of the Coefficient

Partitioning Technique
Here are the results that highlight the effectiveness of the coefficient parti-
tioning technique. Table 4.3 compares the best case, worst case and base case
( partitioning by simple split of coefficient set) areas for some linear phase
filters.
As can be seen, 8% to 10% area saving can easily be obtained from a good
partition. Table 4.4 compares the area required for a ROM implementation
with that of hard wired implementation (using SIS) for different numbers of
16-bit precision coefficients. The area mapping was performed using a library
of TSC4000 0.351Lm CMOS Standard gates from Texas Instruments [100]. It
can be seen that more savings result for smaller number of coefficients as the
decoder overhead does not decrease proportionately for the ROM even though
the memory size decreases. Table 4.5 illustrates the kind of area variations that
occur depending on the partitioning of coefficients for some typical filters.
Table 4.4. ROM vs Hardwired Area (Equivalent NA210 NANO Gates) Comparison
NO.of ROM Hardwired %

Coeffs. Area Area Saving
4 310 64.25 79%
5 355 96.00 73%
6 447 214.00 52%
7 626 345.50 45%
8 985 650.25 34%
Table 4.5. Area (Equivalent NA21 0 NANO Gates) Statistics for All Possible Coefficient Par-
titions
NO.of Best Worst Mean Range Std.

Coeffs Area Area Area (%) Oev.
8 91.25 115.25 100.49 23.88 26.41
14 340.00 420.75 382.58 21.11 118.64
16 759.00 934.00 844.01 20.73 200.00
16 895.50 1127.50 1000.75 23.18 291.88
16 884.50 1041.00 970.37 16.13 174.19
16 689.00 1005.00 795.50 39.72 332.02
18 1285.50 1619.25 1393.12 23.96 369.20
Clearly the correct choice of partitions results in 20% to 25% area saving
and so a proper algorithm for choosing the ideal partition is altogether justified.
CFI and CF2 were implemented on filters with taps ranging from 8 to 40.
For filters with number of coefficients less than 20, all possible partitions were
generated while for larger ones a comparable number of random partitions were
generated. In each case the actual area mapping of the simplified circuit was
obtained through SIS simulation.
The following results were obtained [86]:
• 82% to 90% probability of choosing the most area optimal partition using
CFI and CF2.
• Over 95% probability of having the most optimal partition in the search
space reduced to 2% of its original size.
• All cases yielded partitions close to minimal area in the reduced search
space.
Table 4.6. Toggle and No-toggle Power Dissipation in Some D FFs
D Flip-Flop (TI standard cell [100]) Cpd,toggle (pF) Cpd,no-toggle (pF) %extra
DTPIO 0.157 0.070 124%
DTP20 0.180 0.071 154%
DTPIA 0.155 0.069 138%
DTPIO 0.167 0.070 139%
• The CF1, CF2 estimates had greater correlation as the size of the search
space increased; and a larger sized domain is where CF2 and CF2 have their
real application.
• For an 8 input truth-table with 256 rows and 16 output columns, SIS required
a CPU time of 350.53s on a Sun SPARC 5 station while CF2 computation
required only 0.15s, a speed advantage of around 2400. Further, this speed
advantage increased sharply with filters of higher order.
The next section presents techniques for reducing power dissipation of DA

based FIR filters. While these techniques have been discussed primarily in the
context of the basic DA based filter structure shown in figure 4.1, they can be
extended to other DA based filter structures as weIl.
4.3. Techniques for Low Power Implementation of DA

Based FIR Filters
For the DA based structure shown in figure 4.1, the rightmost bits in the
shift registers constitute the address for the LUT. Data is shifted every clock
cycle and the LUT outputs are shifted and accumulated. This is done N times
where N is the precision of the input data (and hence the length of the shift
registers). At the end of every N clock cycles, the output is tapped at Y. For a
2's complement representation, the Sign Control is always positive except for
the MSB i.e. for the N th clock cycle.
Substantial power consumption occurs as a result of toggles occurring in the
shift registers every clock cycle. Table 4.6 compares the power dissipation for
the toggle and the non-toggle (in the data values) cases for four D flip-flops
based on a 0.35p,m CMOS technology [100]. From the table it is clear that
a technique which reduces the number of toggles in the shift registers would
significantly reduce the power dissipation in the design.
For applications where the distribution of data values is known, a data coding
scheme can be derived which for a given distribution profile of data values,
results in lesser number of toggles in the shift registers. The main constraint
is to have a scheme which results in toggle reduction with minimal hardware

overhead (implying power dissipation overhead as weil). The second important
constraint is that the coding scheme should be programmable so that the same
hardware can be used for different distribution profiles of data values. The nega-
binary scheme discussed here satisfies these two constraints and can be directly
incorporated into the structure of the DA based FIR shown in Figure 4.1.
4.3.1. Toggle Reduction Using Data Coding

Any coding scheme that seeks to reduce togg1ing must meet the following
criteria:
1. It should add minimum additional hardware overhead.
2. It should represent the entire range of values of the source data being coded.
The generic nega-binary scheme proposed here meets the above two re-
quirements. It has the added ftexibility of choosing one of the several possible
nega-binary schemes that meet the above criteria and also resuIts in maximum
toggle reduction.
4.3.1.1 Nega-binary Coding

Nega-binary numbers [106, 107] are a more generic case of a 2's complement
representation. Consider an N bit 2's complement number. Only the most
significant bit (MSB) has a weight of -1 while all others have a weight of + 1.
An N-bit nega-binary number is a weighted sum of ±2i . As a special case
consider a weighted (_2)i series where nbi denotes the i th bit in the nega-
binary representation of the number.
N-l
number = L nbi . (_2)i (4.11 )
i=ü
In the above case, powers of 2 alternate in signs. While the 2's complement
representation has the range of [ - 2 N -1 , 2 N -1 - 1 ], this nega-binary scheme
has the range of [ _(4 LN/ 2J - 1)/3, (4 fN / 21 - 1)/3]. It can be noted that in
general the nega-binary scheme results in a different range of numbers than the
2's complement representation. Thus there can be a number that has an N bit 2's
complement representation but does not have an N bit nega-binary representa-
tion. This issue has been addressed in section 4.3.1.2. Here is a simple example
that demonstrates how the nega-binary scheme can result in reduced number
of toggles. Consider the 2's complement number 010101018. Using a nega-
binary scheme with alternating positive and negative signs (weights - (- 2)i ),
the corresponding representation will be ll111111 NR . Clearly while the first
case has maximum possible toggles the second one has minimum toggles. If
instead the number was 10101010B, this nega-binary scheme would result in
a representation with same number of toggles as the 2's complement. How-
_T~_al range for 32 _(21\(4+1» different Nega-Binary representations
.2's Complement Range.
-31 -8 o 7 31
Range ror . - - - - Range ror + + + + +
Figure 4. JJ. Range of Represented Values for N=4, 2's Complement and N+ 1=5, Nega-binary
ever, a different nega-binary scheme (weights ( - 2)i ) will have a representation

11111110N B withjust 1 toggle. Thus it can be noted that different nega-binary
schemes have different 'regions' in their entire range which have fewer toggles
and hence depending on the data distribution the flexibility exists of choosing
a scheme which minimizes toggling without altering the basic DA based FIR
filter structure.
In existing literature [106, 107], the term nega-binary is used specifically for
binary representations with radix -2. In this chapter, the definition of the term
has been extended to encompass all possible representations obtained by using
±2i as weights for the i th bit. Hence for an N-bit precision there will exist 2N
different nega-binary schemes.
4.3.1.2 2's Complement vs Nega-binary Representation

Since the range üf values für the twü representatiüns are different, the bit
precision for the nega-binary scheme needs to be increased tü N+ 1. With N+ 1
bits of precision, when all sign bits are negative, the corresponding nega-binary
range is [ _2 N + 1 + 1 ,0] and likewise when all the sign bits are positive, the
range is [ 0 , 2N +1 - 1]. All intermediate sign combinations have a range
lying between [ _2 N + 1 + 1 , 2 N + 1 - 1 ] and each combination represents
2N + 1 consecutive numbers. The N-bit 2's complement range being [ _2 N - 1 ,
2 N - 1 - 1] overlaps and completely lies within the N+l bit nega-binary range
for exactly 2 N + 1 different nega-binary representations out of the possible
2N +1 total cases. Figure 4.11 illustrates this point for an N=4, 2's complement
and N+ 1=5, nega-binary representation. From a total of 32 different 5-bit nega-
binary schemes, 17 schemes cover the 4-bit 2's complement range.
The advantage of using such a scheme is that it enables selecting a nega-
binary representation that minimizes the number of toggles in the data values
while covering the entire range spanned by its 2's complement counterpart. For
a given profile of input data distribution a nega-binary scheme can be selected,
out of the ones which overlap with the 2's complement representation, such
that it minimizes the total weighted toggles i.e. the product of the number of
toggles in a data value and the corresponding probability of its occurrence.
0.03 r-----..,.-------r----r------r----~---___,
0.025
0.02
f
n"
0
":>
lL
0.015
~
.ö;
"'"
0
0.01
0.005
o L-_ _ ..g,:;~_
·600 ·400 -200 o 200 400 600

VALUE _••• >
Figure 4.12. Typical Audio Data Distribution for 25000 Sampies Extracted from an Audio File
Figure 4.12 iIlustrates the distribution profile of a typica1 audio data extracted
from an audio file. The non-uniform nature of the distribution is at once appar-
ent. A nega-binary scheme which has the minimum number of toggles in data
values with very high probability of occurrence will substantially reduce power
consumption. Further, each ofthe 2N + 1 overlap cases have different 'regions'
of minimum toggle over the range, which implies that there exists a nega-binary
representation which minimizes total weighted toggles corresponding to a data
distribution peaking at a different 'region' in the range. While the relative data
distribution of a typical audio data is similar to that shown in figure 4.12, its
mean can shift depending on factors such as volume control. The flexibility
of selecting a coding scheme depending on the 'mean' values is hence very
critical for such applications. Section 4.3.1.4 shows that the binary to nega-
binary conversion can be made programmable so that the desired nega-binary
representation can be selected (even at run-time) by simply programming a
register.
It can be noted that the toggle reduction using the nega-binary coding comes
at the cost of an extra bit of precision. The amount of saving hence reduces as
the distribution becomes more and more uniform. This is to be expected, as
any exhaustive N-bit code (i.e. one that comprises of all possible combinations
3 ,
2.5
l'
cn
1.5
Q)
"öl
Cl
0
f-
.S: ~ ~ "-'
Q)
'-'
c
e?
Q)
:e 0.5
0
0 I- ..... , .......
-
-0.5
I
-1
I,
-40 -30 -20 -10 0 10 20 30 40
VALUE --->
Figure 4.13. Difference in Toggles far N=6, 2's Complement and Nega-binary Scheme : + - -
+-++
of 1sand Os) will necessarily have the same total number of toggles (summed
over all its representations) as any other similar code. Therefore, as the data
distribution becomes more and more uniform i.e. all possible values tend to
occur with equal probability, toggle reduction decreases.
Figures 4.13 and 4.14 ilIustrate the difference in number of toggles for a 6-bit,
2's complement representation and two different 7-bit, nega-binary represen-
tations for each data value. Figures 4.15 and 4.16 show two profi les for 6-bit
Gaussian distributed data. As can be seen the nega-binary scheme of figure 4.13
can be used effectively for a distribution like the one shown in Figure 4.15, re-
sulting in 34.2% toggle reduction. Similarly nega-binary scheme of figure 4.14
can be used for a distribution like the one shown in figure 4.16, resulting in
34.6% toggle reduction. Figures 4.13 and 4.14 depict two out of a total of 65
possibilities. Each of these peaks (i.e. the corresponding nega-binary scheme
has fewer toggles compared to the 2's complement case) differently, and hence,
for a given distribution, a nega-binary scheme can be selected to reduce power
dissipation.
4r-----.---~h-----,,-----._,----,_----,,----_.----_.
l'
cn
2 ~ <>i
Q)
c;,
'"0
I-
.~
Q)
u
c
~
Q)
::=
(5
0 - .
~ )~ ..
'ti
r
-1 W ~
I
,
-2
-40 -30 -20 -10 0 10 20 30 40
VALUE ---->
Figure 4.14. Differenee in Toggles for N=6, 2's Complement and Nega-binary Sehe me : - + +
-+-+
4.3.1.3 Deriving an Optimum Nega-binary Scheme for a Given Data

Distribution
The first step is to find a data value which contributes most to the power
dissipation, using (Number ofToggles x Probability of Occurrence) as the cost
function. The analysis of various Gaussian data distributions indicates that a
nega-binary scheme that minimizes the toggles in such a data value, is the opti-
mum nega-binary scheme for the given data distribution. For example, for the
profile shown in figure 4.15, data value 22 (6 bit binary representation - 0] 0 I] 0)
contributes most to the power dissipation. The nega-binary representation of
22 using the 7 bit scheme (+ - - + - + +) shown in figure 4.13 is (I] ] ll ] 0) which
has least number of toggles. As has been presented earlier, this nega-binary
scheme is the most power efficient for the distribution shown in figure 4.15.
Here is an algorithm for deriving an (N+ I) bit nega-binary representation that
minimizes toggles (i.e. the number of adjacent bits with different values) for a
given N bit 2's complement number.
Procedure Optimum-Nega-binary-Scheme
Input: Array Bit[N] - N bit 2's complement representation of a number, with
0.07 !
0.06
0.05
1
c 0.04
0
t5c •
::J
LL
:c
'iii 0.03
c
QJ
0
0.02
0.01
1111I
0
-40 -30 -20 -10 0 10 20 30 40
VALUE ---->
Figure 4./5. Gaussian Oistributed Oata with N=6, Mean=22, SO=6
Bit[O] being the LSB and Bit[N-l] being the MSB

Output: Array Sign[N+ 1) - N+ 1 bit nega-binary representatiün that minimizes
tüggles in the number.
für (i =0 tü N-2) {
== 1) {
if (Bit[i+ 1)
Sign[i) = '+'
} else {
Sign[i] = '-'
}
}
if (Bit[N-l) == 1) {
Sign[N-l} = '+'
Sign[N} = '-'
} else {
Sign[N-l] = '-'
Sign[N} = '+'
}
0,07 ,
0,06
0,05
c
~ 0,04
.Q
ti
c
::J
LL
C:- 0,03
'Vi
c
<l>
0
0,02
0,01
[ 1II1
0
-40 -30 -20 -10 0 10 20 30 40
VALUE ---->
Figure 4,16, Gaussian Oistributed Oata with N=6, Mean=-22, SO=6
4.3.1.4 Incorporating a Nega-binary Scheme into the DA Based FIR

Filter Implementation
Conversion of a 2's complement number to a nega-binary representation can
be done bit-serially. Here is an algorithm for such a bit-serial conversion.
Procedure Binary-to-Nega-binary
Inputs: Array Bit[N] - N bit 2's complement representation of a number with
Bit[O] being the LSB and Bit[N-I] being the MSB.
Array Sign[N] - N bit nega-binary representation, with Sign[O] being the sign
for the LSB.
Output: Array NegaBit[N] - N bit nega-binary representation for the number.
C[O] = 0; /* Array C[N] is an array of intermediate variables */

for (i=O to N-l) {
NegaBit[i] = Bit[i] XOR. C[i]
if (Sign[i] == '+') { C[i+ 1] = Bit[i] ,AND. C[i]
} else { C[i+ 1] = Bit[i] .OR. C[i] }
}
2's Complement to Nega-Binary Converter
bi---+-,r-------------------------~\ Serial
Data-in
o
DA based
FIR structure
c1k Sign Control

N+ 1 bit Sign Register
Figure 4.17. DA Based FIR Architecture Incorporating the Nega-binary Scheme
The above algorithm can be directly implemented in hardware resulting in

a small area overhead. Data values can be bit serially converted from a 2's
complement representation to a nega-binary representation and loaded into the
shift registers. The sign bits can be directly coupled to the Sign Contral of the
adder shown in Figure 4.1. Figure 4.17 illustrates the complete nega-binary
DA based FIR architecture. The sign register is a pragrammable register which
holds the sign combination for the chosen nega-binary scheme. The bit serial
nega-binary computation logic requires just 5 gates and can have worse case
power dissipation equivalent to 3 flip-flops, which is negligible compared to the
number of flip-flops in the shift register chain.
It is important to note that a simple difference of weighted toggle sums
obtained for the 2's complement and the nega-binary representation does not
give the actual toggle reduction occurring in the concatenated registers. Since
the nega-binary registers have N+ 1 bits of precision, each data value contributes
its toggles (N+ I)/N times more than the corresponding 2's complement value.
Therefore, the nega-binary weighted toggle sum needs to be multiplied by a
factor equal to (N+I)/N.
Hence, the power saving can be estimated as folIows:
(4.12)
where p(i) is the probability of occurrence of a data with value 'i', N is the
2's complement bit-precision used, togs(i) and negatogs(i) are the number of
toggles in the representation ofi forthe 2's complement case and the nega-binary
case respectively.
The above saving computation does not account for 'inter-data' toggles that
result from two data values being placed adjacent to each other in the shift
register sequence. It may be observed that for a T tap filter with N-bit precision
registers an architecture similar to Figure 4.1 would imply a virtual shift register
(obtained through concatenating all the individual registers) of length TxN.
Actual shift simulations were performed sampie by sampie for different data
profiles and different number of sampies to find out the nega-binary scheme
that resuIts in maximum saving. These simulations showed that in all cases, the
nega-binary scheme that resulted in the best saving was the same as the scheme
that resulted in maximum estimate of power saving. This can be attributed to
the observation (based on the simulations) that the contribution due to inter-
data toggle is almost identical across various nega-binary schemes. Hence the
power saving estimate, given in equation 4.12, can be used to arrive at the
optimum nega-binary scheme. There are two advantages of choosing a nega-
binary scheme this way. One, it does not require actual sampie by sampIe
data, only an overall distribution profile is sufficient. Two, the run times for
computing the best nega-binary scheme are orders of magnitude smaller.
4.3.1.5 A Few Observations

• It was observed that for a given type of distribution (e.g. Gaussian, bimodal
etc.) there was a fixed trend in the best nega-binary representation for
different precisions (i.e. N values). In fact, from a knowledge of the best
nega-binary representation for lower values on N, the scheme for higher
values could be inductively obtained. Table 4.7 shows the best nega-binary
schemes for 5 to 10 bit precision data having a non-zero mean Gaussian
distribution. The strong trend in the nega-binary scheme is at once apparent.
A similar exercise for different distributions showed such a predictable trend
in every case.
• The resuIts given in table 4.7 also indicate that the amount of power saving
reduces with increasing bit precision. This trend can be explained as fol-
lows. The dynamic range of data with B bits of precision is given by 2 B .
With one extra bit of precision (i.e. B+ I), the dynamic range doubles (i.e.
2B + 1 ). Since the mean and the standard deviation ofthe data distribution are
specified as a fraction of max, these also scale (double) with an additional
bit of precision. Thus a data value D with B bits of precision, gets mapped
onto two data values (2D) and (2D+ 1) with (B+ 1) bits of precision. As can
be seen from table 4.7, the nega-binary representation for (B+ 1) bits of pre-
cision is derived by appending '+' to the nega-binary representation for B
bits of precision. Thus the nega-binary representation of (2D) and (2D+ I) is
104 VLSJ SYNTHESJS OF DSP KERNELS
given by appending 0 and 1 respectively to the B bit nega-binary representa-

tion of D. Depending upon the LSB of the B bit nega-binary number, either
(2D) or (2D+ I) results in an extra toggle. By the same argument, depending
on the LSB of the corresponding (B-l) bit 2's complement number, either
(2D) or (2D+ 1) results in an extra toggle. Thus with one additional bit of
precision, the absolute value of reduction in the toggles remains more or less
the same. However, since the total number of toggles in the 2's complement
representation increases, the amount of % power reduction decreases with
each additional bit of precision.
• In an actual simulation of the shifting sequence, each data value contributes
its total toggle count equal to the number of shifts for which the entire data
remains in the virtual shift register. This explains the need for the (N+ 1)/N
factor in saving. This, however, does not account for the toggle contribution
during the stage when a data value is partially in or when it is partially out
of the shift sequence.
• If the number of taps, T, is increased, the 'inter-data' (i.e. between two
consecutive data values in the virtual shift register) toggling contribution
increases as a fraction of the 'intra-data' toggling. As a result, the actual
saving obtained through a shift register simulation is less than that computed
by the saving formula. Similarly, as the bit-precision is reduced, keeping the
number of taps constant, once again the 'inter-data' toggling contribution
increases as a fraction of its 'intra-data' counterpart. This is consistent
with the experimental result for a 16 tap filter, which shows that the toggle
reduction of 61 % for 4-bit precision compared to the reduction of 73% for
8-bit precision.
• As pointed out before, the nega-binary scheme performs weil only with
peaked distributions. For a symmetrical, uniform distribution the 2's com-
plement scheme is better. This is apparent since the nega-binary scheme
is implemented with N+ 1 bits of precision to take care of the entire range.
lt is only in a few regions that the toggle count is lesser compared to its
2's complement counterpart. A uniform distribution nullifies this effect.
Figure 4.18 shows a plot of saving versus the Standard Deviation (SD) ex-
pressed as a percentage of the entire span (256 in this case), for an N=8,
Gaussian distributed data with mean=max!2 (max being the largest positive
number represented, which is 127 in this case). The Gaussian distributions
for SD=8 and SD=44 are shown in figures 4.19 and 4.20 respectively.
4.3.1.6 Additional Power Saving with Nega-binary Architecture

In addition to the toggle reduction in the shift register and consequently in
the address lines driving the LUT, the nega-binary architecture results in toggle
Distributed Arithmetic Based Implemenfafion 105
Table 4.7. Best Nega-binary Schemes for Gaussian Data Distribution ( mean = max/2; SD =
0.17 max)
N Best Nega-binary Scheme (N+ I bit precision) saving

(precision) 21U 2" 2/\ 2 2" 2" 24 2s 2" 2' 2u
5 + - - - + + 25.41 %
6 + - - - + + + 17.87%
7 + - - - + + + + 13.73%
8 + - - - + + + + + 11.16%
9 + - - - + + + + + + 9.42%
10 + - - - + + + + + + + 8.15%
18
16
14
12
10
1
l
Cl
8
c
.:;
'"
(J)
6
-2
2 4 6 8 10 12 14 16 18
SO (% 01 range) --->
Figure 4.18. Saving vs SD Plot for N=8, Gaussian Distributed Data with Mean = max/2
reduction in the LUT outputs as weIl. Such a reduction apart from saving power
in the adder also results in substantial power savings in the LUT itself. Table 4.8
shows the number of Repeated Consecutive Addresses (RCAs) to the LUT for
the 2's complement and the nega-binary case. It is easy to observe that the
number of repeated consecutive addresses in the shift register outputs gives the
number of times no toggles occur in the LUT outputs (since the same contents
0.05
0,045
0,04
0,035
c
f 0.03
nc
0
:> 0025
u..
2:-
'0;
c: 0,02
0'"
0.015
0.0 1
0.005
0
-150 -100 -50 o 50 100 150
VALUE -.-.>
Figure 4.19. Narrow (SD=8) Gaussian Distribution
Table 4.8. Toggle Rcduction in LUT (far 10,000 Sampies; Gaussian Distributed Data)
TAPS N 2's Complement

Nega-binary Nega-binary RCAs Toggle Reduction
Scheme RCAs (% of total) (% of total) (% of 2's Comp.)
4 +--++ 1.94% 41.41% 25.32%
8 6 +--+--- 11.60% 38.55% 18.90%
8 +--+----- 2.80% 24.04% 12.08%
4 +--++ 7,21% 45.62% 26.75%
4 6 +--+--- 9.15% 36.09% 17.93%
8 +--+----- 8.89% 29.65% 13.14%
Average 12% to 25% additional toggle reduction in LUT with the Nega-binary architecture.
are being read). This toggle reduction is, therefore, independent of the filter
coefficients.
A few comments need to be made about these numbers .
• 2's complement RCAs were obtained by counting the number of cases (out
of a possible of 1OOOOxN times the LUT is addressed) where two consecutive
0.01 , - - -- . . . , - - - - - , - - - - - - - , - - - - - - - r - - - - - - , - - - - - - ,
0.009
0.008
0,007
i 0.006
,§
Ü
§ 0.005
LL
~
' ij;
:ii 0.004
o
0.003
0.002
0.001
oL----~----~ww
·150 ·100 -50 o 50 100 150
VALUE •••• >
Figure 4.20. Broad (SD=44) Gaussian Distribution
addresses were identical. A similar computation was performed for the best
nega-binary scheme obtained using the techniques presented in the previous
sections (the total number of cases in this case is obviously 1OOOOx(N+ 1) ).
• Toggle reduction was computed by finding the difference between the num-
ber of times at least one toggle occurred at the LUT output for the two
schemes.
• For all the three different precisions a Gaussian distribution with mean =
max/2 and an SD = 0.2 max was used (max being the largest positive 2's
complement number represented).
4.3.2. Toggle Reduction in Memory Based Implementations by

Gray Sequencing and Sequence Reordering
Techniques have been proposed [70, 93] to eliminate shifting in registers to
reduce power by storing data as memory bits and using a column decoder to
access bits at the desired location. In other words, instead of shifting the data,
the pointer is moved. While such a technique reduces power dissipation due to
data shifting, it results in additional power dissipation in the column decoder.
The following techniques can be used for reducing power in such a shift-less
COUNTER
(Gray Sequence - same
as Routing Sequence)
Figure 4.2/. Shiftless lmplementation of DA Based FIR with Fixed Gray Sequencing
DA implementation.
1. Using a gray sequence in the counter (column decoder) for selecting subse-
quent bits - this would reduce the toggling in the counter outputs which drive
the multiplexers to the theoretical minimum.
2. Using the flexibility of having several gray schemes to choose a data distri-
bution dependent scheme which minimizes toggles in the multiplexer outputs.
Gray coded addressing has been presented in the literature [61] as a technique
for significantly reducing power dissipation in an address bus, especially in
case of a sequential access. Figure 4.21 illustrates a DA based FIR with a
fixed gray sequencing scheme. This results in theoretically minimum possible
toggles occurring in the counter output. As can be seen such an implementation
requires no additional hardware in the basic DA structure.
An N bit gray code can be obtained in N! ways (in a gray code any two
columns can be swapped to obtain another a gray code). This freedom can be
exploited to obtain a data specific gray code which minimizes the toggle count
as successive bits are selected within the register. This gives us dual power
saving : one, in the counter output lines themselves and two, in the multiplexer
output which drives the LUT (i.e. the LUT address bus). There is an additional
overhead of course. Since the register is not scanned sequentially, a simple shift
b
f---+------X[n~lI
~x[n-l]=====1~
LUT
t
X[n-k]
Shirt Count
COUNTER
Figure 4.22. Shiftless Implementation 01' DA Based FIR with Any Sequencing Possible
and accumulate cannot be used_ Instead, a barrel shifter is required, as shown

in Figure 4_22, to shift and accumulate as per the counter sequence_ As with the
best nega-binary scherne, an optimal gray code can be chosen which minimizes
the weighted toggle sum using a saving formula very similar to the one used in
section 4_3.1 A. However, no extra bit precisions are required_
The limitation of the above scheme is that a complete gray code sequencing
is not possible with bit precisions which are not powers of 2_ In such cases, a
partial gray code can be used, i_e. where llog2N J bits are gray coded, N being
the data precision_ This is true because if a complete gray code is used there
will occur values greater that N in the counter sequence which are meaningless_
Partial gray coding obviously reduces the number of different codes that are
available and therefore the address bus power saving, in this case, is lesser.
Table 4_9 shows the weighted toggle data for a 3-bit gray sequencing of an
N=8 Gaussian distributed data with mean = -max/2 and SD = 0_16 max. As can
be seen, the best case toggle reduction is 9.51 %_ The toggle reduction varies
depending on the type of distribution. One important observation is that savings
are obtained for symmetrie Gaussian as weil as symmetrie bimodal data, where
the nega-binary performance deteriorates substantially. A nega-binary scheme
could also be used in conjunction with gray sequencing to obtain additional
power savmgs_ It can be noted that inter-data toggles need not be considered
in this case_
Table 4.9. Comparison of Weighted Toggle Data for Din'erent Gray Sequences
Weighted Toggles for Simple Consecutive Sequencing = 3.786 (Base Case)

No. Gray Sequence Used Wt. Toggles % Saving
1. 01326754 3.439 9.17%
2. 02315764 3.652 3.54%
3. 01546732 3.556 6.08%
4. 02645731 3.619 4.14%
5. 04513762 3.426 9.5%
6. 04623751 3.702 2.22%
Best Case Gray Sequence Saving = 9.51 %
Tables 4.7, 4.8 and 4.9 highlight the effectiveness of the proposed tech-
niques in reducing power dissipation in the DA based implementation of FIR
filter. Here are some more results on power savings obtained for different num-
ber of bits of data precision and different distribution profiles of data values.
Table 4.10 shows the percentage reduction in the number of toggles for two
different Gaussian distributions.
Table 4.10. Toggle Reduction as a Percentage of 2's Complement Case for Two Different
Göussian Distributions
Best Nega-binary Scheme N TRI (%) TR2 (%)

+--++ 4 49.75 % 41.96 %
+--+++ 5 35.63 % 28.95 %
1 +--+++- 6 28.34 % 24.04 %
+--++++-- 8 20.88 % 16.94%
+--++++---- 10 16.59 % 12.80 %
+--++++------ 12 13.52 % 7.17 %
-++-- 4 42.41 % 32.51 %
-++--+ 5 34.07 % 26.74 %
2 -++--++ 6 27.71 % 22.06 %
-++--++++ 8 19.85 % 16.12 '',0
-++--++++++ 10 15.39 % 12.63 %
-++--++++++++ 12 12.55 % 8.68 %
TRI is the weighted toggle reduction as computed using the saving formula;
TR2 is the percentage toggle reduction obtained by using 25000 actual sampies
(i.e. it accounts for the 'inter-data' toggles as weil as the other factors mentioned
in section 4.3.1.5) in an 8 tap filter. The predictable trend in the best case nega-
binary scheme for different precisions is at once apparent. Further, it can be
Table 4./1. Toggle Reduction with Gray Sequencing for N = 8 and Some Typical Distributions
Data Distribution Used ( range = [-128,127] ) Best Gray Toggle

Type Mean SD Sequence Reduction (%)
Gaussian 64 20 04513762 9.46 %
Gaussian -64 20 04513762 9.52 %
Gaussian 0 20 02315764 3.46 %
Gaussian 64 56 01326754 1.92 %
Gaussian Bimodal -64, +64 20 04513762 9.49 %
observed that as the precision increases TR 1 and TR2 values approach each
other, for the reasons mentioned in section 4.3.1.5.
Table 4.11 shows the best case gray sequencing toggle reduction, in the LUT
address bus, obtained for 8-bit precision data with five different distributions.
The first four are Gaussian, and the last one is a Gaussian Bimodal distribu-
tion. As was pointed before, the toggle reduction decreases as the distribution
becomes more and more uniform (i.e. as SD increases).
With gray sequencing, toggle reductions are obtained even with Bimodal
distributions. Nega-binary representations with low toggle regions symmet-
rically distributed about the origin do not exist and therefore in case of such
distributions the nega-binary architecture does not give good results.
Chapter 5
MULTIPLIER-LESS IMPLEMENTATION
Many DSP applications involve linear transfonns whose coefficients are fixed
at design time. Examples of such transforms include DCT, IDCT and color
space conversion kemels such RGB-to-YUv. Since the coefficients are fixed,
the flexibility of a multiplier is not necessary and an efficient implementation
of such transfonns can be obtained using adders and shifters. This chapter
presents techniques for area efficient implementation of fixed coefficient l-D
and 2-D linear transforms.
A 2-D linear transfonn that transfonns N inputs to generate M outputs can
be performed using matrix multiplication as shown below:
Y[l] A[l,l] A[I,2] A[l,N] X[l]
Y[2]
[
A[2,1] A[2,2] A[2,N] X[2]
Y[M] A[M,l] A[M,2] .... A[M,N] X[N]

A l-D transfonn such as an FIR filter can be treated as a special case of a
2-D transform with M = 1.
It can be noted that a 2-D linear transfonn involves two types of computation.
Weighted-sum:An MxN 2-D transform can be computed as M l-D trans-

forms each of length N. A l-D transform is computed as the weighted-sum
ofN inputs with the weights given by the rows ofthe transformation matrix.
2 Multiple Constant Multiplication (MCM): An MxN 2-D transfonn involves
each ofthe N inputs being multiplied by M constants (columns oft he matrix).
It can be noted that the transposed FIR filter structure also performs MCM
type computation where during each output computation, the latest data
sampie is multiplied by all the filter coefficients.
113

X3 X2 Xl XO
AO Al
Figure 5. J. Data Flow Graph for a 4-term Weighted-sum Computation
This chapterpresents techniques forminimizing additions in both these forms

of computation using a common subexpression elimination approach. Such
a technique for MCM based computation has been proposed in [78]. The
technique presented in this chapter is different in that it extracts only 2-bit
common subexpressions during each iteration. Such an approach provides
higher ftexibility enabling more area efficient implementation.
While this chapter focuses on the two types of computations separately, the
two techniques can be combined in the context of 2-D transforms to achieve an
area efficient implementation. Such a combined optimization strategy has been
presented in [75].
The output from the common subexpression elimination based optimization
is adata ftow graph with add and shift operators. The precision ofthe variables in
the data ftow graph varies significantly, especially for higher number of inputs.
This chapter also looks at the high level synthesis of such multi-precision data
ftow graphs.
5.1. Minimizing Additions in the Weighted-sum

Computation
5.1.1. Minimizing Additions - an Example
Consider a 4 term weighted-sum computation (figure 5.1) with the coef-
ficients AO (= 0.0111011), Al (= 0.0101011), A2 (= 1.0110011) and A3 (=
I. 1001 0 I 0) represented in 2's complement 8 bit fixed point format.
The output Y can be computed as
Y = AO . X3 + Al . X2 + A2 . Xl + A3 . XO (5.1)
Replacing the multiplications by additions and shifts, gives
Y = X3 + X3 « I + X3 « 3 + X3 « 4 + X3 « 5 +
Multiplier-Iess Implementation 115
X2 + X2 « 1 + X2 « 3 + X2 « 5 +
Xl + Xl « 1 + Xl « 4 + Xl « 5 - Xl « 7
XO « 1 + XO « 3 + XO « 6 - XO « 7 (5.2)
The above computation requires 15 additions, 2 subtractions and 15 shifts.

However, ifY is computed in terms ofX23 (= X2 + X3) and XOI (= XO + XI)
as shown below :
Y = X3« 4 + Xl + Xl « 4 + Xl « 5 + XO « 3 + XO « 6 +
X23 + X23 « 1 + X23 « 3 + X23 « 5 +
X01 « 1 - X01 « 7 (5.3)
The above computation requires 12 additions (including 2 required to compute

X23 and XO I), 1 subtraction and 10 shifts. The number of additions can be
further reduced by precomputing X 123 = X 1 + X23. Y can then be computed
as
Y = X3«4+X1«4+XO«3+XO«6+
X23 « 1 + X23 « 3 + X01 « 1 - XOl « 7+
X123 + X123 « 5 (5.4)

XOI, X23 and X123), 1 subtraction and 9 shifts. Y can also be computed in
terms of X13 (=XI + X3) and X02 (=XO + X2) as folIows:
Y = X3« 3 + X2 + X2 «5 - Xl «7 + XO« 6 - XO« 7 +

X13 + X13 « 1 + X13 « 4 + X13 « 5 +
X02 « 1 + X02 « 3 (5.5)

XI3 and X02), 2 subtractions and 10 shifts. It can be noted that (X13 «
4 + X13 « 5) can be computed as ((X13 + X13 « 1) « 4). The number
of additions can thus be further reduced by precomputing X13_01 = X13 +
X13 « 1. Y can then be computed as
Y = X3« 3 + X2 + X2 « « 7 + XO «
5 - Xl 6 - XO « 7+
X13_01 + X13_01 « 4 + X02 « 1 + X02 « 3 (5.6)
The above computation requires 10 additions (inc\uding 3 required to compute
X13, X02 and X13_01), 2 subtractions and 9 shifts (including 1 required to
compute X 13_0 I).
The above example shows techniques for reducing the number of additions+
subtractions (17 to 12 - 29% reduction in this case) for implementing weighted-
sum computation.
5.1.2. 2 Bit Common Subexpressions

The minimization techniques presented in the above example are based on
finding common subexpressions (2 bit patterns), computing the subexpressions
first and using the result to compute the filter output. XOl, X23, X02, X13,
X 123, X 13_01 in the equations above are examples ofthese 2 bit subexpressions.
The common subexpressions used in the minimization of additions are of two
types.
(i). Common Subexpressions Across Coefficients (CSACs) which are
identified between 2 coefficients both having 1s in more than one bit locations.
For example, the coefficients AO and AI both have 1s in the bit locations 0,
1, 3 and 5 (bit 10cation 0 being the LSB), resulting in the common subex-
pression X23. It can be noted that the number of bit locations in which the
common subexpression appears, directly decides the reduction in the number
of additions/subtractions and shifts. For example, the subexpression X23 which
appears at four bit 10cations, results in reducing the number of additions by three
and also reducing the number of shifts by three.
(ii). Common Subexpressions Within a Coefficient (CSWCs) which are
identified in a coefficient bit representation having multiple instances of a 2 bit
pattern. For example, the 2 bit pattern' 11 ' appears twice in the coefficient A2
at locations 0,1 and 4,5 (bit location 0 being the LSB) resulting in a common
subexpression. The multiplication (A2· Xl) given by (X 1 + Xl« 1 + Xl«
4+X1« 5-X1« 7),canbeimplementedusingXL01(= X1+X1« 1)
as (XL01 + XLOl « 4 - Xl « 7), resulting in the reduction in the number
of additions by one and the number of shifts by one. It can be noted that the
amount of reduction in the number of additions and shifts depends directly on
the number of instances of the 2 bit pattern in the coefficient bit representation.
The above mentioned subexpression types are further divided into sub-types
so as to handle different representation schemes, such as CSD, in which the
coefficient bits can take values 0, land -1. The CSACs have two subtypes:
(i). CSAC++ in which the two coefficients have non-zeros values in more than
one bit locations, and the values are either both I s or both -1 s, and
(ii). CSAC+- in which the two coefficients have non-zero values in more than
one bit locations, and the values are either 1 and -1 or -1 and I.
The CSWC subtypes CSWC++ and CSWC+- can be defined similarly.
5.1.3. Problem Formulation

Each coefficient multiplication in the weighted-sum computation is repre-
sented as a row of an NxB matrix, where N is the number of coefficients and
B is the number of bits used to represent the coefficients. The total ntlmber of
operations (additions + subtractions) required to compute the output is given by
Multiplier-less lrnplernentation 117
the total number of non-zero bits in the matrix less one. The coefficient matrix
for the 4 term weighed-sum mentioned above is shown below.
7 6 5 4 3 2 1 0
AO 0 0 1 1 1 0 1 1
A1 0 0 1 0 1 0 1 1
A2 -1 0 1 1 o0 1 1
A3 -1 1 o 0 1 0 1 0
The iterative 2 bit common subexpression elimination algorithm works on

this matrix in two phases. In the first phase, common subexpressions across
coefficients(CSACs) are searched. The matrix is updated at the end of every
iteration to reflect the common subexpression elimination in that iteration. The
matrix is updated by adding a new row to the matrix representing the 2 bit
common subexpression. The bit values in the new rows are set depending on
the locations in which the subexpression was identified and the coefficient bit
values in these locations. The two coefficient rows are also updated so as to set
to zero the bit values at the locations in which the common subexpression was
identified.
For example, consider the common subexpression XA23 between coeffi-
cients A2 and A3 at bit locations 7 and 1 (0 being the LSB). This subexpression
can be eliminated so as to reduce number of additions required to compute
the multiplication of coefficients A2 and A3 with the corresponding input data
values. The updated coefficient matrix after eliminating this subexpression is
shown below.
7 6 5 4 3 2 1 0
AO 0 0 1 1 1 0 1 1
A1 0 0 1 0 1 0 1 1
A2 0 0 1 1 000 1
A3 0 1 0 0 1 0 0 0
XA23 -1 0 0 0 001 0
Considering that each 2 bit common subexpression requires 1 addition, the

total number of additions + subtractions for a coefficient matrix at any stage in
the iterative minimization process is given by total number of non-zero bits +
number of subexpressions less 1. For example, the computation represented by
the above coefficient matrix requires 16 additions + subtractions (16 + 1 - 1).
The first phase of the minimization process terminates when no more com-
mon subexpressions across coefficients are possible. The updated coefficient
matrix is then searched for common subexpressions (CSWC) within each row.
These subexpressions are then eliminated to complete the second phase of the
minimization algorithm.
AG
') 4
4
2
A3 AI
2 3
A2
Figure 5.2. Coefficient Subexpression Graph for the 4-term Weighted-sum Computation
5.1.4. Common Subexpression Elimination

The coefficient matrix at any stage in the iterative minimization process
typically has more than ] common subexpressions. For example, the coefficient
matrix (shown in section 5.1.3) has a common subexpression across every pair
of coefficients. Elimination of each of these subexpressions resuIts in different
amount of reduction in the number of additions+subtractions and also affects
the common subexpressions for the subsequent iterations. For example, while
the number of additions+subtractions are reduced by one when XA23 common
subexpression is eliminated, the number of additions+subtractions are reduced
by three when XAOI or XA02 common subexpressions are eliminated. The
choice of a common subexpression for elimination during each iteration thus
affects the overall reduction in the number of additions+subtractions.
The steepest descent approach can be adopted to select the common subex-
pression for elimination. In this approach, during every iteration the com-
mon subexpression that resuIts in maximum reduction in the number of addi-
tions+subtractions is selected. Such a common subexpression is identified by
first constructing a fully connected graph with its nodes representing the rows
in the coefficient matrix. Each pair of nodes in the graph has two edges (E++
and E+-) representing CSAC++ and CSAC+- between the rows represented by
the nodes. These edges are assigned weights to indicate the number of times the
subexpression (represented by the edge) appears between the two coefficient
rows represented by the two end nodes of the edge. Figure 5.2 shows such a
common subexpression graph for the coefficient matrix of the 4 tap filter ex-
ample. Since all E+- edges have 0 weight, only E++ type of edges are shown
in the graph.
The subexpression corresponding to the edge with the highest weight is
selected for elimination during each iteration. In case there are more than one
edges with the same highest weight, one levellookahead is used to decide on the
subexpression to be eliminated. The lookahead mechanism works as follows:
Multiplier-less Implementation 119
for each edge Eij, its end nodes i and j and all the other edges connecting to
the end nodes are deleted from the graph. The modified graph is searched to
find the edge with the highest weight. This weight is assigned as the one level
lookahead weight for the edge Eij. The subexpression corresponding to the
edge with highest one level lookahead weight is selected for elimination.
5.1.5. The Algorithm

Here is the algorithm for minimizing number of additions using the common
subexpression elimination technique.
Procedure M inimize-Additions-jor- Weighted-sum

Input: N filter coefficients represented using B bits of precision
Output: Data flow graph representation of the weighted-sum computation,
with the nodes of the flow graph restricted to add, subtract and shift operations.
/* Process the given set of coefficients */

Eliminate coefficients with 0 non-zero bits (i.e. value 0)
Merge the coefficients with same value. This applies to transforms such as
linear phase FIR filters with symmetric coefficients.
/* Phase I - */
Construct the initial coefficient matrix of size NxB, where N is the number of
coefficients after the processing and B is the number of bits used to represent
the coefficients.
repeat {
Construct subexpression graph
Find the edge with highest weight and highest one level lookahead weight
Update the coefficient matrix so as to eliminate the common subexpression
} until (highest weight < 2)
/* phase 11 : */
for the first B bits of each row {
Extract all subexpressions of type CSWC++ and CSWC+-
for each subexpression {
Find the bit distance - given by the distance between
the bit locations of the subexpression
}
Find pairs of subexpressions with same type and equal bit distances
Eliminate the common subexpressions by representing one of the
subexpressions as a shifted version of the other subexpression
Update the Coefficient matrix to reflect this elimination
}
Output the data flow graph in terms of shifts, additions and subtractions.
Table 5.1. Number of Additions+Subtractions (Initial and After Minimization)
Filter Initial Weighted-sum

# taps # +/-s # +/-s %reduction
16 51 37 27.5%
24 83 58 30.1%
32 95 65 31.6%
36 88 67 23.9%
40 120 84 30.0%
48 123 92 25.2%
64 169 116 31.4%
72 201 130 35.3%
96 225 157 30.2%
128 270 191 29.6%
Table 5.1 gives the results in terms of number of additions + subtractions

for weighted-sum computation performed in the context of 10 low pass FIR
filters. The number of taps of these filters range from 16 to 128. These filters
have been synthesized using Park-McClellan's algorithm and the coefficients
are quantized to 16 bit fixed-point format. The initial number of operations
for these filters correspond to a coefficient representation scheme in which
the non-zero bits of a coefficient are either all 1s or all -1 s. Table 5.1 shows
that the common subexpression elimination algorithm reduces the number of
additions+subtractions by as much as 35%.
5.2. Minimizing additions in MCM Computation

5.2.1. Minimizing Additions - an Example
Consider a 4-term MCM computation shown in figure 5.3 with four coef-
ficients AO (= 0.0111011), Al (= 0.0101011), A2 (= 1.0110011) and A3 (=
1.1001010) represented in 2's complement 8 bit fixed point format.
The computation of outputs YO, Y 1, Y2 and Y3 is given by
YO AO·X (5.7)
Yl Al·X (5.8)
Y2 A2·X (5.9)
Y3 A3·X (5.10)
Replacing the multiplications by additions and shifts, gives
YO X+X«1+X«3+X«4+X«5 (5.1l)
Yl X+X«1+X«3+X«5 (5.12)
x----~-----------.-----------,-----------.
AO Al A2 A3
YO YI Y2 Y3
Figure 5.3. Data Flow Graph for 4 term MCM Computation
Y2 X+X«1+X«4+X«5-X«7 (5.13)
Y3 X«1+X«3+X«6-X«7 (5.14)
The above computation requires 12 additions, 2 subtractions and 15 shifts. As

can be seen from figure 5.3, these intermediate values are fed to the 3 adders
to complete the MCM computation. Thus the output computation requires
15 additions, 2 subtractions and 15 shifts, which is the same computational
complexity as required for the output computation shown in 5.2.
The number of additions/subtractions in the MCM computation can be mini-
mized using the techniques given below. This can be achieved by precomputing
X _01 (= X + X « 1) and using it to compute the 4 intermediate outputs as
folIows:
YO X _01 + X « 3 + X « 4 + X « 5 (5.15)
Y1 X _0 1 + X « 3 + X « 5 (5.16)
Y2 X _0 1 + X « 4 + X « 5 - X « 7 (5.17)
Y3 X«1+X«3+X«6-X«7 (5.18)
The above computation requires 10 additions ( including 1 addition required

to compute X_O 1 ), 2 subtractions and 13 shifts (including I shift required to
compute X_O 1).
The number of additions can be further reduced by precomputing X A5( =
X « 4 + X « 5), and computing YO to Y3 as folIows:
YO X _01 + X« 3 + X A5 (5.19)
Y1 X _0 1 + X « 3 + X « 5 (5.20)
Y2 X_01 + XA5 - X« 7 (5.21)
Y3 X«1+X«3+X«6-X«7 (5.22)
The above computation requires 9 additions ( inc\uding 2 additions required

to compute X_OI and XA5), 2 subtractions and 11 shifts (inc\uding I shift
required to compute X_O land 2 shifts required to compute XA5).
The number of additions can be further reduced by precomputing X _013
X _01 + X « 3, and computing YO to Y3 as folIows:
YO X_013 + XA5 (5.23)

Yl X_013 + X« 5 (5.24)
Y2 X_Ol + XA5 - X« 7 (5.25)
Y3 X«1+X«3+X«6-X«7 (5.26)
The above computation requires 8 additions (inc\uding 3 additions required

to compute X_OI, XA5 and X_OI3), 2 subtractions and 10 shifts (inc\uding I
shift required to compute X_O I, 2 shifts required to compute XA5 and 1 shift
required to compute X_Ol3).
It can be noted that XA5 can be computed using X_OI as X A5 = X _01 « 4
thus further reducing the number of additions by 1 and also reducing the number
of shifts by 1.
The above example shows techniques for reducing the number of additions+
subtractions (17 to 12 - 29% reduction in this case) for implementing MCM
based structures. This reduction is similar (same in this case) to that achieved
for the weighted-sum computation, and uses similar techniques of finding 2
bit common subexpressions, computing the subexpressions first and using the
result to compute the intennediate filter outputs. X_OI, XA5, X_Ol3 in the
equations above are examples of these 2 bit subexpressions.
5.2.2. 2 Bit Common Subexpressions

In addition to CSWC type of subexpression, the minimization process for
MCM based structures uses Common Subexpressions across Bit Locations
(CSABs). The CSABs are identified between two bit locations both having
I s for more than one coefficients. For example, the bit values at bit locations
° and 1 (0 being the LSB) are both 1 in case of coefficients AO, Aland A2,
resulting in the common subexpression X_OI. It can also be noted that the
number of common coefficients directly decides the reduction in the number of
additions/subtractions and shifts. For example, the subexpression X_OI which
is common for three coefficients, results in reducing the number of additions
by two and the number of shifts also by two.
The CSABs can be easily generalized to handle different representation
schemes, such as CSD, in which the coefficient bits can take values 0, 1 and -1.
This is achieved by defining two subtypes :
i. CSAB++ in which the bit values at two bit locations are non-zero for more
than I coefficients, and the values are either both I s or both -1 s, and
ii. CSAB+- in which the bit values at the two bit locations are non-zero for
more than one coefficients, and the values are either 1 and -1 or -1 and 1.
5.2.3. Problem Formulation

Each coefficient multiplication in the MCM computation is represented as
a row of an N x B matrix, where N is the number of coefficients and B is
the number of bits used to represent the coefficients. The total number of
additions+subtractions required to compute the intermediate outputs is given
by (the total number of non-zero bits in the matrix - N). The coefficient matrix
for the 4 tap filter mentioned above is shown in section 5.1.3.
The iterative 2 bit common subexpression elimination algorithm works on
this matrix in four phases. In the first phase, CSABs are searched. The
coefficient matrix is updated at the end of every iteration to reftect the common
subexpression elimination in that iteration. This is done by adding a new column
to the coefficient matrix, which represents the 2 bit common subexpression. The
bit values in the new column are set depending on the locations in which the
subexpression was identified and the coefficient bit va1ues in these locations.
The two bit columns are also updated so as to set to zero the bit values at the
locations in which the common subexpression was identified.
For example, consider the common subexpression XA5 between coefficients
AO and A2 at bit locations 4 and 5. The updated coefficient matrix after elimi-
nating this subexpression is shown below.
7 6 5 4 3 2 0 X35
1
AO o0 o 0 1 0 1 11
Al o0 1 0 1 0 1 01
A2 -1 o 0000 1 11
A3 -1 1 0 010 1 0 0
Considering that each 2 bit common subexpression requires 1 addition, the

total number of additions+subtractions to compute the intermediate outputs at
any stage in the iterative minimization process is given by (the total number of
non-zero bits + number of subexpressions - N). For example, the computation
represented by the above coefficient matrix requires 13 additions+subtractions
(16 + 1 - 4). The first phase of the minimization process terminates when no
more common subexpressions across coefficients are possible.
In the second phase of the minimization process, CSWCs are searched in
the updated coefficient matrix and are eliminated.
In the third phase, all the identified subexpressions (CSABs and CSWCs)
are searched to check for 'shift' relationships among them i.e. whether a subex-
pression can be realized by left shifting another subexpression by some amount.
In the example shown above, the subexpressions X_OI and XA5 share such a
'shift' relationship given by (X A5 = X _01 « 4). XA5 can hence be realized
in tenns of X_O I so as to reduce the number of additions by one and reduce the
number of shifts by one.
In the fourth and the final phase of the optimization process, the coefficient
matrix (first B columns) is searched for two 2 bit subexpressions with 'shift'
relationship among them. Such expressions can also be eliminated so as to
reduce the number of additions. For example, consider two coefficients AO =
0.0101010 and AI = 0.1000101, with corresponding YO and YI computations
given by :
YO X«1+X«3+X«5 (5.27)
Yl X+X«2+X«6 (5.28)
While no CSABs can be found for these coefficients, there exist subexpressions
X_13 (in AO) and X_02 (in A 1) that are related by 'shift' relationship. YO and
Y I can hence be recomputed in tenns of X _02 (= X + X « 2) as follows
YO X _02« 1+X « 5 (5.29)

Yl X_02 + X« 6 (5.30)
thus reducing the number of additions by one.
5.2.4. Common Subexpression Elimination

Just as in case of I: A[i]X[n - i] based structure, the coefficient matrix at
every stage in the iterative minimization process typicaJly has more than one
common subexpressions. The choice of a common subexpression for elimina-
tion during each iteration thus affects the overall reduction in the number of
additions+subtractions.
The steepest descent approach, similar to that used for I: A[i]X[n - i] based
structure, is adopted to select a common subexpression for elimination. How-
ever since CSABs are to be searched the subexpression graph is constructed
with the columns of the coefficient matrix being its nodes. Each pair of nodes in
the graph has two edges (E++ and E+-) and weights are assigned to these edges
to indicate the number of times CSAB++ and CSAB+- appear respectively
between the columns represented by these nodes.
The subexpression corresponding to the edge with the highest weight is
selected for elimination during each iteration. In case there are more than one
edges with the same highest weight, one level lookahead is used to decide on
the subexpression to be eliminated. The one levellookahead weight for an edge
is computed in the same way as presented in section 5.1.4.
5.2.5. The Algorithm

Here is an algorithm for minimizing the number of operations (additions +
subtractions) using the common subexpression precomputation technique.
Procedure M inimize-Additions-Jor-M CM
Input: N filter coefficients represented using B bits of precision
Output: Data ftow graph representation of the MCM computation, with the
nodes of the ftow graph restricted to add, subtract and shift operations.
/* Process the given set of coefficients */

Eliminate coefficients with less than 2 non-zero bits
Merge the coefficients with same value. This applies to transforms such as
linear phase FIR filters with symmetrie coefficients.
/* Phase I: */
Construct the initial coefficient matrix of size N x B, where N is the number
of coefficients after the processing and B is the number ofbits used to represent
the coefficients.
repeat {
----Construct subexpression graph
Assign weights to the edges based on the number of CSABs
Find the edge with the highest weight and the highest one level lookahead
weight.
Update the coefficient matrix so as to eliminate the common subexpression
} until (highest weight < 2)
/* phase 11 : */
.. same as in section 5.1.5
/* phase 111 : */
Find bit distances for all the common subexpressions
Find pairs of subexpressions of the same type and with equal bit distances
for each pair {
Eliminate one of the subexpressions by representing it
as a shifted version of the other subexpression.
}
/* phase IV: */
Extract all 2 bit patterns in the first B columns
Find bit distances between these 2 bit subexpressions
Find pairs of subexpressions with equal bit distances
Replace one of the subexpressions as a shifted version of the other subexpres-
slon
Output the signal ftow graph in terms of shifts, additions and subtractions.
Table 5.2 gives the results in terms of number of additions+subtractions re-

quired for MCM computation as part of the transposed FIR filter structure. The
same (as used in table 5.1) 10 FIR filters with number oftaps ranging from 16 to
128 have been used. The results show that the number additions+subtractions
Table 5.2. Number of Additions+Subtractions for Computing MCM Intermediate Outputs
# taps initial +/-s final +/-s initial/final

16 36 21 1.7
24 60 29 2.1
32 64 29 2.2
36 53 27 2.0
40 81 38 2.1
48 76 34 2.2
64 106 44 2.4
72 134 57 2.4
96 132 51 2.6
128 155 60 2.6
can be reduced by an average factor of 2.2. This is much higher than the factor
of 1.43 (avg.) for the FIR filters presented in [78].
5.2.6. An Upper Bound on the Number of Additions for MCM

Computation
Consider an MCM computation where a variable X is being multiplied by N
constants using B bits of precision. Since using B bits of precision, 2 B distinct
constants can be represented, the N constants can have atmost 2B unique values.
Since aB bit constant can have atmost BIs and multiplication by aB bit constant
can be perfonned using atmost B-l additions. Thus the number of additions
for MCM computation has an upper bound given by (2 B . (B - 1)) [78]. This
upper bound however is pessimistic due to the following reasons:
One of the 2B constants is a '0' and multiplication by a '0' does not require
any addition.
2 There are B constants whose binary representations have only one '1'. Mul-
tiplication by such a constant can be perfonned using a shift operation and
hence does not require any addition.
3 Only one of the 2B constants has B number of 1s. The average number of
1s per constant can be shown to be B /2.
4 Consider constants NI and N2 such that N2 = NI < < K, where K is the
amount of left shift. N2· X can then be computed as ( (NI· X) < < K), thus
requiring no addition. For example, once the multiplication (00000111· X)
is computed, multiplications by 00001110, 00011100,00111000, 01110000
and 11100000 can be computed by appropriately left shifting (00000111 .
X).
Multiplier-less ImplementGtion 127
5 The above mentioned upper bound does not comprehend the reduction
achieved using the common subexpression precomputation technique. This
point has also been highlighted in [78].
Based on the above observations, a tighter upper bound on the number of
additions can be obtained by first coming up with a subset of constants which
have more than one 1s. This subset can be further reduced by eliminating those
constants that can be obtained by left-shifting other constants in the subset. In
other words, in the reduced subset no two constants are related by just a shift
operation.
For a given constant NI (with more than one number of Is in its binary
representation), another constant N2 can always be found such that N2 has one
less number of 1sand the Hamming distance between NI and N2 is one. The
multiplication NI . X can hence be computed as one addition of N2 . X with
appropriately left shifted X.
Based on the above analysis, the multiplication by each member of the re-
duced subset can be computed using just one addition. The upper bound on the
number of additions is thus given by the cardinality of the reduced subset.
It can be noted that no two constants with '1' as their LSBs can be related
by just a shift operation. It can also be noted that for a constant NI whose
LSB is '0', there always exists a constant N2 with '1' as its LSB, such that
NI = N2 < < K, where the amount of left shift K is given by the number of
continuous Os as the LSBs ofNl. For example, for NI = 00101000, there exists
N2 = 00000101, such that NI = N2 < < 3. Based on these observations the
reduced subset consists of those constants wh ich have more than one 1sand
have 'I' as their LSBs. It can easily be shown that for a B bit number, the
cardinality of such a reduced subset is given by 2 8 - 1 - 1. This hence is an
upper bound on the number of additions for MCM computation.
The analysis presented above assumes constants with B bit unsigned repre-
sentation. A similar analysis can be performed in case of constants with B bit
2's complement representation. These constants (except for - 2 8 - 1 ) can also
be represented using a B bit signed-magnitude representation. It can be noted
that multiplying a variable X with a negative constant can be achieved by multi-
plying -X with the corresponding positive constant. Thus once -X is computed,
multiplication by all negative constants which have only one '1' in their magni-
tude part, can be implemented using only a shift operation. The multiplication
by the constant - 2 B -1 can also be handled the same way. It can also be noted
that a multiplication of a negative constant (say -NI) with a variable X, can be
computed using one subtraction as (- NI . X) = 0 - (NI· X).
The B bit constants can be divided into a set of (B-I) bit positive constants
and a set of (B-I) bit negative constants. Since the reduced subset of (B-I) bit
constants has the cardinality of (2 8 - 2 - 1), the MCM computation using the
positive constants can be achieved using (2 8 - 2 - 1) additions. Similarly, the
MCM computation using the negative constants can be achieved using (2 B - 2 _

1) additions. Considering an extra subtraction required to compute -X, the
upper bound on the number of additions for MCM computation with B bit 2's
complement constants is given by (2 B - 2 -1) + (2 B - 2 -1) + 1 = (2 B -- 1 -1).
It can be noted that this upper bound is same as that for unsigned constants.
5.3. Transformations for Minimizing Number of Additions

While the common subexpression precomputation technique helps in reduc-
ing the number of additions, the optimality ofthe final solution also depends on
the initial coefficient representation. This section presents two types of coef-
ficient transforms to generate different coefficient representations as the initial
solution. It also shows how these transforms along with the common subex-
pression precomputation technique result in an area efficient implementation.
5.3.1. Number Theoretic Transforms

Consider an N bit binary representation in terms of bits bo to bN - 1 , bit bo
being the LSB. Within this generic framework, various representation schemes
are possible that differ in terms of weights associated with each bit and the
value that each bit can take. For example, the weight associated with an i th
bit can be 2i or -2 i and the bit values can be either from set {0,1} or from
set {O, 1, -I}. The bit pattern for a number can thus vary depending on the
representation scheme used. Since the number of non-zero bits directly impact
the number of additions/subtractions, the choice of the representation scheme
can significantly impact the total number of additions required to compute the
filter output.
This section presents three binary representation schemes that result in dif-
ferent number of non-zero elements and consequently impact the number of
additions.
5.3.1.1 2's Complement Representation

2's complement is the most common representation scheme for signed num-
bers. In this scherne, the bits can take values 0 or 1. Far an N bit number, the
weight associated with the MSB is - 2N -1. The weight associated with any
other bit location i is 2i . The value of a number represented in 2's complement
is hence given by (-b N -1 . 2 N -1) + 2:[':0 2 bi ·2 i
Consider 8 bit 2's complement representations of3 and -3 given by 00000011
and 11111101 respectively. It can be noted that in terms of number of non-zero
bits, 2's complement is not the best representation for -3. In general, 2's comple-
ment representations of small negative numbers have higher number of non-zero
(I s) entries. Thus for transforms having coefficients with small negative values,
2's complement may not be the optimal coefficient representation scherne.
Multiplier-less Implementatiofl 129
5.3.1.2 Uni-sign Representation

The limitation of 2's complement representation requiring higher number
of 1s to represent small negative numbers can be overcome by employing a
scheme in which the bit values can be either from set {O, I} or from set {0,-1 }.
Since all non-zero bits in such a representation have the same sign, it is called
the uni-sign representation. It can be noted that this representation is similar to
sign-magnitude representation, except that the overhead of sign bit is eliminated
by embedding the sign information in the non-zero bits.
In the uni-sign representation scheme 8 bit representation of 3 and -3 is
given by 00000011 and OOOOOONN (N indicates bit value -1) respectively. This
scheme thus enables representation of small negative numbers with fewer num-
ber of non-zero bits. It can be noted that the number of non-zero bits in the
uni-sign representation are same as that obtained by the selective coefficient
negation technique presented in 2.3.1. As per the results in 2.2, using this
scheme the number of non-zero bits can be reduced by as much as 56% com-
pared to 2's complement representation of FIR filter coefficients.
Since the range of numbers that can be represented using this scheme ( - (2 N -
1) to (2 N - 1)) is more than the 2's complement representation (_(2 N - 1 ) to
(2 N - 1 - 1), any coefficient value (except the most negative value) in N bit 2's
complement representation can be converted to this scheme using one less bit
(i.e. N-l) of precision.
5.3.1.3 Canonical Signed Digit (CSD) Representation

The schemes discussed in 3.1 and 3.2 are not efficient in terms ofrepresenting
numbers such as 15,-15,31,-31 which result in continuous streams of 1s or
Ns (-ls). Consider 8 bit representation of 31, which in both the schemes is
given by 00011111. The number of non-zero bits can be reduced to 2 by
representing 31 as (32-1). This can be achieved in two ways, (i) using nega-
binary representation [106] scheme in which the bit bi has the weight of - (- 2)i.
Number 31 can then be represented as 00100001. (ii) using a scheme such as
CSD [10 I] which allows the bit values to be 0, 1 or -1. Number 31 can then
be represented as OOIOOOON. It can be noted that the CSD representation is
more flexible than the nega-binary representation in terms of values that can
be represented at each bit location. The bit-value f1exibility enables CSD to
provide a representation that is guaranteed to have the least number of non-zero
bits.
While CSD gives locally optimal solution resulting in minimum number of
additions to implement a coefficient multiplication, it does not always result in
a globally optimal solution for implementing MCM computation. Here are two
examples that demonstrate this. Consider the following computation using 8
bit 2's complement or uni-sign representation scheme :
Yo = 17 . X = 00010001 . X
YI = 19 . X = 00010011 . X
Using the common subexpression precomputation technique the above compu-
tation can be performed using two additions as follows
Y o =X+X«4
YI =Yo + X « l
Using 8 bit CSD scheme no common subexpression exists. This computation
thus requires two additions and one subtraction (which is one extra compared
to Uni-sign representation) as shown below
Yo = 00010001 . X = X + X < < 4
YI = 0001010N . X = X < < 4 + X < < 2 - X
Here is another example, where CSD reduces total number of non-zero bits
in coefficients but does not minimize total number of additions+subtractions
across coefficient multiplications. Consider the following computation with
coefficients represented in 8 bit 2's complement form
Y = 00010101 . Xl + 10011101 . X 2 + 10011001 . X 3
Using the techniques presented in section 5.1, the above computation can be
performed using six additions and one subtraction as folIows:
Tl = Xl + X 2
T2 = X 2 + X 3
Y = X 3 + Tl + Tl < < 2 + T 2 < < 3 + T 2 < < 4 - T 2 < < 7
Using CSD representation the computation to be performed is
Y = 00010101 . Xl + N0100N01 . X 2 + N010N001 . X 3
Using the techniques presented in section 5.1, this computation can be per-
formed using five additions and three subtractions as folIows:
T2 = X 2 + X 3
Y = X I + X I < < 2 + Xl< < 4 + T 2 + T 2 < < 5 - T 2 < < 7 - X 2 < <
2-X3 « 3
While the total number of non-zero bits is reduced by 1 (12 to 11) using CSD
representation, it results in an extra computation.
These examples highlight the role of different number theoretic transforms in

reducing number of additions required to implement multiplier-Iess FIR filters.
5.3.2. Signal Flow Graph Transformations
The transformations discussed in section 5.3.1 do not alter the coefficient

values. This section presents two transformations which are applicable specif-
ically to FIR filters. These transforms used in conjunction with the number
theoretic transforms help in further reducing the number of computations.
The FIR signal flow graph can be transformed so as to compute the output
Y[n] in terms of input data values and the previously computed output Y[n-I].
Multiplier-less Implementatioll 131
-I -I -I -I -I
Z Z Z Z Z
X[nl---+----,-~--,----+-----,-----+-.-----7--,
A[O]
f---+--'----7 Y[ n 1
Figure 5.4. SFG Transformation - Computing Y[n] in Terms of Y[n-l]
This can be done as folIows:

N-l
Y [n - 1] = L A [i] . X [n - 1 - i] (5.31)
i=O
N-l
Y[n] = L A[i] . X[n - i] (5.32)
i=O
By adding the LHS of equation 5.3 t and subtracting the RHS of equation 5.31
to the RHS of equation 5.32 gives :
N-l N-l
Y[n] = L A[i]· X[n - i]- L A[i]· X[n - 1 - i] + Y[n - 1] (5.33)
i=O i=O
N-l N-2
Y[n] A[O] . X[n] + L A[i] . X[n - i]- L A[i] . X[n - 1 - i]
i=l i=O
-A[N - 1] . X[n - N] + Y[n - 1] (5.34)
N-l
Y[n] A[O]· X[n] + L (A[k] - A[k - 1]) . X[n - k]
k=l
- A[ N - 1] . X [n - N] + Y [n - 1] (5.35)
Figure 5.4 shows the signal flow graph of a 4 tap FIR filter transformed using
the above mentioned approach.
The direct form structure of an N tap FIR filter requires N multiplications
and N-I additions. With the above mentioned SFG transformation, the resultant
structure (figure 5.4) requires (N+ 1) multiplications and (N+ 1) additions. While
this transform results in more computation, it also modi fies the filter coefficients.
If the saving in the number of additions due to the modified filter coefficients is
more than the overhead of the additional computation, this transformation can
result in an area-efficient multiplier-Iess FIR implementation. Such a possibility
is higher in case of linear phase FIR filters because for such filters this SFG
transformation retains the number of multiplications required to compute the
output. This can be proved by analyzing coefficient symmetry property of the
transformed SFG.
The coefficient symmetry property (stated below) of the linear phase filters
A[i] = A[N - 1 - i] (5.36)
ean be used to reduee the number of multiplieations by half in the direet form
FIR implementation.
For an N tap FIR filter, the eorresponding transformed strueture (equa-
tion 5.35) has N+ I eoeffieients C[O] to C[N). Ifthe original filter has symmetrie
eoeffieients (linear phase) the eoeffieients of the transformed strueture are anti-
symmetrie as shown below. i.e. C[i] = -C[N - i].
From 5.35, C[O] = A[O] and C[N] = -A[N - 1]
From 5.36, A[O] = A[N - 1]
From the above two equations, C[O] = -C[N] ... proved for i=O
From 5.35, C[j] = A[j]- A[j -1] and C[N - j] = A[N - j]- A[N - j -1]
From 5.36, A[N - j] = A[j - 1] and A[N - j - 1] = A[j]
From the above two equations, C[j] = A[j] - A[j - 1] and C[N - j] =
A[j - 1] - A[j]
henee C[j] = -C[N - j] ... proved.
An N tap linear phase filter requires N/2 multiplieations if the number of
coefficients is even and requires (N+ 1)/2 multiplications if the number of eo-
efficients is odd. If N is odd, the transformed filter has even number (N+ 1) of
eoeffieients whieh are anti-symmetrie and henee require (N+ 1)/2 multiplica-
tions. For N even, the transformed filter has odd number (N+ 1) of eoeffieients
and henee requires (N+2)/2 number of multiplieations. However, sinee from
(5.36) (A[N/2] = A[N/2-l D, the eoeffieient C[N/2] = A[N/2] - A[N/2-1] = O.
Thus for N even, the transformed filter requires N/2 number of multiplieations.
For example eonsider the SFG shown in figure 5.4. If the original fi Iter has lin-
ear phase, the eoeffieient values A[ 1] and A[2] are same, henee the eoeffieient
(A[2]-A[ 1]) in this SFG is O. This SFG thus requires two multiplieations and
four additions, as against two multiplieations and three additions required by
the direet form 4 tap linear phase filter.
The above analysis shows that this signal f10w graph transformation retains
the number of multiplieations required in ease of linear phase FIR filters, and
provides an opportunity to reduce the number of additions by altering the eo-
effieient values. As an example eonsider the ease of A[2] = 19 = 00010011
and A[3] = -13 = OOOONNON. The transformed strueture will have a eoeffieient
C[3] = A[3] - A[2] = -32 = OOONOOOOO whieh has just one non-zero bit.
The eomputation of Y[n] in terms of Y[n-l] can also be aehieved by sub-
traeting the LHS of equation 5.31 and adding the RHS of equation 5.31 to the
-I -I -I -I -I
Z Z Z Z Z
X[ n]----).---;--~-,---~--,------+--r---+---,
[0]
r----)--'------7 Y [n]
Figure 5.5. SFG Transformation - Computing Y[n] in Terms of Y[n-l]
RHS of equation 5.32. Y[n] is thus computed as :

N-l
Y[n] A[O]· X[n] + L (A[k] + A[k - 1]) . X[n - k]
k=l
+A[N - 1] . X [n - N] - Y [n - 1] (5.37)
The resultant signal f10w graph is shown in figure 5.5.
It can be shown that for a linear phase FIR filter, the coefficients ofthe modi-
fied SFG are also symmetric. Thus for odd number of taps, this transformation
requires the same number of multiplications (N + 1)/2 as the original struc-
ture. In case of even number of taps, unlike the above mentioned transform,
no coefficient cancellation is possible. This transformation hence results in a
signal f10w graph that requires (N + 2)/2 (i.e. one more) multiplications.
It can be noted from figures 5.4 and 5.5 that this SFG transformation coverts
the FIR structure to an UR structure. The resultant IIR structure has a pole and a
zero on the unit circle in the Z plane. The pole-zero cancellation is essential for
the filter response to be stable. This can be achieved by retaining full numerical
precision while performing the UR computations.
5.3.3. Evaluating Effectiveness of the Transformations

This subsection presents results that evaluate the effectiveness of various
coefficient transforms in minimizing the number of additions required for the
MCM computation as part of the transposed form FIR filter structure. For each
filter, the initial number of additions (without common subexpression elimina-
tion) and the final (minimized) number of additions (after common subexpres-
sion elimination) for the following coefficient transforms is looked at:
I. 2's: Direct form structure with coefficients in 2's complement form.
11. allp: Direct form structure with coefficients in uni-sign form.
III. diff-allp: Transformed SFG (as in figure 5.4) with coefficient differences
represented in uni-sign form.
IV. sum-allp: Transformed SFG (as in figure 5.5) with coefficient sums repre-
sented in uni-sign form.
V. csd: Direct form structure with coefficients in CSD form.
2.5
B - - -
<J
co
LL
C
0
U 1.5
::l r- ,-
U
Ql ,-
er:
0.5
2's allp diff-allp sum-allp csd diff-csd sum-csd
Figure 5.6. Average Reduction Factor Using Common Subexpression Elimination
VI. diff-csd: Transfonned SFG (as in figure 5.4) with coefficient differences
represented in CSD fonn.
VII. sum-csd: Transfonned SFG (as in figure 5.5) with coefficient sums rep-
resented in CSD form.
The bar chart in figure 5.6 shows the average reduction factor for various
transforms. This can be used to analyze the impact of coefficient transfonns
on the amount of minimization achieved using common subexpression elimi-
nation. As can be seen from the bar chart, the common subexpression elimina-
tion results in maximum reduction when the coefficients are represented in 2's
complement fonn. It can be noted that among all the coefficient transforms, 2's
complement coefficient representation results in maximum number of additions
in the initial solution (i.e. without common subexpression elimination). The
trend in figure 5.6 shows that for a given filter, if a transform results in higher
total number of non-zero bits in the coefficient representations, the higher is
the reduction achieved using common-subexpression elimination. This also
indicates that if a coefficient transform results in the best initial solution, it may
not always give the best final solution. Hence it is important to explore across
coefficient transforms to get the most optimal implementation.
5 dilf-csd
4
csd
csd
3 dilf-csd dilf-csd
dilf-csd csd csd csd
dilf-csd
csdsum-csd csd
2 diff-csd
LP16 LP19 LP24 LP27 LP32 LP36 LP40 LP48 LP53 LP61 LP64 LP72 LP96 LP128
Figure 5.7. Best Reduction Factors Using Coefficient Transforms Without Common Sub-
expression Elimination
The bar chart in figure 5.7 gives the best reduction factors w.r.t. initial so-
lutions (i.e. 2's complement representation) using coefficient transforms alone
(i.e. without applying common subexpression elimination). It can be noted that
reduction by a factor as high as 4.9 can be achieved. Figure 5.7 also shows for
each filter the coefficient transform that results in most reduction. As expected
CSD representation results in maximum reduction for all the filters. It can also
be noted that CSD used in conjunction with SFG transformations that compute
Y[n] in terms of Y[n-l] result in maximum reduction in 50% of the cases.
The bar chart in figure 5.8 gives the best reduction factors using the coeffi-
cient transforms in conjunction with common subexpression elimination. The
reduction factors are computed with respect to an initial solution given by ap-
plying common subexpression elimination on coefficients represented in 2's
complement form. It can be noted that reduction by a factor of as high as 2 can
be achieved. Figure 5.8 also shows for each filter the coefficient transforms that
result in most reduction. It can be noted that for a filter, multiple coefficient
transforms can result in the best final solution. As an example, for LP64 three
coefficient transforms (Il. allp, V. csd and VI. diff-csd) result in the best final
solution. It can also be noted that a coefficient transform that results in the best
csd
2
diff-csd csd
1.8
csd
diff-csd diff-allp
1.6 allp diff-csd
diff-csd allp csd
csd csd
diff-csd csd
1A csd sum-csd allp csd
sum-csd
~ csd
'c"
l.L. 1.2
.f50
::J
"0
Q)
er:
0.8
0.6
OA
0.2
LP16 LP19 LP24 LP27 LP32 LP36 LP40 LP48 LP53 LP61 LP64 LP72 LP96 LP128
Figure 5.8. Best Reduction Factors Using Coefficient Transforms with Common Sub-
expression Elimination
initial solution may not in all the cases result in the best final solution. As an
example consider LP48 for which CSD (V) transform results in the best initial
solution but allp (II) transform results in the best final solution.
The bar chart in figure 5.9 gives the number of times each coefficient trans-
form results in the best final solution. As can be seen from the figure while
CSD gives the best solution in most of the cases, the uni-sign representation
can also in some cases perform better than CSD. The figure also highlights the
role of SFG transformations shown in figures 5.4 and 5.5, which result in the
best final solution in 8 out of the 14 filters.
This data can also be used to compare the two signal flow graph transfor-
mations shown in figures 5.4 and 5.5, which result in SFG structures with
coefficient-differences and coefficient-sums respectively. As discussed earlier,
the transformation in figure 5.5 always results in one more number of coeffi-
cient multiplications in case of even order filters. 1t hence results in a relatively
higher number of additions for even order filter. Overall, the SFG transforma-
tion based on coefficient differences (figure 5.4) provides higher reduction than
the transform based on coefficient sums (figure 5.5).
Multiplier-less lmplementation 137
10 .--
.--
4
2 ,--
o ,,~
äliP
n
Ulll-allp ~Urll-äll~ es ul -G:iU SU - ~U
Figure 5.9. Frequency of Various Coefficient Transforms Resulting in the Best Reduction
Factor with Common Sub-expression Elimination
The resuIts presented in this section thus demonstrate the role of various
transforms in minimizing number of computations in the muItiplier-less imple-
mentations of FIR filters. The various coefficient transforms discussed in this
chapter enable exploration of a wider search space resulting in the best final
solution.
5.3.4. Transformations for Optimal Initial Solution

In addition to the SFG restructuring transformations presented in the earlier
section, the following two algorithmic transformations also enable obtaining an
optimal initial solution for the multiplier-Iess implementation of FIR filters.
5.3.4.1 Coefficient Optimization

The coefficients of an FIR filter can be suitably modified so as to reduce the
number of I s in the coefficients while satisfying the desired filter characteristics
such as passband ripple and stopband attenuation. These techniques target both
the 2's complement binary representation [28] and the CSD representation [83,
105] for the coefficients. The coefficient optimization algorithm presented in
section 2.4.5 can be adapted by appropriately modifying the cost function to

perform this transformation.
5.3.4.2 Efficient Pre-Filter Structures

Instead of realizing the filter in the direct form as weighted sum computation,
it can be implemented as a cascade of a 'pre-filter' with an 'equalizer'. The
pre-filter structures are computationally efficient as they use coefficients with
values 1 or -1. Use of Cyc\otomic Polynomial Filters as pre-filter structures
has been proposed in [72]. With the correct choice of apre-filter structure, the
equalizer filter can be implemented with fewer number of taps than required
for the direct form realization [I]. The cascade of 'pre-filter' - 'equalizer' thus
requires fewer number of multiplications and hence fewer number additions in
case of a multiplier-Iess implementation.
5.4. High Level Synthesis of Multiprecision DFGs

The output from the common subexpression elimination based optimization
is a data f10w graph with add and shift operators. The precision of these oper-
ations and their input/output variables varies significantly across the data f10w
graph. As an example, consider a 32 term weighted-sum computation with 12
bit data and 12 bit coefficients. The output of this computation needs 29 bits of
precision to guarantee no overflow. Thus the data f10w graph has variables of
precision ranging from 12 bits to 29 bits.
While the conventional high level synthesis techniques can be applied to
implement such data f10w graphs, the implementation is not optimal as the
variable precision is not exploited during the synthesis process. The following
subsections show, with the help of examples, how the variable precision can
be comprehended during the three key components of high level synthesis -
register allocation, functional unit binding and scheduling.
5.4.1. Precision Sensitive Register Allocation

The inputs to the register allocation problem is the variable Iifetime graph
which is derived from the scheduled data f10w graph. During register alloca-
tion [22], variables with non-overlapping lifetimes are assigned to the same
register, so as to minimize the total number of registers. The conventional al-
gorithms do not aim at minimizing the total number of register bits. Consider
the example shown in figure 5.10.
While both the register allocations require the same number of registers, the
precision sensitive allocation requires significantly lesser number of register
bits and is hence area efficient.
Multiplier-less Implementatioll 139
8 15 CI 8 15
Vl-> Reg I V2 -> Reg 2 VI-> Reg I V2 -> Reg 2
16 9 C2 9 16
V3 -> Reg I V4 -> Reg 2 V4 -> Reg I V3 -> Reg 2
precision insensitive allocation precision sensitive allocation

Reg I: 16 bits, Reg 2: 15 bits Reg I: 9 bits, Reg 2: 16 bits
Figure 5.10. Precision Sensitive Register Allocation
~
Op 1 -> Adder I
Gf0
Op2 -> Adder 2
CI
~
Op I -> Adder I
~
Op2 -> Adder 2
G-~
Op3 -> Adder I
~
Op4 -> Adder 2
C2
~
Op4 -> Adder I
@
Op3 -> Adder 2
precision insensitive binding precision sensitive binding

Addcr I: 19 bits, Adder 2: 18 bi ts Addcr I: 13 bits, Adder 2: 19 bits
Figure 5.11. Precision Sensitive Register Allocation
5.4.2. Precision Sensitive Functional Unit Binding

Functional unit binding [22] aims at assigning functional units to the oper-
ators scheduled in each control step. This can be viewed as a special case of
register allocation problem where the lifetime of the computation is always one
control step (can be extended to handle multi-cycle operations). Consider the
four add operations scheduled across two control steps as shown in figure 5.11.
While both the bindings require the same number of adders, the precision
sensitive binding requires adders with significantly smaller precision and is
hence area efficient.
Cl
+16
C2
precision insensitive scheduling precision sensitive scheduling

Adder I: 16 bits, Adder 2: 17 bits Adder 1: 12 bits, Adder 2: 17 bits
Figure 5.12. Precision Sensitive Scheduling
5.4.3. Precision Sensitive Scheduling

Scheduling [22] aims at assigning the operations of the data ftow graph to
various control steps with the aim of minimizing area for the given number
of control steps (time constrained scheduling) or minimizing the number of
control steps for the fixed number of resources (resource constrained schedul-
ing). Consider the data ftow graph (figure 5.12 with four add operations to be
scheduled over two control steps.
While both the schedules require the same number of adders, the precision
sensitive schedule can be implemented using adders with significantly smaller
precision and is hence area efficient.
Since register allocation, functional unit binding and scheduling are inter-
dependent, an approach that uni fies these three steps in one synthesis algorithm
is necessary to get an optimal implementation. Such an integrated and precision
sensitive approach to the synthesis of multi-precision data ftow graphs has been
presented in [3].
Chapter 6
IMPLEMENTATION OF MULTIPLICATION-FREE
LINEAR TRANSFORMS ON A PROGRAMMABLE
PROCESSOR
Many signal processing applications such as image transfonns [27], error

correction/detection involve matrix multiplication of the fonn Y = A * X,
where X and Y are the input and the output vectors and A is the transfonna-
tion matrix whose elements are 1,-1 and O. This chapter presents optimized
code generation of these transfonns targeted to both register rich RISC archi-
tectures such as ARM7TDMI [6] and single register, accumulator based DSP
architectures such as TMS320C2x [95] and TMS320C5x [96].
Code optimization techniques discussed in the literature [4, 5, 38, 39, 90]
address the problems of instruction selection and scheduling, register allocation
and storage assignment to minimize code size and/or number of cycles. These
techniques operate on a Directed Acyclic Graph (DAG) representation of the
code being optimized and can be applied to implement the multiplication-free
linear transfonns. However, the amount of optimization achieved using these
techniques is limited by the initial DAG representation. Much higher gains
are possible by optimizing the DAG itself. The DAG optimization techniques
presented in this chapter are targeted to both the register-rich and the single
register architectures.
With the increasing trend towards portable computing and wireless com-
munication, low power has become an important design consideration. These
systems are typically built around programmable processors. The amount of
code running on such embedded processors has been growing exponentially
over the last few years. Low power is thus becoming an increasingly important
consideration for code generation.
Techniques for low power memory mapping for data intensive algorithms
have been presented in [67, 74, 36]. Instruction scheduling for low power has
been presented in [89, 36] (calIed cold scheduling in [89]). Both these ap-
proaches [89, 36] perfonn instruction selection and register assignment first
141

and then use list scheduling based algorithm for instruction scheduling. In case
of multiplication-free linear transforms since primarily ADD and SUB instruc-
tions are used, the operand part of the instructions dominates the overall power
dissipation. Instead of assigning registers to the variables before instruction
scheduling, the approach presented in this chapter first performs instruction
scheduling using DAG variables and then does register assignment for low
power code generation. The chapter also presents a technique that reorders the
nodes of the DAG so as to minimize the power dissipation in case of single
register architectures.
This chapter is organized as two sections, 6.1 and 6.2, which present code
generation techniques targeted to register-rich and single register architectures,
respectively. Each section also gives resuIts that highlight the effectiveness of
these techniques in terms of reducing the number of cycles and also the power
dissipation for various multiplication-free linear transforms.
6.1. Optimum Code Generation for Register-rieh

Architectures
6.1.1. Generie Register-rieb Arebiteeture Model
The code generation techniques presented in this section are targeted to the
architecture model shown in figure 6.1. This model is a suitable abstraction of
a generic RISC architecture. The datapath of this architecture has a register file
connected to an ADD/SUBTRACT module which supports the following two
instructions:
1 ADD Srl Sr2,Shift Dr: (Srl) + (Sr2)< <Shift -+ (Dr)

2 SUB Srl Sr2,Shift Dr: (Srl) - (Sr2)< <Shift -+ (Dr)
where Srl and Sr2 are the source registers, Dr is the destination register and
'Shift' is the amount by which Sr2 is left shifted before being added to/subtracted
from Srl. In addition to these instructions, the architecture also supports the
load and store instructions for movement of data between the registers and the
data memory.
It is assumed that a transform is implemented as a function and all the input
data values are loaded in the registers before calling the function. Within the
main body of the function, the transform is performed as aseries of ADD and
SUB instructions that operate on the data stored in the registers and produce
outputs which are stored in the registers.
The code generation phase takes the DAG as the input and performs instruc-
tion scheduling and register assignment aimed at having a code that is smallest
in terms of program size, executes in minimum number of cycles, uses minimum
number of registers and dissipates least amount of power.
Implementation of Multiplication-free Linear Transforms 143
J INSTRUCTION RE(,ISTER
l
PR<X}RAM j 1 tJ
MEMORY
DECODE
t ~ ~ ~
I
EXECUTE CONTROL
1 1 1 SHIT
REG. ADDRESSES Sr2
«
DATA
Sr!
MEMORY RHilSTLR
I~ ~
FIl.E +/-
Dr
Figure 6. J. Generic Register-rich Architecture
6.1.2. Sources and Measures of Power Dissipation

Consider the architecture shown in figure 6.1 and three stages of the pipeline
perfonning instruction fetch, decode and execute respectively. During each
cycle, a new instruction In is fetched from the program memory, the previous
instruction I n - 1 is decoded and the instruction previous to that, I n - 2 is executed.
Thus during each cycle, there is an activity on the program memory busses, in
the decoder, in the register file and in the ADD/SUB function unit. These thus
fonn the sources of power dissipation.
The power dissipated in the program memory data bus depends on the total
Hamming distance between successive instructions fetched from the memory
and also on the number of adjacent signals of the bus toggling in opposite di-
rection. Since the decoder is a combinational block, the power dissipation is
dependent on the switching activity of its inputs. Since the fetched instruction
fonns the main input to the decoder block, the power dissipated in the decoder
is dependent on the Hamming distance between the successive instructions be-
ing decoded. The power dissipated in the ADD/SUB function unit primarily
depends on the data values being addedlsubtracted. Since these data values
are inputs to the program, the program generator has minimal control over this
component of power dissipation. The execute pipeline stage also sees activity in
the register file. The power dissipated in the register file decoders is dependent
on the sequence of register addresses (Srl, Sr2 and Dr) decoded every cycle.
WI W2 W3 XI X2 X3
W4 WS W6 X4 XS X6
W7 W8 W9 X7 X8 X9
Figure 6.2. 3x3 Pixel Window Transform
Since this sequence is primarily decided by the sequence in which the instruc-
tions are fetched, this component of the power dissipation is also dependent on
the Hamming distance between successive instructions. Thus it can be noted
that minimizing Hamming distance between consecutive instructions reduces
power dissipation in all the three stages of the pipeline.
6.1.3. Optimum Code Generation for I-D Transforms

A multiplication-free one dimensional transform can be represented as
N
Y = L A[i]· Xli], where A[i] E {O, 1, -I} for i = 1,2, ... , N (6.1)
n=l
Examples of such transforms include the 3x3 pixel window transforms [27]
used in image processing. Consider a 3x3 pixel window of an image (figure 6.2)
with values X[ I] to X[9] and the corresponding transform window with weights
W[l] to W[9]. The transform is then computed as:
9
Y = L W[i] . Xli] (6.2)
i=l
Figure 6.3 shows the Prewitt window transform [27] which is used for edge
detection. The corresponding DAG is also shown in figure 6.3.
This subsection looks at low power code generation for such transforms.
As discussed earlier, the code generation should aim at reducing the Hamming
distance between successive instructions. An instruction has two parts - the first
part gives the operator (ADD or SUB) and the second part gives the operands
(source and destination registers). The Hamming distance in the operator part
of the instruction can be reduced during instruction scheduling by maximizing
the sequences of consecutive ADD and consecutive SUB operations. The other
technique is to modify the DAG itself so as to maximize nodes/operations of
the same type. For example, the DAG in figure 6.3 can be transformed to
Implementation of M ultiplication-free Linear Transforms 145
Xl X3 X4 X6 X7 X9
\1 \1 \1
1 0 -1
8 8
1 0 -1 ~Gf
~
1 0 -1
Figure 6.3. Prewitt Window Transform
Xl X3 X6 X4 X9 X7
\1 \1 \1
8 8
~8/
~ y
Figure 6.4. Transformed DAG with All SUB Nodes
the DAG shown in figure 6.4. While the initial DAG in figure 6.3 has three
SUB and two ADD nodes, the transfonned graph has all five nodes of SUB
type. Consequently, the code generated from the transfonned DAG has zero
Hamming distance in the operator part of the instructions.
The reduction in the Hamming distance in the operands part of the instruc-
tions results in power reduction in the register file also and thus has a bigger
impact on the overall power reduction. Here is a technique that suitably trans-
forms the DAG and generates code with minimum total Hamming distance
between successive instructions.
Xl ~~~E)-ß-~~ Y
t t t t t
X4 X7 X3 X6 X9
Figure 6.5. Chain-type DAG for Prewitt Window Transform
ADD Xl X4 Tl ADD RO Rl,O RO

ADD Tl X7 T2 ADD RO R3,O RO
SUB T2 X3 T3 SUB RO R2,O RO
SUB T3 X6 T4 SUB RO R6,O RO
SUB T4 X9 Y SUB RO R7,O RO
Figure 6.6. Optimized Code for Prewitt Window Transform
Step 1: Convert the DAG to a 'Chain' structure and reorder the nodes so
as to group all ADD nodes together. The reordering minimizes the Hamming
distance in the operator part of the instructions. Figure 6.5 shows such a DAG
for the Prewitt Window transform.
Step 2: The chain structure fixes the scheduling of the operations and is
given by the order of nodes in the DAG. Generate the instruction sequence
using variables as operands. It can be noted that in such a sequence, the first
source and the destination variables of all instructions can be assigned to the
same register. Such an assignment results in zero Hamming distance in the
Srl and Dr operands of the instructions. For the variables in Sr2, the registers
are assigned in a gray code sequence so as to minimize the Hamming distance
between successive Sr2 operands of the instruction. Figure 6.6 shows the code
for Prewitt Window transform before and after the register assignment.
It can be noted that this two step algorithm generates code that is optimal in
terms of minimum program size, minimum number of cycles, minimum number
of registers and also minimum power dissipation.
6.1.4. Minimizing Number of Operations in Two Dimensional

Transforms
A two dimensional transform with an MxN transformation matrix can be
implemented as M one dimensional transforms whose transformation matri-
ces (size I xN) are given by the rows of the MxN matrix. Such an approach
however cannot exploit the computational redundancy across these one dimen-
sional transforms and hence results in increased number of AddJSub operations.

The number of Add/Sub operations required to perform an MxN transform can
be minimized by extracting common subexpressions and precomputing them.
This technique is identical to the technique discussed in the earlier chapter for
minimizing the number of additions in the multiplier-less implementation of
the MCM computation.
Consider the 4x4 Haar transform [27] shown below.
1 1
o
j 1[ ~~ 1
-1
1 -1
o 1 -1 X4
From the transformation matrix, it can be observed that the subexpression

(Xl +X2) is used to compute both Yl and Y3. Similarly, the subexpression
(X3+X4) is used to compute both Y2 and Y4. The total number of additions
can be reduced by precomputing such common subexpressions. For example,
the number of additions+ subtractions to compute the above equations can be
reduced by two, if (Xl +X2) and (X3+X4) are precomputed. Thus the number
of additions can be minimized by iteratively identifying and precomputing such
common subexpressions.
Two types of common subexpressions are used
CS++ in which the elements in two columns of the matrix are both 1 or both
-1 for more than one rows (e.g. X 12+, columns 1, 2 for rows 1 and 3).
2 CS+- in which the elements in two columns of the matrix are (+ 1,-1) or
(-1, + 1) for more that one rows.
Every iteration involves selecting a common subexpression that results in maxi-
mum reduction in the number of operations used to perform the transform. This
is the same heuristic that is used for minimizing additions in the multiplier-less
implementation of weighted-sum and MCM computations.
Once the subexpression is identified, the transformation matrix is updated
to reflect the precomputation. This is done by adding a new column to the
transformation matrix and suitably updating the matrix elements. For example,
consider X12+ common subexpression. The modified transformation matrix is
shown below :
=~ ~ 1[Jt I
o 1
-1 o
o -1
o 1
Tl
XI~+ YI
X2 ~ - }------'d----~ Y2
X3~+ Y3
X4 ~81------->~ Y4
Figure 6.7. Optimized DAG für 4x4 Haar Transfürm
Figure 6.7 shows the optimized DAG for 4x4 the Haar transform. It requires
six computations compared to eight computations required if the transform is
computed as four I x4 transforms.
6.1.5. Low Power Code Generation

The DAG optimization technique discussed in the earlier section results in a
code that is optimal in terms of number of cycles. This subsection presents an
approach for instruction scheduling and register assignment to generate code
with minimum total Hamming distance between successive instructions.
Step 1: Instruction Scheduling
Generate an initial list of ready-to-be-scheduled nodes by selecting nodes
for which both the inputs are primary inputs. Schedule anode Ni from this list
as the first node.
For examp\e, for the DAG shown in figure 6.7 the ready-to-be-scheduled list
is given by {Tl,T2,Y2,Y4}. Tl is scheduled as the first node from this list.
Repeat {
Include the node Ni in the already-scheduled node list.
Update the ready-to-be-scheduled list which has nodes whose inputs are
either primary inputs or are in the already-scheduled list
For the example being considered, during the first iteration, the updated
already-scheduled list will be {Tl} and the updated ready-to-be-scheduled list
will be {T2, Y2, Y4}
Select anode from the ready-to-be-scheduled list with minimum difference
from the latest scheduled node. The difference is computed by comparing the
operator and the variables assigned to the operand fields.
For the example being considered, during the first iteration, node Y2 will be
selected as it differs with Tl in two fields (operator and destination) as against
Implementation of Multiplication-free Linear Transfarms 149
ADD Xl X2 Tl
SUB Xl X2 Y2
SUB X3 X4 Y4
ADD X3 X4 T2
ADD Tl T2 YI
SUB Tl T2 Y3
Figure 6.8. Scheduled lnstructions far 4x4 Haar Transform
T2 which differs in three fields (all the three operands) and Y4 which differs in
all the four fields.
While computing difference between the current node and the latest sched-
uled node, if the current node is of type ADD, use commutativity to swap
the source operands and check whether the difference reduces with swapped
operands.
For example, if the latest scheduled node corresponds to SUB X2 Xl Yl,
then for anode with operation ADD Xl X2 Y2 the difference will be 4, however,
with inputs swapped the same operation ADD X2 Xl Y2 will have a difference
oftwo.
} Until ready-to-be-scheduled list is empty
Figure 6.8 gives the output of instruction scheduling for the DAG shown in
figure 6.7.
Step 2: Register Assignment
From the schedule derived in step I, find lifetimes of all the variables.
Figure 6.9 shows the data ftow graph for the scheduled DAG and the lifetime
spans for all the variables.
Construct a register-conjlict graph as follows. Each node in the graph rep-
resents a variable in the data ftow graph. Connect two nodes of the graph if the
lifetimes of the corresponding variables overlap.
Figure 6.10 shows the register-conjlict graph for the data ftow graph shown
in figure 6.9.
The register assignment approaches discussed in the literature [4] solve the
problem as a graph coloring problem where no two nodes which are connected
by an edge are assigned the same color and the graph is thus colored using
minimum number of colors.
In this approach, the number of registers are minimized only to the extent of
eliminating register-spills and the focus is more on low power considerations.
The instruction schedule is analyzed to build a consecutive-variables graph in
which each node represents a variable in the data ftow graph. Two nodes of
the graph are connected if the corresponding variables appear in the consec-
XI X2 X3 X4 XI X2 X3 X4 TI T2 YI Y2 Y3 Y4
CI
C2
C3
C4
C5
C6
I
YI Y2 Y4 Y3
Figure 6.9. Data Flow Graph and Variable Lifetimes for 4x4 Haar Transform
Y3 Y2 Y3 Y2
Y4
XI TZ XI T2
X4 X4
Figure 6.10. Register-Conflict Graph Figure 6.11. Consecutive-Variables Graph
utive eyc\es at the same operand loeation. Eaeh edge E[i, j] in the graph is
assigned a weight W[i, j] given by the number of times variables i and j appear
eonseeutively in the instruetion sequenee.
Figure 6.11 shows the consecutive-variables graph for the DFG shown in
figure 6.9. It ean be noted that for this graph, all the edges have the same
weight( =1).
lmplementation of Multiplication-free Linear Transforms 151
HD
YI
R5(OIOI) ADD RO RI5,O R3
2
SUB RO RI5,O R2
3
XI SUB RI R7,O R6
2
ADD RI R7,O R7
2
TI ADD R3 R7,O R5
2
SUB R3 R7,O R4
X}
X4-T2
11
RI(OOOI) R7(OIII)
Figure 6.13. Code Optimized for Low Power

Figure 6.12. Register Assignment for Low
Power
The low power register assignment can be formulated as a graph coloring

problem, with the cost function to be minimized given by:
CF = L L HD[i,j].W[i,j] (6.3)
j
This cost function is same as that used for FSM state encoding. Many tech-
niques have been proposed to solve this problem and include approaches such
as simulated annealing [63] and stochastic evolution [47] based optimization.
Since the objective is to minimize the Hamming distance, register sharing

is performed only if it helps in reducing the Hamming distance. In general,
if two variables are connected in the conseclltive-variables graph but are not
connected in the register-conjlict graph, they are assigned to the same register.
From the graphs shown in figures 6.10 and 6.11, it can be noted that variables
X4 and T2 satisfy this criterion and hence can be assigned the same register.
Figure 6.12 shows the modified consecutive-variables graph and the corre-
sponding register assignment. The code thus generated in shown in figure 6.13.
The total Hamming distance between successive instructions for this code is II
(assuming Hamming distance of one between the opcodes of ADD and SUB
operations).
The code generated by this algorithm was compared with that generated using
theoptimizingC compiler for TMS470Rl x [99] wh ich is based onARM7TDMI
core of Advanced Risc Machines Ltd.
Table 6.1 gives the total Hamming distance between successive instructions
for 6 different transforms (figures 6.3, 6.14, 6.7 and 6.18). The second column
gives the measure for the code generated from C code that directly implements
I 2 1 -I -I -I 1 1 1
0 0 0 -I 8 -I I I I
-I -2 -I -I -I -I 1 I 1
Sobel Window Transfonn Spatial High-Pass Filter Spatial Low-Pass (Averaging) Filter
Figure 6.14. 3x3 Window Transforms
Table 6.1. Total Hamming Distance Between Successive Instructions
Transform TMS470Rlx Instruction Low Power %red. w.r.t.

C Compiler Scheduling + Code TMS470Rlx
TMS470Rlx Generator C Compiler
C Compiler
Prewitt Window 17 10 5 71%
Sobel Window 13 13 7 46%
Spatial High Pass Filter 22 17 10 55%
Spatial Low Pass Filter 14 14 7 50%
4x4 Haar 29 20 11 62%
4x4 Walsh-Hadamard 42 35 15 64%
these DAGs. The third column presents results for the C code that represents
the reordered DAG and consequently re-scheduled instructions. The fourth
column gives the Hamming distance measure for the code generated by the low
power code generator. The results assurne that the Hamming distance between
the ADD and the SUB opcodes is one.
As can be seen from the results, significant power reduction can be achieved
by using a low power driven code generation approach. To compare this ap-
proach with the approach that first does register assignment and then performs
cold scheduling, the register assignment done by the TMS470Rlx C Compiler
was used to cold schedule the Prewitt Window Transform. The total Hamming
distance for the resultant code was eight compared to the measure of five for
the low power code generator. This justifies the approach of first scheduling
the instructions and then performing low power register assignment.
Implementation oj Multiplication-free Linear Transjorms 153
PDB I lNSTRUCTlON
I
REGISTER
I
DRAß
PAß I ADDRESS
I
GI'NERATOR I DWAB
DRDß
PROGRAM DATA
MEMORY MEMORY
«
+/-
r
~
I Ace
I «
DWDB
Figure 6.15. Single Register, Accumulator Based Architecture
6.2. Optimum Code Generation for Single Register,

Accumulator Based Architectures
6.2.1. Single Register, Accumulator Based Architecture Model
The eode generation teehniques presented in this seetion are targeted to the
arehiteeture model shown in figure 6.15. This model is a suitable abstraetion
of TMS320C2x [95] and TMS320C5x [96] proeessors and shows the datapath
of interest to the multiplieation-free linear transforms.
The arehiteeture has six busses. The program memory address bus (PAB)
gives the loeation of the instruetion to be fetehed. The instruetion is fetehed
on the pro gram memory data bus (PDB). The loeation of the data to be read
is speeified on the data memory read address bus (DRAB) and the data is read
from the memory on data memory read data bus (DRDB). The loeation for
weiting a data is speeified on the data memory weite address bus (DWAB) and
the data to be weitten in put on data memory write data bus (DWDB).
The arehiteeture supports the foJIowing instruetions:
ADD Mem,Shift: (Aee) + (Mern)< <Shift -t (Aee)

Forexample, the instruetion 'ADD X, l' gets the value from the data mem-
ory loeation a~dressed by X, left shifts it by 1 and adds to the Aeeumulator
in one cloek eycle (assuming 0 wait-state memory aeeess).
2 SUB Mem,Shift: (Ace) - (Mern)< <Shift -t (Ace)

Y2
Xl X2 X3 X4
Figure 6./6. Example DAG
3 LAC Mem,Shift: (Mem)< <Shift -+ (Acc)

For example, the instruction 'LAG X, l' gets the value from the data mem-
ory location addressed by X, left shifts it by I and loads into the Accumulator
in one c10ck cycle (assuming 0 wait-state memory access).
4 SAC Mem,Shift: (Acc)< <Shift -+ (Mem)
5 NEG -(Ace) -+ (Ace)
The code generator uses these instructions to implement a multiplication-free

linear transform.
6.2.2. Code Generation Rules

One of the inputs to the code generator is a DAG representation of the desired
computation. Figure 6.16 shows a DAG representation of a five input, two
output eomputation. A DAG has three types of nodes - input, output and the
intermediate nodes. For example, nodes Xl, X2, X3, X4, X5 in figure 6.16
are the input nodes, Tl, T2, T3 are the intermediate nodes and Yl, Y2 are the
output nodes. The output and the intermediate nodes represent either an ADD
or a SUBTRACT operation and have fanin of two.
The other input to the code generator is a sequence in which the nodes of the
DAG need to be evaluated. Given the sequenee and the DAG, following rules
are used to generate the code:
Let 'current' node be the latest evaluated node and 'new' node be the new
node for which the code is being generated.
Table 6.2. Code Dependance on the Scheduling of DAG Nodes
TI,T2,T3,YI,Y2 T2,T3,TI,YI,Y2 T2,TI,YI,T3,Y2

LAC XI rule I LAC X3 rule I LAC X3 rule I
ADD X2 rule I ADD X4 rule I ADD X4 rule I
SAC TI rule I SAC T2 rule 4 SAC T2 rule 4
LAC X3 rule I ADD X5 rule 2 LAC XI rule I
ADD X4 rule I SAC T3 rule I ADD X2 rule I
SAC T2 rule 4 LAC XI rule I SUB T2 rule 2
ADD X5 rule 2 ADD X2 rule I SAC Yl rule 4
SAC T3 rule I SUB T2 rule 2 LAC T2 rule I
LAC TI rule I SAC YI rule 4 ADD X5 rule I
SUB T2 rule I ADD T3 rule 2 ADD YI rule 2
SAC YI rule 4 SAC Y2 rule 4 SAC Y2 rule 4
ADD T3 rule 2
SAC Y2 rule 4
13 cycles 11 cycles II cycles
If the 'current' node is not one of the fanin nodes of the 'new' node, save
the 'current' node (SAC instruction), load the left fanin node of the 'new'
node (LAC instruction) and ADD/SUBTRACT the right fanin node of the
'new' node.
2 Ifthe 'current' nodeisaleftfaninnodeofthe 'new' node, ADD/SUBTRACT

the right fanin node of the 'new' node.
3 If the 'current' node is a right fanin node of the 'new' node and the 'new'
node function is SUBTRACT, negate the 'current' node (NEG) instruction
and ADD the left fanin node of the 'new' node.
4 If the 'new' node is an output node or an interrnediate node with fanout of

two or more, store the new node (SAC instruction) before proceeding with
the next node.
Considersequences {TI, T2, T3, YI, Y2}, {T2, T3, Tl, YI, Y2} and {T2,
TI, Y I, T3, Y2} for the DAG shown in figure 6.16. The corresponding code is
shown in table 6.2.
As can be seen from this example, for a given DAG, the code size and
consequently the number of cycles depend on the sequence in which the nodes
are evaluated. The code optimization problem thus maps onto the problem of
finding an optimum sequence of DAG node evaluations.
6.2.3. Computation Scheduling Algorithm

This subsection presents an algorithm for scheduling the DAG computations
for minimum number of cycles. The algorithm uses the following knowledge-
base derived from the code generation mIes presented earl ier.
Anode can be scheduled for computation only if both its fanin nodes are
al ready computed or are input nodes.
2 The computation of output nodes and the intermediate nodes with fanout of
two or more, always needs to be stored irrespective of the next computation
node.
3 If the 'current' node is one of the fanin nodes of the 'new' node, it avoids
accumulator spill and hence reduces the 'store' and 'load' overhead.
The algorithm perforrns the optimization in two phases. The first phase uses
the above mentioned knowledgebase to generate an initial schedule. The second
phase performs iterative refinement to further optimize the schedule.
Procedure DAG-Schedule
Input: DAG representation of the computation to be implemented on a single
register, accumulator based machine
Output: An order in which the DAG nodes need to be scheduled so as to
generate a code with minimum accumulator spills
scheduled-node-list = {}
current-node = 0
while (no.of-scheduled-nodes < total-no.of-intermediate+output-nodes) {
/* build candidate-node-list */
candidate-node-list = {}
for all (nodei tf- scheduled-node-list) {
if «nodei.left-fanin E (input-node-list + scheduled-node-list)) .and.
(nodei.right-fanin E (input-node-Iist + scheduled-node-list)))
candidate-node-list += nodei
} /* assign weights to the candidate-nodes */
for all (nodei E candidate-node-list) {
nodei.weight = 1
if «nodei E output-node-list) .or. (nodei.fanout 2:: 2))
nodei. weight++
if «nodei.left-fanin = current-node) .or.
«(nodei.right-fanin = current-node) .and. (nodei.op = ADD)))
nodei. weight += 2
if (nodei.fanout-node.right-fanin E scheduled-node-list)
nodei.weight += 2
}
/* select the node with the highest weight for scheduling */
Find (nodem E candidate-node-list) such that nodem.weight is maximum
scheduled-node-list += node m
current-node = node m
}
In the above algorithm, at each stage of node selection, there can be more
than one nodes with the same weight. During the first phase of the algorithm a
node is selected randomly. In the iterative refinement phase anode selected at
each stage is replaced by other node (if available) with the same weight. The
resultant schedule is compared with the initial schedule and accepted if it results
in fewer number of cycles.
The scheduling algorithm when applied to the DAG shown in figure 6.16,
generates the schedule T2, T3, Tl, Yl, Y2 with no further improvement possible
in the iterative refinement phase. During the first iteration of the algorithm,
the candidate-node-list consists of nodes Tl and T2, with weights 1 and 4
respectively. Node T2 is hence scheduled first. In the second iteration, the
candidate list consists of nodes Tl and T3 both having weight 3. Selecting T3
results in the schedule T2, T3, Tl, Yl, Y2 which requires 11 cycles. During
the iterative refinement phase of the algorithm, Tl is selected instead of T3
resulting in the schedule T2, Tl, Y 1, T3, Y2 which also requires 11 cycles.
The code generated by the algorithm presented in this chapter was compared
with that generated using optimizing C compiler for TMS320C5x. The DAGs
for 4x4 Walsh-Hadamard transform shown in figures 6.17, 6.18 and 6.23 were
converted to an equivalent C program and compiled with highest optimization
level. The generated code which used indirect addressing, was converted to
use direct addressing thus reducing number of cycles. Table 6.3 shows the
comparison in terms of number of cycles assuming that the program and data
are available in on-chip memories.
Table 6.3. Comparison ofCode Generator with 'C5x C Compiler
DAG 'C5x C compiler Code Generator

no. of cycles no. of cycles
Fig.6.17 20 20
Fig.6.18 22 22
Fig.6.23 19 14
The results show that the code generator generates as compact code as the
'C5x C compiler for the first 2 DAGs. It does better in case of the DAG in fig-
Xl YI Xl Y3
X2 X3 X4 X2 X3 X4
Xl Y2 Xl Y4
X2 X3 X4 X2 X3 X4
Figure 6.17. DAG far 4x4 Walsh-Hadamard Transfarm
ure 6.23. The main reason for this is that the C compiler during its optimization
phase modi fies the DAG and in the process generates code with more number
of cycles.
6.2.4. Impact of DAG Structure on the Optimality of Generated

Code
This section shows how the structure of a DAG impacts the optimality of
the code. For a one dimensional transform, the DAG structure can be either a
tree-type or a chain-type. It can be shown that a chain-type structure results
in 0 accumulator spills and is hence results in a more optimal code than the
tree-type structure.
Consider the 4x4 Walsh-Hadamard transform as an example to analyze the
relationship between the structure of a DAG and the optimality 01' the generated
code.
The 4x4 Walsh-Hadamard transform [27] is shown below.
1 1
-1
1 -1
-1 -1
1
-~ 1r ;~ 1
-1
1
X3
X4
One approach to realize this transform is to implement it as four I x4 one

dimensional transforms. The DAG for such an implementation, shown in fig-
ure 6.17, has 12 nodes (i.e. 12 additions + subtractions).
The numberof nodes can be minimized using the technique, discussed earlier,
of precomputing the common subexpressions. Figure 6.18 shows the DAG
thus optimized for minimum number of nodes. It has 8 nodes (8 additions +
subtractions) compared to 12 nodes of the DAG shown in figure 6.17.
The scheduling algorithm discussed earlier was used to schedule the DAGs
shown in figures 6.17 and 6.18. While the DAG shown in figure 6.17 requircs
Xl Y1
X2 Y2
X3 Y3
X4 Y4
Figure 6.18. Optimized DAG für 4x4 Walsh-Hadamard Transform
20 cycles, the DAG in figure 6.18 requires 22 cycles to compute the transform,
eventhough it has four less nodes. Clearly, fewer number of nodes does not
always translate into fewer number of cycles. The main reason for the DAG in
figure 6.18 requiring more cycles, is that all its intermediate nodes have fanout
of two. For single register, accumulator based architectures, such intermedi-
ate nodes result in accumulator spills, and consequently in 'store' and 'load'
overhead.
6.2.5. DAG Optimizing Transformations

This section presents four DAG transformations that minimize the accumu-
lator spill and hence the number of execution cycles.
6.2.5.1 Transformation I - Tree to Chain Conversion

This transforms converts a 'tree' structure in a DAG to a 'chain' structure.
This eliminates the need to store the intermediate computations and hence re-
duce the number of cycles. Figure 6.19 shows an example of this transform.
While the DAG with a 'tree' structure requires seven cycles to compute the
output, and the transformed 'chain' structure performs the computation in five
cycles.
6.2.5.2 Transformation 11 - Serializing a Butterfly

Many image transform DAGs have 'butterfly' structures that perform the
computations of the type (YI = XI + X2, Y2 = Xl - X2). Such butterfly
structures can be serialized by computing one of the butterfly outputs in terms
of the other output, and using a SHIFT operation which when performed along
with ADD or SUBTRACT does not require in an additional cycle. Figure 6.20
LAC XI
ADD X2 XI~OTI LAC XI
SAC TI ADD X2
X2....:7 ~
LAC X3 ~Yl X3
ADD X3
X3~
c)
ADD X4 ADD X4
ADD TI X4....:7 T2 SAC Yl
X4 YI
SAC YI
TREE STRUCTURE CHAIN STRUCTURE
Figure 6.19. Transformation I - Tree to Chain Conversion
LAC Xl LAC Xl LAC Xl

ADD X2 Xl2S:Yl Xlzt
Yl ADD X2 Xl~Yl
+2
SUB X2
SAC Yl SAC Y2
c)
SAC Yl OR
LAC Xl X2 -1 Y2 X2 -2 Y2 SUB X2.1 X2 -1 Y2 ADD X2.1
SUB X2 SAC Y2 SAC Yl
SAC Y2
Figure 6.20. Transformation Il - Serializing a Butterfly
shows the serialized DAGs which require five cycles compared to six cycles
required for the butterfly computation.
As can be seen from the figure, there are two ways of serializing a butterfly
depending on whether Yl is computed in tenns of Y2 (Yl = Y2 + 2*X2) or
Y2 is computed in terms of Yl (Y2 = Yl - 2*X2). The choice of the transfonn
depends on the context in which the butterfly appears in the overall DAG.
6.2.5.3 Transformation III - Fanout Reduction

Since the intennediate nodes with fanout of two or more, result in accumu-
tator spilling, this transfonnation reduces fanout of an intennediate node in a
DAG. Unlike the first two transfonns, this transfonn increases the number of
nodes in the DAG by one. Figure 6.21 shows an example of this transfonna-
tion applied to a four input, two output DAG. It can be noted the fanout of
the intennediate node TI in the transfonned DAG is one (i.e. one less than in
the original DAG). While the original DAG has three nodes and requires eight
cycles, the transfonned DAG has four nodes but requires seven cycles.
For the DAG shown in figure 6.21, the fanout reduction transfonnation can
also be applied to eliminate TI 's fanout to Y2, instead of eliminating TI's fanout
to YI. In general, the choice of which fanout edge to eliminate depends on the
context in which the node appears in the overall DAG.
"1"
LAC XI
LAC XI
ADD X2 LAC XI X3;fYl
ADD X2 ADD X2
SAC TI XI~
ADD X3 XI ~ I c:::) X4 Y2 ADD X4

ADD X3
:::t
X2--'" SAC YI
X2--'" SAC Y2 OR
SAC YI
SUB X4 LAC XI
LAC TI
--'" Y2 ADD X2
ADD X4 ADD X3
X4 ADD X4
SAC Y2 SAC YI X4 Y2
X3 YI SAC Y2
FANOUT REDUCTION MERGING
Figure 6.21. Transformations 111 and IV
XI XI XI YI
X2 X2 Y2
X4 Y3
X3 X3
XI Y2
X4 X4 Y4
Xl
X3
Y4
SERIALIZING A BUlTERFL Y MERGING TREE TO CHAIN CONVERSION
Figure 6.22. Optimizing DAG Using Transformations
6.2.5.4 Transformation IV - Merging

Merging is another transfonn that reduces the fanout of intennediate nodes.
Unlike the earlier transfonns, this transfonn does not reduce the number of
cycIes. However it transfonns the DAG so that other transfonnations can be
applied to the modified DAG. Figure 6.21 also shows an example ofthe 'merg-
ing' transfonnation applied to the four input, two output DAG.
Figure 6.22 shows how these transfonnations can be applied to the DAG in
figure 6.18. The resultant DAG requires 16 cycles (six cycIes \ess) to compute
the 4x4 Walsh-Hadamard transfonn.
The amount of optimization possible using these transfonns depends on the
sequence in which the nodes are selected and the choice of transfonnations ap-
plied. One approach is to search the DAG for potential nodes for transfonnation
and for a selected node, apply the transformation that results in most saving.
This greedy approach does not often give the most optimum solution. Instead
of applying transfonnations to eliminate accumulator spills, a spill-free DAG
can be directly synthesized from the transformation matrix. The following sub-
section presents a technique for synthesis of spill-free DAGs that are optimal
in terms of number of cycles.
6.2.6. Synthesis of Spill-free DAGs

A DAG that can be scheduled without any accumulator spi lls provides certain
advantages. Firstly, it simplifies code generation. Secondly, since there are
no accumulator spills, no intermediate storage is required, thus reducing the
memory requirements to implement the transform. The DAG in figure 6.17
is an example of such a DAG. It however requires 20 cycles to compute the
transform. It can be noted that the final optimized DAG in figure 6.22, is also
a spill-free DAG. This DAG however requires just 16 cycles. The main reason
for the reduced cycles is the fact that this DAG uses precomputed outputs along
with the inputs. For example, Y3 is computed in terms of Y 1 and the primary
outputs, and Y 4 is computed in terms of Y2 and the primary outputs. Instead
of generating a DAG with minimum number of additions and then applying
transformations, this DAG can be directly generated from the transformation
matrix, if the sequence of output computation is known.
Here is an algorithm that arrives at an optimum sequence of output compu-
tations such that the resultant spill-free DAG requires fewest number of cycles.
The algorithm operates on a graph whose nodes represent the outputs. The
nodes are of three types - corresponding to
1. The 'most-recently-computed' output
2. Other outputs that are 'already-computed'
3. Outputs that are 'yet-to-be-computed'
Each node in the graph has an edge (self loop) that starts and ends in itself. These
self-Ioops are assigned costs which are given by the number of cycles required
to compute the output independently (i.e. without using any of the precomputed
outputs). There are also edges between every 'already-computed' output node
to all the 'yet-to-be-computed' nodes. Each edge is assigned a cost given by
the number of cycles required to compute the 'yet-to-be-computed' output in
terms of the 'already-computed' output. The algorithm uses the steepest de-
scent approach, which at every stage selects an output that results in minimum
incremental cost. In case of more than one outputs having the same lowest
incremental cost, one output is selected randomly. Once an output is selected,
it is marked as the 'most-recently-computed' output. All the edges between this
node and the 'already-computed' nodes are deleted, and new edges are added
between this node and the other 'yet-to-be-computed' nodes. The newly added
edges are then assigned appropriate costs. This process is repeated to cover all
the outputs.
Procedure Synthesize-Spill-Jree-DAG
Input: Two dimensional matrix representing the multiplication-free linear
transform
Output: Spill-free DAG representation of the computation with the DAG hav-
ing minimum number nodes
already-computed-output-list = { }
most-recently-computed-output = (j)
/* Construct initial graph and compute edge costs */
for (i=ü,i <no-of-outputs;i++) {
edge[i,i].cost = number of non-zero entries in row 'i' + 1
}
repeat {
------pfnd the edge E(M,N) with the lowest cost.
if (M == N) { /* self loop */
Generate the DAG to compute output(N) in terms of only the inputs
} else {
Generate DAG to compute output(N) in terms of inputs and output(M)
}
/* Update the graph */
Delete edge E(N,N)
for each node (i E already-computed-output-list) {
Delete edge E(i,N)
}
already-computed-output-list += N
for each node (i E yet-to-be-computed-output-list) {
E(most-recently-computed-output,i).cost++
}
most-recently-computed-output = N
for each node i E yet-to-be-computed-output-list {
Add edge E(N,i)
E(N,i).cost = number of mismatches between row N and row 'i'
of the transformation matrix
}
} until (yet-to-be-computed-output-list == {})
Figure 6.23 shows each iteration of the algorithm applied to the 4x4 Walsh-
Hadamard transform matrix, and the resultant DAG. It can be noted that the
resultant DAG is spill-free and requires just 14 cycles to compute the transform.
Here are results of applying these transformations on 8x8 Walsh-Hadamard
transform, 8x8 Haar transform and 4x4 Slant transform.
4
4
XI YI Y3 Y2 Y4
X2 X3 X4 X3 X2 X3 X4
Figure 6.23. SpilI-free DAG Synthesis
The 8x8 Walsh-Hadamard transfonn [27] is given by:

Y1 1 1 1 1 1 1 1 1 Xl
Y2 1 -1 1 -1 1 -1 1 -1 X2
Y3 1 1 -1 -1 1 1 -1 -1 X3
Y4 1 -1 -1 1 1 -1 -1 1 X4
Y5 1 1 1 1 -1 -1 -1 -1 X5
Y6 1 -1 1 -1 -1 1 -1 1 X6
Y7 1 1 -1 -1 -1 -1 1 1 X7
Y8 1 -1 -1 1 -1 1 1 -1 X8
The direct computation of this transfonn requires 56 additions+ subtractions
and the corresponding code executes in 72 cycles. The number of additions+
subtractions can be minimized to 24 using the common subexpression precom-
putation algorithm. The resultant DAG is shown in figure 6.24. The code
corresponding to this DAG requires 64 cycles.The DAG can be optimized by
applying transfonnations to serialize all the butterflies. The resultant DAG is
also shown in figure 6.24. This DAG also has 24 nodes but the corresponding
code requires 52 cycles.
Figure 6.25 shows a spill-free DAG synthesized for the 8x8 Walsh-Hadamard
transfonn. The DAG has 35 nodes and the corresponding code requires 44
cycles. The results so far indicate that for both 4x4 and 8x8 Walsh-Hadamard
transforms, the spill-free DAGs result in most efficient code. To take this
analysis further the DAG for 8x8 Walsh-Hadamard transfonn was modified to
XI YI XI YI
X2 Y2 X2 Y2
X3 Y3 X3 Y3
X4 Y4 X4 Y4
X5 YS XS YS
X6 Y6 X6 Y6
X7 Y7 X7 Y7
X8 Y8 X8 Y8
Figure 6.24. DAGs far 8x8 Walsh-Hadamard Transform
extract acommon sub-computation (X5 + X6 + X7 + X8). The resultant DAG

is also shown in figure 6.25. This DAG has 32 nodes and it does result in one
accumulator spill. The code corresponding to this DAG requires 42 cycles (2
less than the spill-free DAG).
The 8x8 Haar transform [27] is given by :
Y1 1 1 1 1 1 1 1 1 X1/v'S
Y2 111 1 -1 -1 -1 -1 X2/v'S
Y3 1 1 -1 -1 o 0 o 0 X3/2
Y4 o 0 0 0 1 1 -1 -1 X4/2
Y5 1 -1 0 0 o 0 o 0 X5/v'2
Y6 o 0 1-1 o 0 o 0 X6/v'2
Y7 000 0 1 -1 o 0 X7/v'2
Y8 000 0 o 0 1 -1 X8/v'2
The direct computation of this transform requires 24 additions + subtractions
and the corresponding code executes in 40 cycles. The number of additions +
subtractions can be minimized to 14 using the common subexpression precom-
putation algorithm. The resultant DAG is shown in figure 6.26. The code
corresponding to this DAG requires 39 cycles. The DAG was optimized by
applying transformations to serialize all the butterflies. The resultant DAG is
also shown in figure 6.26. This DAG also has 14 nodes but the corresponding
code requires 30 cycles.
The spilI-free DAG for the 8x8 Haar transform has 20 nodes and the corre-
sponding code requires 32 cycles.
Yl Y2 Y3
X8
-2
X7 X6 X5 X4 X3 X2 Xl X2 X4 X6 X8 X2 X3 X6 X7
X7 X6 X3 X2 X8 X6 X8 X5 X3 X2 X8 X6 X4
Y7
X2 X4 X6 X8
Yl Y5 D Y2 Y3
X8
Y8
Y7 Y6 Y4
Figure 6.25. Spill-free DAGs für 8x8 Walsh-Hadamard Transform
XI YI XI YI
X2 Y5 X2 Y5
X3 Y3 X3 Y3
X4 Y6 X4 Y6
Y2 X5 Y2
Y7 X6 Y7
: Y4
Y8
X7
X8
Y4
Y8
Figure 6.26. DAGs for 8x8 Haar Transform

The 4x4 Slant transfonn [27] can be transfonned into a 4x8 multiplication-
free transfonn as shown below :
-~ -; 1 [
X1/2
X2/2v1s
1
-1 1 X3/2
3 -1 X4/2v1s
X1/2
X2/2v1s
-n
1 0 0 0 X3/2
Y2
[ Y1
Y3
Y4
1 [ J 1
-1
-1
-1
-1
-1
1
-1
0
0
0
0
-1
0
0
X4/2V5
Xl
X2/vIs
X3
X4/vIs
It can be noted that the left half of the 4x8 matrix is same as the Walsh-
Hadamard transform. The direct computation of the 4x8 transfonn requires
16 additions+ subtractions and the corresponding code executes in 24 cyc\es.
The number of additions + subtractions can be minimized to 12 using the com-
mon subexpression precomputation algorithm. The code corresponding to the
resultant DAG requires 26 cyc\es.
Interestingly the spill-free DAG can be synthesized directly from the 4x4
matrix with elements 1,-1,3 and -3. The 4 outputs can be computed as
YI = Xl + X2 + X3 + X4, Y2 = YI + Xl «1 - X3«1 - X4«2,
Y3 = Y2 - Xl «1 - X2«1 + X4«2, Y4 = Y3 - X2«I + X3«2 - X4«1
The DAG for the above computation has 12 nodes and requires 17 cyc\es.
The results presented so far are summarized in table 6.4.
Table 6.4. Number of Nodes (Ns) and Cycles(Cs) for Various DAG Transforms
Original Minimum Serialized spiII-free

Transform adds+subs butterft ies DAGs
Ns Cs Ns Cs Ns Cs Ns Cs
4x4 Walsh-Hadamard 12 20 8 22 8 19 9 14
8x8 Walsh-Hadamard 56 72 24 64 24 52 35 44
8x8 Haar 24 40 14 39 14 30 20 32
4x4 Slant 16 24 12 26 12 23 12 17
The following subsections present a technique that reorders the nodes of the
DAG to achieve low power realization of multiplication-free linear transfonns
on a single register, accumulator based architecture.
6.2.7. Sources and Measures of Power Dissipation

The six busses of the architecture shown in figure 6.15 are networks with a
large capacitive loading. Hence signal switching in these networks has a sig-
nificant impact on power consumption. Since during each cycle, an instruction
is fetched from the program memory, the power dissipated in the PDB bus is
directly dependent on the total Hamming distance between successive instruc-
tions. The power dissipation in the PAB bus is dependent on the addresses of
the instructions being fetched. Techniques such as gray coded addressing have
been proposed to reduce power dissipation in the PAB bus. The DRAB and
DRDB busses see the activity during the execution of LAC, ADD and SUB
instructions. While the amount of switching and hence the power dissipation
in the DRDB bus is dependent on the data values, the power dissipation in the
DRAB bus is dependent on the total Hamming distance between successive data
read addresses. Since the code for the multiplication-free linear transforms is
executed sequentially, static analysis of the code is adequate to compute these
measures of power dissipation. The following measure is used for power dis-
sipated due to the execution of i th instruction:
Pli] = HO (Opeode part of Instruction[i-l], Opeode part of instruetion [iD +

2 * HO (Memory address part of Instruetion [i-I], Memory address part of lnstruetion[i])
6.2.8. Low Power Code Generation

For one dimensional transforms, the chain structure results in a code that
causes zero accumulator spills and is hence optimal in terms of number of cy-
cles. The nodes of the Chain structured DAG can be reordered so as to reduce
the power dissipation in the PDB and DRAB busses. To arrive at an optimal
order, a fully connected graph is constructed with the nodes of the graph rep-
resenting the nodes of the DAG (and thus the corresponding instruction). The
edges of the graph are assigned weights in the following way:
W[i,j] = HO (Opeode part of Instruetion[i], Opeode part of instruetion [j]) +

2 * HO (Memory address part of Instruetion [i], Memory address part of Instruetion[j])
The problem of finding an optimum node order can be reduced to the problem
of finding the lowest cost Hamiltonian path in an edge-weighted graph or the
traveling salesman problem. Since the last instruction has to be SAC (i.e. store
the accumulatorcontents into the specified data memory location), the algorithm
uses the corresponding node as the starting point and works backwards to get
the lowest cost Hamiltonian path. It also comprehends the constraint that the
first instruction has to be LAC (Load the accumulator with the contents of
Table 6.5. Hamming Distance Measure for Accumulator based Architectures
Transform Initial DAG Reordered DAG %reduction

Prewitt Window 36 28 22%
Sobel Window 40 24 40%
Spatial High Pass Filter 42 28 33%
Spatial Low Pass Filter 34 25 26%
4x4 Haar 68 62 9%
4x4 Walsh-Hadamard 64 58 9%
the specified data memory loeation) and only those variables that are to be
multiplied by the weight of +1 ean be used for the LAC instruetion.
For a two dimensional transform, the spill-free DAG strueture is used as
the starting point and is partitioned into sub-DAGs bounded by primary output
eomputations (i.e. the SAC instruetions). For example, for the spill-free DAG
ofthe Walsh-Hadamard transform shown in figure 6.23, it is partitioned into four
sub-DAGs bounded by SAC Yl, SAC Y3, SAC Y2 and SAC Y4 instruetions.
The nodes within eaeh of the sub-DAGs ean then be reordered without affeeting
the overall funetionality, thus resulting in a code with redueed power dissipation.
Table 6.5 gives the Hamming distanee based measure (deseribed in see-
tion 6.2.7) for six multiplieation-free linear transforms. For the 3x3 window
transforms, it is assumed that the variables Xl to X9 are stored at loeations
OxOO to Ox08 and the output Y is stored at loeation OxOF. For the 4x4 Haar and
Wall transforms, it is assumed that the inputs Xl to X4 are stored at loeations
OxOO to Ox03 and the outputs YI to Y4 are stored at loeations Ox08 to OxOB.
1t is also assumed that the opeodes for LAC, ADD, SUB and SAC are I-hot
eneoded and henee have a Hamming distanee of two between any two of them.
As can be seen by the results the proposed node reordering technique results in
significant power reduetion.
Chapter 7
RESIDUE NUMBER SYSTEM BASED

IMPLEMENTATION
Residue Number System (RNS) based implementation of DSP algorithms

have been presented in the literature [29, 30, 92] as a technique for high speed
realization. In a Residue Number System (RNS), an integer is represented
as a set of residues with respect to a set of integers called the Moduli. Let
(mI, m2, m3, ... , m n ) be a set of relatively prime integers called the Moduli
set. An integer X can be represented as X = (Xl, X 2, X 3, ... , X n ) where
Xi = (X) modulo mi for i = 1,2, .. , n (7.1 )

we use notation Xi to represent IXlmi the residue of X w.r.t mi. Given the
moduli set, the dynamic range(M) is given by the LCM of all the moduli. If
the elements are pair-wise relatively prime, the dynamic range is equal to the
product of all the moduli [92]. The bit-precision of a given moduli set is
(7.2)
where M is the dynamic range of the given moduli set. So, the moduli
set is determined based on the bit-precision needed for the computation. Für
example, für 19-bit precisiün the modul i set 5,7,9,11,13,16 can be used [87].
Let X,Y and Z have the residue representations X = (Xl, X 2, X 3, ... , X n ),
Y = (YI , Y 2, Y 3, ... , Y n ) and Z = (Zl, Z2, Z3, ... , Zn) respectively and Z =
(X Op Y) where Op is any operation in addition, multiplication or subtractiün.
Thus we have in RNS,
Zi = lXi op Yilmi for i = 1,2, .. , n (7.3)
Since, Xi's and Yi 's require lesser precisiün than X and Y, the computation
üf Zi 's can be performed faster than the computation of Z. Müreüver, since the
171

computations are independent of each other, they can be executed in parallel

resulting in significant performance improvement.
For example, Consider the moduli set (5,7,11)
Let X = 47 = (14715,14717,147111) = (2,5,3)
Y = 31 = (13115,13117',131111) = (1,3,9)
and Z = X + Y = 47 + 31 = 78 = (17815,17817, 178hd = (3,1,1)
Zi'S can be computed, using equation 7.3, as
Z = (12 + 115, 15 + 317, 13 + 9111) = (3,1,1)
While in RNS, basic operations like multiplication, addition, and subtrac-
tion can be performed with high speed, operations like division and magnitude
comparison require several basic operations and hence are slower and more
complicated to implement. Since most DSP kemels (such as FIR and UR filter-
ing, FFT, correlation, DCT etc) do not need these operations to be performed,
this limitation does not apply.
Figure 7.1 shows the RNS based implementation of an N-term weighted-sum
computation. The implementation is generic and assumes K moduli (MI to
MK) selected so as to meet the desired precision requirements. The coefficient
memory stores the pre-computed residues of the coefficients A[O] to A[N-]]
for each of the modul i M 1 to MK. The data memory stores the residues of the
data values for each of the moduli Ml to MK. The weighted-sum computation
is performed as aseries of modulo MAC operations. During each cycle the
Coefficient Address Generator and the Data Address Generator provide the
appropriate coefficient-data pairs to the K modulo MAC units which perform
the modulo computations in parallel.
Figure 7.2 shows an implementation of a modulo MAC unit. It consists of a
modulo multiplier and a modulo adder connected via an accumulator so as to
perform repeated modulo MAC and make the result available to the RNS-to-
Binary converter every N cycles.
The modulo multiplication and modulo addition is typically implemented as
a look-up-table (LUT) for small moduli. Figure 7.2 also shows the look-up-
table based implementation of a modulo 3 multiplier. The modulo MAC unit
as shown in figure 7.2 requires two look-ups. The performance can be further
improved by merging the two LUTs into a modulo Multiply-Accumulate (MAC)
LUT as shown in figure 7.3.
7.1. Optimizing RNS based Implementation of the

Weighted-sum Computation
This section presents techniques to optimize the RNS based implementation
of weighted-sum computations in the area-delay-power space.
Residue Number Systembased Implementation 173
A[O] modMI A[O] mod M2 A[O] modMK

Coefficient ---------
Address ~ A[I] modMI A[I] mod M2 A[I] modMK
Generator ---------
---------
A[N-I] mod MI A[N-I] mod M2 A[N-I] mod MK
MAC RNS
modMI MAC to r--- y
modM2 Binary
~
MAC
modMK
X[O] modMI X[O] mod M2 X[O] modMK

Oata ---------
Address ~ X[I]modMI X[I] modM2 X[I] modMK
Generator ---------
---------
X[N-I] mod MI X[N-I] mod M2 X[N-I] mod MK
f t
Binary 10 RNS
t
Xli]
Figure 7.1. RNS Based ImpJementation of FIR Filters
* moduJo 3
A[i]
A X Y
*
L + ~
,---
A
C - f-t> Y
A
00 00
00 01
00 JO
01 00
OJ 01
00
00
00
00
OJ
Z
moduJo M moduJo M
f----t' C
X
OJ 10
JO 00
JO
00
JO OJ 10
~ ~ 1010 01
I I
Xli] CJk
Figure 7.2. ModuJo MAC using Jook-up-tabJes

A[i]
L MAC
modulo M
-
A
C
"-
f---!> C I-~
f-C> y
T
I
Xli] Clk
Figure 7.3. Modulo MAC using a single LUT
7.1.1. Parallel Processing

The RNS based implementation shown in figure 7.1 can be modified so as to
read, for each of the moduli, two data-coefficient pairs during every cycle. The
RNS structure can be modified, as shown in figure 7.4, so as to have two modulo
MAC units per modulus . With such a parallel processing of degree two, an N
term weighted-sum can be computed in NI2 cycles. If the same throughput is
to be maintained, the clock can be slowed down appropriately and the supply
voltage lowered resulting in power savings.
7.1.2. Residue Encoding for Low Power

The modulo multiplier of the modulo MAC unit is typically implemented as
a look-up-table. The look-up-table for modulo 3 multiplier assumes that the
coefficient residues are binary coded (i.e. residue 0 as 00, residue 1 as 01 and
residue 2 as 10). However, since the coefficient residues are fed to the modulo
MAC units only, they need not necessarily be binary coded. Secondly, since
there is a separate MAC unit for each modulus, the coefficient residue coding
can be different across moduli. For example, residue 0 for modulus 5 may be
coded as 010 and for modulus 7 may be coded as 100. This ftexibility in the
coefficient coding can be exploited to minimize the switching activity in the
coefficient memory output busses that feed the modulo MAC units. Such a
coding can thus reduce the power dissipation in the coefficient memory and the
modulo MAC units.
Here is the formulation of the coefficient residue encoding problem. For
a given modul i Mi, a fully connected graph is constructed with the nodes of
the graph representing the residues. The edges (E[i,j]) of the graph are then
assigned weights to indicate the number of times the corresponding residues
Ci' and 'j') appear sequentially in the coefficient memory. The residues can
Residue Number System based lmplementation 175
A[O] modMI A[I]modMI

Coefficient ----------
Address ~ A[2] modMI A[3] mod MI
Generator ---------
---------
A[N-2] mod MI A[N-I] mod MI
1 1
+
~
MAC
~ RNS
modMI modMI
MAC ----;0
to -.... y
----;0
mod MI Binary
----;0
X[O] mod MI
I
X[I] modMI
Data ----------
Address ~ X[2] mod MI X[3] mod MI
Generator ---------
---------
X[N-2] mod MI X[N-I]modMI
t t t
Binary to RNS
1
Xli]
Figure 7.4. RNS Based Implementation of FIR Filters with Parallel Processing Transformation
then be coded so as to minimize the following cost function:
CF = LLHD[i,j]. W[i,j] (7.4)

j
where HD[i,j] is the Hamming distance between the coding of residues 'i'
and 'j', and W[i, j] is the edge weight. It can be noted that the cost function
CF is similar to that used for FSM state encoding. The stochastic evolution
based optimization strategy described in [47] can thus be used to perform the
coefficient residue encoding.
7.1.3. Coefficient Ordering

Since modulo addition is both commutative and associative, the order in
which the coefficient-data pairs are fed to the modulo MAC units can be changed
without impacting the functionality. The residues in the coefficient memory can
be reordered so as to minimize the total Hamming distance between successive
values of the coefficient residues. Such a minimization reduces the switch-

ing activity in the coefficient memory output busses and hence reduces power
dissipation in the coefficient memory and the modulo MAC units.
The problem of finding an optimum coefficient order can be mapped onto
the problem of finding the lowest cost Hamiltonian Circuit in an edge-weighted
graph or the traveling salesman problem. The coefficients map onto the cities
and the Hamming distances between the residues of the coefficients map onto
the distances between cities. The optimal coefficient order thus becomes the
optimal tour of the cities, where each city is visited only once and the total
distance traveled is minimized.
It can be noted that this technique is similar to the technique discussed in
chapter 2 (section 2.3.2) except that the coefficient residues are used for calcu-
lating the Hamming distance. The coefficient order can either be kept consistent
across moduli or optimized for each modulus. The later results in higher savings
but increases the complexity of address generation.
7.1.4. Exploiting Redundancy

The modulo multiplication and the module addition for small moduli is typ-
ically performed using look-up-tables. One approach to implement a look-up-
table is to use a memory such as a ROM or a RAM. However, since the number
of entries in a look-up-table are typically less than the number of rows of the
memory block, such as implementation is not area efficient. For example, a
look -up-table for modulo 5 multiplication wh ich has 25 entries, needs a memory
that has 64 rows (6 address lines). Thus a more area-efficient way to implement
the look-up-tables is to realize them as PLAs.
The look-up-table area can be further reduced by exploiting the commutativ-
ity of modulo multiplication and modulo addition. For example, a look-up-table
for modulo 5 multiplication has an entry corresponding to inputs 2 and 3 and
also has an entry corresponding to inputs 3 and 2. This redundancy can be ex-
ploited to reduce the number of entries in the look-up-table and correspondingly
reduce the PLA area. For example, for a modulo 5 multiplication the number
of entries in the look-up-table can be reduced from 25 to 15. In general, for a
modulo Mi multiplication (or addition), the number of entries in the look-up-
table can be reduced from Ml to Mi (Mi + 1)/2. While the look-up-table area
is reduced, there is an associated overhead of the logic that needs to optionally
swap the inputs to the look-up-table. Figure 7.5 shows a scheme (SCHEME I)
that swaps the inputs if the first input is greater than the second input. It can be
noted that the reduction in the look-up-table area is thus achieved at the expense
of multiplexers and a comparator.
The comparator overhead can be eliminated by using one of the bits (say
LSB) of one of the inputs as the control signal to the multiplexers for swapping
the inputs. Figure 7.5 also shows an implementation ofthis scheme (SCHEME
AII)···(M-lil All) .. (M-lil
*
moduln M *
modulo M
XII) (M-lil XII) ... (M-1i1
SCHEME I SCHEME 11
Figure 7.5. Minimizing Look Up Table Area by Exploiting Redundancy
11). For example, consider the look-up-table for modulo 3 multiplication and an
input swapping scheme in which the inputs to the look-up-table are swapped
if the LSB of the first input is 1. It can be noted that with such ascheme, the
input pairs 01 00 and 01 10 will never be fed to the LUT. Thus the number of
entries of the look-up-table can be reduced from 9 to 7. While this reduction is
Iess than the reduction from 9 to 6 entries using the scheme mentioned earlier,
overall it may be more area efficient and faster (no comparator delay).
7.1.5. Residue Encoding for minimizing LUT area

Coefficient residue encoding was presented in section 7.1.2 as a technique to
minimize switching in the coefficient memory output busses. The coefficient
residue coding alters the modulo multiplication truth-table and hence impacts
the area of the PLA. The area of the PLA implementing the modulo addition
can be impact by coding the residue values of the input data. Such a coding
also impacts the area of the PLA implementing the modulo multiplication. The
binary-to-RNS and the RNS-to-Binary converters can be suitably modified to
comprehend the residue coding of the data. Techniques based on symbolic
input-output coding can be used so as to appropriately encode the coefficient
residues and the data residues so as to minimize the PLAs implementing the
modulo multiplication and the modulo addition across all moduli. If this tech-
nique is to be applied in conjunction with the technique showed in figure 7.5
(SCHEME I), the impact of coding on the compare logic area also needs to be
comprehended. Such a coding algorithm has been presented in [43].
The area improvements obtained by exploiting the redundancy in the com-
putation and encoding the residues are presented in tables 7.1, 7.2, and 7.3 for
modulo adder, modulo multiplier, and modulo MAC respectively. The results
are presented for RNS based implementations based on the moduli set {5, 7,
9, 11, 13}. The metric chosen for PLA area is the product of the number of
rows and the number of columns in the PLA. The number of columns in the
Table 7.1. Area estimates für PLA based müdulü adder implementatiün
Xfürm M=5 M=7 M=9 M=11 M=13

Cünv 270 540 1060 1580 2200
Xform1 217(19%) 363(32%) 736(30%) 1050(33%) 1364(38%)
Xform2 225( 16%) 420(22%) 900(15%) 1320(17%) 1820(17%)
Xfürm3 240(11%) 420(22%) 1220 1340(15%) 1740(21%)
Xfürm1+3 172(36%) 368(31 %) 673(36%) 1172(25%) 1443(34%)
Xfürm3+1 217(19%) 355(34%) 804(24%) 1066(32%) 1198(45%)
Xform3+2 195(27%) 315(42%) 980(7.5%) 1280(19%) 1480(33%)
PLA is equal to 2*(no. of input bits) + no. of output bits. The number of rows
in PLA are obtained from the residue encoded truth table minimized using
Espresso [62].
It can be noted that the redundancy elimination technique can be used in
conjuction with residue encoding. Tables 7.1, 7.2 and 7.3 show area reduction
for such combination of transformations as weIl. Here are the cases for which
the results are presented:
Conv: Area with conventional implementation.

Xform 1: Area improvement with redundancy elimination - Scheme I
Xform2: Area improvement with redundancy elimination - Scheme 11
Xform3: Area improvement with coefficient residue encoding
Xform 1+3: Area improvement with Xform 1 followed by Xform3.
Xform3+ 1: Area improvement with Xform3 followed by Xform 1.
Xform3+2: Area improvement with Xform3 followed by Xform2.
As can be seen from the results, the PLA area can be reduced by as much as
45% (corresponding to modulo-13 addition using Xform3+ 1). The results also
show that no one combination of techniques results in the most area reduction
across all moduli. While (Xform3+ 1) combination gives maximum area reduc-
tion in most cases, it has an associated delay overhead. For modulo-7 addition,
the (Xform3+2) combination gives minimum area and it comes with minimal
delay overhead.
In modulo multiplication implementation, the PLA area can be reduced by as

much as 52% (corresponding to modulo-13 multiplication using Xform3+ 1).
The results also show that for modulo-5 multiplication, just the Xform 11 gives
the most area efficient PLA.
Table 7.2. Area estimates for PLA based modulo multiplier implementation
Xform M=5 M=7 M=9 M=II M=13

Conv 225 285 960 1440 1860
Xform 1 202(10%) 258(9%) 656(31%) 990(31%) 1324(28%)
Xform2 165(27%) 225(21%) 760(21%) 1200(16%) 1540(17%)
Xform3 135(40%) 330 540(43%) 1000(31%) 920(51%)
Xform1+3 170(24%) 325 644(32%) 1046(27%) 1448(22%)
Xform3+1 157(30%) 314 547(43%) 884(38%) 886(52%)
Xform3+2 165(27%) 330 720(25%) 980(32%) 1160(38%)
Table 7.3. Area estimates for PLA based modu10 MAC implementation
Xform M=5 M=7 M=9 M=11 M=13

Conv 1428 4284 10696 20636 31892
Xform1 892(38%) 2346(45%) 6240(42%) 11454(45%) 17670(45%)
Xform2 1197(16%) 3381(21%) 8876(17%) 16744(18%) 26320(17%)
Xform 3 1113(22%) 2604(39%) 5124(52%) 17388(16%) 32088
Xform1+3 884(38%) 2133(50%) 3838(64%) 10748(48%) 18415(42%)
Xform3+1 766(47%) 1781(58%) 3648(66%) 10150(51 %) 17264(46%)
Xfonn3+2 1029(27%) 2667(37%) 7224(32%) 14728(28%) 25228(20%)
In case of MAC computation, the results show that the PLA area can be
reduced by as much as 66%. In this case Xform3+ I gives the minimum area
for all the moduli (i.e. 5, 7, 9, 11 and 13).
7.2. Optimizing RNS based Implementation of FIR Filters

While the techniques described in the earlier section can be applied to FIR
filters, this section presents additional transformations specific to FIR filters.
7.2.1. Coefficient Scaling

For ascale factor K, the FIR filtering equation 2.10 translates to:
N-l N-l
Y[n] = (K· L A[i]·X[n-i]) ·1/ K = (L (K·A[i]) ·X[n-i]) ·1/ K (7.5)
i=ü i=ü
Thus the FIR filter output can be calculated using coefficients scaled by a
factor K and the weighted-sum result scaled by 1/K.
Since the residue values of the scaled coefficients are different than the
residue values of the original coefficients, scaling can be used as a transfor-
]80 VLSI SYNTHESIS OF DSP KERNELS
mation to minimize switching in the coefficient memory output busses. For a

given set offilter coefficients, an optimum scale factor within the specified range
(e.g. ±3db) can be found such that the total Hamming distance between residue
values of successive coefficients, across all moduli, is minimized. Such a trans-
formation can thus reduce the power dissipation in the coefficient memory and
the modulo MAC units.
The RNS based implementation shown in figure 7.] can be modified to in-
clude a 'modulo multiply by l/K' LUTs in front ofthe RNS-to-Binary converter.
It can be noted that instead of using a common scale factor across all moduli,
a different scale factor which is optimal for each of the moduli can be chosen so
as to further reduce the power dissipation. In such a case, the 'modulo multiply
by 1/K' LUTs can be replaced by LUTs that use different scale factors for
different moduli.
7.2.2. Coefficient Optimization for Low Power

Coefficient optimization to minimize the total Hamming distance between
successive coefficient values has been presented in section 2.4.5. The algo-
rithm can be suitably adapted to minimize the total Hamming distance between
residues of successive coefficient values across all moduli. Coefficients can
thus be optimized so as to minimize the switching activity in the coefficient
memory output busses and hence reduce power dissipation in the coefficient
memory and the modulo MAC units.
7.2.3. RNS based Implementation of Transposed FIR Filter

Structure
The implementation shown in figure 7.1 corresponds to the direct form FIR
filter structure. The RNS based implementation based on the transposed form
FIR filter has been presented in [87] and shown to be efficient for very high
order filters and sm aller moduli. Figure 7.6 shows the implementation for a
modulus.
7.2.4. Coefficient Optimization for Area Reduction

As shown in figures 7.2 and 7.3, the filter coefficients and data form the
inputs to the modulo multiplier and the modulo MAC units. The area of the
PLA based modulo multiplier and the modulo MAC depends on the number of
combinations ofthe coefficients and data for which the PLA needs to be realized.
For area efficient implementation, the coefficients can be optimized such that the
number of unique residues across the moduli set are reduced thereby reducing
the entries in PLA based modulo multiplier and the modulo MAC unit. Such
a coefficient optimization can be performed by suitably adapting the algorithm
presented in chapter 2 (section 2.4.5).
X[i] ------,,--------,-------,-----------.
*2 *3 * (M-I)
modM modM modM
CONNECTION ARRA Y
-I + +
Z
-I +
'----_Z_-I/I mod ,---_,/1 mod mod f-------I> Y[i]
M M M
Figure 7.6. Modulo MAC structure for Transposed Form FIR Filter
It can be noted that the reduction in the number of unique residues across
the moduli set results in an area efficient implementation of the transposed FIR
filter structure shown in Figure 7.6 because of the reduction in the number of
modulo multipliers needed in the structure.
The impact of this transformation can be appreciated from the results for
6 low pass filters shown below. These filters vary in terms of desired filter
characteristics and consequently in the number of coefficients. These filters
have been synthesized using the Park-McClellan's algorithm [73] for minimum
number of taps. The optimization has been performed using first improvement
and the steepest descent strategies. The strategies differ in the optimization
approaches. While in the steepest descent approach, in each iteration of the
optimization, the move which gives the maximum gain is selected, in first
improvement approach, the first move which gives gain is selected in each
iteration.
The coefficient values quantized to 16-bit 2's complement fixed point repre-
sentation form the initial set of coefficients for optimization. coefficient opti-
mization algorithm has been applied across the moduli set {5,7,9, 11,13, 17} for
PLA based implementation of modulo multiplier and modulo MAC. The area
improvements and total number of unique residues for different optimization
strategies for modulo multiplier and modulo MAC modules relative to the con-
ventional implementation are shown in Table 7.5. The metric chosen for PLA
area is the product of the number of rows and columns in the PLA. The number
of columns in a PLA is equal to { 2*(number of input bits) + the number of

output bits}. The number of rows are obtained by minimizing the resulting
truth table using Espresso [62]. The values under Mult and MAC columns in
table 7.5 give the sum of areas of the modules across the whole moduli set.
Table 7.4. Distribution of Residues across the Moduli Set
Filter 5 7 9 11 13 17 Total
LPI -lp_16L3KA.5L2A2_24
Conventional 5 6 6 9 7 7 40
Steepest 4 4 4 6 5 5 28
Ist Impr. 4 6 3 6 5 3 27
LP2 -lp_12L2L3K.12A5_28
Steepest 4 5 5 5 6 3 28
I st Impf. 5 5 4 5 3 5 27
LP3 -lp_IOL2K_3K_0.05AO_29
Steepest 5 5 5 4 4 4 27
Ist Impr. 4 6 5 3 7 4 29
LP4 -lp_12K_2.2K_3.1 K_.16A9_34
Steepest 3 6 6 6 6 5 32
I st Impf. 4 5 7 4 7 4 31
LP5 -lp_IOK_1.8K_2.5L.15_60AI
Steepest 5 7 8 8 8 9 45
I st Impf. 5 7 7 8 6 9 42
LP6 -lp_IOK_I.8L2.5L.03_70_55
Steepest 5 7 9 10 9 1I 51
I st Impf. 5 7 8 9 8 13 50
Results show that with coefficient optimization, area improvement of upto

52% in modulo multiplier block and 54% in the modulo MAC block. Ta-
ble 7.4 shows the distribution of residues across the moduli set for different
optimization strategies compared to the distribution in the conventional case.
For example, for filter LPI and modulo 17 the conventional implementation
has value 7 and the first improvement strategy has a value 3. This means that
out the possible 17 residues for modulo 17, the coefficients map to 7 residues
in the case of conventional implementation, while the coefficients map to only
3 residues after optimization by first improvement strategy.
The names of the filters mentioned in the tables indicate the filter character-
istics. For example, the FIR filter Ip_16L3KA.5K_.2A2_24 is a low-pass filter
Residue Number System based Implementation 183
with the following characteristics:

Sampling freq. = 16KHz, Passband freq. = 3KHz,
Stopband freq. = 4.5KHz, Passband ripple = O.2db,
Stopband atten. = 42db and Number of filter coeffs = 24
Table 7.5. Impact of Coefficient Optimization on the Area of Modulo Multiplier and Modulo
MAC
Conventional Steepest Ist Impr. Best % Impr.

Filter Mult MAC Mult MAC Mult MAC Mult MAC
LPI 5565 62060 4100 44780 3500 35600 37% 43%
LP2 6775 75195 3375 36215 3215 34880 52% 54%
LP3 6485 71260 3365 33515 3610 38060 48% 53%
LP4 6230 69910 3805 40295 3645 36005 42% 49%
LP5 7465 85840 5880 64965 5330 59890 29% 30%
LP6 7940 90470 7150 81065 7070 80115 11% 11%
In terms of the optimization strategies, the first improvement approach per-

forms marginally better in most of the cases. It is to be noted that the runtimes
for first improvement strategy is far less than that of steepest descent strategy
because of the exhaustive search involved in selecting the best move in each
iteration with steepest descent approach.
It can be noted that this transformation can be applied in conjunction with
residue encoding and redundancy elimination schemes to achieve further area
reduction.
7.3. RNS as an Optimizing Transformation for High

Precision Signal Processing
In applications, such as high quality audio, that need more than 16 bits of
precision, the processing of signals on a 16-bit fixed point DSP requires double
precision computation and is hence time consuming. Since in Residue Number
System (RNS), the high precision data is decomposed to lower precision for its
processing, RNS can be used as an optimizing transformation [44] for improving
the performance of such applications implemented on a single processor.
The selection of moduli is an important consideration in the RNS based
implementation as it determines the computation time. The issues involved
are:
The moduli set should be pairwise relatively prime to enable high dynamic
range.
2 The modul i set has to be selected such that residue computations are easy
(e.g. 2n ,2 n - 1) [92].
3 Since the implementation is on a single processor and computations w.r.t.

to each moduli has to be done sequentially, it is desirable to have as few
moduli as possible.
4 The moduli should have simple multiplicative inverses. This ensures con-
version from residue to binary domain with less computations.
Based on the above mentioned considerations, for a programmable DSP

based implementation a moduli set ofthe form (2 n , 2n -1, 2n - 1 -1, 2n - 2 -1)
can be selected. It has been shown by Szabo and Tanaka [92) that the four
moduli are pairwise relatively prime if and only if n is odd. So, for n = 15,
this moduli set provides a high dynamic range of more than 56 bits and also
offers an advantage of simplicity in determining the additive and multiplicative
inverse. The additive inverse of a number w.r.t. modulo of form 2k - 1 is just
the 1's complement representation of the number.
The results of RNS based implementation of an N-tap FIR filter with 24 bit
data and coefficient precision on TMS320C5x DSP are shown in tables 7.6 and
7.7.
Table 7.6. RNS based FIR filter with 24-bit precision on C5x
Pgm. Mem. Data Mem. Exec. time

size( words) size(words) (cycles)
BIN_FIR 86 4N+8 32N+20
RNS_FIR N+269 7N+19 16N+261
Table 7.7. Number of Operations for RNS based FIR filter with 24-bit precision on C5x
Loads & Stores Adds Mults

BIN_FIR IIN+4 ION 4N
RNS_FIR 4N+48 3N+60 4N+I
As can be seen from table 7.6, the program and data memory requirements of
the RNS based implementation (RNS--FIR) are much higher than the implemen-
tation in the binary domain (RNS~IN). In terms of execution time however,
the RNS based implementation requires fewer cycles for filters with more than
Residue Number System based Jmplementation 185
15 taps. As can be seen from table 7.7, the RNS based implementation requires
fewer loads and stores and performs fewer number of additions than the binary
domain implementation for filters with more than 8 taps. Thus the RNS based
implementation is also power efficient for higher filter orders.
It can be noted that the transformation based on the multi rate architectures
presented in chapter 2 (section 2.4.2) can be applied in conjunction with the
RNS based implementation [44] to further improve performance and reduce
power dissipation.
Chapter 8
A FRAMEWORK FOR ALGORITHMIC AND

ARCHITECTURAL TRANSFORMATIONS
Chapters 2 to 7 have presented many algorithmic and architectural trans-

formations targeted to the weighted-sum and the MCM computations realized
using different implementation styles. This chapter proposes a framework that
encapsulates these transformations and enables a systematic exploration of the
area-delay-power solution space. The framework is based on a classification
of the transformations into seven categories which exploit unique properties of
the DSP algorithms and the implementation styles.
8.1. Classification of Algorithmic and Architectural

Transformations
Implementing data movement by moving the pointer to the data
Most DSP algorithms are data intensive and perform sequential shifting of
data between two output computation. The power dissipation for such a data
shift can be minimized by moving the pointer to the data instead of moving
the data itself. The power reduction is thus achieved at the expense of the
decoder overhead and increased control complexity. The transformations
of this type incIude:
• Circular Buffers used in programmable DSPs to achieve data movement

after every FIR filter computation (section 2.4.1).
• Shiftless Implementation of DA based FIR Filters described in sec-
tion 4.3.2.
2 Data Flow Graph Restructuring Transformations

Most DSP algorithms are repetitive in nature. This property can be exploited
to re-structure the DFGs so as to achieve the desired area-delay or area-power
tradeoffs. The transformations of this type incIude:
187

• Parallel Processing [15, 76]

• Pipelining [15]
• Re-timing [60]: An application ofthis transform is the MCM based FIR
filter structure (figure 3.5) derived from the direct form structure.
• Loop Unrolling [22,60]
3 Data Coding
The area, delay, power parameters of most implementation styles are im-
pacted by the bit pattern of the processed data. Data coding techniques use
different number representation schemes so as to appropriately alter the bit
pattern and hence impact area, delay and power parameters. Data coding
has the associated overhead of encoding and decoding. The transformations
of this type include:
• Gray Coding: used to minimize power dissipation in the busses with

sequential data values (sec 2.2.2.1). It is also used to minimize the
decoder power in the shiftless DA based implementation of FIR filters
(section 4.3.2).
• TO coding: also used to minimize power dissipation in the busses with
sequential data values (section 2.2.2.2).
• Bus Invert Coding: used to minimize power dissipation in the busses
with data values having no specific pattern (section 2.2.2.3).
• Nega-binary Coding: used to minimize power dissipation in the input
data shift register of the DA based implementation of FIR filters (sec-
ti on 4.3.1).
• Uni-sign representation: used to improve area efficiency of multiplier-
less implementation ofweighted-sumlMCM computations (section 5.3.1.2).
• CSD representation: also used to improve area efficiency of multiplier-
less implementation ofweight-sumlMCM computations (section 5.3.1.3).
• Coefficient Residue Encoding: for low power realization of RNS based
weighted-sum computation (section 7.1.2). Also used to minimize look-
up-table area in the RNS based implementation of weighted-sum com-
putation (section 7.1.5)
4 Transformations that Exploit Redundancy in the Computation

The transformations of this type reduce the computational complexity of an
algorithm by exploiting redundancy in the computation. This is achieved by
either reducing the number of operations or replacing complex operations
(such as multiplication) by simpler operations (such as addition and shift).
These transformations typically destroy the regularity of the computation
A Frameworkfor Algorithmic and Architectural Transformations 189
and also increase the control complexity. The transformations of this type
include:
• Multirate Architectures for FIR filter implementation [66].

• Block FIR Filters [77]
• Common Subexpression Precomputation: used to improve the area effi-
ciency of a multiplier-less implementation of a weighted-sum computa-
tion (sections 5.1 and 5.2) and also to reduce the number of operations
required to perform two dimensional multiplication-free linear trans-
forms (section 6.1.4).
• Fast Fourier Transform (FFT) [73]
• Fast Discrete Cosine Transform (DCT) [82]
• Redundancy elimination schemes (section 7.1.4) to reduce the areas of
PLAs implementing modulo-add, modulo-multiply and modulo-MAC
look-up-tables in the RNS based implementation ofweighted-sum com-
putation.
5 DFG Transformations based on Mathematical Properties

These transformations exploit the properties such as commutativity, associa-
tivity and distributivity of mathematical operators so as to suitably restruc-
ture the data flow graph. Here are some examples of the transformations of
this type:
• Linear Phase FIR Filters which exploit the coefficient symmetry and
use the distributivity property to reduce by half the number of multipli-
cations (section 3.4.1.2).
• Coefficient Scaling discussed in sections 2.4.4 and 7.2.1.
• Selective Coefficient Negation discussed in section 2.3.1.
• Coefficient Ordering discussed in sections 2.3.2 and 7.1.3.
• Se\ective Bit Swapping of ALU Inputs in section 2.3.3.
• Computing output Y[n] in terms of Y[n-l] of an FIR Filter discussed in
section 5.3.2.
• DAG Node Reordering for Low Power discussed in section 6.1.3.
• DAG Transformation - Tree to Chain Conversion discussed in sec-
tion 6.2.5.1.
• DAG Transformation - Serializing a Butterfly discussed in section 6.2.5.2.
• DAG Transformation - Fanout Reduction discussed in section 6.2.5.3.
• DAG Transformation - Merging discussed in section 6.2.5.4.
6 Exploiting Relationship between the Real Value Domain and the Binary
Domain
While the frequency response of an FIR filter depends on the real values of
the coefficients, the area-delay-power parameters are affected by the prop-
erties of the binary representation of the coefficients. A small change in the
value of a coefficient has minimal impact on the filter characteristics but
can impact the binary representation significantly. For example, numbers
31 and 32 have a small difference in the real value domain, but in the binary
domain 31 has five 1s while 32 has just one 1. This relationship can be ex-
ploited to suitably alter the filter coefficients while still meeting the desired
filter characteristics in terms of passband ripple and stopband attenuation.
The transformations of this type include:
• Coefficient Optimization to reduce the power dissipation in the coef-

ficient memory data bus for the FIR filters implemented on a pro-
grammable DSP (section 2.4.5) and also for RNS based implementation
(section 7.2.2).
• Coefficient Optimization to minimize the number of non-zero bits in
the CSD representation of filter coefficients. This helps to reduce the
number of additions in the multiplier-less implementation of FIR fil-
ters [83, 105].
• Coefficient Optimization to reduce the area of the look-up-tables in RNS
based implementation of FIR filters (section 7.2.4).
7 Transformations that Exploit Available Degrees of Freedom

For a given implementation style, multiple alternatives may exist to realize
some of its functionality. As an example, consider the loading of coefficient
and data values in the program and data memories respectively, for a pro-
grammable DSP based implementation of an FIR filter. Typically multiple
options exist for the start address of these data blocks. If the parameters
such as memory size and number of cyc\es are not affected by the start
address, any of the available options can be arbitrarily chosen for loading
the data. However, if a new design constraint such as power dissipation
is to be comprehended and is affected by the start address, this degree of
freedom presents an opportunity to optimize the design constraint. The
transformations that exploit such degrees of freedom include:
• Allocation of Coefficient and Data Memory to reduce power dissipation

in the address busses for the programmable DSP based implementation
of FIR filters (section 2.2.1).
• Bus Bit Reordering to reduce cross-coupling related power dissipation
in the busses (section 2.2.5).
• Coefficient Partitioning to improve area efficiency of a DA based im-

plementation of FIR filters, that uses two LUTs (section 4.2).
• Register Assignment for low power code generation of multiplication-
free linear transforms (section 6.1.5).
8.2. A Snapshot of the Framework

This section proposes a framework for systematically applying the transfor-
mations of each type. The framework captures for each of the transforms the
parameters it optimizes and the desired characteristics of the algorithm and of
the implementation style for the transform to be effective. For example, con-
sider the gray coding transform. This transform can be used to minimize power
dissipation in a bus. The transform is most effective if the algorithm results in
sequential data values on the bus and if the bus represents a large capacitive load
in the target implementation style. The framework also captures the overheads
associated with the transformation. For example, the gray coding transform
results in the area and delay overhead due to the encoder and the decoder logic
at the source and the destination of the bus respectively.
Figures 8.1 and 8.2 show a snapshot of the framework that can be used to
achieve the desired area-power tradeoff for a DSP algorithm.
It can be noted that figures 8.1 and 8.2 depict a representative sampie of
the complete framework. The complete framework not only comprehends all
existing transformations, but also suggests a methodology for identifying new
transformations that could be specific to a new algorithm or a new implemen-
tation style. This is achieved by systematically exploring the different classes
of transformations.
The framework can be further enhanced by incorporating estimators that can
quantify the gains and overheads associated with each of the transformations.
Get the 'baseline' mapping of the algorithm on the target implementation style
Identify main sources of power dissipation
Does the algorithm Is the decoder overhead Implement data move

involve sequential
~ and increased control ~ as movement of the
data movement ? complexity acceptable ? pointer to the da ta
N NI
I
~ +
Is the algorithm Can the supply voltage
repetitive in ~ be scaled ? and
nature? Is increase in the Y
f----- Use parallel processing
area acceptable ?
N Is increase in the area y
f----- Use pipelining
and latency acceptable ?
..... _-_._-_ .............. __ ......... . ..................... __ ...........
Is increase in the
control complexity ~ Use loop unrolling
acceptable ?
NI
I
~ +
Is there any Is the potential loss in Use common sub-
redundancy in regularity and the expression pre-
the computation
that can be
exploited?
--
y corresponding increase
in control complexity
acceptable ?
~
computation or other
transforms to reduce
computational
complexity
N NI
I
tt
Continued on
* the next page *
Figure 8.1. A Framework for Area-Power Tradeoff

* Continued from
*
the previous page
!
Are there busses Is the encoder-
y Use Gray coding
(e.g. address bus)
~ decoder overhead ~ or TO coding
that see sequential acceptable ?
data access ?
NI
N 1~ t I
Are there busses Is the encoder- Use bus-invert
y
that see values that ~ decoder overhead ~ coding
are da ta dependent ? acceptable ?
NI
N 1~ ~ I
Are representative Is there a layout-level Use bus bit-reordeing
traces of data values
~ t1exibility in routing ~ during routing
availab1e for the a bus?
busses?
NI
Nl J t I
Does the impl-
Is the area overhead of Use selective bit-
mentation havc ALU Y
with large capacitive ~ a mux and an xor gate ~ swapping for the
per input bit acceptable ? ALU inputs
input busses?
NI
N 1~ t I
Implementation with the desired area-power tradeoff
Figure 8.2. A Framework for Area-Power Tradeoff - continued

Chapter 9
SUMMARY
As the seemingly simple and straightforward problem ofweighted-sum com-

putation was approached, its multiple facets started unfolding. Several opti-
mization opportunities were identified in the area-delay-power space targeted
to technologies ranging from programmable processors to hardwired implemen-
tations with or without a hardware multiplier. The algorithmic and architectural
transformations presented in this book cover this entire solution space.
Programmable DSP based Implementation

For programmable DSP based implementations, the book has focussed on
the issue of low power realization. Generic extensions to DSP architectures for
low power were presented, followed by techniques specific to weighted-sum
computation and finally transformations which are specific to FIR filters. These
were captured in a comprehensive framework for low power realization of FIR
filters on a programmable DSP.
Implementation using Hardware Multiplier(s) and Adder(s)

The effectiveness of many high level synthesis transformations such as par-
allel processing, pipelining, loop unrolling and re-timing were evaluated in the
context of weighted-sum computation. This led to an important observation
that the actual gains from these transformations are heavily dependent on the
hardware resources available for implementation. The effect of these transfor-
mations - specifically parallel processing - on the peak power dissipation
was also analyzed. While the average power dissipation continues to reduce
with increasing degree of parallelism, the peak power dissipation can, in fact,
start increasing beyond a point. The relationship between the supply voltage
Voo, the threshold voltage VT and the degree ofparallelism N can be analyzed
such that for a given value of the VOO/VT ratio, a limit on N can be derived
195

beyond which the peak power dissipation starts increasing. The DFG transfor-
mations mentioned above do not impact the computational complexity; they
achieve power reduction at the expense of increased area. In the context of FIR
filters, multi rate architectures were presented as structures which reduce com-
putational complexity and thus achieve power reduction with minimal datapath
area overhead.
Distributed Arithmetic Based Implementation

For Distributed Arithmetic based implementations, the transformations pre-
sen ted in this book focused on three important aspects. Firstly, various DA
based architectures were analyzed as techniques to arrive at multiple solutions
in the area-delay space for weighted-sum computation. Secondly, the flexibil-
ity in coefficient partitioning was exploited to improve the area efficiency of
DA structures which use two look-up-tables. Finally, a data encoding scheme
based on the nega-binary number representation was presented to reduce power
dissipation in DA based FIR filters.
Multiplier-Iess Implementation
The problem of minimizing the number of additions in the multiplier-less
implementations of 1-D and 2-D linear transforms was presented. A common
subexpression precomputation technique that can be applied to both weighted-
sum computation and the multiple constant multiplication (MCM) computation
was presented. In the context of FIR filters, coefficient transformations were
identified which, in conjunction with the common subexpression precompu-
tation technique, realize area-efficient multiplier-less FIR filters. Since the
resultant data flow graphs have operators and variables with varying bit preci-
sion, an approach to precision-sensitive high level synthesis was discussed.
Code Generation of Multiplication-free Linear Transforms

Many image processing algorithms perform weighted-sum computation where
the weight values are restricted to {O, 1, -I}. The common subexpression pre-
computation technique can be applied to minimize the number of additions
required to implement such two dimensional multiplication-free linear trans-
forms. Depending on the target architecture, (register-rich or single register,
accumulator based), appropriate code generation strategy and DAG optimiza-
tion transformations can be used to improve performance and reduce power
dissipation of the implementation of these transforms.
Residue Number System Based Implementation

In RNS based implementation, the weighted-sum computation problem can
be solved as multiple weighted-sum computations of smaller precision exe-
cuted in parallel. For smaller moduli, the weighted-sum is performed using
Summary 197
look-up tables (LUTs) thus enabling a multiplier-Iess implementation. Tech-

niques such as residue encoding and LUT redundancy elimination enable better
area efficiency and reduce power dissipation of such implementations. These
parameters can further be optimized in the context ofFIR filters using additional
transformations such as coefficient optimization.
A Framework for Aigorithmic and Architectural Transformations

The various transformations discussed in the book were classified into seven
categories. The classification is based on the properties that the transformations
exploit to achieve optimization in the area-delay-power space.
A framework was proposed for a systematic application ofthese transforma-
tions. For each of the transformations, the framework captures the parameters
it optimizes, the desired characteristics of the algorithm and of the implemen-
tation style for the transformation to be effective.
There are two more types of transformations that can be incorporated in

the framework so as to make it more comprehensive. The first type includes
techniques that perform area-delay-power tradeoffs dynamically. Techniques
for dynamic power management and system-level adaptive voltage scaling [23]
are examples of such transformations. The second type of transformations use
'quality' as a variable to trade area, delay and power with one another. One
example of such a transformation is the setting of word length for storing the
data and coefficient values so as to achieve the desired area vs quality trade-off
for FIR filtering.
References
[I] J. W. Adams and A. N. Willson, "Some Efficient Digital Prefilter Structures", IEEE
Transactions on Circuits and Systems, May 1984, pp. 260-265
[2] M. Agarwala and P. T. Balsara, "An Architecture for a DSP Field Programmable
Gate Array", IEEE Transactions on VLSI Systems, March 1995, pp. 136-141
[3] Vikas Agrawal, Anand Pande, Mahesh Mehendale, "High Level Synthesis of
Multi-precision Data Flow Graphs", 14th International Conference on VLSI De-
sign, January 2001, pp. 411-416
[4] A. Aho, R. Sethi and J. Ullman, Compilers Principles, Techniques and Tools,
Addison- Wesley, 1986
[5] G. Araujo, S. Malik, M. T-C. Lee, "Using Register-Transfer Paths in Code Genera-
tion for Heterogeneous Memory-Register Architectures", ACMIIEEE 33rd Design
Automation Conference, 1996, pp. 591-596
[6] ARM7TDMI Data Sheet, Advanced RISC Machines Ltd.(ARM), 1995
[7] R. S. Bajwa, M. Hiraki, H. Kojima, D. J. Gorny, K. Nitta, A. Sridhar, K. Seki

and K. Sasaki, "Instruction Buffering to Reduce Power in Processors for Signal
Processing", IEEE Transactions on VLSI Systems, December 1997, pp. 417-424
[8] Luca Benini, G. D. Micheli, E. Macii, D. Sciuto, C. Silvano, "Asymptotic Zero-

Transition Activity Encoding for Address Busses in Low-Power Microprocessor-
Based Systems", GLS- VLSI'97, 7th Great Lakes Symposium on VLSI, 1997, pp.
77-82
[9] N. Binh, M. Imai, A. Shiomi and N. Hikichi, "A Hardware/ Software Partition-
ing Algorithm for Designing Pipelined ASIPs with Least Gate Counts", 33rd
ACMIIEEE Design Automation Conference, DAC-1996, pp. 527-532
[10] Ivo Bolsens, Hugo J. Oe Man, Bill Lin, Kar! Van rompaey, Steven Vercautcren and
Dicderik Verkest, "Hardware/Software Co-Design of Digital Telecommunication
Systems", Proceedings of the IEEE, March 1997, pp. 391-418
[I I] Jon Bradley, Ca/cu/ation «f TMS320C5x Power Dissipation Application Report,

Texas Instruments, J 993
199
200 VLSI SYNTHESIS OF DS? KERNELS
[12] R. K. Brayton, C. T. McMullen, G. D. Hachtel and A. Sanglovanni-Vincentelli,

Logic Minimization Algorithmsfor VLSI Synthesis, New York: Kluwer Academic,
1984
[13] C. S. Burrus, "Digital Filter Structures Described by Distributed Arithmetic", IEEE

Transactions on Circuits and Systems, December 1977, pp. 674-680
[14] A. Chandrakasan, M. Potkonjak, J. Rabaey and R. Brodersen, "HYPER-LP: A

System for Power Minimization using Architectural Transformations", ICCAD-
1992, pp. 300-303
[15] A. Chandrakasan and R. Brodersen, "Minimizing Power Consumption in Digital

CMOS Circuits", Proceedings of the IEEE, April 1995, pp. 498-523
[16] Jui-Ming Chang and Massoud Pedram, "Energy Minimization Using Multiple
Supply Voltages", IEEE Transactions on VLSI Systems, December 1997, pp. 436-
443
[17] Pali ab Chatterjee and Graydon Larrabee, "Gigabit Age Microelectronics and Their
Manufacture", IEEE Transactions on VLSI Systems, March 1993, pp. 7-21
[18] N. Deo, Graph Theory with Applications to Engineering and Computer Science,
Prentice Hall India, 1989
[19] S. Devadas, H. T. Ma, A. R. Newton and Sanglovanni-Vincentelli, "MUSTANG:

State Assignments of Finite State Machines Targeting Multi-Level Logic Imple-
mentations", IEEE Transactions on Computer-Aided Design, vol. CAD-7, Dec
1988, pp. 1290-1300,
[20] W.E. Dougherty, D.J. Pursley, D.E. Thomas, "Instruction subsetting: Trading
power for programmability", IEEE Computer Society Workshop on VLSJ'98,
1998 , pp. 42-47
[21] R. Ernst, J. Henkel and T. Benner, "Hardware-Software Cosynthesis for Micro-

controllers", IEEE Design and Test of Computers, Vol. 12, 1993, pp. 64-75
[22] Daniei Gajski, Nikil Dutt, A C-H Wu, S Y-L Lin, High,level Synthesis -Introduc-
tion to Chip and System Design, Kluwer Academic Publishers, 1992
[23] V. Gutnik and A. P. Chandrakasan, "Embedded Power Supply for Low-Power

DSP", IEEE Transactions on VLSI Systems, December 1997, pp. 425-435
[24] N. Halbwachs, P. Capsi, P. Raymond, and D. Pilaud, 'The Synchronous dataftow

programming language Lustre", Proceedings of the IEEE, September 1991, pp.
1305-1320
[25] P. N. Hilfinger, J. Rabaey, D. Genin, C. Scheers and H. De Man, "DSP Specification

using the SILAGE Language", International Conference on Acoustics, Speech and
Signal Processing, ICASSP-1990, pp. 1057-1060
[26] K. Illgner, H-G. Gruber, P. Gelabert, J. Liang, Y. Yoo, W. Rabadi and Raj Talluri,
"Programmable DSP Platform for Digital Still Cameras", International Conference
on Acoustics, Speech and Signal Processing, ICASSP-1999, pp. 2235-2238
REFERENCES 201
[27) Anil K. Jain, Fundamentals 0/ Digital Image Processing, Prentice Hall Inc. 1989
[28) R. Jain, et.al, "Efficient CAD Tools for Coefficient Optimization of Arbitrary
Integrated Digital Filters", IEEE International Conference on Accoustics, Speech
and Signal Processing, 1984
[29) W. K. Jenkins and B. Leon, "The Use of Residue Number System in the Design
of Finite Impulse Response Filters", IEEE Transactions on Circuits and Systems,
April 1977, pp. 191-201
[30) W. K. Jenkins, "A Highly Efficient Residue-combinatorial Architecture far Digital

Filters", Proceedings of the IEEE, June 1978, pp. 700-702
[31) I. Karkowski and R.H.J .M. Otten, "An Automatie Hardware-Software Partitioner
Based on the Possibilistic Programming", European Design and Test Conference,
1996, pp. 467-472
[32) D. Kodek and K. Steiglitz, "Comparison of Optimal and Local Search Methods for
Designing Finite Wordlength FIR Digital Filters", IEEE Transactions on Circuits
and Systems, January 1981, pp. 28-32
[33) E.L. Lawler, J.K. Lenstra, A.H.G. Rinnooy Kan, and D.B. Shmoys (eds.), The
Travelling Salesman Problem, lohn Wiley & Sons Ltd, 1985
[34) Edward A. Lee, "Programmable DSP Architectures: Part I", IEEE ASSP Maga-
zine, October 1988, pp. 4-19
[35) Edward A. Lee, "Programmable DSP Architectures: Part 11", IEEE ASSP Maga-
zine, lanuary 1989, pp. 4-14
[36) M. T.-c. Lee, V. Tiwari, S. Malik and M. Fujita, "Power Analysis and Minimization
Techniques for Embedded DSP Software", IEEE Transactions on VLSI Systems,
March 1997, pp. 123-135
[37) H. Lekatsas, l. HenkaL W. Wolf, "Code Compression for Low Power Embedded
System Design", ACM/IEEE Design Automation Conference, DAC-2000, pp.
294-299
[38) Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, "Instruction Selection
Using Binate Covering for Code Size Optimization", IEEE International Confer-
ence on CAD, ICCAD-1995, pp. 393-399
[39) Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, Albert wang, "Storage
Assignment to Decrease Code Size", ACM conference on Programming Language
Design and Implementation, 1995
[40) Kun-Shan Lin (ed.), Digital Signal Processing Applications with the TMS320
Family - Theory, Algorithms alld Implementations - Vol I, Texas Instruments, 1989
[41) E. Lueder, "Generation of Equivalent Block Parallel Digital Filters and Algo-
rithms by a Linear Transformation", IEEE International Symposium on Circuits
and Systems, 1993, pp. 495-498
[42) Gin-Kou Ma and Fred J. Taylor, "Multiplier Policies For Digital Signal Process-
ing", IEEE ASSP Magazine. January 1990, pp. 6-19
[43] M. N. Mahesh, Satrajit Gupta and Mahesh Mehendale, "Improving Area Effi-
ciency of Residue Number System based Implementation of DSP Algorithms",
12th International conference on VLSI design, January 1999, pp. 340-345
[44] M. N. Mahesh, Mahesh Mehendale, "Improving Performance of High Precision

Signal Processing Algorithms on Programmable DSPs", ISCAS-99, 1999 pp. 488-
491
[45] M. N. Mahesh, Mahesh Mehendale, "Low Power Realization of Residue Number

System based Implementation of FIR Filters", 13th International Conference on
VLSI Design, January 2000, pp. 30-33
[46] H. De Man, J. Rabaey, P. Six and L. Claesen, "Cathedral-II: A Silicon Compiler

for Digital Signal Processing", IEEE Design and Test of Computers Magazine,
December 1986, pp. 13-25
[47] Mahesh Mehendale, B. Mitra, "An Integrated Approach to State Assignment and
Sequential Element Selection for FSM Synthesis", 7th International Conference
on VLSI Design, 1994, pp. 369-372
[48] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Techniques for Low Power

Realization of FIR Filters", Asia and South Pacific Design Automation Confer-
ence, ASP-DAC'95, pp. 447-450
[49] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Coefficient Optimization for

Low Power Realization of FIR Filters", IEEE workshop on VLSI Signal Process-
ing, 1995, pp. 352-361
[50] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Synthesis of Multiplier-Iess

FIR Filters with Minimum Number of Additions", IEEE International Conference
on Computer Aided Design, ICCAD'95, pp. 668-671
[51] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Low Power ReaIization ofFIR

Filters using Multirate Architectures", International Conference on VLSI Design,
VLSI Design'96, pp. 370-375
[52] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Optimized Code Genera-

tion of Multiplication-free Linear Transforms", ACM/IEEE Design Automation
Conference, DAC'96, pp. 41-46
[53] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Area-Delay Tradeoff in Dis-

tributed Arithmetic based Implementation of FIR Filters", International Confer-
ence on VLSI Design, VLSI Design'97, pp. 124-129
[54] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Extensions to Programmable

DSP Architectures for Reduced Power Dissipation", International Conference on
VLSI Design, VLSI Design'98, pp. 37-42
[55] Mahesh Mehendale, Somdipta Basu Roy, S. D. Sherlekar, G. Venkatesh, "Coef-

ficient Transformations for Area-Efficient Implementation of Multiplier-Iess FIR
Filters", International Conference on VLSI Design, VLSI Design'98, pp. 110-115
REFERENCES 203
[56] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Algorithmic and Architec-

tural Transformations for Low Power Realization of FIR Filters, International
Conference on VLSI Design, VLSI Design'98, pp. 12-17
[57] Mahesh Mehendale, Amit Sinha, S. D. Sherlekar, "Low Power Realization of FIR
Filters Implemented Using Distributed Arithmetic", Asia and South Pacific Design
Automation Conference, ASP-DAC'98, pp. 151-156
[58] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Low Power Realization of

FIR Filters on Programmable DSPs", IEEE Transactions on VLSI Systems, Special
Issue on Low Power Design, December 1998, pp. 546-553
[59] Mahesh Mehendale, S. D. Sherlekar, "Low Power Code Generation of

Multiplication-free Linear Transforms", 12th International Conference on VLSI
Design, January 1999, pp. 42-47
[60] R. Mehra, D. B. Lidsky, A Abnous, P. E. Landman and J. M. Rabaey, "Algorithms

and Architectural Level Methodologies for Low Power", Chapter 11 in Low Power
Design Methodologies, Jan Rabaey and Massoud Pedram, Eds., Kluwer Academic
Publishers, September 1995
[61] Huzefa Mehta, R. M. Owens, M. J. Irwin, "Some Issues in Gray Code Addressing",
GLS-VLSI'96, 6th Great Lakes Symposium on VLSI, 1996, pp. 178-181
[62] G.De Micheli, Synthesis and optimization of digital circuits, McGraw-HiII, 1994.
[63] Biswadip Mitra, Shantanu Jha and P. Pal Chaudhuri, "A Simulated Annealing
Based State Assignment Approach for Control Synthesis", 4th CSIIIEEE Interna-
tional Symposium on VLSI Design, 1991, pp. 45-50
[64] Biswadip Mitra, P. R. Panda and P. Pal Chaudhuri, "Estimating the Complexity of
Synthesized Designs from FSM Specifications", 5th International Conference on
VLSI Design, 1992, pp. 175-180
[65] J. Monteiro, S. Devadas, A. Ghosh, "Sequential logic optimization for low

power using input-disabling precomputation architectures", IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, Volume: 17 3 ,
March 1998, pp. 279-284
[66] Z.J. Mou and P. Duhamel, "Short-Length FIR Filters and Their Use in Fast Nonre-
cursive Filtering". IEEE Transactions on Signal Processing, June 1991, pp. 1322-
1332
[67] L. Nachtergaele, F. Catthoor, B. Kapoor, S. Janssens, "Low Power Storage Explo-

ration for H.263 Video Decoder", IEEE International Workshop on VLSI Signal
Processing, 1996, pp. 115-124
[68] Farid Najm, 'Transition Density: A New Measure of Activity in Digital Circuits",
IEEE Transactions on CAD, Feb 1993, pp. 310-323
[69] The National Technology Roadmap for Semiconductors, SIA - Semiconductor

Industry Association, 1994
[70] B. New, "A Distributed Arithmetic Approach to Designing Scalable DSP Chips",
Electronic Design News, August 17, 1995
[71] Ralf Niemann, Peter Marwedel, "Hardware/Software Partitioning using Integer

Programming", Eurpean Design and Test Conference, 1996, pp. 473-479
[72] W. J. Oh and Y. H. Lee, "Cascade/Parallel Form FIR Filters with Powers-of- Two
Coefficients", IEEE International Symposium on Circuits and Systems, ISCAS-
1994, Vol. H, pp. 545-548
[73] A.v. Oppenheim and R.W. Schaffer, Discrete Time Signal Processing, Prentice
Hall, 1989
[74J Preeti Panda and Nikil Dutt, "Reducing Address Bus Transitions for Low Power
Memory Mapping", European Design and Test Conference, 1996, pp. 63-68
[75] Anand Pande, Sunil Kashide, Hardware Software Codesign of DSP Algorithms,
ME Thesis, Centre for Electronics Design and Technology, Indian Institute of
Science, Bangalore, India, January 2000
[76] K.K. Parhi, "Algorithms and Architectures for High-Speed and Low-Power Digital
Signal Processing", 4th International Conference on Advances in Communications
and Control, 1993, pp. 259-270
[77] D.N. Pearson and K.K. Parhi, "Low-PowerFiR Digital Filter Architectures", IEEE
International Symposium on Circuits and Systems, Vol I, 1995, pp. 231-234
[78] M. Potkonjak, Mani Srivastava and Anantha Chandrakasan, "Multiple Constant

Multiplications: Efficient and Versatile Framework and Algorithms for Exploring
Common Subexpression Elimination", IEEE Transactions on Computer Aided
Design, February 1996, pp. 151-165
[79] Ptolemy Project Homepage, Departmcnt of Electrical En-

gineering and Computer Sciences, University of California at Berkeley, URL -
http://ptolemy.eecs.berkeley.edu
[80] Wu Qing, M. Pedram, Wu Xunwci, "Clock-gating and its application to low power
design of sequential circuits", IEEE Transactions on Circuits and Systems I: Fun-
damental Theory and Applications, Volume: 47 3, March 2000, pp. 415-420
[81] Anand Raghunathan and Niraj Jha, "Behavioral Synthesis for Low Power", Pro-
ceedings of International Conference on Computer Design, ICCD-1994, pp. 318-
322
[82] K. R. Rao and P. Yip, Discrete Cosine Trans:form: Algorithms, Advantages and
Applications, Academic Press, 1990
[83] H. Samueli, "An Improved Search Algorithm for the Design of Multiplierless
FIR Filters with Powers-of-Two Coefficients", IEEE Transactions on Circuits and
Systems, July 1989, pp. 1044-1047
[84] N. Sankarayya and K. Roy, "Algorithms for Low Power FIR Filter Realization
Using Differential Coefficients", International Conference on VLSI Design, 1997,
pp. 174-178
REFERENCES 205
[85] H. Schroder, "High Word-Rate Digital Filters with Programmablc Table Look-
Up", IEEE Transactions on Circuits and Systems, May 1977, pp. 277-279
[86] Amit Sinha, Mahesh Mehendale, "Improving Area Efficicncy ofFiR Filters Imple-
mented Using Distributed Arithmetic", International Conference on VLSI Design,
VLSI Design'98, pp. 104-109
[87] M. A. Soderstrand and K. AI-Marayati, "VLSI Implementation of Very-High-

Order FIR Filters", IEEE International Symposium on Circuits and Systems, 1995,
pp. 1436-1439
[88] M. R. Stan and W. P. Burleson, "Bus Invert Coding for Low Power 1/0", IEEE
Transactions on VLSI Systems, March 1995, pp. 49-58
[89] Ching-Long Su, Chi-Ying Tsui and Alvin M. Despain, "Saving Power in the Con-
trol Path of Embedded Proeessors", IEEE Design and Test of Computers, Winter
1994, pp. 24-30
[90] Ashok Sudarsanam, Sharad Malik, "Memory Bank and Register AlIoeation in
Software Synthesis for ASIPs", IEEE International Conferenee on CAD, ICCAD-
1995, pp. 388-392
[91] Earl Swartzlander Jr., VLSI Signal Processing Systems, Kluwer Academie Pub-
lishers, 1985
[92] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and its Applications to Computer
Technology, Me-Graw Hili, 1967
[93] N. Tan, S. Eriksson and L. Wanhammar, "A Power-Saving Teehnique for Bit-Serial
DSP ASICs", ISCAS 94, Vol. IV, pp. 51-54
[94] Y. Tiwari, S. Malik, P. Ashar, "Guarded evaluation: pushing power management

to logie synthesis/design", IEEE Transaetions 0 Computer-Aided Design of Inte-
grated Cireuits and Systems, Volume: 17 10, Oel. 1998 , pp. 1051- 1060
[95] TMS320C2x User's Guide, Texas Instruments, 1993
[99] TMS470R I x Code Generation Tools Guide, Texas Instruments, 1997
[100] TSC4000ULV 0.35pm CMOS Standard Cell, Maero Library Summary, Appliea-
tion Specifie Integrated Cireuits, Texas Instruments, 1996
[101] Y. Tsividis and P. Antognetti, editors, Design of MOS VLSI Circuits.f(Jr Te/ecom-
munications, Prentiee-Hall, 1985
[102] c.- Y.Wang and K. Roy, "Control Unit Synthesis Targeting Low-Power Proces-
sors", IEEE International Conferenee on Computer Design, Oetoher 1995, pp.
454-459
[103] Ching-Yi Wang and K. K. Parhi, "High-Level DSP Synthesis Using Concurrent
Transformations, Scheduling and Allocation", IEEE Transactions on Computer-
Aided Design 01' Integrated Circuits and Systems, March 1995, pp. 274-295
[104] Stanley White, "Applications of Distributed Arithmetic to Digital Signal Process-

ing: A Tutorial Review", IEEE ASSP Magazine, July 1989, pp. 4-19
[105] Q. Zhao and Y. Tadokoro, "A Simple Design of F1R Filters with Powers-01'-Two
Coefficients", IEEE Transactions of Circuits and Systems, May 1988, pp. 566-570
[106] S. Zohar, "Negative Radix Conversion", IEEE Transactions on Computer, Vol

C-19, March 1970, pp. 222-226
[107] S. Zohar, "A VLSI Implementation of a Correlator/Digital Filter Based on Dis-

tributed Arithmetic", IEEE Transactions on Acoustics, Speech and Signal Pro-
cessing, Vol. 37, No. I, January 1989, pp. 156-160
Topic Index
Accumulator based Architecture, 153 Low Power Code Generation, 148, 168
Adder Input Bit Swapping, 31 MAC (Multiply-Accumulate) Instruction, 12
Array Multiplier, 14 Memory Architectures for Low Power, 22
Binary to Nega-Binary Conversion, 10 I Memory Partitioning for Low Power, 23
Bitwise Commutativity of ADD Operation, 31 Memory Prefetch ButTer, 23
Block FlR Filters, 56 Modulo MAC, 173
Bus Bit Reordering, 24 Multiple Constant Multiplication (MCM), 113, 120
Bus Coding, 17 Multiplication-free Linear Transforrn, 141
Bus Invert Coding, 20 Multirate Architectures, 37,63,81
CSD Representation, 129 Nega-binary Coding, 95
Characteristics of DSP Algorithms, 1I Optimization using 0-1 Programming, 50
Circular Buffer, 36 Parallel Processing, 59, 174
Classification of Transformations, 187 Pixel Window Transform, 144
Code Generation of I-D Transform, 144 Power Analysis of Multirate Architectures, 68
Coefficient Optimization, 43, 137, 180 Power Dissipation due to Cross Coupling, 13
Coefficient Ordering, 28, 175 Power Dissipation in CMOS, 12
Coefficient Partitioning, 86 Power Dissipation in a Bus, 13
Coefficient Scaling, 42, 179 Power Dissipation in a Multiplier, 13
Color Space Conversion, 2, 113 Pre-Filter Structures, 138
Common Subexperssion Elimination, 116, 122, 147 Precision Sensitive Binding, 139
Consecutive-Variables Graph, 150 Precision Sensitive Register Allocation, 138
DAG Optimizing Transformations, 159 Precision Sensitive Scheduling, 140
DAG Transform - Fanout Reduction, 160 Prewitt Window Transform, 144
DAG Transform - Merging, 161 RNS Moduli Selection, 183
DAG Transform - Serializing a Butterfly, 159 Register Assignment, 149
DAG Transform - Tree to Chain Conversion, 159 Register-Conflict Graph, 150
DCT, 2, 113, 172, 189 Register-rich Architecture, 143
DFG Transformations, 56 Residue Encoding, 174, 177
DSC Image Pipeline, I Residue Number System, 171
DSP Architecture, II Retiming,59
Decoded Instruction Buffer, 21 Selective Coefficient Negation, 27
Digital Still Camera, I Shiftless DA Implementation, 107
Distributed Arithmetic, 76 Signal Flow Graph Transformations, 130
Slant Transform, 167
Dynamic/Switching Power in CMOS, 12
SoC Design Methodology, 4
Energy vs Peak Power Tradeoff, 61
Sobel Window Transform, 152
FFT,44, 172, 189
Solution Space for DSP Implementation, 6
FIR Filter, 35
Spatial High Pass Filter, 152
Gaussian Data Distribution, 104
Spatial Low Pass Filter, 152
Generic Techniques for Low Power, 26
Spill-free DAG, 162
Graph Coloring Problem, 151
Stochastic Evolution, 151
Gray Coded Addressing, 17, 108
TO coding, 18
Haar Transforrn, 147, 165
TMS320C2x/C5x, 39, 153
Hardware-Software Partitioning, 6
TMS320C54x - FIRS Instruction, 34
High Level Synthesis of Multiprecision DFGs, 138
Transformation Framework, 51, 191
High Precision Signal Processing, 183
Transition Density, 14
Instruction ButTering, 21
Transposed FIR Filter, 41, 180
Instruction Scheduling, 148
Traveling Salesman Problem, 29
LUT Redundancy Elimination, 176
Truth Table Area Estimation, 88
Linear Phase FIR Filters, 65
Two Dimensional Linear Transform, 113
Loop Unrolling, 59 Uni-sign Representation, 129
Walsh-Hadamard Transform, 158, 164
207
About the Authors
Mahesh Mehendale is a Distinguished Member of Technical Staff and the

Director, DSP Design at Texas Instruments (India) Ltd. He leads the DSP design
group which recently designed TI's next generation TMS320C27x processor.
This design received EDN Asia magazine's component of the year award for
1999. Mahesh received his B.Tech (EE), the M. Tech (CS&E) and the Ph.D. de-
grees from IIT Bombay in 1984, 1986 and 1999 respectively. He has published
more than 35 papers in refereed international conferences and has received two
best paper awards at EDIF World 1991 and the VLSI Design 1996 conferences.
He was a co-author of the paper that received the best student paper award at
the VLSI Design 1999 conference. Mahesh has co-presented fuH day and half
day tutorials in the areas of FPGAs, low power DSP, VLSI signal processing
and programmable DSP verification at many national and international confer-
ences. He has served as a member of the technical program committee for many
conferences and also as the Tutorial Chair for the VLSI Design 1996 conference
and the Design Contest Chair for the VLSI Design 2000 conference. He is the
program co-chair for the joint VLSI Design-ASPDAC 2002 conference. Ma-
hesh holds five US patents with one more application pending. He is a senior
member of IEEE. He can be reached at m-mehendale@ti.com .
Sunil D. Sherlekar is currently the Vice-President of R&D at Sasken Com-

munication Technologies Ltd. (formerly known as Silicon Automation Systems
Ltd.), a company engaged in developing leading edge solutions for the telecom
access market. Prior to joining this company, he was on the facuIty of Computer
Science and Engg. at IIT Bombay for about 12 years. He has published several
papers in refereed conferences and journals and has supervised several doctoral
and masters' students. He has served as a member and chair for the techni-
cal program committee of several conferences and also as the General Chair
for the Asian Test Symposium 1995 and Asia-Pacific Conference on HDLs
(APCHDL) 1996. He is the General Chair for the joint VLSI Design-ASPDAC
2002 conference. He is a member of the editorial board of JETTA and a mem-
ber of the Steering Committee of ASPDAC. For APCHDL 1994, he gave the
keynote speech at Toyohashi in Japan. In 1996, his paper with Mahesh Mehen-
dale received the best paper award at the VLSI Design Conference. In 1997,
he was awarded by IEEE for his contribution to the Asian Test Subcommittee
activities. He can be reached at sds@sasken.com .
209

VLSI Synthesis of DSP Kernels - Algorithmic and Architectural Transformations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VLSI Synthesis of DSP Kernels - Algorithmic and Architectural Transformations

Uploaded by

Copyright:

Available Formats

Algorithmic and Architectural Transformations

Springer Science+Business Media, LLC

ISBN 978-1-4419-4904-2 ISBN 978-1-4757-3355-6 (eBook)

Printed on acid-free paper

All Rights Reserved

2.4 Techniques for Low Power Realization of FIR Filters 35

4. DISTRIBUTED ARITHMETIC BASED IMPLEMENTATION 75

4.2.2.1 CF2: Estimating Area from the Actual Truth-Table 89

5.4.1 Precision Sensitive Register Allocation 138

8. A FRAMEWORK FOR ALGORITHMIC AND ARCHITECTURAL

1.1 Digital Still Camera System 2

2.23 Normalized Power Dissipation as a Function ofNumber

4.8 Two Bank Implementation - Simple Coefficient Split 86

5.11 Precision Sensitive Register Allocation 139

2.1 Adjacent Signal Transitions in Opposite Direction as a

4.7 Best Nega-binary Schemes for Gaussian Data Distribu-

Technology is a driving force in society. At times it seems to be driving us

useful. In the 1970s, Digital Signal Processing became a military advantage

gorithm and of the implementation style, to achieve tradeoffs in the area-delay-

M. Mehendale et al., VLSI Synthesis of DSP Kernels

Driver Correlated Automatie AID

Figure 1.1. Digital Still Camera System

• Edge enhancement: CFA interpolation introduces low-pass filtering in the

• Compression: To reduce memory requirement, images are compressed typi-

~------------------I Auto exposure

Figure 1.2. DSC Image Pipeline

The common characteristics of each of these blocks is that they require

1.2. The Design Process: Constraints and Alternatives

Figure 1.3. Hardware-Software Codesign Methodology far a System-on-a-chip

The perfonnance requirements of a system are driven by its data process-

• Shorter design cycle time for faster time-to-market.

• F1exibility of making changes to cater to changes in an evolving standard.

• Field upgradability for longer time to obsolescence.

• Re-use and customizability for amortization of design cost over several

Low Medium High

<11---- Degree of Programmability

Figure J.4. Solution Space for Weighted-Sum Computation

The implementation style marked 'X' (hardware multiplier with no programma-

1.3. Organization of the Book

Chapter3 presents implementations using hardware multiplier(s) and adder(s).

1.4. For the Reader

• Description and analysis of several algorithmic and architectural transfor-

• Automated and semi-automated techniques for applying these transforma-

PROGRAMMABLE DSP BASED

A programmable DSP is a processor customized to implement digital signal

• Repetitive: DSP algorithms are repetitive both at micro-Ievel (e.g. Multiply-

This chapter focuses on low power realization of DSP algorithms on a pro-

M. Mehendale et al., VLSI Synthesis of DSP Kernels

Program Datu Read

Figure 2.1. Generic DSP Architecture

employed in most of the programmable DSPs [95, 96]. During weighted-sum

Pdynamic = Cswitch. V 2 .1 (2.1 )

where V is the supply voltage, 1 is the operating frequency, Cswitch is the

2.1. Power Dissipation - Sources and Measures

2.1.2. Measures of Power Dissipation in Busses

2.1.3. Measures of Power Dissipation in the Multiplier

Figure 2.2. 4x4 Array Multiplier

where Bd(Y, Xi), is the Boolean difference of Y w.r.t. Xi.

2.2. Low Power Realization of DSP Algorithms

2.2.1. Allocation of Program, Coefficient and Data Memory

Figure 2.5. Address Bus Power Dissipation as a Function of Start Address

2.2.2. Bus Coding