Professional Documents
Culture Documents
Kouretas 2013
Kouretas 2013
Abstract—This paper presents techniques for low-power addition/subtraction in the Logarithmic Number System (LNS) and
quantifies their impact on digital filter VLSI implementation. The impact of partitioning the look-up tables (LUT) required for LNS
addition/subtraction on complexity, performance, and power dissipation of the corresponding circuits is quantified. Two design
parameters are exploited to minimize complexity, namely the LNS base and the organization of the LNS word. A roundoff noise
model is used to demonstrate the impact of base and wordlength on the signal-to-noise ratio (SNR) of the output of finite impulse
response (FIR) filters. In addition, techniques for the low-power implementation of an LNS multiply-accumulate (MAC) unit are
investigated. Furthermore, it is shown that the proposed techniques can be extended to cotransformation-based circuits that
employ interpolators. The results are demonstrated by evaluating the power dissipation, complexity and performance of several
FIR filter configurations comprising one, two or four MAC units. Simulations of placed and routed VLSI LNS-based digital filters
using a 90nm 1.0V CMOS standard-cell library, reveal that significant power dissipation savings are possible by using optimized
LNS circuits at no performance penalty, when compared to linear fixed-point two’s-complement equivalents.
1 I NTRODUCTION
LNS low-power design
system by using look-up tables [11]. Very recently, Ismail framework, the area-time-power design space of a low-
and Coleman present a co-transformation procedure and an pass FIR filter is explored for several configurations of
improved interpolation method that reduce the size of look- MAC units. A similar study has recently been preformed by
up tables to an extent that allows their easy synthesis in Galal and Horowitz in the different context of floating-point
logic [12]. Fu et al. deal with LNS arithmetic optimizations arithmetic [26]. The proposed study focuses on the use of
on FPGAs [13]. Arnold and Collange propose Complex partitioning as a technique to limit the exponential growth
LNS (CLNS) as a generalization of LNS, which represents of the size of LUTs with the wordlength. The technique is
complex values in log-polar form [14]. simple and leads to fast circuits. Departing from direct look-
For several practical applications, the benefits of LNS are up table organization, a variety of LNS architectures for
found to be more important than its inherent disadvantages. addition, subtraction, and multi-operand operations, have
In particular, several authors have shown that LNS reduces been proposed in the literature employing interpolation,
power dissipation in signal-processing-related applications, linear or polynomial approximation aiming at reducing the
ranging from hearing-aid devices [15] and subband cod- memory requirements, particularly for larger word lengths,
ing [16], to video processing [17] and error control [18]. such as 32 bits [27, 28]. These ideas have been combined
Moreover, logarithmic techniques have been employed in with mathematical decompositions and transformations of
turbo code decoding for wireless communications appli- the basic operations, exploiting the particular characteristics
cations. In particular, logarithmic representation has been of the functions to further simplify approximation [12, 27–
proved to be suited for the implementation of the symbol- 31].
by-symbol Log-MAP (Logarithmic Maximum A Posteri- Beyond the representational properties of LNS that have
ori) algorithm used for iterative decoding [19, 20]. Peng an impact on switching activity, LNS arithmetic units have
and Chen have adopted LNS for the implementation of structural characteristics that can be exploited to reduce
an FFT-based Log-Sum-Product-decoding-Algorithm (Log- power dissipated. In particular, they comprise
SPA) used in decoding of nonbinary Low-Density Parity • mutually exclusive sub-units, which can be used se-
Check (LDPC) codes [21]. lectively, and
The properties of logarithmic representation that render • imbalanced delay paths.
it efficient for reducing power dissipation have been studied
Therefore simple low-power design techniques are found
[22–25] and it has been demonstrated that the proper choice
to suit an LNS adder/subtractor organization very well;
of the parameters of the representation can reduce the
the impact is quantified for the case of lookup based
switching activity, while guaranteeing the quality of the
architectures, but other LNS architectures may benefit as
output signal. In this context, the quality of the output
well, in terms of reducing power dissipation. Extension to
is evaluated in terms of signal-to-noise ratio (SNR). In
other architectures is demonstrated using an interpolation-
particular, the impact of the selection of the base b of
based subtractor as an illustrative example. Partitioned LUT
the logarithm has been investigated as a means to explore
circuits provide high speed; more sophisticated techniques
trade-offs between precision and dynamic range given a
can be used to reduce size [12, 13, 28–30, 32, 33].
particular word length. Paliouras and Stouraitis [22, 23]
In summary, the contribution of this paper are:
address the low-power LNS properties from a represen-
tational viewpoint and do not focus on power dissipation • a low-power design framework for LNS systems,
estimation data obtained by circuit simulations. The low- • the quantification of power dissipation reduction and
power characteristics of LNS addition/subtraction and mul- performance improvement made possible by using
tiplication have been quantitatively studied and compared LNS, compared to equivalent binary implementations
to equivalent linear binary two’s-complement fixed-point in a contemporary 90nm technology,
operations [24], where it has been demonstrated that there • the design space exploration using the number of
are practical cases, in which an appropriately optimized LUTs for addition/subtraction as a parameter, for the
LNS representation can replace a linear representation of case of using combinatorial logic for LUT implemen-
longer word length, without imposing any degradation on tation,
the signal quality. • the extension of SNR models in LNS for the case of
Finally, conclusions are discussed in Section 6. organization of an LNS adder/subtractor is shown in Fig. 2.
The parallel subtractions
s1 = x − y (9)
2 LNS BASICS
s2 = y − x (10)
The basic idea in LNS is to use logarithms to represent are implemented, followed by a multiplexer, which com-
data. Since the logarithm of a negative number is not real, putes d according to the rule
in order to represent signed numbers in LNS, the sign
information is stored as a separate bit sX , and used in s1 , s1 > 0
d = |x − y| = (11)
combination with the logarithm of the magnitude of the s2 , otherwise.
number. Furthermore, since the logarithm of zero is not a The choice exploits the sign of either (9) or (10), as a select
finite number, an additional single-bit flag zX is used to signal for the multiplexer. The same signal is used to select
denote that a number is zero. Summarizing, X denotes the the maximum of x and y, required for the computation
original number, x denotes the logarithm of the absolute of (4) and (7).
value of |X| , and XLNS is a triplet containing the sign The complexity of LNS circuitry arises from the fact
bit, the zero bit and x. Formally in LNS, a number X is that the values of functions φa and φs should be computed
represented as the triplet by the LNS addition/subtraction circuit hardware for all
XLNS = (zx , sx , x), (1) required values of d. There are two main approaches to
where zx is asserted in the case that X is zero, sx is the implement the evaluation of functions, namely the hardware
sign of X and x = logb (|X|), if X is not zero, with b implementation of an approximation algorithm or the off-
being the base of the logarithm, also called base of the line precomputation and storage of all required values in
representation. The choice of b plays a crucial role in the a look-up table [34]. The former approach is generally
representational capabilities of the triplet in (1), as well as adopted for high-precision applications, while the latter
the computational complexity of the processing and forward approach is generally preferable for smaller word lengths,
and inverse conversion circuitry. i.e., in relatively low-precision applications where the size
Due to the basic properties of the logarithm, the multi- of the required lookup tables is moderate. Both approaches
plication of XLNS and YLNS is reduced to the computation have been extensively studied in the context of elementary
of the triplet ZLNS , function approximation. Let x denote the base-b logarithm
of X and x2 denote the base-2 logarithm of X. Since
ZLNS = (zz , sz , z), (2)
x = logb |X| = logb 2x2 = x2 (logb 2), the conversion
where zZ = zX z Y , sZ = sX sY , and z = x + y. between a base-b LNS and a base-2 LNS requires scaling
Similarly the case of division reduces to binary subtraction. by a constant factor. Several authors have studied hardware
The derivation of the logarithm a of the sum A of two implementations of converters to/from base-2 LNS [11,
triplets is more involved, as it relies on the computation
of 35, 36]. In this paper, although conversion is neglected
a = max{x, y} + logb 1 + b−|x−y| (3) the conclusions about power consumption are valid for
the complete application. To better clarify this assume
= max{x, y} + φa (d), (4)
a FIR filter of order N , requiring about N multiply-
where φa (d) = logb (1 + b−d ) and accumulate operations for each input conversion and each
d = |x − y|. (5) output conversion. If Ein is the average energy for one
Similarly the derivation of the difference of two numbers, input conversion, Eout is the average energy for one output
requires the computation of conversion, and Ey is average energy for one multiply
c = max{x, y} + logb 1 − b−|x−y| (6) accumulate, the total energy (after initialization) for the FIR
filter to produce each result is EFIR = Ein + Eout + N · Ey .
= max{x, y} + φs (d), (7) For sufficiently large values of N , the percentage of energy
where φs (d) = logb (1 − b−d ). Assume that a two’s- consumed in the multiply-add units may approach 100% of
Ey
complement (TC) word is used to represent the logarithm the total as lim EFIR = 1.0.
N →∞
x, composed of a k-bit integral part and an l-bit fractional
part. The range D spanned by x is
LNS k−1 given by
3 L OW- POWER DESIGN OF LNS CIRCUITS
2 −2−l 2−l
DLNS = −b , −b {0} In this Section low-power LNS architectures for addition
−l k−1 −l and subtraction are presented. The memory structure is
b2 , b2 −2 , (8)
organized as a collection of LUTs and is the most complex
to be compared with the range of (−2i−1 , 2i−1 − 2−f ) of a part of the LNS adder/subtractor. Several designs were
linear TC representation of i integral bits and f fractional investigated, distinguished by two choices, i.e., first, the
bits. In general, LNS offers a superior range, over the linear choice of using either latches or D flip-flops (DFFs) to
two’s-complement representation. This is achieved using freeze the addresses of inactive sub-LUTs, and, second,
comparable word lengths, by departing from the strategy the choice to select the active sub-LUT either based on
of equispaced representable values and thus resorting to a the most significant bits (MSB) or on the least significant
scheme that resembles floating-point arithmetic. The basic bits (LSB) of d in (11).
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS
3.5 1-bit MSB tative power, area and delay results are estimated through
3
simulating placed and routed VLSI circuits.
2.5
4 LNS MAC AND FIR IMPLEMENTATION
2
ISSUES
1.5
A
The design of low-power low-complexity FIR filters has
1
low-area been studied by several researchers [40, 41]. This Section
0.5
discusses the impact of the proposed LNS circuits on the
0
1 2 3 4 5 6 7 implementation of FIR filters. Initially, the impact of round-
Delay(ns) off onto SNR is briefly discussed and a procedure to deter-
(a) Area×Delay for the LNS adders in case of 12-bit LNS wordlength. mine the wordlength organization is detailed. Subsequently,
Point A marks a design of high area-delay efficiency. an LNS MAC architecture is proposed. The performance
x 10
−3 of FIR filter structures is finally quantitatively studied, to
1.5
Latch 1-bit LSB
demonstrate the benefits of employing the proposed LNS
Latch 1-bit MSB circuits, over binary fixed-point implementations.
DFF 1-bit LSB
DFF 1-bit MSB
Power(μm2 )
1-bit LSB
1-bit MSB 4.1 Word length determination
1
In order to compare the performance of LNS-based hard-
ware to the widely-used TC-based hardware, it is necessary
to define in a meaningful measurable way the concept of
0.5 equivalence of behavior between the two systems. For the
case of FIR filters, the SNR is used as such a measure.
A
By optimizing LNS representation parameters with the
objective to achieve a particular SNR, low-power operation
0
1 2 3 4 5 6 7
can be achieved.
Delay(ns) The output SNR of an LNS FIR filter has been both the-
(b) Power×Delay for the LNS adders in case of 12-bit LNS wordlength.
oretically and experimentally studied in conjuction with the
Point A marks a design of good power-delay efficiency. LNS word organization in [42] and [36]. Chandra provides
an expression for the ratio of output error variance to output
Fig. 6. Area×Delay and Power×Delay plots for the
signal variance of a logarithmic FIR filter implementation
LNS adder in case of 12-bit LNS wordlength.
due to roundoff [42].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS
62
filter coefficients, the base b, and the fractional wordlength
l, since τ = 2−l−1 . It is noted that (34) does not take into
60
Xi Y Ci+3
D
D Xi+3 D
Ci
Fig. 8. Organization of single-MAC architecture.
Xi+2 D
Ci+1
D D
Ci+2
Xi+1 Y
D
Ci+1
D
Y
Xi+1 D
Xi D
D Xi D
Ci
Fig. 9. Organization of two-MAC architecture. D
Ci
Fig. 10. Organization of four-MAC architecture.
them operations, in a way that numerical results obtained
by the hardware and the model are exactly the same; i.e.,
the models are bit-true. the computation into a two-MAC architecture, computa-
The result of the procedure is the determination of the tion (36) is partitioned into the computation of two terms,
equivalent data representations for both fixed-point and subsequently added, as follows
the LNS-based systems. Subsequently, circuits that use the N
2 −1 N
2 −1
derived representations are simulated in order to perform Y (n) = X(n−2j)C2j + X(n−2j −1)C2j+1 .
quantitative comparisons in terms of area, power and delay. j=0 j=0
(37)
4.2 Proposed LNS MAC architectures Each term is allocated to a MAC of a two-MAC architec-
ture. In general, for a P -MAC architecture, the computation
The area, delay and power dissipation of an LNS-based
is decomposed as
digital filter implemented using a single-MAC unit is
N
P −1
studied in comparison to binary counterparts.
Following the procedure suggested by EDA tool vendor, Y (n) = X(n − P j)CP j + . . . +
j=0
we initially optimize for area and subsequently optimize for
N
P −1
speed. Synthesis is run for several values of maximum delay
constraint, thus implementing a design space exploration. X(n − P j − P + 1)CP j+P −1(38)
Six possible implementations were studied that span the j=0
N
design space between low delay (e.g., area is relaxed) and
P −1 P
−1
low area (e.g., delay is relaxed). = X(n − P j − p)CP j+p . (39)
Synthesized and placed-and-routed circuits have been p=0 j=0
derived by using Synopsys’ Design Compiler and IC Com- allocated to pth MAC
piler. Placed and routed circuits are simulated to generate NP −1
switching activity. Extracted parasitic and switching activity The pth MAC unit computes Sp = j=0 X(n − P j −
has been used as input to Synopsys’ Prime Time in order to p)CP j+p . Subsequently, the summation of all Sp is com-
produce the power consumption. Power estimation includes puted. The LNS equivalent of a single-MAC architecture
both dynamic and leakage power. is depicted in Fig. 11, where the binary multiplier has
The basic structure been replaced by an adder, and the binary adder is mapped
ofthe single-MAC unit is shown in to an LNS adder/subtractor. The LNS adder/subtractor is
Fig. 8. Symbols and denote a multiplier and an adder
respectively, while D denotes a delay unit, implemented as augmented with saturation circuitry and exploits a zero
a register. The study is extended to the performance of a flag to avoid unnecessary activation of LUT partitions and
two-MAC and a four-MAC architecture, shown in Figs. 9 further reduce power dissipation. In the implementation of
and 10, respectively. Fig. 11, it is evident that the paths to the inputs of the final
An FIR filter is described by adder are not balanced, thus leading to excessive switching
N −1 activity at the adder following the memory structure. The
Y (n) = Ci X(n − i), (36) amount of the switching activity depends on the logic depth
i=0 of the LUT implementation. A solution to this problem is to
where Ci are the filter coefficients, X(n) is the input retime the circuit so that the register located at the feedback
sequence and Y (n) is the output sequence. In order to map path is replaced by registers placed at the inputs of the final
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS
clk
3
2 sub-LUTs
4 sub-LUTs
6 sub-LUTs
2.5
10 sub-LUTs
Wallace-tree
CSA
-
Power(μW)
ci 2
yi
xi -
1.5
1
Fig. 11. LNS MAC unit.
clk 0.5
0
100 150 200 250 300 350 400 450 500 550
Time/sample(ns)
(a) 1-MAC FIR implementation.
ci - 6
yi 2 sub-LUTs
xi - 4 sub-LUTs
6 sub-LUTs
5
10 sub-LUTs
Wallace-tree
CSA
Power(μW)
4
Fig. 12. Retimed LNS MAC unit.
particular for the 12- and 14-bit LNS FIR implementations 2 sub-LUTs
4 sub-LUTs
the LNS bases used are 1.8 and 1.9, respectively. Equivalent 6 sub-LUTs
5
10 sub-LUTs
in terms of SNR binary structures of 13- and 15-bit Wallace-tree
CSA
wordlength are compared to 12- and 14-bit LNS-based
Power(μW)
the wordlength does not include the zero and values sign
bit as described in (1). 3
As a test case, a 50th-order FIR low-pass filter is
used, with a cut-off frequency of 0.3rad/sec. A zero-mean
uncorrelated Gaussian random sequence is used as stimulus. 2
In this subsection the impact of the proposed LNS MAC (c) 4-MAC FIR implementation.
units on the implementation of FIR structures is detailed. Fig. 13. Power×Delay in case of 1-, 2-, and 4-MAC FIR
The employed LNS MACs adopt the MSB-based archi- filter implementations in case of 12-bit LNS and 13-bit
tectures for the LUTs partition and use DFFs for address binary wordlength, respectively.
latching. Area-delay and power-delay complexity of the
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS
10
4
x 10
6 2
4 sub-LUTs
6 sub-LUTs
12 sub-LUTs 1.8
5
24 sub-LUTs
Wallace-tree
CSA
1.6
Area(μm2 )
Power(μW)
4
2 sub-LUTs
1.4 4 sub-LUTs
6 sub-LUTs
3 10 sub-LUTs
Wallace-tree
1.2 CSA
2
1
1 0.8
0.6
0 1 1.5 2 2.5 3 3.5 4
100 200 300 400 500 600 700
Time/sample(ns) delay(ns)
6
4
2 sub-LUTs
4 sub-LUTs
5 6 sub-LUTs
3.5 10 sub-LUTs
Wallace-tree
4
3
CSA
3 2.5
2 2
1 1.5
1
0 1 1.5 2 2.5 3 3.5 4
50 100 150 200 250 300 350 delay(ns)
Time/sample(ns)
(b) 2-MAC FIR implementation.
(b) 2-MAC FIR implementation.
4
x 10
20 14
4 sub-LUTs
6 sub-LUTs
18 12 sub-LUTs
24 sub-LUTs
16
Wallace-tree 12
CSA
Power(μW)
14
Area(μm2 )
10
12 2 sub-LUTs
4 sub-LUTs
6 sub-LUTs
10 8 10 sub-LUTs
Wallace-tree
8 CSA
6
6
4
4
0 2
40 60 80 100 120 140 160 1 1.5 2 2.5 3 3.5 4
Time/sample(ns) delay(ns)
(c) 4-MAC FIR implementation. (c) 4-MAC FIR implementation.
Fig. 14. Power×Delay in case of 1-, 2-, and 4-MAC FIR Fig. 15. Area×Delay in case of 1-, 2-, and 4-MAC FIR
filter implementations in case of 14-bit LNS and 15-bit filter implementations in the cases of 12-bit LNS and
binary wordlength, respectively. 13-bit binary wordlength, respectively.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS
11
x 10
4 proposed LNS filter designs is quantified for practical word
7
lengths. The choice of the number of sub-LUTs emerges
as a major design choice, which defines the area, delay and
6
power limits of the proposed circuits that can be achieved
by synthesis tools. By studying the achieved performance
5
for various word lengths and sub-LUT organizations, it
Area(μm2 )
2
x 10 where Paverage denotes the average power consumed during
the simulation, Nsamples = 1000 denotes the number of
1.8
samples used for the experiment and Py is the power
1.6
depicted on the y axis.
Figs. 13 and 14 demonstrate that the Time/sample values
Area(μm2 )
3.5
12
namely the 2-sub-LUT LNS, achieves 88% power savings by some post-processing, which stems from the definition
when compared to both corresponding Wallace-tree and of the fundamental addition operation in (3) (similarly
CSA structures. The 4-sub-LUT architecture is the next for subtraction). Such paths introduce glitches leading to
most power-efficient circuit with 82% power savings. In excessive power dissipation. This problem can be addressed
case of 2-MAC implementations, the 2-sub-LUT lowest- by appropriately retiming LNS multiply-accumulate units
power circuit demonstrates 80% and 55% savings with closely resembling the techniques described here.
respect to the Wallace-tree and CSA binary structures As LUTs are implemented by combinational logic, ef-
respectively, while the slowest one exhibits 63% and 59% ficiency of the derived circuits is affected by the logic
savings when compared to the corresponding slowest binary minimization procedures applied during synthesis.
implementations. For the 4-MAC implementations, in case LNS adders/subtractors comprise several different sub-
of the most power efficient 2-sub-LUT implementations the systems and unbalanced paths i.e., paths of unequal length.
savings range from 51% for the slowest circuit upto 80% Coleman describes a technique that simplifies the structure
for the fastest one, compared to the corresponding Wallace- of tables involved in logarithmic arithmetic. To demonstrate
tree circuits. that the concepts discussed in this paper can be applied to
Figs. 13 and 14 demonstrate that there is trade-off among co-transformation based arithmetic circuits, we have im-
the wordlength, the number of sub-LUTs and the number plemented a co-transformation based subtractor, modified
of MACs. Hence Figs. 13 and 14 show that the best choices to implement the selective activation of required-only sub-
are the 2-sub-LUT and the 4-sub-LUT LNS, respectively. units. As described in [6] co-transformation is based on the
This is reasonable because as wordlength increases, gains tabulation of values of a function
achieved by partitioning the larger LUTs (Fig. 14) become F (r) = log2 (1 − 2r ) ,
larger than in the case of smaller LUTs (Fig. 13). with r = j − i, j ≤ i. The computation derives the
Furthermore when compared to equivalent binary struc- intermediate result r2 , as
tures, it is shown that for the wordlengths investigated
r2 = j − i + F (k1 ) − F (r1 ), (41)
the best LNS-based implementations exhibit lower power
dissipation. where k1 and r1 are computed as follows
Area vs. Delay results are depicted in Figs. 15 and 16, r1 = (((j − i) div m1 ) − 1)m1 = j + k1 − i (42)
for word lengths of 12 to 14 bits. The x axis represents k1 = i − j + r1 = −((j − i) mod m1 ) + m1 . (43)
the circuit latency. Several instances are synthesized under The result r2 is approximated as j − i, in the case r < −1,
increasing values of delay constraint. As expected, for while in the case −1 ≤ r < −m1 , r2 is computed
larger values of allowed delay, area is decreased. as in (41). In both cases, the final result is subsequently
In general, the area of the 2-MAC LNS implementation is obtained through a sub-unit which performs interpolation.
roughly 3 times the area of the 1-MAC, and the 4-MAC is 7 When −m1 < r < 0, the final result is obtained as
times larger than the 1-MAC. This is expected, as it matches k2 = F (k1 ). Details of the method are discussed in [6].
the number of LNS adders which account for most of the As a demonstrator, a 12-bit logarithmic-base-2
area. However, power scales linearly with the number of interpolation-based subtractor using cotransformation
multipliers. This is because power dissipation is dominated is studied. The use of the value of r to decide which
by the multiply-add units, which in turn are dominated of the sub-units should be activated per subtraction, i.e.,
by the adder/subtractor. During normal operation, the final the use of selective activation is found to decrease power
adder tree is inactive and its inputs remain latched, therefore dissipation from 0.55mW to 0.45mW. Power dissipation is
practically no dynamic power is dissipated. further decreased to 0.38mW by partitioning the largest of
the tables used by the interpolator, using the DFF-based
5 E XTENSION TO OTHER LNS SCHEMES sub-unit selection. For the test case we assume m1 = 2−3 ,
five fractional bits and an interpolation step size of
Low-power techniques have been quantitatively discussed m = 2−1 .
here from the viewpoint of their application to LNS
circuits. While the simple partitioned LUT based LNS
adder/subtractor has been used here to illustrate the im- 6 C ONCLUSIONS
pact of these techniques, results are applicable to other This paper quantitatively shows that the adoption of LNS
approaches as well. can lead to very efficient circuits for digital filtering applica-
Specifically, most LNS adder/subtractor techniques em- tions when appropriately selecting the logarithmic base and
ploy separate sub-units at least for the computation of addi- the wordlength in a contemporary 90nm technology outper-
tion and subtraction, which can be activated selectively, thus forming circuits based on two’s-complement arithmetic. An
resembling the discussed low-power partitioning and selec- LNS-based system using the proposed adder/subtractors of-
tive use of sub-LUTs. Furthermore, LNS adder/subtractors fers substantial power dissipation savings at no performance
comprising paths of substantially different delay (imbal- penalty. Partitioning of the LUTs is employed to create
anced delay paths), may also benefit from the retimed parts in the circuit that can be independently activated thus
MAC approach. Imbalanced delay paths occur due to the reducing power dissipation. Power has been reduced by
basic concept of implementing an approximation followed latching the inputs to the LUTs. Furthermore the gated
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS
13
clock technique has been used to further reduce power [12] R. C. Ismail and J. N. Coleman, “ROM-less LNS,” IEEE Symposium
consumption performed to the lathed inputs due to the clock on Computer Arithmetic, pp. 43–51, 2011.
[13] H. Fu, O. Mencer, and W. Luk, “FPGA designs with optimized
signal. It has been shown that the choice of number of logarithmic arithmetic,” IEEE Transactions on Computers, vol. 59,
sub-LUTs is an important design parameter that can be no. 7, pp. 1000 –1006, Jul. 2010.
employed for exploration of the area, time, power design [14] M. Arnold and S. Collange, “A Real/Complex logarithmic number
system ALU,” IEEE Transactions on Computers, vol. 60, no. 2, pp.
space. Furthermore, the application of retiming is particu- 202 –213, Feb. 2011.
larly useful in avoiding unnecessary switching activity, due [15] R. E. Morley, Jr., G. L. Engel, T. J. Sullivan, and S. M. Natarajan,
to unbalanced delay paths in LNS arithmetic circuits. “VLSI based design of a battery-operated digital hearing aid,”
Proceedings of the IEEE International Conference on Acoustics,
Furthermore the paper extends base-2 LNS filter SNR Speech and Signal Processing, pp. 2512–2515, 1988.
models for the case of logarithmic base b = 2, to facilitate [16] J. R. Sacha and M. J. Irwin, “Number representation for reducing
design space exploration. By properly defining wordlength, switched capacitance in subband coding,” Proceedings of IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing
base, circuit architecture and LUT organization it has (ICASSP), pp. 3125–3128, 1998.
been shown that the LNS-based MACs can outperform [17] M. G. Arnold, “Reduced power consumption for MPEG decoding
the corresponding TC ones in both power and delay with LNS,” Proceedings of the IEEE International Conference on
Application-Specific Systems, Architectures and Processors, (ASAP
complexities, for specific practical wordlengths. 02), pp. 65 – 67, 2002.
The design techniques and quantitative performance [18] B. Kang, N. Vijaykrishnan, M. J. Irwin, and T. Theocharides,
analysis of LNS MAC units and filter implementations “Power-efficient implementation of turbo decoder in SDR system,”
Proceedings of the IEEE International SOC Conference, pp. 119 –
presented in this paper, show that LNS can offer a viable 122, 2004.
solution for low-power signal processing systems with [19] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of
moderate word length requirements. optimal and sub-optimal MAP decoding algorithms operating in
the log domain,” Proceedings IEEE International Conference on
Communications, pp. 1009–1013, June 1995.
[20] H. Wang, H. Yang, and D. Yang, “Improved log-MAP decoding
ACKNOWLEDGEMENTS algorithm for turbo-like codes,” Communications Letters, IEEE,
vol. 10, no. 3, pp. 186–188, Mar 2006.
We thank the reviewers for their comments which helped [21] R. Peng and R.-R. Chen, “Application of nonbinary LDPC codes
improving the presentation of this work. for communication over fading channels using higher order modula-
tions,” IEEE Global Telecommunications Conference, GLOBECOM
’06, pp. 1–5, Dec. 2006.
R EFERENCES [22] V. Paliouras and T. Stouraitis, “Low-power properties of the Log-
arithmic Number System,” Proceedings of 15th Symposium on
[1] T. Stouraitis and V. Paliouras, “Considering the alternatives in low- Computer Arithmetic (ARITH15), pp. 229–236, Jun. 2001.
power design,” IEEE Circuits and Devices, vol. 17, no. 4, pp. 23 – [23] ——, “Logarithmic number system for low-power arithmetic,” Pro-
29, July 2001. ceedings of International Workshop - Power and Timing Modeling,
[2] P. E. Landman and J. M. Rabaey, “Architectural power analysis: The Optimization and Simulation, (PATMOS 2000), vol. LNCS 1918, pp.
dual bit type method,” IEEE Transactions on VLSI Systems, vol. 3, 285–294, 2000.
no. 2, pp. 173 – 187, Jun. 1995. [24] C. Basetas, I. Kouretas, and V. Paliouras, “Low-power digital filter-
[3] K.-H. Chen and T.-D. Chiueh, “A low-power digit-based reconfig- ing based on the logarithmic number system,” Proc. of 17th Work-
urable FIR filter,” IEEE Transactions on Circuits and Systems II: shop on Power and Timing Modeling, Optimization and Simulation,
Express Briefs, vol. 53, no. 8, pp. 617–621, Aug. 2006. LNCS4644, pp. 546–555, 2007.
[4] E. Swartzlander and A. Alexopoulos, “The sign/logarithm number [25] I. Kouretas, C. Basetas, and V. Paliouras, “Low-power Logarithmic
system,” IEEE Transactions on Computers, vol. 24, no. 12, pp. 1238– Number System addition/subtraction and their impact on digital
1242, Dec. 1975. filters,” Proceedings of IEEE International Symposium on Circuits
[5] M. G. Arnold, T. A. Bailey, J. R. Cowles, and M. D. Winkel, and Systems (ISCAS’08), pp. 692–695, 2008.
“Applying features of the IEEE 754 to sign/logarithm arithmetic,” [26] S. Galal and M. Horowitz, “Energy-efficient floating-point unit
IEEE Transactions on Computers, vol. 41, pp. 1040–1050, Aug. design,” IEEE Transactions on Computers, vol. 60, no. 7, pp. 913
1992. –922, Jul. 2011.
[6] J. Coleman, C. Softley, J. Kadlec, R. Matousek, M. Tichy, Z. Pohl, [27] H. Henkel, “Improved addition for the logarithmic number system,”
A. Hermanek, and N. Benschop, “The European Logarithmic Mi- IEEE Transactions on Acoustics, Speech, and Signal Processing,
croprocesor,” IEEE Transactions on Computers,, vol. 57, no. 4, pp. vol. 37, no. 2, pp. 301–303, Feb. 1989.
532–546, April 2008. [28] D. Lewis and L. Yu, “Algorithm design for a 30 bit integrated
[7] V. Mahalingam and N. Ranganathan, “Improving accuracy in logarithmic processor,” Proc. of 9th Symp. on Computer Arithmetic,
Mitchell’s logarithmic multiplication using operand decomposition,” pp. 192–199, 1989.
IEEE Transactions on Computers,, vol. 55, no. 12, pp. 1523–1535, [29] J. Coleman, “Simplification of table structure in logarithmic arith-
Dec. 2006. metic,” Electronics Letters, vol. 31, no. 22, pp. 1905 –1906, oct
[8] K. Johansson, O. Gustafsson, and L. Wanhammar, “Implementation 1995.
of elementary functions for logarithmic number systems,” IET [30] V. Paliouras and T. Stouraitis, “A novel algorithm for accurate
Computers & Digital Techniques, vol. 2, no. 4, pp. 295–304, 2008. logarithmic number system subtraction,” Proceedings of the 1996
[Online]. Available: http://link.aip.org/link/?CDT/2/295/1 IEEE Symposium on Circuits and Systems (ISCAS‘96), vol. 4, pp.
[9] M. G. Arnold, T. A. Bailey, J. R. Cowles, and M. D. Winkel, 268–271, May 1996.
“Arithmetic co-transformations in the real and complex Logarithmic [31] I. Orginos, V. Paliouras, and T. Stouraitis, “A novel algorithm for
Number Systems,” IEEE Transactions on Computers, vol. 47, no. 7, multi-operand Logarithmic Number System addition and subtraction
pp. 777–786, Jul. 1998. using polynomial approximation,” Proceedings of the 1995 IEEE
[10] V. S. Dimitrov, G. A. Jullien, and W. C. Miller, “Theory and International Symposium on Circuits and Systems (ISCAS‘95), pp.
applications of the double-base number system,” IEEE Transactions III.1992–III.1995, 1995.
on Computers, vol. 48, no. 10, pp. 1098–1106, 1999. [32] S. Collange, J. Detrey, and F. de Dinechin, “Floating-point or LNS:
[11] R. Muscedere, V. Dimitrov, G. Jullien, and W. Miller, “Efficient tech- Choosing the right arithmetic on an application basis,” Proceed-
niques for binary-to-multidigit multidimensional logarithmic number ings of the 9th Euromicro Conference on Digital System Design
system conversion using range-addressable look-up tables,” IEEE (DSD’06), pp. 197 – 203, 2006.
Transactions on Computers,, vol. 54, no. 3, pp. 257–271, March [33] P. D. Vouzis, S. Collange, and M. G. Arnold, “Cotransformation
2005. provides area and accuracy improvement in an HDL library for
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS
14
LNS subtraction,” Proceedings of the 10th Euromicro Conference Vassilis Paliouras is currently an assistant
on Digital System Design (DSD’07), pp. 85–93, 2007. professor at the Electrical and Computer En-
[34] J.-M. Muller, Elementary Functions – Algorithms and Implementa- gineering Department, University of Patras,
tion. New York: Hamilton Printing, 1997. Greece. His research interests are in the ar-
[35] S. Paul, N. Jayakumar, and S. Khatri, “A fast hardware approach for eas of VLSI architectures for signal process-
approximate, efficient logarithm and antilogarithm computations,” ing and communications, low-power digital
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, design and computer arithmetic. He leads
vol. 17, no. 2, pp. 269–277, Feb. 2009. research projects in these areas, funded by
[36] J. Kurokawa, T. Payne and S. Lee, “Error analysis of recursive European and national resources, as well as
digital filters implemented with logarithmic number systems,” IEEE contracts with the industry. He has published
Transactions on Acoustics, Speech and Signal Processing, vol. 28, 92 research articles in international journals,
no. 6, pp. 706 – 715, Dec. 1980. conferences, and book chapters and has edited three books. He
[37] F. Taylor, R. Gill, J. Joseph, and J. Radke, “A 20 bit Logarith- is advisor to five PhD students, and has supervised 20 masters’
mic Number System processor,” IEEE Transactions on Computers, and 18 diploma theses. Dr. Paliouras has received the Guillemin -
vol. 37, no. 5, pp. 190–199, Feb. 1988. Cauer best-paper award from the IEEE CASS for the year 2000. Dr.
[38] http://www.synopsys.com. Paliouras has served as the general co-chair of international work-
[39] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low shop on Power and Timing Modeling, Optimization and Simulation
Power Methodology Manual: For System-on-Chip Design. Springer (PATMOS 2004). He has also served as technical program chair of
Publishing Company, Incorporated, 2007. PATMOS 2005, the IEEE Workshop on Signal Processing Systems
[40] C.-H. Chang, J. Chen, and A. Vinod, “Information theoretic approach Implementation (SiPS) 2005, and technical program co-chair of
to complexity reduction of FIR filter design,” IEEE Transactions on IEEE International Conference on Electronics Circuits and Systems
Circuits and Systems – Part I, vol. 55, no. 8, pp. 2310–2321, Sept. (ICECS) 2010. He is currently the European liaison for IEEE ISCAS
2008. 2012, Korea. He has also served in editorial boards of journals
[41] M. Aktan, A. Yurdakul, and G. Dundar, “An algorithm for the design and technical program committees of numerous conferences in the
of low-power hardware-efficient FIR filters,” IEEE Transactions on areas of circuits, systems, signal processing and communications,
Circuits and Systems – Part I, vol. 55, no. 6, pp. 1536–1545, July including ISCAS, ICASSP, ACM/IEEE GLS-VLSI, ISWPC, IFIP IEEE
2008. VLSI-SoC, NEWCAS, DATE. Dr. Paliouras is a member of the IEEE
[42] D. Chandra, “Error analysis of FIR filters implemented using CASS Technical Committee on Circuits for Communications and a
logarithmic arithmetic,” IEEE Transactions on Circuits and Systems member of the IEEE SPS Technical Committee on the Design and
II: Analog and Digital Signal Processing, vol. 45, no. 6, pp. 744– Implementation of Signal Processing Systems.
747, Jun 1998.
[43] A. Papoulis, Probability, random variables, and stochastic processes,
3rd ed. McGraw-Hill, 1991.
[44] T. K. Callaway and E. E. Swartzlander, Jr., “Power-delay character-
istics of CMOS multipliers,” Proceedings of the 13th Symposium on
Computer Arithmetic (ARITH13), pp. 26–32, Jul. 1997.