Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.


IEEE TRANSACTIONS ON COMPUTERS

Low-power Logarithmic Number System


Addition/Subtraction and their Impact on
Digital Filters
I. Kouretas Member, IEEE, Ch. Basetas, and V. Paliouras Member, IEEE

Abstract—This paper presents techniques for low-power addition/subtraction in the Logarithmic Number System (LNS) and
quantifies their impact on digital filter VLSI implementation. The impact of partitioning the look-up tables (LUT) required for LNS
addition/subtraction on complexity, performance, and power dissipation of the corresponding circuits is quantified. Two design
parameters are exploited to minimize complexity, namely the LNS base and the organization of the LNS word. A roundoff noise
model is used to demonstrate the impact of base and wordlength on the signal-to-noise ratio (SNR) of the output of finite impulse
response (FIR) filters. In addition, techniques for the low-power implementation of an LNS multiply-accumulate (MAC) unit are
investigated. Furthermore, it is shown that the proposed techniques can be extended to cotransformation-based circuits that
employ interpolators. The results are demonstrated by evaluating the power dissipation, complexity and performance of several
FIR filter configurations comprising one, two or four MAC units. Simulations of placed and routed VLSI LNS-based digital filters
using a 90nm 1.0V CMOS standard-cell library, reveal that significant power dissipation savings are possible by using optimized
LNS circuits at no performance penalty, when compared to linear fixed-point two’s-complement equivalents.

Index Terms—Logarithmic Number System, Computer Arithmetic, Digital Filter, FIR.

1 I NTRODUCTION
LNS low-power design

D ATA representation is an important parameter in the


design of low-power processors since it affects both
the switching activity and hardware complexity [1–3].
Definition of LNS word organization

The Logarithmic Number System (LNS) has been in- Determination of b, k, l


vestigated as an efficient way to represent data in special-
Definition of LNS MAC architecture
purpose VLSI processors, since it allows for simple arith-
metic circuits under certain conditions. In particular, LNS LUT partitioning
exploits the properties of the logarithm to reduce the basic
arithmetic operations of multiplication, division, roots, and Definition of number and contents of sub-LUTs
powers to binary addition, subtraction, and right and left
Timing and address latching scheme
shifts, respectively [4].
In addition to simplifying several operations, LNS pro- Fig. 1. Proposed design methodology.
vides efficient data representation because its roundoff error
behavior resembles that of floating-point arithmetic. In fact,
LNS-based systems have been proposed that exhibit char- ward to perform in LNS as complex look-up tables or
acteristics similar to 32-bit single-precision floating-point other approximation circuitry are needed. While for short
representation [5, 6]. Coleman et al. report that the Eu- wordlengths simple techniques based on LUTs suffice,
ropean Logarithmic Processor (ELM) compares favorably more elaborate approximation techniques are required for
to 32 bit floating point implementations [6]. Furthermore, longer wordlengths. Several authors have proposed solu-
when LNS is used to represent data in hardware systems, it tions to reduce complexity of awkward LNS operations.
provides an additional degree of freedom in the exploration Mahalingam and Ranganathan improve Mitchell’s Algo-
of VLSI design space, namely the choice of logarithmic rithm (MA) in terms of the accuracy of the logarithmic
base. In particular, proper choice of the LNS base allows operations [7], while Johansson et al. use a method based
the designer to fine-tune resolution vs. dynamic-range trade- on sums of bit-products to implement the basic logarithmic
offs. functions [8]. Arnold et al. suggest the use of cotrans-
On the other hand, LNS benefits come at a cost: the formations for the reduction of the look-up tables [9].
operations of addition and subtraction are rather awk- Dimitrov et al. have proposed an extension of LNS in
which several bases are used [10]. In this context and
• The authors are with the Department of Electrical and Com- to address the complicated required conversion to LNS,
puter Engineering, University of Patras, Patras GR 26500, Greece. Muscedere et al. have studied techniques for for converting
(e-mail:{kouretas,paliuras}@ece.upatras.gr, chbasetas@gmail.com.)
binary to a multidigit-multidimensional logarithmic number

Digital Object Indentifier 10.1109/TC.2012.111 0018-9340/12/$31.00 © 2012 IEEE


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

system by using look-up tables [11]. Very recently, Ismail framework, the area-time-power design space of a low-
and Coleman present a co-transformation procedure and an pass FIR filter is explored for several configurations of
improved interpolation method that reduce the size of look- MAC units. A similar study has recently been preformed by
up tables to an extent that allows their easy synthesis in Galal and Horowitz in the different context of floating-point
logic [12]. Fu et al. deal with LNS arithmetic optimizations arithmetic [26]. The proposed study focuses on the use of
on FPGAs [13]. Arnold and Collange propose Complex partitioning as a technique to limit the exponential growth
LNS (CLNS) as a generalization of LNS, which represents of the size of LUTs with the wordlength. The technique is
complex values in log-polar form [14]. simple and leads to fast circuits. Departing from direct look-
For several practical applications, the benefits of LNS are up table organization, a variety of LNS architectures for
found to be more important than its inherent disadvantages. addition, subtraction, and multi-operand operations, have
In particular, several authors have shown that LNS reduces been proposed in the literature employing interpolation,
power dissipation in signal-processing-related applications, linear or polynomial approximation aiming at reducing the
ranging from hearing-aid devices [15] and subband cod- memory requirements, particularly for larger word lengths,
ing [16], to video processing [17] and error control [18]. such as 32 bits [27, 28]. These ideas have been combined
Moreover, logarithmic techniques have been employed in with mathematical decompositions and transformations of
turbo code decoding for wireless communications appli- the basic operations, exploiting the particular characteristics
cations. In particular, logarithmic representation has been of the functions to further simplify approximation [12, 27–
proved to be suited for the implementation of the symbol- 31].
by-symbol Log-MAP (Logarithmic Maximum A Posteri- Beyond the representational properties of LNS that have
ori) algorithm used for iterative decoding [19, 20]. Peng an impact on switching activity, LNS arithmetic units have
and Chen have adopted LNS for the implementation of structural characteristics that can be exploited to reduce
an FFT-based Log-Sum-Product-decoding-Algorithm (Log- power dissipated. In particular, they comprise
SPA) used in decoding of nonbinary Low-Density Parity • mutually exclusive sub-units, which can be used se-
Check (LDPC) codes [21]. lectively, and
The properties of logarithmic representation that render • imbalanced delay paths.
it efficient for reducing power dissipation have been studied
Therefore simple low-power design techniques are found
[22–25] and it has been demonstrated that the proper choice
to suit an LNS adder/subtractor organization very well;
of the parameters of the representation can reduce the
the impact is quantified for the case of lookup based
switching activity, while guaranteeing the quality of the
architectures, but other LNS architectures may benefit as
output signal. In this context, the quality of the output
well, in terms of reducing power dissipation. Extension to
is evaluated in terms of signal-to-noise ratio (SNR). In
other architectures is demonstrated using an interpolation-
particular, the impact of the selection of the base b of
based subtractor as an illustrative example. Partitioned LUT
the logarithm has been investigated as a means to explore
circuits provide high speed; more sophisticated techniques
trade-offs between precision and dynamic range given a
can be used to reduce size [12, 13, 28–30, 32, 33].
particular word length. Paliouras and Stouraitis [22, 23]
In summary, the contribution of this paper are:
address the low-power LNS properties from a represen-
tational viewpoint and do not focus on power dissipation • a low-power design framework for LNS systems,

estimation data obtained by circuit simulations. The low- • the quantification of power dissipation reduction and

power characteristics of LNS addition/subtraction and mul- performance improvement made possible by using
tiplication have been quantitatively studied and compared LNS, compared to equivalent binary implementations
to equivalent linear binary two’s-complement fixed-point in a contemporary 90nm technology,
operations [24], where it has been demonstrated that there • the design space exploration using the number of

are practical cases, in which an appropriately optimized LUTs for addition/subtraction as a parameter, for the
LNS representation can replace a linear representation of case of using combinatorial logic for LUT implemen-
longer word length, without imposing any degradation on tation,
the signal quality. • the extension of SNR models in LNS for the case of

This paper presents a low-power design framework for b = 2.


LNS-based systems, composed of several techniques. We It is noted that we focus on applications that do not
quantify the impact of the constituent design techniques require very high precision, (i.e., less than 16 bits are
using detailed simulations of the derived LNS circuits. required) since the proposed techniques are demonstrated
The proposed design framework is depicted in Fig. 1. here for look-up-table-based implementations of the LNS
Initially, extending [25] and [24], optimal selection of LNS adder/subtractor.
design parameters is sought, including wordlength and base The remainder of the paper is organized as follows:
assuming a simple partitioned look-up table architecture. LNS basics are discussed in Section 2, while Section 3
Subsequently, in the second stage of the proposed frame- presents techniques for the low-power implementation of
work design techniques and the derived architectures are LNS addition/subtraction. In Section 4 various filter struc-
presented, targeting LNS multiply-accumulate (MAC) units tures are discussed. Section 5 demonstrates the impact of
in 90nm technology. To illustrate the use of the proposed the proposed design methodology to other LNS schemes.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

Finally, conclusions are discussed in Section 6. organization of an LNS adder/subtractor is shown in Fig. 2.
The parallel subtractions
s1 = x − y (9)
2 LNS BASICS
s2 = y − x (10)
The basic idea in LNS is to use logarithms to represent are implemented, followed by a multiplexer, which com-
data. Since the logarithm of a negative number is not real, putes d according to the rule
in order to represent signed numbers in LNS, the sign
information is stored as a separate bit sX , and used in s1 , s1 > 0
d = |x − y| = (11)
combination with the logarithm of the magnitude of the s2 , otherwise.
number. Furthermore, since the logarithm of zero is not a The choice exploits the sign of either (9) or (10), as a select
finite number, an additional single-bit flag zX is used to signal for the multiplexer. The same signal is used to select
denote that a number is zero. Summarizing, X denotes the the maximum of x and y, required for the computation
original number, x denotes the logarithm of the absolute of (4) and (7).
value of |X| , and XLNS is a triplet containing the sign The complexity of LNS circuitry arises from the fact
bit, the zero bit and x. Formally in LNS, a number X is that the values of functions φa and φs should be computed
represented as the triplet by the LNS addition/subtraction circuit hardware for all
XLNS = (zx , sx , x), (1) required values of d. There are two main approaches to
where zx is asserted in the case that X is zero, sx is the implement the evaluation of functions, namely the hardware
sign of X and x = logb (|X|), if X is not zero, with b implementation of an approximation algorithm or the off-
being the base of the logarithm, also called base of the line precomputation and storage of all required values in
representation. The choice of b plays a crucial role in the a look-up table [34]. The former approach is generally
representational capabilities of the triplet in (1), as well as adopted for high-precision applications, while the latter
the computational complexity of the processing and forward approach is generally preferable for smaller word lengths,
and inverse conversion circuitry. i.e., in relatively low-precision applications where the size
Due to the basic properties of the logarithm, the multi- of the required lookup tables is moderate. Both approaches
plication of XLNS and YLNS is reduced to the computation have been extensively studied in the context of elementary
of the triplet ZLNS , function approximation. Let x denote the base-b logarithm
of X and x2 denote the base-2 logarithm of X. Since
ZLNS = (zz , sz , z), (2)
  x = logb |X| = logb 2x2 = x2 (logb 2), the conversion
where zZ = zX z Y , sZ = sX sY , and z = x + y. between a base-b LNS and a base-2 LNS requires scaling
Similarly the case of division reduces to binary subtraction. by a constant factor. Several authors have studied hardware
The derivation of the logarithm a of the sum A of two implementations of converters to/from base-2 LNS [11,
triplets is more involved, as it relies on the computation
 of 35, 36]. In this paper, although conversion is neglected
a = max{x, y} + logb 1 + b−|x−y| (3) the conclusions about power consumption are valid for
the complete application. To better clarify this assume
= max{x, y} + φa (d), (4)
a FIR filter of order N , requiring about N multiply-
where φa (d) = logb (1 + b−d ) and accumulate operations for each input conversion and each
d = |x − y|. (5) output conversion. If Ein is the average energy for one
Similarly the derivation of the difference of two numbers, input conversion, Eout is the average energy for one output
requires the computation of   conversion, and Ey is average energy for one multiply
c = max{x, y} + logb 1 − b−|x−y| (6) accumulate, the total energy (after initialization) for the FIR
filter to produce each result is EFIR = Ein + Eout + N · Ey .
= max{x, y} + φs (d), (7) For sufficiently large values of N , the percentage of energy
where φs (d) = logb (1 − b−d ). Assume that a two’s- consumed in the multiply-add units may approach 100% of
Ey
complement (TC) word is used to represent the logarithm the total as lim EFIR = 1.0.
N →∞
x, composed of a k-bit integral part and an l-bit fractional
part. The range D spanned by x is
 LNS k−1 given by
 3 L OW- POWER DESIGN OF LNS CIRCUITS
2 −2−l 2−l
DLNS = −b , −b {0} In this Section low-power LNS architectures for addition
  −l k−1 −l  and subtraction are presented. The memory structure is
b2 , b2 −2 , (8)
organized as a collection of LUTs and is the most complex
to be compared with the range of (−2i−1 , 2i−1 − 2−f ) of a part of the LNS adder/subtractor. Several designs were
linear TC representation of i integral bits and f fractional investigated, distinguished by two choices, i.e., first, the
bits. In general, LNS offers a superior range, over the linear choice of using either latches or D flip-flops (DFFs) to
two’s-complement representation. This is achieved using freeze the addresses of inactive sub-LUTs, and, second,
comparable word lengths, by departing from the strategy the choice to select the active sub-LUT either based on
of equispaced representable values and thus resorting to a the most significant bits (MSB) or on the least significant
scheme that resembles floating-point arithmetic. The basic bits (LSB) of d in (11).
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

add LUT1 be activated for a particular addition. Then the number of


words stored in each sub-LUT is given by
x
− add LUTN
w(b, l, m) = 2addressbits(b,l)−m . (15)
s Since W (b, l) is not necessarily an integral multiple of w,
− subtract LUT1 the number of words wl stored in the last sub-LUT is given
y
by
subtract LUTN wl (b, l, m) = W (b, l) − (N (b, l, m) − 1)w(b, l, m), (16)
where N (b, l, m) is the number of sub-LUTs. It is inter-
esting to note that using the m most significant bits of the
address to select the sub-LUT to activate, does not imply
Fig. 2. The organization of an LNS adder/subtractor. that the number of sub-LUTs is 2m . Due to φa (d) being
essentially zero for d > Xeff , the number N of required
sub-LUTs, is given by
In the proposed design framework, power dissipation Xeff
reduction is sought by partitioning the particular LUTs N (b, l, m) = addressbits(b,l)−m −l + 1. (17)
2 2
into smaller LUTs, called sub-LUTs, only one of which is
The first value of φa (k) stored in the kth sub-LUT as-
active per operation. This organization is shown in Fig. 2.
suming the selection of sub-LUTs based on MSB of d,
To guarantee that no dynamic power is dissipated in the
corresponds to the input values x(k)
inactive sub-LUTs, the corresponding sub-LUT addresses
are latched and remain constant throughout a particular x(k) = k2addressbits(b,l)−m 2−l . (18)
operation. Further power dissipation reduction is sought Therefore the word length of values stored in the kth sub-
at the implementation of multiply-accumulate units, by LUT is
using retiming techniques, as well as at the algorithmic bpwLUT(k) = l + log2 φa (x(k)) + 1. (19)
optimization level by judiciously selecting the parameters It is noted that in the case of using LSB of d in (11) to
of the LNS representation. select sub-LUT, since the monotonicity of the stored values
Complexity reduction in LNS processors by partitioning cannot be exploited, the wordlength of values is essentially
of the LUTs has been successfully applied in the past [37]. constant; i.e.,
Here we focus on combinational logic implementation bpwLUTLSB (k) = bpwLUT(1). (20)
of LUTs, instead of memory-based implementation. The
organization of the LNS adder/subtractor comprises N sub- As an illustrative example consider Fig. 3. The kth
LUTs per operation, as shown in Fig. 2. The upper sub- sub-LUT, 1 ≤ k ≤ 2m , stores values φa (d), with
LUT system corresponds to function φa (d) required for (k − 1)2n−m ≤ d·2l < k ·2n−m . Bits dn−1 . . . dn−1−m are
LNS addition, i.e., addition of operands having the same used to select the requested value between the outputs of
sign, while the lower sub-LUT system is used for LNS the p sub-LUTs. Since φa (d) is monotonically decreasing,
subtraction, i.e., addition of operands of different signs. it holds that

φa (d) ≤ φa (2n−1−l ), (21)


3.1 Organization and complexity of lookup table
subsystem in LNS adders/subtractors
Assume that b denotes the logarithmic base and l is the for d ≥ 2n−1−l . Therefore differences among the values
number of the fractional bits employed in the representation stored in LUT2 are limited to their less significant part,
of the logarithms. The range of values of the functions therefore the less significant part is the only one that needs
φa (d) and φs (d) required to be stored, is defined by to be stored for each value. Hence fewer bits per entry are
the essential zero, Xeff , beyond which the absolute value required to be stored in LUT2 than in LUT1. Sub-LUTs that
of φa (d) or φs (d) is lower than the resolution of the correspond to the upper parts of the interval, need to store
representation. In particular, due to the monotonic behavior data words of reduced length, since stored values share a
of φa (x), and by demanding φa (Xeff ) ≤2−l , Xeff is common most significant part as shown in Fig. 3(a).
−l The possibility to determine the active sub-LUT using
Xeff = − logb b2 −1 . (12)
the LSBs of d is of interest, as LSBs are available early
Therefore the total number of words, i.e., values of φa (x), in the computation of d, thus allowing the fast generation
required to be stored in the memory

sub-system is of selection signals. However a partitioning scheme based
Xeff on LSBs does not facilitate memory compression since
W (b, l) = . (13)
2−l consecutive function samples are stored in different sub-
The number of bits required for addressing the memory LUTs, as shown in Fig. 3(b). Therefore no common most
subsystem is therefore given by significant part of the sub-LUT contents can be omitted,
addressbits(b, l) = log2 W (b, l) + 1. (14) resulting in increased storage with respect to MSB-based
Assume that the m most significant bits of the memory sub-LUT selection.
subsystem address are used for selecting the sub-LUT to Hence the overall storage requirements, measured in
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

number of bits, are given by


N−2
Btot = bpwLUT(k)w(b, l, m) +
k=0
bpwLUT(N − 1)wl (b, l, m). (22)
The complexity imposed by latching the sub-LUT ad-
dresses, depends on the number of sub-LUTs N , as follows: φ(d)
Ain = N (addressbits(b, l) − m). (23)
Finally, to select the output of N sub-LUTs, a network
of (N − 1) multiplexers are required. An upper bound on
the complexity of this network is
Amux = (N − 1)(l + log2 φa (0) + 1). (24)
Further complexity reduction is possible for the MSB is
chosen by exploiting the fact that the actual word length of
the data in the kth sub-LUTs is given by (19), which is up- d · 2l
bounded by l + log2 φa (0) + 1, due to the monotonicity (a) Partitioning of φa (d) in two sub-LUTs, based on MSB bit.
of φa (x). A similar analysis can be carried out for the case
of φs (x).
The above analysis reveals that the partitioning of the
storage into sub-LUTs with latched inputs, introduces an
area cost for latching the sub-LUT addresses and multi-
plexing the sub-LUT outputs. The particular complexity in- φ(d)
creases linearly with the number N of sub-LUTs. Assuming
that the multiplexer network is organized as a tree, a delay
cost that increases logarithmically with N is also imposed.
Therefore, a detailed design-space exploration is required
to determine the proper number of sub-LUTs required to
achieve given design specifications. To facilitate such an
exploration, an objective function
A(b, t, m) = Btot amem + Ain alat + Amux amux (25)
d · 2l
is formed, where amem , alat , and amux denote the area cost (b) Partitioning of the addition function in sub-LUTs, based on LSB
of storing a bit in a look-up table, of a latch and an one-bit bit.
two-to-one multiplexer respectively. Such a model provides Fig. 3. Partioning of the addition function in sub-LUTs.
a starting point for the determination of m, i.e., how many
of the address bits are used for sub-LUT selection, given the
base and the word organization. While alat and amux can be
found in the corresponding standard-cell library databook, sy
amem depends on the results and effectiveness of the logic sx s
synthesis algorithms used by the EDA tool. Clearly the
selection strategy also affects tool efficiency. dadd1
LU T add1
3.2 Implementation of LNS adder/subtractor
Fig. 4 depicts an architecture using a one-bit LSB selection dn ...d1 dadd2
for LUT partitioning. Latches are connected to the inputs
x0 LU T add2
of the sub-LUTs. Sign s and d0 are used to generate a y0
d0
signal that enables the latches at the input of the sub-LUT
required to be activated for a particular computation. The dsub1
LSB should reach the latches fast enough, considering the LU T sub1
additional delay of computing |x − y|, to avoid the violation
of timing constraints defined by the standard-cell library.
The LUTs have been implemented as combinational dn ...d1 dsub2
logic, synthesized in a UMC 90nm 1.0V CMOS standard- LU T sub2
cell library, by using the Synopsys Design Compiler, IC
Fig. 4. Four-latch organization using LSB d0 to gener-
Compiler and Prime Time [38]. Power dissipation results
ate the lookup table selection signals.
assume the maximum possible clock frequency as dictated
by circuitry delay, in case of DFFs and the correspondingly
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

sy maximum possible data rate in all other cases. A two-


sx
s step procedure has been adopted during synthesis and
optimization: initially circuits are optimized for lowest area
dadd1
and subsequently are optimized under delay constraints.
Fig. 5 depicts a DFF-based architecture. It is noted that
LU T add1 a latch-based gated clock is used for the DFFs, since
additional signals are used to enable the corresponding flip-
dn−1 ...d0 flops. Significant advantage is obtained since gated clocks
x dn−1 ...d0 dadd2
y achieve further power savings and also the problem of setup
LU T add2
dn
and hold time violations is easier to resolve, since glitches
adder are avoided [39].
dsub1 Fig. 6(b) reveals that the MSB selection of d in (11) com-
bine with the use of DFFs leads to LNS circuits of lower
LU T sub1 delay and lower power×delay product than the correspond-
ing combinational or latch-based LNS, for wordlength of 12
dn−1 ...d0 dsub2 bits. Besides, Fig. 6(a) reveals that MSB-based architectures
achieve lower area complexity than the LSB-based counter
LU T sub2 part. This is because the required LUTs become simpler.
Fig. 5. D-Flip-Flop (DFF) organization using the MSB Since the utilization of the MSB for LUT selection is
dn to generate the lookup table selection signals. Clock not efficient for a latch-based design due to additional
and reset signals are omitted for clarity. hardware used to introduce the required delay to fast paths
of the circuit, a solution based on D flip-flops is preferable.
4
x 10
Moreover, the flip-flop based selection is better supported
5
fast Latch 1-bit LSB
by commercial EDA design flows.
4.5 Latch 1-bit MSB In the following, LNS MAC and FIR implementation
DFF 1-bit LSB
4 DFF 1-bit MSB architectures are presented using MSB DFFs and quanti-
1-bit LSB
Area(μm2 )

3.5 1-bit MSB tative power, area and delay results are estimated through
3
simulating placed and routed VLSI circuits.
2.5
4 LNS MAC AND FIR IMPLEMENTATION
2
ISSUES
1.5

A
The design of low-power low-complexity FIR filters has
1
low-area been studied by several researchers [40, 41]. This Section
0.5
discusses the impact of the proposed LNS circuits on the
0
1 2 3 4 5 6 7 implementation of FIR filters. Initially, the impact of round-
Delay(ns) off onto SNR is briefly discussed and a procedure to deter-
(a) Area×Delay for the LNS adders in case of 12-bit LNS wordlength. mine the wordlength organization is detailed. Subsequently,
Point A marks a design of high area-delay efficiency. an LNS MAC architecture is proposed. The performance
x 10
−3 of FIR filter structures is finally quantitatively studied, to
1.5
Latch 1-bit LSB
demonstrate the benefits of employing the proposed LNS
Latch 1-bit MSB circuits, over binary fixed-point implementations.
DFF 1-bit LSB
DFF 1-bit MSB
Power(μm2 )

1-bit LSB
1-bit MSB 4.1 Word length determination
1
In order to compare the performance of LNS-based hard-
ware to the widely-used TC-based hardware, it is necessary
to define in a meaningful measurable way the concept of
0.5 equivalence of behavior between the two systems. For the
case of FIR filters, the SNR is used as such a measure.
A
By optimizing LNS representation parameters with the
objective to achieve a particular SNR, low-power operation
0
1 2 3 4 5 6 7
can be achieved.
Delay(ns) The output SNR of an LNS FIR filter has been both the-
(b) Power×Delay for the LNS adders in case of 12-bit LNS wordlength.
oretically and experimentally studied in conjuction with the
Point A marks a design of good power-delay efficiency. LNS word organization in [42] and [36]. Chandra provides
an expression for the ratio of output error variance to output
Fig. 6. Area×Delay and Power×Delay plots for the
signal variance of a logarithmic FIR filter implementation
LNS adder in case of 12-bit LNS wordlength.
due to roundoff [42].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

output of the filter is given by


70  −1
b = 1.6 σf2 σe2 N 2
k=0 Pkk hk
68 b = 1.8
2
= N −1 2 , (34)
b=2 σy k=0 hk
66
where σe2 is given by (33), hk are the coefficients of the
64 filter, N is the order of the filter, and P00 = N − 1,
Pkk = N − k, for k > 0. Eq. (34) relates SNR with the
SNR (dB)

62
filter coefficients, the base b, and the fractional wordlength
l, since τ = 2−l−1 . It is noted that (34) does not take into
60

58 account coefficient quantization and overflow/underflow


56
errors. The latter implies no limit on the number of integral
bits. Fig. 7 depicts the behavior of SNR, computed as
54
σf2
52
SNR = −10 log10 2 (35)
0 50 100 150 200 250 300 σy
Filter order
as a function of the filter order N , for three different values
of the base and for a fractional wordlength of l = 10.
Fig. 7. SNR as a function of filter order for various The minimum number of the integral bits needed to
values of b and l = 10. achieve the theoretical SNR for a given number of fractional
bits is discussed in [36]. In particular, Kurokawa et al.
explain that when the number of integral bits is too small,
The analysis by Chandra assumes the use of the logarith- there is a disagreement between theoretical and experimen-
mic base b = 2. In the following this approach is extended tal results due to overflow or underflow [36]. They also
to b = 2. The extension is required as the base in the reported that the number of integral and fractional bits can
proposed framework is treated as a design parameter, the be changed with no effect on output SNR, provided that
value of which should be optimally determined. The error the total wordlength remains unchanged and an appropriate
 due to roundoff in the logarithmic domain and assuming 2
rounding of the addition output is given by [42]: √ base b is used if the fractional
base is used. Specifically,
bits increase by one; b if the fractional bits decrease
 = Kx − Lx , (26) by 1. For our test case (50th-order FIR low-pass filter
where Lx = logb |x| and Kx is the quantized value of Lx , and 12–14 fractional bits) it was determined by simulation
rounded to l fractional bits. Since a base b is assumed, the that 8 or 9 integral bits are needed depending on the
relative error in the linear domain is number of fractional bits l and the selection of the logarithm
b Kx − b L x base b. Basetas et al. [24] have demonstrated that a base
e= = b − 1, (27)
bLx b of value different than the common choice of two (i.e.,
hence, b = 2) may provide better balance between accuracy and
 = logb (1 + e). (28) dynamic range, thus offering a higher output SNR. The
experimentally determined SNR and corresponding LNS
Since  is due to rounding, it can be assumed that it word organization are found to be in close agreement with
follows a uniform distribution, in the range [−τ, τ ], where anticipated results.
τ = 2−l−1 , i.e., the probability
1 density function of e is To achieve a fair comparison of LNS and fixed-point
2τ ,  ∈ [−τ, τ ] implementations, initially LNS word organizations that
f () = . (29)
0, else provide at least better SNR than fixed-point counterparts
Using (28) and (29) and the technique of transformation of are required. The organization of the LNS word, i.e., the
variables [43] the probability density function (pdf) of e is total wordlength and the partitioning of the LNS word
derived, as follows into k integer and l fractional bits, is determined using
1 1 the following experimental methodology. Firstly, an FIR
fe (e) = . (30)
2τ log b 1 + e filter model is simulated utilizing an n-bit fixed-point two’s
By applying the definition of the mean value and the pdf complement (TC) representation and the corresponding
of (30), the expected value me = E[e] is obtained as output SNR is calculated. The output of a double-precision
bτ − b−τ floating-point FIR model is used as the reference signal.
me = E[e] = − 1. (31)
2τ log b Then an FIR filter model is implemented and its behavior is
The MSE of the error e in the linear domain is given by studied for several LNS word organizations, i.e., determine
(bτ − b−τ )(bτ + b−τ − 4) the numbers k integer and l fractional bits required and the
E[e2 ] = + 1. (32)
4τ log b corresponding base. The objective of this procedure is to
2 derive an LNS word organization which achieves an output
Finally, the linear roundoff noise variance σe is given by
σe2 = E[e2 ] − m2e . (33) SNR that is better than the one of the fixed-point FIR im-
plementation. In order to guarantee an unbiased comparison
Resembling [42] for the case of successive additions, the between fixed-point and LNS FIR, the employed models
σ2
roundoff-noise variance to signal variance ratio, σf2 , at the use identical data types with the hardware and perform on
y
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

Xi Y Ci+3
D

D Xi+3 D
Ci
Fig. 8. Organization of single-MAC architecture.
Xi+2 D
Ci+1
D D
Ci+2
Xi+1 Y
D
Ci+1
D
Y
Xi+1 D
Xi D

D Xi D
Ci
Fig. 9. Organization of two-MAC architecture. D
Ci
Fig. 10. Organization of four-MAC architecture.
them operations, in a way that numerical results obtained
by the hardware and the model are exactly the same; i.e.,
the models are bit-true. the computation into a two-MAC architecture, computa-
The result of the procedure is the determination of the tion (36) is partitioned into the computation of two terms,
equivalent data representations for both fixed-point and subsequently added, as follows
the LNS-based systems. Subsequently, circuits that use the N
2 −1 N
2 −1
 
derived representations are simulated in order to perform Y (n) = X(n−2j)C2j + X(n−2j −1)C2j+1 .
quantitative comparisons in terms of area, power and delay. j=0 j=0
(37)
4.2 Proposed LNS MAC architectures Each term is allocated to a MAC of a two-MAC architec-
ture. In general, for a P -MAC architecture, the computation
The area, delay and power dissipation of an LNS-based
is decomposed as
digital filter implemented using a single-MAC unit is
N
P −1
studied in comparison to binary counterparts. 
Following the procedure suggested by EDA tool vendor, Y (n) = X(n − P j)CP j + . . . +
j=0
we initially optimize for area and subsequently optimize for
N
P −1
speed. Synthesis is run for several values of maximum delay 
constraint, thus implementing a design space exploration. X(n − P j − P + 1)CP j+P −1(38)
Six possible implementations were studied that span the j=0
N
design space between low delay (e.g., area is relaxed) and 
P −1  P
 −1
low area (e.g., delay is relaxed). = X(n − P j − p)CP j+p . (39)
Synthesized and placed-and-routed circuits have been p=0 j=0
  
derived by using Synopsys’ Design Compiler and IC Com- allocated to pth MAC
piler. Placed and routed circuits are simulated to generate  NP −1
switching activity. Extracted parasitic and switching activity The pth MAC unit computes Sp = j=0 X(n − P j −
has been used as input to Synopsys’ Prime Time in order to p)CP j+p . Subsequently, the summation of all Sp is com-
produce the power consumption. Power estimation includes puted. The LNS equivalent of a single-MAC architecture
both dynamic and leakage power. is depicted in Fig. 11, where the binary multiplier has
The basic structure been replaced by an adder, and the binary adder is mapped
 ofthe single-MAC unit is shown in to an LNS adder/subtractor. The LNS adder/subtractor is
Fig. 8. Symbols and denote a multiplier and an adder
respectively, while D denotes a delay unit, implemented as augmented with saturation circuitry and exploits a zero
a register. The study is extended to the performance of a flag to avoid unnecessary activation of LUT partitions and
two-MAC and a four-MAC architecture, shown in Figs. 9 further reduce power dissipation. In the implementation of
and 10, respectively. Fig. 11, it is evident that the paths to the inputs of the final
An FIR filter is described by adder are not balanced, thus leading to excessive switching

N −1 activity at the adder following the memory structure. The
Y (n) = Ci X(n − i), (36) amount of the switching activity depends on the logic depth
i=0 of the LUT implementation. A solution to this problem is to
where Ci are the filter coefficients, X(n) is the input retime the circuit so that the register located at the feedback
sequence and Y (n) is the output sequence. In order to map path is replaced by registers placed at the inputs of the final
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

clk
3
2 sub-LUTs
4 sub-LUTs
6 sub-LUTs
2.5
10 sub-LUTs
Wallace-tree
CSA
-

Power(μW)
ci 2
yi
xi -
1.5

1
Fig. 11. LNS MAC unit.

clk 0.5

0
100 150 200 250 300 350 400 450 500 550
Time/sample(ns)
(a) 1-MAC FIR implementation.

ci - 6
yi 2 sub-LUTs
xi - 4 sub-LUTs
6 sub-LUTs
5
10 sub-LUTs
Wallace-tree
CSA
Power(μW)

4
Fig. 12. Retimed LNS MAC unit.

adder, as shown in Fig. 12. Power dissipation simulations


in [25] have shown that the retimed circuit is more efficient. 2

Hence, the retimed circuit is adopted for the LNS MAC


implementations.
1
By using as a starting point the results of the analysis
in Section 4.1, to determine the base, simulations were
performed to fine tune the selection of i, k, b. The 0
50 100 150 200 250 300
parameters of the LNS representation (base of the logarithm Time/sample(ns)
and word organization) have been selected to achieve at (b) 2-MAC FIR implementation.
least the same SNR as their linear equivalent [24]. In 6

particular for the 12- and 14-bit LNS FIR implementations 2 sub-LUTs
4 sub-LUTs
the LNS bases used are 1.8 and 1.9, respectively. Equivalent 6 sub-LUTs
5
10 sub-LUTs
in terms of SNR binary structures of 13- and 15-bit Wallace-tree
CSA
wordlength are compared to 12- and 14-bit LNS-based
Power(μW)

systems. It is noted that in case of LNS-based structures 4

the wordlength does not include the zero and values sign
bit as described in (1). 3
As a test case, a 50th-order FIR low-pass filter is
used, with a cut-off frequency of 0.3rad/sec. A zero-mean
uncorrelated Gaussian random sequence is used as stimulus. 2

The clock frequency is set to the maximum possible for


each simulation, as dictated by the circuit delay. The 1
experiment assumes 1000 input data samples.
0
30 40 50 60 70 80 90 100 110 120 130
4.3 Low-power LNS FIR filter implementations Time/sample(ns)

In this subsection the impact of the proposed LNS MAC (c) 4-MAC FIR implementation.
units on the implementation of FIR structures is detailed. Fig. 13. Power×Delay in case of 1-, 2-, and 4-MAC FIR
The employed LNS MACs adopt the MSB-based archi- filter implementations in case of 12-bit LNS and 13-bit
tectures for the LUTs partition and use DFFs for address binary wordlength, respectively.
latching. Area-delay and power-delay complexity of the
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

10

4
x 10
6 2
4 sub-LUTs
6 sub-LUTs
12 sub-LUTs 1.8
5
24 sub-LUTs
Wallace-tree
CSA
1.6

Area(μm2 )
Power(μW)

4
2 sub-LUTs
1.4 4 sub-LUTs
6 sub-LUTs
3 10 sub-LUTs
Wallace-tree
1.2 CSA

2
1

1 0.8

0.6
0 1 1.5 2 2.5 3 3.5 4
100 200 300 400 500 600 700
Time/sample(ns) delay(ns)

(a) 1-MAC FIR implementation. (a) 1-MAC FIR implementation.


4
x 10
6
9
4 LUTs
6 sub-LUTs 5.5
8 12 sub-LUTs
24 sub-LUTs
Wallace-tree 5
7 CSA
4.5
Area(μm2 )
Power(μW)

6
4
2 sub-LUTs
4 sub-LUTs
5 6 sub-LUTs
3.5 10 sub-LUTs
Wallace-tree
4
3
CSA

3 2.5

2 2

1 1.5

1
0 1 1.5 2 2.5 3 3.5 4
50 100 150 200 250 300 350 delay(ns)
Time/sample(ns)
(b) 2-MAC FIR implementation.
(b) 2-MAC FIR implementation.
4
x 10
20 14
4 sub-LUTs
6 sub-LUTs
18 12 sub-LUTs
24 sub-LUTs
16
Wallace-tree 12
CSA
Power(μW)

14
Area(μm2 )

10

12 2 sub-LUTs
4 sub-LUTs
6 sub-LUTs
10 8 10 sub-LUTs
Wallace-tree
8 CSA
6
6

4
4

0 2
40 60 80 100 120 140 160 1 1.5 2 2.5 3 3.5 4
Time/sample(ns) delay(ns)
(c) 4-MAC FIR implementation. (c) 4-MAC FIR implementation.
Fig. 14. Power×Delay in case of 1-, 2-, and 4-MAC FIR Fig. 15. Area×Delay in case of 1-, 2-, and 4-MAC FIR
filter implementations in case of 14-bit LNS and 15-bit filter implementations in the cases of 12-bit LNS and
binary wordlength, respectively. 13-bit binary wordlength, respectively.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

11

x 10
4 proposed LNS filter designs is quantified for practical word
7
lengths. The choice of the number of sub-LUTs emerges
as a major design choice, which defines the area, delay and
6
power limits of the proposed circuits that can be achieved
by synthesis tools. By studying the achieved performance
5
for various word lengths and sub-LUT organizations, it
Area(μm2 )

4 sub-LUTs follows that, depending on area-time constraints, a different


4 6 sub-LUTs number of sub-LUTs provides minimum area requirements
12 sub-LUTs
24 sub-LUTs for a given delay constraint.
Wallace-tree
3 CSA Furthermore the power-delay performance for various
filter configurations is studied. The quantitative results are
2
shown in Figs. 13 and 14, where the Time/sample on the
horizontal axis refers to the time required to process one
1 sample of the input sequence with the filter described in
Section 4, while the vertical axis denotes the power that
0
1 1.5 2 2.5 3 3.5 4 4.5
has been consumed for processing per sample, according
delay(ns) to the following equation
(a) 1-MAC FIR implementation. Py = Paverage /Nsamples , (40)
5

2
x 10 where Paverage denotes the average power consumed during
the simulation, Nsamples = 1000 denotes the number of
1.8
samples used for the experiment and Py is the power
1.6
depicted on the y axis.
Figs. 13 and 14 demonstrate that the Time/sample values
Area(μm2 )

1.4 decrease proportionally with the number of MAC units.


4 sub-LUTs
6 sub-LUTs This is because the computational load of processing a
1.2
12 sub-LUTs particular input sequence is distributed to several parallel
24 sub-LUTs
1 Wallace-tree MACs units.
CSA
Fig. 13 depicts the power requirements of several filter
0.8
implementations that comprise one, two or four MAC units
0.6 assuming 12-bit word length. For each of the three cases,
LNS adder/subtractors are designed using different numbers
0.4
of sub-LUTs for the implementation of the addition and
0.2
subtraction functions φa (x) and φs (x). Several circuits
1 1.5 2 2.5 3 3.5 4 4.5 5
delay(ns) are derived in every case and optimized using different
(b) 2-MAC FIR implementation. delay constraints. The experimental results depicted in
5
Figs. 13 and 14 reveal that for lower speed requirements,
x 10
5 the most effective solution is a single-MAC architecture. To
4.5
achieve higher speeds, the number of MAC units should
be increased. Depending on the architecture, it is found
4
that a 2-LUT LNS adder/subtractor achieves lower power
requirements for a given delay. Further increasing the
Area(μm2 )

3.5

4 sub-LUTs number of sub-LUTs, increases the power required by the


3
6 sub-LUTs supporting circuitry, canceling the benefits due to smaller
12 sub-LUTs
2.5 24 sub-LUTs sub-LUTs. Moreover in case of 14-bit LNS wordlength
Wallace-tree
CSA the EDA synthesis tool requires significantly increased
2
computation time that makes synthesis not practical.
1.5
In Fig. 13 power-delay complexity points of equivalent
1 binary structures, namely Wallace-tree area optimized and
carry-save (CSA) structures, are also included. By using
0.5
two different types of binary multipliers it is shown that
0
1 1.5 2 2.5 3 3.5 4 4.5
the proposed LNS structures are fairly compared to the
delay(ns) corresponding most competitive binary structures [44]. It
(c) 4-MAC FIR implementation. can be seen that in case of four-MAC implementations the
Fig. 16. Area×Delay in case of 1-, 2-, and 4-MAC FIR LNS-based architectures exhibit lower power consumption.
filter implementations in the cases of 14-bit LNS and In particular, let the savings factor S be defined as S =
Pc1 −Pc2
15-bit binary wordlength, respectively. Pc1 , which compares the P metric for the circuits c1
and c2 with respect to c1 . Then the lowest-power circuit
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

12

namely the 2-sub-LUT LNS, achieves 88% power savings by some post-processing, which stems from the definition
when compared to both corresponding Wallace-tree and of the fundamental addition operation in (3) (similarly
CSA structures. The 4-sub-LUT architecture is the next for subtraction). Such paths introduce glitches leading to
most power-efficient circuit with 82% power savings. In excessive power dissipation. This problem can be addressed
case of 2-MAC implementations, the 2-sub-LUT lowest- by appropriately retiming LNS multiply-accumulate units
power circuit demonstrates 80% and 55% savings with closely resembling the techniques described here.
respect to the Wallace-tree and CSA binary structures As LUTs are implemented by combinational logic, ef-
respectively, while the slowest one exhibits 63% and 59% ficiency of the derived circuits is affected by the logic
savings when compared to the corresponding slowest binary minimization procedures applied during synthesis.
implementations. For the 4-MAC implementations, in case LNS adders/subtractors comprise several different sub-
of the most power efficient 2-sub-LUT implementations the systems and unbalanced paths i.e., paths of unequal length.
savings range from 51% for the slowest circuit upto 80% Coleman describes a technique that simplifies the structure
for the fastest one, compared to the corresponding Wallace- of tables involved in logarithmic arithmetic. To demonstrate
tree circuits. that the concepts discussed in this paper can be applied to
Figs. 13 and 14 demonstrate that there is trade-off among co-transformation based arithmetic circuits, we have im-
the wordlength, the number of sub-LUTs and the number plemented a co-transformation based subtractor, modified
of MACs. Hence Figs. 13 and 14 show that the best choices to implement the selective activation of required-only sub-
are the 2-sub-LUT and the 4-sub-LUT LNS, respectively. units. As described in [6] co-transformation is based on the
This is reasonable because as wordlength increases, gains tabulation of values of a function
achieved by partitioning the larger LUTs (Fig. 14) become F (r) = log2 (1 − 2r ) ,
larger than in the case of smaller LUTs (Fig. 13). with r = j − i, j ≤ i. The computation derives the
Furthermore when compared to equivalent binary struc- intermediate result r2 , as
tures, it is shown that for the wordlengths investigated
r2 = j − i + F (k1 ) − F (r1 ), (41)
the best LNS-based implementations exhibit lower power
dissipation. where k1 and r1 are computed as follows
Area vs. Delay results are depicted in Figs. 15 and 16, r1 = (((j − i) div m1 ) − 1)m1 = j + k1 − i (42)
for word lengths of 12 to 14 bits. The x axis represents k1 = i − j + r1 = −((j − i) mod m1 ) + m1 . (43)
the circuit latency. Several instances are synthesized under The result r2 is approximated as j − i, in the case r < −1,
increasing values of delay constraint. As expected, for while in the case −1 ≤ r < −m1 , r2 is computed
larger values of allowed delay, area is decreased. as in (41). In both cases, the final result is subsequently
In general, the area of the 2-MAC LNS implementation is obtained through a sub-unit which performs interpolation.
roughly 3 times the area of the 1-MAC, and the 4-MAC is 7 When −m1 < r < 0, the final result is obtained as
times larger than the 1-MAC. This is expected, as it matches k2 = F (k1 ). Details of the method are discussed in [6].
the number of LNS adders which account for most of the As a demonstrator, a 12-bit logarithmic-base-2
area. However, power scales linearly with the number of interpolation-based subtractor using cotransformation
multipliers. This is because power dissipation is dominated is studied. The use of the value of r to decide which
by the multiply-add units, which in turn are dominated of the sub-units should be activated per subtraction, i.e.,
by the adder/subtractor. During normal operation, the final the use of selective activation is found to decrease power
adder tree is inactive and its inputs remain latched, therefore dissipation from 0.55mW to 0.45mW. Power dissipation is
practically no dynamic power is dissipated. further decreased to 0.38mW by partitioning the largest of
the tables used by the interpolator, using the DFF-based
5 E XTENSION TO OTHER LNS SCHEMES sub-unit selection. For the test case we assume m1 = 2−3 ,
five fractional bits and an interpolation step size of
Low-power techniques have been quantitatively discussed m = 2−1 .
here from the viewpoint of their application to LNS
circuits. While the simple partitioned LUT based LNS
adder/subtractor has been used here to illustrate the im- 6 C ONCLUSIONS
pact of these techniques, results are applicable to other This paper quantitatively shows that the adoption of LNS
approaches as well. can lead to very efficient circuits for digital filtering applica-
Specifically, most LNS adder/subtractor techniques em- tions when appropriately selecting the logarithmic base and
ploy separate sub-units at least for the computation of addi- the wordlength in a contemporary 90nm technology outper-
tion and subtraction, which can be activated selectively, thus forming circuits based on two’s-complement arithmetic. An
resembling the discussed low-power partitioning and selec- LNS-based system using the proposed adder/subtractors of-
tive use of sub-LUTs. Furthermore, LNS adder/subtractors fers substantial power dissipation savings at no performance
comprising paths of substantially different delay (imbal- penalty. Partitioning of the LUTs is employed to create
anced delay paths), may also benefit from the retimed parts in the circuit that can be independently activated thus
MAC approach. Imbalanced delay paths occur due to the reducing power dissipation. Power has been reduced by
basic concept of implementing an approximation followed latching the inputs to the LUTs. Furthermore the gated
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

13

clock technique has been used to further reduce power [12] R. C. Ismail and J. N. Coleman, “ROM-less LNS,” IEEE Symposium
consumption performed to the lathed inputs due to the clock on Computer Arithmetic, pp. 43–51, 2011.
[13] H. Fu, O. Mencer, and W. Luk, “FPGA designs with optimized
signal. It has been shown that the choice of number of logarithmic arithmetic,” IEEE Transactions on Computers, vol. 59,
sub-LUTs is an important design parameter that can be no. 7, pp. 1000 –1006, Jul. 2010.
employed for exploration of the area, time, power design [14] M. Arnold and S. Collange, “A Real/Complex logarithmic number
system ALU,” IEEE Transactions on Computers, vol. 60, no. 2, pp.
space. Furthermore, the application of retiming is particu- 202 –213, Feb. 2011.
larly useful in avoiding unnecessary switching activity, due [15] R. E. Morley, Jr., G. L. Engel, T. J. Sullivan, and S. M. Natarajan,
to unbalanced delay paths in LNS arithmetic circuits. “VLSI based design of a battery-operated digital hearing aid,”
Proceedings of the IEEE International Conference on Acoustics,
Furthermore the paper extends base-2 LNS filter SNR Speech and Signal Processing, pp. 2512–2515, 1988.
models for the case of logarithmic base b = 2, to facilitate [16] J. R. Sacha and M. J. Irwin, “Number representation for reducing
design space exploration. By properly defining wordlength, switched capacitance in subband coding,” Proceedings of IEEE In-
ternational Conference on Acoustics, Speech and Signal Processing
base, circuit architecture and LUT organization it has (ICASSP), pp. 3125–3128, 1998.
been shown that the LNS-based MACs can outperform [17] M. G. Arnold, “Reduced power consumption for MPEG decoding
the corresponding TC ones in both power and delay with LNS,” Proceedings of the IEEE International Conference on
Application-Specific Systems, Architectures and Processors, (ASAP
complexities, for specific practical wordlengths. 02), pp. 65 – 67, 2002.
The design techniques and quantitative performance [18] B. Kang, N. Vijaykrishnan, M. J. Irwin, and T. Theocharides,
analysis of LNS MAC units and filter implementations “Power-efficient implementation of turbo decoder in SDR system,”
Proceedings of the IEEE International SOC Conference, pp. 119 –
presented in this paper, show that LNS can offer a viable 122, 2004.
solution for low-power signal processing systems with [19] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of
moderate word length requirements. optimal and sub-optimal MAP decoding algorithms operating in
the log domain,” Proceedings IEEE International Conference on
Communications, pp. 1009–1013, June 1995.
[20] H. Wang, H. Yang, and D. Yang, “Improved log-MAP decoding
ACKNOWLEDGEMENTS algorithm for turbo-like codes,” Communications Letters, IEEE,
vol. 10, no. 3, pp. 186–188, Mar 2006.
We thank the reviewers for their comments which helped [21] R. Peng and R.-R. Chen, “Application of nonbinary LDPC codes
improving the presentation of this work. for communication over fading channels using higher order modula-
tions,” IEEE Global Telecommunications Conference, GLOBECOM
’06, pp. 1–5, Dec. 2006.
R EFERENCES [22] V. Paliouras and T. Stouraitis, “Low-power properties of the Log-
arithmic Number System,” Proceedings of 15th Symposium on
[1] T. Stouraitis and V. Paliouras, “Considering the alternatives in low- Computer Arithmetic (ARITH15), pp. 229–236, Jun. 2001.
power design,” IEEE Circuits and Devices, vol. 17, no. 4, pp. 23 – [23] ——, “Logarithmic number system for low-power arithmetic,” Pro-
29, July 2001. ceedings of International Workshop - Power and Timing Modeling,
[2] P. E. Landman and J. M. Rabaey, “Architectural power analysis: The Optimization and Simulation, (PATMOS 2000), vol. LNCS 1918, pp.
dual bit type method,” IEEE Transactions on VLSI Systems, vol. 3, 285–294, 2000.
no. 2, pp. 173 – 187, Jun. 1995. [24] C. Basetas, I. Kouretas, and V. Paliouras, “Low-power digital filter-
[3] K.-H. Chen and T.-D. Chiueh, “A low-power digit-based reconfig- ing based on the logarithmic number system,” Proc. of 17th Work-
urable FIR filter,” IEEE Transactions on Circuits and Systems II: shop on Power and Timing Modeling, Optimization and Simulation,
Express Briefs, vol. 53, no. 8, pp. 617–621, Aug. 2006. LNCS4644, pp. 546–555, 2007.
[4] E. Swartzlander and A. Alexopoulos, “The sign/logarithm number [25] I. Kouretas, C. Basetas, and V. Paliouras, “Low-power Logarithmic
system,” IEEE Transactions on Computers, vol. 24, no. 12, pp. 1238– Number System addition/subtraction and their impact on digital
1242, Dec. 1975. filters,” Proceedings of IEEE International Symposium on Circuits
[5] M. G. Arnold, T. A. Bailey, J. R. Cowles, and M. D. Winkel, and Systems (ISCAS’08), pp. 692–695, 2008.
“Applying features of the IEEE 754 to sign/logarithm arithmetic,” [26] S. Galal and M. Horowitz, “Energy-efficient floating-point unit
IEEE Transactions on Computers, vol. 41, pp. 1040–1050, Aug. design,” IEEE Transactions on Computers, vol. 60, no. 7, pp. 913
1992. –922, Jul. 2011.
[6] J. Coleman, C. Softley, J. Kadlec, R. Matousek, M. Tichy, Z. Pohl, [27] H. Henkel, “Improved addition for the logarithmic number system,”
A. Hermanek, and N. Benschop, “The European Logarithmic Mi- IEEE Transactions on Acoustics, Speech, and Signal Processing,
croprocesor,” IEEE Transactions on Computers,, vol. 57, no. 4, pp. vol. 37, no. 2, pp. 301–303, Feb. 1989.
532–546, April 2008. [28] D. Lewis and L. Yu, “Algorithm design for a 30 bit integrated
[7] V. Mahalingam and N. Ranganathan, “Improving accuracy in logarithmic processor,” Proc. of 9th Symp. on Computer Arithmetic,
Mitchell’s logarithmic multiplication using operand decomposition,” pp. 192–199, 1989.
IEEE Transactions on Computers,, vol. 55, no. 12, pp. 1523–1535, [29] J. Coleman, “Simplification of table structure in logarithmic arith-
Dec. 2006. metic,” Electronics Letters, vol. 31, no. 22, pp. 1905 –1906, oct
[8] K. Johansson, O. Gustafsson, and L. Wanhammar, “Implementation 1995.
of elementary functions for logarithmic number systems,” IET [30] V. Paliouras and T. Stouraitis, “A novel algorithm for accurate
Computers & Digital Techniques, vol. 2, no. 4, pp. 295–304, 2008. logarithmic number system subtraction,” Proceedings of the 1996
[Online]. Available: http://link.aip.org/link/?CDT/2/295/1 IEEE Symposium on Circuits and Systems (ISCAS‘96), vol. 4, pp.
[9] M. G. Arnold, T. A. Bailey, J. R. Cowles, and M. D. Winkel, 268–271, May 1996.
“Arithmetic co-transformations in the real and complex Logarithmic [31] I. Orginos, V. Paliouras, and T. Stouraitis, “A novel algorithm for
Number Systems,” IEEE Transactions on Computers, vol. 47, no. 7, multi-operand Logarithmic Number System addition and subtraction
pp. 777–786, Jul. 1998. using polynomial approximation,” Proceedings of the 1995 IEEE
[10] V. S. Dimitrov, G. A. Jullien, and W. C. Miller, “Theory and International Symposium on Circuits and Systems (ISCAS‘95), pp.
applications of the double-base number system,” IEEE Transactions III.1992–III.1995, 1995.
on Computers, vol. 48, no. 10, pp. 1098–1106, 1999. [32] S. Collange, J. Detrey, and F. de Dinechin, “Floating-point or LNS:
[11] R. Muscedere, V. Dimitrov, G. Jullien, and W. Miller, “Efficient tech- Choosing the right arithmetic on an application basis,” Proceed-
niques for binary-to-multidigit multidimensional logarithmic number ings of the 9th Euromicro Conference on Digital System Design
system conversion using range-addressable look-up tables,” IEEE (DSD’06), pp. 197 – 203, 2006.
Transactions on Computers,, vol. 54, no. 3, pp. 257–271, March [33] P. D. Vouzis, S. Collange, and M. G. Arnold, “Cotransformation
2005. provides area and accuracy improvement in an HDL library for
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
IEEE TRANSACTIONS ON COMPUTERS

14

LNS subtraction,” Proceedings of the 10th Euromicro Conference Vassilis Paliouras is currently an assistant
on Digital System Design (DSD’07), pp. 85–93, 2007. professor at the Electrical and Computer En-
[34] J.-M. Muller, Elementary Functions – Algorithms and Implementa- gineering Department, University of Patras,
tion. New York: Hamilton Printing, 1997. Greece. His research interests are in the ar-
[35] S. Paul, N. Jayakumar, and S. Khatri, “A fast hardware approach for eas of VLSI architectures for signal process-
approximate, efficient logarithm and antilogarithm computations,” ing and communications, low-power digital
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, design and computer arithmetic. He leads
vol. 17, no. 2, pp. 269–277, Feb. 2009. research projects in these areas, funded by
[36] J. Kurokawa, T. Payne and S. Lee, “Error analysis of recursive European and national resources, as well as
digital filters implemented with logarithmic number systems,” IEEE contracts with the industry. He has published
Transactions on Acoustics, Speech and Signal Processing, vol. 28, 92 research articles in international journals,
no. 6, pp. 706 – 715, Dec. 1980. conferences, and book chapters and has edited three books. He
[37] F. Taylor, R. Gill, J. Joseph, and J. Radke, “A 20 bit Logarith- is advisor to five PhD students, and has supervised 20 masters’
mic Number System processor,” IEEE Transactions on Computers, and 18 diploma theses. Dr. Paliouras has received the Guillemin -
vol. 37, no. 5, pp. 190–199, Feb. 1988. Cauer best-paper award from the IEEE CASS for the year 2000. Dr.
[38] http://www.synopsys.com. Paliouras has served as the general co-chair of international work-
[39] M. Keating, D. Flynn, R. Aitken, A. Gibbons, and K. Shi, Low shop on Power and Timing Modeling, Optimization and Simulation
Power Methodology Manual: For System-on-Chip Design. Springer (PATMOS 2004). He has also served as technical program chair of
Publishing Company, Incorporated, 2007. PATMOS 2005, the IEEE Workshop on Signal Processing Systems
[40] C.-H. Chang, J. Chen, and A. Vinod, “Information theoretic approach Implementation (SiPS) 2005, and technical program co-chair of
to complexity reduction of FIR filter design,” IEEE Transactions on IEEE International Conference on Electronics Circuits and Systems
Circuits and Systems – Part I, vol. 55, no. 8, pp. 2310–2321, Sept. (ICECS) 2010. He is currently the European liaison for IEEE ISCAS
2008. 2012, Korea. He has also served in editorial boards of journals
[41] M. Aktan, A. Yurdakul, and G. Dundar, “An algorithm for the design and technical program committees of numerous conferences in the
of low-power hardware-efficient FIR filters,” IEEE Transactions on areas of circuits, systems, signal processing and communications,
Circuits and Systems – Part I, vol. 55, no. 6, pp. 1536–1545, July including ISCAS, ICASSP, ACM/IEEE GLS-VLSI, ISWPC, IFIP IEEE
2008. VLSI-SoC, NEWCAS, DATE. Dr. Paliouras is a member of the IEEE
[42] D. Chandra, “Error analysis of FIR filters implemented using CASS Technical Committee on Circuits for Communications and a
logarithmic arithmetic,” IEEE Transactions on Circuits and Systems member of the IEEE SPS Technical Committee on the Design and
II: Analog and Digital Signal Processing, vol. 45, no. 6, pp. 744– Implementation of Signal Processing Systems.
747, Jun 1998.
[43] A. Papoulis, Probability, random variables, and stochastic processes,
3rd ed. McGraw-Hill, 1991.
[44] T. K. Callaway and E. E. Swartzlander, Jr., “Power-delay character-
istics of CMOS multipliers,” Proceedings of the 13th Symposium on
Computer Arithmetic (ARITH13), pp. 26–32, Jul. 1997.

Ioannis Kouretas received the Diploma and


the M.Sc. degrees in computer engineering
and informatics in 2001 and 2003 respec-
tively, from the Computer Engineering and
Informatics Department, University of Patras,
Patras, Greece. Since 2003, he has been a
Research Assistant at the VLSI Laboratory of
the Electrical and Computer Engineering De-
partment, University of Patras. His research
interests include computer arithmetic, low-
power digital design and VLSI signal pro-
cessing architectures.

Charalambos Basetas has completed his


master’s degree at the Electrical and Com-
puter Engineering Department, University of
Patras, Greece. His research interests are
digital signal processing and mixed ana-
log/digital circuits design. He has published
2 research articles in international confer-
ences. He is currently working as a freelance
electrical engineer for various companies in
Greece. He is a member of the Technical
Chamber of Greece.

You might also like