Professional Documents
Culture Documents
Design-Efficient Approximate Multiplication Circuits Through Partial Product Perforation
Design-Efficient Approximate Multiplication Circuits Through Partial Product Perforation
Abstract— Approximate computing has received significant for the generation of approximate arithmetic circuits are
attention as a promising strategy to decrease power consump- truncation [4], [5], voltage overscaling (VOS) [2], [6], and
tion of inherently error tolerant applications. In this paper, simplification of logic complexity (i.e., alteration of the truth
we focus on hardware-level approximation by introducing the
partial product perforation technique for designing approximate table) [7]–[9]. Extensive research has been conducted on
multiplication circuits. We prove in a mathematically rigor- approximate adders [6], [7], [10], [11], providing significant
ous manner that in partial product perforation, the imposed gains in terms of area and power while exposing small
errors are bounded and predictable, depending only on the error. However, research activities on approximate multipli-
input distribution. Through extensive experimental evaluation, ers are limited. Efficient approximate multipliers introduced
we apply the partial product perforation method on different
multiplier architectures and expose the optimal architecture– in [8], [9], [12], and [13] target the approximation of the partial
perforation configuration pairs for different error constraints. product accumulation but do not examine approximations on
We show that, compared with the respective exact design, the the partial product generation.
partial product perforation delivers reductions of up to 50% Approximate hardware circuits, contrary to software
in power consumption, 45% in area, and 35% in critical delay. approximations, offer transistors reduction, lower dynamic
In addition, the product perforation method is compared with the
state-of-the-art approximation techniques, i.e., truncation, voltage and leakage power, lower circuit delay, and opportunity for
overscaling, and logic approximation, showing that it outperforms downsizing. Motivated by the limited research on approximate
them in terms of power dissipation and error. multipliers, compared with the extensive research on approxi-
Index Terms— Approximate arithmetic circuits, approximate mate adders, and explicitly the lack of approximate techniques
computing, approximate multiplier, error analysis, low power. targeting the partial product generation, we introduce the
partial product perforation method for creating approximate
multipliers. Inspired from [14], we omit the generation of some
I. I NTRODUCTION partial products, thus reducing the number of partial products
that have to be accumulated, we decrease the area, power, and
I N MODERN embedded electronic devices, power con-
sumption is a first-class design concern. Considering that a
large number of application domains are inherently tolerant to
depth of the accumulation tree. The major contributions of this
paper are summarized as follows.
imprecise calculations, e.g., digital signal processing (DSP), 1) We adopt and apply, for the first time, the software-based
data analytics, and data-mining [1], approximate computing perforation technique [14] on the design of hardware
appear as a promising solution to reduce their power dissi- circuits, obtaining the optimized design solutions regard-
pation. Such applications process large redundant data sets ing the power–area–error tradeoffs.
or noisy input data derived from the real world, do not have 2) We analyze in a mathematically rigorous manner the
a golden result, perform statistical/probabilistic computations, arithmetic accuracy of partial product perforation and
and/or demand human interaction, thus their exactness is prove that it delivers a bounded and predictable output
relaxed due to limited human perception [2], [3]. Approximate error. Our error analysis is not bound to a specific
computing can be applied at both software and hardware multiplier architecture and can be applied with error
levels. guarantees to every multiplication circuit regardless of
Hardware-level approximation mainly targets arithmetic its architecture. Such a rigorous analysis enables precise
units, such as adders and multipliers, widely used in portable error estimation over input data distributions.
devices to implement multimedia algorithms, e.g., image 3) We explore and characterize the efficiency of the
and video processing. The most commonly used techniques product perforation method on several multiplier
schemes, exposing its power–area impact on differ-
Manuscript received September 17, 2015; revised January 4, 2016; accepted ent architectures. This is the first time that such
February 9, 2016. Date of publication March 15, 2016; date of current version an exploratory analysis over different approximate
September 23, 2016. This work has been partially supported by the E.C.
program AEGLE under H2020 Grant Agreement No: 644906. multiplier architectures is offered to the designer,
The authors are with the Department of Electrical and Computer Engi- enabling also the selection of the optimum architecture–
neering, National Technical University of Athens, Athens 15780, Greece perforation configuration for given error constraints.
(e-mail: zervakis@microlab.ntua.gr; kostastsoumanis@gmail.com; sxydis@
microlab.ntua.gr; dsoudris@microlab.ntua.gr; pekmes@microlab.ntua.gr). 4) We show that the partial product perforation outper-
Digital Object Identifier 10.1109/TVLSI.2016.2535398 forms the related state-of-the-art works in terms of
1063-8210 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
3106 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016
power consumption and error, as well as output quality, the application’s characteristics, and in addition, the induced
when applied to image processing and data analytics approximation error is not rigorously bounded.
algorithms. Extensive research has been conducted targeting the
More specifically, we apply the partial product perfora- implementation of approximate adders [7], [10], [11].
tion on 16 different multiplier architectures using industrial Verma et al. [11] developed a probability proof, estimating
strength tools, i.e., Synopsys Design Compiler and PrimeTime. that the longest carry chain in an n-bit adder is logn, and
Through extensive experimental evaluation, we present the produced a fast inexact adder limiting the carry propagation.
optimal approximate multiplier configurations for various error In [10], approximation is performed by decomposing the
constraints. We show that, compared with the accurate multi- addition circuit in an accurate and an approximate inaccurate
plier, the product perforation offers reductions of up to 50% part. Gupta et al. [7] build imprecise full adder cells, requiring
in power consumption, 45% in area, and 35% in critical fewer transistors, by approximating their logic function and
delay for 0.1% normalized mean error distance (NMED) [15]. then use them to build imprecise adders. Although it is
Moreover, it is compared with the state-of-the-art approximate proposed to use such adders targeting to build approximate
computing works that use either VOS [6], logic approxima- multipliers, it is not clear how they can be used in different
tion [9], or truncation [4], outperforming them significantly tree architectures and how their error scales in the case
in terms of power dissipation and error. Finally, we examine of multioperand addition. Targeting the creation of approx-
the scalability of our technique by applying it on different imate multipliers, Kulkarni et al. [8] proposed a simplified
bit-width multipliers and show that the delivered savings imprecise 2 × 2 multiplier cell used as the basic block for
increase with the width increase. constructing larger multiplier architectures. Momeni et al. [9]
The rest of this paper is organized as follows. In Section II, presented two approximate 4:2 compressors by modifying
we discuss the related literature with an emphasis on circuit- the respective accurate truth table, which were then used
level approximation. Section III introduces the partial product to build two approximate multipliers outperforming [8]. The
perforation technique, providing the corresponding error approximate compressors of [9] are used in Dadda tree with
analysis and error correction methods. In Section IV, we 4:2 reduction. However, different multiplier architectures were
examine the product perforation on different multiplier archi- not explored. Based on an approximate adder that limits the
tectures, exposing the optimal architecture–perforation con- carry propagation, Liu et al. [13] presented a fast and low-
figuration pairs under differing error constraints. Section V power multiplier scheme with higher error than [9]. However,
evaluates the product perforation method by comparing it with in all the aforementioned approaches, the imposed error cannot
the related state-of-the-art works. Finally, the conclusion is be predicted, as it depends on carry propagation and the
drawn in Section VI. circuits’ implementation, and requires simulations over all
possible inputs in order to be calculated.
II. R ELATED W ORK Recently, Narayanamoorthy et al. [17] and
In this section, the related research in the field of hard- Hashemi et al. [18] proposed the use of m × m multipliers
ware approximate computing is discussed. Both general- to perform an n × n multiplication (with m < n).
purpose approximation techniques [4], [6], [16] applied to any Narayanamoorthy et al. [17] statically split the multiplicand in
arithmetic circuit and circuit-specific approximation either to three m-bit segments and perform the multiplication utilizing
adder [7], [10], [11] or multiplier designs [8], [9], [13], the segment containing the most significant 1 (leading one).
[17], [18] have been presented. However, as stated in [18], m needs to be at least n/2 to attain
Regarding the general approximation techniques, acceptable accuracy, thus limiting the energy savings and the
VOS [2], [6] and truncation [4], [5], [12] have been scalability of this approach. Hashemi et al. [18] extended
proposed. VOS is applied in any circuit by lowering the the idea of leading-one segments to enable dynamic range
supply voltage below its nominal value. Decreasing the multiplication and added a correction term. Although [18]
supply voltage reduces the circuit’s power consumption, delivers higher accuracy designs than [17] using smaller
but produces errors caused by the number of paths that values for m, its approach requires the allocation of extra
fail to meet the delay constraints [2]. Banescu et al. [12] complex circuitry, i.e., two leading-one detectors, two
proposed an automated generation of large precision floating- complex multiplexers for segment selection, one log(n)-bit
point multipliers in field-programmable gate arrays using comparator, a log(n)-bit adder, and one 2n-bit barrel shifter.
sophisticated truncation over underutilized DSPs. In [5], These extra components are expected to highly increase
a truncated multiplier with a constant correction term is the circuit’s complexity, introducing nontrivial delay, area,
proposed, significantly decreasing the error imposed by typical and energy overheads that may considerably decrease the
truncation. King and Swartzlander [4] proposed a truncated approximation benefits [17]. This is expected to be more
multiplier with variable correction that outperforms [5] in evident in designs targeting too small error values, in which
terms of error. Probabilistic pruning and logic minimization the need for larger m values is required.
techniques have been presented in [16] using a greedy In this paper, we target the design of power–error efficient
approach to generate approximate circuits. These techniques multiplication circuits. We differ from the previous works
systematically eliminate circuit’s components and simplify by exploring approximation on the generation of the partial
logic complexity according to the circuit’s activity profile and products. The proposed method can be easily applied in any
output significance. Both the techniques heavily depend on multiplier architecture without the need for a special design,
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 3107
Fig. 1. Partial product reduction process for 8 × 8 multiplication with (a) accurate array, (b) approximate array, (c) accurate Wallace, (d) approximate
Wallace, (e) accurate compressor 4:2, (f) approximate compressor 4:2, (g) accurate Dadda 4:2, and (h) approximate Dadda 4:2. Approximation is performed
by perforating the third and fourth partial products. The boxes with four dots are 4:2 compressors, those with three are full adders and those with two are
full- or half-adders.
in contrast to related works. In addition, the error imposed by approximate multiplication is given by
perforation depends only on the configuration parameters and,
in contrast to existing work, can be analytically calculated
n/2−1
A × B| j,k = Ab iM B 4i , b iM B ∈ {0, ±1, ±2}. (3)
without the need for exhaustive simulations. The latter is
i=0
critical, as, given the application’s inputs, a precise estimation i ∈[
/ j, j +k)
of the output quality can be extracted. Finally, the knowledge Fig. 1 shows an example of applying the partial product
of the induced error permits the selection of the configuration perforation method on different 8-bit multipliers with j = 2
that maximizes the power savings for a specific error bound. and k = 2 configuration values. For each architecture, the dot
diagrams [19] of the accurate and the respective perforated tree
III. A NALYZING PARTIAL P RODUCT P ERFORATION are presented. The dots represent the bits of the partial prod-
ucts that have to be accumulated, while the stages represent
A. Method Analysis the delay of the reduction process followed by each tree. The
In this section, the partial product perforation method for dashed boxes with four dots are 4:2 compressors, those with
the design of approximate hardware multipliers is described. three are full adders and those with two are either full- or
Consider two n-bit numbers A and B. The result of their half-adders. Through the proposed approximation technique,
multiplication A × B is obtained after summing all the partial the power, area, and delay of the multiplication circuit are
products Abi , where bi is the i th bit of B. Thus decreased, making, though, the computation imprecise. The
higher the order of a perforated partial product, the greater the
n−1 error imposed at the final result. In addition, since the addition
A×B = Abi 2i , bi ∈ {0, 1}. (1) is an associative and commutative operation, when more than
i=0 one partial products are perforated, the total error results from
The partial product perforation technique omits the genera- the addition of the errors produced from the perforation of
tion of k successive partial products starting from the j th one. each partial product separately.
A perforated partial product is not inserted in the accumulation We use the notation D[j,k,c] to label the different approxi-
tree, and hence n full adders can be eliminated. Applying the mate multiplier architectural configurations. The parameter D
product perforation with j and k configuration values on the refers to the tree architecture, j is the order of the first per-
multiplication, A × B produces the approximate result forated partial product, and k is the number of the perforated
partial products. If no j and k are specified, the respective
n−1
notation refers to the exact design. Finally, c corresponds to
A × B| j,k = Abi 2i , bi ∈ {0, 1}. (2) the partial product generation technique and takes the value s
i=0, for simple partial products (SPPs) or m for MBE. For example,
i ∈[
/ j, j +k)
Fig. 1(a) shows the array[s] configuration, while Fig. 1(b)
Note that j ∈ [0, n − 1] and k ∈ [1, min(n − j, n − 1)]. shows the array[2,2,s] configuration.
Similarly, when modified booth encoding (MBE) [19] is The partial product perforation should not be confused
used for generating the partial products, the result of the with the truncation technique. Truncation eliminates the circuit
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
3108 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016
that produces specific least significant bits (LSBs) of the Assuming that ED A is the sum of EDs ∀B for a given A,
accumulation tree, while the perforation skips the generation we have
of partial products and thus decreases the number of operands
ED A = ED(A, B) = 2n−k xB2 j A
to be accumulated. For example, in an 8-bit array multiplier,
∀B ∀x B
perforating a partial product removes eight full adders from
2 2 (2k −
n j 1)A
the accumulation tree and reduces its delay. In order to attain = (8)
similar circuit reduction using truncation, 6 LSB have to be 2
truncated. However, truncating 6 LSB does not offer any delay and the sum of all EDs is
⎛n ⎞
2n 2 j (2k − 1)A 2 −1
reduction. Moreover, in this example, the truncation delivers, 2n 2 j (2k − 1) ⎝ ⎠
in all the cases, incorrect results, whereas the outputs of perfo- ED A = = A
2 2
ration are 50% correct. Finally, perforating one partial product ∀A ∀A A=0
(out of eight) results in a 12.5% loss of information while 2 j 22n (2k − 1)(2n − 1)
truncating 6 LSB (out of 16) results in a 37.5% information = .
4
loss. In Section V, the perforation and truncation techniques (9)
are quantitatively compared in greater detail regarding error
and power metrics, in order to further expose their differences. Using (9), (7) equals
2 j 22n (2k − 1)(2n − 1) 2 j (2k − 1)(2n − 1)
MED = = .
B. Error Analysis 22n 4 4
A critical issue for the approximate computing is the error (10)
imposed during computations and how it affects the final Thus
result. In this section, an error evaluation analysis of the partial MED 2 j (2k − 1)
product perforation technique is presented. We evaluate the NMED = = . (11)
(2n − 1)2 4(2n − 1)
induced error metrics proposed in [15], i.e., ED, MED, and
NMED, as effective metrics for quantifying the accuracy of Similarly
approximate arithmetic circuits. ED is defined as the absolute ED(A, B) xB2 j
RED(A, B) = = (12)
distance of the fully accurate product P and the approximate A×B B
one P , ED = |P − P |. The MED is the average of EDs for and
all inputs and NMED = MED/Pmax , where Pmax = (2n − 1)2
2n x B 2 j xB2 j
in the case of an n-bit multiplier [13]. The relative error MRED = 2n
= . (13)
distance (RED) is defined as RED = ED/P, and the mean 2 B 2n B
∀B ∀B
RED (MRED) is similarly obtained [13]. The previous analysis provides rigorous expressions of error
1) Error Evaluation: When applying the product perfora- metrics, enabling a fast error analysis of differing product
tion on an n-bit multiplier using SPP generation, the ED of perforation configurations. As shown in Section IV, these
multiplying two numbers A and B is calculated as follows: analytical error expressions are used in an exploration loop for
n−1
n−1 deriving optimized approximate design solutions. The analyti-
ED(A, B) = |P − P | = A bi 2i − A bi 2i cal equations (11) and (13) consider uniform distribution; thus
i=0 i=0, in the case of differing distributions,1 they should be adjusted
i ∈[
/ j, j +k)
according to the new PDFs, since the power–error efficiency
j +k−1
of approximate designs highly depends on the multiplier’s
= A 2i bi = A2 j x B (4) operands distribution. In most applications, e.g., multimedia,
i= j the inputs are highly correlated [16]. As an intuitive example,
where x B ∈ [0, 2k ) and Fig. 2(a) shows the power–NMED Pareto graph for a 16-bit
Dadda 4:2 multiplier when A and B follow the uniform
k−1
distribution over the overall range of n-bit numbers, while
xB = 2i b j +i = B/2 j mod 2k . (5)
Fig. 2(b) shows the same graph with inputs derived from the
i=0
GSM 06.10 audio benchmark [20]. As shown, increasing the
If p A and p B are the probability density functions (PDFs) of k-values results in lower power consumption but increased
A and B, respectively, then the MED is calculated from error values, while the selection of the j-value mostly depends
MED = p A (A) p B (B)ED(A, B). (6) on the input distribution. Intuitively, for a uniform distribution
∀A,B over all possible n-bit numbers [Fig. 2(a)], where all the
bits have equal probability of being one or zero, j should
Without loss of generality, the rest of our analysis consid-
be kept small to minimize the error. This is also confirmed
ers a uniform distribution over the overall n-bit numbers,
from Fig. 2(a), where 58% of the Pareto configurations feature
i.e., (A, B) ∈ [0, 2n )2 . Hence, p A (A) = 1/2n ∀A and
j = 0 and 42% of the Pareto configurations feature j = 1.
p B (B) = 1/2n ∀B. Therefore, MED is given from
However, as shown in Fig. 2(b), when the inputs are correlated
ED(A, B) 1
MED = = ED(A, B). (7) 1 In the case of different input distributions, starting from (6), we apply the
2n 2n 22n
∀A,B ∀A ∀B same steps given the respective PDFs of the input operands.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 3109
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
3110 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016
= = 2n−k (C I − x B )
B B B
∀A,B: ∀B ∀A: ∀B
x A >x B x A >x B
(24)
and Fig. 3. Percentage reduction of (a) NMED and (b) MRED achieved by the
xB xB xB
correction Methods 1 and 2 with respect to the NMED and MRED values
= = 2n−k (25) obtained by product perforation without correction. The x-axis contains all
B B B the [j, k] configurations.
∀A,B: ∀B ∀A: ∀B
x A =x B x A =x B
(23) is equal to the product perforation configurations (j, k). Fig. 3(a) shows
xB
the NMED reduction attained by the correction methods with
RED(A, B) = 2 j 2n−k (1 + 2(C I − x B )) respect to the NMED of product perforation without an error
B
∀A,B ∀B
correction method. Fig. 3(b) shows the respective graph for
n −1
2 xB
the MRED metric. The proposed corrective methods offer both
= 2 j 2n−k (1 + 2(C I − x B )) NMED and MRED reduction. Method 1 offers higher NMED
B
B=1 reduction, while Method 2 achieves higher MRED reduction.
(26)
On average, Method 1 offers 30% NMED reduction and
and MRED is calculated as a relation of j and k from 24% MRED reduction, while Method 2 offers 26% reduction
and 50% reduction, respectively. As a result, the selection of
2j xB
2 −1 n
a corrective method depends on the application in which the
MRED = n+k
(1 + 2(C I − x B )) . (27) perforated multiplier will be used. If the magnitude of the
2 B
B=1
error is more important than its absolute distance from the
Method 2 (Comparing A and B): In this method, A and B accurate result, then Method 2 should be preferred; if not, then
are compared before the multiplication, and if A > B, A and B Method 1 should be selected. However, the implementation
are swapped. As a result the induced error ED(A, B) = of Method 1 requires a k-bit comparator, while Method 2
A2 j x B , when A ≤ B and ED(A, B) = B2 j x A , when A > B. requires an n-bit one, and thus Method 1 induces smaller area
Similar to Method 1 and power overheads. As a result, since both the methods
⎛ ⎞
offer significant NMED and MRED reductions and Method
2j ⎜ ⎟ 1 induces less power overhead, it should be preferred in the
MED = ⎜ x A + x A B⎟
22n ⎝ ⎠
B case the application is unknown.
∀A,B: ∀A,B: Methods 1 and 2 decrease the error metrics, but their imple-
A≤B A>B
⎛ ⎞ mentation requires an additional comparator. Fig. 4 shows the
2j ⎜ ⎟ impact of correction Method 1 or Method 2 on the delay,
= ⎝ xA A + 2 x A B⎠ power, and area on the Dadda 4:2 multiplier, with respect to
22n
∀A,B: ∀A ∀B: the accurate design. Since the complexity of the comparator is
A=B B<A
n −1
2 mainly affected by the perforation variable k, Fig. 4 shows the
2 j
= x A A2 (28) perforation configurations that feature j = 1 and k = 1 to 8
22n (similar results are obtained for other j and for MBE designs).
A=1
n −1 As expected, using Method 1 with perforation induces 13%
2 j 2A=1 x A A2
NMED = (29) overhead on critical delay, but also retains 26% and 20%,
22n (2n − 1)2 on average, power saving and area saving, respectively. The
2j xB
2n −1
respective values for Method 2 are 20%, 26%, and 17%.
and MRED = 2n + 2x B . (30) The NMED and MRED analytical relations show that the
2 B
B=1 error imposed by the product perforation method is bounded
Fig. 3 shows the error improvement achieved by and predictable. Therefore, when the application’s input data
Methods 1 and 2, for a 16-bit (n = 16) multiplier and all set is determined, it can be used to calculate the optimal
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 3111
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
3112 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 3113
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
3114 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016
Fig. 10. (a) 16-bit input image and the result of the geometric mean filter using (b) accurate multiplier Dadda4:2[s], (c) Dadda4:2[1,5,s] without correction,
(d) Dadda4:2[1,5,s] with correction Method 1, (e) Dadda4:2[3,4,s] without correction, (f) Dadda4:2[3,4,s] with correction Method 1, and (g) ACM2.
Fig. 11. (a) 16-bit input image and the result of the Canny edge detection using (b) accurate multiplier Dadda4:2[s], (c) Dadda4:2[1,5,s] without correction,
(d) Dadda4:2[1,5,s] with correction Method 1, (e) Dadda4:2[3,4,s] without correction, (f) Dadda4:2[3,4,s] with correction Method 1, and (g) ACM2.
algorithms are implemented in C++, while for the image The input data set is clustered in 100 clusters. To evaluate
processing ones, OpenCV library is used. the accuracy of the K-means algorithm, we use the average
Geometric mean filter removes noise from images, offering relative L2-norm, i.e., (|xacc − xapprox|2 /|xacc |2 )
.
better results than the arithmetic mean filter for Gaussian-type Similar to [9] and [10], the approximate multiplier is consid-
noise. The geometric mean filter with parameter r filters an ered as part of a general processing system that implements the
image by replacing each pixel’s value by the geometric mean aforementioned algorithms. The rest of hardware components
of the values of all the neighboring pixels that are inside (except the multiplier) are considered to deliver accurate
a (2r + 1) × (2r + 1) block centered on that pixel. For our results, and thus any applications inaccuracy and energy sav-
evaluation, the r parameter is set to 3. We approximate the geo- ings result from the usage of the approximate multiplier. The
metric mean by replacing the multiplication between the pixels energy values of each multiplication operation are delivered
with an approximate 16 × 16 multiplier. We used as input the by postsynthesis simulations of the approximate multipliers on
16-bit (16 bits/pixel) grayscale image, as shown in Fig. 10(a). the input data traces extracted by the applications execution.
To evaluate the accuracy of the output images of the geometric Note that in the Canny edge detection and geometric mean
mean, we use the peak signal–noise ratio (PSNR). algorithms, the number of the multiplications depends only
Canny edge detection [27] filter is considered to be an on the image size, and thus it is the same for the accurate as
optimal edge detector. In particular, it masks the image by well as the approximate version of the algorithm. On the other
applying a Gaussian filter to remove the noise, it calculates hand, the iterations performed by the K-means algorithm are
the gradient of the image to find the edge strength, it applies not constant, and as a result, the number of multiplications
a nonmaximum suppression to keep only the local maxima, it in the accurate may differ from the ones in the approximate
determines the potential edges by thresholding, and it tracks version.
edges by hysteresis, i.e, suppresses all the edges that are weak Fig. 10 shows both the input image and the output image of
and not connected to strong edges. The size of the Gaussian the geometric mean filter when using the accurate multiplier
kernel is 7 × 7 with 1.1 standard deviation value and uses Dadda4:2[s], the perforated multipliers Dadda4:2[1,5,s] and
16-bit fixed point arithmetic. We approximate Canny edge Dadda4:2[3,4,s] with and without any correction method and
by replacing the multiplication in the Gaussian filter with an the approximate multiplier ACM2. Fig. 11 shows the same
approximate 16 × 16 multiplier. We used as input the 16-bits images for the Canny edge detection. Table II summarizes
grayscale image, shown in Fig. 11(a). The percentage of the the values of the energy savings and quality metrics of each
edges detected using the approximate multiplier over those application when using the aforementioned multipliers.
detected using the accurate one is used as our quality metric. The use of the Dadda4:2[1,5,s] multiplier results in
K-means is a popular algorithm for clustering data points 85.95-dB PSNR for the geometric mean and 91.04% edges
from a multidimensional space into k clusters. It uses a detected for the Canny edge detection. The application of the
two-phase iterative method and aims to partition the data corrective Method 1 with the Dadda4:2[1,5,s] results in a small
points into sets, so as to minimize the within-cluster sum decrease of the energy savings (7.41%), but delivers better
of distance functions of each point in the cluster to the outputs as the PSNR increases by 2.9% and the edges detected
center. We use the Euclidean distance as a distance function. by 7.6%. The Dadda4:2[3,4,s] multiplier detects the 84.79%
We approximate the K-means algorithm by replacing the mul- of the edges, and its PSNR is 89.93 dB. The use of correction
tiplications in the calculation of the Euclidean distance with Method 1 with the Dadda4:2[3,4,s] decreases the energy
an approximate 16×16 multiplier. We use a random generated reduction by 10%, detects 16.6% more edges, and increases
input data set of 100 000 4-D points with 16 bits/dimension. its PSNR by 3.1%. When ACM2[s] [9] is used, the output
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 3115
TABLE II
E VALUATION OF PARTIAL P RODUCT P ERFORATION IN I MAGE P ROCESSING AND D ATA A NALYTICS A LGORITHMS
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
3116 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.
ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 3117
Sotirios Xydis received the Diploma and Ph.D. Kiamal Pekmestzi received the Diploma degree in
degrees in electrical and computer engineering from electrical engineering from the National Technical
the National Technical University of Athens, Athens, University of Athens, Athens, Greece, in 1975, and
Greece, in 2005 and 2011, respectively. the Ph.D. degree in electrical engineering from the
He was a Post-Doctoral Research Fellow with University of Patras, Patras, Greece, in 1981.
the Dipartimento di Elettronica, Informazione e He was a Research Fellow with the Electronics
Bioingegneria, Politecnico di Milano, Milan, Italy, Department, Nuclear Research Center Demokritos,
for two years. He is currently a Research Associate Athens, from 1975 to 1981. From 1983 to 1985,
with the National Technical University of Athens. he was a Professor with the Higher School of
He has authored over 60 technical and research Electronics, Athens. Since 1985, he has been with
papers in scientific books, international journals, and the National Technical University of Athens, where
conferences. His current research interests include design space exploration for he is currently a Professor with the Department of Electrical and Computer
system level and datapath synthesis, and design and optimization of arithmetic Engineering. His current research interests include efficient implementation
VLSI circuits and power management multi/many-core and reconfigurable of arithmetic operations, design of embedded and microprocessor-based
architectures. systems, architectures for reconfigurable computing, VLSI implementation of
Dr. Xydis was a recipient of the two best paper awards from the cryptography, and digital signal processing algorithms.
NASA/ESA/IEEE International Conference on Adaptive Hardware and Sys-
tems and the Fourth Workshop on Parallel Programming and Run-Time
Management Techniques for Many-Core Architectures, in 2007 and 2013,
respectively.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:06 UTC from IEEE Xplore. Restrictions apply.