Approximate Multipliers For Optimal Utilization of FPGA Resources

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2021 24th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS)

Approximate Multipliers for Optimal Utilization of


2021 24th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS) | 978-1-6654-3595-6/20/$31.00 ©2021 IEEE | DOI: 10.1109/DDECS52668.2021.9417027

FPGA Resources

Christoph Niemann, Michael Rethfeldt, Dirk Timmermann


Institute of Applied Microelectronics and Computer Engineering
University of Rostock, Germany
Email: christoph.niemann@uni-rostock.de

Abstract—Approximate or inexact arithmetic is a promising • We propose a partial product generation considering


approach towards lower power consumption for applications that the specific hardware properties of FPGAs and derive
can tolerate a certain amount of imprecision. As human percep- novel building blocks for partial product generation.
tion is limited in its precision, this applies to image and audio
processing. Beyond, other applications like neuronal networks or • We present multiplier designs based on these compres-
AI processing can benefit from such arithmetic as well, as they are sors and the improved partial product generation that
inherently tolerant to a certain amount of inaccuracy. One of the save energy while increasing the throughput.
most critical components of arithmetic circuits regarding power,
delay, and area are multipliers. Various sophisticated approaches • We realize approximate image processing cores based
towards approximate multipliers are already published for on different approximation approaches. Thereby, we
ASICs. However, such ASIC approaches are under-performing in are able to compare how well these approaches per-
conjunction with the specific Lookup-Table (LUT)-based design form in a realistic application scenario.
of FPGAs. As FPGAs gain in importance for applications like
signal processing, there is a substantial lack of approximate The remainder of this paper is organized as follows: In Sec-
design methodology for FPGAs. We propose an approach towards tion II we discuss requirements and objectives for the design
approximate signal processing that is specifically tailored towards of approximate circuits. These criteria are applied to evaluate
the LUT-based hardware of FPGAs. It allows for significant related approaches in Section III. We introduce an approach
performance improvements while lowering the energy demands. towards approximate compressors in Section IV. In Section V
While introducing an insignificant average relative error of just
we discuss the composition of optimized multipliers and
0.14%, we achieve a 45.9% area reduction in terms of LUTs while
decreasing the delay by 30.6% compared to the Xilinx Vivado propose a partial product generation tailored towards FPGA
multiplier IP core. Our proposed design is open source and hardware resources. We evaluate our approach in Section VI
available at https://github.com/niemann-c/approx-mult-for-fpga. and show one possible application in Section VII. Section VIII
concludes this paper.

I. I NTRODUCTION II. R EQUIREMENTS AND F IGURES OF M ERIT FOR


A PPROXIMATE C OMPUTING A PPROACHES
Approximate computing is of growing interest in today’s
arithmetic systems. One approach towards approximate com- Approximate computing is a promising approach to reduce
puting is the design of approximate compressors. Usually, they power and resource demands of computations. However, it can
are designed to save a number of transistors and/or logic gates only be applied for applications that are able to tolerate a
compared to an exact compressor. FPGAs becoming more and certain amount of inaccuracy. The most prominent reason for
more popular for arithmetic processing, particularly in the this ”allowed tolerance” is the human perception. The use of
fields of interest for approximate computing. Therefore, the this property for lossy data compression methods such as MP3
application of the approximate computing paradigm towards or JPEG is widely known. Approximate computing applies
FPGAs is important for FPGA designs as it promises substan- the same principle to digital signal processing. Metrics to
tial performance benefits. However, only very few approaches evaluate approximate computing approaches should therefore
have been proposed so far. FPGAs are built of programmable consider the characteristics of human perception, the so-called
logic blocks (PLBs). Those PLBs are typically implemented psychophysics. One important characteristic is that the amount
as static random-access memory (SRAM) based lookup tables of inaccuracy tolerated by human perception is not an absolute
(LUTs). An n-input LUT can implement arbitrary n-input value, but relative to the signal amplitudes. This is reflected
Boolean functions. Nowadays, six-input LUTs are the standard by Weber’s law [2]:
logic resource. In this work we use a Xilinx Virtex-7 FPGA
∆S
as reference. As most state-of-the-art FPGAs, its six-input k= (1)
LUTs can alternatively be configured as two five-input LUTs. S
However, this is only possible if both five-input LUTs share It states that the just noticeable difference ∆S of a stimulus
the same inputs [1]. Our contributions are: S is proportional to the stimulus itself. Applied to metrics for
approximate computing, this means that the error introduced
• We discuss the required properties of approximate 4:2 should be set in relation to the actual (correct) signal value.
compressors to fit the specific hardware characteristics However, many proposed and frequently used metrics for
of FPGAs. approximate computing are neglecting this principle. Examples
978-1-6654-3595-6/21/$31.00 2021
c IEEE are the Mean Absolute Error (MAE), also called Mean Error

23

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
Distance (MED), and the Normalized Mean Error Distance
4:2 Compressor critical path
(NMED) (both discussed in [3]). The MAE is defined straight
an+1 bn+1 cn+1 dn+1 an bn cn dn an-1 bn-1 cn-1 dn-1
forward:
X |O(i) (i)
correct − Oapprox |
M AE = (2) Full Adder Full Adder Full Adder
2n
∀i
(i) (i) Internal carry
with the correct output Ocorrect and the actual output Oapprox
of the approximate circuit for the i-th input pattern, respec-
Full Adder Full Adder Full Adder
tively. Whereas, the NMED sets the error in relation to the
maximum possible signal value to compensate for the effects
of differing bit widths. Hence, when comparing designs with out h,n+1 Out l,n+1 out h,n out l,n out h,n-1 out l,n-1
the same bit-width, both metrics are equivalent. Therefore, we
will solely use the NMED in the remainder of this paper. Fig. 1. Critical path in a 4:2 compressor.
M AE
N M ED = (3)
max(Ocorrect )
Obviously, the NMED behaves just like the Peak Signal to idea. However, they aim to reduce logic gates which does not
Noise Ratio (PSNR) whereas metrics related to the actual yield optimal results for LUT based FPGAs. Therefore, their
Signal to Noise Ratio (SNR) would be more adequate. That design shows inexact outputs in 4 of 16 input vectors, which
holds true not only for the human perception but also for is sub-optimal, as we discuss in Section IV. As the authors
many technical applications, as the wide use of the metric do not specify any metric that evaluates the quality of their
SNR reflects. However, the SNR holds the problem that it is approximation, we re-implemented their solutions exactly as
highly dependent on the signal processed. Because of that, we described in the publication. The approximation causes huge
choose other metrics closely related to the metrics used by [4]. errors that reflect in an NMED of about 44%.
Therefore, we divide the introduced error by the correct output
value instead of the maximum one. Obviously, this metric gets In [6], the authors propose an approximate multiplier based
infinite for a correct output value of zero. This somehow makes on approximate adders. Their key innovation is to extract the
sense, as for example in acoustics, an error in a surrounding sign and than round to the nearest number in the form 2n
of absolute silence would be extremely disturbing. However, to execute a multiplication as simple shift. Precise resource
to have a usable metric we normalize errors that occur at savings in terms of LUTs are not specified. However, the
the correct output of zero with the smallest possible non-zero worst-case relative error for an 8 bit multiplier is stated to be
output value, i.e., one for non-fractional number systems. We more than 8%. The authors of [7] compose multipliers from
call this metric the Relative Error (RE). a variety of approximate compressors. Their area savings are
(i) (i)
|Ocorrect − Oapprox | most impressive. However, they introduce a mean relative error
REi = (i)
(4) that is at least one order of magnitude higher than our solution.
max(1, Ocorrect ) In [8], the authors propose different 8x8 bit multipliers recur-
sively constructed from 4x4 bit multipliers. The mean relative
Two important metrics can be derived from the relative error of their most precise multiplier proposed is in the same
error: the mean relative error (MRE) and the worst-case relative order of magnitude but slightly worse than our approach.
error (WCRE).
P
REi In [4], the authors propose a very interesting approach
M RE = ∀i n (5) towards the automated generation of approximate multiplier
2
designs by means of multi-objective Cartesian genetic pro-
defines the mean relative error with n being the number of
gramming. The impressive results are published as a library of
input bits. The worst-case relative error is defined as
approximate multipliers, too. As this approach targets ASIC
W CRE = max(REi ) (6) technologies, they are outperformed by designated FPGA
∀i approaches, as shown in [9]. In [10], the authors argue that
a choice of suitable designs from this library is indeed com-
As discussed, these metrics reflect the characteristic of petitive on FPGAs. They propose a machine learning (ML)
the error tolerance better than, e.g., an absolute error. The supported methodology to pick suitable implementations for
fact that the relative error is more relevant than the absolute FPGAs from huge libraries of approximate designs. Ullah et
error value is not only important for proper evaluation of al. propose small (4x2) approximate multipliers [9]. Based on
approximate circuits. Rather, it is the major motivation for these they compose bigger multipliers in two different ways,
the design decisions and is the basis for our novel approach introducing further approximation. We found this approach the
towards approximate arithmetic structures on FPGAs. most technically sound. It provides very good results in terms
of quality of the approximation and saved resources. Therefore,
III. R ELATED W ORK we use it as baseline to compare our approach to.
Mody et al. propose an approximate compressor for mul-
tipliers on FPGAs [5]. They propose to neglect the internal We provide a quantitative comparison of related approaches
carry paths within 4:2 compressors, which we find an inspiring with this work in Section VI.

24

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
TABLE I. T RUTH TABLE FOR APPROXIMATE 4:2 COMPRESSOR .
IV. A PPROXIMATE C OMPRESSORS
a b c d outh outl Relative
Approximate computing aims to reduce the demand for Error
resources by reducing the complexity of the circuitry, allowing 0 0 0 0 0 0 0%
0 0 0 1 0 1 0%
some amount of error. Typically, that is reducing the amount 0 0 1 0 0 1 0%
of logic gates within a structure. However, as FPGAs do not 0 0 1 1 1 0 0%
0 1 0 0 0 1 0%
map the logic to gates, but to SRAM based LUTs, a different 0 1 0 1 1 0 0%
approach is necessary to fit FPGA properties in an optimal 0 1 1 0 1 0 0%
manner. As LUTs can represent arbitrary logical functions, 0 1 1 1 1 1 0%
1 0 0 0 0 1 0%
the specific logical function of any LUT has no influence on 1 0 0 1 1 0 0%
the demanded resources. Any FPGA specific approach towards 1 0 1 0 1 0 0%
approximate computing needs to consider this fact. 1 0 1 1 1 1 0%
1 1 0 0 1 0 0%
1 1 0 1 1 1 0%
One of the most frequently used basic building blocks for 1 1 1 0 1 1 0%
arithmetic structures are compressors. As a basic element of 1 1 1 1 1 1 25%
the reduction tree, they are part of typical multiplier designs
such as Wallace or Dadda-Tree structures. In signal processing, Factor a
7 6 5 4 3 2 1 0
they can be used as accumulator-stage in multiply-accumulate
operations as well. Typically, designers use 3:2 or 4:2 compres- PPU_A 0

sors. The latter tend to result in a more regular structure of the 1


PPU_B
design which is beneficial for reducing routing congestions and
"AND" 2
delay. Therefore, we focus on 4:2 compressors.
3

Factor b
In ASIC implementations, such 4:2 cells are typically
4
composed from two full adders. The carry path includes two
full adders, as shown in Fig. 1. In a five-input-LUT based 5
FPGA implementation, the carry path dominates the timing 6
behavior. Hence, an approximation that omits the carry may
7
yield significant advantages in terms of speed in FPGA based
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
designs. Therefore, it makes sense to omit the carry path as
proposed for 4:2 compressors by [11] and for various other
Fig. 2. Proposed arrangement of PPUs.
basic adder blocks by [12]. However, we adopted the logical
mapping proposed by [11] for ASICs for our FPGA design,
as it produces less errors and maps nicely to LUTs.
results as it is not trivial to distinguish between benefits of
The four inputs of a 4:2 compressor are all equal and the approximation and those of the chosen implementation
represent a value of one. Hence, the four equal inputs can of the multiplier. Therefore, we implemented some interesting
represent numbers from zero to four. As we want to omit the approaches from literature to achieve a realistic quantification
carry path, we need to perform an approximate compression of of advantages and disadvantages of each approach.
4 inputs to two outputs. With just two output bits remaining,
it is only possible to represent numbers with a numerical B. Optimized partial product generation for FPGAs
value of zero to three using a standard unsigned number
representation. Therefore, the scenarios of all inputs being The straightforward way of generating the necessary partial
one cannot be mapped correctly to outputs without a carry. products for a multiplication is
Whereas, we are able to assign the correct output to all other P Pi,j = Ai ∧ Bj (7)
inputs due to the freely configurable LUTs. Therefore, an
error is solely introduced in one of the sixteen possible input This can be mapped to ASICs very efficiently in the form
configurations [11]. Furthermore, this error is introduced in the of an AND gate. Contrarily, applying this approach to FPGAs
case producing the highest correct output value. Therefore, the leads to inefficient utilization of resources. Just two AND gates
relative error is rather small as compared to the relative errors can be mapped two a single LUT when reconfigured into two
of other approaches (e.g., [5]). Table I represents the truth table independent two-input LUTs. However, a LUT can represent
of the approximate 4:2 compressor. arbitrary (complex) logical functions. In our approach, we take
advantage of this fact to save resources.
V. A PPROXIMATE M ULTIPLIERS We propose a combination of creating partial products
A. Multiplier implementations with some amount of compression, both joined together and
mapped efficiently to LUT based logic. Such a partial product
Many approaches towards approximate multiplications aim unit (PPU) needs to generate multiple partial products and
at optimized compressors or small basic multiplication cells compress them. The partial product bits generated by one PPU
(e.g., [5], [9]). However, the actual multiplier implementation can depend on a maximum of six input bits, as this is the
(i.e., the way of composing a multiplier from these basic cells) number of input pins of a LUT. However, in this case the LUT
has an enormous impact on the results. This holds true for can generate only a single output bit. We found it beneficial to
resource savings, as well as for the amount of inaccuracy configure LUTs into two five-input LUTs with shared inputs.
caused by the approximate basic elements. This fact compli- Therefore, a PPU should generate a maximum number of
cates the comparison of different approximate multiplication partial product bits out of five input bits. Furthermore, it should

25

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
TABLE II. R EQUIRED RESOURCES FOR PARTIAL PRODUCT GENERATION (8 BIT ).
Number of AND LUTs for AND gates PPUs LUTs for PPUs Overall LUTs Number of partial
gates product bits
’Conventional’ approach 64 32 0 0 32 64
Proposed PPU approach 4 2 12 24 26 52

TABLE III. O UTPUT BITS OF PPU A COMPARED WITH


’ CONVENTIONAL’ PARTIAL PRODUCT GENERATION FOR THE SAME INPUT TABLE V. PPU A O UTPUT TABLE VI. PPU B O UTPUT
PROBABILITY. PROBABILITY.
BITS .

Numeric value Number of Number of PPU A Output bit Oi P (Oi = 1) Output bit Oi P (Oi = 1)
’conventional’ partial output bits 0 25% 0 37.5%
product bits 1 37.5% 1 37.5%
2 37.5% 2 21.875%
20 1 1
3 9.375% 3 6.25%
21 2 1
22 2 1
23 0 1
Numeric range [0,13] [0,15] PPUs, the asymptotic delay is O(1), which makes our approach
perfectly suitable for the construction of bigger multipliers.
TABLE IV. O UTPUT BITS OF PPU B COMPARED WITH Furthermore, Fig. 2 depicts that we need to create some partial
’ CONVENTIONAL’ PARTIAL PRODUCT GENERATION FOR THE SAME INPUT
BITS .
product bits individually, i.e., outside the PPUs depending on
the specific bit-width of the multiplier. These individual partial
Numeric value Number of Number of PPU B
’conventional’ partial output bits
product bits are generated by feeding both input bits into a
product bits LUT configured as an AND gate. The small overhead added
20 2 1 this way is almost negligible. For example, in our 8-bit case it
21 2 1 adds two additional LUTs, each configured as two individual
22 1 1
23 0 1 two-input AND gates. As shown in Table II, our approach
Numeric range [0,10] [0,15] saves 19% of the resources as compared to a traditional AND-
gate approach. It is not only saving resources during the
partial product creation itself, but due to the integrated partial
need a minimum (optimally even) number of output bits to compression it also demands only 81% of the partial product
represent the pre-compressed partial product. bits of the traditional AND-gate approach. Therefore, it allows
for further resource savings in the compressor tree following
We found that a combination of two different types of PPUs the partial product generation.
is optimal regarding these requirements. We call them PPU
type A and B. Fig. 2 depicts this combination of different C. Statistical optimization of the compression tree
PPUs. In our design, each PPU has five inputs. For example,
the highlighted PPU A has the inputs a0 , a1 , b0 , b1 , and b2 . As described in Section IV, our approximate compressor
The highlighted PPU B has the inputs a0 , a1 , b2 , b3 , and works accurately in 15 out of 16 cases. As only the case
b4 . With ai and bi being the ith bit of the respective input with all inputs being ’one’ introduces an error it should be
factors a and b for the multiplication. Each PPU derives five avoided if possible. In other words, the probability of that case
’conventional’ partial product bits from these inputs. These in- should be minimized. As shown in Fig. 3 there is some degree
ternal ’conventional’ partial product bits have different numeric of freedom in the choice which PPU outputs should be fed
values with some of them appearing multiple times. Hence, into which compressor. The PPU output signals have different
they form a redundant number representation. The numeric probabilities for having the value ’one’. Hence, the assignment
value of this internal redundant partial product is compressed of PPU outputs to the compressors should be made in a way
to a four bit non-redundant number at the output of the PPU. that minimizes the probability of all inputs of a compressor
The specific configuration of each type of PPU concerning being ’one’. The precise PPU output probability depends on
this matter is documented in Tables III and IV. Note that the the multiplier inputs, too. However, the statistics concerning
internal ’conventional’ partial products do not exist as real this cannot be determined without domain knowledge and
signals within the FPGA. In fact, they are just a theoretical exponential simulation times. Therefore, we determined the
concept to design and understand the logical function mapped output probabilities of the PPUs with a behavioral simulation
to the LUTs. We decided to use just five internal ’conventional’ assuming uniformly distributed PPU inputs. Results are shown
partial products in each PPU even though it would be possible in Tables V and VI. Based on this data, we assigned the
to derive six of them from the five input bits. However, the PPU outputs to the compressor inputs in such a way, that
representation of their numeric range would demand more the probability of all inputs being ’one’ is minimized for the
PPU output bits resulting in a worse compression and higher approximate compressors. With this method we achieved a
resource consumption in terms of LUTs. significant improvement of the quality of the results of our
approximate multiplier.
One of the most important characteristics of an approach
towards optimized partial product generation is its asymptotic VI. A NALYSIS AND R ESULTS
behavior with respect to delay, as it determines its suitability
for longer bit-width multipliers. The different PPUs are ar- To evaluate our approach we compared it to the state
ranged in a regular structure. Therefore, bigger multipliers can of the art. As image processing is one of the most im-
be constructed by just extending the pattern shown in Fig. 2. portant applications of approximate computing, we chose a
As there is no data-path interconnection between the individual bit width of 8 bit, typically used in such applications, for

26

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
PPU_A output bits 100
[7] I 1-II 4
PPU_B output bits
[9] Cc, Ca
"AND" output bit
[8] T1-T8
Compressor assignment
[13]
10−1

Mean relative error


[14]
[3]
[15]
[10]
10−2 This work

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Bit index

Fig. 3. Choice of PPU outputs for compression. 10−3

our comparison. To achieve a meaningful comparison of the


quality of the approximation, we determined the metrics of
each approach with an exhaustive simulation of all possible 0 20 40 60 80 100
input patterns. For that purpose, we used the open source Number of LUTs
data of [9] and [10] and re-implemented the designs T1-T8
from [8] as described in the paper. For comparable results Fig. 4. Comparison of this work with state-of-the-art regarding logic
regarding ressource utilization and delay we synthesized all resources. Data for [3], [7], [13]–[15] taken from [7].
desings of these three publications on our tool chain. For other
referenced desings we used data from the literature, as there
is no open source implementation available. Results are listed [7] I 1-II 4
in Table VII. Obviously, there are approaches that save more [9] Cc, Ca
resources. However, the delivered quality of the approximation [8] T1-T8
varies in a range of multiple orders of magnitude. When [13]
10−1
Mean relative error

[14]
compared to state of the art approaches that allow for a
[3]
similar saving in terms of resources, our approach delivers
[15]
the best results in terms of the mean relative error. Therefore,
[10]
our design seems to be most suitable for applications that
10−2 This work
have a limited tolerance regarding the introduced error. Fig. 4
depicts the Pareto comparison regarding this metric. Note,
that this comparison also includes designs that are not meant
to be used on FPGAs, but for ASICs. Besides averaging
metrics, the worst-case behavior of an approximate circuit 10−3
is of great interest as well. Unfortunately, some authors do
not specify it in their publications. Compared to the known
data, our approach outperforms the state of the art by a factor
of two. In terms of hardware characteristics the delay is of 0 2 4 6 8
huge importance, too. Fig. 5 illustrates that our design is Delay [ns]
outperforming the state of the art in this regard.
Our approach is perfectly suitable for the construction of Fig. 5. Comparison of this work with state-of-the-art regarding its delay.
multipliers with larger bit widths. As discussed in Section V-B, Data for [3], [7], [13]–[15] taken from [7].
our partial product generation has a constant asymptotic delay,
while the compression tree scales with O(log(n)). Therefore,
larger designs exhibit low latencies as well. For example, a processing core, as it is used for edge detection, blurring,
16x16 bit multiplier design following our approach has a delay and many other image processing tasks. More specifically,
of just 3.249 ns and consumes 220 LUTs on a Virtex-7 FPGA. we chose a filter kernel that processes a 3x3 pixel window
While, e.g., the 16 bit extension of [9] (design Ca) consumes per clock cycle. However, results scale to bigger kernels very
245 LUTs and causes a delay of 4.547 ns when synthesized on well. We applied different approximate computing approaches
the same tool chain for the same target platform. However, to the multipliers and compressors within the core, to compare
a comprehensive discussion and comparison of 16 bit designs their impact on image quality, required logic resources, and
is out of scope due to limited space and is planned as future delay. As baseline, we chose a Wallace tree multiplier utilizing
work. exact 4:2 compressors and a carry save (CS) compression tree
for the accumulation. The results shown in Table VIII prove
VII. APPLICATION E XAMPLE : A PPROXIMATE I MAGE
that our approach does not only work in theory but also in
P ROCESSING C ORE real applications. Note the tremendous resource savings of
about 40% as well as a 16% speedup when our approximate
To illustrate the usability of our approximate multipliers compressors are applied not only to the multipliers but in
and compressors, we implemented a convolutional image the accumulate stages of the filter kernel, too. Fig. 6 shows

27

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
TABLE VII. C OMPARISON WITH STATE - OF - THE - ART - 8- BIT MULTIPLIERS . R ESULTS ARE DETERMINED BY EXHAUSTIVE SIMULATION OF ALL
POSSIBLE INPUT VECTORS OR FROM LITERATURE .

Reference Approach Area [LUTs] delay [ns] MRE WCRE NMED [102 ]
Xilinx Coregen exact multiplier 98 3.447 — — —
Ullah [9] Design Ca elementary multiplier blocks 57 3.167 0.0029 0.1905 0.080
Ullah [9] Design Cc elementary multiplier blocks 56 2.372 0.1294 0.9675 2.430
[7] AM II 4 1 approximate compressors 48 4.330 0.0140 N/A 0.190
[7] AM I 4 1 approximate compressors 55 4.440 0.0117 N/A 0.160
[8] T1 elementary multiplier blocks 57 3.132 0.0017 0.1633 0.055
[10] mul8u 158B ML supported choice from ASIC-Library 73 3.967 0.0036 0.4000 0.023
[10] mul8u 18G6 ML supported choice from ASIC-Library 23 2.732 0.1191 3.0000 0.665
This Work LUT-optimized compressors 53 2.391 0.0014 0.0816 0.068
1
Data for Spartan-6 FPGA

TABLE VIII. R ESULTS FOR APPLICATION OF DIFFERENT APPROXIMATE TECHNIQUES TO A 3X3 CONVOLUTIONAL FILTER KERNEL .

Approach Number of LUTs Relative savings Delay [ns] Relative speedup


Wallace Tree multipliers, CS compression tree 872 — 6.046 —
Ullah [9] Design Cc 630 27.75% 6.634 -9.73%
Our approach, CS compression tree with exact 4:2 compressors 636 27.06% 5.788 4.27%
Our approach, CS compression tree with approximate 4:2 compressors 520 40.37% 5.082 15.94%

[2] M. Spering and T. Schmidt, Allgemeine Psychologie kompakt :


Wahrnehmung, Aufmerksamkeit, Denken, Sprache; mit Add-on, 1st ed.
Weinheim [u.a.]: Beltz, 2009.
[3] C. Liu, J. Han, and F. Lombardi, “A low-power, high-performance
approximate multiplier with configurable partial error recovery,” Design,
Automation & Test in Europe Conference & Exhibition (DATE), 2014.
[4] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, “EvoApproxSb:
Library of approximate adders and multipliers for circuit design and
benchmarking of approximation methods,” Proceedings of the 2017
Fig. 6. Resulting images: exact multiplier, this approach, Ullah et al. [9] Design, Automation and Test in Europe, DATE 2017, pp. 258–261, 2017.
Design Cc (from left to right). [5] J. Mody, R. Lawand, R. Priyanka, S. Sivanantham, and K. Sivasankaran,
“Study of approximate compressors for multiplication using FPGA,” IC-
GET 2015 - Proceedings of 2015 Online International Conference on
results of a Gaussian smoothing filter. The multipliers within Green Engineering and Technologies, pp. 5–8, 2016.
the filter kernel are an exact multiplier and our approach. [6] M. Osta, A. Ibrahim, H. Chible, and M. Valle, “Approximate multipliers
For comparison we also determined results for multiplier based on inexact adders for energy efficient data processing,” Proceed-
ings - New Generation of CAS, NGCAS 2017, pp. 125–128, 2017.
Design Cc [9], when used in our filter kernel.
[7] N. V. Toan and J. Lee, “Energy-Area-Efficient Approximate Multipli-
ers for Error-Tolerant Applications on FPGAs,” in 2019 32nd IEEE
VIII. C ONCLUSION International System-on-Chip Conference (SOCC), 2019, pp. 336–341.
[8] Y. Guo, H. Sun, and S. Kimura, “Small-area and low-power fpga-based
We propose a new approximate compressor design for multipliers using approximate elementary modules,” in 2020 25th Asia
efficient arithmetic in error tolerant applications on modern and South Pacific Design Automation Conference (ASP-DAC), 2020.
FPGAs. Furthermore, we discuss the optimal composition of [9] S. Ullah, S. Rehman, B. S. Prabakaran, F. Kriebel, M. A. Hanif,
a multiplier from this compressor design. Our novel approach M. Shafique, and A. Kumar, “Area-Optimized Low-Latency Approxi-
is specifically tailored towards the characteristics of human mate Multipliers for FPGA-based Hardware Accelerators,” in 2018 55th
ACM / ESDA / IEEE Design Automation Conference (DAC).
perception and the resources of modern FPGAs. Thereby,
[10] B. S. Prabakaran, V. Mrazek, Z. Vasicek, L. Sekanina, and M. Shafique,
we achieve a significant improvement in terms of quality “Approxfpgas: Embracing asic-based approximate arithmetic compo-
of the approximation while being highly resource efficient. nents for fpga-based systems,” in 2020 57th ACM/IEEE Design Au-
We are able to save 46% of the FPGA resources while tomation Conference (DAC), 2020, pp. 1–6.
introducing a mean relative error of just 0.14%. Addition- [11] Z. Yang, J. Han, and F. Lombardi, “Approximate compressors for error-
ally, our approach decreases the delay by 31% compared resilient multiplier design,” Proceedings of the 2015 IEEE International
to the Xilinx Coregen multiplier IP core. To fuel further Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology
Systems, DFTS 2015, pp. 183–186, 2015.
research, our proposed design is open source and available
[12] B. S. Prabakaran, S. Rehman, M. A. Hanif, S. Ullah, G. Mazaheri,
at https://github.com/niemann-c/approx-mult-for-fpga. A. Kumar, and M. Shafique, “Demas: An efficient design methodology
for building approximate adders for fpga-based systems,” in 2018
IX. ACKNOWLEDGEMENT Design, Automation Test in Europe Conference Exhibition (DATE).
[13] M. Ha and S. Lee, “Multipliers with Approximate 4-2 Compressors and
We want to thank Hooma Amjad for her practical support. Error Recovery Modules,” IEEE Embedded Systems Letters, vol. 10,
This research was funded by the Deutsche Forschungsgemein- no. 1, pp. 6–9, 2018.
schaft (DFG, German Research Foundation) SFB 1270/1 - [14] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, “Low-Power
299150580. Approximate Multipliers Using Encoded Partial Products and Approxi-
mate Compressors,” IEEE Journal on Emerging and Selected Topics in
Circuits and Systems, vol. 8, no. 3, pp. 404–416, 2018.
R EFERENCES [15] T. Yang, T. Ukezono, and T. Sato, “Low-power and high-speed approx-
[1] Xilinx, “7 Series FPGAs Configurable Logic Block,” 2016. [Online]. imate multiplier design with a tree compressor,” Proceedings - 35th
Available: https://www.xilinx.com/support/documentation/user guides/ IEEE International Conference on Computer Design, ICCD 2017, pp.
ug474 7Series CLB.pdf 89–96, 2017.

28

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.

You might also like