Professional Documents
Culture Documents
Approximate Multipliers For Optimal Utilization of FPGA Resources
Approximate Multipliers For Optimal Utilization of FPGA Resources
Approximate Multipliers For Optimal Utilization of FPGA Resources
FPGA Resources
23
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
Distance (MED), and the Normalized Mean Error Distance
4:2 Compressor critical path
(NMED) (both discussed in [3]). The MAE is defined straight
an+1 bn+1 cn+1 dn+1 an bn cn dn an-1 bn-1 cn-1 dn-1
forward:
X |O(i) (i)
correct − Oapprox |
M AE = (2) Full Adder Full Adder Full Adder
2n
∀i
(i) (i) Internal carry
with the correct output Ocorrect and the actual output Oapprox
of the approximate circuit for the i-th input pattern, respec-
Full Adder Full Adder Full Adder
tively. Whereas, the NMED sets the error in relation to the
maximum possible signal value to compensate for the effects
of differing bit widths. Hence, when comparing designs with out h,n+1 Out l,n+1 out h,n out l,n out h,n-1 out l,n-1
the same bit-width, both metrics are equivalent. Therefore, we
will solely use the NMED in the remainder of this paper. Fig. 1. Critical path in a 4:2 compressor.
M AE
N M ED = (3)
max(Ocorrect )
Obviously, the NMED behaves just like the Peak Signal to idea. However, they aim to reduce logic gates which does not
Noise Ratio (PSNR) whereas metrics related to the actual yield optimal results for LUT based FPGAs. Therefore, their
Signal to Noise Ratio (SNR) would be more adequate. That design shows inexact outputs in 4 of 16 input vectors, which
holds true not only for the human perception but also for is sub-optimal, as we discuss in Section IV. As the authors
many technical applications, as the wide use of the metric do not specify any metric that evaluates the quality of their
SNR reflects. However, the SNR holds the problem that it is approximation, we re-implemented their solutions exactly as
highly dependent on the signal processed. Because of that, we described in the publication. The approximation causes huge
choose other metrics closely related to the metrics used by [4]. errors that reflect in an NMED of about 44%.
Therefore, we divide the introduced error by the correct output
value instead of the maximum one. Obviously, this metric gets In [6], the authors propose an approximate multiplier based
infinite for a correct output value of zero. This somehow makes on approximate adders. Their key innovation is to extract the
sense, as for example in acoustics, an error in a surrounding sign and than round to the nearest number in the form 2n
of absolute silence would be extremely disturbing. However, to execute a multiplication as simple shift. Precise resource
to have a usable metric we normalize errors that occur at savings in terms of LUTs are not specified. However, the
the correct output of zero with the smallest possible non-zero worst-case relative error for an 8 bit multiplier is stated to be
output value, i.e., one for non-fractional number systems. We more than 8%. The authors of [7] compose multipliers from
call this metric the Relative Error (RE). a variety of approximate compressors. Their area savings are
(i) (i)
|Ocorrect − Oapprox | most impressive. However, they introduce a mean relative error
REi = (i)
(4) that is at least one order of magnitude higher than our solution.
max(1, Ocorrect ) In [8], the authors propose different 8x8 bit multipliers recur-
sively constructed from 4x4 bit multipliers. The mean relative
Two important metrics can be derived from the relative error of their most precise multiplier proposed is in the same
error: the mean relative error (MRE) and the worst-case relative order of magnitude but slightly worse than our approach.
error (WCRE).
P
REi In [4], the authors propose a very interesting approach
M RE = ∀i n (5) towards the automated generation of approximate multiplier
2
designs by means of multi-objective Cartesian genetic pro-
defines the mean relative error with n being the number of
gramming. The impressive results are published as a library of
input bits. The worst-case relative error is defined as
approximate multipliers, too. As this approach targets ASIC
W CRE = max(REi ) (6) technologies, they are outperformed by designated FPGA
∀i approaches, as shown in [9]. In [10], the authors argue that
a choice of suitable designs from this library is indeed com-
As discussed, these metrics reflect the characteristic of petitive on FPGAs. They propose a machine learning (ML)
the error tolerance better than, e.g., an absolute error. The supported methodology to pick suitable implementations for
fact that the relative error is more relevant than the absolute FPGAs from huge libraries of approximate designs. Ullah et
error value is not only important for proper evaluation of al. propose small (4x2) approximate multipliers [9]. Based on
approximate circuits. Rather, it is the major motivation for these they compose bigger multipliers in two different ways,
the design decisions and is the basis for our novel approach introducing further approximation. We found this approach the
towards approximate arithmetic structures on FPGAs. most technically sound. It provides very good results in terms
of quality of the approximation and saved resources. Therefore,
III. R ELATED W ORK we use it as baseline to compare our approach to.
Mody et al. propose an approximate compressor for mul-
tipliers on FPGAs [5]. They propose to neglect the internal We provide a quantitative comparison of related approaches
carry paths within 4:2 compressors, which we find an inspiring with this work in Section VI.
24
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
TABLE I. T RUTH TABLE FOR APPROXIMATE 4:2 COMPRESSOR .
IV. A PPROXIMATE C OMPRESSORS
a b c d outh outl Relative
Approximate computing aims to reduce the demand for Error
resources by reducing the complexity of the circuitry, allowing 0 0 0 0 0 0 0%
0 0 0 1 0 1 0%
some amount of error. Typically, that is reducing the amount 0 0 1 0 0 1 0%
of logic gates within a structure. However, as FPGAs do not 0 0 1 1 1 0 0%
0 1 0 0 0 1 0%
map the logic to gates, but to SRAM based LUTs, a different 0 1 0 1 1 0 0%
approach is necessary to fit FPGA properties in an optimal 0 1 1 0 1 0 0%
manner. As LUTs can represent arbitrary logical functions, 0 1 1 1 1 1 0%
1 0 0 0 0 1 0%
the specific logical function of any LUT has no influence on 1 0 0 1 1 0 0%
the demanded resources. Any FPGA specific approach towards 1 0 1 0 1 0 0%
approximate computing needs to consider this fact. 1 0 1 1 1 1 0%
1 1 0 0 1 0 0%
1 1 0 1 1 1 0%
One of the most frequently used basic building blocks for 1 1 1 0 1 1 0%
arithmetic structures are compressors. As a basic element of 1 1 1 1 1 1 25%
the reduction tree, they are part of typical multiplier designs
such as Wallace or Dadda-Tree structures. In signal processing, Factor a
7 6 5 4 3 2 1 0
they can be used as accumulator-stage in multiply-accumulate
operations as well. Typically, designers use 3:2 or 4:2 compres- PPU_A 0
Factor b
In ASIC implementations, such 4:2 cells are typically
4
composed from two full adders. The carry path includes two
full adders, as shown in Fig. 1. In a five-input-LUT based 5
FPGA implementation, the carry path dominates the timing 6
behavior. Hence, an approximation that omits the carry may
7
yield significant advantages in terms of speed in FPGA based
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
designs. Therefore, it makes sense to omit the carry path as
proposed for 4:2 compressors by [11] and for various other
Fig. 2. Proposed arrangement of PPUs.
basic adder blocks by [12]. However, we adopted the logical
mapping proposed by [11] for ASICs for our FPGA design,
as it produces less errors and maps nicely to LUTs.
results as it is not trivial to distinguish between benefits of
The four inputs of a 4:2 compressor are all equal and the approximation and those of the chosen implementation
represent a value of one. Hence, the four equal inputs can of the multiplier. Therefore, we implemented some interesting
represent numbers from zero to four. As we want to omit the approaches from literature to achieve a realistic quantification
carry path, we need to perform an approximate compression of of advantages and disadvantages of each approach.
4 inputs to two outputs. With just two output bits remaining,
it is only possible to represent numbers with a numerical B. Optimized partial product generation for FPGAs
value of zero to three using a standard unsigned number
representation. Therefore, the scenarios of all inputs being The straightforward way of generating the necessary partial
one cannot be mapped correctly to outputs without a carry. products for a multiplication is
Whereas, we are able to assign the correct output to all other P Pi,j = Ai ∧ Bj (7)
inputs due to the freely configurable LUTs. Therefore, an
error is solely introduced in one of the sixteen possible input This can be mapped to ASICs very efficiently in the form
configurations [11]. Furthermore, this error is introduced in the of an AND gate. Contrarily, applying this approach to FPGAs
case producing the highest correct output value. Therefore, the leads to inefficient utilization of resources. Just two AND gates
relative error is rather small as compared to the relative errors can be mapped two a single LUT when reconfigured into two
of other approaches (e.g., [5]). Table I represents the truth table independent two-input LUTs. However, a LUT can represent
of the approximate 4:2 compressor. arbitrary (complex) logical functions. In our approach, we take
advantage of this fact to save resources.
V. A PPROXIMATE M ULTIPLIERS We propose a combination of creating partial products
A. Multiplier implementations with some amount of compression, both joined together and
mapped efficiently to LUT based logic. Such a partial product
Many approaches towards approximate multiplications aim unit (PPU) needs to generate multiple partial products and
at optimized compressors or small basic multiplication cells compress them. The partial product bits generated by one PPU
(e.g., [5], [9]). However, the actual multiplier implementation can depend on a maximum of six input bits, as this is the
(i.e., the way of composing a multiplier from these basic cells) number of input pins of a LUT. However, in this case the LUT
has an enormous impact on the results. This holds true for can generate only a single output bit. We found it beneficial to
resource savings, as well as for the amount of inaccuracy configure LUTs into two five-input LUTs with shared inputs.
caused by the approximate basic elements. This fact compli- Therefore, a PPU should generate a maximum number of
cates the comparison of different approximate multiplication partial product bits out of five input bits. Furthermore, it should
25
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
TABLE II. R EQUIRED RESOURCES FOR PARTIAL PRODUCT GENERATION (8 BIT ).
Number of AND LUTs for AND gates PPUs LUTs for PPUs Overall LUTs Number of partial
gates product bits
’Conventional’ approach 64 32 0 0 32 64
Proposed PPU approach 4 2 12 24 26 52
Numeric value Number of Number of PPU A Output bit Oi P (Oi = 1) Output bit Oi P (Oi = 1)
’conventional’ partial output bits 0 25% 0 37.5%
product bits 1 37.5% 1 37.5%
2 37.5% 2 21.875%
20 1 1
3 9.375% 3 6.25%
21 2 1
22 2 1
23 0 1
Numeric range [0,13] [0,15] PPUs, the asymptotic delay is O(1), which makes our approach
perfectly suitable for the construction of bigger multipliers.
TABLE IV. O UTPUT BITS OF PPU B COMPARED WITH Furthermore, Fig. 2 depicts that we need to create some partial
’ CONVENTIONAL’ PARTIAL PRODUCT GENERATION FOR THE SAME INPUT
BITS .
product bits individually, i.e., outside the PPUs depending on
the specific bit-width of the multiplier. These individual partial
Numeric value Number of Number of PPU B
’conventional’ partial output bits
product bits are generated by feeding both input bits into a
product bits LUT configured as an AND gate. The small overhead added
20 2 1 this way is almost negligible. For example, in our 8-bit case it
21 2 1 adds two additional LUTs, each configured as two individual
22 1 1
23 0 1 two-input AND gates. As shown in Table II, our approach
Numeric range [0,10] [0,15] saves 19% of the resources as compared to a traditional AND-
gate approach. It is not only saving resources during the
partial product creation itself, but due to the integrated partial
need a minimum (optimally even) number of output bits to compression it also demands only 81% of the partial product
represent the pre-compressed partial product. bits of the traditional AND-gate approach. Therefore, it allows
for further resource savings in the compressor tree following
We found that a combination of two different types of PPUs the partial product generation.
is optimal regarding these requirements. We call them PPU
type A and B. Fig. 2 depicts this combination of different C. Statistical optimization of the compression tree
PPUs. In our design, each PPU has five inputs. For example,
the highlighted PPU A has the inputs a0 , a1 , b0 , b1 , and b2 . As described in Section IV, our approximate compressor
The highlighted PPU B has the inputs a0 , a1 , b2 , b3 , and works accurately in 15 out of 16 cases. As only the case
b4 . With ai and bi being the ith bit of the respective input with all inputs being ’one’ introduces an error it should be
factors a and b for the multiplication. Each PPU derives five avoided if possible. In other words, the probability of that case
’conventional’ partial product bits from these inputs. These in- should be minimized. As shown in Fig. 3 there is some degree
ternal ’conventional’ partial product bits have different numeric of freedom in the choice which PPU outputs should be fed
values with some of them appearing multiple times. Hence, into which compressor. The PPU output signals have different
they form a redundant number representation. The numeric probabilities for having the value ’one’. Hence, the assignment
value of this internal redundant partial product is compressed of PPU outputs to the compressors should be made in a way
to a four bit non-redundant number at the output of the PPU. that minimizes the probability of all inputs of a compressor
The specific configuration of each type of PPU concerning being ’one’. The precise PPU output probability depends on
this matter is documented in Tables III and IV. Note that the the multiplier inputs, too. However, the statistics concerning
internal ’conventional’ partial products do not exist as real this cannot be determined without domain knowledge and
signals within the FPGA. In fact, they are just a theoretical exponential simulation times. Therefore, we determined the
concept to design and understand the logical function mapped output probabilities of the PPUs with a behavioral simulation
to the LUTs. We decided to use just five internal ’conventional’ assuming uniformly distributed PPU inputs. Results are shown
partial products in each PPU even though it would be possible in Tables V and VI. Based on this data, we assigned the
to derive six of them from the five input bits. However, the PPU outputs to the compressor inputs in such a way, that
representation of their numeric range would demand more the probability of all inputs being ’one’ is minimized for the
PPU output bits resulting in a worse compression and higher approximate compressors. With this method we achieved a
resource consumption in terms of LUTs. significant improvement of the quality of the results of our
approximate multiplier.
One of the most important characteristics of an approach
towards optimized partial product generation is its asymptotic VI. A NALYSIS AND R ESULTS
behavior with respect to delay, as it determines its suitability
for longer bit-width multipliers. The different PPUs are ar- To evaluate our approach we compared it to the state
ranged in a regular structure. Therefore, bigger multipliers can of the art. As image processing is one of the most im-
be constructed by just extending the pattern shown in Fig. 2. portant applications of approximate computing, we chose a
As there is no data-path interconnection between the individual bit width of 8 bit, typically used in such applications, for
26
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
PPU_A output bits 100
[7] I 1-II 4
PPU_B output bits
[9] Cc, Ca
"AND" output bit
[8] T1-T8
Compressor assignment
[13]
10−1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Bit index
[14]
compared to state of the art approaches that allow for a
[3]
similar saving in terms of resources, our approach delivers
[15]
the best results in terms of the mean relative error. Therefore,
[10]
our design seems to be most suitable for applications that
10−2 This work
have a limited tolerance regarding the introduced error. Fig. 4
depicts the Pareto comparison regarding this metric. Note,
that this comparison also includes designs that are not meant
to be used on FPGAs, but for ASICs. Besides averaging
metrics, the worst-case behavior of an approximate circuit 10−3
is of great interest as well. Unfortunately, some authors do
not specify it in their publications. Compared to the known
data, our approach outperforms the state of the art by a factor
of two. In terms of hardware characteristics the delay is of 0 2 4 6 8
huge importance, too. Fig. 5 illustrates that our design is Delay [ns]
outperforming the state of the art in this regard.
Our approach is perfectly suitable for the construction of Fig. 5. Comparison of this work with state-of-the-art regarding its delay.
multipliers with larger bit widths. As discussed in Section V-B, Data for [3], [7], [13]–[15] taken from [7].
our partial product generation has a constant asymptotic delay,
while the compression tree scales with O(log(n)). Therefore,
larger designs exhibit low latencies as well. For example, a processing core, as it is used for edge detection, blurring,
16x16 bit multiplier design following our approach has a delay and many other image processing tasks. More specifically,
of just 3.249 ns and consumes 220 LUTs on a Virtex-7 FPGA. we chose a filter kernel that processes a 3x3 pixel window
While, e.g., the 16 bit extension of [9] (design Ca) consumes per clock cycle. However, results scale to bigger kernels very
245 LUTs and causes a delay of 4.547 ns when synthesized on well. We applied different approximate computing approaches
the same tool chain for the same target platform. However, to the multipliers and compressors within the core, to compare
a comprehensive discussion and comparison of 16 bit designs their impact on image quality, required logic resources, and
is out of scope due to limited space and is planned as future delay. As baseline, we chose a Wallace tree multiplier utilizing
work. exact 4:2 compressors and a carry save (CS) compression tree
for the accumulation. The results shown in Table VIII prove
VII. APPLICATION E XAMPLE : A PPROXIMATE I MAGE
that our approach does not only work in theory but also in
P ROCESSING C ORE real applications. Note the tremendous resource savings of
about 40% as well as a 16% speedup when our approximate
To illustrate the usability of our approximate multipliers compressors are applied not only to the multipliers but in
and compressors, we implemented a convolutional image the accumulate stages of the filter kernel, too. Fig. 6 shows
27
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.
TABLE VII. C OMPARISON WITH STATE - OF - THE - ART - 8- BIT MULTIPLIERS . R ESULTS ARE DETERMINED BY EXHAUSTIVE SIMULATION OF ALL
POSSIBLE INPUT VECTORS OR FROM LITERATURE .
Reference Approach Area [LUTs] delay [ns] MRE WCRE NMED [102 ]
Xilinx Coregen exact multiplier 98 3.447 — — —
Ullah [9] Design Ca elementary multiplier blocks 57 3.167 0.0029 0.1905 0.080
Ullah [9] Design Cc elementary multiplier blocks 56 2.372 0.1294 0.9675 2.430
[7] AM II 4 1 approximate compressors 48 4.330 0.0140 N/A 0.190
[7] AM I 4 1 approximate compressors 55 4.440 0.0117 N/A 0.160
[8] T1 elementary multiplier blocks 57 3.132 0.0017 0.1633 0.055
[10] mul8u 158B ML supported choice from ASIC-Library 73 3.967 0.0036 0.4000 0.023
[10] mul8u 18G6 ML supported choice from ASIC-Library 23 2.732 0.1191 3.0000 0.665
This Work LUT-optimized compressors 53 2.391 0.0014 0.0816 0.068
1
Data for Spartan-6 FPGA
TABLE VIII. R ESULTS FOR APPLICATION OF DIFFERENT APPROXIMATE TECHNIQUES TO A 3X3 CONVOLUTIONAL FILTER KERNEL .
28
Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 13,2022 at 07:21:27 UTC from IEEE Xplore. Restrictions apply.