Approximate Arithmetic Circuits A Survey Characterization and Recent Applications

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Approximate Arithmetic

Circuits: A Survey,
Characterization, and
Recent Applications
This article provides a comprehensive evaluation of recently proposed approximate
arithmetic circuits mainly including adders, multipliers, and dividers; they are
compared under different design constraints and applied to image processing
and deep learning applications.
By H ONGLAN J IANG , Member IEEE, F RANCISCO JAVIER H ERNANDEZ S ANTIAGO, H AI M O,
L EIBO L IU , Senior Member IEEE, AND J IE H AN , Senior Member IEEE

ABSTRACT | Approximate computing has emerged as a new tions for performance and area. The error and circuit charac-
paradigm for high-performance and energy-efficient design of teristics are then generalized for different classes of designs.
circuits and systems. For the many approximate arithmetic The applications of these circuits in image processing and
circuits proposed, it has become critical to understand a deep neural networks indicate that the circuits with lower error
design or approximation technique for a specific application rates or error biases perform better in simple computations,
to improve performance and energy efficiency with a minimal such as the sum of products, whereas more complex accumula-
loss in accuracy. This article aims to provide a comprehen- tive computations that involve multiple matrix multiplications
sive survey and a comparative evaluation of recently devel- and convolutions are vulnerable to single-sided errors that lead
oped approximate arithmetic circuits under different design to a large error bias in the computed result. Such complex com-
constraints. Specifically, approximate adders, multipliers, and putations are more sensitive to errors in addition than those in
dividers are synthesized and characterized under optimiza- multiplication, so a larger approximation can be tolerated in
multipliers than in adders. The use of approximate arithmetic
circuits can improve the quality of image processing and deep
Manuscript received January 30, 2020; revised May 18, 2020; accepted June 17,
2020. Date of publication August 12, 2020; date of current version learning in addition to the benefits in performance and power
November 20, 2020. This work was supported by the China Postdoctoral Science
consumption for these applications.
Foundation under Grant 2019M650679, the National Natural Science Foundation
of China (Grant No. 61834002), the National Key R&D Program of China (Grant
No. 2018YFB2202101), the National Science and Technology Major Project of the
KEYWORDS | Adders; approximate computing (AC); arith-
Ministry of Science and Technology of China (Grant No. 2018ZX01027101-002), metic circuits; deep neural networks (DNNs); dividers; image
the National Council of Science and Technology (CONACYT) and Mexican
Foundation for Education, Technology and Science (FUNED) and the Natural
processing; multipliers.
Sciences and Engineering Research Council (NSERC) of Canada under Project
RES0025211. (Corresponding authors: Leibo Liu; Jie Han.) I. I N T R O D U C T I O N
Honglan Jiang and Hai Mo are with the Institute of Microelectronics, Tsinghua With increasing importance of big data processing and
University, Beijing 100084, China (e-mail: jianghonglan@mails.tsinghua.edu.cn;
moh19@mails.tsinghua.edu.cn). artificial intelligence, an unprecedented challenge has
Francisco Javier Hernandez Santiago was with the Department of Electrical arisen due to the massive amounts of data and com-
and Computer Engineering, University of Alberta, Edmonton, AB T6G 1H9,
Canada. He is now with Intel, Zapopan 45017, Mexico (e-mail: fh@ualberta.ca).
plex computations required in these applications. Energy-
Leibo Liu is with the Institute of Microelectronics, Beijing National Research efficient and high-performance general-purpose compute
Center for Information Science and Technology, Tsinghua University, Beijing
100084, China (e-mail: liulb@tsinghua.edu.cn).
engines, as well as application-specific integrated circuits,
Jie Han is with the Department of Electrical and Computer Engineering, are highly demanded to facilitate the development of these
University of Alberta, Edmonton, AB T6G 1H9, Canada (e-mail:
new technologies. Meanwhile, exact or high-precision
jhan8@ualberta.ca).
computation is not always necessary. Instead, some small
Digital Object Identifier 10.1109/JPROC.2020.3006451 errors can compensate each other or will not have a signif-

0018-9219 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

2108 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

icant effect in the computed results. Hence, approximate The approximation is obtained by accumulating some most
computing (AC) has emerged as a new approach to energy- significant partial products (PPs), along with a correction
efficient design, as well as to increasing the performance of constant obtained by a statistical analysis as an approxima-
a computing system, at a limited loss in accuracy [1]. tion for the sum of the least significant PPs [12], [13].
In 2004, the concept of approximation was applied to
A. Motivation
adders and Booth multipliers in a superscalar processor
In the past few decades, the feature size of transis- to increase the clock frequency of a microprocessor [14].
tors has decreased exponentially, as governed by Moore’s The approximate adder is designed by observing that
law [2], which has resulted in a continuous improvement the effective carry chain of an adder is much shorter
in the performance and power efficiency of integrated cir- than the full carry chain for most inputs. On average,
cuits. However, at the nanometer scale, the supply voltage the longest carry chain in an n-bit addition is not longer
cannot be further reduced, which has led to a significant than the binary logarithm of n, or log (n), as discussed by
increase in power density. Thus, a percentage of transistors Burks et al. [15]. Thus, a carry in an n-bit adder is obtained
in an integrated circuit must be powered off to alleviate the from its previous k input bits rather than from all of the
thermal issues; the powered-off transistors are called “dark previous bits (so, k < n). Compared to an accurate adder
silicon” [3]. A study has shown that the area of “dark sili- (AccuA), the critical path of this approximate adder is
con” may reach up to more than 50% for an 8-nm technol- significantly shorter. The approximate adder was suggested
ogy [4]. This indicates an increasing challenge to improve for use in the generation of the 3× multiplicand required
circuit performance and power efficiency by using con- in the radix-8 encoding algorithm for a Booth multiplier.
ventional technologies. New design methodologies have Since around 2008, approximate adders and multi-
been investigated to address this issue, including multicore pliers have received significant attention, resulting in
architectures, heterogeneous integration, and AC [5]. various designs; the early ones include the almost cor-
AC is driven by the observation that many applica- rect adder (ACA) [16], the error-tolerant adder [17],
tions, such as multimedia, recognition, classification, and the lower-part-OR adder (LOA) [18], the equal segmenta-
machine learning, can tolerate the occurrence of some tion adder (ESA) [19], the approximate mirror adder [20],
errors. Due to the perceptual limitations of humans, some the broken-array multiplier (BAM) [18], the error-tolerant
errors do not impose noticeable degradation in the output multiplier (ETM) [21], and the underdesigned multiplier
quality of image, audio, and video processing. Moreover, (UDM) [22]. In addition, logic synthesis methods have
the external input data to a digital system are usually been developed to reduce the circuit area and power
noisy and quantized, so there is already a limit in the dissipation for a given error constraint [23], [24]. Auto-
precision or accuracy in representing useful information. mated processes have also been considered to generate
Probability-based computing, such as stochastic comput- approximate adders and multipliers [25], [26]. Moreover,
ing, performs arithmetic functions on random binary bit various computing and memory architectures have been
streams using simple logic gates [6], where trivial errors proposed to support AC applications [27], [28]. In par-
do not result in a significantly different result. Lastly, many ticular, a programming language can support approxi-
applications including machine learning are based on iter- mate data types for low-power computing [29]. Recently,
ative refinement. This process can attenuate or compensate approximate designs include those for dividers [30]–[33],
the effects of insignificant errors [7]. AC has thus become multiply-and-accumulate (MAC) units [34], squaring cir-
a potentially promising technique to benefit a variety of cuits [35]–[38], square root circuits [39], and a coordinate
error-tolerant applications. rotation digital computer (CORDIC) [40].
B. Development History of Approximate
Arithmetic Circuits C. Applications of AC
Since the 1960s, the Newton–Raphson algorithm has AC has been considered for many applications with error
been utilized for computing an approximate quotient to resilience, such as image processing and machine learning,
speed up division [8], followed by many other functional for a higher performance and energy efficiency [41]–[48].
iteration-based algorithms such as Goldschmidt [9]. Mul- The approximation techniques at algorithm,
tiple precision dividers can, therefore, be obtained by architecture, and circuit levels have been synergistically
terminating the computing process at different stages [10]. applied in the design of an energy-efficient programmable
Also in the early 1960s, Mitchell [11] proposed a vector processor for recognition and data mining [41].
logarithm-based algorithm for multiplication and divi- This design achieves an energy reduction of 16.7%–56.5%
sion. Although specific approximation techniques aimed compared to a conventional one without any quality
at arithmetic circuits were not significantly developed in loss, and 50.0%–95.0% when the output quality is
the following few decades, some straightforward approx- insignificantly reduced.
imation (or rounding) techniques have gradually been As basic image processing applications, sharpening,
considered, for example, in truncation-based multipliers to smoothing, and multiplication have been used to assess
obtain an output with the same bit width as the inputs. This the quality of approximate adders and unsigned multipli-
type of multipliers is referred to as a fixed-width multiplier. ers [49]–[51]. Image compression algorithms have been

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2109
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

considered for evaluating approximate signed multipli- The approximate adders, multipliers, and dividers are
ers [43], [44]. then presented, synthesized, and comparatively evaluated
Approximate adders and multipliers have been inte- in Sections III–V, respectively. Section VI presents the
grated in deep learning accelerators for reducing delay and applications. Finally, Section VII concludes this article and
saving energy [42], [45]–[47], [52]. In [42], truncated 16- discusses current challenges and future prospects.
bit multipliers with constant error compensation are used
in lieu of 32-bit floating-point multipliers in an accelera- II. B A C K G R O U N D
tor for large-scale convolutional neural networks (CNNs) A. Design Methodologies
and deep neural networks (DNNs). Up to 83.6% and An approximate arithmetic circuit can be obtained by
86.4% reductions in area and power consumption have, using the voltage overscaling (VOS) technique [58]–[60],
respectively, been achieved. Designs with various error and redesigning a logic circuit into an approximate one [51],
circuit characteristics have also been exploited in reconfig- and using a simplification algorithm [11].
urable systems to enhance the reconfiguration flexibility. Using VOS, a lower supply voltage is provided to effi-
In [45] and [46], approximate adders and multipliers ciently reduce the power consumption of a circuit, without
with various levels of accuracy are integrated in a coarse- having to change the circuit structure. However, a reduced
grained reconfigurable array for different configurations voltage increases the critical path delay, possibly result-
determined on-the-fly by the application requirements. ing in timing errors [58]. Thus, the output can be erro-
In this way, different performance and energy improve- neous due to the violated timing constraint. Moreover,
ments can be obtained by trading off various levels of the error characteristics of such an approximate operation
processing quality. are nondeterministic, as affected by parametric variations
In the implementation of a state-of-the-art wireline [61]. When the most significant bits (MSBs) are affected,
transceiver, an approximate multiplier is used for low- the output error can be large [62].
power digital signal processing [48]. Compared to the More commonly, an approximate design is derived from
accurate design, power is reduced by 40% and the max- an accurate circuit by modifying, removing, or adding
imum performance is improved by 25%. some elements. For instance, some transistors in a mir-
ror adder are removed to implement a low-power and
D. Scope of This Article high-speed full adder (FA) [20]. In addition, an approx-
Recent research on AC has spanned from algorithms to imate circuit can be obtained by simplifying the truth
circuits and programming languages [51], [53]–[56]. This table or Karnaugh map (K-map) [22], [63]. This method
article aims to provide an overview of approximate arith- results in circuits with deterministic error characteristics.
metic circuits, various design methodologies, an evaluation Due to the same structure and design principles, however,
and characterization of approximate adders, multipliers, the hardware improvements are hardly significant espe-
and dividers with respect to accuracy and circuit measure- cially when a high accuracy is required.
ments. Three image processing applications and a CNN Compared to addition and subtraction, multiplication,
are implemented to show the capability and performance division, and square root computation are more complex.
advantage of approximate arithmetic circuits. Therefore, their functions can be converted into some
Some preliminary results have been presented in [51] simpler operations. Mitchell’s binary logarithm-based algo-
and [57]; however, this article presents the following new, rithm enables the utilization of adders and subtractors
distinctive contributions. Instead of undergoing a generic to implement multipliers and dividers, respectively [11].
synthesis process, approximate circuits are synthesized It is the origin of most current simplification algorithms
and optimized for delay and area, respectively. The for approximate multiplier and divider designs [30],
results can be used to guide the selection of appropriate [64], in parallel with the functional iteration-based algo-
designs for an application-specific requirement (e.g., rithms for divider designs [10]. By using algorithmic
high performance or low power). In addition, hardware simplification, the performance and energy efficiency
efficiency and accuracy are jointly considered to show of an arithmetic circuit can be significantly improved
the hardware improvements at the cost of a certain loss because of the simplification in the basic circuit struc-
of accuracy. Furthermore, a larger class of approximate ture. Nevertheless, the accuracy of such a design is rel-
adders, multipliers, and dividers including many recent atively low, and many peripheral circuits are required to
designs are evaluated; in particular, approximate dividers achieve a high accuracy, which may limit the hardware
are extensively analyzed and characterized in detail. efficiency.
Finally, image compression and a DNN are implemented Practically, several approximation techniques are
for assessing the quality of approximate adders and often simultaneously utilized in a hybrid approximate
signed multipliers to obtain insights into the application of circuit [65].
approximate arithmetic circuits in image processing and
artificial intelligence systems. B. Evaluation Metrics
This article is organized as follows. Section II briefly Both error characteristics and circuit measurements
reviews the design methodologies and evaluation metrics. need to be considered for approximate circuits.

2110 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

1) Error Characteristics: Various design metrics and the average error divided by the maximum output of the
analytical approaches are useful for the evaluation of accurate design.
approximate arithmetic circuits [49], [66]–[73]. Monte Last but not least, the worst case error of an approximate
Carlo simulation is widely employed to acquire data for circuit reflects the largest ED possible. Generally, it is
analysis. The following metrics have been used to assess normalized by the maximum accurate result.
the error characteristics. 2) Circuit Measurements: The basic circuit metrics
Two basic error metrics are the error rate (ER) and error include the critical path delay, power dissipation, and area.
distance (ED). The ER indicates the probability that an Some compound metrics include the power-delay product
erroneous result is produced. The ED shows the arithmetic (PDP), area-delay product (ADP), and energy-delay prod-
difference between the approximate and accurate results. uct (EDP).
Given the approximate and accurate results M  and M, Electronic design automation (EDA) tools are indis-
respectively, the ED is calculated by ED = |M  − M|. pensable for circuit implementation and measurement.
Additionally, the relative ED (RED) shows the relative In general, the circuit is measured based on different
difference with respect to the accurate result, given by process technologies and component libraries, for exam-
RED = |ED/M|. ED and RED reveal two important features ple, the 45-nm open source NanGate [74], the 45-nm pre-
of an approximate design. For two input combinations dictive technology model (PTM) [75], 28-nm CMOS, and
leading to the same ED, the one that produces a smaller 15-nm FinFET models. The configurations for the supply
accurate result, M, would result in a larger RED. As the voltage, temperature, and optimization options also affect
average values of all obtained EDs and REDs, the mean the simulation results. For a fair comparison, the same
ED (MED) and mean relative ED (MRED) are often used configuration should be used for different designs.
to assess the accuracy of an approximate design. They are Conventionally, high performance and power efficiency
given by are, respectively, pursued as independent design consid-
erations. For instance, to cope with aging-induced timing

N errors, approximate adders and multipliers are developed
MED = EDi · P(EDi ) (1) for a high speed [76]. High-performance arithmetic cir-
i=1 cuits are also preferred in real-time machine learning
systems [77]. For mobile and embedded devices, however,
and power efficiency is key to the extended use of a limited
battery life.

N
In this article, approximate circuits are evaluated
MRED = REDi · P(REDi ) (2)
for maximizing performance (through delay) or mini-
i=1
mizing power (through area). Specifically, approximate
designs are implemented in hardware description lan-
where N is the total number of considered input combi-
guages (HDLs) and synthesized using the Synopsys Design
nations for a circuit, and EDi and REDi are the ED and
Compiler (DC, 2011.09 release) in ST’s 28-nm CMOS
RED for the i th input combination, respectively. P(EDi )
technology, with a supply voltage of 1.0 V at a temperature
and P(REDi ) are the probabilities that EDi and REDi occur,
of 25 ◦ C. To compare speed and power, the approximate
which are also the probability of the i th input combination.
circuits are synthesized under different constraints. The
The NMED is defined as the normalization of MED by
critical path delay of a design is set to the smallest
the maximum output of the accurate design, useful in
value without incurring a timing violation for the delay-
comparing the error magnitudes of approximate designs
optimized synthesis, whereas the area is minimized for
of different sizes.
the area-optimized synthesis. The DesignWare library and
The mean squared error (MSE) and root-mean-square
“ultra compile” are used in the synthesis for optimization.
error (RMSE) are also widely used to measure the arith-
The critical path delay and area are reported by the Synop-
metic error magnitude. They are computed by
sys DC. Power dissipation is measured by the PrimeTime-
PX tool with ten million random input combinations.

N
As widely used EDA tools in industry and academia, Syn-
MSE = ED2i · P(EDi ) (3)
opsys DC and PrimeTime-PX provide estimations of timing,
i=1
area, and power dissipation with a prediction error of less
than 10% compared with physical implementations [78].
and
3) Comprehensive Measurements: To provide an overall
√ evaluation of an approximate circuit, both error and circuit
RMSE = MSE. (4) characteristics must be considered. Several figures of merit
(FOMs) have been developed by combining some error and
In addition, the error bias is given by the average error circuit metrics in an analytical form [79], [80]. However,
that is the mean value of all possible errors (M  − M). these FOMs are heuristic-based and lead to different com-
The normalized average error is commonly considered as parison results. In this work, therefore, the delay, power,

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2111
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

and PDP of approximate circuits are directly compared


with respect to their ERs, NMEDs, and MREDs.

III. A P P R O X I M A T E A D D E R S
A. Preliminaries
An adder performs the addition of two binary numbers
and is one of the fundamental arithmetic circuits in a
computer. Two basic designs are the ripple-carry adder
(RCA) and the carry lookahead adder (CLA). An n-bit RCA
consists of n FAs connected in series, each of which gen-
erates a sum (si ) and a carry-out (ci+1 ) by implementing
si = ai ⊕ bi ⊕ ci and ci+1 = ai bi + ai ci + bi ci , where ai
and bi are the i th least significant bits (LSBs) of the two
inputs, and ci is the carry-in, i = 0, 1, . . . , n − 1. In an Fig. 1. Approximate speculative adder.

n-bit RCA, the carry of each FA propagates to the next


FA, thus the delay and circuit size increase proportionally
with n, denoted by O (n). An n-bit CLA consists of n units 1) Speculative Adders: As an early scheme, a speculative
in parallel; each unit produces the signals of a generate design leverages the fact that the effective carry chain of
(gi = ai bi ), a propagate ( pi = ai ⊕ bi ), and a sum, an n-bit adder is much shorter than n in most cases [14].
where the former two signals are used for generating the Thus, an n-bit speculative adder uses the previous k bits
lookahead carries. In a CLA, the carry is computed by (k < n) to predict the carry for computing each sum bit,
ci+1 = gi + pi ci = gi + pi (gi−1 + pi−1 ci−1 ) = · · · = as shown in Fig. 1. In this way, the critical path delay is
 i i
i
j =0 g j k= j +1 pk + c0 k=0 pk . The delay of a CLA is reduced to O (log (k)) (for a parallel implementation, such
approximately logarithmic in n or O (log (n)), which is as a CLA, the same below). Compared to the design in [14],
significantly shorter than the delay of an RCA. However, the hardware overhead is reduced in the ACA by sharing
a CLA requires a larger circuit area [in O (n log (n))], so it some components among the subcarry generators [16].
incurs a higher power dissipation.
2) Segmented Adders: A segmented adder is imple-
For an adder with a width equal to or larger than
mented by several parallel subadder blocks with an inde-
32 bits, the simple carry lookahead structure of CLA is
pendent carry-in [17], [19], [91], [92]. Hence, the carry
not very efficient due to the large fan-in and fan-out of
propagation chain is truncated into shorter segments.
the constituent gates that lower the speed and increase the
Fig. 2 shows a basic structure for many segmented adders.
circuit area and power consumption. Thus, multiple levels
As the simplest design, an n-bit ESA uses n/k k-bit
of lookahead structures have been proposed to construct a
subadders without any carry-in [19]. Different from ACA,
large-width adder, which is usually referred to as a parallel-
the input bits used for carry computation do not overlap in
prefix adder. The parallel-prefix adders exploit the fact that
ESA; thus, for the same k, its hardware cost is significantly
the carry signals in a CLA can be generated by grouping
lower than ACA.
gi and pi in various ways. By varying the group size and
In an n-bit accuracy-configurable approximate adder
the connection pattern, many parallel-prefix adders have
(ACAA), n/k − 1 2k-bit subadders are utilized to add
been designed to improve the speed or reduce the circuit
2k consecutive bits without carry inputs and k bits are
area, including the Kogge–Stone adder [81], the Ladner–
overlapped between two neighboring subadders [91]. The
Fischer adder [82], the Ling adder [83], the Brent–Kung
accuracy of ACAA can be configured at runtime by chang-
adder [84], and the Han–Carlson adder [85]. Another type
ing the bit width of the subadders. The generic accuracy
of adders uses addition blocks of variable sizes, including
configurable adder (GeAr) generalizes the structure of
the carry-select adder [86], the carry-skip adder [87],
ACAA by varying the number of overlapped bits used for
the conditional-sum adder [88], and the carry-increment
carry prediction [93], whereas the quality-area optimal
adder [89]. The architectures and characteristics of these
adders are discussed in [90].

B. Review
Conventional design methodology to accelerate an
adder often comes with a cost in circuit area and power
dissipation. However, approximate adders tradeoff accu-
racy for an overall improvement in hardware efficiency.
Based on the approximation schemes to reduce the critical Fig. 2. Basic structure of a segmented adder. ai,h−1:0 and
path and hardware complexity, approximate adders are bi,h−1:0 are the h-bit inputs for the segment i. The inputs can be
classified into four categories. overlapped between neighboring segments.

2112 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

low-latency approximate adder (QuAd) further utilizes


subadders of variable width [94].
An n-bit error-tolerant adder type II (ETAII) consists of
n/k k-bit carry generators that are connected in parallel
with the k-bit sum generators [17]. For the same k, ETAII
utilizes a carry generator to predict the carry for the next
sum generator, so it is more accurate than ESA and ACA.
Fig. 4. Approximate carry-select adder with carry-in selection.
However, the circuit of ETAII is more complex than that of
ESA, and its delay is larger due to the longer 2k-bit critical
path. For a fixed k, ETAII uses the same carry propagation
path as ACAA for each sum, so they share the same error
characteristics. block (see Fig. 3). The SCSA, ETAII, and ACAA achieve
The dithering adder uses a more significant (accurate) the same accuracy for the same value of k due to the
subadder and a less significant subadder with upper and identical carry predict function. In the CCA, the Seli of
lower bounding modules [68]. An additional control signal the multiplexer is determined by the propagate signals
is used as the carry-in of the more significant (accurate) in the current and previous blocks. The carry prediction
subadder, which is also used to select the sum output of of the CCA depends not only on its LSBs, but also on the
the less significant subadder. To reduce the error due to the more significant bits.
ignored carry inputs, an error control and compensation In Fig. 4, each block consists of a carry generator
method is developed to tradeoff computing efficiency for and a sum generator. The approximate carry skip adder
an improved accuracy of a segmented adder in [92]. (CSA) [96], the generate signal-exploited carry specu-
Generally, the critical paths of the segmented adders lation adder (GCSA) [100], and the block-based carry
are in O (log (k)) due to the carry-ignored segmenta- speculative approximate adder (BCSA) [103] use different
tion. The circuit complexities are in O (n log (k)) for selection schemes for the carry-in of a carry generator.
ESA and ETAII, in O ((n − k) log (k)) for ACAA, and in In the CSA, the carry-in of the (i + 1)th block is determined
O (((n − L)/k + 1) L log (L)) for GeAr. by the propagate signals of the i th block: it is the carry-
out of the (i − 1)th subcarry generator when all propagate
3) Approximate Carry-Select Adders: The structure used
signals are true (Pi = 1); otherwise, it is the carry-out
in the classic carry-select adder [86] is employed in [95]–
of the i th subcarry generator. The generate signals are
[102] to introduce approximation in the selection of the
used in the GCSA for the carry speculation; the carry-in
carry-in and sum for each subadder. This type of adders
for the (i + 1)th block is selected by its own propagate
is referred to as an approximate carry-select adder with
signals rather than its previous block. The carry-in is the
either sum or carry-in selection, as shown in Figs. 3 and
most significant generate signal gi,k−1 of the i th block if
4, respectively. An n-bit approximate carry-select adder
Pi = 1, or else it is the carry-out of the i th carry generator.
consists of m = n/k blocks and uses several common sig-
The carry-in of the (i + 1)th block in BCSA is selected
nals. For the i th block, generate gi, j = ai, j bi, j , propagate
 between the most significant generate signal gi,k−1 and
pi, j = ai, j ⊕ bi, j , and Pi = k−1 j =0 pi, j are defined, where the carry-out of the i th block, that is, Ci+1,in = Si Ci,out +
ai, j and bi, j are the j th LSBs of the inputs in block i , where
Si gi,k−1 , where Si = ai+1,0 + bi+1,0 + gi,k−1 [103]. An error
j = 0, 1, . . . , k − 1. Pi = 1 indicates that all k propagate
detection-and-recovery scheme is further proposed to par-
signals in the i th block are logic “1.”
tially compensate the errors by modifying the LSB of the
In the speculative carry selection adder (SCSA) [95]
sum output in each block.
and the consistent carry approximate adder (CCA) [99],
In the carry speculative adder (CSPA), each block con-
a sum is selected from adder0 (with carry-in “0”) and
tains one sum generator, two internal carry generators
adder1 (with carry-in “1”) by using a multiplexer. In the
with carry-0 and carry-1, respectively, and a simple carry
SCSA, the carry-out of adder0 in the (i − 1)th block is
predictor [98]. The carry-out of the i th carry predictor
connected to the Seli signal of the multiplexer in the i th
selects a carry-in for the (i + 1)th sum generator. The carry
predictor uses kl rather than k input bits (kl < k), so it leads
to a simpler circuit than SCSA for the same block size k.
Some control signals are added to the gracefully degrad-
ing accuracy-configurable adder (GDA) to configure the
accuracy by selecting an accurate or approximate carry-
in for each subadder [97]. Thus, the delay of GDA varies
with the carry propagation path determined by the control
signals.
In the carry cut-back adder (CCBA), the full carry prop-
agation is prevented by a multiplexer or an OR gate [102].
Fig. 3. Approximate carry-select adder with sum selection. The carry-in for a segment is determined by a cut signal

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2113
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

To obtain the error characteristics, the functions of the


16-bit approximate adders are implemented in MATLAB
and simulated with ten million uniformly distributed ran-
dom input combinations. The simulation results show
similar trends in MRED and NMED for the approximate
adders [51], so only the MRED is reported here. For
circuit measurements, the 16-bit approximate adders are
implemented in HDLs and synthesized as described in
Section II-B2. The clock period used in the power estima-
tion is 1 ns.
Fig. 5. n-bit adder using AFAs. Fig. 6 shows the delay comparison of approximate
adders with respect to MRED and ER (for delay-optimized
from a carry propagate block at a higher position, a carry synthesis), whereas Fig. 7 shows the power comparison
speculated from a short chain at a lower position, and the (for area-optimized synthesis). The overall comparison in
carry-out of the previous segment. The delay and accuracy PDP and MRED is shown in Fig. 8 (for both delay- and
of the CCBA depend on the distance between the propagate area-optimized syntheses). A Pareto front is delineated in
block and the multiplexer or OR gate. each figure to show the designs with the highest efficiency.
The critical path delay of the approximate carry-select As ETAII, ACAA, and SCSA share the same error character-
adders can be given by O (log (k)), when the bit width of istics for a certain k, only ETAII is shown in the figures due
the input operands in each block is k. The circuit area to its lower hardware overhead.
varies with the complexity of the carry prediction and As can be seen in Fig. 6, most approximate adders have
selection schemes. very close ERs between 0.5% and 35% except for GeAr
(R4_P8) and CSA (for k > 3) with ERs smaller than
4) Approximate Full Adders: Another method for reduc- 0.5%, and CCBA, ESA, LOA, and TruA with very high ERs,
ing the critical path and power dissipation of an adder is to although CCBA, LOA, and TruA can have relatively small
approximate an FA. The approximate FA (AFA) is then used MREDs. CSA can be very accurate, whereas ESA, BCSA,
to implement l LSBs in an n-bit adder (l < n), whereas and CSPA show a low accuracy with relatively large MREDs
the (n − l) MSBs are computed by an AccuA, as shown in and ERs.
Fig. 5. In the LOA, an OR gate is used as a simple AFA,
and an AND gate is used to generate the carry-in for the 1) Performance/Power Versus Accuracy: As also shown in
AccuA [18]. Fig. 6, LOA, CCBA, and GeAr can be faster than the other
Other AFA designs include the mirror adder [20], designs for a relatively small MRED, so they exhibit a
the approximate XOR/XNOR-based FAs [104], the inex- balanced tradeoff in performance and accuracy (in MRED).
act adder cells proposed in [105], and the approximate ESA and CSPA are the fastest at a large MRED and ER.
reverse carry propagate FA [106] (specific for the RCA For a similar ER, some configurations of CSA, GeAr, and
structure). Additionally, emerging technologies such as ETAII can be faster than others (i.e., in the Pareto front).
magnetic tunnel junctions have been considered for the As shown in Fig. 7, CCBA, LOA, and TruA achieve the
design of AFAs for a shorter delay, a smaller area, and a best power and accuracy tradeoffs (in terms of MRED).
lower power consumption [107], [108]. Finally, a simply GeAr, BCSA, and CCBA are in the Pareto set for power
truncated adder (TruA) that works with a lower precision consumption and ER.
is considered as a baseline design. 2) Energy Versus Accuracy: To consider both error and
The critical path of this type of adders is typically circuit characteristics, the MRED and PDP are selected
in O (log (n − l)) when there is no carry propagation for as representative metrics. As shown in Fig. 8, a similar
the AFAs. LOA is selected as the reference design in the trend is obtained for both the delay- and area-optimized
evaluation due to its logic-level implementation while most syntheses in the PDP and MRED of the approximate
other AFAs are designed at the transistor level. adders. Overall, CCBA, LOA, and TruA achieve the best
In addition to the above four categories, a library tradeoffs between accuracy (in MRED) and energy (in
of 430 approximate 8-bit adders has been automatically PDP); however, they have the highest ERs. Nevertheless,
generated by using Cartesian genetic programming (CGP) these approximate adders show a decent tradeoff in error
and a multiobjective genetic algorithm [109]. Due to the magnitude and hardware efficiency. In particular, they are
restricted bit width of 8 bits, however, this group of designs suitable for applications in which a high ER is not as
is not considered in the evaluation. detrimental as a large error magnitude.
In summary, truncation is an effective approach to a
C. Evaluation hardware-efficient design, albeit resulting in a high ER.
In this evaluation, we consider 16-bit approximate On the other hand, the carry select scheme can be very
adders. In the circuit implementations, all subadders in the effective in highly accurate designs such as the CSA.
designs are implemented by CLAs for a high efficiency. A speculative adder results in a very high power dissipation

2114 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 6. Optimized delay versus accuracy for the approximate 16-bit adders using different error metrics. (a) Optimized delay versus MRED.
(b) Optimized delay versus ER.
Note: The number of approximate or truncated LSBs for LOA and TruA ranges from 3 to 9, the subadder width of ESA is from 8 down to 3,
the number of bits used for carry speculation for ACA is from 8 down to 3 from left to right. The block width for CSA is from 5 down to 3, and
it is from 6 down to 3 for the other adders from left to right. In CSPA, the size of the carry predictor is k/2. The global speculative carry for
CCA is “0,” which leads to a more accurate result than using “1.” For GeAr, the configurations from left to right are R4_P8, R6_P4, R4_P4, and
R2_P4 (Rm_Pk means k previous bits are used for generating m bits of sum results). For CCBA, the configurations with the smallest PDPs are
chosen for a similar MRED.

and a large error magnitude (e.g., ACA). The advantages IV. A P P R O X I M A T E M U L T I P L I E R S


and disadvantages of the approximate adders with at least A. Preliminaries
one prominent property are summarized in Table 1. In this
table, ED stands for both MRED and NMED. As can be Typically, a combinational multiplier consists of three
seen, the ESA is very hardware-efficient for applications processing stages: PP generation, PP accumulation, and
with high error tolerance, whereas CSA is suited for appli- a final carry propagate addition, as shown in Fig. 9. Let
cations that require a high accuracy. When a high ER is the two input operands of an n × n unsigned multiplier
n−1 n−1
not an issue, CCBA, LOA, and TruA are the most efficient be A = i=0 Ai 2 and B =
i i
i=0 Bi 2 , where Ai and
designs. Bi are the i th least significant bits of inputs A and B,

Fig. 7. Power consumption versus accuracy for the area-optimized approximate 16-bit adders using different error metrics. (a) Power
(area-optimized) versus MRED. (b) Power (area-optimized) versus ER.

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2115
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 8. Comparison of PDP and MRED for the approximate 16-bit adders. (a) PDP (delay-optimized) versus MRED. (b) PDP (area-optimized)
versus MRED.

Table 1 Summary of Approximate Adders

respectively, and i starts from 0. A PP is often generated Fig. 9. Basic arithmetic process of a 4 × 4 unsigned multiplication.
by an AND gate, that is, PPi, j = A j Bi . To accumulate •: an input, a PP, or an output bit; : an HA, an FA, or a
4:2 compressor.
the PPs, three structures are widely used: the carry-save
adder array [110], the Wallace tree [111], and the Dadda
tree [112].
Fig. 10 shows a carry-save adder array for a 4 × 4 possible in a Wallace tree. Compared to the array structure,
unsigned multiplier, where the carry and sum signals gen- a tree-based PP accumulation is faster; however, a tree
erated by the adders in a row are passed on to the adders structure requires longer and more complex wiring, which
in the next row. The carry signals propagate through the can result in a larger circuit area [113].
adders in a diagonal direction. Hence, the critical path Signed multiplication uses two’s complement represen-
for an n × n multiplier is approximately in O (n). Due to tation. The input operands are given as A = − An−1 2n−1 +
n−2 n−2
i=0 Ai 2 and B = −Bn−1 2
its regular layout, the array structure in Fig. 10 requires i n−1 + i
i=0 Bi 2 . The Booth
mostly short wires and is easy to scale to large arrays. algorithm is then used to recode the multiplier for gener-
A Wallace tree utilizes FAs, half adders (HAs), and ating PPs [114]. MacSorley modified the Booth algorithm
4:2 compressors for a fast accumulation of the PPs, to the radix-4 Booth algorithm [115], which reduces the
as shown in the dotted box in Fig. 9 for a 4 × 4 unsigned number of PPs by half. The radix-2r Booth algorithm can
multiplier. The adders in each stage operate in parallel be obtained by using the same principles of the radix-
without carry propagation, and the same operation repeats 4 scheme. In addition, the Baugh–Wooley algorithm [116]
until two rows of the PPs are left. For an n × n multiplier, and the modified Baugh–Wooley algorithm [117] can sim-
about log1.5 (n/2) stages are required in a Wallace tree plify the signed multiplication by adding the two’s com-
[110]. Therefore, the delay is in O (log (n)), which is plements of the PP rows and preprocessing the constant
shorter than that of the array structure. The Dadda tree additions. The modified Baugh–Wooley algorithm is also
has a similar structure as the Wallace tree, but it uses as widely used to eliminate the sign extension in Booth
few adders as possible rather than reducing PPs as early as multipliers [118].

2116 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

without any approximation and 2) otherwise, the accurate


multiplier (AccuM) is used to multiply the MSBs, while
the nonmultiplication section is used to approximately
process the LSBs. The static segment multiplier (SSM) uses
a similar partition scheme, but the approximation section
is omitted for not processing the LSBs [133]. If the MSBs of
one input are all zeros, its LSBs are multiplied by either the
MSBs or the LSBs of the other input depending on whether
MSBs are all zeros.
Similarly, an exact k × k submultiplier is used in the
design of an n × n dynamic range unbiased multiplier
(DRUM) [120]. However, the k-bit inputs of the reduced-
width multiplier are dynamically selected starting from
Fig. 10. 4 × 4 unsigned multiplier using a carry-save adder array. the leading “1”s or the most significant “1”s in the two
n-bit input operands. If the leading “1” position is higher
than k, the redundant LSBs are truncated and the LSB of
the k selected bits is set to “1.” Otherwise, the leading “1”
To approximate an unsigned multiplier, five methodolo-
position is ignored, and the k LSBs of the input operands
gies have been considered: 1) approximation in generating
are selected as the inputs of the submultiplier. The final
the PPs [22]; 2) approximation (including truncation) in
output is then obtained by using a barrel shifter to restore
the PP tree [18], [21], [119], [120]; 3) using approximate
the computed result. As the input bits are more effectively
adders [121], counters [63], or compressors [79], [80],
processed, DRUM is more accurate than ETM and SSM.
[122]–[126] in the PP reduction; 4) using logarithmic
Moreover, it produces unbiased errors, so it is suited for
approximation [11], [64], [127]–[130]; and 5) using
accumulative operations. However, it uses a more complex
an automated process such as a genetic programming
circuit for the dynamic selection of inputs.
method [109], [131]. For signed multiplication, approx-
An approximate Wallace tree multiplier (AWTM) utilizes
imate Booth multipliers have been designed for its fast
a bit-width-aware approximate multiplication and a carry-
operation on a reduced number of PPs.
in prediction [119]. An n × n AWTM is implemented by
Therefore, approximate multipliers are classified into
four n/2 × n/2 submultipliers, where the most significant
five unsigned categories and signed Booth multipliers.
submultiplier A H B H is further divided into four n/4 × n/4
submultipliers. By using different numbers of approximate
B. Approximate Unsigned Multipliers n/4×n/4 submultipliers in A H B H , the AWTM is configured
into four modes. The three less significant n/2 × n/2 sub-
1) Approximation in Generating PPs: As an early design,
multipliers ( A H B L , A L B H , and A L B L ) are approximate.
the UDM utilizes an approximate 2 × 2 multiplier to con-
struct larger multipliers [22]. The 2 × 2 multiplier approxi-
3) Using Approximate Counters or Compressors in the PP
mates the product “1001” with “111” when both the inputs
Reduction: An inaccurate 4 × 4 Wallace multiplier uses
are “11,” so saving one output bit and simplifying the logic
a 4:2 counter that approximates the output of the carry
circuit. The ER of this 2 × 2 multiplier is 0.54 = 6.25% if
and sum, “100,” with “10” when all four inputs are
each input bit is equally likely to be “0” or “1.”
“1” [63]. For uniformly distributed inputs, the probability
2) Approximation in the PP Tree: A BAM omits some of one bit being “1” is 0.5, so the probability that a PP
carry-save adders in an array multiplier in both the hori- is “1” is 0.25. The ER of the approximate 4:2 counter
zontal and vertical directions [18]. A more straightforward is, therefore, only 0.254 = 0.39%. Larger multipliers
approximation is to truncate some LSBs on the input can be constructed using the inaccurate counter-based
operands so that a smaller multiplier is sufficient for 4 × 4 multiplier (ICM, in general). In the approximate
the remaining MSBs. This truncated multiplier (TruM) is counters in [134], the more significant output bits are
considered as a baseline design for comparison. Different ignored for an efficient implementation of several signed
from BAM and TruM, several consecutive rows of PPs that multipliers.
do not necessarily start from the LSB are ignored for the Two approximate designs that implement simplified
PP reduction in [132]. This design is referred to as a PP functions of 4:2 compressors are considered for Dadda
perforation-based multiplier (PPAM). multipliers [123]. To lower the error probability, the PPs
The ETM consists of a multiplication section, a non- are encoded by propagate (i.e., PPi, j + PP j,i ) and generate
multiplication section, and a control block [21]. The NOR (i.e., PPi, j · PP j,i ) signals, which enable the design of sev-
gates-based control block determines: 1) if all k MSBs in eral approximate compressors with a relatively low error
at least one of the two n-bit input operands are zeros, probability [124]. Similarly, an approximate 4:2 compres-
the multiplication section (using an accurate k × k mul- sor with encoded inputs is proposed for 4 × 4 multipliers,
tiplier, where k < n) is activated to multiply the LSBs which is then used to construct larger multipliers [125].

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2117
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

The dual-quality 4:2 compressors in [79] can switch The multiplication is then completed by performing an
between exact and approximate operating modes using antilogarithmic approximation, that is,
power gating techniques. These compressors are then used

in the PP accumulation of a Dadda multiplier and the accu- 2k1 +k2 (x1 + x2 + 1) , if x1 + x2 < 1
racy can be dynamically configured. Using a three-input M≈ (10)
2k1 +k2 +1 (x1 + x2 ) , if x1 + x2 ≥ 1.
majority gate, a FinFET-based imprecise 4:2 compressor is
designed for an approximate 8 × 8 Dadda multiplier with
truncation in the PP array [80]. To improve the accuracy of an LM, the ALM-SOA uses a
In the high-order compressor-based multiplier (HOCM), truncated binary-logarithm converter and a set-one-adder
each column of PPs is accumulated by only one compres- (SOA) for the addition [127]. The SOA simply sets the
sor [126]. An allocation algorithm is then developed to LSBs to constant “1”s and generates a carry-in for the MSBs
determine the use of exact and approximate compressors using an AND gate. Moreover, an improved algorithm
at different stages of the accumulation with the truncation using exact and approximate adders (ILM-EA and ILM-AA)
of the lower half PPs. is proposed in [64] and [128]. In [129], the input operands
In [121], a novel approximate adder uses two adjacent between two consecutive powers of two are partitioned
inputs to generate a sum and an error bit for accumulating into several segments. An error reduction factor is then
the PPs. To alleviate the error due to the approximate analytically determined for each segment and compen-
adder, two error recovery schemes are considered to use sated to the result of the basic LM. A two-stage design that
either OR gates to accumulate the error bits in the so- uses two truncated LMs for error correction achieves a low
called approximate multiplier 1 (AM1) or both OR gates and unbiased average error [130].
and the approximate adders in the approximate multiplier 5) Using an Automated Process: In [109], 471 8 × 8
2 (AM2). Moreover, TAM1 and TAM2 are obtained by approximate unsigned multipliers are automatically gener-
truncating the lower half of the PPs in AM1 and AM2, ated by using CGP and a multiobjective genetic algorithm.
respectively [50], [135]. As CGP can provide better implementations of a circuit
than conventional synthesis tools, it is used to denote a
4) Using Logarithmic Approximation: Mitchell’s algo- circuit using this design method. An approximate circuit
rithm leverages the logarithmic and anti-logarithmic is generated by randomly removing some connections
approximations of a binary number. It serves as the basis of several accurate designs. A genetic algorithm is then
of logarithmic multipliers (LMs) [11]. In this algorithm, applied for design space exploration to obtain the optimal
the two unsigned binary input operands A and B of a approximate circuits with respect to MRED. These 8 × 8
multiplier are expressed as multipliers are then used to construct 16 × 16 approximate
multipliers, referred to as CGPM1–CGPM6 [131].
A = 2k1 (1 + x1 ) (5)

and C. Evaluation of Approximate Unsigned


Multipliers
B = 2k2 (1 + x2 ) (6) The error and circuit characteristics of 16 × 16 approx-
imate unsigned multipliers are obtained with the same
where k1 and k2 indicate the leading “1” positions of A and experimental setup as in the evaluation of approximate
B, respectively, and x1 and x2 are the fractional numbers adders, except that the clock period used in the power
that represent the bits to the right of the leading “1”s estimation is 4 ns.
normalized by 2k1 and 2k2 , respectively. The product of A As shown in [51], most of the approximate multipliers
and B is then given by result in large ERs close to 100%. However, ICM has a low
ER of 5.45% because only one approximate counter with
an ER of 0.39% is used in a 4 × 4 multiplier block. Some
M = A × B = 2k1 +k2 (1 + x1 ) (1 + x2 ) . (7) configurations of CGPM1, CGPM2, and CGPM3 also show
lower ERs than the other designs. Additionally, the ER of
Thus UDM is 80.99%, which is lower than most of the other
designs. Hence, the accuracy is only compared here in
log2 M = k1 + k2 + log2 (1 + x1 ) + log2 (1 + x2 ) . (8) MRED and NMED. For circuit measurements, Fig. 11 shows
the critical path delay of the multipliers synthesized under
delay-optimized constraints with respect to MRED and
As 0 ≤ x1 , x2 < 1, log2 (1 + x1 ) ≈ x1 , and log2 (1 + x2 ) ≈ NMED, while Fig. 12 shows the power consumption for
x2 . Hence, (8) is approximated by area-optimized synthesis. Based on the preliminary results
in [57], the designs with good tradeoffs are selected from
log2 M ≈ k1 + k2 + x1 + x2 . (9) each category for comparison here. The array and Wallace

2118 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 11. Optimized delay versus accuracy for the approximate 16 × 16 unsigned multipliers using different error metrics. (a) Optimized
delay versus MRED. (b) Optimized delay versus NMED.
Note: The number of truncated LSBs for TruMA and TruMW is from 2 to 8 from left to right, and from 11 to 22 for BAM. The number of MSBs
used for error compensation is from 16 to 10 for TAM1. The size of the accurate submultiplier is from 10 to 8 for SSM, and 10 to 6 for DRUM.
The configurations for HOCM are 1StepFull (with approximate compressors in the first accumulation stage), 1StepTrunc (1StepFull with a
truncation of LSBs), 2StepFull (with approximate compressors in both the first and second stages), and 2StepTrunc (2StepFull with a
truncation of LSBs) from left to right. For CGPM1 and CGPM3, which, respectively, use one and three 8 × 8 approximate multipliers for
constructing a 16 × 16 multiplier, the configurations with the smallest PDPs are shown for a specific MRED, selected from 500 configurations
for each design.

architectures are considered for TruM, which are denoted TAM1, ILM-AA, and ALM-SOA show the best tradeoffs
as TruMA and TruMW, respectively. between performance and accuracy. PPAM can have the
shortest delay but largest error.
1) Performance Versus Accuracy: Fig. 11 shows that most
approximate unsigned multipliers exhibit a similar per- 2) Power Versus Accuracy: As shown in Fig. 12, LMs
formance trend versus both MRED and NMED except for are very power-efficient albeit with a very low accu-
ICM and DRUM. CGPM1 is the most accurate design with racy, whereas CGPM1 is relatively power-hungry but with
very small values of MRED and NMED, and a reasonable a high accuracy. At a medium accuracy, BAM consis-
performance. As expected, the LMs (ALM-SOA and ILM- tently consumes a low power, followed by CGPM3. Some
AA) are good in performance, but poor in accuracy. HOCM, configurations of HOCM also show good tradeoffs in power

Fig. 12. Power consumption versus accuracy in MRED and NMED for the area-optimized approximate 16 × 16 unsigned multipliers.
(a) Power (area-optimized) versus MRED. (b) Power (area-optimized) versus NMED.

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2119
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 13. Comparison of PDP and MRED for the approximate 16 × 16 unsigned multipliers. (a) PDP (delay-optimized) versus MRED. (b) PDP
(area-optimized) versus MRED.

Table 2 Summary of Approximate Unsigned Multipliers

Fig. 14. PP partition for an 8 × 8 fixed-width modified Booth


multiplier [136]. PPi,j is the jth PP in the ith PP vector, PPi,j is the
inverted PPi,j , and ni is the sign of the ith encoded digit.

efficiency and accuracy. HOCM (1StepTrunc) shows the D. Approximate Booth Multipliers
best tradeoff in power and accuracy.
The modified (or radix-4) Booth algorithm is commonly
3) Energy Versus Accuracy: The PDPs of the unsigned used in the design of approximate Booth multipliers [118],
multipliers with respect to MRED are shown in Fig. 13. [136]–[140]. Initially aimed at a fixed-width signed mul-
The overall trend is slightly different between the tiplier, a widely used method is to truncate the lower
delay-optimized [see Fig. 13(a)] and area-optimized half of the PPs to generate an output with the same
[see Fig. 13(b)] synthesis results. As shown in width as the input. This truncation saves the circuits for
Fig. 13(a), the 1StepFull in HOCM, TAM1, CGPM1, PP accumulation, but it introduces a large error. Hence,
and CGPM3 exhibit the best energy-accuracy tradeoffs, many error compensation schemes have been proposed to
located in the center of the plot. In Fig. 13(b), CGPM3, increase accuracy [118], [136], [137], [139].
HOCM, TAM1, and TruMA show slightly better tradeoffs Inspired by the BAM, the broken Booth multiplier (BBM)
than the other designs. At a very low accuracy, ALM-SOA omits the adder cells to the right of a vertical line [138],
and PPAM are the most hardware-efficient with the whereas directly truncating k LSBs of the input operands
smallest PDP values. leads to a truncated Booth multiplier (TBM-k). The TBM
In summary, truncation is effective in reducing the delay is considered as a baseline design for comparing the Booth
and energy consumption of unsigned multipliers. An LM multipliers.
tends to be hardware-efficient but with a rather low accu- Generally, a fixed-width Booth multiplier is based on
racy. The automatically generated multipliers can be highly a partition of the PP array, as shown in Fig. 14. For an
accurate with a reasonable hardware consumption. A brief 8 × 8 fixed-width modified Booth multiplier, the PP array is
summary of the error and circuit characteristics is shown divided into two parts: the upper half denoted as the main
in Table 2 for the approximate unsigned multiplier in the part (MP) and the lower half truncation part (TP). The
Pareto fronts. TP is further divided into TPmajor and TPminor . The final

2120 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Table 3 Summary of the Approximate Booth Multipliers multipliers, the most efficient approximate radix-8 Booth
multiplier, ABM2_R15 (with the truncation of 15 bits,
resulting in a fixed-width multiplier), is considered and
denoted as ABM2.
To speed up the PP generation, two approximate radix-
4 Booth encoders are proposed by simplifying the K-Map
to generate k least significant PP columns for an n × n
multiplier (k = 1, 2, . . . , 2n) [140]. By changing the value
of k, different tradeoffs can be achieved between accuracy
and hardware efficiency.
product of a fixed-width multiplier is the addition of the
MP and the carry signals generated from the TP.
In [136], the carry signals include the exact carry from E. Evaluation of Approximate Booth Multipliers
the TPmajor and an approximate carry from the TPminor (see In this evaluation, we consider 16 × 16 approximate (or
Fig. 14). The approximate carry is generated by the output fixed-width) Booth multipliers for signed multiplication.
of the modified Booth encoders. This multiplier is referred The clock period for the power estimation is 4 ns. Fig. 15
to as BM04. Using a similar partition scheme, BM11 relies shows the optimized delay with respect to NMED and
on a simplified sorting network to generate the carries MRED, while the power is shown in Fig. 16 for area-
for error compensation [118]. This error compensation optimized synthesis. Fig. 17 shows the tradeoff between
makes the errors symmetrical and centered around zero, PDP (for both delay- and area-optimized syntheses) and
which reduces the error bias and MSE. In BM15, the error MRED for the approximate Booth multipliers.
due to truncation is compensated by the outputs of the
1) Performance/Power Versus Accuracy: As revealed in
Booth encoders and the multiplicand [141]. In BM07,
Figs. 15 and 16, most fixed-width Booth multipliers show
the number of PP columns in TPmajor is adaptively variable
similar NMEDs except for BBM and BM15 with relatively
to compensate for the quantization error in a fixed-width
large values. Compared to the fixed-width Booth multipli-
multiplier [137]. Another design is based on a probabilistic
ers, TBM can have a similar MRED and higher NMED, with
estimation-based bias [139], referred to as PEBM. In this
a higher speed and power dissipation. With a moderate
design, an error compensation formula is derived from
accuracy, ABM2 is the fastest and the most power-efficient.
a probability analysis, where the number of PP columns
With a very high accuracy, BM07 is the slowest design with
in TPmajor varies in accordance with the desired tradeoff
a relatively high power consumption, followed by BM11.
between hardware and accuracy.
PEBM shows a moderate speed and power dissipation, with
To reduce the additional delay due to the radix-8 Booth
a relatively high accuracy.
algorithm, an approximate recoding adder is proposed for
calculating the triple multiplicands in [142]. A Wallace tree 2) Energy Versus Accuracy: Fig. 17 shows that BM07,
and a truncation technique are then utilized for the PP BM11, and PEBM exhibit the best tradeoffs between
accumulation. To be consistent with the fixed-width Booth accuracy and PDP. ABM2 and BBM stand out too for

Fig. 15. Delay versus accuracy for the approximate 16 × 16 Booth multipliers. The number of truncated LSBs for TBM is from 2 to 6 from
left to right. (a) Optimized delay versus MRED. (b) Optimized delay versus NMED.

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2121
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 16. Power consumption versus accuracy for the approximate 16 × 16 Booth multipliers. (a) Power (area-optimized) versus MRED.
(b) Power (area-optimized) versus NMED.

power-optimized synthesis. A summary of the error approach to divider design is to follow the pencil-and-
and circuit characteristics is shown in Table 3. Overall, paper algorithm. In general, the quotient of a division
BM07 and BM11 are relatively accurate but slow. PEBM is computed by iteratively subtracting a multiple of the
shows small values of NMED and PDP, as well as a high divisor from the partial remainder that is initially set to
speed. ABM2 is efficient in both power and performance the dividend. In a restoring divider, the partial remainder
with a moderate accuracy. is corrected or reserved when the subtraction yields a
negative number, while it is not corrected in a nonrestoring
V. A P P R O X I M A T E D I V I D E R S divider [110]. Fig. 18 shows an 8/4 unsigned restoring
A. Preliminaries array divider that uses a multiplexer and the borrow
signal in the subtractor cell to retain the partial remainder.
Although the divider is not as frequently used as adders Generally, n 2 subtractor cells are required
  in a 2n/n array
and multipliers, the system performance can be signifi- divider. The critical path is in O n 2 due to the ripple
cantly degraded if it is not appropriately implemented, borrow propagations among subtractor cells, whereas it is
whereas it is hard to reduce the latency of dividers without in O (n) for an n × n array multiplier. Therefore, an array
significant overhead in area [143]. A straightforward divider is much slower than an array multiplier. However,

Fig. 17. Comparison of PDP and MRED for the approximate 16 × 16 Booth multipliers. (a) PDP (delay-optimized) versus MRED. (b) PDP
(area-optimized) versus MRED.

2122 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

truncation, and error compensation in an array structure.


For an array divider, a high-radix design can be faster but
consumes more power compared to a radix-2 design [152].

2) Using a Reduced-Width Exact Divider: A dynamic


approximate divider (DAXD) selects the inputs and uses
a reduced-width restoring array divider [153], in a way
 similar to the design of DRUM [120]. This selection scheme
b bi8/4
Fig. 18. unsigned restoring array divider. A = 7 ai 2i and
i=0
2i are the input dividend and divisor, respectively.
could lead to overflows that cause a low accuracy in divi-
 
B=
i=0 sion. To implement a DAXD, two leading-1 detectors, two
Q = 3 qi 2i and R = 3 ri 2i are the output quotient and
i=0 i=0 multiplexers, a reduced-width array divider, a subtractor,
remainder, respectively. The OR gate is a simplified subtractor with
and a barrel shifter are needed.
0 as the subtrahend for dealing with overflow.
Depending on the positions of the leading “1”s, an adap-
tively approximate divider (AAXD) employs two pruning
the delay of an array divider can be reduced by using schemes to determine the inputs for a reduced-width exact
carry-save reduction and carry-lookahead principles, with divider [39]. Different from the DAXD, zeros are appended
an increased cost in area and power consumption [144]. to the LSBs for selecting k input bits when the leading
To reduce the critical path, Sweeney [145], Robertson “1” is within the k LSBs. In addition, an error-correction
[146], and Tocher [147] (SRT) algorithm speculates a unit is used to ensure a high accuracy with a very low
quotient bit based on a few MSBs of the divisor and the maximum ED.
partial remainder. In an SRT divider, therefore, the bit
width of the subtractors is smaller than that in an array 3) Approximate Dividers Based on Functional Approxima-
divider, thus resulting in a faster operation. The perfor- tion: In the high-speed and energy-efficient approximate
mance can be further improved by using a high-radix divider (SEERAD) [154], the division is implemented by a
divider that generates several quotient bits rather than one simple multiplication by rounding the divisor B to a form
at each iteration [148]. The quotient in an SRT division is of 2 K +L /D, where K indicates the leading “1” position
usually in a redundant form, so an on-the-fly conversion of B, and L and D are constant integers estimated via
algorithm [149] is required to convert the quotient into a an exhaustive simulation for achieving the lowest mean
nonredundant representation. Notably, the performance of relative error. For a division of A/B, it is then sufficient to
these dividers is improved by trading off circuit area and use a multiplier for computing AD, a barrel shifter, and
power consumption. some lookup tables for storing L and D. Different accuracy
Several iterative algorithms, including the Newton– levels are obtained by varying D and L.
Raphson [10] and Goldschmidt [9] algorithms, have been A binary logarithm-based functional approximation per-
developed for large division using multiplication and addi- forms division by computing the antilogarithm of the dif-
tion. The performance of this type of dividers is signifi- ference between the logarithmic values of the dividend A
cantly affected by the presumed initial parameter values. given by (5), and divisor B given by (6). It leads to
Several approximate subtractor/adder cells
have recently been proposed for an array divider
log2 Q = log2 (A/B) ≈ k1 − k2 + x1 − x2 (11)
[31], [150]–[152]. Another type of approximate dividers
uses a reduced-width exact divider for large division [39],
[153], while several others are based on functional where k1 and k2 specify the leading one positions of A and
approximation (e.g., using a logarithmic algorithm) [30], B, respectively, x1 = A/2k1 − 1, and x2 = B/2k2 − 1.
[33], [154]–[156] and curve fitting [157]. Using Mitchell’s algorithm, an approximate division can
be implemented by performing the antilogarithm of (11),
that is
B. Review
1) Approximation in Subtractor/Adder Cells: In [31],

2k1 −k2 (x1 − x2 + 1), if x1 − x2 ≥ 0
three approximate subtractors obtained by simplifying the Q≈ (12)
2k1 −k2 −1 (x1 − x2 + 2), if x1 − x2 < 0.
circuit of an exact cell are used for processing some LSBs
in a vertical, horizontal, square, or triangle region in an
array divider. This design is referred to as an approximate Error correction is considered in [33] to compen-
unsigned nonrestoring divider (AXDnr). Compared to the sate an offset to the computed quotient. Additionally,
AXDnrs, similarly designed approximate restoring dividers a number of LSBs are truncated in the subtractors for
(AXDrs) show better tradeoffs with slightly higher accu- implementing (12). These techniques enable the design
racy and lower power dissipation [150]. of approximate integer and floating-point dividers with
In [151], an approximate signed-digit adder is proposed near-zero error bias, denoted as INZeD and FaNZeD,
for use in high-radix dividers, together with replacement, respectively.

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2123
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 19. Optimized delay and power consumption versus MRED for the approximate 16/8 unsigned integer dividers. (a) Delay
(delay-optimized) versus MRED. (b) Power (area-optimized) versus MRED.
Note: The replacement depths of AXDr1, AXDr2, and AXDr3 are from 8 to 11 from left to right. The accuracy levels of SEERAD are from
4 down to 1 from left to right. The dividend width after pruning is from 12 down to 8 with a decrement of 2 for DAXD and AAXD from left to
right. In INZeD, the number of LSBs truncated in the subtractor is 0, 2, 3, and 4, from left to right.

In a high-speed divider (HSD) [30], a piecewise linear AAXD, SEERAD, and INZeD. The high-radix and floating-
approximation is utilized to implement the antilogarithm point dividers are not considered. Three designs of AXDr
directly on the two input operands, thus only lookup tables with the triangle replacement (that shows the best trade-
and multiplications are required. Compared to a divider off [150]) are selected for evaluation, i.e., AXDr1, AXDr2,
implemented using Mitchell’s algorithm, the HSD is more and AXDr3 using three different approximate subtractors.
accurate and faster with a larger area. For DAXD and AAXD, the reduced-width exact dividers
An approximate hybrid divider (AXHD) is based on are implemented using an array structure. An exhaustive
a restoring array structure and logarithmic approxima- simulation is performed for the error evaluation, that is, all
tion [155]. In this design, the p MSBs in a 2n/n divider valid combinations in the range of [0, 65535] and (0, 255]
is accurately implemented as a restoring array divider, that do not cause overflow in an accurate 16/8 divider, are
while the (2n − p) LSBs are approximately processed using used as the input dividends and divisors. The same tools,
Mitchell’s algorithm as per (12). technologies, and configurations as for the evaluation of
In [156], the mantissa in floating-point division is adders and multipliers are applied here. The clock period
approximately computed by a subtractor; the approxima- for the power estimation is set to 5 ns. As MRED and
tion is then tuned by using an error compensation lookup NMED show a similar pattern in the simulation results,
table (storing precomputed values) and a subtractor. This only MRED is plotted in Fig. 19(a) and (b), respectively,
design is denoted as the configurable approximate divider against which the optimized delay and power consumption
for energy efficiency (CADE) as its accuracy varies with the for area-optimized synthesis are shown. Fig. 20 presents
size of the lookup table. the comparison in PDP versus MRED for both delay- and
area-optimized syntheses.
4) Curve Fitting-Based Approximate Dividers: In the
design of a floating-point divider, the curved surfaces of 1) Hardware Versus Accuracy: As can be seen in Fig. 19,
the quotient are partitioned into several square or tri- AXDr1 and AXDr3 can be very accurate with a moderate
angular regions that are linearly approximated by curve power consumption, but quite slow, whereas DAXD is the
fitting [157]. The mantissa division is then implemented least accurate in general. For a medium-low MRED, AAXD,
by a comparison module, a lookup table, shifters, and and INZeD show the highest performance and the lowest
adders. With a similar circuit structure to the HSD, this power consumption; thus, they achieve the best accuracy-
approximate divider achieves a higher accuracy. hardware tradeoff. This is also evident in the PDP and
MRED figure in Fig. 20. SEERAD is the fastest at a low
accuracy (see Fig. 19) and with the lowest PDP for delay-
C. Evaluation optimized synthesis (see Fig. 20).
In this evaluation, we consider approximate In summary, AAXD is an efficient design for applica-
16/8 unsigned integer dividers, including AXDr, DAXD, tions that require a high accuracy and high performance.

2124 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 20. Comparison of the PDP and MRED for the approximate 16/8 unsigned integer dividers. (a) PDP (delay-optimized) versus MRED.
(b) PDP (area-optimized) versus MRED.

Table 4 Summary of Approximate 16/8 Unsigned Integer Divider where G is a 5 × 5 convolution matrix given by
Designs
⎡ ⎤
1 4 7 4 1
⎢ ⎥
⎢4 16 26 16 4⎥
⎢ ⎥
⎢ ⎥
G = ⎢7 26 41 26 7⎥ . (14)
⎢ ⎥
⎢ ⎥
⎣4 16 26 16 4⎦
1 4 7 4 1

Although some configurations of AXDr1 and AXDr3 are


Equation (13) shows that 25 multiplications, 24 addi-
very accurate, they are generally slow with high energy
tions, and a division are required for computing S(x, y).
consumption. INZeD is the most efficient design for a
In this simulation, the inputs are normalized by the maxi-
moderate accuracy. For an application that can tolerate a
mum numbers and scaled to a number in 16-bit unsigned
high level of inaccuracies, SEERAD-1 is suitable with a low
integer representation, so 16 × 16 approximate unsigned
hardware cost. A qualitative summary of these features is
multipliers and 16-bit approximate adders are used to
shown in Table 4.
implement the sum of products in (13). Note that the
32-bit products are rounded to 16 bits as inputs to the
VI. A P P L I C A T I O N S
adders. The division by 273 is implemented by a multi-
A. Image Processing
plication of a constant input 1/273.
To assess the capabilities of the approximate designs, Among the approximate adders and multipliers in each
we consider three image-processing applications: image category, one or two designs with the best accuracy and
sharpening using unsigned multipliers and adders, image energy tradeoffs are selected for this application. Hence,
compression using signed multipliers and adders, and ACA, GeAr, ETAII, CCBA, LOA, and TruA are considered for
change detection using unsigned dividers. addition; UDM, BAM, DRUM, HOCM, TAM1, ICM, ALM-
1) Image Sharpening: This technique enhances the SOA, CGPM3, and TruMW are selected for multiplication.
edges in an image to obtain a clearer view. An image The configurations that lead to similar PDPs compared
with a pixel matrix I is sharpened by using R(x, y) = to other designs are considered for designs with variable
2I(x, y) − S(x, y) [158], where R(x, y) is a resultant image parameters.
pixel, and S(x, y) is obtained by a convolution Fig. 21 shows the PSNRs of the sharpened images, which
are infinite for the combinations of the AccuA and AccuM,
and AccuA and ICM. The input image is a blurred “Lena”
1  
2 2
S(x, y) = G(m + 3, n + 3)I(x − m, y − n) with 512 × 512 pixels. Because ICM has a very low ER
273 (5.45%), and the error seldom occurs in this application,
m=−2 n=−2
(13) the ICM is as effective as AccuM for image sharpening.

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2125
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

of LOA-7 and BAM-16 results in an image with a higher


PSNR than BAM-16 and AccuA, and the combination of
CCBA-3 and DRUM-6 outperforms the duo of CCBA-3 and
AccuM.
The circuit designs for image sharpening are synthesized
for the optimized area (the same for the other applica-
tions). The clock period for the power estimation is 10 ns.
In the simulation, the CLA and Wallace multiplier are
utilized for the AccuA and AccuM, respectively. Fig. 22(a)
shows that the implementations using TAM1-16 are the
fastest, followed by HOCM, whereas the ones using DRUM-
6 are the slowest, followed by ICM. The delay values are
not as consistent as the area results for different adder
and multiplier combinations because the syntheses are
optimized for area.
As shown in Fig. 22(b) and (c), using different adder
designs does not significantly affect the area or power dis-
sipation when a specific multiplier is used. Thus, the mul-
tiplier dominates the area and power dissipation for this
application. On the contrary, the adder plays a more
significant role on the critical path delay, as shown in
Fig. 22(a), because the 25 multipliers work in parallel,
whereas the 24 adders work in a tree structure, resulting
in a critical path of one multiplier and five adders. Among
the multipliers, ALM-SOA and TAM1-16 are very energy-
efficient, as shown in the PDP values in Fig. 22(d).
Fig. 23 shows the comparison of PDP reductions of
different implementations compared with the accurate

Fig. 21. PSNR comparison of image sharpening results.


Note: AccuA and AccuM denote the accurate adder and multiplier,
respectively. The numerical value in the name of each design
indicates the parameter value. The design 1StepTrunc is considered
for HOCM; the configuration of 280 is selected for CGPM3.

For the same reason, the images sharpened by using the


UDM have very close PSNRs to the ones processed by an
AccuM. These results illustrate the advantages of designs
with low ERs in certain applications.
Although DRUM-7 and DRUM-6 show larger values of
MRED and NMED than the other designs, they lead to
higher PSNRs due to their unbiased errors. Also, HOCM
results in images with a higher quality than many other
multipliers, when used with the same adder. With a larger
MRED and NMED, the performance of ALM-SOA is similar
to BAM-16 and TAM1-16, because BAM and TAM1 gener-
ate single-sided errors that can be accumulated in the sum
of products.
When an approximate adder (or multiplier) has a very
low accuracy (e.g., BAM-18, TruM-4, ACA-6, GeAr-R4P4,
and ETAII-4), increasing the accuracy of the collaborating
multiplier (or adder) does not improve the quality of the
processed image. It is worth noting that using some com-
binations of approximate adders and multipliers can lead
to a higher image quality than using solely an approximate Fig. 22. Circuit measurements of image sharpening. (a) Delay. (b)
adder or an approximate multiplier. For example, the use Area. (c) Power. (d) PDP.

2126 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 23. Comparison of PDPs in the descending order of PSNRs for various adder-multiplier implementations of image sharpening.

design, in the descending order of PSNRs larger than The high-frequency information is then discarded by the
30 dB. For an implementation that produces sharpened quantization
images with a PSNR higher than 35 dB, ETAII-6 and HOCM
 
lead to the most significant saving in PDP (by about 50%). D(x, y)
C(x, y) = round (17)
For a PSNR between 30 and 35 dB, LOA-9 and ALM-SOA Q(x, y)
are the most efficient with 69% reduction in PDP. DRUM-
6 achieves the largest reductions in PDP for a high image where Q is a quantization matrix of unsigned integers
quality (with a PSNR larger than 40 dB), whereas ALM- determined by the required quality level. The quality level
SOA is the most efficient for a relatively low image quality can be from 1 to 100, where 1 corresponds to the highest
(with a PSNR lower than 35 dB). compression ratio and thus the poorest image quality.
To reconstruct the image, the above operations are
inverted by dequantization and inverse DCT (IDCT)
2) JPEG Compression: Based on the discrete cosine
transform (DCT), JPEG is a widely used lossy compression
algorithm for digital images [159]. The image pixels in R(x, y) = C(x, y) × Q(x, y) (18)
the spatial domain are first converted into the frequency
domain via a DCT. In the DCT, the pixel matrix of an image and
in 16-bit two’s complement is divided into 8 × 8 blocks.
Each block B is converted to the frequency domain by
I = T’RT. (19)

D = TBT’ (15)
In this simulation, the signed coefficients in matrix T are
scaled to 16-bit two’s complement format, and the image
where T is an 8 × 8 DCT coefficient matrix given by pixels are normalized by the maximum number followed
⎧ by a subtraction of 0.5 and scaled to 16-bit two’s comple-

⎪ 1 ment format. The signed multiplications in the DCT and
⎨√ , if x = 0
T(x, y) = 1 8  (2y + 1)xπ  (16) IDCT are implemented by approximate Booth multipliers,

⎪ cos including the 16 × 16 designs with good tradeoffs in accu-
⎩ , if x > 0
2 16 racy and hardware, BM07, PEBM, ABM2, BBM, and TBM.
The same 16-bit adder designs as in the image sharpening
where x = 0, 1, . . . , 7 and y = 0, 1, . . . , 7. are used. The quality level for the compression is 50.

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2127
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

lesser extent in JPEG compression than in image sharp-


ening for an acceptable accuracy, e.g., LOA-3 is required
for JPEG compression while LOA-8 is sufficient for image
sharpening.
Fig. 25 shows the resulting PDP reductions of the
DCT implementations using different multiplier and adder
combinations compared with the accurate design. The
clock period used for the power estimation is 10 ns. The
accurate DCT design utilizes a 16-bit post-truncated fixed-
width Booth multiplier and a 16-bit CLA. As shown in
Fig. 25, a combination of AccuA and PEBM shows the
best tradeoff in this implementation, achieving the highest
PDP reduction (about 20%) with a relatively high PSNR
(nearly 30 dB). Using an approximate adder with PEBM
rather significantly degrades the image quality. Among
the approximate Booth multipliers, PEBM, TBM-3, and
ABM2 lead to higher energy efficiency than the other
designs for this application.

3) Change Detection: The changes in two images can be


detected by finding the ratios between the corresponding
pixels. Thus, change detection can be used to assess the
approximate dividers. In each design, one configuration
Fig. 24. Comparison of JPEG compression and decompression is selected to ensure that a similar PDP (e.g., AXDr1-11,
quality using different adder and multiplier designs.
AXDr2-11, AXDr3-11, DAXD-8, AAXD-8, SEERAD-1, and
INZeD-2), or MRED (e.g., AXDr1-9, AXDr2-8, AXDr3-10,
DAXD-12, AAXD-10, SEERAD-4, and INZeD-0) occurs
for the considered designs. The 16/8 unsigned integer
The qualities of the decompressed images using dif-
dividers are utilized to obtain the pixel ratios. As shown
ferent adder and multiplier combinations are shown in
in Table 5, AXDr1-9, AXDr3-10, AXDr3-11, AAXD-10,
Fig. 24. As can be seen, ACA, GeAr, and ETAII are not
INZeD-0, INZeD-2, and SEERAD-4 perform similarly well
suitable for this application although they have very low
as an accurate divider, whereas AXDr2-11, DAXD-12,
ERs. Note that the errors of these approximate adders are
DAXD-8, and SEERAD-1 produce results with a lower
single-sided (or negative) due to the dropping of some
quality. AXDr2-8 and AAXD-8 produce images that are
carry bits; thus, the error biases for these types of adders
acceptable for a visual inspection. As the implementa-
are very large [51]. As a result, errors are accumulated in
tion of change detection mainly consists of dividers, the
the multiple matrix multiplications and cannot be tolerated
circuit measurements are similar to those for approximate
in DCT or IDCT. Similarly, BBM has a larger error bias
dividers; thus, they are not shown here. Fig. 20 shows that
than the other approximate Booth multipliers due to the
truncation of the PPs. Thus, the use of BBM also produces
images with a low quality in most cases. Among the
approximate adders, LOA-3 performs the best, followed
by TruA-1, while for the approximate multipliers, BM07,
PEBM, and TBM-2 outperform the other designs when the
same adder is used.
It is worth noting that using some approximate Booth
multipliers along with some approximate adders, e.g.,
PEBM and TruA-1, PEBM and TruA-2, generate a signifi-
cantly higher quality than the other designs (even when an
AccuM is used). Except for these special cases, the image
quality in PSNRs increases with the decrease in the MREDs
of the utilized approximate Booth multipliers.
Additionally, the results in Fig. 24 shows that a more
complex computation involving multiple matrix multipli-
cations is more sensitive to errors in addition than those
in multiplication with the same bit width. Thus, a larger
approximation can be tolerated in multiplication than in Fig. 25. Comparison of PDPs in the descending order of PSNRs for
addition. The tolerable approximation in addition is to a DCT implementations using different adder and multiplier designs.

2128 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Table 5 Change Detection Results Using Different Dividers

Fig. 26. Architecture of the implemented MTCNN.

to achieve a result with a PSNR higher than 33 dB, INZeD- approximate adder with single-sided errors (i.e., ETAII),
2 consumes the smallest energy followed by INZeD-0 and one with a small error bias (i.e., LOA), and the TruA are
AAXD-10. selected for additions.
Fig. 27 shows some face detection (in the bottom row)
and face alignment (in the top row) results using different
B. Deep Neural Networks adders and multipliers. For the face detection, a square
Face detection and alignment are two common tasks in is drawn to show the detected area. To align a face, five
machine learning. Using DNNs, the accuracy of face detec- landmarks are used to mark the eyes, nose, and mouth.
tion and alignment have been significantly improved since Compared to the accurate implementation using AccuM
the early 2010s. Due to the correlation of these two tasks, and AccuA, BM07 and ETAII-7 perform poorly in face
a multitask CNN (MTCNN) has been proposed for joint detection and alignment, as indicated by the squares and
face detection and alignment [160]. An accelerator specif- landmarks far away from the target positions, whereas
ically designed for this MTCNN achieves a high energy BM07 and LOA-4 result in a better quality.
efficiency and throughput [161]. In this MTCNN, three To quantitatively assess the accuracy of the face detec-
CNNs cascade as the proposal network (P-Net), the refine tion, the true positive rate (TPR) is measured for each
network (R-Net), and the output network (O-Net). The implementation on the FDDB data set [162], as shown in
basic operation in a CNN is the convolution based on Table 6. The TPRs in Table 6 show that the approximate
multiplications and additions. Booth multipliers (except for BBM), when working with
To assess the viability of approximate circuits in DNNs, LOA-3, LOA-4 and TruA-1, perform well in face detection,
the 16 × 16 approximate Booth multipliers and 16-bit resulting in close TPRs to the accurate implementation.
adders are integrated into the architecture of an MTCNN ETAII-7 leads to very small TPRs due to its large error
for face detection and alignment, as shown in Fig. 26. bias. Similarly, TruA-2 (except for the combination with
Here, the convolutional (CONV) layers account for the PEBM) and BBM (except for the combinations with LOA-
most computations. The max pooling is used for all pooling 3 and LOA-4) result in relatively low TPRs. Interestingly,
layers. Two fully connected (FC) layers to the end of ABM2 and TBM-4 working with the AccuA result in higher
the R-Net and P-Net are implemented by vector multi- TPRs than the accurate design. Similar results have been
plications. The approximate Booth multipliers with good observed in [47].
tradeoffs in accuracy and hardware, as those used in the In addition, the number of MACs required to detect the
JPEG compression, are considered in this application. One faces in one image averaged over the FDDB data set is

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2129
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Fig. 27. Some face detection (at bottom) and alignment (on top) results using different adders and multipliers. (a) AccuM and AccuA.
(b) BM07 and ETAII-7. (c) BM07 and LOA-4. (d) TBM-4 and TruA-1.

Table 6 TPRs of the Face Detection on FDDB Data Set (%). The Values Table 8 NMEs for the Face Alignment on AFLW Data Set (%). The Errors
Higher Than 90% are Highlighted in Bold, and the Ones Larger Than That Smaller Than That of the Accurate Design are Highlighted in Bold
of the Accurate Design are in Red

LOA-4 consistently achieve smaller NMEs than those using


Table 7 Number of MACs Required to Detect the Faces in One Image
AccuAs. Although TruA-1 performs well in face detection,
Averaged Over the FDDB Data Set (in Billions). The Numbers Smaller
Than That of the Accurate Design are in Bold it results in large NMEs in face alignment, LOA-3 per-
forms the best among the approximate adders. Among
the approximate multipliers, BM07 and PEBM are effective
designs producing smaller NMEs than the accurate design.
Note that the approximate adders and multipliers that
result in low accuracy in face detection and alignment
(BBM, ETAII-7, TruA-1, and TruA-2) share the same fea-
ture, single-sided errors. Similar to the JPEG compression,
the face detection and alignment are more sensitive to
errors in additions than in multiplications (with the same
bit width), so a deeper approximation can be tolerated in
reported in Table 7 as an indicator for energy efficiency. approximate multipliers.
Overall, ABM2 and LOA-3, AccuM and TruA-1, BM07 and
TruA-1, PEBM and TruA-1, and PEBM and TruA-2, are VII. C O N C L U S I O N , C H A L L E N G E S ,
effective combinations for face detection, which result in AND PROSPECTS
high TPRs and require a smaller number of MACs (thus, In this article, approximate arithmetic circuits are
a higher energy efficiency) than the accurate implemen- reviewed, characterized, and comparatively evaluated,
tation. Hence, the energy efficiency of a DNN can be using functional simulation, circuit synthesis optimized for
improved by using approximate arithmetic circuits while performance and area, and image processing and machine
achieving a similar or even higher detection accuracy. learning applications.
Finally, the normalized mean errors (NMEs) for the
face alignment are obtained by comparing the coordinate
values of the five landmarks with their standard values A. Characterization
for the AFLW data set [163], as shown in Table 8. The 1) Approximate Adders: Most of the approximate adders
combinations of designs that result in small TPRs are have been designed for high performance and low error (in
omitted. It is interesting that the MTCNNs using LOA-3 and ER) by reducing the critical path delay. Most speculative

2130 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

adders, segmented adders, and carry-select adders show approximation might be promising for hardware-efficient
low ERs. Due to the reduction of some carries, single- approximate arithmetic circuits, though leaving the chal-
sided errors are prevalent in these designs that result in lenge of enhancing its accuracy with low hardware
large error biases, especially in applications that require overhead.
iterative or repetitive additions. However, the designs
using approximate FAs in the LSBs often have high ERs,
low MREDs, and low power dissipation. With a reduced B. Applications
precision, a TruA produces a biased error with a high For image processing, the approximate adders, mul-
ER close to 100%, but it consumes a very low power. tipliers, and dividers with smaller error magnitudes (in
Considering practically effective error and circuit metrics, MREDs) generally produce results with a higher qual-
such as MRED and PDP, LOA, CCBA, and the TruAs, achieve ity. For simple operations such as the sum of products,
the best accuracy-energy tradeoffs. the approximate multipliers with lower ERs outperform
those with higher ERs. Although with very low ERs,
2) Approximate Multipliers: For unsigned designs, trun-
the approximate adders with large error biases (e.g., ACA,
cating part of the PPs or some LSBs of the input operands
ETAII, and TruA) do not work well for more complex com-
is an effective scheme to reduce circuit area, while pre-
putations such as cascaded matrix multiplications. In an
serving a moderate and variable accuracy in terms of
accumulative operation, the approximate designs with low
NMED and MRED, depending on the number of bits
error biases or double-sided errors consistently perform
truncated; examples include the BAM, TAM1, and, even,
better than those with single-sided errors.
the TruMs. LMs are relatively inaccurate, but they can
The more complex computation that involves multi-
be very efficient in performance and power consump-
ple matrix multiplications is more vulnerable to errors
tion. The truncated Wallace multiplier, compressor-based
in addition than those in multiplication. In other words,
HOCM and TAM1 are among the designs with a high
a larger approximation can be tolerated in multipliers than
performance, while the truncated array multiplier is more
in adders to achieve a reasonably accurate result. In such
power efficient at a moderate accuracy. The CGPMs can
applications, the multiplier dominates the area and power
be very accurate with a moderate performance. TAM1,
dissipation of the circuit, whereas more adders are in the
HOCM, and the logarithmic design ALM-SOA show the
critical path, so the adder plays a more important role on
best tradeoffs between energy and accuracy. For signed
the delay.
multipliers, most fixed-width Booth multipliers provide a
By using approximate adders and multipliers, an
better design tradeoff than the TBMs due to the efficient
MTCNN can achieve a comparable face detection quality
error compensation.
to the accurate design. The accuracy of the face detection
3) Approximate Dividers: For the fewer divider designs, generally decreases with the increase of the MREDs for
those approximated in the subtractor/adder cells are the approximate multipliers, so the MRED is an indicator
slow, and their accuracy varies with the approximate of the quality of an approximate multiplier. For this more
subtractor/adder design. The dividers based on func- complex application, the approximate adders and multi-
tional approximation are relatively fast. Among the pliers with large error biases result in significantly poor
considered designs, the logarithmic INZeD and input- accuracy in face detection and alignment. Compared to the
adaptive AAXD provide balanced tradeoffs with both low accurate design, a smaller number of MACs is required for
PDPs and MREDs. face detection when some approximate designs are used
The above observations are based on the investigation in an MTCNN. Hence, the use of approximate arithmetic
of 16-bit designs, so the circuit and error characteristics circuits can reduce the power consumption and improve
may vary for adders of different sizes, although some the energy efficiency of an MTCNN.
adders are based on regular structures. Multipliers and Interestingly, some combinations of approximate
dividers of different sizes may exhibit more significantly multipliers and adders lead to higher accuracy than the
different characteristics as some designs are tailored and accurate implementation, even though these circuits,
optimized for a specific bit-width; thus, the performance by themselves, do not show advantages over the others.
may degrade even though the approximation scheme is Hence, it might be more effective to design approximate
scalable. arithmetic circuits from a system’s or application’s
In general, rather limited improvements in circuit mea- perspective. A rigorous evaluation framework with
surements are observed for the approximate arithmetic trustworthy error metrics would be imperative to
circuits simplified from an accurate design. Many ad hoc ensure the reliability and robustness of the system
designs underperform simply truncated circuits. A func- with respect to the effect due to the propagation and
tional approximation algorithm such as the binary loga- statistical distributions of errors. Approximate arithmetic
rithm can lead to designs with significant savings in circuit circuits could also be integrated with other approximate
area and power consumption, albeit at the cost of a low components in a system hierarchy, such as memory and
accuracy. With the potential of breaking away from the interconnects, for a more significant improvement in
original (limiting) architecture, nevertheless, functional hardware efficiency as well as processing quality.

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2131
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

REFERENCES
[1] J. Han and M. Orshansky, “Approximate tolerant applications,” in Proc. DATE, 2010, [43] F. S. Snigdha, D. Sengupta, J. Hu, and
computing: An emerging paradigm for pp. 957–960. S. S. Sapatnekar, “Optimal design of JPEG
energy-efficient design,” in Proc. ETS, May 2013, [24] S. Venkataramani, A. Sabne, V. Kozhikkottu, K. hardware under the approximate computing
pp. 1–6. Roy, and A. Raghunathan, “SALSA: Systematic paradigm,” in Proc. DAC, 2016, p. 106.
[2] G. E. Moore, “Cramming more components onto logic synthesis of approximate circuits,” in Proc. [44] H. A. F. Almurib, T. N. Kumar, and F. Lombardi,
integrated circuits,” Electronics, vol. 38, no. 8, DAC, 2012, pp. 796–801. “Approximate DCT image compression using
p. 114, Apr. 1965. [25] Z. Vasicek and L. Sekanina, “Evolutionary inexact computing,” IEEE Trans. Comput., vol. 67,
[3] N. Hardavellas, M. Ferdman, B. Falsafi, and approach to approximate digital circuits design,” no. 2, pp. 149–159, Feb. 2018.
A. Ailamaki, “Toward dark silicon in servers,” IEEE IEEE Trans. Evol. Comput., vol. 19, no. 3, [45] M. Brandalero, A. C. S. Beck, L. Carro, and
Micro, vol. 31, no. 4, pp. 6–15, Jul. 2011. pp. 432–444, Jun. 2015. M. Shafique, “Approximate on-the-fly
[4] H. Esmaeilzadeh, E. Blem, R. St. Amant, [26] V. Mrazek, S. S. Sarwar, L. Sekanina, Z. Vasicek, coarse-grained reconfigurable acceleration for
K. Sankaralingam, and D. Burger, “Dark silicon and K. Roy, “Design of power-efficient general-purpose applications,” in Proc. DAC, 2018,
and the end of multicore scaling,” in Proc. ISCA, approximate multipliers for approximate artificial pp. 1–6.
2011, pp. 365–376. neural networks,” in Proc. ICCAD, 2016, p. 7. [46] O. Akbari, M. Kamal, A. Afzali-Kusha, M. Pedram,
[5] M. Shafique, S. Garg, J. Henkel, and D. [27] H. Esmaeilzadeh, A. Sampson, L. Ceze, and and M. Shafique, “PX-CGRA: Polymorphic
Marculescu, “The EDA challenges in the dark D. Burger, “Architecture support for disciplined approximate coarse-grained reconfigurable
silicon era: Temperature, reliability, and variability approximate programming,” ACM SIGARCH architecture,” in Proc. DATE, 2018, pp. 413–418.
perspectives,” in Proc. DAC, 2014, pp. 1–6. Comput. Archit. News, vol. 40, no. 1, pp. 301–312, [47] M. S. Ansari, V. Mrazek, B. F. Cockburn,
[6] A. Alaghi, W. Qian, and J. P. Hayes, “The promise Apr. 2012. L. Sekanina, Z. Vasicek, and J. Han, “Improving
and challenge of stochastic computing,” IEEE [28] J. S. Miguel, J. Albericio, A. Moshovos, and the accuracy and hardware efficiency of neural
Trans. Comput.-Aided Design Integr. Circuits Syst., N. E. Jerger, “Doppelgánger: A cache for networks using approximate multipliers,” IEEE
vol. 37, no. 8, pp. 1515–1531, Aug. 2018. approximate computing,” in Proc. MICRO, 2015, Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28,
[7] S. Venkataramani, S. T. Chakradhar, K. Roy, and pp. 50–61. no. 2, pp. 317–328, Feb. 2020.
A. Raghunathan, “Approximate computing and the [29] A. Sampson, W. Dietl, E. Fortuna, [48] B.-J. Yoo et al., “A 56 Gb/s 7.7 mW/Gb/s PAM-4
quest for computing efficiency,” in Proc. DAC, D. Gnanapragasam, L. Ceze, and D. Grossman, wireline transceiver in 10 nm FinFET using
2015, p. 120. “EnerJ: Approximate data types for safe and MM-CDR-based ADC timing skew control and
[8] P. Rabinowitz, “Multiple-precision division,” general low-power computation,” ACM SIGPLAN low-power DSP with approximate multiplier,” in
Commun. ACM, vol. 4, no. 2, p. 98, Feb. 1961. Notices, vol. 46, no. 6, pp. 164–174, Jun. 2011. IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
[9] R. E. Goldschmidt, “Applications of division by [30] J. Y. L. Low and C. C. Jong, “Non-iterative high Tech. Papers, Feb. 2020, pp. 122–124.
convergence,” Ph.D. dissertation, Massachusetts speed division computation based on Mitchell [49] C. Liu, J. Han, and F. Lombardi, “An analytical
Inst. Technol., Cambridge, MA, USA, 1964. logarithmic method,” in Proc. ISCAS, May 2013, framework for evaluating the error characteristics
[10] M. J. Flynn, “On division by functional iteration,” pp. 2219–2222. of approximate adders,” IEEE Trans. Comput.,
IEEE Trans. Comput., vol. C-19, no. 8, [31] L. Chen, J. Han, W. Liu, and F. Lombardi, “Design vol. 64, no. 5, pp. 1268–1281, May 2015.
pp. 702–706, Aug. 1970. of approximate unsigned integer non-restoring [50] H. Jiang, C. Liu, F. Lombardi, and J. Han,
[11] J. N. Mitchell, “Computer multiplication and divider for inexact computing,” in Proc. GLSVLSI, “Low-power approximate unsigned multipliers
division using binary logarithms,” IEEE Trans. 2015, pp. 51–56. with configurable error recovery,” IEEE Trans.
Electron. Comput., vol. EC-11, no. 4, pp. 512–517, [32] H. Jiang, L. Liu, F. Lombardi, and J. Han, Circuits Syst. I, Reg. Papers, vol. 66, no. 1,
Aug. 1962. “Adaptive approximation in arithmetic circuits: A pp. 189–202, Jan. 2019.
[12] Y. C. Lim, “Single-precision multiplier with low-power unsigned divider design,” in Proc. [51] H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han,
reduced circuit complexity for signal processing DATE, Mar. 2018, pp. 1411–1416. “A review, classification, and comparative
applications,” IEEE Trans. Comput., vol. 41, [33] H. Saadat, H. Javaid, and S. Parameswaran, evaluation of approximate arithmetic circuits,”
no. 10, pp. 1333–1336, Oct. 1992. “Approximate integer and floating-point dividers ACM J. Emerg. Technol. Comput. Syst., vol. 13,
[13] M. J. Schulte and E. E. Swartzlander, “Truncated with near-zero error bias,” in Proc. DAC, 2019, no. 4, pp. 60:1–60:34, Aug. 2017.
multiplication with correction constant [for p. 161. [52] Q. Zhang, T. Wang, Y. Tian, F. Yuan, and Q. Xu,
DSP],” in Proc. IEEE Workshop VLSI Signal [34] D. Esposito, A. G. M. Strollo, and M. Alioto, “ApproxANN: An approximate computing
Process., Oct. 1993, pp. 388–396. “Low-power approximate MAC unit,” in Proc. framework for artificial neural network,” in Proc.
[14] S.-L. Lu, “Speeding up processing with PRIME, 2017, pp. 81–84. DATE, 2015, pp. 701–706.
approximation circuits,” Computer, vol. 37, no. 3, [35] M.-H. Sheu and S.-H. Lin, “Fast compensative [53] R. W. Allred, “Circuits, systems, and methods
pp. 67–73, Mar. 2004. design approach for the approximate squaring implementing approximations for logarithm,
[15] A. W. Burks, H. H. Goldstine, and J. von Neumann, function,” IEEE J. Solid-State Circuits, vol. 37, inverse logarithm, and reciprocal,” U.S. Patent
Preliminary Discussion of the Logical Design of an no. 1, pp. 95–97, Jan. 2002. 7 171 435, Jan. 30, 2007.
Electronic Computing Instrument. Springer, 1947. [36] Y.-H. Chen, “Area-efficient fixed-width squarer [54] S. R. Datla, M. A. Thornton, and D. W. Matula,
[16] A. K. Verma, P. Brisk, and P. Ienne, “Variable with dynamic error-compensation circuit,” IEEE “A low power high performance radix-4
latency speculative addition: A new paradigm for Trans. Circuits Syst. II, Exp. Briefs, vol. 62, no. 9, approximate squaring circuit,” in Proc. ASAP,
arithmetic circuit design,” in Proc. DATE, pp. 851–855, Sep. 2015. Jul. 2009, pp. 91–97.
Mar. 2008, pp. 1250–1255. [37] M. S. Ansari, B. F. Cockburn, and J. Han, [55] J. Han, “Introduction to approximate computing,”
[17] N. Zhu, W. L. Goh, G. Wang, and K. S. Yeo, “Low-power approximate logarithmic squaring in Proc. VTS, 2016, p. 1.
“Enhanced low-power high-speed adder for circuit design for DSP applications,” IEEE Trans. [56] W. Liu, F. Lombardi, and M. Shulte, “A
error-tolerant application,” in Proc. ISIC, Emerg. Topics Comput., early access, Apr. 24, 2020, retrospective and prospective view of approximate
Nov. 2010, pp. 69–72. doi: 10.1109/TETC.2020.2989699. computing [Point of View],” Proc. IEEE, vol. 108,
[18] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and [38] K. Manikantta Reddy, M. H. Vasantha, Y. B. Nithin no. 3, pp. 394–399, Mar. 2020.
C. Lucas, “Bio-inspired imprecise computational Kumar, and D. Dwivedi, “Design of approximate [57] H. Jiang et al., “Characterizing approximate
blocks for efficient VLSI implementation of booth squarer for error-tolerant computing,” IEEE adders and multipliers optimized under different
soft-computing applications,” IEEE Trans. Circuits Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28, design constraints,” in Proc. GLSVLSI, 2019,
Syst. I, Reg. Papers, vol. 57, no. 4, pp. 850–862, no. 5, pp. 1230–1241, May 2020. pp. 393–398.
Apr. 2010. [39] H. Jiang, L. Liu, F. Lombardi, and J. Han, [58] Y. Liu, T. Zhang, and K. K. Parhi, “Computation
[19] D. Mohapatra, V. K. Chippa, A. Raghunathan, and “Low-power unsigned divider and square root error analysis in digital signal processing systems
K. Roy, “Design of voltage-scalable meta-functions circuit designs using adaptive approximation,” with overscaled supply voltage,” IEEE Trans. Very
for approximate computing,” in Proc. DATE, IEEE Trans. Comput., vol. 68, no. 11, Large Scale Integr. (VLSI) Syst., vol. 18, no. 4,
Mar. 2011, pp. 1–6. pp. 1635–1646, Nov. 2019. pp. 517–526, Apr. 2010.
[20] V. Gupta, D. Mohapatra, A. Raghunathan, and [40] L. Chen, J. Han, W. Liu, and F. Lombardi, [59] D. Mohapatra, V. K. Chippa, A. Raghunathan, and
K. Roy, “Low-power digital signal processing using “Algorithm and design of a fully parallel K. Roy, “Design of voltage-scalable meta-functions
approximate adders,” IEEE Trans. Comput.-Aided approximate coordinate rotation digital computer for approximate computing,” in Proc. DATE, 2011,
Design Integr. Circuits Syst., vol. 32, no. 1, (CORDIC),” IEEE Trans. Multi-Scale Comput. Syst., pp. 1–6.
pp. 124–137, Jan. 2013. vol. 3, no. 3, pp. 139–151, Jul. 2017. [60] J. Chen and J. Hu, “Energy-efficient digital signal
[21] K. Y. Kyaw, W. L. Goh, and K. S. Yeo, “Low-power [41] V. K. Chippa, H. Jayakumar, D. Mohapatra, K. Roy, processing via voltage-overscaling-based residue
high-speed multiplier for error-tolerant and A. Raghunathan, “Energy-efficient recognition number system,” IEEE Trans. Very Large Scale
application,” in Proc. EDSSC, Dec. 2010, pp. 1–4. and mining processor using scalable effort Integr. (VLSI) Syst., vol. 21, no. 7, pp. 1322–1332,
[22] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading design,” in Proc. CICC, 2013. Jul. 2013.
accuracy for power with an underdesigned [42] T. Chen et al., “DianNao: A small-footprint [61] S. Ghosh and K. Roy, “Parameter variation
multiplier architecture,” in Proc. VLSID, Jan. 2011, high-throughput accelerator for ubiquitous tolerance and error resiliency: New design
pp. 346–351. machine-learning,” ACM SIGPLAN Notices, vol. 49, paradigm for the nanoscale era,” Proc. IEEE,
[23] D. Shin, “Approximate logic synthesis for error no. 4, pp. 269–284, Apr. 2014. vol. 98, no. 10, pp. 1718–1751, Oct. 2010.

2132 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

[62] V. K. Chippa, D. Mohapatra, K. Roy, S. T. no. 3, pp. 260–264, Mar. 1982. MOS/spintronic non-volatile full-adder,” in Proc.
Chakradhar, and A. Raghunathan, “Scalable effort [85] T. Han and D. A. Carlson, “Fast area-efficient VLSI NANOARCH, 2016, pp. 203–208.
hardware design,” IEEE Trans. Very Large Scale adders,” in Proc. ARITH, 1987, pp. 49–56. [108] S. Angizi, H. Jiang, R. F. DeMara, J. Han, and
Integr. (VLSI) Syst., vol. 22, no. 9, pp. 2004–2016, [86] O. J. Bedrij, “Carry-select adder,” IRE Trans. D. Fan, “Majority-based spin-CMOS primitives for
Sep. 2014. Electron. Comput., vol. EC-11, no. 3, pp. 340–346, approximate computing,” IEEE Trans.
[63] C.-H. Lin and I.-C. Lin, “High accuracy Jun. 1962. Nanotechnol., vol. 17, no. 4, pp. 795–806,
approximate multiplier with error correction,” in [87] A. Guyot, B. Hochet, and J. M. Muller, “A way to Jul. 2018.
Proc. ICCD, Oct. 2013, pp. 33–38. build efficient carry-skip adders,” IEEE Trans. [109] V. Mrazek, R. Hrbacek, Z. Vasicek, and
[64] M. S. Ansari, B. F. Cockburn, and J. Han, Comput., vol. C-36, no. 10, pp. 1144–1152, L. Sekanina, “EvoApprox8b: Library of
“A hardware-efficient logarithmic multiplier with Oct. 1987. approximate adders and multipliers for circuit
improved accuracy,” in Proc. DATE, 2019, [88] J. Sklansky, “Conditional-sum addition logic,” design and benchmarking of approximation
pp. 928–931. IEEE Trans. Electron. Comput., vol. EC-9, no. 2, methods,” in Proc. DATE, 2017, pp. 258–261.
[65] G. Zervakis, S. Xydis, K. Tsoumanis, D. Soudris, pp. 226–231, Jun. 1960. [110] P. Behrooz, Computer Arithmetic: Algorithms and
and K. Pekmestzi, “Hybrid approximate multiplier [89] A. Tyagi, “A reduced-area scheme for carry-select Hardware Designs, 2nd ed. New York, NY, USA:
architectures for improved power-accuracy adders,” IEEE Trans. Comput., vol. 42, no. 10, Oxford Univ. Press, 2010.
trade-offs,” in Proc. ISLPED, Jul. 2015, pp. 79–84. pp. 1163–1170, Oct. 1993. [111] D. Jacobsohn, “A suggestion for a fast multiplier,”
[66] J. Liang, J. Han, and F. Lombardi, “New metrics [90] M. D. Ercegovac and T. Lang, Digital Arithmetic. IEEE Trans. Electron. Comput., vol. EC-13, no. 1,
for the reliability of approximate and probabilistic Amsterdam, The Netherlands: Elsevier, 2003. p. 754, Dec. 1964.
adders,” IEEE Trans. Comput., vol. 62, no. 9, [91] A. B. Kahng and S. Kang, “Accuracy-configurable [112] L. Dadda, “Some schemes for parallel multipliers,”
pp. 1760–1771, Sep. 2013. adder for approximate arithmetic designs,” in Alta Frequenza, vol. 34, no. 5, pp. 349–356,
[67] J. Huang, J. Lach, and G. Robins, “A methodology Proc. DAC, 2012, pp. 820–825. Mar. 1965.
for energy-quality tradeoff using imprecise [92] X. Yang, Y. Xing, F. Qiao, Q. Wei, and H. Yang, [113] N. H. E. Weste and D. Harris, CMOS VLSI Design:
hardware,” in Proc. DAC, 2012, pp. 504–509. “Approximate adder with hybrid prediction and A Circuits and Systems Perspective. New Delhi,
[68] J. Miao, K. He, A. Gerstlauer, and M. Orshansky, error compensation technique,” in Proc. ISVLSI, India; Pearson, 2015.
“Modeling and synthesis of quality-energy optimal Jul. 2016, pp. 373–378. [114] A. D. Booth, “A signed binary multiplication
approximate adders,” in Proc. ICCAD, 2012, [93] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, technique,” Quart. J. Mech. Appl. Math., vol. 4,
pp. 728–735. “A low latency generic accuracy configurable no. 2, pp. 236–240, 1951.
[69] R. Venkatesan, A. Agarwal, K. Roy, and adder,” in Proc. DAC, 2015, pp. 1–6. [115] O. Macsorley, “High-speed arithmetic in binary
A. Raghunathan, “MACACO: Modeling and [94] M. A. Hanif, R. Hafiz, O. Hasan, and M. Shafique, computers,” Proc. IRE, vol. 49, no. 1, pp. 67–91,
analysis of circuits for approximate computing,” in “QuAd: Design and analysis of quality-area Jan. 1961.
Proc. ICCAD, 2010, pp. 667–673. optimal low-latency approximate adders,” in Proc. [116] C. R. Baugh and B. A. Wooley, “A two’s
[70] S. Mazahir, O. Hasan, R. Hafiz, M. Shafique, and DAC, Jun. 2017, p. 42. complement parallel array multiplication
J. Henkel, “Probabilistic error modeling for [95] K. Du, P. Varman, and K. Mohanram, “High algorithm,” IEEE Trans. Comput., vol. C-22, no. 12,
approximate adders,” IEEE Trans. Comput., performance reliable variable latency carry select pp. 1045–1047, Dec. 1973.
vol. 66, no. 3, pp. 515–530, Mar. 2017. addition,” in Proc. DATE, Mar. 2012, [117] M. Hatamian and G. L. Cash, “A 70-MHz
[71] M. K. Ayub, O. Hasan, and M. Shafique, pp. 1257–1262. 8-bit×8-bit parallel pipelined multiplier in 2.5-μm
“Statistical error analysis for low power [96] Y. Kim, Y. Zhang, and P. Li, “An energy efficient CMOS,” IEEE J. Solid-State Circuits, vol. 21, no. 4,
approximate adders,” in Proc. DAC, 2017, pp. 1–6. approximate adder with carry skip for error pp. 505–513, Aug. 1986.
[72] A. Qureshi and O. Hasan, “Formal probabilistic resilient neuromorphic VLSI systems,” in Proc. [118] J.-P. Wang, S.-R. Kuang, and S.-C. Liang,
analysis of low latency approximate adders,” IEEE IEEE/ACM Int. Conf. Comput.-Aided Design “High-accuracy fixed-width modified booth
Trans. Comput.-Aided Design Integr. Circuits Syst., (ICCAD), Nov. 2013, pp. 130–137. multipliers for lossy applications,” IEEE Trans. Very
vol. 38, no. 1, pp. 177–189, Jan. 2019. [97] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On Large Scale Integr. (VLSI) Syst., vol. 19, no. 1,
[73] M. A. Hanif, R. Hafiz, O. Hasan, and M. Shafique, reconfiguration-oriented approximate adder pp. 52–60, Jan. 2011.
“PEMACx: A probabilistic error analysis design and its application,” in Proc. ICCAD, 2013, [119] K. Bhardwaj, P. S. Mane, and J. Henkel, “Power-
methodology for adders with cascaded pp. 48–54. and area-efficient approximate wallace tree
approximate units,” in Proc. DAC, 2020, pp. 1–6. [98] I. Lin, Y. Yang, and C. Lin, “High-performance multiplier for error-resilient systems,” in Proc.
[74] Silvaco, Inc. Nangate, Open Cell Library. Accessed: low-power carry speculative addition with ISQED, Mar. 2014, pp. 263–269.
Sep. 20, 2019. [Online]. Available: variable latency,” IEEE Trans. Very Large Scale [120] S. Hashemi, R. Bahar, and S. Reda, “DRUM:
https://www.silvaco.com/products/ Integr. (VLSI) Syst., vol. 23, no. 9, pp. 1591–1603, A dynamic range unbiased multiplier for
nangate/Library_Design_Services/index.html Sep. 2015. approximate applications,” in Proc. ICCAD, 2015,
[75] W. Zhao and Y. Cao, “New generation of predictive [99] L. Li and H. Zhou, “On error modeling and pp. 418–425.
technology model for sub-45 nm early design analysis of approximate adders,” in Proc. ICCAD, [121] C. Liu, J. Han, and F. Lombardi, “A low-power,
exploration,” IEEE Trans. Electron Devices, vol. 53, Nov. 2014, pp. 511–518. high-performance approximate multiplier with
no. 11, pp. 2816–2823, Nov. 2006. [100] J. Hu and W. Qian, “A new approximate adder configurable partial error recovery,” in Proc. DATE,
[76] H. Amrouch, B. Khaleghi, and A. Gerstlauer, with low relative error and correct sign 2014, pp. 1–4.
“Towards aging-induced approximations,” in Proc. calculation,” in Proc. DATE, 2015, pp. 1449–1454. [122] J. Ma, K. L. Man, N. Zhang, S. Guan, and
DAC, 2017, pp. 1–6. [101] V. Camus, J. Schlachter, and C. Enz, T. T. Jeong, “High-speed area-efficient and
[77] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Real-time “Energy-efficient inexact speculative adder with power-aware multiplier design using approximate
learning capability of neural networks,” IEEE high performance and accuracy control,” in Proc. compressors along with bottom-up tree topology,”
Trans. Neural Netw., vol. 17, no. 4, pp. 863–878, ISCAS, 2015, pp. 45–48. Proc. SPIE, vol. 8784, Mar. 2013, Art. no. 87841Z.
Jul. 2006. [102] V. Camus, J. Schlachter, and C. Enz, “A low-power [123] A. Momeni, J. Han, P. Montuschi, and F. Lombardi,
[78] DC Ultra, Synopsys, Mountain View, CA, USA, carry cut-back approximate adder with fixed-point “Design and analysis of approximate compressors
2018. implementation and floating-point precision,” in for multiplication,” IEEE Trans. Comput., vol. 64,
[79] O. Akbari, M. Kamal, A. Afzali-Kusha, and Proc. DAC, 2016, p. 127. no. 4, pp. 984–994, Apr. 2015.
M. Pedram, “Dual-quality 4:2 Compressors for [103] F. Ebrahimi-Azandaryani, O. Akbari, M. Kamal, [124] S. Venkatachalam and S.-B. Ko, “Design of power
utilizing in dynamic accuracy configurable A. Afzali-Kusha, and M. Pedram, “Block-based and area efficient approximate multipliers,” IEEE
multipliers,” IEEE Trans. Very Large Scale Integr. carry speculative approximate adder for Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,
(VLSI) Syst., vol. 25, no. 4, pp. 1352–1361, energy-efficient applications,” IEEE Trans. Circuits no. 5, pp. 1782–1786, May 2017.
Apr. 2017. Syst. II, Exp. Briefs, vol. 67, no. 1, pp. 137–141, [125] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han,
[80] F. Sabetzadeh, M. H. Moaiyeri, and Jan. 2019. “Low-power approximate multipliers using
M. Ahmadinejad, “A majority-based imprecise [104] Z. Yang, A. Jain, J. Liang, J. Han, and F. Lombardi, encoded partial products and approximate
multiplier for ultra-efficient approximate image “Approximate XOR/XNOR-based adders for compressors,” IEEE J. Emerg. Sel. Topics Circuits
multiplication,” IEEE Trans. Circuits Syst. I, Reg. inexact computing,” in Proc. IEEE-NANO, Syst., vol. 8, no. 3, pp. 404–416, Sep. 2018.
Papers, vol. 66, no. 11, pp. 4200–4208, Nov. 2019. Aug. 2013, pp. 690–693. [126] D. Esposito, A. G. M. Strollo, E. Napoli,
[81] P. M. Kogge and H. S. Stone, “A parallel algorithm [105] H. A. F. Almurib, T. N. Kumar, and F. Lombardi, D. De Caro, and N. Petra, “Approximate
for the efficient solution of a general class of “Inexact designs for approximate low power multipliers based on new approximate
recurrence equations,” IEEE Trans. Comput., addition by cell replacement,” in Proc. DATE, compressors,” IEEE Trans. Circuits Syst. I, Reg.
vol. C-22, no. 8, pp. 786–793, Aug. 1973. 2016, pp. 660–665. Papers, vol. 65, no. 12, pp. 4169–4182, Dec. 2018.
[82] R. E. Ladner and M. J. Fischer, “Parallel prefix [106] M. Pashaeifar, M. Kamal, A. Afzali-Kusha, and [127] W. Liu, J. Xu, D. Wang, C. Wang, P. Montuschi, and
computation,” J. ACM, vol. 27, no. 4, M. Pedram, “Approximate reverse carry propagate F. Lombardi, “Design and evaluation of
pp. 831–838, Oct. 1980. adder for energy-efficient DSP applications,” IEEE approximate logarithmic multipliers for low
[83] H. Ling, “High-speed binary adder,” IBM J. Res. Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, power error-tolerant applications,” IEEE Trans.
Develop., vol. 25, no. 3, pp. 156–166, Mar. 1981. no. 11, pp. 2530–2541, Nov. 2018. Circuits Syst. I, Reg. Papers, vol. 65, no. 9,
[84] R. B. Brent and H. T. Kung, “A regular layout for [107] H. Cai, Y. Wang, L. A. B. Naviner, Z. Wang, and pp. 2856–2868, Sep. 2018.
parallel adders,” IEEE Trans. Comput., vol. C-31, W. Zhao, “Approximate computing in [128] M. S. Ansari, B. F. Cockburn, and J. Han, “An

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2133
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

improved logarithmic multiplier for [139] Y.-H. Chen and T.-Y. Chang, “A high-accuracy by inexact binary signed-digit addition,” in Proc.
energy-efficient neural computing,” IEEE Trans. adaptive conditional-probability estimator for GLSVLSI, May 2017, pp. 293–298.
Comput., early access, May 6, 2020, doi: fixed-width booth multipliers,” IEEE Trans. Circuits [152] L. Chen, J. Han, W. Liu, P. Montuschi, and
10.1109/TC.2020.2992113. Syst. I, Reg. Papers, vol. 59, no. 3, pp. 594–603, F. Lombardi, “Design, evaluation and application
[129] H. Saadat, H. Javaid, A. Ignjatovic, and Mar. 2012. of approximate high-radix dividers,” IEEE Trans.
S. Parameswaran, “REALM: Reduced-error [140] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and Multi-Scale Comput. Syst., vol. 4, no. 3,
approximate log-based integer multiplier,” in Proc. F. Lombardi, “Design of approximate radix-4 pp. 299–312, Jul. 2018.
DATE, Mar. 2020, pp. 1366–1371. booth multipliers for error-tolerant computing,” [153] S. Hashemi, R. Bahar, and S. Reda, “A low-power
[130] H. Kim, M. S. Kim, A. A. Del Barrio, and IEEE Trans. Comput., vol. 66, no. 8, dynamic divider for approximate applications,” in
N. Bagherzadeh, “A cost-efficient iterative pp. 1435–1441, Aug. 2017. Proc. DAC, 2016, p. 105.
truncated logarithmic multiplication for [141] B. Shao and P. Li, “Array-based approximate [154] R. Zendegani, M. Kamal, A. Fayyazi,
convolutional neural networks,” in Proc. ARITH, arithmetic computing: A general model and A. Afzali-Kusha, S. Safari, and M. Pedram,
Jun. 2019, pp. 108–111. applications to multiplier and squarer design,” “SEERAD: A high speed yet energy-efficient
[131] V. Mrazek, Z. Vasicek, L. Sekanina, H. Jiang, and IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 62, rounding-based approximate divider,” in Proc.
J. Han, “Scalable construction of approximate no. 4, pp. 1081–1090, Apr. 2015. DATE, 2016, pp. 1481–1484.
multipliers with formally guaranteed worst case [142] H. Jiang, J. Han, F. Qiao, and F. Lombardi, [155] W. Liu, J. Li, T. Xu, C. Wang, P. Montuschi, and
error,” IEEE Trans. Very Large Scale Integr. (VLSI) “Approximate radix-8 booth multipliers for F. Lombardi, “Combining restoring array and
Syst., vol. 26, no. 11, pp. 2572–2576, Nov. 2018. low-power and high-performance operation,” IEEE logarithmic dividers into an approximate
[132] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, Trans. Comput., vol. 65, no. 8, pp. 2638–2644, hybrid design,” in Proc. ARITH, Jun. 2018,
and K. Pekmestzi, “Design-efficient approximate Aug. 2016. pp. 92–98.
multiplication circuits through partial product [143] S. F. Oberman and M. J. Flynn, “Design issues in [156] M. Imani, R. Garcia, A. Huang, and T. Rosing,
perforation,” IEEE Trans. Very Large Scale Integr. division and other floating-point operations,” IEEE “CADE: Configurable approximate divider for
(VLSI) Syst., vol. 24, no. 10, pp. 3105–3117, Trans. Comput., vol. 46, no. 2, pp. 154–161, energy efficiency,” in Proc. DATE, Mar. 2019,
Oct. 2016. Feb. 1997. pp. 586–589.
[133] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, [144] M. Cappa and V. C. Hamacher, “An augmented [157] L. Wu and C. C. Jong, “A curve fitting approach for
T. Park, and N. S. Kim, “Energy-efficient iterative array for high-speed binary division,” non-iterative divider design with accuracy and
approximate multiplication for digital signal IEEE Trans. Comput., vol. C-22, no. 2, performance trade-off,” in Proc. NEWCAS,
processing and classification applications,” IEEE pp. 172–175, Feb. 1973. Jun. 2015, pp. 1–4.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, [145] D. W. Sweeney, “Divider device for skipping a [158] M. S. K. Lau, K.-V. Ling, and Y.-C. Chu,
no. 6, pp. 1180–1184, Jun. 2015. string of zeros or radix-minus-one digits,” U.S. “Energy-aware probabilistic multiplier: Design
[134] D. Kelly, B. Phillips, and S. Al-Sarawi, Patent 3 145 296, Aug. 18, 1964. and analysis,” in Proc. CASES, 2009,
“Approximate signed binary integer multipliers for [146] J. E. Robertson, “A new class of digital division pp. 281–290.
arithmetic data value speculation,” in Proc. ECSI, methods,” IRE Trans. Electron. Comput., vol. EC-7, [159] R. C. Gonzalez and R. E. Woods, Digital Image
2009, pp. 97–104. no. 3, pp. 218–222, Sep. 1958. Processing, 3rd ed. Upper Saddle River, NJ, USA:
[135] C. Liu, “Design and analysis of approximate [147] K. D. Tocher, “Techniques of multiplication and Prentice-Hall, 2007.
adders and multipliers,” M.S. thesis, Dept. Elect. division for automatic binary computers,” Quart. [160] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face
Comput. Eng., Univ. Alberta, Edmonton, AB, J. Mech. Appl. Math., vol. 11, no. 3, pp. 364–384, detection and alignment using multitask cascaded
Canada, 2014. 1958. convolutional networks,” IEEE Signal Process.
[136] K.-J. Cho, K.-C. Lee, J.-G. Chung, and K. K. Parhi, [148] W. Liu and A. Nannarelli, “Power efficient division Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016.
“Design of low-error fixed-width modified booth and square root unit,” IEEE Trans. Comput., [161] H. Mo et al., “A 1.17 TOPS/W, 150fps accelerator
multiplier,” IEEE Trans. Very Large Scale Integr. vol. 61, no. 8, pp. 1059–1070, Aug. 2012. for multi-face detection and alignment,” in Proc.
(VLSI) Syst., vol. 12, no. 5, pp. 522–531, [149] M. D. Ercegovac and T. Lang, “On-the-fly DAC, Jun. 2019, p. 80.
May 2004. conversion of redundant into conventional [162] V. Jain and E. Learned-Miller, “FDDB: A
[137] M.-A. Song, L.-D. Van, and S.-Y. Kuo, “Adaptive representations,” IEEE Trans. Comput., vol. C-36, benchmark for face detection in unconstrained
low-error fixed-width booth multipliers,” IEICE no. 7, pp. 895–897, Jul. 1987. settings,” Univ. Massachusetts, Amherst, Amherst,
Trans. Fundamentals Electron., Commun. Comput. [150] L. Chen, J. Han, W. Liu, and F. Lombardi, “On the MA, USA, Tech. Rep. UM-CS-2010-009, 2010.
Sci., vol. E90-A, no. 6, pp. 1180–1187, Jun. 2007. design of approximate restoring dividers for [163] M. Kostinger, P. Wohlhart, P. M. Roth, and
[138] F. Farshchi, M. S. Abrishami, and S. M. Fakhraie, error-tolerant applications,” IEEE Trans. Comput., H. Bischof, “Annotated facial landmarks in the
“New approximate multiplier for low power vol. 65, no. 8, pp. 2522–2533, Aug. 2016. wild: A large-scale, real-world database for facial
digital signal processing,” in Proc. CADS, [151] L. Chen, F. Lombardi, P. Montuschi, J. Han, and landmark localization,” in Proc. ICCV Workshops,
Oct. 2013, pp. 25–30. W. Liu, “Design of approximate high-radix dividers Nov. 2011, pp. 2144–2151.

ABOUT THE AUTHORS

Honglan Jiang (Member, IEEE) received Hai Mo received the B.S. degree in com-
the B.Sc. and master’s degrees in instru- puter science from the Huazhong Univer-
ment science and technology from the sity of Science and Technology, Wuhan,
Harbin Institute of Technology, Harbin, China, in 2019. He is working toward
China, in 2011 and 2013, respectively, and the M.S. degree at the Institute of Micro-
the Ph.D. degree in integrated circuits and electronics, Tsinghua University, Beijing,
systems from the University of Alberta, China, on approximation algorithm and high-
Edmonton, AB, Canada, in 2018. performance computing.
She is currently a Postdoctoral Fellow with His research interests include architecture
the Institute of Microelectronics, Tsinghua University, Beijing, design, in-memory computing, and hardware implementation for
China. Her research interests include approximate computing, deep learning accelerators.
reconfigurable computing, and stochastic computing.

Francisco Javier Hernandez Santiago Leibo Liu (Senior Member, IEEE) received
received the B.Eng. degree in electronics the B.S. degree in electronic engineer-
and communications from the University of ing and the Ph.D. degree from the
Guadalajara, Guadalajara, Mexico, in 2012, Institute of Microelectronics, Tsinghua
and the M.Sc. degree in computer engineer- University, Beijing, China, in 1999 and 2004,
ing from the University of Alberta, Edmon- respectively.
ton, AB, Canada, in 2020. He is currently a Full Professor with
He is currently a Security Engineer with the Institute of Microelectronics, Tsinghua
Intel, Zapopan, Mexico. His research inter- University. His current research interests
ests include approximate computing, computer architecture, and include reconfigurable computing, mobile computing, and very
computer security. large-scale integration digital signal processing.

2134 P ROCEEDINGS OF THE IEEE | Vol. 108, No. 12, December 2020
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.
Jiang et al.: Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applications

Jie Han (Senior Member, IEEE) received the Symposium on VLSI (GLSVLSI) in 2015, the NanoArch 2016, and
B.Sc. degree in electronic engineering from the 19th International Symposium on Quality Electronic Design
Tsinghua University, Beijing, China, in 1999, (ISQED) in 2018. He was nominated for the 2006 Christiaan Huy-
and the Ph.D. degree from the Delft Univer- gens Prize of Science by the Royal Dutch Academy of Science.
sity of Technology, Delft, The Netherlands, His work was recognized by Science, for developing a theory
in 2004. of fault-tolerant nanocircuits in 2005. He served as the Gen-
He is currently a Professor with the Depart- eral Chair for GLSVLSI 2017 and the IEEE International Sympo-
ment of Electrical and Computer Engineer- sium on Defect and Fault Tolerance in VLSI and Nanotechnology
ing, University of Alberta, Edmonton, AB, Systems (DFT) in 2013 and the Technical Program Committee
Canada. His research interests include approximate computing, Chair for GLSVLSI 2016, DFT 2012, and the Symposium on Sto-
stochastic computing, reliability and fault tolerance, nanoelec- chastic and Approximate Computing for Signal Processing and
tronic circuits and systems, and novel computational models for Machine Learning in 2017. He is an Associate Editor of the IEEE
nanoscale and biological applications. TRANSACTIONS ON EMERGING TOPICS IN COMPUTING (TETC), the IEEE
Dr. Han was a recipient of the Best Paper Award at the TRANSACTIONS ON NANOTECHNOLOGY , the IEEE Circuits and Sys-
International Symposium on Nanoscale Architectures (NanoArch) tems Magazine, the IEEE OPEN JOURNAL OF THE COMPUTER SOCIETY,
in 2015 and the Best Paper Nominations at the 25th Great Lakes and Microelectronics Reliability (Elsevier Journal).

Vol. 108, No. 12, December 2020 | P ROCEEDINGS OF THE IEEE 2135
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 17,2023 at 05:29:40 UTC from IEEE Xplore. Restrictions apply.

You might also like