Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

1994 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO.

11, NOVEMBER 2021

Transactions Briefs
Energy-Efficient Logarithmic Square Rooter for Error-Resilient Applications
Neelam Arya , Manisha Pattanaik, and G. K. Sharma

Abstract— Approximate computing is an emerging computing tech- combinational architecture is proposed in [9] which saves energy by
nique for designing energy- and resource-efficient arithmetic circuits progressively eliminating least significant bits (LSBs) from the square
for error-resilient applications. Square root (SQR) computation is a
rooter, thereby reducing switching activity at subtractor cell nodes.
fundamental and complex operation in various signal/image processing
tasks. It demands high resource and energy consumption, making This brief proposes a new approximate scheme and hardware
the square-rooter a crucial design element. This brief proposes a design called LESQ to compute SQR using the logarithmic-based
low-complexity logarithmic-based energy-efficient approximate square approach. The integer radicand A is first expressed in the nearest
rooter (LESQ) for computing integer SQR using simple addition and shift power of 2 plus a positive residue. This expression is then expanded
operations. A partial error compensation scheme is also suggested for
improved accuracy. The proposed approximate square rooter also enables using binomial series expansion. Some terms of the series are
various accuracy configurable modes to tradeoff error with hardware retained, while the remaining are ignored to get an approximate result.
efficiency for targeted application requirements. LESQ achieves energy- Error-compensation scheme is proposed after meticulous error analy-
and area-delay savings of up to 80% and 60%, respectively, compared to sis to fetch near-accurate results. LESQ exhibits features, such as
an accurate array-based square-rooter design. The proposed approximate
design is tested on error-tolerant applications, such as image processing
minimal error bias, improved circuit parameters, easy scalability, and
and amplitude modulation (AM) communication system. optimum energy-quality tradeoff. LESQ is compared with accurate
and recent approximate SQR designs and is evaluated for its efficacy
Index Terms— Accuracy-energy tradeoff, approximate com- in error-resilient applications.
puting, approximate square rooter, error-resilient applications, The rest of this brief is organized as follows. Section II presents the
restoring array. proposed approximate square-rooter design. Section III discusses the
I. I NTRODUCTION performance evaluation of the considered designs. Section IV evalu-
ates the proposed approximate designs in the context of error-resilient
In approximate computing, accuracy can be considered an extra
applications. The conclusion is presented subsequently in Section V.
design dimension that could be altered to achieve significant
energy/area/performance improvements for error-resilient applica- II. P ROPOSED A PPROXIMATE S QUARE ROOTER (LESQ)
tions. Approximate arithmetic blocks, such as adders, multipliers, and A. Algorithmic Description
dividers, are widely targeted due to their omnipresence and ease of
The 2n-bit input radicand A is first converted to the nearest power
approximation.
of 2 (smaller than the input) plus an additional residual term x:
Square root (SQR) is a crucial elementary arithmetic operation
with long latency and high complexity [1]. Often the results of A = 2k + x (0 ≤ k ≤ 2n − 1)
√ 
such complex arithmetic units are utilized in multipliers and adders, A = 2k + x. (1)
which affects the system’s overall performance and hence demands
efficient implementation of such elementary arithmetic block [2]. Converting to the standard form and applying the binomial theorem
SQR operations are highly useful in computing applications like    1
√ x 2
lower–upper (LU) factorization [3], Cholesky decomposition [4], and A = 2k 1 + k
other compute-intensive applications [1]. Like division, the SQR 2
operation can also be performed in hardware using shift and sub-  1  1
x 2
= 2k
2
tract operations in a restoring array-based structure [5]. Iterative 1+ k
2
algorithms, like Newton–Raphson (NR) multiplicative algorithms [6],  
x 
x2 
can also be used for SQR computation. Floating-point square-rooters
k
≈ 2 2 1 + k+1 − 2k+2  + 
· · ·
···
and dividers were proposed in [1] using the Taylor-series expan- 2 2
sion method, where a fused add/subtract/multiplication unit with where
division and SQR operation was proposed with minimal area over- x
head. The square-rooter design proposed in [7] uses a non-restoring <1 (2)
2k+1
iterative approach to design floating-point SQR circuit for low k
k x
22
power consumption. The work in [8] proposes an approximate ≈ 2 2 + k+1 (3)
unsigned square-rooter circuit using adaptive approximations. A bit- 2
k x
significance-based approximate square rooter using an array-based ≈ 22 + k . (4)
2 2 +1
Manuscript received May 9, 2021; revised August 13, 2021; accepted From (2), only the first two terms of the expanded expression are
September 16, 2021. Date of publication October 8, 2021; date of current
version November 3, 2021. (Corresponding author: Neelam Arya.) retained, with the other terms being neglected as they contribute less
The authors are with the Department of VLSI and Embedded Systems, toward the output formation. The final output Q given in (4) consists
Atal Bihari Vajpayee Indian Institute of Information Technology and Man- of only two terms, thereby simplifying the SQR operation.
agement (ABV-IIITM), Gwalior 474015, India (e-mail: neelam@iiitm.ac.in;
manishpattanaik@iiitm.ac.in; gksharma@iiitm.ac.in). B. Hardware Implementation
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2021.3114616. The detailed hardware architecture of LESQ is shown in Fig. 1. The
Digital Object Identifier 10.1109/TVLSI.2021.3114616 leading one detector (LOD) circuit finds the nearest power of 2 (2k )
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 10:12:39 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 11, NOVEMBER 2021 1995

The error is always positive since [2m+1  (3x/2m )]. The mean
error (mean and median error are almost similar in this context) can
also be derived from (7) by substituting the minimum and maximum
values of x for the 2n-bit input
Minimum Error + Maximum Error
Mean Error (ME) = . (8)
2
For an input interval (2k − 2k+1 = 22m−1 − 22m ), there are
22m−1 distinct values, where x can assume any value from 1 (max.
error) to (22m−1 − 1) ≈ 22m−1 (min. error). Substituting these values
in (8) and rewriting the equations
   2m−1

1 2m+1 − 3 + 1 2m+1 − 3 2
10 2 m 10 2 m
=
  2  
1 m+1 3 m+1 m−1
= 2 − m +2 −3 2
20 2
⎛  ⎞
1 ⎝ 22m+2 3 3 22m−1
= − m+2 − ⎠
Fig. 1. Hardware implementation of LESQ. 5 2n+2 2 2m+2
to a given input A (2n-bit). The priority encoder (PE), based on an 1 m  
= 2 − 3 2m−3
OR -tree structure [10], is used to extract the value of k from the term 5
1 8.2m  
(2k ). A right shifter is used to shift k to obtain k/2. An adder is also m−3
= − 3 2
integrated with this shifter to evaluate the term (k/2 + 1). For odd 5 23
values of k, the term (k/2) will give fractional values; for such cases, =2 m−3 . (9)
k is changed to k − 1 by setting the LSB to 0 (this is not shown
in Fig. 1). The residual term x and the shifted term (x/2(k/2)+1 ) are From (9), the mean error value can be calculated for odd values
realized by subtracting and right shifting the residue x, respectively. of k; the EC term can then be added to the approximated result to
The two terms 2k/2 and (x/2(k/2)+1 ) are added together to get the get improved output. It is also evident from (9) that the mean error is
final approximate resultant Q (n-bit) as given in (4). An approximate noticeable only for (m ≥ 3 or for k ≥ 5). A detailed example that is
lower-part OR adder (LOA) [11] is integrated into the final addition compatible with Fig. 1 is shown. In this example, the input radicand
step for summing up the last t LSBs as the LOA shows the best is given as 44 100 which has an accurate SQR 210:
error versus hardware cost tradeoff. As one of the terms in the final  
Let A = 44 100 = 32 768 + 11 332 = 2k + x
addition stage is represented in one-hot encoding format, LOA can be  
utilized without incurring any significant accuracy loss. The accurate LOD = 215 2k
addition is performed for the most significant bits. The error metrics
for (t < 4) are competitive and evaluated in this brief. The error PE = 15 (k)
 
compensator (EC) adds the mean error, derived in (9) to the final k ≈k−1
Shifter1 = 7 , (for odd k)
output to get near accurate result. 2
 
Subtractor = 11 332 (x) , x = A − 2k = 44 100 − 32 768
C. Error Analysis  
11 332 x
In LESQ designs, errors occur only for odd values of k as k is Shifter2 = = 44,
27 k
22 + 1
substituted to k − 1 to find the integer term k/2. The error distance  k≈k−1 
(ED) expression for odd values of k is thus given as Decoder = 128, 2 2
 
ED = Accurate Output − Approximate Output
    k x
x x Result (Q) = 128 + 44 = 172, 22 + k
2 2 +1
k k−1
= 22 + k − 2 2 + k−1 . (5) √
2 2 +1 2 2 +1 Error(simulation) = 44 100 − 172 = 210 − 172 = 38
   
Substituting k = 2m − 1, where m is an integer (0, 1, . . . , k/2) 1 3x 1 3.11332
Error(Analysis) = 2m+1 − m = 29 − = 38
for odd k and rearranging the terms 10 2 10 28
 
2m x 2m x where
= 1 + − − k−1  
2m 2 2
−1
2 k +1
22 2 n m =
x 2
m
= 2 (0.707 − 0.5) + m (0.707 − 1)  
2 EC term = 28−3 = 25 = 32, 2m+3
x
= 2m (0.2) + m (−0.3)
2  Q̄ = 172 + 32 = 204, (Q + 32) . (10)
2m 2 x −3
= + m . (6) The derived error in (7) agrees with the simulation results. Partial
10 2 10
error correction is performed in this design as complete accuracy is
After rearranging the terms not the goal of approximate designs. The error becomes substantial
 
1 3x when the input operands are close to the maximum power of two
ED = 2m+1 − m . (7)
10 2 intervals. For example, an n-bit input can be divided into intervals

Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 10:12:39 UTC from IEEE Xplore. Restrictions apply.
1996 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 11, NOVEMBER 2021

TABLE I
P ERFORMANCE C OMPARISON OF 16-B IT S QUARE ROOTERS

(2k − 2k+1 , 2k+1 − 2k+2 , . . . , 0 ≤ k < n), thereby forming


n-intervals with (k = 0, 1, . . . , n − 1). It can be easily found that
65% of the total inputs are in the intervals when k is odd. Thus,
the error rate (ER) of LESQ is around 65%, which is also evident
from the error simulation results in Table I. It is also observed that
the error accumulates in the subsequent intervals (2k − 2k+1 , 2k+2 −
2k+3 , . . . , 0 ≤ k < n; for odd k). The error is primarily accumulated
in the last intervals (2n−3 − 2n−2 ) and (2n−1 − 2n ) as these intervals
contain majority of the input operands. For the intervals where k
is even, a negligible error is observed. The same error behavior is
noticeable for higher input bit-widths in LESQ. The mean error values
can be added to the final output to obtain better output accuracy. Fig. 2. MRED versus energy savings (%).

III. P ERFORMANCE E VALUATION efficiency. AASR-8 shows good performance in terms of hardware
The proposed, existing approximate and exact SQR designs were parameters amongst other considered approximate designs, although
LESQ-EC is 22% more energy-efficient than AASR-8. The FoM
described in Verilog-HDL and synthesized using the Cadence RTL
values show that LESQ achieves a balanced tradeoff between energy
Compiler for Nangate 45-nm CMOS [12] technology at a typical
and quality, making it attractive to use in error-tolerant applications
process corner. The accurate square-rooter circuit is designed using
array-based hardware [8] and the NR multiplicative algorithm [6]. with varying accuracy and energy requirements. Both FoM values for
LESQ cluster shows minimum values compared to the other approx-
The error bias (EB) [13], ER, and the mean relative error distance
imate designs. Fig. 2 shows the MRED versus energy savings (%)
(MRED) [14] error metrics were calculated by performing an exhaus-
tive simulation of 16-bit SQR designs. for the considered approximate designs. LESQ-EC4 shows maximum
energy savings with moderate MRED value closely followed by other
The proposed designs were compared with AXSR3-t and AASR-t
LESQ configurations. AASR-8 also shows high energy savings with
approximate square-rooter circuits [8], where t indicates the replace-
moderate MRED values as the design utilizes a 50% smaller square
ment depth for AXSR3 designs and the bit-width of input radi-
cand in AASR, respectively. For the proposed approximate design, rooter for computing output.
LESQ-ECt indicates LESQ with error compensation (LESQ-EC) IV. C ASE S TUDY
using LOA approximate adder for processing least significant t-bits. A. Edge Detection
An analysis of the energy-quality tradeoff of each considered design The application evaluation was performed in MATLAB where
is also evaluated using two figure of merits (FoM1 and FoM2) as a MATLAB script was used to extract the pixel values of an
indicated in 8-bit grayscale image (scaled to 16-bit values). The data was then
FoM1 = MRED × Energy; FoM2 = ER × Energy. (11) processed by exact and approximate SQR designs in Xilinx Vivado
2019.1. The outputs obtained from the simulation tool were again
A lower value of FoMs indicates a better compromise between converted to image form using MATLAB. The Sobel edge detection
hardware efficiency and accuracy. method is generally used to detect edges in both the horizontal (G x )
Table I demonstrates that LESQ-EC achieves minimum EB, ER, and vertical (G y ) directions. The pixel values of the image were
and MRED values compared to other state-of-the-art designs. With convolved with the Sobel kernel to detect the edges. The gradient
an increase in t, these values shows a modest increase as approximate magnitude (G) is given by
adders process the LSBs. ER remains unaltered when using the 
EC unit, as the error-correction strategy partially corrects the error, G = G 2x + G y 2 (12)
thereby improving MRED and EB values. All designs are compared which is the SQR of the convoluted edges in both the horizontal
in terms of area, delay, power, energy (product of power and delay), and vertical directions. The peak signal-to-noise ratio (PSNR) and
and area–delay product (ADP). Amongst all the implemented designs, structural similarity index matrix (SSIM) metrics were calculated to
the proposed LESQ architectures are faster and also exhibit area show the quality of output images.
and energy efficiency. LESQ-EC is almost 50% faster and 85%
more energy-efficient compared to an accurate array-based counter- B. Analog Communication System
part. LESQ-EC4 shows minimum ADP and energy values implying AM is a commonly used modulation technique used extensively
that the proposed design accomplishes significant area and energy for analog communication [15]. A MATLAB program was used to

Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 10:12:39 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 11, NOVEMBER 2021 1997

TABLE II compared with recent and accurate designs for performance evalua-
Q UALITY M ETRICS OF E RROR -R ESILIENT A PPLICATIONS tion. LESQ is more than 80% energy-efficient and occupies up to 35%
less area than the accurate array-based counterpart. The approximate
SQR designs are also tested on relevant image processing and analog
communication benchmarks. The quality metrics indicate the viability
of proposed designs for energy-efficient data processing in portable
and smart electronics.

R EFERENCES
[1] T.-J. Kwon and J. Draper, “Floating-point division and square root using
a Taylor-series expansion algorithm,” Microelectron. J., vol. 40, no. 11,
pp. 1601–1605, Nov. 2009.
[2] W. Liu and A. Nannarelli, “Power efficient division and square
root unit,” IEEE Trans. Comput., vol. 61, no. 8, pp. 1059–1070,
Apr. 2012.
[3] X. Wang and S. G. Ziavras, “A configurable multiprocessor and dynamic
load balancing for parallel LU factorization,” in Proc. 18th Int. Parallel
generate a 250-Hz random message signal m(t), a 100-kHz carrier Distrib. Process. Symp., 2004, p. 234.
signal A c cos ωc (t). At the receiver, a 5-dB additive white Gaussian [4] S. G. Haridas and S. G. Ziavras, “FPGA implementation of a Cholesky
noise (AWGN) with a signal-to-noise ratio (SNR) of 5 dB was added algorithm for a shared-memory multiprocessor architecture,” Parallel
Algorithms Appl., vol. 19, no. 4, pp. 211–226, Dec. 2004.
to SAM (t) for real-time experimentation. The received signal is then [5] B. Parhami, Computer Arithmetic. New York, NY, USA: Oxford Univ.
passed through a fifth-order low-pass filter Butterworth filter with a Press, 2010.
cut-off frequency of 200 Hz. An envelope detector is used (which uses [6] M. D. Ercegovac and T. Lang, Digital Arithmetic. Amsterdam,
accurate/approximate SQR operation) for generating demodulated The Netherlands: Elsevier, 2004.
[7] S. Suresh, S. F. Beldianu, and S. G. Ziavras, “FPGA and ASIC square
signal Y (t):
 root designs for high performance and power efficiency,” in Proc.
IEEE 24th Int. Conf. Appl.-Specific Syst., Archit. Processors, Jun. 2013,
Y (t) = A 0 + m(t) + n(t)2 . (13) pp. 269–272.
[8] H. Jiang, L. Liu, F. Lombardi, and J. Han, “Low-power unsigned
Both 16-bit and 32-bit designs were used to evaluate the AM divider and square root circuit designs using adaptive approxi-
results. The results for AM signals are numerically evaluated using mation,” IEEE Trans. Comput., vol. 68, no. 11, pp. 1635–1646,
two quality metrics SNR and EUD. The quality metrics evaluation Nov. 2019.
for both applications is shown in Table II. The average PSNR and [9] N. Arya, T. Soni, M. Pattanaik, and G. K. Sharma, “Bit significance
based reconfigurable approximate restoring dividers and square rooters,”
SSIM values indicates that LESQ-EC performs best, followed by Microelectron. J., vol. 104, Oct. 2020, Art. no. 104861.
AASR-12 and AXSR3-10. For demodulated waveforms, the quality [10] W. Liu, J. Xu, D. Wang, C. Wang, P. Montuschi, and F. Lombardi,
metrics for 32-bit designs are superior to 16-bit designs due to higher “Design and evaluation of approximate logarithmic multipliers for low
precision input values. The quality metrics further improve for LESQ power error-tolerant applications,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 65, no. 9, pp. 2856–2868, Sep. 2018.
when using LESQ-EC, due to partial error correction. AASR designs
[11] A. Dalloo, A. Najafi, and A. Garcia-Ortiz, “Systematic design of
also exhibits noteworthy quality metrics both for 32/16-bit circuit an approximate adder: The optimized lower part constant-OR adder,”
evaluation. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 8,
pp. 1595–1599, Aug. 2018.
V. C ONCLUSION [12] J. Knudsen, Nangate 45 nm Open Cell Library. Scherpenzeel,
The Netherlands: EMEA, 2008.
This brief proposed and presented a new logarithmic-based square [13] H. Saadat, H. Javaid, and S. Parameswaran, “Approximate integer
rooter for improved energy efficiency using approximations in the and floating-point dividers with near-zero error bias,” in Proc. 56th
binomial series expansion. The proposed approximate design has ACM/IEEE Design Automat. Conf., Jun. 2019, pp. 1–6.
desirable features, such as energy-quality scalability, and minimal [14] J. Liang, J. Han, and F. Lombardi, “New metrics for the reliability of
approximate and probabilistic adders,” IEEE Trans. Comput., vol. 62,
error-bias with significant energy savings. An EC is designed to
no. 9, pp. 1760–1771, Sep. 2013.
achieve better accuracy metrics with minimal area and power over- [15] H. Taub and D. L. Schilling, Principles of Communication Systems.
head. The proposed designs have been investigated thoroughly and New York, NY, USA: McGraw-Hill, 1986.

Authorized licensed use limited to: Dr B C Roy Engineering College. Downloaded on January 11,2022 at 10:12:39 UTC from IEEE Xplore. Restrictions apply.

You might also like