An Efficient Architecture For Signed Carry Save Multiplication

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

IEEE LETTERS OF THE COMPUTER SOCIETY, VOL. 3, NO.

1, JANUARY–JUNE 2020 9

An Efficient Architecture for Signed A gate level implementation of low power and small-area
Carry Save Multiplication approximate multiplier is discussed in [9]. Significance-driven
logic compression based energy efficient approximate multiplier is
presented in [10]. Modified bauwooley scheme for 8  8 signed
Pramod Patali and Shahana Thottathikkulam Kassim multiplication is shown in Fig. 2. Here i ¼ 0 to 7, j ¼ 0 to 7 and
k ¼ 0 to 15.
Abstract—The performance of a digital signal processing (DSP) system is greatly The CPD of the conventional 8 x 8 signed CSM using modified
affected by the performance of its multiplication operations. Simultaneous
bauwooley scheme may be given by,
improvement in performance metrics such as delay, power, area, and energy
efficiency is difficult to achieve and is a challenge to be addressed. To this end, an TS8 ¼ Tand2 þ Thas þ 5Tfas þ Tmer8 (2)
efficient carry save multiplier (CSM) that employs modified square root carry select
adder (MSCA) for the vector-merging addition and improved full adder (IFA) in TS8 ¼ Tand2 þ Thas þ 5Tfas þ Tfadc þ 6Tfac þ Thas (3)
place of conventional full adder is proposed. Among 16 x 16 multipliers, the critical
path delay (CPD), power, area, power delay product (PDP), and area delay where Tand2 is the delay incurred by a two input AND gate, Tfas is
product (ADP) of the proposed CSM are improved by 27.74, 19.4, 46.2, 41.4, and the delay incurred by a full adder for the generation of sum, Tfac
60.87 percent respectively in comparison with improved booth multiplier and by
is the delay from carry input to carry output of a full adder, Tfadc is
46.43, 31.46, 36.9, 63.05, and 65.96 percent respectively in comparison with low
PDP booth multiplier. Cadence software with gpdk 45 nm standard cell library is
the delay from data input to carry output of a full adder and Thas
used for the design and implementation. is the delay incurred by a half adder for the generation of sum.
For a conventional full adder,
Index Terms—Computer arithmetic, low-power design, processors, VLSI systems
Tfas ¼ 2Txor2 (4)
Ç Tfadc ¼ Txor2 þ Tand2 þ Tor2 (5)
Tfac ¼ Tand2 þ Tor2 (6)
1. INTRODUCTION Thas ¼ Txor2 : (7)
REAL time digital signal processing (DSP) architectures require a
low complex, delay and energy efficient multiplier in order to meet From equations (3), (4), (5), (6) and (7), the CPD is obtained as
high speed processing of input data [1]. Various multiplication
schemes have been proposed over the years [2] to [5]. A radix-4 8 x 8 TS8 ¼ 14 Txor2 þ 8 Tand2 þ 7 Tor2 : (8)
bit multiplier using improved binary to two’s complement con-
verter (BTC) is introduced in [6]. The improvement in delay through
the use of improved BTC was negated by the serial processing 3 MODIFIED CARRY SAVE MULTIPLIER
of data through the stages. A conventional carry save multiplier
The modified CSM is developed by incorporating the following
(CSM) has a simple and regular structure. In a CSM, the carry bits
strategies.
are not immediately added, but saved to pass diagonally down-
wards. Though the speed is improved through the carry save opera- 1. The conventional full adder (FA) structure is replaced by
tion, the delay performance is affected by the final vector-merging the improved full adder (IFA).
adder. A delay and energy efficient modular hybrid adder is dis- 2. The conventional vector-merging adder(CVMA) is replaced
cussed in [7]. An efficient CSM that uses high speed and energy effi- by the delay and energy efficient MSCA.
cient MSCA [8] for vector merging addition and improved full
adder in place of conventional one is proposed here. 3.1 Improved Full Adder
The rest of this paper is organized as follows. The conventional The sum (S) and the carry (Co ) outputs of a conventional full adder
CSM is discussed in Section 2 and the proposed CSM is introduced shown in Fig. 3a may be represented by
in Section 3. The performance comparison of various multipliers is
done in Section 4. The conclusion is given in Section 5. S ¼ ðA  BÞ  Cin (9)

Co ¼ ABþðA  BÞCin : (10)


2 CONVENTIONAL CARRY SAVE MULTIPLIERS
An unsigned 4  4 CSM is shown in Fig. 1. It consists of 3 rows of An improved full adder (IFA) using logic decomposition and
half and full adders for the addition of partial products and a vec- Boolean term sharing is shown in Fig. 3b. The sum (Si ) and the
tor merging adder for the generation of final multiplication result. carry (Cio ) outputs of IFA may be represented as
Here Gij ði ¼ 0 to 3; j ¼ 0 to 3Þ represents the partial product and
Pk ðk ¼ 0 to 7Þ represents the bit-wise multiplication result. Si ¼ ðððA þ BÞðABÞ0 Þ0  Cin Þ0 (11)

Gij ¼ Ai Bj ; (1) Cio ¼ ððABÞ0 ððAþBÞCin Þ0 Þ0 : (12)

where Ai and Bj ði ¼ 0 to 3; j ¼ 0 to 3Þ respectively represent the The carry propagation delays are improved as follows.
multiplicand and multiplier bits.
Tifadc ¼ Tor2 þ 2Tnand2 (13)
 The authors are with the Division of Electronics, School of Engineering, Cochin Tifac ¼ 2Tnand2 : (14)
University of Science and Technology, Kochi, Kerala 682022, India.
E-mail: pramodp2006@gmail.com, shahanatk@cusat.ac.in. Where Tifadc represents the propagation delay from data inputs
Manuscript received 1 Nov. 2019; revised 1 Jan. 2020; accepted 25 Jan. 2020. Date of to carry output and Tifac represents the propagation delay from
publication 3 Feb. 2020; date of current version 25 Feb. 2020. carry input to carry output of IFA.
(Corresponding author: Pramod Patali.)
Recommended for acceptance by I. Iliadis. The performance comparison of conventional and improved
Digital Object Identifier no. 10.1109/LOCS.2020.2971443 full adders at 45 nm is shown in Table 1.
2573-9689 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Saveetha Engineering College. Downloaded on February 04,2021 at 04:06:17 UTC from IEEE Xplore. Restrictions apply.
10 IEEE LETTERS OF THE COMPUTER SOCIETY, VOL. 3, NO. 1, JANUARY–JUNE 2020

TABLE 1
Comparison of Conventional Full Adder(FA) and Improved Full Adder
(IFA) at 45 nm After Digital Synthesis Using RTL Compiler v11.10

Full- Sum generation Carry generation Area (sq. Power


adder delay(ps) delay(ps) mm) (nW)
FA 72 79 10 230
IFA 64 44 9 166
 
Tfadc Tifadc .

Fig. 1. An unsigned 4 x 4 carry save multiplier.


TABLE 2
Comparison of Performance of Conventional Vector Merging
Adder and MCSA of 8 x 8 Multiplier at 45 nm

Adder Delay(ps) Area (sq. mm) Power(nW)


CVMA 394 76 2198
MCSA 174 94 2353

Where

Tcpg0 ¼ Tor2 (19)

Tmcg0 ¼ 2Tand2 þ Tnor2 (20)

Tmcg1 ¼ Tnor2 þ Tor2 (21)


Fig. 2. Modified bauwooley scheme for 8 x 8 signed multiplication. Tcs2 ¼ Tand2 þ Tnor2 (22)

Tsg2 ¼ 2Tnand2 : (23)

From equations (18), (19), (20), (21), (22) and (23),

Tmcsa7 ¼ 3Tand2 þ 3Tnor2 þ 2 Tor2 þ 2Tnand2 : (24)

Combining equations (13), (16), (17) and (24) with equation (15),
the CPD of the proposed 8  8 CSM may be obtained as

TM8 ¼ 5 Txnor2 þ Txor2 þ 10 Tnand2 þ 8 Tor2 þ 3Tand2 þ 3Tnor2 :


(25)

Fig. 3 (a) Conventional full adder (b) Improved full adder.


Comparing equations (8) and (25), it can be found that the con-
ventional CSM consists of 14 complex logic (XOR) gates along the
critical path whereas the proposed MCSM consists of only 6 com-
It can be found that the carry propagation delay, area and plex logic (XOR and XNOR) gates. 18 logic gates of the proposed
power of IFA are improved in comparison with FA. Table 2 shows multiplier along the critical path are inverted gates. This results in
the comparison of performance of MCSA and conventional vector reduction in critical path length. It may be noted that in order to
merging adder (RCA). An 8  8 modified CSM showing critical improve the performance further the AND gates of the two initial
path is depicted in Fig. 4. rows (except A0 B0 ) are replaced by NAND gates. The design of half
adder (HA0) also correspondingly changed to accommodate this
3.2 Critical Path Delay of the Modified change. The half adder (HA1) along the critical path is designed so
Carry Save Multiplier as to ensure a NAND-XNOR combination in place of AND–XOR
The CPD of the proposed 8  8 CSM is given by, combination. The modified square root CSLA used for vector merg-
ing addition consists of 3 adder segments. Each of the adder seg-
TM8 ¼ Tnand2 þ Thas1 þ 5Tifas þ Tifadc þ Tmcsa7 ; (15) ments of the MSCA consists of the following sections.

where Tnand2 is the delay incurred by a two input NAND gate, Thas1 1. CPG: The carry propagate and generate (CPG) block gener-
is the delay incurred by the half adder (HA1) for the generation of ate the propagate and generate functions.
sum, Tifas is the delay incurred by the improved full adder for the 2. NCC: The nand carry chain(NCC) generates 2 carry rows-
generation of sum, Tifac is the delay incurred by the improved full one for input carry, Cin ¼ 0 and the other for Cin ¼ 1.
adder for the generation of carry, Tmcsa7 is the delay incurred by the 3. CS: The carry select (CS) block selects one among two pos-
7-bit MSCA and Txnor2 is a two input XNOR gate delay. sible carries depending upon the input carry.
4. SG: The sum generation(SG) block generates final sum.
Thas1 ¼ Txnor2 (16) 5. MCG: Module carry generation (MCG) block generates
module end carry
Tifas ¼ Tor2 þ Tnand2 þ Txnor2 (17) The MSB of the multiplication result P15 is obtained by adding
bit ‘1’ with the carry output C14 of the previous bit (14th) addition
Tmcsa7 ¼ Tcpg0 þ Tmcg0 þ Tmcg1 þ Tcs2 þ Tsg2 ; (18) and may be as expressed by (26)

Authorized licensed use limited to: Saveetha Engineering College. Downloaded on February 04,2021 at 04:06:17 UTC from IEEE Xplore. Restrictions apply.
IEEE LETTERS OF THE COMPUTER SOCIETY, VOL. 3, NO. 1, JANUARY–JUNE 2020 11

Fig. 4. An 8 x 8 Modified carry save multiplier showing critical path. The logic elements along the critical path are highlighted.

0 1. Prop. CSM represents the proposed signed CSM that uses


P15 ¼ 1  C14 ¼ C14 : (26)
IFA in place of FA and MCSA for vector merging addition.
2. Imp. BM represents the improved booth multiplier that
Thus the need of a half adder is avoided and the vector merging uses MCSA for addition of partial products in [8].
addition is further optimized. CPD of the proposed 16 x 16 carry 3. LCBM represents booth multiplier in [4].
save multiplier is given by, 4. LP BM represents low PDP booth multiplier in [6].
5. C.WTM represents conv.Wallace tree multiplier in [2].
TM16 ¼ Tnand2 þ Thas1 þ 13Tifas þ Tifadc þ Tmcsa15 : (27) 6. VM represents signed Vedic multiplier in [5].
7. CSM represents conventional carry save multiplier[2].
For N  N multiplication, the CPD of the proposed carry save 8. C. BM represents conventional booth multiplier[6].
multiplier is given by, 9. AM represents array multiplier in [1].
10. SDCM represents Significance-driven logic compression
TMN ¼ Tnand2 þ Thas1 þ ðN  3ÞTifas þ Tifadc þ TmcsaðN1Þ ; based (2-bit SDLC) approximate multiplier.
(28) The results of various signed multipliers in terms of CPD, power,
Area, PDP and ADP at 45nm (with Vdd ¼ 1.1V) after digital synthesis
Where Tmcsa15 is the delay incurred by a 15-bit MSCA and using RTL compiler v11.10 is exhibited in Table 3. The multipliers (2
TmcsaðN1Þ is the delay incurred by a (N-1)-bit MSCA. to 10) are re-implemented in 45 nm. The CPD of the proposed 8  8
multiplier is reduced by 1.44, 5.23, 23.95, 40.77, 43.49, 30, 42.96 and
50.72 percent respectively in comparison with multipliers 2, 3, 4, 5, 6,
4 PERFORMANCE COMPARISON 7, 8 and 9. The CPD of the proposed 16  16 multiplier is reduced by
The performances of multipliers are compared in terms of CPD, 27.74, 10.9, 46.43, 42.48, 46.33, 32.32, 44.7 and 53.57 percent respec-
area, power, PDP, and ADP. All the designs have been developed tively in comparison with multipliers 2, 3, 4, 5, 6, 7, 8 and 9. The total
using VHDL (structural modeling) and synthesized in Cadence power of the proposed 8  8 multiplier is reduced by 21.96, 26.23,
RTL compiler v11.10 using 45 nm standard cell library. For the eas- 21.75, 17.54, 43.06, 18.51, 61.24 and 7.5 percent respectively in compar-
iness of comparison of performances, the multipliers are repre- ison with multipliers 2, 3, 4, 5, 6, 7, 8 and 9. Among the 16 x 16 multi-
sented as follows. pliers, the total power of the proposed 16 x 16 multiplier is reduced

Authorized licensed use limited to: Saveetha Engineering College. Downloaded on February 04,2021 at 04:06:17 UTC from IEEE Xplore. Restrictions apply.
12 IEEE LETTERS OF THE COMPUTER SOCIETY, VOL. 3, NO. 1, JANUARY–JUNE 2020

TABLE 3 TABLE 4
Results of Various Signed Multipliers in Terms of Critical Path Delay Comparison of Performance of the Proposed
(CPD), Power, Area, PDP, and ADP at 45 nm After Digital 20 x 20 Multiplier and SDCM at 45 nm
Synthesis Using RTL Compiler v11.10
Multipliers Delay (ns) Power (nW) Area (sq. mm) PDP (fJ) ADP (sq.mm x ns)
Multipliers CPD Power Area PDP ADP Prop. CSM 1.45 226334 3545 327 5123
Size
ns nW sq. mm fJ sq.mm x ns SDCM[10] 2.83 169039 2944 479 8340

Prop. CSM 0.616 19412 529 12 326


Imp.BM[8] 0.625 24874 720 16 450
LCBM[4] 0.65 26314 708 17 460 improved through the use of NAND-NAND chain in place of
LP BM [6] 0.81 24807 761 20 616
C. WTM[2] 1.04 23542 656 24 682 AND-OR chain along the carry generation and propagation path.
8x8 The final vector-merging addition using modified square root carry
VM [5] 1.09 34092 1028 37 1121
CSM [2] 0.88 23821 620 21 546 select adder (MCSA) further improves the speed.
C. BM[6] 1.08 50087 1237 54 1336
AM [1] 1.25 20985 646 26 808
SDCM[10] 0.64 14977 405 10 259 REFERENCES
Prop. CSM 1.19 120757 2245 144 2680 [1] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs. Oxford
Imp.BM[8] 1.64 149829 4176 246 6849 Univ. Press New York, NY, USA, 2nd edn., 2010
LCBM[4] 1.33 177314 3390 236 4509 [2] J. M. Rabaey, A. Chandrakasan and B. Nikolic, Digital Integrated Circuits:A
LP BM [6] 2.21 176185 3559 390 7873 Design Perspective. London, UK: Pearson Education, 2nd edn., 2017
C. WTM[2] 2.06 146627 2587 302 5329 [3] R. S. Waters and E. E. Swartzlander, “A reduced complexity wallace multi-
16 x 16 plier reduction,” IEEE Trans. Comput., Vol. 59, No. 8, pp. 1134–1137, Aug.
VM [5] 2.21 160675 5361 355 11837
CSM [2] 1.75 142048 2559 249 4481 2010.
[4] N. V. V. K. Boppana, J. Kommareddy, and S. Ren, “Low-cost and high-
C. BM[6] 2.14 245396 4985 526 10683
performance 8  8 booth multiplier,” Circuits Syst. Signal Process., vol. 38,
AM [1] 2.55 139067 2623 355 6694 no. 9, pp. 4357–4368, 2019.
SDCM[10] 1.4 100927 1918 141 2685 [5] K. Paldurai and K. Hariharan, “Implementation of signed vedic multiplier
targeted at FPGA architectures,” ARPN J. Eng. Appl. Sci., vol. 10, no. 5,
pp. 2193–2197, 2015.
[6] H. Xue, R. Patel, N. V. V. K. Boppana, and S. Ren “Low- power-delay prod-
by 19.4, 31.9, 31.46, 17.64, 24.84, 14.99, 50.79 and 13.17 percent res- uct radix-4 88 booth multiplier in CMOS,” Electron. Lett., vol. 54, no. 6,
pectively in comparison with multipliers 2, 3, 4, 5, 6, 7, 8 and 9. The pp. 344–346, 2018.
area of the proposed 8  8 multiplier is reduced by 26.5, 25.3, 30.5, [7] P. Pramod and T. K. Shahana, “Delay and energy efficient modular hybrid
19.4, 48.5, 14.7, 57.2 and 18.1 percent in comparison with multipliers adder for signal processor architectures,” IETE J. Res., Jun. 2, 2019. [Online].
Available: https//doi.org/10.1080/03772063.2019.1627917
2, 3, 4, 5, 6, 7, 8 and 9 respectively, whereas among 16 x 16 multipliers [8] P. Pramod and T. K. Shahana, “High throughput FIR filter architectures
it is reduced by 46.2, 33.8, 36.9, 13.2, 58.1, 12.3, 55 and 14.4 percent using retiming and modified CSLA based adders,” IET Circuits Devices
respectively. Syst., vol. 13, no. 7, pp. 1007–1017, 2019.
[9] H. Baba, T. Yang, M. Inoue, K. Tajima, T. Ukezono, and T. A. Sato, “A low-
Among the 8  8 multipliers, the PDP of the proposed multiplier power and small-area multiplier for accuracy-scalable approximate
is reduced by 23.08, 30.09, 40.49, 51.16, 67.82, 42.96, 77.89 and 54.41 computing,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2018, pp. 569–574.
percent respectively in comparison with multipliers 2, 3, 4, 5, 6, 7, 8 [10] I. Qiqieh, R. Shafik, G. Tarawneh, D. Sokolov, S. Das and A. Yakovlev,
“Significance-driven logic compression for energy-efficient multiplier
and 9, whereas among 16  16 multipliers it is reduced by 41.4, design,” IEEE Trans. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 417–430,
30.09, 63.05, 52.33, 59.41, 42.11, 72.62 and 59.43 percent respectively. Sep. 2018
The ADP of the proposed 8 x 8 multiplier is reduced by 27.59, 29.19,
47.14, 52.24, 70.92, 40.27 75.61 and 59.64 percent respectively in " For more information on this or any other computing topic,

comparison with multipliers 2, 3, 4, 5, 6, 7, 8 and 9, whereas among please visit our Digital Library at www.computer.org/csdl.
16  16 multipliers it is reduced by 60.87, 40.56, 65.96, 49.71, 77.36,
40.19, 74.91 and 59.96 percent respectively. It is found that the pro-
posed 8  8 and 16  16 multipliers excel all the accurate multipliers
(2 to 9) in terms of all the 5 performance metrics. The number of bits
in the partial product matrix of significance-driven logic compres-
sion based approximate multiplier (SDCM) is reduced by perform-
ing lossy logic compression. Improvement in PDP and ADP is
achieved at the cost of increased percentage of inaccuracy. The CPD
of proposed 8 x 8 and 16 x 16 multipliers are reduced by 3.75 and
15.36 percent respectively in comparison with SDCM. With the
increase in multiplier size, the CPD, PDP and ADP of the proposed
multiplier further improves as shown in Table 4, primarily due to
the high speed vector merging addition.

5 CONCLUSION
A delay, power, area and energy efficient carry save multiplier is
presented. Precise critical path analysis of carry save multiplier
was done and derived the critical path delay as a function of num-
ber of full adders and logic gates. Remarkably improved perfor-
mance in terms of majority of the performance metrics is achieved
through the use of improved full adder and modified square root
CSLA. The structural simplicity and regularity of the full adder/
half adder array of the proposed CSM is improved through the use
of IFA. The carry generation delay, area and power of the IFA is

Authorized licensed use limited to: Saveetha Engineering College. Downloaded on February 04,2021 at 04:06:17 UTC from IEEE Xplore. Restrictions apply.

You might also like