Professional Documents
Culture Documents
Fast Binary Counters and Compressors Generated by Sorting Network
Fast Binary Counters and Compressors Generated by Sorting Network
6, JUNE 2021
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
GUO AND LI: FAST BINARY COUNTERS AND COMPRESSORS GENERATED BY SORTING NETWORK 1221
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
1222 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 6, JUNE 2021
TABLE I
T RUTH TABLE OF (7,3) C OUNTER O UTPUTS
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
GUO AND LI: FAST BINARY COUNTERS AND COMPRESSORS GENERATED BY SORTING NETWORK 1223
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
1224 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 6, JUNE 2021
Hi = Hi |Hi+ j ,
(i = 0, 1, . . . , 7, 8; j ≥ 0; i + j ≤ 9)
(18)
n= j
Q n = Ii &I j +1 , (0 ≤ i ≤ j ≤ 7)
n=i
n= j
Pn = Hi &H j +1, (0 ≤ i ≤ j ≤ 8). (19)
n=i
Fig. 8. Eight-way sorting network.
The outputs of the eight-way and seven-way sorting net-
increases the parallelism of the circuit. However, as will be works are denoted as sequence H (includes H1 –H8 and
discussed later, the area of the proposed design will not extended to H0 − H9 ) and sequence I (includes I1 –I7 and
increase with the parallelism. In fact, it decreases. The main extended to I0 –I8 ), respectively. By utilizing A&B logic,
reason for the area reduction is (2), (13), and (15), which we get the one-hot code sequences P (P0 –P8 ) and Q
significantly simplifies the output Boolean expressions. (Q 0 − Q 7 ). These sequences are similar to the sequences in
Table II compares the proposed structure with the structure (7,3) counter. Thus, we directly give out the key equations,
in [11]. “2IBLG” and “3IBLG” stand for “two-input basic as shown in (17)–(19). Note that (17) is just an example; this
logic gate” and “three-input basic logic gate,” respectively. The regular satisfies all the random separation. The symbol “ ”
critical path is from input to C1 for both structures, and the in (19) represents continuous “OR.”
proposed structure has a shorter critical path that is composed 2) Boolean Expressions of (15,4) Counter: The 4-bit output
of five layers of 2IBLGs and a MUX, while the structure of the (15,4) counter is denoted as C3 C2 C1 S. Table III shows
in [11] has three 2IBLGs, three 3IBLGs, and a MUX. the output corresponding to the number of the number of “1”s
in input sequence.
IV. H IGHER C OMPRESSION R ATIO C OUNTERS Similar to the method by which we construct the
We have implemented an efficient (7,3) saturated counter, (7,3) counter, first, establish logic equations between C3 C2 C1 S
but, in some applications, such as high radix NTT [6] and and sequences P and Q through the sum of the subscripts. Sec-
large number multiplication [5], a (15,4) even (31,5) saturated ond, utilize (17)–(19) to optimize it. Then we get (20)–(23).
counter will be useful. In this section, we utilize the above Because the original Boolean expressions are too long, here,
methods to construct (15,4) and (31,5) saturated counters and we express them with the Verilog syntax (especially ternary
give them a brief but clear description. operator: “a?b:c”).
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
GUO AND LI: FAST BINARY COUNTERS AND COMPRESSORS GENERATED BY SORTING NETWORK 1225
C4 = (Q 0 &P16 )|(Q 1 &P15 )|(Q 2 &P14 )|(Q 3 &P13 )|(Q 4 &P12 )|(Q 5 &P11 )|(Q 6 &P10 )|(Q 7 &P9 )|
(Q 8 &P8 )|(Q 9 &P7 )|(Q 10 &P6 )|(Q 11 &P5 )|(Q 12 &P4 )|(Q 13 &P3 )|(Q 14 &P2 )|(Q 15 &P1 ) (24)
C3 = H8&H16 ?Q 0 : Q 8 H7 &H15?Q 1 : Q 9 H6 &H14 ?Q 2 : Q 10 H5 &H13 ?Q 3 : Q 11
H4&H12 ?Q 4 : Q 12 H3&H11 ?Q 5 : Q 13 H2 &H10 ?Q 6 : Q 14 H1 &H9?Q 7 : Q 15 (25)
C2 = H4 &H8 H12 &H16 ?(Q 0 |Q 8 ) : (Q 4 |Q 12 ) H3 &H7 H11 &H15 ?(Q 1 |Q 9 ) : (Q 5 |Q 13 )
H2 &H6 H10 &H14 ?(Q 2 |Q 10 ) : (Q 6 |Q 14 ) H1 &H5 H9 &H13 ?(Q 3 |Q 11 ) : (Q 7 |Q 15 ) (26)
C1 = H2 &H4 H6 &H8 H10 &H12 H14 &H16 ?(Q 0 |Q 4 |Q 8 |Q 12 ) : (Q 2 |Q 6 |Q 10 |Q 1 4)
H1 &H3 H5 &H7 H9 &H11 H13 &H15 ?(Q 1 |Q 5 |Q 9 |Q 13 ) : (Q 3 |Q 7 |Q 11 |Q 15 ) (27)
S = (P1 |P3 |P5 |P7 |P9 |P11 |P13 |P15 ) ⊕ (Q 1 |Q 3 |Q 5 |Q 7 |Q 9 |Q 11 |Q 13 |Q 15 ) (28)
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
1226 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 6, JUNE 2021
TABLE IV
T RUTH TABLES OF THE A PPROXIMATE (4:2) C OMPRESSORS
Fig. 11. Proposed exact (4:2) compressor. Fig. 12. Proposed approximate (4:2) compressor with one error.
following equation:
s0 = A&B |D
Cout = B. (29)
The summation of s0 , Cin , and C is calculated with a
“Full Adder” (as shown in Fig. 11), which has been modified.
Equation (30) describes the “Full Adder”
h 1 = C|Cin Fig. 13. Proposed approximate (4:2) compressor with two errors.
h 2 = C&Cin
Carry = (s0 &h 1 )|h 2 the structure is constructed, and it has the same logical
function as that “Yang1” and “Strollo1” have
Sum = s0 ? h 1 |h 2 : h 1 &h 2 . (30)
Carry = A&h 1
B. Approximate (4:2) Compressors Sum = h 2 |(A ⊕ h 2 ). (31)
We also use the name “Yang1” and “Yang2” in [18] To construct a faster approximate (4:2) compressor, a sorter
to represent the approximate (4:2) compressors, which has is discarded in 4 SN, as shown in Fig. 13. Although it is
1 and 2 errors, respectively, as proposed in [19]. We name uncertain that the sequence [ A, h 1 , h 2 ] is sorted completely,
the approximate (4:2) compressors proposed in [18] with one we assume that the sequence is sorted completely. In order
and two errors as “Strollo1” and “Strollo2,” respectively. to correct the deviation introduced by incomplete sorting,
In Fig. 12, we construct an approximate (4:2) compressor the output expressions are modified to (31).
based on sorting network. D is one of the outputs of 4SN, and Table IV shows the truth tables of [18] and [19] and
it is the minimum one of the inputs. By simply discarding D, Figs. 12 and 13.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
GUO AND LI: FAST BINARY COUNTERS AND COMPRESSORS GENERATED BY SORTING NETWORK 1227
TABLE V
S YNTHESIS R ESULTS OF (7,3) C OUNTER
TABLE VI
S YNTHESIS R ESULTS OF (15,4) C OUNTER
TABLE VII a compression, and they are also constructed with Verilog
S YNTHESIS R ESULTS OF (31,5) C OUNTER AND I MPROVEMENT IN D ELAY HDL. The (7,3) counters are synthesized by the Synopsys
design compiler (DC) with the TSMC 90-nm/65 nm-process,
and the synthesis results are listed in Table V. There are three
sets of data that belong to the proposed design; they have
different delay constraints. According to Table V, the pro-
posed design has more than 10% better performance than
other designs, while it has significantly less area and power
consumption
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
1228 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 6, JUNE 2021
TABLE VIII
S YNTHESIS R ESULTS OF 16 × 16 B IT M ULTIPLIER
a (7,3) counter, we can also get a (7,2) compressor, which has a better performance. The proposed design also has more than
the same function as the structure in [16], by directly intro- 40% less power consumption.
ducing a full adder. The three outputs of a (7,3) counter are
denoted as C2 , C1 , and S, as shown in Fig. 7. A (7,2) counter C. Results of (31,5) Counter
can be constructed with (32). In this way, we compare the pro-
posed design with the (7,2) compressor in [16]. The synthesis Compare the (31,5) counter with a Verilog code “Out-
results show that the proposed design has approximately 10% put[4:0] = x[0] + x[1] + x[2] + · · · + x[30]” (autosynthesized
better performance and consumes significantly less area and by Synopsys DC). The results are listed in Table VII.
power. The delay is approximately 20% shorter, but the area
consumes too much because the 16- and 15-way sorting net-
works need a large area. Thus, in high-performance scenarios,
B. Results of (15,4) Counter the proposed (31,5) design is a better choice.
To evaluate (15,4) counter’s performance, we use the
(15,4) counter in [12] and two (7,3) counters in [11] combined D. Typical Application
into a (15,4) counter for comparison. The synthesis results are 1) 16 × 16 Bit Multiplier: To show the performance of the
shown in Table VI. proposed counters, we construct 16 × 16 bit multipliers as
The new (15,4) counter performs more than 15% better application platforms. The proposed (31,5) counter has poor
in time delay, but the area advantage is not as large as the area efficiency, so we do not utilize it in this subsection.
(7,3) counter. However, it should be noted that, if the delay The 16 × 16 bit multipliers’ architectures are illustrated in
constraints are relaxed slightly, the area of the proposed design Fig. 14(a) and (b) for (7,3) counter and (15,4) counter, respec-
will drop rapidly. When the proposed counter is synthesized tively. We also implement the 16×16 bit multiplier mentioned
with the same delay constraints as the structure in [12], in [16] as an object of comparison.
the proposed design will have less area than [12]. Thus, this Table VIII shows the synthesis results. Multipliers with
new design is more flexible that can be utilized in high- the proposed counters have better ADP and PDP than
performance or compact scenarios, and in both scenarios, it has other designs, and they can also reach lower delays than
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
GUO AND LI: FAST BINARY COUNTERS AND COMPRESSORS GENERATED BY SORTING NETWORK 1229
Fig. 14. Architecture of 16 × 16 bit multipliers. (a) For (7,3) counter. (b) For (15,4) counter.
TABLE IX
S YNTHESIS R ESULTS OF 8 × 8 B IT A PPROXIMATE M ULTIPLIER
other designs. Thus, in some high-performance cases, it will (31,5) counters based on this method. The (7,3) counter has
be very suitable, and it also performs well in low-power and 8.1%–27.0% less delay than other designs and also consumes
area-efficient cases. less area and power. The (15,4) counter is more flexible than
2) 8×8 Bit Approximate Multiplier: To make a compression existing designs because it achieves 14.9%–35.2% less delay
between the proposed and the other exact/approximate (4:2) when the speed is critical and performs 14.7%–49.0% and
compressors, we construct an 8 × 8 bit approximate multi- 41.2%–72.7% better in ADP and PDP when the area or power
plier, which is proposed in [18]. The results are illustrated is critical. The (31,5) counter has more than 22.0% shorter
in Table IX. The proposed design has a better performance in delay than other existing designs. When they are embedded in
ADP and PDP. a 16-bit multiplier, the multiplier achieves 31.8% and 32.2%
better in ADP and PDP in maximum than those embedded in
VII. C ONCLUSION other counter designs. Exact/approximate (4:2) compressors
In this article, a new counter design method based on a sort- are also proposed based on a sorting network. They perform
ing network is proposed, and we construct (7,3), (15,4), and approximately 10.2%–37.4% better in ADP and 22.3%–48.0%
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.
1230 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 6, JUNE 2021
better in PDP when they are embedded in an 8-bit approximate [17] T. Satish and K. S. Pande, “Multiplier using NAND based com-
multiplier. pressors,” in Proc. 3rd Int. Conf. Electron., Mater. Eng. Nano-
Technol. (IEMENTech), Kolkata, India, Aug. 2019, pp. 1–6, doi:
10.1109/IEMENTech48150.2019.8981067.
R EFERENCES [18] A. G. M. Strollo, E. Napoli, D. De Caro, N. Petra, and
G. D. Meo, “Comparison and extension of approximate 4-2 com-
[1] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. pressors for low-power approximate multipliers,” IEEE Trans. Circuits
Electron. Comput., vol. EC-13, no. 1, pp. 14–17, Feb. 1964, doi: Syst. I, Reg. Papers, vol. 67, no. 9, pp. 3021–3034, Sep. 2020, doi:
10.1109/PGEC.1964.263830. 10.1109/TCSI.2020.2988353.
[2] R. S. Waters and E. E. Swartzlander, “A reduced complexity wal- [19] Z. Yang, J. Han, and F. Lombardi, “Approximate compressors for error-
lace multiplier reduction,” IEEE Trans. Comput., vol. 59, no. 8, resilient multiplier design,” in Proc. IEEE Int. Symp. Defect Fault Toler-
pp. 1134–1137, Aug. 2010, doi: 10.1109/TC.2010.103. ance VLSI Nanotechnol. Syst. (DFTS), Amherst, MA, USA, Oct. 2015,
[3] P. L. Montgomery, “Five, six, and seven-term karatsuba-like formulae,” pp. 183–186, doi: 10.1109/DFT.2015.7315159.
IEEE Trans. Comput., vol. 54, no. 3, pp. 362–369, Mar. 2005, doi: [20] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “Dual-quality
10.1109/TC.2005.49. 4:2 compressors for utilizing in dynamic accuracy configurable multi-
[4] J. Ding, S. Li, and Z. Gu, “High-speed ECC processor over NIST prime pliers,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 4,
fields applied with Toom–Cook multiplication,” in IEEE Trans. Circuits pp. 1352–1361, Apr. 2017, doi: 10.1109/TVLSI.2016.2643003.
Syst. I, Reg. Papers, vol. 66, no. 3, pp. 1003–1016, Mar. 2019, doi: [21] K. Manikantta Reddy, M. H. Vasantha, Y. B. Nithin Kumar, and
10.1109/TCSI.2018.2878598. D. Dwivedi, “Design of approximate booth squarer for error-tolerant
[5] R. Liu and S. Li, “A design and implementation of montgomery modular computing,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28,
multiplier,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Sapporo, no. 5, pp. 1230–1241, May 2020, doi: 10.1109/TVLSI.2020.2976131.
Japan, May 2019, pp. 1–4, doi: 10.1109/ISCAS.2019.8702684. [22] S. Venkatachalam and S.-B. Ko, “Design of power and area effi-
[6] W. Wang, X. Huang, N. Emmart, and C. Weems, “VLSI design cient approximate multipliers,” IEEE Trans. Very Large Scale Integr.
of a large-number multiplier for fully homomorphic encryption,” in (VLSI) Syst., vol. 25, no. 5, pp. 1782–1786, May 2017, doi:
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 9, 10.1109/TVLSI.2016.2643639.
pp. 1879–1887, Sep. 2014, doi: 10.1109/TVLSI.2013.2281786. [23] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi,
[7] S. Asif and Y. Kong, “Analysis of different architectures of “Design-efficient approximate multiplication circuits through par-
counter based wallace multipliers,” in Proc. 10th Int. Conf. Com- tial product perforation,” IEEE Trans. Very Large Scale Integr.
put. Eng. Syst. (ICCES), Cairo, Egypt, Dec. 2015, pp. 139–144, doi: (VLSI) Syst., vol. 24, no. 10, pp. 3105–3117, Oct. 2016, doi:
10.1109/ICCES.2015.7393034. 10.1109/TVLSI.2016.2535398.
[8] A. Najafi, B. Mazloom-nezhad, and A. Najafi, “Low-power and high- [24] W. Liu et al., “Design and analysis of approximate redundant binary
speed 4-2 compressor,” in Proc. 36th Int. Conv. Inf. Commun. Tech- multipliers,” IEEE Trans. Comput., vol. 68, no. 6, pp. 804–819,
nol., Electron. Microelectron. (MIPRO), Opatija, Croatia, May 2013, Jun. 2019, doi: 10.1109/TC.2018.2890222.
pp. 66–69.
[9] A. Najafi, S. Timarchi, and A. Najafi, “High-speed energy-efficient 5:2
compressor,” in Proc. 37th Int. Conv. Inf. Commun. Technol., Electron.
Microelectron. (MIPRO), Opatija, Croatia, May 2014, pp. 80–84, doi:
10.1109/MIPRO.2014.6859537.
[10] S. Asif and Y. Kong, “Design of an algorithmic wallace multi- Wenbo Guo received the B.S. degree in microelec-
plier using high speed counters,” in Proc. 10th Int. Conf. Com- tronics from Northwestern Polytechnical University,
put. Eng. Syst. (ICCES), Cairo, Egypt, Dec. 2015, pp. 133–138, doi: Xi’an, China, in 2019. He is currently working
10.1109/ICCES.2015.7393033. toward the M.S. degree at the Institute of Micro-
[11] C. Fritz and A. T. Fam, “Fast binary counters based on symmetric electronics, Tsinghua University, Beijing, China.
stacking,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, His current research interests include VLSI archi-
no. 10, pp. 2971–2975, Oct. 2017, doi: 10.1109/TVLSI.2017.2723475. tectures for postquantum cryptography algorithms
[12] Q. Jiang and S. Li, “A design of manually optimized (15, and fully homomorphic encryption.
4) parallel counter,” in Proc. Int. Conf. Electron Devices Solid-
State Circuits (EDSSC), Hsinchu, Taiwan, Oct. 2017, pp. 1–2, doi:
10.1109/EDSSC.2017.8126527.
[13] M. H. Najafi, D. J. Lilja, M. D. Riedel, and K. Bazargan, “Low-cost
sorting network circuits using unary processing,” IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 26, no. 8, pp. 1471–1480, Aug. 2018, doi:
10.1109/TVLSI.2018.2822300.
[14] D. E. Knuth, The Art of Computer Programming: Sorting and Searching, Shuguo Li (Member, IEEE) received the bache-
vol. 3. Reading, MA, USA: Addison-Wesley, 1973. lor’s degree from Xidian University, Xi’an, China,
[15] M. Mehta, V. Parmar, and E. Swartzlander, “High-speed multiplier in 1986, the M.S. degree from Shandong University,
design using multi-input counter and compressor circuits,” in Proc. Shandong, China, in 1993, and the Ph.D. degree in
10th IEEE Symp. Comput. Arithmetic, Grenoble, France, Jun. 1991, computer science from Northwestern Polytechnical
pp. 43–50, doi: 10.1109/ARITH.1991.145532. University, Xi’an, in 1999.
[16] A. Fathi, B. Mashoufi, and S. Azizian, “Very fast, high-performance He is currently a Professor with the Institute
5-2 and 7-2 compressors in CMOS process for rapid parallel accu- of Microelectronics, Tsinghua University, Beijing,
mulations,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., China. His current research interests include the
vol. 28, no. 6, pp. 1403–1412, Jun. 2020, doi: 10.1109/TVLSI.2020. algorithm, design, and VLSI implementation for
2983458. cryptography.
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on September 03,2021 at 08:43:02 UTC from IEEE Xplore. Restrictions apply.