Professional Documents
Culture Documents
Kwon2002 Article A16 BitBy16 BitMACDesignUsingF
Kwon2002 Article A16 BitBy16 BitMACDesignUsingF
c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.
OHSANG KWON∗
SUN Microsystems Inc., Palo Alto, CA 94303, USA
KEVIN NOWKA
IBM Austin Research Lab, Texas 78758, USA
Abstract. 3:2 counters and 4:2 compressors have been widely used for multiplier implementations. In this paper,
a fast 5:3 compressor is derived for high-speed multiplier implementations. The fast 5:3 compression is obtained
by applying two rows of fast 2-bit adder cells to five rows in a partial product matrix. As a design example, a 16-bit
by 16-bit MAC (Multiply and Accumulate) design is investigated both in a purely logical gate implementation and
in a highly customized design. For the partial product reduction, the use of the new 5:3 compression leads to 14.3%
speed improvement in terms of XOR gate delay. In a dynamic CMOS circuit implementation using 0.225 µm bulk
CMOS technology, 11.7% speed improvement is observed with 8.1% less power consumption for the reduction
tree.
Keywords: 3:2 counter, 4:2 compressor, 5:3 compressor, 5:2 compressor, multiplier, MAC
directly to the next cycle in a redundant form (carry- five input bits including a carry-in from the neighbor-
save). In this case, a new method can be used to elim- ing less significant cell and three outputs including a
inate one row of the partial product matrix before the carry-out to the more significant neighboring cell. Both
actual reduction process begins, which can eliminate designs shown in Fig. 2 have 3xor on their longest
one 3:2 counter delay for some specific operand sizes. path.
Overall, the new design is 14.3% faster in terms of XOR
delays for the partial product reduction. In a highly cus-
tomized dynamic CMOS circuit implementation, it is 2.2. A 5:3 Compressor
observed that the new method provides 11.7% speed
improvement while it consumes 8.1% less power in A fast 5:3 compressor can be constructed with 2-bit
the reduction tree. adder cells. Figure 3 shows the block diagrams of two
In this paper, the new 5:3 compressor design is pre- fast 2-bit adder cells. To minimize the delay, two differ-
sented in Section 2. Section 3 presents a 16-bit by ent logical decompositions of a full adder are combined
16-bit two’s complement MAC implementation exam- to form the 2-bit adder. The first full adder (Type-I) has
ple and compares the performance to a conventional three input bits in the less significant bit position and
design. Section 4 discusses possible extensions of the calculates an output carry in a fast manner. The second
new method. Finally, a conclusion is given in Section 5. full adder (Type-II) accepts the carry input in the mid-
dle of its calculation. Through combining the different
full adder styles for the 2-bit adder configuration, the
2. New Compressors overall delay of the 2-bit adder is only 2xor which is
the same as that of a 1-bit full adder’s sum bit delay.
2.1. 3:2 Counters and 4:2 Compressors Figure 4 shows a fast 5:3 compression that is im-
plemented using the fast 2-bit adder cells. By applying
Figure 1 shows two block diagrams of a 3:2 counter. two rows of 2-bit adder cells in parallel, five rows are
3:2 counters have been widely used in fast multiplier reduced to three rows as shown in the figure.
implementations from the earliest efforts [1, 2] to more Using the new fast 5:3 compression, the speed of
recent works in CMOS circuits [8–10]. A 3:2 counter multipliers and fused multiply-adders can be improved
has three input bits of equal weight and two output bits, as will be explained in Sections 3 and 4. Here, XOR
sum and carry, of the same and one greater binary bit delay is used as the speed measure of 3:2 counters,
weight, respectively. The sum output of a 3:2 counter 4:2 compressors and the new 5:3 compressors since
takes two XOR delays (an XOR delay is denoted xor , all of their logical decompositions employ a series of
hereafter). XORs. In this way, the speed comparison can be made
Figure 2 shows two logical decompositions of a independent of specific technologies though actual rel-
4:2 compressor [7, 10–12]. The 4:2 compressor has ative delays may be slightly different depending on
specific circuit families and fabrication technologies
employed. Many multiplier designs use specially de-
signed counters and compressors depending on their
processing technology. Section 3 gives an example of
the fast 2-bit adder cell design using a fully customized
dynamic CMOS circuit.
Figure 3. Two logical decompositions of a fast 2-bit adder. (a) type-I; (b) type-II.
cell and one carry bit (carry) of one-greater signifi- carry, sum are output bits in Fig. 6. Then,
cance and a sum bit (sum). It has a sum delay of 4xor
on the critical path. a + b + c + d + e + x1 in + x2 in
Figure 6 shows another 5:2 compressor cell which = (x1 out + x2 out + carry) · 2 + sum. (1)
also has a sum delay of 4xor [13]. Its carry output has
the same delay as the sum output while the carry output Proof : From the definition in Table 1, the following
is faster than the sum in the block diagram shown in holds true.
Fig. 5.
The following theorem proves the correctness of x1 out = ((a ∨ b) ∧ (c ∨ d)) (2)
the 5:2 compressor in Fig. 6.
s1 = ((a ∧ b) ∨ (c ∧ d)) (3)
Theorem 1. a, b, c, d, e, x1 in, x2 in are input bits (s1 ⊕ n) = ((a ⊕ b) ⊕ (c ⊕ d)) (4)
of the 5:2 compressors in Fig. 6 and x1 out, x2 out, a + b + c + d = x1 out · 2 + s1 + n. (5)
80 Kwon, Nowka and Swartzlander
Let x2 out and s2 represent the carry and save terms In a similar manner, let carry and sum be carry and
of (s1 + n + x1 in), respectively. Then, sum of (s2 + e + x2 in), respectively. Then,
x2 out ≡ (s1 ∧ n) ∨ (s1 ∧ x1 in) ∨ (n ∧ x1 in) carry ≡ (((s2 ⊕ e) ∧ e) ∨ ((s2 ⊕ e) ∧ x2 in)) (9)
= (s1 ∧ n) ∨ (x1 in ∧ (s1 in ∨ n)) sum ≡ ((s2 ⊕ e) ⊕ x2 in) (10)
= (((s1 ⊕ n) ∧ s1 ) ∨ ((s1 ⊕ n) ∧ x1 in)) (6) s2 + e + x2 in = carry · 2 + sum. (11)
A 16-Bit by 16-Bit MAC Design 81
Figure 7. A block diagram of 16-bit by 16-bit MAC. Figure 8. A reduction tree for 11:2 compression.
82 Kwon, Nowka and Swartzlander
in the partial product matrix can be reduced to two adders, there are only ten rows to be reduced in the main
rows using a row of full adders simultaneously dur- reduction process.
ing the partial product generation. The partial product Figure 9 shows the detailed view of the reduc-
generation requires one Booth encoding delay and one tion process using the new reduction tree. Overall,
Booth decoding (5:1 MUX) delay which is longer than the design using the new 5:3 compressor takes 7xor
one full adder delay. With the pre-reduction row of full compared to the 8xor required by a conventional
A 16-Bit by 16-Bit MAC Design 83
design using 4:2 compressors and 3:2 counters, which Table 2. A new booth encoding.
is 14.3% speed improvement. X 2i+1 X 2i X 2i−1 Pi Mi Xi 2X i
0 0 0 0 0 1 1
3.2. Circuit Implementation
0 0 1 1 0 1 0
0 1 0 1 0 1 0
The radix-4 Booth recoding technique has been widely
0 1 1 1 0 0 1
used since it reduces the number of partial products by
approximately half [3–6, 16]. There are well-known 1 0 0 0 1 0 1
Booth encoding methods and the corresponding static 1 0 1 0 1 1 0
CMOS circuits in previous works [6, 16]. However, for 1 1 0 0 1 1 0
dynamic CMOS circuit implementation, dual-rail sig- 1 1 1 0 0 1 1
nals are required at the output of the Booth decoders.
This entails at least five control signals for single-stage
Booth decoder implementation using direct implemen- decoder circuit, for the complementary signal genera-
tation of previous encoding methods. tion, a control signal is needed to indicate a recoded
In this paper, a new encoding method is used and it value of zero. Using the new encoding method, this
will be shown that four control signals are enough with event can be represented when both X and 2X are high
a carefully designed Booth decoder circuit. This will be and the circuit can be configured to take advantage of
helpful in large multiplier designs where interconnec- it. Note that some of the NMOS devices are shared
tion complexity is one of design bottlenecks. Table 2 between adjacent Booth decoder cells to minimize the
shows the new encoding method. Figure 10 shows its cell area of the Booth decoder.
dynamic CMOS implementation and Fig. 11 shows the Figure 12 shows a dynamic implementation of the
accompanying Booth decoder circuits. In the Booth fast 2-bit adder cells. To minimize the number of
Conventional
New scheme scheme [10]
Figure 14. Delay comparison: Tree architectures with dynamic CMOS circuit implementation.
cell as in Fig. 1. However, in fully customized circuit decoders to compress the original partial products into
implementations, compressors and/or counters are de- half. In fact, the similar compression can be done using
signed to minimize overall delay and transistor counts. one 4:2 compression stage. The advantage of the Booth
The distinct delay difference is not usually found in recoding technique comes from the following aspects:
many customized circuit implementations. In addition, speed, area and power consumption.
the algorithmic approach leads to different intercon- In terms of delay, the recoding involves encoder and
nections on different bit slices which is a major design decoder delay while non-Booth recoding takes one
bottleneck in large multipliers where interconnection 2-input AND gate and a 4:2 compressor delay. It should
complexity is a dominant design factor as well as area be noted that delay comparison depends on specific im-
and power consumption. plementation styles and its investigation is beyond the
The 5:3 compressor has a similarity with the algo- scope of this paper. Nonetheless, it is seen that radix-4
rithmic approach in the viewpoint that it utilizes the recoding is preferred especially in many custom circuit
delay difference on paths from inputs to outputs dur- implementation since encoding and decoding can be
ing logic derivation. However, in the 5:3 compressor, implemented in one custom logic gate, respectively.
connections are optimized within a local area (the 5:3 Specifically, in dynamic CMOS circuit implementa-
compressor itself which spans over only 2 consecu- tion, radix-4 recoding involves only two domino stages
tive bit positions) and the resulting cell has equal de- while non-Booth recoding takes 2 domino stages for
lay on outputs from the viewpoint of logical analysis. 4:2 compressor delay and another for 2-input AND
Thus, the conventional partial product reduction meth- gate. In terms of area, one Booth decoder performs the
ods can be applied in a similar manner and the ben- role of one 4:2 compressor if the relatively small Booth
efit of interconnection regularity and design modular- encoding logic is ignored. In custom circuit designs,
ity of conventional reduction methods are preserved. the Booth decoder is easily implemented as one custom
Simultaneously, delay and area can be improved by logic gate and its area can be minimized by sharing
circuit optimization in the cell itself as shown in the some transistors with adjacent cells just as in a dynamic
circuit implementation example of Section 3.2. CMOS circuit implementation of Fig. 11 while a 4:2
compressor requires two 3:2 counters. The less area and
fewer logic stage leads to less power consumption, too.
4.2. Booth Encoding and Pre-Reduction Technique Radix-4 Booth recoding is used in the implementation
example of this paper to reflect these advantages.
In multiplier designs, radix-4 Booth recoding technique In addition, in the MAC design example, a row of
has been widely used for partial product generation full adders was used for 3:2 reduction for three rows
though the recoding entails Booth encoder and Booth not driven by the Booth recoder in the partial product
A 16-Bit by 16-Bit MAC Design 87
Figure 15. Partial product matrix for a 13-bit by 13-bit 2’s complement MAC.
matrix prior to the main reduction process. The delay of cuit implementation, 11.7% speed improvement is ob-
the full adder row is smaller than that of partial product served with 8.1% less power consumption in 0.225 µm
generation when radix-4 or higher radix recoding tech- bulk CMOS technology.
niques are used. In these recoding techniques, Booth
encoding and Booth multiplexing are needed for the
References
partial product generation.
If the delay of partial product generation is greater 1. C.S. Wallace, “A Suggestion For a Fast Multiplier,” IEEE Trans.
than that of a 4:2 compressor, 4:2 compression can on Electronic Computers, vol. EC-13, 1964, pp. 14–17.
be used for the early available 4 rows in the partial 2. L. Dadda, “Some Schemes for Parallel Multiplier,” Alta Freq.,
product matrix. Figure 15 shows an example of this vol. 34, 1965, pp. 349–356.
case, which is useful to reduce one more row in the 3. O.L. MacSoley, “High Speed Arithmetic in Binary Computers,”
Proc. IRE, vol. 49, 1961, pp. 67–91.
partial product matrix before partial product reduction 4. H. Sam and A. Gupta, “A Generalized Multibit Recoding of
process begins. The example is for a 13-bit by 13-bit Two’s Complement Binary Numbers and Its Proof with Applica-
2’s complement MAC design. Using this method, the tion in Multiplier Implementation,” IEEE Trans. on Computers,
delay become 6 xor for 8:2 reduction instead of 7 xor vol. 39, 1990, pp. 1006–1015.
for 10:2 reduction. This method is applicable when the 5. S. Vassiliadis, E.M. Swartz, and D.J. Hanrahan, “A General
Proof for Overlapped Multiple-Bit Scanning Multiplications,”
operand width of the MAC is an odd number. IEEE Trans. on Computers, vol. 38, 1989, pp. 172–183.
6. P.J. Song and G.D. Micheli, “Circuit and Architecture Trade-
offs for High-Speed Multiplication,” IEEE Journal of Solid-State
5. Conclusions Circuits, vol. 26, 1991, pp. 1184–1198.
7. V.G. Oklobdzija, D. Villeger, and S.S. Liu, “A Method for Speed
In this paper, a new fast 5:3 compression method is Optimized Partial Product Reduction and Generation of Fast
Parallel Multipliers Using an Algorithmic Approach,” IEEE
derived from a fast 2-bit adder cell. The 2-bit adder Trans. on Computers, vol. 45, 1996, pp. 294–305.
cell has the delay of 2 xor when a new logical de- 8. K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi,
composition is used. In addition, its one-stage dynamic and A. Shimizu, “A 3.8-ns CMOS 16 × 16-b Multiplier Using
CMOS circuit is proposed for highly customized de- Complementary Pass-Transistor Logic,” IEEE Journal of Solid-
State Circuits, vol. 25, 1990, pp. 388–395.
sign methodology. For the partial product reduction of a
9. A. Parameswar, H. Hara, and T. Sakurai, “A Swing Restored
16-bit by 16-bit MAC, the use of the new 5:3 compres- Pass-Transistor Logic-Based Multiply and Accumulator Circuit
sor cell leads to 14.3% speed improvement in terms of for Multimedia Applications,” IEEE Journal of Solid-State
XOR delay. In highly customized dynamic CMOS cir- Circuits, vol. 31, 1994, pp. 804–809.
88 Kwon, Nowka and Swartzlander
10. M. Izumikawa, H. Igura, K. Furuta, H. Ito, H. Wakabayashi, Ultra-Sparc III, IV microprocessors for high-end servers. His re-
K. Nakajima, T. Mogami, T. Horiuchi, and M. Yamashina, “A search interests are in computer arithmetic, high-speed CMOS circuit
0.25-µm CMOS 0.9-V 100-MHz DSP Core,” IEEE Journal of implementation. He obtained Ph.D. in electrical and computer engi-
Solid-State Circuits, vol. 32, 1997, pp. 52–61. neering at The University of Texas at Austin in 2000. His dissertation
11. N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. research was in the field of computer arithmetic and its high-speed
Sasaki, and Y. Nakagome, “A 4.4 ns CMOS 54 × 54-b Multiplier CMOS circuit implementation. He received M.S. and B.S. in elec-
Using Pass-Transistor Multiplexer,” IEEE Journal of Solid-State tronics engineering at Seoul National University, Seoul, Korea in
Circuits, vol. 30, 1993, pp. 251–257. 1992 and 1990, respectively. From 1992 to 1997, he worked for Dae-
12. C.F. Law, S.S. Rofail, and K.S. Yeo, “A Low-Power 16 × woo Electronics Co. Ltd., Seoul, Korea as research engineer. During
16-b Parallel Multiplier Utilizing Pass-Transistor Logic,” IEEE this period, he developed image decoder, demodulator and channel
Journal of Solid-State Circuits, vol. 34, 1999, pp. 1395–1399. decoder for a prototype digital HDTV and worked on their VLSI
13. O. Kwon, K. Nowka, and E. Swartzlander, Jr., “A 16-bit × 16-bit implementations. From 1998 to 2000, he worked for IBM Austin
MAC Design Using Fast 5:2 Compressors,” in International Research Laboratory, where he was engaged in 1 GHz PowerPC
Conference on Application-Specific Systems, Architectures, and prototype development and then explored high-speed CMOS
Processors, 2000, pp. 235–243. circuit implementation of 64-bit multipliers and adders. He has been
14. C.R. Baugh and B.A. Wooley, “A Two’s Complement Parallel with Microelectronics Division at SUN Microsystems since 2000.
Array Multiplication Algorithm,” IEEE Trans. Computers, vol. He published 7 papers in several international conferences and is an
C-22, 1973, pp. 1045–1047. inventor or co-inventor of 10 U.S. patents.
15. S. Vassiliadis, E.M. Swartz, and B.M. Sung, “Hard-Wired ohsang.kwon@eng.sun.com
Multipliers with Encoded Partial Products,” IEEE Trans. on
Computers, vol. 40, 1991, pp. 1181–1197.
16. G. Goto, A. Inoue, R. Ohe, S. Kashiwakura, S. Mitarai, T. Tsuru,
and T. Izawa, “A 4.1-ns Compact 54 × 54-b Multiplier Utiliz-
ing Sign-Select Booth Encoders,” IEEE Journal of Solid-State
Circuits, vol. 32, 1997, pp. 1676–1682.
17. L. Su, R. Schulz, J. Adkisson, K. Beyer, G. Biery, W. Cote,
E. Crabbe, D. Edelstein, J. Ellis-Monaghan, E. Eld, D. Foster,
R. Gehres, R. Goldblatt, N. Greco, C. Guenther, J. Heidenreich,
J. Herman, D. Kiesling, L. Lin, S.-H. Lo, J. McKenna, C.
Megivern, H. Ng, J. Oberschmidt, A. Ray, N. Rohrer, K. Tallman,
T. Wagner, and B. Davari, “A High-Performance Sub-0.25 µm
CMOS Technology with Multiple Thresholds and Copper Kevin J. Nowka received his B.S. degree in computer engineering
Interconnects,” in 1998 Symposium on VLSI Technology Digest from Iowa State University in 1986 and his M.S. and Ph.D. degrees
of Technical Papers, 1998, pp. 18–19. in electrical engineering from Stanford University in 1988 and 1995,
18. Neil H.E. Weste and Kamran Eshraghian, Principles of CMOS respectively. He joined the IBM Austin Research Laboratory in
VLSI Design: A Systems Perspective, 2nd ed., Santa Clara, CA: 1996 where he has conducted research on CMOS VLSI circuits and
Addison-Wesley Publishing Company, 1993. arithmetic functions for application to the design of high-frequency
19. Kevin J. Nowka and Tibi Galambos, “Circuit Design Tech- and low-power CMOS processors. He developed circuits for two
niques for a Gigahertz Integer Microprocessor,” in International gigahertz microprocessors and for an ultralow-power embedded
Conference on Computer Design, 1998, pp. 11–16. PowerPC processor. He holds fourteen patents related to processor
20. J. Silberman, N. Aoki, D. Boerstler, J. Burns, S. Dhong, A. design.
Essbaum, U. Ghoshal, D. Heidel, P. Hofstee, K. Lee, D. Meltzer, nowka@us.ibm.com
H. Ngo, K. Nowka, S. Posluszny, O. Takahashi, I. Vo, and
B. Zoric, “A 1.0 GHz Single-Issue 64-Bit PowerPC Integer
Processor,” IEEE Journal of Solid-State Circuits, vol. 33, 1998,
pp. 1600–1608.
between computer architecture and VLSI technology. This involves (Lafayette, LA, October 11–13, 2000), the IEEE International
computer arithmetic, VLSI development and digital signal processor Conference on Application-Specific Systems, Architectures, and
implementation. He is currently the hardware area editor for ACM Processors (Boston, MA, July 10–12, 2000), the 31st Asilomar Con-
Computing Reviews, the computer arithmetic editor for the Journal ference on Signals, Systems & Computers (Monterey, CA, 1997),
of Systems Architecture. He has been the Editor-in-Chief of the IEEE the 1994 International Conference on Application Specific Array
Transactions on Computers, the IEEE Transactions on Signal Pro- Processors (San Francisco), the 1993 International Conference on
cessing and was the founding Editor-in-Chief of the Journal of VLSI Parallel and Distributed Systems, Taiwan, and the 11th Symposium
Signal Processing. He has been a member of the Board of Governors on Computer Arithmetic (Windsor, Canada, 1993). He obtained his
of the IEEE Computer Society (1987 to 1991), the ADCOM/Board of doctorate in computer design with the support of a Howard Hughes
Governors of the IEEE Signal Processing Society (1992 to 1994) and Doctoral Fellowship. He is a Fellow of the IEEE and is a regis-
a member of the IEEE Solid-State Circuits Council/Society (1986 tered professional engineer in four states. He has been recognized as
to 1991). He was the Secretary of the IEEE Solid-State Circuits an Outstanding Electrical Engineer and a Distinguished Engineer-
Council (1992 to 1993) and Treasurer of the IEEE Solid-State Cir- ing Alumnus of Purdue University and has received a Distinguished
cuits Council/Society (1994 to 1998). He has been a member of the Engineering Alumnus Award from the University of Colorado. He is
IEEE History Committee since 1996. He has chaired or co-chaired also a member of the IEEE Computer Society Golden Core and the
many conferences including the following: the IEEE Workshop recipient of an IEEE Third Millennium Medal.
on Signal Processing Systems (SiPS) Design and Implementation eswartzla@aol.com