Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Journal of VLSI Signal Processing 31, 77–89, 2002


c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

A 16-Bit by 16-Bit MAC Design Using Fast 5:3 Compressor Cells

OHSANG KWON∗
SUN Microsystems Inc., Palo Alto, CA 94303, USA

KEVIN NOWKA
IBM Austin Research Lab, Texas 78758, USA

EARL E. SWARTZLANDER, JR.


Department of Electrical and Computer Engineering, University of Texas at Austin, Texas 78712, USA

Received October 17, 2000; Revised October 8, 2001

Abstract. 3:2 counters and 4:2 compressors have been widely used for multiplier implementations. In this paper,
a fast 5:3 compressor is derived for high-speed multiplier implementations. The fast 5:3 compression is obtained
by applying two rows of fast 2-bit adder cells to five rows in a partial product matrix. As a design example, a 16-bit
by 16-bit MAC (Multiply and Accumulate) design is investigated both in a purely logical gate implementation and
in a highly customized design. For the partial product reduction, the use of the new 5:3 compression leads to 14.3%
speed improvement in terms of XOR gate delay. In a dynamic CMOS circuit implementation using 0.225 µm bulk
CMOS technology, 11.7% speed improvement is observed with 8.1% less power consumption for the reduction
tree.

Keywords: 3:2 counter, 4:2 compressor, 5:3 compressor, 5:2 compressor, multiplier, MAC

1. Introduction are basic elements that are frequently used to reduce a


partial product matrix into two rows [1, 2, 6, 7]. The re-
Multiplication is one of the most frequently encoun- duction process is performed with a CSA (Carry-Save-
tered arithmetic operations in microprocessors and sig- Adder) array using the basic elements. A high-speed
nal processors. Due to the delay and complexity of carry-propagating adder is used to produce the final
multipliers, efficient designs have been pursued over result from the two rows.
the last several decades. In this paper, a new design In this paper, a new fast 5:3 compressor is proposed
method is proposed for multiplier, multiple adders and for high-performance multiplier implementation. The
fused MAC (Multiply and Accumulate) designs. new fast 5:3 compressor is obtained by applying 2-bit
The fast multiplication process consists of three adder cells in parallel and a new logical decomposition
steps [1, 2]: partial product generation, partial product is used for fast implementation of the 2-bit adder. In
reduction and final carry-propagating addition. To the new logical decomposition, the 2-bit adder cell has
reduce the number of partial products, recoding tech- 2 XOR delays on the critical path. A 16-bit by 16-bit
niques have been widely used [3–5]. In the partial prod- MAC (Multiply and Accumulate) design is presented
uct reduction step, 3:2 counters and 4:2 compressors using the fast 5:3 compressors and a radix-4 Booth
recoding technique.
∗ Thiswork was done when the author was with the Department of In the MAC design, for high throughput operation,
Electrical and Computer Engineering, University of Texas at Austin. the result of partial product reduction can be fed back
78 Kwon, Nowka and Swartzlander

directly to the next cycle in a redundant form (carry- five input bits including a carry-in from the neighbor-
save). In this case, a new method can be used to elim- ing less significant cell and three outputs including a
inate one row of the partial product matrix before the carry-out to the more significant neighboring cell. Both
actual reduction process begins, which can eliminate designs shown in Fig. 2 have 3xor on their longest
one 3:2 counter delay for some specific operand sizes. path.
Overall, the new design is 14.3% faster in terms of XOR
delays for the partial product reduction. In a highly cus-
tomized dynamic CMOS circuit implementation, it is 2.2. A 5:3 Compressor
observed that the new method provides 11.7% speed
improvement while it consumes 8.1% less power in A fast 5:3 compressor can be constructed with 2-bit
the reduction tree. adder cells. Figure 3 shows the block diagrams of two
In this paper, the new 5:3 compressor design is pre- fast 2-bit adder cells. To minimize the delay, two differ-
sented in Section 2. Section 3 presents a 16-bit by ent logical decompositions of a full adder are combined
16-bit two’s complement MAC implementation exam- to form the 2-bit adder. The first full adder (Type-I) has
ple and compares the performance to a conventional three input bits in the less significant bit position and
design. Section 4 discusses possible extensions of the calculates an output carry in a fast manner. The second
new method. Finally, a conclusion is given in Section 5. full adder (Type-II) accepts the carry input in the mid-
dle of its calculation. Through combining the different
full adder styles for the 2-bit adder configuration, the
2. New Compressors overall delay of the 2-bit adder is only 2xor which is
the same as that of a 1-bit full adder’s sum bit delay.
2.1. 3:2 Counters and 4:2 Compressors Figure 4 shows a fast 5:3 compression that is im-
plemented using the fast 2-bit adder cells. By applying
Figure 1 shows two block diagrams of a 3:2 counter. two rows of 2-bit adder cells in parallel, five rows are
3:2 counters have been widely used in fast multiplier reduced to three rows as shown in the figure.
implementations from the earliest efforts [1, 2] to more Using the new fast 5:3 compression, the speed of
recent works in CMOS circuits [8–10]. A 3:2 counter multipliers and fused multiply-adders can be improved
has three input bits of equal weight and two output bits, as will be explained in Sections 3 and 4. Here, XOR
sum and carry, of the same and one greater binary bit delay is used as the speed measure of 3:2 counters,
weight, respectively. The sum output of a 3:2 counter 4:2 compressors and the new 5:3 compressors since
takes two XOR delays (an XOR delay is denoted xor , all of their logical decompositions employ a series of
hereafter). XORs. In this way, the speed comparison can be made
Figure 2 shows two logical decompositions of a independent of specific technologies though actual rel-
4:2 compressor [7, 10–12]. The 4:2 compressor has ative delays may be slightly different depending on
specific circuit families and fabrication technologies
employed. Many multiplier designs use specially de-
signed counters and compressors depending on their
processing technology. Section 3 gives an example of
the fast 2-bit adder cell design using a fully customized
dynamic CMOS circuit.

2.3. 5:2 Compressors

The 5:3 compressor can be used to build other com-


pressors. Figure 5 shows a 5:2 compressor cell using
the fast 5:3 compressor and a 3:2 counter. It has five di-
rect input bits (a, b, c, d, e) and two additional carry-in
bits (x1 in, x2 in) from a neighboring less significant
cell. Among four output bits, two are carry-out bits
Figure 1. 3:2 counters. (x1 out, x2 out) to the neighboring more significant
A 16-Bit by 16-Bit MAC Design 79

Figure 2. 4:2 compressors.

Figure 3. Two logical decompositions of a fast 2-bit adder. (a) type-I; (b) type-II.

cell and one carry bit (carry) of one-greater signifi- carry, sum are output bits in Fig. 6. Then,
cance and a sum bit (sum). It has a sum delay of 4xor
on the critical path. a + b + c + d + e + x1 in + x2 in
Figure 6 shows another 5:2 compressor cell which = (x1 out + x2 out + carry) · 2 + sum. (1)
also has a sum delay of 4xor [13]. Its carry output has
the same delay as the sum output while the carry output Proof : From the definition in Table 1, the following
is faster than the sum in the block diagram shown in holds true.
Fig. 5.
The following theorem proves the correctness of x1 out = ((a ∨ b) ∧ (c ∨ d)) (2)
the 5:2 compressor in Fig. 6.
s1 = ((a ∧ b) ∨ (c ∧ d)) (3)
Theorem 1. a, b, c, d, e, x1 in, x2 in are input bits (s1 ⊕ n) = ((a ⊕ b) ⊕ (c ⊕ d)) (4)
of the 5:2 compressors in Fig. 6 and x1 out, x2 out, a + b + c + d = x1 out · 2 + s1 + n. (5)
80 Kwon, Nowka and Swartzlander

Figure 4. 5:3 compression using 2-bit adder cells.

Figure 6. A direct 5:2 compressor logical decomposition.

s2 ≡ ((s1 ⊕ n) ⊕ x1 in) (7)


Figure 5. A 5:2 compressor cell using a 5:3 compressor and a 3:2
counter. s1 + n + x1 in = x2 out · 2 + s2 . (8)

Let x2 out and s2 represent the carry and save terms In a similar manner, let carry and sum be carry and
of (s1 + n + x1 in), respectively. Then, sum of (s2 + e + x2 in), respectively. Then,

x2 out ≡ (s1 ∧ n) ∨ (s1 ∧ x1 in) ∨ (n ∧ x1 in) carry ≡ (((s2 ⊕ e) ∧ e) ∨ ((s2 ⊕ e) ∧ x2 in)) (9)
= (s1 ∧ n) ∨ (x1 in ∧ (s1 in ∨ n)) sum ≡ ((s2 ⊕ e) ⊕ x2 in) (10)
= (((s1 ⊕ n) ∧ s1 ) ∨ ((s1 ⊕ n) ∧ x1 in)) (6) s2 + e + x2 in = carry · 2 + sum. (11)
A 16-Bit by 16-Bit MAC Design 81

Table 1. A redundant representation of (a + b + c + d). high-throughput operation, the accumulator stores an


a+b+c+d Other conditions x1 out n s1 intermediate result in a carry-save form and the accu-
mulator output is fed back for the next accumulation
0 0 0 0 operation. In this manner, carry propagating addition
1 0 1 0 is separated from the critical path in MAC operation.
2 ((a ∧ b) ∨ (c ∧ d)) 0 1 1 The partial products and accumulator output are com-
((a ∨ b) ∧ (c ∨ d)) 1 0 0 bined to form a partial product matrix and undergo 11:2
3 1 0 1 reduction process. The final two rows after the final
4 1 1 1 reduction process are added in the carry-propagating
adder.
Figure 8 shows the necessary 11:2 reduction tree ob-
From Eqs. (5), (8) and (11), tained using two 5:3 compressors, three 3:2 counters
and a 4:2 compressor. Figure 9 shows the reduction
a + b + c + d + e + x1 in + x2 in
process starting from the partial product matrix. In the
= x1 out · 2 + s1 + n + e + x1 in + x2 in partial product matrix, the sign bits of partial products
= x1 out · 2 + x2 out · 2 + s2 + e + x2 in are shown as white dots and carry-ins of partial prod-
ucts are shown as grey dots. A bar above a dot means
= (x1 out + x2 out + carry) · 2 + sum. (12)
the logical inverse of the dot. When a recoded digit is
It is readily seen that the block diagram of Fig. 6 gen- negative, the corresponding partial product has a carry-
erates x1 out, x2 out, carry and sum according to their in. The method of handling two’s complement partial
definitions in Eqs. (2), (6), (9) and (10) respectively. products in the partial product matrix is well discussed
in [14, 15].
The partial product matrix consists of eight partial
3. A Design Example: 16-bit × 16-bit
products obtained through radix-4 Booth recoding and
MAC Design
two rows from the accumulator output. The maximum
height of bit slice is eleven since there are carry-ins
3.1. A Logical Architecture of a MAC Using
with partial products due to the possible negative re-
the New 5:3 Compressor
coded digits in the Booth recoding process. However,
the accumulator output bits are available at the begin-
Figure 7 shows a block diagram of a 16-bit by 16-bit
ning of the cycle. The carry-in bit of the partial prod-
2’s complement MAC design. To reduce the number of
uct and constant one’s are also available without any
partial products, radix-4 Booth recoding is used. For
delay in the cycle. Therefore, the bottom three rows

Figure 7. A block diagram of 16-bit by 16-bit MAC. Figure 8. A reduction tree for 11:2 compression.
82 Kwon, Nowka and Swartzlander

Figure 9. A dot diagram of 11:2 reduction process.

in the partial product matrix can be reduced to two adders, there are only ten rows to be reduced in the main
rows using a row of full adders simultaneously dur- reduction process.
ing the partial product generation. The partial product Figure 9 shows the detailed view of the reduc-
generation requires one Booth encoding delay and one tion process using the new reduction tree. Overall,
Booth decoding (5:1 MUX) delay which is longer than the design using the new 5:3 compressor takes 7xor
one full adder delay. With the pre-reduction row of full compared to the 8xor required by a conventional
A 16-Bit by 16-Bit MAC Design 83

design using 4:2 compressors and 3:2 counters, which Table 2. A new booth encoding.
is 14.3% speed improvement. X 2i+1 X 2i X 2i−1 Pi Mi Xi 2X i

0 0 0 0 0 1 1
3.2. Circuit Implementation
0 0 1 1 0 1 0
0 1 0 1 0 1 0
The radix-4 Booth recoding technique has been widely
0 1 1 1 0 0 1
used since it reduces the number of partial products by
approximately half [3–6, 16]. There are well-known 1 0 0 0 1 0 1
Booth encoding methods and the corresponding static 1 0 1 0 1 1 0
CMOS circuits in previous works [6, 16]. However, for 1 1 0 0 1 1 0
dynamic CMOS circuit implementation, dual-rail sig- 1 1 1 0 0 1 1
nals are required at the output of the Booth decoders.
This entails at least five control signals for single-stage
Booth decoder implementation using direct implemen- decoder circuit, for the complementary signal genera-
tation of previous encoding methods. tion, a control signal is needed to indicate a recoded
In this paper, a new encoding method is used and it value of zero. Using the new encoding method, this
will be shown that four control signals are enough with event can be represented when both X and 2X are high
a carefully designed Booth decoder circuit. This will be and the circuit can be configured to take advantage of
helpful in large multiplier designs where interconnec- it. Note that some of the NMOS devices are shared
tion complexity is one of design bottlenecks. Table 2 between adjacent Booth decoder cells to minimize the
shows the new encoding method. Figure 10 shows its cell area of the Booth decoder.
dynamic CMOS implementation and Fig. 11 shows the Figure 12 shows a dynamic implementation of the
accompanying Booth decoder circuits. In the Booth fast 2-bit adder cells. To minimize the number of

Figure 10. A booth encoding circuit.


84 Kwon, Nowka and Swartzlander

Table 3. Performance comparison of reduction trees.

Conventional
New scheme scheme [10]

No. of domino stages 4 5


Delay 149 ps (5:3) 209 ps (4:2-I)
118 ps (3:2) 112 ps (3:2)
187 ps (4:2) 186 ps (4:2-II)
Total: 454 ps 507 ps
No. of TRs used/bit slice 382 426
Power consumption/bit slice 4.19 mW 4.56 mW

eliminates one xor in the partial product reduction.


That is, it takes 7xor instead of 8xor in a conven-
tional design, which is 14.3% speed improvement.
Table 3 shows a circuit simulation result using highly
customized dynamic CMOS circuit implementation in
Figure 11. A booth decoding circuit. 0.225 µm bulk CMOS technology [17]. The conven-
tional design uses only 4:2 compressors and/or 3:2
devices and to get fast operation, the 2-bit adder cell is counters in the reduction tree [10]. The new design uses
designed using multiple output domino logic (MODL) the new 5:3 compressors as well as the 4:2 compres-
where some intermediate nodes in the evaluation paths sors and 3:2 counters as seen in Fig. 8. In fast dynamic
are used as outputs as well. In Fig. 3, two XOR stages CMOS implementations, a 3:2 counter is implemented
are needed for fast implementation of the 2-bit adder with 1-stage domino logic and a 4:2 compressor is im-
cell. The 2 stages are logically combined in a 1-stage plemented with two cascaded domino full adders. The
domino implementation instead of separate 2-stage new 5:3 compressor in Fig. 12 is implemented using
domino implementation for faster operation. only 1-stage domino logic and thus is much faster than
the 4:2 compressor. In the reduction tree, the new de-
3.3. Performance Comparison sign requires only four domino stages while the conven-
tional scheme has five domino stages. In domino logic
In the 16-bit by 16-bit MAC design, the new 5:3 com- implementation, it is important to reduce the number of
pressor was applied and it was shown that it effectively stages since each domino stage has an output inverter

Figure 12. A fast 2-bit adder implementation using MODL.


A 16-Bit by 16-Bit MAC Design 85

as well as a NMOS pull-down path. Overall, the new


design has an 11.7% speed improvement over the con-
ventional design with less number of devices and less
power comsumption as seen in Table 3. The advantage
mainly comes from the use of the new efficient 5:3
compression using the fast 2-bit adder cell.
In the circuit implementation, dynamic CMOS cir-
cuits are exclusively used for maximum performance.
In general, CMOS circuits are classified into static
CMOS and dynamic CMOS circuits. Static CMOS
circuits are good for moderate performance and low
power consumption. For high performance operation,
dynamic circuits are often used at the cost of overhead
for precharge clock distribution and relatively large Figure 13. 5:3 Compressor tree architecture for 21:2 reduction.
power consumption. Nonetheless, the use of dynamic
CMOS circuits have been reported in many high-end
processor designs where circuit speed is more impor- performance is mainly determined by compressor cell
tant than other design factors [18–20]. For static circuit design. In addition, the performance can be different
implementation, it should be noted that the dynamic depending on specific operand sizes because the use of
CMOS circuit topology in this paper can be directly ap- a specific compressor leads to a specific sequence of
plied to DCVS (Differential Cascode Voltage Switch) maximum numbers of rows to be handled. Figure 14
circuits since their topology is quite similar. The logi- shows delay comparison for various number of input
cal decompositions in Section 2 can be implemented in rows in dynamic CMOS circuit implementation. The
popular CPL (Complementary Pass-gate Logic), DPL Wallace tree performs especially well when the num-
(Differential Pass-gate Logic) and static CMOS circuits ber of rows is 6, 9, 37–42 and 59–63. The binary tree is
as well. good when the number of rows is 4, 14–16 and 59–64.
The 5:3 compressor is best when the number of rows
is 5, 10–13, 17–21, 33–35, 43–58 and 65–70.
4. Discussion So far, the use of the new 5:3 compressor was
compared with conventional partial product reduction
4.1. 5:3 Compressor methods using 3:2 counters and 4:2 compressors. In
these methods, the reduction architecture and its delay
In this paper, a new 5:3 compressor is applied to is determined by the tallest vertical bit slice in a PPM
a 16-bit by 16-bit MAC design. In general, the 5:3 com- (partial product matrix). The logic elements and their
pressor can be used for other multiplication or multiply- interconnections in the vertical slice is replicated to dif-
addition applications. ferent bit positions. This approach has been widely used
Figure 13 shows a 5:3 compressor tree example for in fully customized circuit implementations of large
21:2 reduction. For maximum performance, 3:2 coun- multipliers where interconnection complexity and de-
ters are used whenever 3 or 4 rows are left even after sign modularity are major design bottlenecks. One of
the application of 5:3 compressors. The delay is given recent works on partial product reduction is to use an al-
as follows. gorithmic approach utilizing delay difference of carry
and sum outputs to attain maximal performance [7].
delay = log5/3 (n/3) · 5:3 +  F A (13) The basic idea of this method (called TDM (Three
Dimensional Minimization)) is to make proper con-
The maximum number of rows which the 5:3 compres- nections globally over entire vertical bit slices so that
sor tree architecture can handle is the delay throughout each path is approximately the
same. The long delay path originating from the previ-
5, 8, 13, 21, 35, . . . ous compressor should be connected to the short delay
path of the next one, and so on. This method is effective
starting from tree height of 2. Since the delay and known to be optimal when there is a distinct de-
is O(log n) as in Wallace/Dadda tree, the relative lay difference among logic paths inside a compressor
86 Kwon, Nowka and Swartzlander

Figure 14. Delay comparison: Tree architectures with dynamic CMOS circuit implementation.

cell as in Fig. 1. However, in fully customized circuit decoders to compress the original partial products into
implementations, compressors and/or counters are de- half. In fact, the similar compression can be done using
signed to minimize overall delay and transistor counts. one 4:2 compression stage. The advantage of the Booth
The distinct delay difference is not usually found in recoding technique comes from the following aspects:
many customized circuit implementations. In addition, speed, area and power consumption.
the algorithmic approach leads to different intercon- In terms of delay, the recoding involves encoder and
nections on different bit slices which is a major design decoder delay while non-Booth recoding takes one
bottleneck in large multipliers where interconnection 2-input AND gate and a 4:2 compressor delay. It should
complexity is a dominant design factor as well as area be noted that delay comparison depends on specific im-
and power consumption. plementation styles and its investigation is beyond the
The 5:3 compressor has a similarity with the algo- scope of this paper. Nonetheless, it is seen that radix-4
rithmic approach in the viewpoint that it utilizes the recoding is preferred especially in many custom circuit
delay difference on paths from inputs to outputs dur- implementation since encoding and decoding can be
ing logic derivation. However, in the 5:3 compressor, implemented in one custom logic gate, respectively.
connections are optimized within a local area (the 5:3 Specifically, in dynamic CMOS circuit implementa-
compressor itself which spans over only 2 consecu- tion, radix-4 recoding involves only two domino stages
tive bit positions) and the resulting cell has equal de- while non-Booth recoding takes 2 domino stages for
lay on outputs from the viewpoint of logical analysis. 4:2 compressor delay and another for 2-input AND
Thus, the conventional partial product reduction meth- gate. In terms of area, one Booth decoder performs the
ods can be applied in a similar manner and the ben- role of one 4:2 compressor if the relatively small Booth
efit of interconnection regularity and design modular- encoding logic is ignored. In custom circuit designs,
ity of conventional reduction methods are preserved. the Booth decoder is easily implemented as one custom
Simultaneously, delay and area can be improved by logic gate and its area can be minimized by sharing
circuit optimization in the cell itself as shown in the some transistors with adjacent cells just as in a dynamic
circuit implementation example of Section 3.2. CMOS circuit implementation of Fig. 11 while a 4:2
compressor requires two 3:2 counters. The less area and
fewer logic stage leads to less power consumption, too.
4.2. Booth Encoding and Pre-Reduction Technique Radix-4 Booth recoding is used in the implementation
example of this paper to reflect these advantages.
In multiplier designs, radix-4 Booth recoding technique In addition, in the MAC design example, a row of
has been widely used for partial product generation full adders was used for 3:2 reduction for three rows
though the recoding entails Booth encoder and Booth not driven by the Booth recoder in the partial product
A 16-Bit by 16-Bit MAC Design 87

Figure 15. Partial product matrix for a 13-bit by 13-bit 2’s complement MAC.

matrix prior to the main reduction process. The delay of cuit implementation, 11.7% speed improvement is ob-
the full adder row is smaller than that of partial product served with 8.1% less power consumption in 0.225 µm
generation when radix-4 or higher radix recoding tech- bulk CMOS technology.
niques are used. In these recoding techniques, Booth
encoding and Booth multiplexing are needed for the
References
partial product generation.
If the delay of partial product generation is greater 1. C.S. Wallace, “A Suggestion For a Fast Multiplier,” IEEE Trans.
than that of a 4:2 compressor, 4:2 compression can on Electronic Computers, vol. EC-13, 1964, pp. 14–17.
be used for the early available 4 rows in the partial 2. L. Dadda, “Some Schemes for Parallel Multiplier,” Alta Freq.,
product matrix. Figure 15 shows an example of this vol. 34, 1965, pp. 349–356.
case, which is useful to reduce one more row in the 3. O.L. MacSoley, “High Speed Arithmetic in Binary Computers,”
Proc. IRE, vol. 49, 1961, pp. 67–91.
partial product matrix before partial product reduction 4. H. Sam and A. Gupta, “A Generalized Multibit Recoding of
process begins. The example is for a 13-bit by 13-bit Two’s Complement Binary Numbers and Its Proof with Applica-
2’s complement MAC design. Using this method, the tion in Multiplier Implementation,” IEEE Trans. on Computers,
delay become 6 xor for 8:2 reduction instead of 7 xor vol. 39, 1990, pp. 1006–1015.
for 10:2 reduction. This method is applicable when the 5. S. Vassiliadis, E.M. Swartz, and D.J. Hanrahan, “A General
Proof for Overlapped Multiple-Bit Scanning Multiplications,”
operand width of the MAC is an odd number. IEEE Trans. on Computers, vol. 38, 1989, pp. 172–183.
6. P.J. Song and G.D. Micheli, “Circuit and Architecture Trade-
offs for High-Speed Multiplication,” IEEE Journal of Solid-State
5. Conclusions Circuits, vol. 26, 1991, pp. 1184–1198.
7. V.G. Oklobdzija, D. Villeger, and S.S. Liu, “A Method for Speed
In this paper, a new fast 5:3 compression method is Optimized Partial Product Reduction and Generation of Fast
Parallel Multipliers Using an Algorithmic Approach,” IEEE
derived from a fast 2-bit adder cell. The 2-bit adder Trans. on Computers, vol. 45, 1996, pp. 294–305.
cell has the delay of 2 xor when a new logical de- 8. K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi,
composition is used. In addition, its one-stage dynamic and A. Shimizu, “A 3.8-ns CMOS 16 × 16-b Multiplier Using
CMOS circuit is proposed for highly customized de- Complementary Pass-Transistor Logic,” IEEE Journal of Solid-
State Circuits, vol. 25, 1990, pp. 388–395.
sign methodology. For the partial product reduction of a
9. A. Parameswar, H. Hara, and T. Sakurai, “A Swing Restored
16-bit by 16-bit MAC, the use of the new 5:3 compres- Pass-Transistor Logic-Based Multiply and Accumulator Circuit
sor cell leads to 14.3% speed improvement in terms of for Multimedia Applications,” IEEE Journal of Solid-State
XOR delay. In highly customized dynamic CMOS cir- Circuits, vol. 31, 1994, pp. 804–809.
88 Kwon, Nowka and Swartzlander

10. M. Izumikawa, H. Igura, K. Furuta, H. Ito, H. Wakabayashi, Ultra-Sparc III, IV microprocessors for high-end servers. His re-
K. Nakajima, T. Mogami, T. Horiuchi, and M. Yamashina, “A search interests are in computer arithmetic, high-speed CMOS circuit
0.25-µm CMOS 0.9-V 100-MHz DSP Core,” IEEE Journal of implementation. He obtained Ph.D. in electrical and computer engi-
Solid-State Circuits, vol. 32, 1997, pp. 52–61. neering at The University of Texas at Austin in 2000. His dissertation
11. N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. research was in the field of computer arithmetic and its high-speed
Sasaki, and Y. Nakagome, “A 4.4 ns CMOS 54 × 54-b Multiplier CMOS circuit implementation. He received M.S. and B.S. in elec-
Using Pass-Transistor Multiplexer,” IEEE Journal of Solid-State tronics engineering at Seoul National University, Seoul, Korea in
Circuits, vol. 30, 1993, pp. 251–257. 1992 and 1990, respectively. From 1992 to 1997, he worked for Dae-
12. C.F. Law, S.S. Rofail, and K.S. Yeo, “A Low-Power 16 × woo Electronics Co. Ltd., Seoul, Korea as research engineer. During
16-b Parallel Multiplier Utilizing Pass-Transistor Logic,” IEEE this period, he developed image decoder, demodulator and channel
Journal of Solid-State Circuits, vol. 34, 1999, pp. 1395–1399. decoder for a prototype digital HDTV and worked on their VLSI
13. O. Kwon, K. Nowka, and E. Swartzlander, Jr., “A 16-bit × 16-bit implementations. From 1998 to 2000, he worked for IBM Austin
MAC Design Using Fast 5:2 Compressors,” in International Research Laboratory, where he was engaged in 1 GHz PowerPC
Conference on Application-Specific Systems, Architectures, and prototype development and then explored high-speed CMOS
Processors, 2000, pp. 235–243. circuit implementation of 64-bit multipliers and adders. He has been
14. C.R. Baugh and B.A. Wooley, “A Two’s Complement Parallel with Microelectronics Division at SUN Microsystems since 2000.
Array Multiplication Algorithm,” IEEE Trans. Computers, vol. He published 7 papers in several international conferences and is an
C-22, 1973, pp. 1045–1047. inventor or co-inventor of 10 U.S. patents.
15. S. Vassiliadis, E.M. Swartz, and B.M. Sung, “Hard-Wired ohsang.kwon@eng.sun.com
Multipliers with Encoded Partial Products,” IEEE Trans. on
Computers, vol. 40, 1991, pp. 1181–1197.
16. G. Goto, A. Inoue, R. Ohe, S. Kashiwakura, S. Mitarai, T. Tsuru,
and T. Izawa, “A 4.1-ns Compact 54 × 54-b Multiplier Utiliz-
ing Sign-Select Booth Encoders,” IEEE Journal of Solid-State
Circuits, vol. 32, 1997, pp. 1676–1682.
17. L. Su, R. Schulz, J. Adkisson, K. Beyer, G. Biery, W. Cote,
E. Crabbe, D. Edelstein, J. Ellis-Monaghan, E. Eld, D. Foster,
R. Gehres, R. Goldblatt, N. Greco, C. Guenther, J. Heidenreich,
J. Herman, D. Kiesling, L. Lin, S.-H. Lo, J. McKenna, C.
Megivern, H. Ng, J. Oberschmidt, A. Ray, N. Rohrer, K. Tallman,
T. Wagner, and B. Davari, “A High-Performance Sub-0.25 µm
CMOS Technology with Multiple Thresholds and Copper Kevin J. Nowka received his B.S. degree in computer engineering
Interconnects,” in 1998 Symposium on VLSI Technology Digest from Iowa State University in 1986 and his M.S. and Ph.D. degrees
of Technical Papers, 1998, pp. 18–19. in electrical engineering from Stanford University in 1988 and 1995,
18. Neil H.E. Weste and Kamran Eshraghian, Principles of CMOS respectively. He joined the IBM Austin Research Laboratory in
VLSI Design: A Systems Perspective, 2nd ed., Santa Clara, CA: 1996 where he has conducted research on CMOS VLSI circuits and
Addison-Wesley Publishing Company, 1993. arithmetic functions for application to the design of high-frequency
19. Kevin J. Nowka and Tibi Galambos, “Circuit Design Tech- and low-power CMOS processors. He developed circuits for two
niques for a Gigahertz Integer Microprocessor,” in International gigahertz microprocessors and for an ultralow-power embedded
Conference on Computer Design, 1998, pp. 11–16. PowerPC processor. He holds fourteen patents related to processor
20. J. Silberman, N. Aoki, D. Boerstler, J. Burns, S. Dhong, A. design.
Essbaum, U. Ghoshal, D. Heidel, P. Hofstee, K. Lee, D. Meltzer, nowka@us.ibm.com
H. Ngo, K. Nowka, S. Posluszny, O. Takahashi, I. Vo, and
B. Zoric, “A 1.0 GHz Single-Issue 64-Bit PowerPC Integer
Processor,” IEEE Journal of Solid-State Circuits, vol. 33, 1998,
pp. 1600–1608.

Earl E. Swartzlander, Jr. is a Professor of Electrical and Computer


Engineering at the University of Texas at Austin, where he holds the
Schlumberger Centennial Chair in Engineering. Previously he held
a variety of engineering and technical management positions with
Ohsang Kwon is a member of technical staff at SUN Microsys- TRW Defense and Space Systems from 1975 to 1990. His research
tems, Inc., Palo Alto, California, where he is involved in developing interests are in application specific processing and the interaction
A 16-Bit by 16-Bit MAC Design 89

between computer architecture and VLSI technology. This involves (Lafayette, LA, October 11–13, 2000), the IEEE International
computer arithmetic, VLSI development and digital signal processor Conference on Application-Specific Systems, Architectures, and
implementation. He is currently the hardware area editor for ACM Processors (Boston, MA, July 10–12, 2000), the 31st Asilomar Con-
Computing Reviews, the computer arithmetic editor for the Journal ference on Signals, Systems & Computers (Monterey, CA, 1997),
of Systems Architecture. He has been the Editor-in-Chief of the IEEE the 1994 International Conference on Application Specific Array
Transactions on Computers, the IEEE Transactions on Signal Pro- Processors (San Francisco), the 1993 International Conference on
cessing and was the founding Editor-in-Chief of the Journal of VLSI Parallel and Distributed Systems, Taiwan, and the 11th Symposium
Signal Processing. He has been a member of the Board of Governors on Computer Arithmetic (Windsor, Canada, 1993). He obtained his
of the IEEE Computer Society (1987 to 1991), the ADCOM/Board of doctorate in computer design with the support of a Howard Hughes
Governors of the IEEE Signal Processing Society (1992 to 1994) and Doctoral Fellowship. He is a Fellow of the IEEE and is a regis-
a member of the IEEE Solid-State Circuits Council/Society (1986 tered professional engineer in four states. He has been recognized as
to 1991). He was the Secretary of the IEEE Solid-State Circuits an Outstanding Electrical Engineer and a Distinguished Engineer-
Council (1992 to 1993) and Treasurer of the IEEE Solid-State Cir- ing Alumnus of Purdue University and has received a Distinguished
cuits Council/Society (1994 to 1998). He has been a member of the Engineering Alumnus Award from the University of Colorado. He is
IEEE History Committee since 1996. He has chaired or co-chaired also a member of the IEEE Computer Society Golden Core and the
many conferences including the following: the IEEE Workshop recipient of an IEEE Third Millennium Medal.
on Signal Processing Systems (SiPS) Design and Implementation eswartzla@aol.com

You might also like