Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

A Fast and Well-Structured Multiplier

Jung-Yup Kang Jean-Luc Gaudiot


Department of Electrical Engineering EECS Department
University of Southern California University of California at Irvine

The main focus of recent multiplier papers [7, 16, 20, 24,
4, 12] has been on rapidly reducing the partial product rows
Abstract
by using some kind of circuit optimization and identifying
the critical paths and signal races. In other words, the goals
The performance of multiplication is crucial for multime- have been to optimize the second step of the multiplication
dia applications such as 3D graphics and signal process- described above.
ing systems which depend on extensive numbers of multi- However, in this paper, we will concentrate on the first
plications. Previously reported multiplication algorithms step which consists in forming the partial product array and
mainly focus on rapidly reducing the partial products rows we will strive to design a multiplication algorithm which
down to final sums and carries used for the final accumu- will produce fewer partial product rows. By having fewer
lation. These techniques mostly rely on circuit optimization partial product rows, the reduction tree can be smaller in
and minimization of the critical paths. size and faster in speed. It should also be noted that 8 or
16-bit words are the most commonly used word sizes in
In this paper, an algorithm to achieve fast multiplica-
the kernels of most multimedia applications [19] and that
tion in two’s complement representation is presented. In-
the implementation of our overall algorithm is particularly
deed, our approach focuses on reducing the number of par-
well suited to such word sizes.
tial product rows. In turn, this directly influences the speed
of the multiplication, even before applying partial products In the next section, the conventional multiplication meth-
reduction techniques. Fewer partial products rows are pro- od is described in detail with an emphasis on its weak-
duced, thereby lowering the overall operation time. This nesses. In section 3, a step-by-step procedure to prevent
results in a true diamond-shape for the partial product tree the adverse effects of some conventional multiplication al-
which is more efficient in terms of implementation. gorithms is presented. In section 4, the effectiveness and
usage of our method is presented by showing a detailed
evaluation.

1 Introduction
2 Multiplication Algorithms
The performance of 3D graphics and signal processing sys- There is no doubt that MBE is efficient when it comes to
tems strongly depends on the performance of multiplica- reducing the partial products. However, it is important to
tions because these applications need to support highly mul- note that there are two unavoidable consequences when us-
tiplication intensive operations. Therefore, there has been ing MBE: sign extension prevention and negative encoding.
much work on advanced multiplication algorithms and de- The combination of these two unavoidable consequences
signs [1, 22, 3, 23, 18, 14, 13, 6, 7, 16, 20, 24, 4, 12]. results in the formation of one additional partial product
There are three major steps to any multiplication. In the row and of course, this additional partial product row re-
first step, the partial products are generated. In the second quires more hardware but more importantly time (to add
step, the partial products are reduced to one row of final this one more row of partial products).
sums and one row of carries. In the third step, the final sums
and carries are added to generate the result. Most of the 2.1 Modified Booth Encoding and the Overhead of Neg-
above mentioned approaches employ the Modified Booth ative Encodings
Encoding (MBE) approach [6, 7, 13, 24, 4] for the first step
because of its ability to cut the number of partial products
rows in half. They then select some variation of any one This grouping of the multiplier bits of MBE is shown in
of partial products reduction schemes such as the Wallace Figure 1 and it is based on a window size of 3 bits and a
trees [22, 6] or the compressor trees [16, 13, 18, 14] in the stride of 2. The multiplier (Y) is segmented into groups
second step to rapidly reduce the number of partial product of three bits (y2i+1 , y2i , y2i−1 ) and each such group of bits
rows to the final two (sums and carries). In the third step, is associated with its own partial product row by using Ta-
they use some kind of advanced adder approach such as ble 1 [15]. In this grouping, y−1 = 0.
carry-lookahead or carry-select adders [5, 17, 11] to add By applying this encoding, the number of partial prod-
the final two rows, resulting in the final product. uct rows to be accumulated is reduced from n to n2 . For

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE
! " # $ %
$ & '  ( ) &   %

      

      


      
     


               


                


                

     
 


                


     
     

Figure 3. Sign Extension Prevented Partial Products in a


MBE Multiplication
Figure 1. Multiplier Bits Grouping According to Modified
Booth Encoding for 8-bit Input

newly formed partial product rows as in Figure 3 where the


sign extension has been removed.We use this structure as
Table 1. Modified Booth Encoding the basis structure for our multiplier architecture.
      

[
[
[ 2.3 One Additional Partial Product Row
[
[
[
[
[ However, there is still the problem of having the last neg
forming one additional partial product row (neg3 in Fig-
ure 3) and this causes another carry save adder delay in
order to generate the sums and carries before the final ac-
example, for an 8 × 8 multiplication, a multiplier with- cumulation. This is because in any case, one more partial
out MBE will generate eight partial product rows (because product means one additional 3-2 reduction. For example,
there is one partial product row for each bit of the multi- Figure 4(a) shows a 8-input reduction (16 bit × 16 bit mul-
plier). However, with MBE, only n2 (= 4) partial products tiplication for our architecture) using 4-2 compressors. If
rows are generated as shown in the example of Figure 2. we have to reduce 9 inputs (16 bit × 16 bit multiplication
However, there are actually n2 + 1 partial product rows for the conventional architecture), one additional carry save
rather than n2 , because of the last neg signal (neg3 in Fig- adder is required as in Figure 4(b). It is known that one 4-2
ure 2). The neg signals (neg0 , neg1 , neg2 , and neg3 ) compressor costs 3 EXOR delay and one carry save adder
are needed because MBE may generate a negative encod- consumes 2 EXOR delay [16, 10]. Therefore, Figure 4(a)
ing ((-1) times the multiplicand or (-2) times the multi- causes 6 EXOR delay whereas the model in Figure 4(b)
plicand). Consequently, one additional carry save adding will cause 8 EXOR delay. If we design the multiplier using
stage is needed to perform the reduction. This is the over- wallace tree instead of 4-2 compressors (because 9 input
head of implementing the negative encodings of MBE. reduction fit better in a wallace tree scheme), as shown in
Figure 4(c), it still takes 8 EXOR delay. If we arrange the
signals as in Figure 4(d) so that signals are grouped intelli-
2.2 Sign Extension and its Prevention gently to save some delay (in a carry save adder, the Sum
arrives after 2 EXOR delay and the carry arrives after ap-
proximately 1 EXOR delay), the delay becomes 7 EXOR
In signed multiplication, the sign bit of a partial product delay. As can be seen from the multiplier models in Fig-
row would have to be extended all the way to the MSB po- ure 4, having one more partial product row adds at least one
sition which would require the sign bit to drive that many more EXOR delay to the time to reduce the partial products.
output loads (each bit position until the MSB should have
the same value as the sign). This makes the partial prod- This fact that one additional partial product row brings
uct rows unequal in length as shown in Figure 2: the first delay is even more critical for multiplications of smaller
row spans 16 bits (pp00 to the leftmost pp80 ), the second words (8 ∼ 16) than with longer operands because of the
row 14 bits (pp01 to the leftmost pp81 ), the third row 12 relatively higher delay effect that this additional row brings.
bits (pp02 to the leftmost pp82 ), and the fourth row 10 bits There is also extra hardware cost since one more carry sav-
(pp03 to the leftmost pp83 ). Ercegovac [5] has shown sign ing adder stage hardware is necessary. A more quantitative
extension prevention method of Agrawal [2] and arrives a evaluation of this effect will be provided in Section 4.

      

      












































 

































   


3 Preventing the Additional Par-
tial Product Row
   
                   

 
                 

$3DUWLDO3URGXFW5RZ
$3DUWLDO3URGXFW
$3DUWLDO3URGXFW&ROXPQ

Therefore, our goal is to remove the last neg signal. This


Figure 2. The Array of Partial Products for Signed Multi- would prevent the extra partial product row, and thus save
plication with MBE the time of one additional carry save adding stage and the
hardware required for the additional carry save adding. We

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE
Figure 5. Two’s Complement Conversion Example

the value is kept as it is and if the conversion signal is “1”


(the checks in Figure 5), then the value is complemented.
(a) 8-input Reduc- (b) 9-input Reduction The conversion signals after the rightmost “1” are always
tion using 4-2 com- using 4-2 compressors
pressors 1. They are 0 otherwise. Once a lower order bit has been
detected to be a “1,” the conversion signals for the higher
order bits to the left of that bit position should all be “1.”
However, this searching for the rightmost “1” could as
time consuming as rippling a carry through to the MSB
     
since the previous bits information must be transferred to
the MSB. Therefore, we must find a method to expedite
    this detection of the rightmost “1.”
 

As we will see, this search for the rightmost “1” can be


  achieved in logarithmic time using a binary search tree-like

structure. We first find the conversion signals for a 2-bit
group by grouping two consecutive bits (the grouping al-
 
ways starts from the LSB) from the input and finds the con-
version signals in each group as shown in Figure 6(a). Then
we find the conversion signals for a 4-bit group (formed by
(c) 9-input Reduction using (d) 9-input Reduction us- two consecutive 2-bit groups). Then we find the conversion
Wallace Tree ing Optimized Wallace Tree
signals for a 8-bit group (formed by two consecutive 4-bit
group). This divide-and-conquer approach is pursued until
Figure 4. Delay Effects in Reduction due to One More the whole input has been covered.
Partial Product When grouping two 2n -bits groups, the leftmost conver-
sion signals from the right group contain the accumulative
information of its group about whether a “1” ever appeared
in any bit position of its group, so that a conversion signal
noticed that if we could somehow produce the two’s com- should force all the conversion signals from the left group
plement of the multiplicand while the other partial prod- all the way to the “1” if it is itself is a “1.” For instance,
ucts were produced, there would be no need for the last as shown in Figure 6(b), if CS1 (the leftmost conversion
neg because this neg signal would have already been ap- signal from the right group) = “1,” the conversion signals
plied when generating the two’s complement of the multi- from the left group (CS2 and CS3 ) should be forced to a
plicand. Therefore, we “only” need to find a faster method “1,” regardless of their previous values. If CS1 = “0,” noth-
to calculate the two’s complement of a binary number. ing happens to the conversion signals from the left group.
This variable control is shown with a dashed arrow. Like-
wise, CS5 may affect conversion signals CS6 and CS7 .
3.1 A Fast Method to Find Two’s Complements
The same goes for CS3 ’ which may affect the conversion
signals (CS7 ’, CS6 ’, CS5 ’, and CS4 ’).
Our method is an extension of well-known algorithm that The inputs to the 2-bit group are bits from the origi-
two’s complementation complements all the bits after the nal binary number. However, the inputs to the next level
rightmost “1” in the word but keeps the other bits as they groups are conversion signals from the previous level. For
are. The two’s complement of a binary number 0010102 instance, the inputs to the 4-bit group are the conversion
(1010 ) is 1101102 (−1010 ) (Figure 5). For this number, the signals generated from two 2-bit groups. Therefore, from
rightmost “1” happens in bit position 1 (the check mark the second level (4-bit grouping) on, the conversion sig-
position in Figure 5). Therefore, values in bit positions 2 to nals are scanned in order to find the rightmost “1.” One
5 can simply be complemented while values in bit positions possible implementation of our algorithm is shown in Fig-
0 and 1 are kept as they were. ure 7(a). Figure 7(b) shows another version of the design
Therefore, two’s complementation now comes down to using NAND, NOR, and inverter.
finding the conversion signals that are used for selectively Once we have the complete conversion signals, these
complementing some of the input bits. If the conversion signals are shifted left 1 bit and EXOR-ed with the input
signal at any position is “0” (the crosses in Figure 5), then to create the two’s complement of the input. One com-

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE

    
       

      

 

L 7 8 : ; < > ? @ B D E : > ?

    
                

E : F ? G H E F B ? B D G ; : > ?








    




 

 

5 7 8 : ; < > ? @ B D E : > ?

E : F ? G H E F B ? B D G ; : > ?

     
                       










              

K 7 8 : ; < > ? @ B D E : > ?


     

 

E : F ? G H E F B ? B D G ; : > ?

     
                       

ELW ELW ELW ELW


FRQYHUVLRQ FRQYHUVLRQ FRQYHUVLRQ FRQYHUVLRQ




 


 


 


 


 


 


 

VLJQDOV VLJQDOV VLJQDOV VLJQDOV


JHQHUDWLRQ JHQHUDWLRQ JHQHUDWLRQ JHQHUDWLRQ
       

;
;
;
;
;
;
;
;



     




) )
   ! " #  % & ' & (  ' & " # !  ! *  + ( #


       

)
) )
     & ,  - % & ' & (  ' & " # !  ! *  + ( #

 

      " #  # $  % & "


) )
    2 3 ! " - % & ' & (  ' & " # !  ! *  + ( #

(a) Finding Two’s Comple- (b) Finding Two’s Complement Conversion 

ment Conversion Signals for Signals for Eight Bits




;
a;

Two Bits

Figure 8. 8-bit Example of Two’s Complementation Using


Figure 6. Finding Two’s Complement Conversion Signals Our Approach

      

      

   
 

   

  
                 


                

  
  
  
  

125 ,19 125 125 ,19 125


; ! # $ & ( * + - / 1 # ( *


                

1 # 2 * 3 5 1 2 - * - / 3 $ ( /


       

  
  
  
  

1$1' 1$1' ,19 1$1' 1$1' ,19

Figure 9. Replacing the last row and the Last neg with
 ! # $ & ( * + - / 1 # ( *

1 # 2 * 3 5 1 2 - * - / 3 $ ( /

,19 ,19 ,19 ,19


125 125 125
  
  
  

125
  

      

8 ! # $ & ( * + - / 1 # ( *

    

signals s9 ∼ s0
1 # 2 * 3 5 1 2 - * - / 3 $ ( /

;125 ;125 ;125 ;125 ;125 ;125 ;125 ;125


                                       

             
   

(a) General Model (b) Using NAND, NOR and In- newly replaced partial products for the last row and
verter
the row above the last partial product row has to be
restored back to the original sign extended format.)
Figure 7. A Gate-Level Diagram of 8-bit Two’s Comple- Step 3 : Finally, the MSB of the last row can be comple-
mentation Logic Using Our Approach mented (s9 ) and the “1” directly above it can be re-
moved as shown in Figure 10.

As can be seen, the critical path column with n2 + 1 ele-


plete example of two’s complementation of “001010002 ”
is shown in Figure 8. Our method is a logarithmic ver- ments (6th bit position of Figure 3 (n − 2)nd bit position in
sion of Hwang’s linear method [11], while Hashemian [9] general) now have only n2 elements as shown in Figure 10
has focused on circuit optimization to improve the perfor- (the neg3 is no longer there). This directly improves the
mance of it. Our approach is more general and shows better speed of the multiplication.
adaptability to any word size. The multiplier architecture to generate the partial prod-
ucts is shown in Figure 11. The only difference between our
3.2 Putting it all together architecture and the conventional multiplier architectures is
that for the last partial product row, our architecture has
no partial product generation but partial product selection
By applying the method we just described for two’s com- with a two’s complement unit. The 3-5 decoder select the
plementation, the last partial product row (in Figure 3) is correct value from 5 possible inputs (2×X, 1×X, 0, -1×X,
correctly generated without the last neg (neg3 in Figure 3). -2×X) which are either coming from the two’s complement
Now, the multiplication can have a smaller critical path. logic or the multiplicand itself and input into the row of 5-1
This avoids having to include one extra carry saving adding selectors. Unlike the other rows which use PPRG (Partial
stage. It also reduces the time to find the product and saves Product Row Generator), the two’s complement logic does
the hardware corresponding to the carry saving adding stage.
      

Forming a truly parallelogram-shaped partial product ar-       

ray after removing the last neg requires undergoing the fol- 











 




































   

lowing steps: 








 


 


 


 


     




Step 1 : Replace the last partial product row and neg3 in Figure 10. Proposed Partial Products After Removing the
Figure 3 with signals s9 ∼ s0 as shown in Figure 9. last neg
Step 2 : Replace the second to the last partial product row
as in Figure 9. (This is so because we now have a

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE
Table 2. Normalized Gate Delay and Hardware Cost
 

\L
RQHL
  

\L
\L
\L   

\L

Figure 11. Proposed Multiplier Architecture    

WZRL
RQHL   

[M
  

  

\L WZRL
QHJL
  

[M
   

\L
SSLM
 


\L
  

QHJL

not have to wait for MBE to finish. Two’s complementation


is performed in parallel with MBE and the 3-5 decoder. (a) Gate Level Diagram (b) Gate Level Diagram
of MBE Signals Generation of ppij Generation

4 Performance Evaluation and Hard- Figure 12. Gate Level Diagram of MBE and Partial Prod-
ware Requirements uct Signals Generation

The performance of our multiplier architecture clearly de-


pends on the speed of the two’s complementation step. If the partial product signals (ppij ). According to the normal-
we can generate the last partial product row of our multi- ized gate delay in Table 2, it takes 11.2 ns to generate any
plier architecture within the exact time that the other par- partial product (ppij ) because it takes 4.2 ns for MBE sig-
tial product rows are generated, the performance will be nal generation and 7.0 ns for the ppij generation.
improved as we have predicted because of the removal of The delay to select the value for the last row using our
the additional partial product row. Therefore, in this sec- scheme entails the following two major parts:
tion, we evaluate the performance of our two’s complement
logic by comparing it to the delay of generating other par- 1. Two’s complementation: The delay to find the two’s
tial products. Then, we investigate the overall impact (in complement of the multiplicand
terms of speed and hardware) of using our multiplier archi- 2. Selection: The delay to select the correct value for
tecture as compared to previous methods. the last partial product row

4.1 Critical Path The delay for two’s complementation varies with the
size of the multiplicand. Therefore, we expanded the dia-
gram of our proposed 8-bit two’s complement logic in Fig-
In order to estimate the delays to obtain the last partial prod- ure 7(b) to larger sizes (such as 16-, 32-, 64-, and 128-bits)
uct row (s9 ∼ s0 in Figure 10) using our selection method, and measured the delay. The delays are shown in Table 3.
as well as the others (ppij i=8..0, j=2..0 in Figure 10) using After two’s complementing the multiplicand, we select
the PPRG, we have used the concept of normalized gate the correct partial product by using a 5-1 selector. Figure 13
delays. Although normalized gate delays vary with pro- shows the gate-level diagram of the 3-5 decoders and of the
cess technology and designs, this approach has been often 5-1 selectors used in the design of Figure 11. Figure 13(b)
used [24, 21, 16, 20] to estimate the performance of logic shows the normalized delay for a 5-1 selection to be a con-
units. There have been several reports on estimated normal-
ized gate delays [24, 8, 11]. We observe that Gajski [8] has
comprehensively examined these delays and the hardware Table 3. Normalized Gate Delay to Find Two’s Comple-
cost for many different gates and combinations. Therefore, ment of Binary Number of Various Sizes
we have selected his analysis as a guideline to estimate per-
formance. Table 2 shows the normalized gate delays de- 

rived from Gajski’s analysis. They have been normalized


to an inverter delay and scaled in nanosecond (ns).
First, let us examine the delay of generating the other
partial product rows using PPRG. Figure 12 shows the gate-
level diagram of the MBE signals (onei , twoi , negi ) and

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE
     

 

   !      !      !  

   !  

   

 

 

   

   

       

        

 

        

 

        

     


       

           

 

125


 
 

 
 
     

 
    

  

125
 
    

 

  

125

     

 

  

125 



   


 

Figure 14. A 4-2 Compressor [16]

(a) Gate Level Diagram of 3-5 Decoder (b) Gate Level Diagram of 5-1 Se-
lector
allowed to arrive later than the others by as much as one
EXOR-delay without causing any logic error.
Figure 13. Gate Level Diagram of 3-5 Coder and 5-1 Se-
lector Therefore, we could exploit to our benefit this EXOR-
gate delay when generating the two’s complement for larger
operands. Only the last partial product row using our method
arrive late compared to the other rows because for longer
stant 3.6 ns. Once all the inputs to the 5-1 selector are words, two’s complementation takes a longer time to com-
available, it is a mere matter of selecting the correct value. plete. For words up to 17 bits long, it takes 12.4 ns to se-
Note that we have not considered the delay for a 3-5 de- lect the correct value for the last row if our method is used.
coder together with the delay for MBE, because it is in- Other partial product generations took 11.2 ns. However,
curred in parallel with the delay of two’s complementation one input of the 4-2 compressor may arrive within a delay
(and is faster than it), and therefore masked by it. of 4.2 ns (which is the normalized gate delay of one EXOR
gate) and the last partial product row is estimated to select
Therefore, it takes 11.0 ns to find the two’s complement the correct value 1.2 ns after the other partial products are
of a 9-bit multiplicand and to select the correct partial prod- generated. Therefore, by efficiently exploiting the signal
ucts (7.4 ns for two’s complementation + 3.6 ns for the se- race of the 4-2 compressors, our method can be applied
lection process). It takes 12.4 ns for up to 17 bits (8.8 ns without any negative consequences.
+ 3.6 ns). Therefore, starting with 10 bits, our critical path
in the selection of the correct value for the last partial prod- Now, let us evaluate the time to reduce the partial prod-
uct row takes more time than the conventional generation uct rows into final two. We have calculated the EXOR-
of the partial products. delay for reducing the partial products for variously sized
multiplications using our architecture in Table 4. We have
also measured how much delay the additional partial prod-
4.2 Performance Improvement uct row causes when reducing partial products. The differ-
ence between these two is the relative improvement which
our multiplier architecture can yield compared to the reduc-
As expected, the performance of our multiplier hinges on
tion model of conventional multiplier architectures such as
the performance of the two’s complementation. Multiplica-
shown in Figure 4. The reduction scheme in Figure 4(b)
tions of longer words (10 or more bits) would suffer from
shows that because of the additional partial product row
some performance degradation due to the now dominating
(because of the last neg), one 3-2 compressor (Carry Save
delay of the two’s complementation.
Adding, therefore 2 EXOR-gate delay) is necessary. The
However, this is not the case because our approach gen- reduction in Figure 4(d) also has one more EXOR-gate de-
erates one fewer partial product row. This one additional lay compared to our architecture.
partial product row may cause 4.2 ns (or up to 8.4 ns)
As we can see from Table 4, we may gain up to 25% in
worth of delay for the total execution (due to the EXOR-
performance when there is one less partial product row for
dealy(s) as shown earlier in Figure 4). If we use some of
16-bit multiplications. We gain less performance with in-
this extra delay to calculate the two’s complement of longer
creases in word length. This is intuitive because of the rel-
words (16 or 32 bit long), we end up spending a little more
atively diminishing importance of one partial product row
time to fully generate the partial products. However, the
over the total number of partial product rows. Our multi-
overall processing time to produce the final product will be
plier architecture is particularly efficient at executing the
shortened because of the additional EXOR-delay(s) which
kernels of most multimedia applications which typically
takes place when our method is not used.
use smaller operands (8 or 16 bits) [19].
We can also take advantage of the characteristics of cur-
Besides the speed advantage, our architecture has one
rent reduction tree modules to hide the delay of two’s com-
other handy advantage. Our architecture is easily expland-
plementation. 4-2 Compressors [16, 13, 18, 14], which
able for multipliers that are sizes of power of 2. This is so
are very popular in reduction tree schemes for multiplica-
because our multipliers by using 4-2 compressors can natu-
tions, have different delay requirements for each input. Fig-
rally receive partial product rows in numbers of power of 2
ure 14 [16] shows that one specific input (“In 1”) could be
(4, 8, 16, 32, 64, and so on). However, in the conventional

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE
carry save adding stage required for the conventional meth-
Table 4. Delay For Reducing Partial Product Tree (EXOR- ods with one extra partial product row (for all cases in Fig-
Delay) ure 4(b), 4(c), 4(d)), we have to add this to the total hard-
ware requirement for the conventional methods. (For hard-
ware calculations, we assume one 4-2 compressor consumes
the same amount of hardware of two carry save adders.)
It turns out that our method always consumes less hard-
ware than that used in conventional methods using partial
product generation for the last partial product row (and there-
fore using one more carry save adding). As the word size
increases, the hardware saving decreases. However, the
point we are making here is that our architecture does not
Table 5. Hardware Cost (In Number of Transistors) For bring more hardware usage. On the contrary, indeed, the
Generating the Last Partial Product Row fact that having one more partial product row not only brings
performance degradation but also more hardware usage.
Multiplications Size 8 × 8 16 × 16 32 × 32 64 × 64 128 × 128
P Pij generation 234 442 858 1690 3354
Hardware to reduce one 506 874 1610 3082 6026 4.4 Related Work
extra partial product row
Total Transistors for 740 1316 2468 4772 9380
Conventional Methods Previous work [6, 16, 20, 24] has shown that the critical
path of the algorithm was the reduction of the partial prod-
5-1 Selector
3-5 Decoder
216
22
408
22
792
22
1560
22
3096
22
ucts from the (n − 2)nd to the nth bit positions because
Two’s Complementation 156 360 816 1824 4032
these bit positions have the highest number of partial prod-
Total Transistors for 394 790 1630 3406 7150 ucts. Ercegovac [5] has pointed out that the presence of
Our Method an extra partial product row was due to the last neg signal.
Huang [10] endeavored to overcome the presence of this
additional row by devising a 9-4 compressor whose goal it
schemes [10, 16, 13] (which have one extra partial prod- is to reduce 9 partial product rows within the same time that
uct row), in order to obtain better performance, complex two 4-2 compressors are reducing 8 partial product rows.
combinations of different reductions cells (using 4-2 com- This has the effect of masking the latency due to the re-
pressors, half adders, and full adders) and different wiring duction of one additional partial product row. However,
schemes for different sized data words are used. This fact this approach is only specific to his design of 9-4 reduc-
shows that our multiplier architecture is regular and that it tion (and his assumption about inputs being constants) and
can be easily modularized and therefore, our architecture is has no flexibility. On the other hand, our scheme just plugs
more efficient for multiplier module generations for multi- two’s complementation logic before selecting the last par-
pliers based on power of 2 (which is the natural sizes for tial product row therefore, it is easier to implement and can
the computing data words). be applied to any bit length multiplications.
Again, all those related work are based on some kind of
efficient circuit optimization techniques to reduce n2 +1 par-
4.3 Hardware Costs
tial product rows (the second step) whereas our algorithm
is to concentrate on the first step which consists in forming
Counting the number of transistors required will help es- the partial product array. By having fewer partial product
timate the hardware requirements before going into actual rows, the reduction tree can be smaller in size and faster
physical implementation. We have estimated the hardware in speed. Furthermore, since our algorithm is targeted at
requirements in terms of the number of transistors in Ta- forming the tree rather than reducing it, recent circuit opti-
ble 5. (The hardware cost comparison is only for compar- mization techniques [7, 16, 20, 24, 4, 12] for its reduction
ing the hardware cost for generating the last partial prod- can be easily incorporated with our method to further im-
uct row because other partial products are generated in the prove the performance.
same way.) We again used Gajski’s analysis shown in Ta-
ble 2 as a guideline to estimate the hardware cost. We have
obtained Table 5 by calculating the number of gates for 5 Conclusions
such units as full adders, partial product generators, and 5-1
selectors, and then extrapolated since we know the number
of such units for variously sized multiplications. The hard- In this paper, we have presented an algorithm to reduce
ware required for two’s complementation is also calculated from n2 + 1 to n2 the number of partial product rows gener-
by simply expanding the 8-bit two’s complementation logic ated during the first step of a multiplication algorithm. By
shown in Figure 7(b). doing so, the structure of the partial product array becomes
more regular and easier to implement. Even more impor-
Each P Pij consumes 26 transistors (trs) and there are
tantly, the product is found faster. As we have shown, we
n + 1 P Pij for the last row. Since there is one additional
can achieve this using less hardware.

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE
We have shown a detailed and step-by-step approach Complement Circuits. In Proceedings of the 34th
to prevent the occurrence of the additional row and also Midwest Symposium on Circuits and Systems, vol-
showed a generalized way of achieving two’s complemen- ume 2, pages 887–890, 1991.
tation in logarithmic time. We also have shown how our [10] Z. Huang and M. Ercegovac. High-Performance Left-
algorithm can exploit the characteristics of the current de- to-Right Array Multiplier Design. In Proceedings of
signs of the 4-2 compressors in order to be efficiently im- the 16th Symposium on Computer Arithmetic, pages
plemented. We have also shown that the proposed multipli- 4–11, june 2003.
cation method is particularly efficient when executing the
multiplications of the kernels of most common multimedia [11] K. Hwang. Computer Arithmetic Principles, Archi-
applications which are based on 8 to 16-bit operands. Our tecture and Design. Wiley, New York, 1979.
estimations point up to a 25% performance improvement [12] N. Itoh, Y. Naemura, H. Makino, Y. Nakase, T. Yoshi-
(for reducing partial products) for 16×16 multiplications. hara, and Y. Horiba. A 600-MHz 54x54-bit Multiplier
with Rectangular-Styled Wallace Tree. IEEE Journal
The proposed multiplier architecture has been modeled of Solid-State Circuits, 36(2):249–257, 2001.
at the behavioral level in C. In order to completely validate
this method, future work would be to actually implement in [13] J.-Y. Kang, W.-H. Lee, and T.-D. Han. A Design of
VLSI and simulate it using appropriate circuit simulation A Multiplier Module Generator Using 4-2 Compres-
tools, in order to accurately evaluate its performance and sor. In Proceedings of KITE Fall Conference, Korea,
compare it with other commercial approaches. volume 16, pages 388–392, 1993.
[14] M. Nagamatsu, S. Tanaka, J. Mori, T. Noguchi, and
K. Hatanaka. A 15ns 32x32-bit CMOS Multiplier
with an Improved Parallel Structure. In Digest of
Acknowledgements Technical Papers, IEEE Custom Integrated Circuits
Conf., 1989.
This work is partly supported by the National Science Foun- [15] O. L. Mac Sorley. High Speed Arithmetic in Binary
dation (NSF) under Grants No. CCR-0234444 and INT- Computers. IRE Proc., 1961.
0223647. [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu. A
Method for Speed Optimized Partial Product Reduc-
References tion and Generation of Fast Parallel Multipliers Us-
ing an Algorithmic Approach. IEEE Transactions on
Computers, 45(3):294–306, 1996.
[1] A. D. Booth. A Signed Binary Multiplication Tech-
nique. Quarterly J. Mechanical and Applied Math., [17] D. A. Patterson and J. L. Hennessy. Computer Ar-
4:236–240, 1951. chitecture: A Quantitative Approach. Morgan Kauf-
mann, San Mateo, CA, 1996.
[2] D. P. Agrawal, and T. R. N. Rao. On multiple operand
addition of signed binary numbers. IEEE Trans- [18] M. R. Santoro and M. Horowitz. SPIM: A Pipelined
actions on Computers, C-27:1068–1070, November 64x64-bit Iterative Multiplier. IEEE Transactions on
1978. Circuits and Systems, 24(2):487–493, 1989.
[3] L. Dadda. Some Schemes for Parallel Multiplier. Alta [19] N. Slingerland and A. J. Smith. Measuring the Perfor-
Frequenza, 34:349–356, 1965. mance of Multimedia Instruction Sets. IEEE Transac-
tions on Computers, 51(11):1317–1332, 2002.
[4] F. Elguibaly. A Fast Parallel Multiplier-Accumulator
Using the Modified Booth Algorithm. IEEE Transac- [20] P. F. Stelling, C. U. Martel, V. G. Oklobdzija, and
tions on Circuits and Systems, 47(9):902–908, 2000. R. Ravi. Optimal Circuits for Parallel Multipliers.
IEEE Transactions on Computers, 47(3):273–285,
[5] M. D. Ercegovac and T. Lang. Digital Arithmetic. 1998.
Morgan Kaufmann Publishers, Los Altos, CA 94022,
USA, 2003. [21] D. Villeger and V. Oklobdzija. Analysis of Booth En-
coding Efficiency in Parallel Multipliers Using Com-
[6] J. Fadavi-Ardekani. M x N Booth Encoded Mul- pressors for Reduction of Partial Products. In Pro-
tiplier Generator Using Optimized Wallace Trees. ceedings of the 27th Annual Asilomar Conference on
IEEE Transactions on Very Large Scale Integration, Signals, Systems, and Computers, volume 1, pages
1(2):120–125, 1993. 781–784, 1993.
[7] A. Farooqui and V. Oklobdzija. General Data-Path [22] C. S. Wallace. A Suggestion for a Fast Multiplier.
Organization of A MAC Unit for VLSI Implemen- IEEE Transactions on Computers, 13(2):14–17, 1964.
tation of DSP Processors. In Proc. 1998 IEEE Int’l
Symp. Circuits and Systems, volume 2, pages 260– [23] A. Weinberger. 4:2 Carry-Save Adder Module. IBM
263, 1998. Technical Disclosure Bull., 23, 1981.
[8] D. Gajski. Principles of Digital Design. Prentice Hall, [24] W.-C. Yeh and C.-W. Jen. High-Speed Booth En-
1997. coded Parallel Multiplier Design. IEEE Transactions
on Computers, 49(7):692–701, 2000.
[9] R. Hashemian and C. P. Chen. A New Parallel Tech-
nique for Design of Decrement/Increment and Two’s

Proceedings of the EUROMICRO Systems on Digital System Design (DSD’04)


0-7695-2203-3/04 $ 20.00 IEEE

You might also like