Professional Documents
Culture Documents
San To SH
San To SH
The main focus of recent multiplier papers [7, 16, 20, 24,
4, 12] has been on rapidly reducing the partial product rows
Abstract
by using some kind of circuit optimization and identifying
the critical paths and signal races. In other words, the goals
The performance of multiplication is crucial for multime- have been to optimize the second step of the multiplication
dia applications such as 3D graphics and signal process- described above.
ing systems which depend on extensive numbers of multi- However, in this paper, we will concentrate on the first
plications. Previously reported multiplication algorithms step which consists in forming the partial product array and
mainly focus on rapidly reducing the partial products rows we will strive to design a multiplication algorithm which
down to final sums and carries used for the final accumu- will produce fewer partial product rows. By having fewer
lation. These techniques mostly rely on circuit optimization partial product rows, the reduction tree can be smaller in
and minimization of the critical paths. size and faster in speed. It should also be noted that 8 or
16-bit words are the most commonly used word sizes in
In this paper, an algorithm to achieve fast multiplica-
the kernels of most multimedia applications [19] and that
tion in two’s complement representation is presented. In-
the implementation of our overall algorithm is particularly
deed, our approach focuses on reducing the number of par-
well suited to such word sizes.
tial product rows. In turn, this directly influences the speed
of the multiplication, even before applying partial products In the next section, the conventional multiplication meth-
reduction techniques. Fewer partial products rows are pro- od is described in detail with an emphasis on its weak-
duced, thereby lowering the overall operation time. This nesses. In section 3, a step-by-step procedure to prevent
results in a true diamond-shape for the partial product tree the adverse effects of some conventional multiplication al-
which is more efficient in terms of implementation. gorithms is presented. In section 4, the effectiveness and
usage of our method is presented by showing a detailed
evaluation.
1 Introduction
2 Multiplication Algorithms
The performance of 3D graphics and signal processing sys- There is no doubt that MBE is efficient when it comes to
tems strongly depends on the performance of multiplica- reducing the partial products. However, it is important to
tions because these applications need to support highly mul- note that there are two unavoidable consequences when us-
tiplication intensive operations. Therefore, there has been ing MBE: sign extension prevention and negative encoding.
much work on advanced multiplication algorithms and de- The combination of these two unavoidable consequences
signs [1, 22, 3, 23, 18, 14, 13, 6, 7, 16, 20, 24, 4, 12]. results in the formation of one additional partial product
There are three major steps to any multiplication. In the row and of course, this additional partial product row re-
first step, the partial products are generated. In the second quires more hardware but more importantly time (to add
step, the partial products are reduced to one row of final this one more row of partial products).
sums and one row of carries. In the third step, the final sums
and carries are added to generate the result. Most of the 2.1 Modified Booth Encoding and the Overhead of Neg-
above mentioned approaches employ the Modified Booth ative Encodings
Encoding (MBE) approach [6, 7, 13, 24, 4] for the first step
because of its ability to cut the number of partial products
rows in half. They then select some variation of any one This grouping of the multiplier bits of MBE is shown in
of partial products reduction schemes such as the Wallace Figure 1 and it is based on a window size of 3 bits and a
trees [22, 6] or the compressor trees [16, 13, 18, 14] in the stride of 2. The multiplier (Y) is segmented into groups
second step to rapidly reduce the number of partial product of three bits (y2i+1 , y2i , y2i−1 ) and each such group of bits
rows to the final two (sums and carries). In the third step, is associated with its own partial product row by using Ta-
they use some kind of advanced adder approach such as ble 1 [15]. In this grouping, y−1 = 0.
carry-lookahead or carry-select adders [5, 17, 11] to add By applying this encoding, the number of partial prod-
the final two rows, resulting in the final product. uct rows to be accumulated is reduced from n to n2 . For
[
[
[ 2.3 One Additional Partial Product Row
[
[
[
[
[ However, there is still the problem of having the last neg
forming one additional partial product row (neg3 in Fig-
ure 3) and this causes another carry save adder delay in
order to generate the sums and carries before the final ac-
example, for an 8 × 8 multiplication, a multiplier with- cumulation. This is because in any case, one more partial
out MBE will generate eight partial product rows (because product means one additional 3-2 reduction. For example,
there is one partial product row for each bit of the multi- Figure 4(a) shows a 8-input reduction (16 bit × 16 bit mul-
plier). However, with MBE, only n2 (= 4) partial products tiplication for our architecture) using 4-2 compressors. If
rows are generated as shown in the example of Figure 2. we have to reduce 9 inputs (16 bit × 16 bit multiplication
However, there are actually n2 + 1 partial product rows for the conventional architecture), one additional carry save
rather than n2 , because of the last neg signal (neg3 in Fig- adder is required as in Figure 4(b). It is known that one 4-2
ure 2). The neg signals (neg0 , neg1 , neg2 , and neg3 ) compressor costs 3 EXOR delay and one carry save adder
are needed because MBE may generate a negative encod- consumes 2 EXOR delay [16, 10]. Therefore, Figure 4(a)
ing ((-1) times the multiplicand or (-2) times the multi- causes 6 EXOR delay whereas the model in Figure 4(b)
plicand). Consequently, one additional carry save adding will cause 8 EXOR delay. If we design the multiplier using
stage is needed to perform the reduction. This is the over- wallace tree instead of 4-2 compressors (because 9 input
head of implementing the negative encodings of MBE. reduction fit better in a wallace tree scheme), as shown in
Figure 4(c), it still takes 8 EXOR delay. If we arrange the
signals as in Figure 4(d) so that signals are grouped intelli-
2.2 Sign Extension and its Prevention gently to save some delay (in a carry save adder, the Sum
arrives after 2 EXOR delay and the carry arrives after ap-
proximately 1 EXOR delay), the delay becomes 7 EXOR
In signed multiplication, the sign bit of a partial product delay. As can be seen from the multiplier models in Fig-
row would have to be extended all the way to the MSB po- ure 4, having one more partial product row adds at least one
sition which would require the sign bit to drive that many more EXOR delay to the time to reduce the partial products.
output loads (each bit position until the MSB should have
the same value as the sign). This makes the partial prod- This fact that one additional partial product row brings
uct rows unequal in length as shown in Figure 2: the first delay is even more critical for multiplications of smaller
row spans 16 bits (pp00 to the leftmost pp80 ), the second words (8 ∼ 16) than with longer operands because of the
row 14 bits (pp01 to the leftmost pp81 ), the third row 12 relatively higher delay effect that this additional row brings.
bits (pp02 to the leftmost pp82 ), and the fourth row 10 bits There is also extra hardware cost since one more carry sav-
(pp03 to the leftmost pp83 ). Ercegovac [5] has shown sign ing adder stage hardware is necessary. A more quantitative
extension prevention method of Agrawal [2] and arrives a evaluation of this effect will be provided in Section 4.
3 Preventing the Additional Par-
tial Product Row
$3DUWLDO3URGXFW5RZ
$3DUWLDO3URGXFW
$3DUWLDO3URGXFW&ROXPQ
E : F ? G H E F B ? B D G ; : > ?
E : F ? G H E F B ? B D G ; : > ?
E : F ? G H E F B ? B D G ; : > ?
;
;
;
;
;
;
;
;
) )
! " # % & ' & ( ' & " # ! ! * + ( #
)
) )
& , - % & ' & ( ' & " # ! ! * + ( #
" # # $ % & "
) )
2 3 ! " - % & ' & ( ' & " # ! ! * + ( #
;
a;
Two Bits
1 # 2 * 3 5 1 2 - * - / 3 $ ( /
Figure 9. Replacing the last row and the Last neg with
! # $ & ( * + - / 1 # ( *
1 # 2 * 3 5 1 2 - * - / 3 $ ( /
125
8 ! # $ & ( * + - / 1 # ( *
signals s9 ∼ s0
1 # 2 * 3 5 1 2 - * - / 3 $ ( /
(a) General Model (b) Using NAND, NOR and In- newly replaced partial products for the last row and
verter
the row above the last partial product row has to be
restored back to the original sign extended format.)
Figure 7. A Gate-Level Diagram of 8-bit Two’s Comple- Step 3 : Finally, the MSB of the last row can be comple-
mentation Logic Using Our Approach mented (s9 ) and the “1” directly above it can be re-
moved as shown in Figure 10.
ray after removing the last neg requires undergoing the fol-
lowing steps:
Step 1 : Replace the last partial product row and neg3 in Figure 10. Proposed Partial Products After Removing the
Figure 3 with signals s9 ∼ s0 as shown in Figure 9. last neg
Step 2 : Replace the second to the last partial product row
as in Figure 9. (This is so because we now have a
\L
RQHL
\L
\L
\L
\L
WZRL
RQHL
[M
\L WZRL
QHJL
[M
\L
SSLM
\L
QHJL
4 Performance Evaluation and Hard- Figure 12. Gate Level Diagram of MBE and Partial Prod-
ware Requirements uct Signals Generation
4.1 Critical Path The delay for two’s complementation varies with the
size of the multiplicand. Therefore, we expanded the dia-
gram of our proposed 8-bit two’s complement logic in Fig-
In order to estimate the delays to obtain the last partial prod- ure 7(b) to larger sizes (such as 16-, 32-, 64-, and 128-bits)
uct row (s9 ∼ s0 in Figure 10) using our selection method, and measured the delay. The delays are shown in Table 3.
as well as the others (ppij i=8..0, j=2..0 in Figure 10) using After two’s complementing the multiplicand, we select
the PPRG, we have used the concept of normalized gate the correct partial product by using a 5-1 selector. Figure 13
delays. Although normalized gate delays vary with pro- shows the gate-level diagram of the 3-5 decoders and of the
cess technology and designs, this approach has been often 5-1 selectors used in the design of Figure 11. Figure 13(b)
used [24, 21, 16, 20] to estimate the performance of logic shows the normalized delay for a 5-1 selection to be a con-
units. There have been several reports on estimated normal-
ized gate delays [24, 8, 11]. We observe that Gajski [8] has
comprehensively examined these delays and the hardware Table 3. Normalized Gate Delay to Find Two’s Comple-
cost for many different gates and combinations. Therefore, ment of Binary Number of Various Sizes
we have selected his analysis as a guideline to estimate per-
formance. Table 2 shows the normalized gate delays de-
! ! !
!
125
125
125
125
(a) Gate Level Diagram of 3-5 Decoder (b) Gate Level Diagram of 5-1 Se-
lector
allowed to arrive later than the others by as much as one
EXOR-delay without causing any logic error.
Figure 13. Gate Level Diagram of 3-5 Coder and 5-1 Se-
lector Therefore, we could exploit to our benefit this EXOR-
gate delay when generating the two’s complement for larger
operands. Only the last partial product row using our method
arrive late compared to the other rows because for longer
stant 3.6 ns. Once all the inputs to the 5-1 selector are words, two’s complementation takes a longer time to com-
available, it is a mere matter of selecting the correct value. plete. For words up to 17 bits long, it takes 12.4 ns to se-
Note that we have not considered the delay for a 3-5 de- lect the correct value for the last row if our method is used.
coder together with the delay for MBE, because it is in- Other partial product generations took 11.2 ns. However,
curred in parallel with the delay of two’s complementation one input of the 4-2 compressor may arrive within a delay
(and is faster than it), and therefore masked by it. of 4.2 ns (which is the normalized gate delay of one EXOR
gate) and the last partial product row is estimated to select
Therefore, it takes 11.0 ns to find the two’s complement the correct value 1.2 ns after the other partial products are
of a 9-bit multiplicand and to select the correct partial prod- generated. Therefore, by efficiently exploiting the signal
ucts (7.4 ns for two’s complementation + 3.6 ns for the se- race of the 4-2 compressors, our method can be applied
lection process). It takes 12.4 ns for up to 17 bits (8.8 ns without any negative consequences.
+ 3.6 ns). Therefore, starting with 10 bits, our critical path
in the selection of the correct value for the last partial prod- Now, let us evaluate the time to reduce the partial prod-
uct row takes more time than the conventional generation uct rows into final two. We have calculated the EXOR-
of the partial products. delay for reducing the partial products for variously sized
multiplications using our architecture in Table 4. We have
also measured how much delay the additional partial prod-
4.2 Performance Improvement uct row causes when reducing partial products. The differ-
ence between these two is the relative improvement which
our multiplier architecture can yield compared to the reduc-
As expected, the performance of our multiplier hinges on
tion model of conventional multiplier architectures such as
the performance of the two’s complementation. Multiplica-
shown in Figure 4. The reduction scheme in Figure 4(b)
tions of longer words (10 or more bits) would suffer from
shows that because of the additional partial product row
some performance degradation due to the now dominating
(because of the last neg), one 3-2 compressor (Carry Save
delay of the two’s complementation.
Adding, therefore 2 EXOR-gate delay) is necessary. The
However, this is not the case because our approach gen- reduction in Figure 4(d) also has one more EXOR-gate de-
erates one fewer partial product row. This one additional lay compared to our architecture.
partial product row may cause 4.2 ns (or up to 8.4 ns)
As we can see from Table 4, we may gain up to 25% in
worth of delay for the total execution (due to the EXOR-
performance when there is one less partial product row for
dealy(s) as shown earlier in Figure 4). If we use some of
16-bit multiplications. We gain less performance with in-
this extra delay to calculate the two’s complement of longer
creases in word length. This is intuitive because of the rel-
words (16 or 32 bit long), we end up spending a little more
atively diminishing importance of one partial product row
time to fully generate the partial products. However, the
over the total number of partial product rows. Our multi-
overall processing time to produce the final product will be
plier architecture is particularly efficient at executing the
shortened because of the additional EXOR-delay(s) which
kernels of most multimedia applications which typically
takes place when our method is not used.
use smaller operands (8 or 16 bits) [19].
We can also take advantage of the characteristics of cur-
Besides the speed advantage, our architecture has one
rent reduction tree modules to hide the delay of two’s com-
other handy advantage. Our architecture is easily expland-
plementation. 4-2 Compressors [16, 13, 18, 14], which
able for multipliers that are sizes of power of 2. This is so
are very popular in reduction tree schemes for multiplica-
because our multipliers by using 4-2 compressors can natu-
tions, have different delay requirements for each input. Fig-
rally receive partial product rows in numbers of power of 2
ure 14 [16] shows that one specific input (“In 1”) could be
(4, 8, 16, 32, 64, and so on). However, in the conventional