(Alan v. Oppenheim, Ronald W. Schafer, John R. Buc

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 94

Part III

Multiplication
Parts Chapters
1. Numbers and Arithmetic
2. Representing Signed Numbers
I. Number Representation 3. Redundant Number Systems
4. Residue Number Systems

5. Basic Addition and Counting


6. Carry-Look ahead Adders

Elementary Operations
II. Addition / Subtraction 7. Variations in Fast Adders
8. Multioperand Addition

9. Basic Multiplication Schemes


10. High-Radix Multipliers
III. Multiplication 11. Tree and Array Multipliers
12. Variations in Multipliers

13. Basic Division Schemes


14. High-Radix Dividers
IV. Division 15. Variations in Dividers
16. Division by Convergence

17. Floating-Point Reperesentations


18. Floating-Point Operations
V. Real Arithmetic 19. Errors and Error Control
20. Precise and Certifiable Arithmetic

21. Square-Rooting Methods


22. The CORDIC Algorithms
VI. Function Evaluation 23. Variations in Function Evaluation
24. Arithmetic by Table Lookup

25. High-Throughput Arithmetic


26. Low-Power Arithmetic
VII. Implementation Topics 27. Fault-Tolerant Arithmetic
28. Past, Present, and
Reconfigurable Future
Arithmetic

Appendix: Past, Present, and Future

Apr. 2015 Computer Arithmetic, Multiplication Slide 1


About This Presentation
This presentation is intended to support the use of the textbook
Computer Arithmetic: Algorithms and Hardware Designs (Oxford
U. Press, 2nd ed., 2010, ISBN 978-0-19-532848-6). It is updated
regularly by the author as part of his teaching of the graduate
course ECE 252B, Computer Arithmetic, at the University of
California, Santa Barbara. Instructors can use these slides freely
in classroom teaching and for other educational purposes.
Unauthorized uses are strictly prohibited. © Behrooz Parhami

Edition Released Revised Revised Revised Revised


First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 May 2007
Apr. 2008 Apr. 2009
Second Apr. 2010 Apr. 2011 Apr. 2012 Apr. 2015

Apr. 2015 Computer Arithmetic, Multiplication Slide 2


III Multiplication
Review multiplication schemes and various speedup methods
• Multiplication is heavily used (in arith & array indexing)
• Division = reciprocation + multiplication
• Multiplication speedup: high-radix, tree, recursive
• Bit-serial, modular, and array multipliers

Topics in This Part


Chapter 9 Basic Multiplication Schemes
Chapter 10 High-Radix Multipliers
Chapter 11 Tree and Array Multipliers
Chapter 12 Variations in Multipliers

Apr. 2015 Computer Arithmetic, Multiplication Slide 3


“Well, well, for a rabbit, you’re not
very good at multiplying, are you?”

Apr. 2015 Computer Arithmetic, Multiplication Slide 4


9 Basic Multiplication Schemes

Chapter Goals
Study shift/add or bit-at-a-time multipliers
and set the stage for faster methods and
variations to be covered in Chapters 10-12

Chapter Highlights
Multiplication = multioperand addition
Hardware, firmware, software algorithms
Multiplying 2’s-complement numbers
The special case of one constant operand

Apr. 2015 Computer Arithmetic, Multiplication Slide 5


Basic Multiplication Schemes: Topics

Topics in This Chapter


9.1 Shift/Add Multiplication Algorithms

9.2 Programmed Multiplication

9.3 Basic Hardware Multipliers

9.4 Multiplication of Signed Numbers

9.5 Multiplication by Constants

9.6 Preview of Fast Multipliers

Apr. 2015 Computer Arithmetic, Multiplication Slide 6


9.1 Shift/Add Multiplication Algorithms
Notation for our discussion of multiplication algorithms:
a Multiplicand ak–1ak–2 . . . a1a0
x Multiplier xk–1xk–2 . . . x1x0
p Product (a  x) p2k–1p2k–2 . . . p3 p2 p1 p0
Initially, we assume unsigned operands

a Multiplicand
 x Multiplier
x 0 a 20
Partial
x 1 a 21
products
x 2 a 22 bit-matrix
x 3 a 23
p Product
Fig. 9.1 Multiplication of two 4-bit unsigned binary numbers in dot notation.
Apr. 2015 Computer Arithmetic, Multiplication Slide 7
Multiplication Recurrence
a Multiplicand
Fig. 9.1  x Multiplier
x 0 a 20
Partial
x 1 a 21
products
x 2 a 22 bit-matrix
x 3 a 23
p Product Preferred

Multiplication with right shifts: top-to-bottom accumulation


p(j+1) = (p(j) + xj a 2k) 2–1 with p(0) = 0 and
|–––add–––| p(k) = p = ax + p(0)2–k
|––shift right––|

Multiplication with left shifts: bottom-to-top accumulation


p(j+1) = 2 p(j) + xk–j–1a with p(0) = 0 and
|shift| p(k) = p = ax + p(0)2k
|––––add––––|
Apr. 2015 Computer Arithmetic, Multiplication Slide 8
Examples of Basic Multiplication
Right-shift algorithm Left-shift algorithm
======================== ======================= Fig. 9.2
a
x
1 0 1 0 1 0 1 0
1 0 1 1
a
x
1 0 1 0
1 0 1 1
Examples
======================== ======================= of
p(0) 0 0 0 0 p(0) 0 0 0 0 sequential
+x0a 1 0 1 0 2p (0) 0 0 0 0 0
––––––––––––––––––––––––– +x3a 1 0 1 0 multipli-
2p(1) 0 1 0 1 0 –––––––––––––––––––––––– cation with
p (1) 0 1 0 1 0 p(1) 0 1 0 1 0
(1)
right and
+x1a 1 0 1 0 2p 0 1 0 1 0 0
––––––––––––––––––––––––– +x2a 0 0 0 0 left shifts.
2p(2) 0 1 1 1 1 0 ––––––––––––––––––––––––
p (2) 0 1 1 1 1 0 p(2) 0 1 0 1 0 0
+x2a 0 0 0 0 2p (2)
p (j+1) = (p(j) 0+ 1
x 0 2k1) 20–10 0
a
––––––––––––––––––––––––– +x1a j 1 0 1 0
2p(3) 0 0 1 1 1 1 0 |–––add–––|
–––––––––––––––––––––––– Check:
p(3) 0 0 1 1 1 1 0 p(3) |––shift
0 1 right––|
1 0 0 1 0 10  11
+x3a 1 0 1 0 2p (3) 0 1 1 0 0 1 0 0 = 110
––––––––––––––––––––––––– +x0a 1 0 1 0
2p(4) 0 1 1 0 1 1 1 0 –––––––––––––––––––––––– = 64 + 32 +
p (4) 0 1 1 0 1 1 1 0 p(4) 0 1 1 0 1 1 1 0 8+4+2
======================== =======================
Apr. 2015 Computer Arithmetic, Multiplication Slide 9
Examples of Basic Multiplication (Continued)
Right-shift algorithm Left-shift algorithm
======================== ======================= Fig. 9.2
a 1 0 1 0 a 1 0 1 0 Examples
x 1 0 1 1 x 1 0 1 1
======================== ======================= of
p(0) 0 0 0 0 p(0) 0 0 0 0 sequential
+x0a 1 0 1 0 2p (0) 0 0 0 0 0
––––––––––––––––––––––––– +x3a 1 0 1 0 multipli-
2p(1) 0 1 0 1 0 –––––––––––––––––––––––– cation with
p (1) 0 1 0 1 0 p(1) 0 1 0 1 0
(1)
right and
+x1a 1 0 1 0 2p 0 1 0 1 0 0
––––––––––––––––––––––––– +x2a 0 0 0 0 left shifts.
2p(2) 0 1 1 1 1 0 ––––––––––––––––––––––––
p (2) 0 1 1 1 1 0 p(2) 0 1 0 1 0 0
+x2ap(j+1) =0 20p(j)
0 0+ xk–j–1a 2p(2) 0 1 0 1 0 0 0
–––––––––––––––––––––––––
|shift| +x1a 1 0 1 0
2p(3) 0 0 1 1 1 1 0 –––––––––––––––––––––––– Check:
p(3) 0|––––add––––|
0 1 1 1 1 0 p(3) 0 1 1 0 0 1 0 10  11
+x3a 1 0 1 0 2p (3) 0 1 1 0 0 1 0 0 = 110
––––––––––––––––––––––––– +x0a 1 0 1 0
2p(4) 0 1 1 0 1 1 1 0 –––––––––––––––––––––––– = 64 + 32 +
p (4) 0 1 1 0 1 1 1 0 p(4) 0 1 1 0 1 1 1 0 8+4+2
======================== =======================
Apr. 2015 Computer Arithmetic, Multiplication Slide 10
9.2 Programmed Multiplication
{Using right shifts, multiply unsigned m_cand and m_ier,
storing the resultant 2k-bit product in p_high and p_low.
Registers: R0 holds 0 Rc for counter
Ra for m_cand Rx for m_ier
R0 0 Rc Counter
Rp for p_high Rq for p_low} Ra Multiplicand Rx Multiplier
{Load operands into registers Ra and Rx} Rp Product, high Rq Product, low
mult: load Ra with m_cand
load Rx with m_ier
{Initialize partial product and counter}
copy R0 into Rp
copy R0 into Rq
load k into Rc
{Begin multiplication loop}
m_loop: shift Rx right 1 {LSB moves to carry flag}
branch no_add if carry = 0
add Ra to Rp {carry flag is set to cout}
no_add: rotate Rp right 1 {carry to MSB, LSB to carry}
rotate Rq right 1 {carry to MSB, LSB to carry}
decr Rc {decrement counter by 1}
branch m_loop if Rc  0
{Store the product} Fig. 9.3 Programmed
store Rp into p_high
store Rq into p_low
multiplication (right-shift
m_done: ... algorithm).
Apr. 2015 Computer Arithmetic, Multiplication Slide 11
Time Complexity of Programmed Multiplication

Assume k-bit words

k iterations of the main loop


6-7 instructions per iteration, depending on the multiplier bit

Thus, 6k + 3 to 7k + 3 machine instructions,


ignoring operand loads and result store

k = 32 implies 200+ instructions on average

This is too slow for many modern applications!

Microprogrammed multiply would be somewhat better

Apr. 2015 Computer Arithmetic, Multiplication Slide 12


9.3 Basic Hardware Multipliers
Shift

Multiplier x

Doublewidth partial product p (j)

Shift

Multiplicand a
0
k xj
0 1
Mux
xj a k p(j+1) = (p(j) + xj a 2k) 2–1
Adder |–––add–––|
cout |––shift right––|
k

Fig. 9.4 Hardware realization of the sequential multiplication


algorithm with additions and right shifts.
Apr. 2015 Computer Arithmetic, Multiplication Slide 13
Example of Hardware Multiplication

1 1
0 0
1 1
0 (11)ten

1 1
0 0 0
1 1
0 1
0 1
0 0
1 0 (110)ten

1 0 1 0 (10)ten

p(j+1) = (p(j) + xj a 2k) 2–1


|–––add–––|
|––shift right––|

Fig. 9.4a Hardware realization of the sequential multiplication


algorithm with additions and right shifts.
Apr. 2015 Computer Arithmetic, Multiplication Slide 14
Performing Add and Shift in One Clock Cycle

Adder’s Adder’s sum


Unused
carry-out
part of the
k k–1 multiplier x

Partial product p(j)

k k–1
To adder To mux control

Fig. 9.5 Combining the loading and shifting of the


double-width register holding the partial product and
the partially used multiplier.

Apr. 2015 Computer Arithmetic, Multiplication Slide 15


Sequential Multiplication with Left Shifts
Shift

Multiplier x

Doublewidth partial product p (j)

Shift

xk-j-1 2k Multiplicand a
0
0
Mux 1
xk-j-1a k

2k-bit adder
cout
2k

Fig. 9.4b Hardware realization of the sequential multiplication


algorithm with left shifts and additions.
Apr. 2015 Computer Arithmetic, Multiplication Slide 16
============================
a 1 0 1 1 0
9.4 Multiplication of x 0 1 0 1 1
============================
Signed Numbers p(0) 0 0 0 0 0 Check:
+x0a 1 0 1 1 0 –10  11
––––––––––––––––––––––––––––––
2p(1) 1 1 0 1 1 0 = 110
Fig. 9.6 Sequential p(1) 1 1 0 1 1 0 = –512 +
multiplication of +x1a 1 0 1 1 0 256 +
2’s-complement –––––––––––––––––––––––––––––
2p(2) 1 1 0 0 0 1 0 128 +
numbers with right p(2) 1 1 0 0 0 1 0 16 + 2
shifts (positive +x2a 0 0 0 0 0
multiplier). –––––––––––––––––––––––––––––
2p(3) 1 1 1 0 0 0 1 0
p (3) 1 1 1 0 0 0 1 0
Negative multiplicand, +x3a 1 0 1 1 0
–––––––––––––––––––––––––––––
positive multiplier: 2p(4) 1 1 0 0 1 0 0 1 0
p(4) 1 1 0 0 1 0 0 1 0
No change, other than +x4a 0 0 0 0 0
looking out for proper –––––––––––––––––––––––––––––
sign extension 2p(5) 1 1 1 0 0 1 0 0 1 0
p (5) 1 1 1 0 0 1 0 0 1 0
============================
Apr. 2015 Computer Arithmetic, Multiplication Slide 17
============================
a 1 0 1 1 0
The Case of a x 1 0 1 0 1
============================
Negative Multiplier p(0) 0 0 0 0 0 Check:
+x0a 1 0 1 1 0 –10  –11
–––––––––––––––––––––––––––––
2p(1) 1 1 0 1 1 0 = 110
Fig. 9.7 Sequential p(1) 1 1 0 1 1 0 = 64 + 32 +
multiplication of +x1a 0 0 0 0 0 8+4+2
2’s-complement –––––––––––––––––––––––––––––
2p(2) 1 1 1 0 1 1 0
numbers with right p (2) 1 1 1 0 1 1 0
shifts (negative +x2a 1 0 1 1 0
multiplier). –––––––––––––––––––––––––––––
2p(3) 1 1 0 0 1 1 1 0
p (3) 1 1 0 0 1 1 1 0
Negative multiplicand, +x3a 0 0 0 0 0
–––––––––––––––––––––––––––––
negative multiplier: 2p(4) 1 1 1 0 0 1 1 1 0
p(4) 1 1 1 0 0 1 1 1 0
In last step (the sign bit), +(-x4a) 0 1 0 1 0
subtract rather than add –––––––––––––––––––––––––––––
2p(5) 0 0 0 1 1 0 1 1 1 0
p (5) 0 0 0 1 1 0 1 1 1 0
============================
Apr. 2015 Computer Arithmetic, Multiplication Slide 18
Signed 2’s-Complement Hardware Multiplier
Multiplier
Partial product
k+1

Multiplicand

1 0 Enable
Mux
Select
k+1

cout Adder cin


0, except in
k+1 last cycle
Fig. 9.8 The 2’s-complement sequential hardware multiplier.

Apr. 2015 Computer Arithmetic, Multiplication Slide 19


Booth’s Recoding
Table 9.1 Radix-2 Booth’s recoding
–––––––––––––––––––––––––––––––––––––
xi xi–1 yi Explanation
–––––––––––––––––––––––––––––––––––––
0 0 0 No string of 1s in sight
0 1 1 End of string of 1s in x
1 0 -1 Beginning of string of 1s in x
1 1 0 Continuation of string of 1s in x
–––––––––––––––––––––––––––––––––––––
Example
1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 Recoded version y

Justification
2j + 2j–1 + . . . + 2i+1 + 2i = 2j+1 – 2i

Apr. 2015 Computer Arithmetic, Multiplication Slide 20


============================
a 1 0 1 1 0
Example Multiplication x
y -
1 0 1 0 1 Multiplier
1 1 -1 1 -1 Booth-recoded
with Booth’s Recoding ============================
p(0) 0 0 0 0 0 Check:
+y0a 0 1 0 1 0 –10  –11
–––––––––––––––––––––––––––––
Fig. 9.9 Sequential 2p(1) 0 0 1 0 1 0 = 110
multiplication of p(1) 0 0 1 0 1 0 = 64 + 32 +
2’s-complement +y1a 1 0 1 1 0 8+4+2
–––––––––––––––––––––––––––––
numbers with right 2p(2) 1 1 1 0 1 1 0
shifts by means of p (2) 1 1 1 0 1 1 0
Booth’s recoding. +y2a 0 1 0 1 0
–––––––––––––––––––––––––––––
–––––––––– 2p(3)
(3)
0 0 0 1 1 1 1 0
xi xi–1 yi p 0 0 0 1 1 1 1 0
+y3a 1 0 1 1 0
–––––––––– –––––––––––––––––––––––––––––
0 0 0 2p(4) 1 1 1 0 0 1 1 1 0
p (4) 1 1 1 0 0 1 1 1 0
0 1 1
-1
y4a 0 1 0 1 0
1 0 –––––––––––––––––––––––––––––
1 1 0 2p(5) 0 0 0 1 1 0 1 1 1 0
p (5) 0 0 0 1 1 0 1 1 1 0
––––––––––
============================
Apr. 2015 Computer Arithmetic, Multiplication Slide 21
9.5 Multiplication by Constants
0 1 2 . . . n–1
Explicit, e.g. y := 12 * x + 1 0
1
Implicit, e.g. A[i, j] := A[i, j] + B[i, j] 2
.
Address of A[i, j] = base + n * i + j . Row i
.
m–1

Column j
Software aspects:
Optimizing compilers replace multiplications by shifts/adds/subs
Produce efficient code using as few registers as possible
Find the best code by a time/space-efficient algorithm

Hardware aspects:
Synthesize special-purpose units such as filters
y[t] = a0x[t] + a1x[t – 1] + a2x[t – 2] + b1y[t – 1] + b2y[t – 2]

Apr. 2015 Computer Arithmetic, Multiplication Slide 22


Multiplication Using Binary Expansion

Example: Multiply R1 by the constant 113 = (1 1 1 0 0 0 1)two


R2  R1 shift-left 1
R3  R2 + R1 Shift, add Shift
R6  R3 shift-left 1
R7  R6 + R1 Ri: Register that contains i times (R1)
R112  R7 shift-left 4 This notation is for clarity; only one
R113  R112 + R1 register other than R1 is needed

Shorter sequence using shift-and-add instructions


R3  R1 shift-left 1 + R1
R7  R3 shift-left 1 + R1
R113  R7 shift-left 4 + R1

Apr. 2015 Computer Arithmetic, Multiplication Slide 23


Multiplication via Recoding

Example: Multiply R1 by 113 = (1 1 1 0 0 0 1)two = (1 0 0-1 0 0 0 1)two


R8  R1 shift-left 3
R7  R8 – R1
Shift
R112  R7 shift-left 4
R113  R112 + R1 Shift, subtract Shift, add

Shorter sequence using shift-and-add/subtract instructions


R7  R1 shift-left 3 – R1
R113  R7 shift-left 4 + R1

6 shift or add (3 shift-and-add) instructions needed without recoding

The canonic signed-digit representation of a number contains no


consecutive nonzero digits: average number of shift-adds is O(k/3)

Apr. 2015 Computer Arithmetic, Multiplication Slide 24


Multiplication via Factorization
Example: Multiply R1 by 119 = 7  17 128x 8x x
= (8 – 1)  (16 + 1)
R8  R1 shift-left 3 CSA 1

R7  R8 – R1
R112  R7 shift-left 4 CPA 1
R119  R112 + R7
119x

Shorter sequence using shift-and-add/subtract instructions


R7  R1 shift-left 3 – R1 Requires a scratch register
R119  R7 shift-left 4 + R7 for holding the 7 multiple

119 = (1 1 1 0 1 1 1)two = (1 0 0 0-1 0 0-1)two


More instructions may be needed without factorization

Apr. 2015 Computer Arithmetic, Multiplication Slide 25


Multiplication by Multiple Constants
Example: Multiplying a number by 45, 49, and 65
R9  R1 shift-left 3 + R1
R45  R9 shift-left 2 + R9
Separate solutions:
R7  R1 shift-left 3 – R1 5 shift-add/subtract
R49  R7 shift-left 3 – R7 operations

R65  R1 shift-left 6 + R1

A combined solution for all three constants


A programmable
R65  R1 shift-left 6 + R1
block can perform
R49  R65 – R1 left-shift 4
any of the three
R45  R49 – R1 left-shift 2
multiplications

Apr. 2015 Computer Arithmetic, Multiplication Slide 26


9.6 Preview of Fast Multipliers
Viewing multiplication as a multioperand addition problem,
there are but two ways to speed it up
a. Reducing the number of operands to be added:
Handling more than one multiplier bit at a time
(high-radix multipliers, Chapter 10)
b. Adding the operands faster:
Parallel/pipelined multioperand addition
(tree and array multipliers, Chapter 11)

In Chapter 12, we cover all remaining multiplication topics:


Bit-serial multipliers
Modular multipliers
Multiply-add units
Squaring as a special case

Apr. 2015 Computer Arithmetic, Multiplication Slide 27


10 High-Radix Multipliers
Chapter Goals
Study techniques that allow us to handle
more than one multiplier bit in each cycle
(two bits in radix 4, three in radix 8, . . .)

Chapter Highlights
High radix gives rise to “difficult” multiples
Recoding (change of digit-set) as remedy
Carry-save addition reduces cycle time
Implementation and optimization methods

Apr. 2015 Computer Arithmetic, Multiplication Slide 28


High-Radix Multipliers: Topics

Topics in This Chapter


10.1 Radix-4 Multiplication

10.2 Modified Booth’s Recoding

10.3 Using Carry-Save Adders

10.4 Radix-8 and Radix-16 Multipliers

10.5 Multibeat Multipliers

10.6 VLSI Complexity Issues

Apr. 2015 Computer Arithmetic, Multiplication Slide 29


10.1 Radix-4 Multiplication
a Multiplicand
Fig. 9.1  x Multiplier
(modified) x 00 a r200
Partial
x 11 a r211
products
x2 a r222
2 bit-matrix
x 33 a r233
p Product Preferred

Multiplication with right shifts in radix r : top-to-bottom accumulation


p(j+1) = (p(j) + xj a r k) r –1 with p(0) = 0 and
|–––add–––| p(k) = p = ax + p(0)r –k
|––shift right––|

Multiplication with left shifts in radix r : bottom-to-top accumulation


p(j+1) = r p(j) + xk–j–1a with p(0) = 0 and
|shift| p(k) = p = ax + p(0)r k
|––––add––––|
Apr. 2015 Computer Arithmetic, Multiplication Slide 30
Radix-4 Multiplication in Dot Notation
a Multiplicand
 x Multiplier
x 0 a 20
Partial
x 1 a 21
products
x 2 a 22 bit-matrix
x 3 a 23
p Product Fig. 10.1 Radix-4,
or two-bit-at-a-time,
Fig. 9.1 multiplication in dot
notation
a Multiplicand
Number of cycles is  x Multiplier
halved, but now the (x x ) a 4 0
1 0 two
“difficult” multiple 3a (x 3x 2) twoa 4 1
must be dealt with
p Product

Apr. 2015 Computer Arithmetic, Multiplication Slide 31


A Possible Design for a Radix-4 Multiplier

Precomputed via Multiplier


shift-and-add 3a
(3a = 2a + a)
0 a 2a 2-bit s hifts
x i+1 xi
00 01 10 11
Mux
k/2 + 1 cycles, rather than k
To the adder
One extra cycle over k/2
not too bad, but we would like Fig. 10.2 The multiple
to avoid it if possible generation part of a radix-4
Solving this problem for radix 4 multiplier with
may also help when dealing precomputation of 3a.
with even higher radices

Apr. 2015 Computer Arithmetic, Multiplication Slide 32


Example Radix-4 Multiplication Using 3a
================================ a
a 0 1 1 0  x
3a 0 1 0 0 1 0 (x x )
1 0
x 1 1 1 0 (x 3x 2)
================================
p(0) 0 0 0 0 p
+(x1x0)twoa 0 0 1 1 0 0
–––––––––––––––––––––––––––––––––
4p(1) 0 0 1 1 0 0
p(1) 0 0 1 1 0 0
+(x3x2)twoa 0 1 0 0 1 0
–––––––––––––––––––––––––––––––––
Fig. 10.3 Example of
4p(2) 0 1 0 1 0 1 0 0
radix-4 multiplication
p(2) 0 1 0 1 0 1 0 0
using the 3a multiple.
================================

Apr. 2015 Computer Arithmetic, Multiplication Slide 33


A Second Design for a Radix-4 Multiplier
Fig. 10.4 The multiple generation
Multiplier part of a radix-4 multiplier based on
replacing 3a with 4a (carry into next
2-bit s hifts higher radix-4 multiplier digit) and –a.
x i+1 xi
0 a 2a –a +c c Carry
mod 4 FF xSet (xif xi+1
i+1 if i x c) = xi = 1
xi+1  xi c c or i+1 = c = 1
00 01 10 11
xi  c
Mux
xi+1 xi c Mux control Set carry
---- --- --- ---------------- ------------
To the adder 0 0 0 0 0 0
0 0 1 0 1 0
0 1 0 0 1 0
0 1 1 1 0 0
1 0 0 1 0 0
1 0 1 1 1 1
1 1 0 1 1 1
1 1 1 0 0 1
Apr. 2015 Computer Arithmetic, Multiplication Slide 34
10.2 Modified Booth’s Recoding
Table 10.1 Radix-4 Booth’s recoding yielding (zk/2 . . . z1z0)four
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
xi+1 xi xi–1 yi+1 yi zi/2 Explanation
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
0 0 0 0 0 0 No string of 1s in sight
0 0 1 0 1 1 End of string of 1s
0 1 0 0 1 1 Isolated 1
0 1 1 1 0 2 End of string of 1s
1 0 0 - 1 0 - 2 Beginning of string of 1s
1 0 1 - 1 1 - 1 End a string, begin new one
1 1 0 0 -1 - 1 Beginning of string of 1s
1 1 1 0 0 0 Continuation of string of 1s
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Recoded
Context Radix-4 digit
radix-2 digits
Example
1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) -1 0 1 0 0 -1 1 0 -1 1 -1 1 0 0 -1 0 Recoded version y
(1) -2 2 -1 2 -1 -1 0 -2 Radix-4 version z
Apr. 2015 Computer Arithmetic, Multiplication Slide 35
Example Multiplication via Modified Booth’s Recoding
================================ a
a 0 1 1 0  x
x 1 0 1 0 (x x )
-1 -2 1 0
z Radix-4 (x 3x 2)
================================
p(0) 0 0 0 0 0 0 p
+z0a 1 1 0 1 0 0
–––––––––––––––––––––––––––––––––
4p(1) 1 1 0 1 0 0
p(1) 1 1 1 1 0 1 0 0
+z1a 1 1 1 0 1 0 Fig. 10.5 Example of
––––––––––––––––––––––––––––––––– radix-4 multiplication
4p(2) 1 1 0 1 1 1 0 0 with modified Booth’s
p(2) 1 1 0 1 1 1 0 0 recoding of the 2’s-
================================ complement multiplier.

Apr. 2015 Computer Arithmetic, Multiplication Slide 36


Multiple Generation with Radix-4 Booth’s Recoding
Multiplier Init. 0 Multiplicand

2-bit shift ---- Encoding ----


x i+1 xi x i–1 k Digit neg two non0
Sign –2 1 1 1
Recoding Logic of a
–1 1 0 1
neg two non0
0 0 0 0
0 0
a 2a 1 0 0 1
Could have named 2 0 1 1
Enable 0 1
this signal one/two Mux
Select
0, a, or 2a
k+1

Add/subtract z i/2 a
control To adder input
Fig. 10.6 The multiple generation part of a radix-4
multiplier based on Booth’s recoding.
Apr. 2015 Computer Arithmetic, Multiplication Slide 37
10.3 Using Carry-Save Adders
Multiplier

0 2a
M ux
0 a x i+1 xi
Old Cumulative M ux
Partial Product

CSA

Adder
New Cumulative Partial Product

Fig. 10.7 Radix-4 multiplication with a carry-save adder used to


combine the cumulative partial product, xia, and 2xi+1a into two numbers.

Apr. 2015 Computer Arithmetic, Multiplication Slide 38


Keeping the Partial Product in Carry-Save Form
Sum
Mux Carry
Multiplier

Partial Product
k Sum
k
Right
Carry shift
Multiplican d
0
Mux
Upper half of PP Lower half of PP
k-Bit CSA
(a) Multiplier
Carry k k block diagram (b) Operation in a typical cycle
Sum Fig. 10.8 Radix-2 multiplication with
the upper half of the cumulative partial
k-Bit Ad der product kept in stored-carry form.

Apr. 2015 Computer Arithmetic, Multiplication Slide 39


Carry-Save Multiplier with Radix-4 Booth’s Recoding
a Multiplier

x x x
i+1 i i-1
Booth recoder
and selector

Old cumulative
z a
partial product i/2

CSA
New cumulative
partial product 2-bit
Adder
Adder FF Extra “dot”

To the lower half


of partial product
Fig. 10.9 Radix-4 multiplication with a CSA used to combine the
stored-carry cumulative partial product and zi/2a into two numbers.
Apr. 2015 Computer Arithmetic, Multiplication Slide 40
Radix-4 Booth’s Recoding for Parallel Multiplication
x i+2 x i+1 xi x i–1 x i–2

Recoding Logic

neg two non0


a 2a
Fig. 10.10 Booth 0 1
Enable
Mux
recoding and multiple Select
selection logic for 0, a, or 2a
k+1
high-radix or parallel
multiplication.
Selective Complement
0, a, –a, 2a, or –2a
k+2

Extra "Dot" z i/2 a


for Column i
Apr. 2015 Computer Arithmetic, Multiplication Slide 41
Yet Another Design for Radix-4 Multiplication
Multiplier

0 2a
M ux
0 a x i+1 xi
Old Cumulative M ux
Partial Product
(4; 2)-counter Fig. 10.11 Radix-4 multiplication,
with the cumulative partial product,
CSA
xia, and 2xi+1a combined into two
numbers by two CSAs.
CSA
New Cumulative
Partial Product

2-Bit
Adder FF Adder

To the Lower Half


of Partial Product
Apr. 2015 Computer Arithmetic, Multiplication Slide 42
10.4 Radix-8 and Radix-16 Multipliers
Multiplier
0 8a
Mux
0 4a x i+3
Mux
2a x i+2
4-bit 4-Bit 0
right Shift Mux
a x i+1
0
shift Mux xi

CSA CSA

CSA
Fig. 10.12 Radix-16
multiplication with the
upper half of the CSA
cumulative partial Sum
4
product in carry-save 4-Bit
3 FF Ad der
Carry
form. Partial Product 4
To th e Lower Half
(Upper Half) of Partial Produ ct
Apr. 2015 Computer Arithmetic, Multiplication Slide 43
Other High-Radix Multipliers Multiplier
0 8a
Remove this mux & CSA and Mux
x i+3
0 4a
replace the 4-bit shift (adder)
Mux
with a 3-bit shift (adder) to get 2a x i+2
4-Bit 0
a radix-8 multiplier (cycle time Shift Mux
x i+1
0 a
will remain the same, though) Mux xi

CSA CSA

A radix-16 multiplier design


becomes a radix-256 CSA Fig. 10.12
multiplier if radix-4 Booth’s
recoding is applied first CSA
(the muxes are replaced by 4
Sum
Booth recoding and multiple 3 FF
4-Bit
Carry Adder
selection logic) 4
Partial Product To the Lower Half
(Upper Half) of Partial Product

Apr. 2015 Computer Arithmetic, Multiplication Slide 44


A Spectrum of Multiplier Design Choices
Several
Next multiples All multiples
multiple ... .. .
Small CSA
tree Full CSA
Adder
tree

Partial product Partial product

Adder Adder

High-radix
Basic or Full
binary Speed up partial tree Economize tree

Fig. 10.13 High-radix multipliers as intermediate


between sequential radix-2 and full-tree multipliers.
Apr. 2015 Computer Arithmetic, Multiplication Slide 45
10.5 Multibeat Multipliers
Inputs Inputs PH1 State
Next-state
Next-state excitation Next-state latches
Present logic logic
state
Next-state
State State logic
flip-flops CLK latches PH2 Inputs
(a) Sequential machine with FFs (b) Sequential machine with latches and 2-phase clock

Fig. 10.15 Two-phase clocking for sequential logic.

Begin changing FF contents


Change becomes visible at FF output

Observation: Half of the


clock cycle goes to waste
Once cycle
Apr. 2015 Computer Arithmetic, Multiplication Slide 46
Twin-Beat and Three-Beat Multipliers
Twin Multiplier
Registers This radix-64 multiplier
runs at the clock rate of a
3a a 3a a radix-8 design (2X speed)
4 4

Pipelined Pipelined
Radix-8 Radix-8
Booth Booth
Reco der Reco der
& Selector & Selector

CSA CSA

Sum Sum
Carry Carry
5 6
6-Bit
FF Ad der
Adder 6
To th e Lower Half
of Partial Produ ct
Fig. 10.14 Twin-beat multiplier Fig. 10.16 Conceptual view
with radix-8 Booth’s recoding. of a three-beat multiplier.
Apr. 2015 Computer Arithmetic, Multiplication Slide 47
10.6 VLSI Complexity Issues
A radix-2b multiplier requires:
bk two-input AND gates to form the partial products bit-matrix
O(bk) area for the CSA tree
At least Q(k) area for the final carry-propagate adder
Total area: A = O(bk)
Latency: T = O((k/b) log b + log k)

Any VLSI circuit computing the product of two k-bit integers must
satisfy the following constraints:
AT grows at least as fast as k3/2
AT2 is at least proportional to k2
The preceding radix-2b implementations are suboptimal, because:
AT = O(k2 log b + bk log k)
AT2 = O((k3/b) log2b)
Apr. 2015 Computer Arithmetic, Multiplication Slide 48
Comparing High- and Low-Radix Multipliers
AT = O(k2 log b + bk log k) AT 2 = O((k3/b) log2b)
Low-Cost High Speed AT- or AT 2-
b = O(1) b = O(k) Optimal
AT O(k 2) O(k 2 log k) O(k 3/2)
AT 2 O(k 3) O(k 2 log2 k) O(k 2)

Intermediate designs do not yield better AT or AT 2 values;


The multipliers remain asymptotically suboptimal for any b
By the AT measure (indicator of cost-effectiveness), slower radix-2
multipliers are better than high-radix or tree multipliers
Thus, when an application requires many independent multiplications,
it is more cost-effective to use a large number of slower multipliers

High-radix multiplier latency can be reduced from O((k/b) log b + log k)


to O(k/b + log k) through more effective pipelining (Chapter 11)

Apr. 2015 Computer Arithmetic, Multiplication Slide 49


11 Tree and Array Multipliers
Chapter Goals
Study the design of multipliers for highest
possible performance (speed, throughput)

Chapter Highlights
Tree multiplier = reduction tree
+ redundant-to-binary converter
Avoiding full sign extension in multiplying
signed numbers
Array multiplier = one-sided reduction tree
+ ripple-carry adder
Apr. 2015 Computer Arithmetic, Multiplication Slide 50
Tree and Array Multipliers: Topics

Topics in This Chapter


11.1. Full-Tree Multipliers

11.2. Alternative Reduction Trees

11.3. Tree Multipliers for Signed Numbers

11.4. Partial-Tree and Truncated Multipliers

11.5. Array Multipliers

11.6. Pipelined Tree and Array Multipliers

Apr. 2015 Computer Arithmetic, Multiplication Slide 51


11.1 Full-Tree Multipliers
Multiplier
a Multiplicand
 x Multiplier
a ...
(x x ) a 4 0
1 0 two
(x 3x 2) twoa 4 1 Multiple- a
Forming
p Product Circuits a
Several
Next multiples All multiples a
multiple ... .. . . . .
Small CSA
tree Full CSA
Adder
tree Partial-Products
Partial product Partial product Reduction Tree
(Multi-Operand
Addition Tree)
Adder Adder

High-radix
Basic or Full
binary Speed up partial tree Economize tree Redundant result
Redundant-to-Binary
Fig. 10.13 High-radix multipliers Converter
as intermediate between sequential Some lower-order
Higher-order product bits are
radix-2 and full-tree multipliers. product bits generated directly

Fig. 11.1 General structure of a full-tree multiplier.


Apr. 2015 Computer Arithmetic, Multiplication Slide 52
Full-Tree versus Partial-Tree Multiplier
All partial products Several partial products

... ...
Small tree of
Large tree of carry-save
carry-save Log- adders
adders depth

Log-
Adder depth Adder

Product Product

Schematic diagrams for full-tree and partial-tree multipliers.


Apr. 2015 Computer Arithmetic, Multiplication Slide 53
Variations in Full-Tree Multiplier Design
Multiplier
Designs are distinguished by a ...
variations in three elements:
Multiple- a
Forming
Circuits a
1. Multiple-forming circuits a
. . .

Partial-Products
Reduction Tree
2. Partial products reduction tree (Multi-Operand
Addition Tree)

Redundant result
Fig. 11.1
3. Redundant-to-binary converter Redundant-to-Binary
Converter
Some lower-order
Higher-order product bits are
product bits generated directly

Apr. 2015 Computer Arithmetic, Multiplication Slide 54


Example of Variations in CSA Tree Design
Corrections
shown in red
a Wallace Tree
Multiplicand Dadda Tree
 x (5 FAs + 3 HAs + 4-Bit Adder)
Multiplier (4 FAs + 2 HAs + 6-Bit Adder)
x 0 a 20
Partial
x 1 a 21
1
x 2 a 22 23 4 3 2 1
products 1 2
3 4 3 2 1
bit-matrix
x 3 a 23 FA FA FA HA FA HA
HA FA
p
--------------------
Product --------------------
1 3 2 3 2 1 1 1 3 2 3 2
3 3 2 1
FA HA FA HA FA FA
HA FA
HA HA
FA
---------------------- ----------------------
2 2 2 2 1 1 1 2 2 2 2 1 2 2 1
4-Bit Adder 6-Bit Adder
---------------------- ----------------------
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Fig. 11.2 Two different binary 4  4 tree multipliers.

Apr. 2015 Computer Arithmetic, Multiplication Slide 55


[0, 6] [2, 8] [3, 9] [5, 11]
Details of a CSA Tree [1, 7] [4, 10] [6, 12]

[1, 6]
Fig. 11.3 Possible 7-bit CSA 7-bit CSA
CSA tree for a 7  7 [2, 8] [1,8] [5, 11] [3, 11]
tree multiplier.
7-bit CSA
[6, 12] [3, 12]
[2, 8]
CSA trees are quite 7-bit CSA
irregular, causing [3,9] [2,12]
some difficulties in [3,12]
VLSI realization
The ind ex pair 10-bit CSA
[i, j] means that [3,12]
Thus, our motivation bit pos ition s
[4,13] [4,12]
to examine alternate from i up to j
are involved.
methods for partial 10-bit CPA
products reduction
Ignore [4, 13] 3 2 1 0

Apr. 2015 Computer Arithmetic, Multiplication Slide 56


11.2 Alternative Reduction Trees FA

Level
Inputs FA
1
FA FA FA

Level-1 FA
carries

11 + y1 = 2y1 + 3 FA FA FA
Level-2 Level
Therefore, y1 = 8 carries 2
FA
carries are needed FA FA
Level-3
carries FA
Level
Fig. 11.4 3
FA
A slice of a Level-4
FA
balanced-delay carry
tree for 11 Level
FA
FA 4
inputs.
Outputs FA
Level
5
Apr. 2015 Computer Arithmetic, Multiplication Slide 57
Binary Tree of 4-to-2 Reduction Modules
4-to-2 compressor
4-to-2 4-to-2 4-to-2 4-to-2

FA
0 1
4-to-2 4-to-2

0 1
FA
4-to-2
c s c s

(a) Binary tree of (4; 2)-counters (b) Realization with FAs (c) A faster realization

Fig. 11.5 Tree multiplier with a more regular


structure based on 4-to-2 reduction modules.
Due to its recursive structure, a binary tree is more regular
than a 3-to-2 reduction tree when laid out in VLSI
Apr. 2015 Computer Arithmetic, Multiplication Slide 58
Example Multiplier with 4-to-2 Reduction Tree
Multiple M u l t i p l i c a n d
Even if 4-to-2 reduction
generation ...

M ul t ipl e s elect i on si gnals


is implemented using circuits
two CSA levels, design
regularity potentially
makes up for the larger
number of logic levels

Similarly,
using Booth’s
recoding may
not yield any
advantage,
because it
introduces
Redundant-to-binary converter
irregularity

Fig. 11.6 Layout of a partial-products reduction tree composed of


4-to-2 reduction modules. Each solid arrow represents two numbers.
Apr. 2015 Computer Arithmetic, Multiplication Slide 59
11.3 Tree Multipliers for Signed Numbers
---------- Extended positions ---------- Sign Magnitude positions ---------
xk–1 xk–1 xk–1 xk–1 xk–1 xk–1 xk–2 xk–3 xk–4 ...
yk–1 yk–1 yk–1 yk–1 yk–1 yk–1 yk–2 yk–3 yk–4 ...
zk–1 zk–1 zk–1 zk–1 zk–1 zk–1 zk–2 zk–3 zk–4 ...
From Fig. 8.19a Sign extension in multioperand addition.
Signs
Sign extensions
        x x x x x x x The difference in
       x x x x x x x x multiplication is the
      x x x x x x x x x shifting sign positions
Five redundant copies 
removed
FA FA FA FA FA FA
Fig. 11.7 Sharing of full adders to reduce
the CSA width in a signed tree multiplier.

Apr. 2015 Computer Arithmetic, Multiplication Slide 60


a4 a3 a2 a1 a0

Using the Negative-Weight a. Unsigned x4 x3 x2 x1 x0
----------------------------
a4 x0 a3 x0 a2 x0 a1 x0 a0 x0
a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
Property of the Sign Bit a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
a4 x3 a3 x3 a2 x3 a1 x3 a0 x3
a4 x4 a3 x4 a2 x4 a1 x4 a0 x4
--------------------------------------------------------
-
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

Sign extension is a way of b. 2's-complement 


a4
x4
a3
x3
a2
x2
a1
x1
a0
x0
----------------------------
converting negatively weighted bits -a4 x0 a3 x0 a2 x0 a1 x0 a0 x0
-a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
(negabits) to positively weighted -a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
-a4 x3 a3 x3 a2 x3 a1 x3 a0 x3
bits (posibits) to facilitate reduction, a4 x4 -a3 x4 -a2 x4 -a1 x4 -a0 x4
--------------------------------------------------------
-
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
but there are other methods of
a4 a3 a2 a1 a0
accomplishing the same without c. Baugh-Wooley  x4 x3 x2 x1 x0
----------------------------
_
introducing a lot of extra bits _ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0
_ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
_ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
a4 x3 a
_ _3 x3 _
a2 x3 a
_1 x3 a0 x3
a4 x4 a3 x4 a2 x4 a1 x4 a0 x4
a4 a4
Baugh and Wooley have 1 x4 x4
--------------------------------------------------------
contributed two such methods -
p
9
p
8
p
7
p
6
p
5
p
4
p
3
p
2
p
1
p
0
a4 a3 a2 a1 a0
d. Modified B-W  x
4
x
3
x
2
x
1
x
0
---------------------------
__
__ a x a x a x a x a x
4 0 3 0 2 0 1 0 0 0
Fig. 11.8 Baugh-Wooley __ a
__
x
4 2
a x
a4 x1
3 2
a x
3
a x
2 2
1
a x
2
a x
1 2
1
a x
1
a x
0 2
1
a x
0 1
a__x a x a x a x a x
2’s-complement multiplication. a x
4 4
4 3 __
a x
3 4
3 3 __
a x
2 4
2 3
a x
1 4
__
1 3
a x
0 4
0 3

1 1
--------------------------------------------------------
-
p p p p p p p p p p
9 8 7 6 5 4 3 2 1 0

Apr. 2015 Computer Arithmetic, Multiplication Slide 61


a4 x4 -a3 x4 -a2 x4 -a1 x4 -a0 x4
--------------------------------------------------------
The Baugh-Wooley Method and Its Modified Form
-
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

a4 a3 a2 a1 a0
Fig. 11.8 c. Baugh-Wooley  x4 x3 x2 x1 x0
----------------------------
_
_ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0
–a4x0 = a4(1 – x0) – a4 _ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
= a4 x 0  – a 4 _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
a4 x3 a
_ _3 x3 _
a2 x3 a
_1 x3 a0 x3
a4 x4 a3 x4 a2 x4 a1 x4 a0 x4
–a4 a4x0 a4 a4
a4 1 x4 x4
--------------------------------------------------------
In next column -
p p p p p p p p p p
9 8 7 6 5 4 3 2 1 0
a4 a3 a2 a1 a0
d. Modified B-W  x
4
x
3
x
2
x
1
x
0
---------------------------
__
__ a x a x a x a x a x
–a4x0 = (1 – a4x0) – 1 a x
__
4 0
a x
3 0
a x
2 0
a x
1 0
a x
0 0
= (a4x0) – 1 __ a x
4 2
a4 x1
3 2
a3 x1 a2 x1 a1 x1
2 2 1 2 0 2
0 1
a__x a x a x a x a x
4 3 __3 3 __2 3 __
1 3 0 3
–1 (a4x0) a x
4 4
a x
3 4
a x
2 4
a x
1 4
a x
0 4
1 1
1 --------------------------------------------------------
-
p p p p p p p p p p
In next column 9 8 7 6 5 4 3 2 1 0
Apr. 2015 Computer Arithmetic, Multiplication Slide 62
a4 a3 a2 a1 a0

Alternate Views of the a. Unsigned  x4 x3 x2 x1 x0


----------------------------
a4 x0 a3 x0 a2 x0 a1 x0 a0 x0
a4 x1 a3 x1 a2 x1 a1 x1 a0 x1

Baugh-Wooley Methods a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
a4 x3 a3 x3 a2 x3 a1 x3 a0 x3
a4 x4 a3 x4 a2 x4 a1 x4 a0 x4
--------------------------------------------------------
-
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

+ 0 0 –a4x3 –a4x2 –a4x1 –a4x0 a4 a3 a2 a1 a0


b. 2's-complement  x4 x3 x2 x1 x0
+ 0 0 –a3x4 –a2x4 –a1x4 –a0x4 ----------------------------
-a4 x0 a3 x0 a2 x0 a1 x0 a0 x0
-------------------------------------------- -a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
-a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
– 0 0 a4 x 3 a4 x 2 a4 x 1 a4 x 0 -a4 x3 a3 x3 a2 x3 a1 x3 a0 x3
a4 x4 -a3 x4 -a2 x4 -a1 x4 -a0 x4
– 0 0 a3 x 4 a2 x 4 a1 x 4 a0 x 4 --------------------------------------------------------
-
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
-------------------------------------------- a4 a3 a2 a1 a0
c. Baugh-Wooley 
+ 1 1 a4 x 3 a4 x 2 a4 x 1 a4 x 0 x4 x3 x2 x1 x0
----------------------------
_
_ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0
+ 1 1 a3 x 4 a2 x 4 a1 x 4 a0 x 4 _ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
_ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
1 a4 x3 a
_ _3 x3 _
a2 x3 a
_1 x3 a0 x3
a4 x4 a3 x4 a2 x4 a1 x4 a0 x4

1 1
a4
x4
a4
x4
-------------------------------------------- --------------------------------------------------------
-
p p p p p p p p p p
9 8 7 6 5 4 3 2 1 0
+ a4 a4 a4 x 3 a4 x 2 a4 x 1 a4 x 0 a4 a3 a2 a1 a0
d. Modified B-W  x x x x x
+ x4 x4 a3 x 4 a2 x 4 a1 x 4 a0 x 4 4 3 2 1 0
---------------------------
__
a x a x a x a x a x
__ 4 0 3 0 2 0 1 0 0 0
a4 a4 __ a
__
x
a x
a4 x1
a x
3
a x 1
a x
2
a x 1
a x
1
a x 1
a x
0 1
4 2 3 2 2 2 1 2 0 2
a__x a x a x a x a x
1 x4 x4 a x
4 4
4 3 __
a x
3 4
3 3 __
a x
2 4
2 3
a x
1 4
__
1 3
a x
0 4
0 3

-------------------------------------------- 1 1
--------------------------------------------------------
-
p p p p p p p p p p
9 8 7 6 5 4 3 2 1 0

Apr. 2015 Computer Arithmetic, Multiplication Slide 63


11.4 Partial-Tree and Truncated Multipliers
h inputs
High-radix versus partial-tree ... Upper part of
the cumulative
multipliers: The difference is partial product
quantitative, not qualitative (stored-carry)
CSA Tree
For small h, say  8 bits,
we view the multiplier of
Fig. 11.9 as high-radix
Sum
When h is a significant
fraction of k, say k/2 or k/4, Carry
Lower part of
then we tend to view it as FF the cumulative
a partial-tree multiplier Adder partial product
h-Bit
Adder

Better design through pipelining


Fig. 11.9 General structure
to be covered in Section 11.6
of a partial-tree multiplier.
Apr. 2015 Computer Arithmetic, Multiplication Slide 64
Why Truncated Multipliers?
Nearly half of the hardware in array/tree multipliers is there to get the
last bit right (1 dot = one FPGA cell)
ulp
.         k-by-k fractional
 .         multiplication
---------------------------------
.       | Max error = 8/2 + 7/4
.       | 
+ 6/8 + 5/16 + 4/32
.     |  
.    |   
+ 3/64 + 2/128
.    |     + 1/256 = 7.004 ulp
.   |     
. |       Mean error =
. |        1.751 ulp
---------------------------------
.         |       
Fig. 11.10 The idea of a truncated multiplier with 8-bit fractional operands.

Apr. 2015 Computer Arithmetic, Multiplication Slide 65


Truncated Multipliers with Error Compensation
We can introduce additional “dots” on the left-hand side to compensate
for the removal of dots from the right-hand side

Constant compensation Variable compensation


. o o o o o o o| . o o o o o o o|
. o o o o o o| . o o o o o o|
. o o o o o| . o o o o o|
. o o o o| . o o o o|
. o o o| . o o o|
. 1 o o| . o o|
. o| . x-1o|
. | . y-1 |
Constant and variable error compensation for truncated multipliers.

Max error = +4 ulp Max error = +? ulp


Max error  -3 ulp Max error  -? ulp
Mean error = ? ulp Mean error = ? ulp
Apr. 2015 Computer Arithmetic, Multiplication Slide 66
11.5 Array Multipliers
x2 a x1 a x0 a 0 a4 x0 0 a3 x0 0 a2 x0 0 a1 x0 0 a0 x0

p0
x3 a CSA a3 x1
a4 x1 a2 x1 a1 x1 a0 x1

CSA p1
x4 a a3 x2
a2 x2 a1 x2 a0 x2
a4 x2
CSA p2
a3 x3
a2 x3 a1 x3 a0 x3
a4 x3
CSA
p3
a3 x4
a2 x4 a1 x4 a0 x4
Ripple-Carry Adder a4 x4
p4
ax 0
p9 p8 p7 p6 p5
Fig. 11.11 A basic array
multiplier uses a one-sided CSA Fig. 11.12 Details of a 5  5
tree and a ripple-carry adder. array multiplier using FA blocks.

Apr. 2015 Computer Arithmetic, Multiplication Slide 67


Signed (2’s-complement) Array Multiplier
_
Fig. 11.13 a4 x0 0 a3 x0 0 a2 x0 0 a1 x0 0 a0 x0
Modifications in a 5  5
p0
array multiplier to deal a3 x1
_
with 2’s-complement a4 x1 a2 x1 a1 x1 a0 x1
inputs using the p1
Baugh-Wooley a3 x2
_ a2 x2 a1 x2 a0 x2
method or to shorten a4 x2
the critical path. p2
a3 x3
a4 x4 _ a2 x3 a1 x3 a0 x3
a4 x3
_
a_4 p3
x4 _ _ _ _
a3 x4 a2 x4 a1 x4 a0 x4 a4
1
x4

p9 p8 p7 p6 p5 p4

Apr. 2015 Computer Arithmetic, Multiplication Slide 68


Array Multiplier Built of Modified Full-Adder Cells
a4 a3 a2 a1 a0
Fig. 11.14 Design of
a 5  5 array multiplier
x0
with two additive
inputs and full-adder p0
blocks that include x1
AND gates.
p1
x2

p2
x3

p3
x4

FA p4

p5
p9 p8 p7 p6

Apr. 2015 Computer Arithmetic, Multiplication Slide 69


Array Multiplier without a Final Carry-Propagate Adder
Fig. 11.15 Conceptual
view of a modified
array multiplier that ..
does not need a final .
carry-propagate adder. Mux

i i Bi Level i
Fig. 11.16 Carry-save Mux
addition, performed in i i
Bi+1
level i, extends the i+1 i+1

conditionally computed Mux


bits of the final product. ..
k
. k
i Conditional bits Bi
Dots in row i
Mux
k

Dots in row i + 1 [k, 2k–1] k–1 i+1 i i–1 1 0

i + 1 Conditional bits All remaining bits of the final product


of the final product Bi+1 produced only 2 gate levels after pk–1
Apr. 2015 Computer Arithmetic, Multiplication Slide 70
11.6 Pipelined Tree and Array Multipliers
h inputs
...
h inputs
... Upper part of
Latches
the cumulative Pipelined
partial product CSA Tree (h + 2)-input
Latches
(stored-carry) CSA tree
Latches
CSA Tree

CSA Latch

Sum CSA
Carry
Lower part of Sum
FF the cumulative
Carry
Adder partial product
h-Bit Lower part of
Adder FF the cumulative
partial product
Adder h-Bit
Adder

Fig. 11.9 General structure Fig. 11.17 Efficiently pipelined


of a partial-tree multiplier. partial-tree multiplier.
Apr. 2015 Computer Arithmetic, Multiplication Slide 71
Pipelined Array Multipliers
a4 a3 a2 a1 a0 x0 x1 x2 x3 x4

With latches after every


FA level, the maximum
throughput is achieved
Latches may be inserted
after every h FA levels for
an intermediate design

Example: 3-stage pipeline

Latched FA
Fig. 11.18 Pipelined 5 5 FA with
AND gate FA
array multiplier using
latched FA blocks. Latch
FA

The small shaded boxes FA

are latches.
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0

Apr. 2015 Computer Arithmetic, Multiplication Slide 72


12 Variations in Multipliers

Chapter Goals
Learn additional methods for synthesizing
fast multipliers as well as other types
of multipliers (bit-serial, modular, etc.)

Chapter Highlights
Building a multiplier from smaller units
Performing multiply-add as one operation
Bit-serial and (semi)systolic multipliers
Using a multiplier for squaring is wasteful

Apr. 2015 Computer Arithmetic, Multiplication Slide 73


Variations in Multipliers: Topics

Topics in This Chapter


12.1 Divide-and-Conquer Designs

12.2 Additive Multiply Modules

12.3 Bit-Serial Multipliers

12.4 Modular Multipliers

12.5 The Special Case of Squaring

12.6 Combined Multiply-Add Units

Apr. 2015 Computer Arithmetic, Multiplication Slide 74


12.1 Divide-and-Conquer Designs
Building wide multiplier from narrower ones

aH aL Rearranged partial products


in 2b-by-2b multiplication
 xH xL
2b bits b bits
a L xL
a H xL
a L xH
a H xH a L xL
a H xL
a L xH
a H xH
p 3b bits

Fig. 12.1 Divide-and-conquer (recursive) strategy for


synthesizing a 2b  2b multiplier from b  b multipliers.

Apr. 2015 Computer Arithmetic, Multiplication Slide 75


General Structure of a Recursive Multiplier

2b  2b use (3; 2)-counters


3b  3b use (5; 2)-counters 4b  4b
4b  4b use (7; 2)-counters 3b  3b
2b  2b
b b

Fig. 12.2 Using b  b multipliers to synthesize


2b  2b, 3b  3b, and 4b  4b multipliers.
Apr. 2015 Computer Arithmetic, Multiplication Slide 76
Using b  c, rather than b  b Building Blocks
4b  4b
3b  3b
2b  2b
b b

2b  2c use b  c multipliers and (3; 2)-counters


2b  4c use b  c multipliers and (5?; 2)-counters
gb  hc use b  c multipliers and (?; 2)-counters

Apr. 2015 Computer Arithmetic, Multiplication Slide 77


Wide Multiplier Built of Narrow Multipliers and Adders
aH xH aL xH aH x aL xL
Fig. 12.3 Using 4  4 L

multipliers and 4-bit [4, 7] [4, 7] [0, 3] [4, 7] [4, 7] [0, 3] [0, 3] [0, 3]

adders to synthesize Multiply Multiply Multiply Multiply


an 8  8 multiplier.
[12,15] [8,11] [8,11] [4, 7] [8,11] [4, 7] [4, 7] [0, 3]
8
Add
[4, 7]

12 8
Add Add
[8,11] [4, 7]

12
Add
000 [8,11]

Add
[12,15]

p [12,15] p[8,11] p [4, 7] p[0, 3]

Apr. 2015 Computer Arithmetic, Multiplication Slide 78


Karatsuba Multiplication

2b  2b multiplication requires four b  b multiplications:

(2baH + aL)  (2bxH + xL) = 22baHxH + 2b (aHxL + aLxH) + aLxL

Karatsuba noted that one of the four multiplications can be removed


at the expense of introducing a few additions:

(2baH + aL)  (2bxH + xL) = b bits


22baHxH + 2b [(aH + aL)  (xH + xL) – aHxH – aLxL] + aLxL aH aL

xH xL
Mult 1 Mult 3 Mult 2
Benefit is quite significant for extremely wide operands
(4/3)5 = 4.2 (4/3)10 = 17.8 (4/3)20 = 315.3 (4/3)50 = 1,765,781
Apr. 2015 Computer Arithmetic, Multiplication Slide 79
12.2 Additive Multiply Modules
y z y
ax
x z
a

4-bit adder cin


p = ax + y + z p
(a) Block diagram (b) Dot notation
v v
Fig. 12.4 Additive multiply module with 2  4 multiplier (ax)
plus 4-bit and 2-bit additive inputs (y and z).

b-bit and c-bit multiplicative inputs


b  c AMM b-bit and c-bit additive inputs
(b + c)-bit output
(2b – 1)  (2c – 1) + (2b – 1) + (2c – 1) = 2b+c – 1
Apr. 2015 Computer Arithmetic, Multiplication Slide 80
Multiplier Built of AMMs
Legend: * *
2 bits x [0, 1]
4 bits a [0, 3] Understanding
[2, 5]
[0, 1]
an 8  8 multiplier
* *
x [0, 1] * built of 4  2
a [4, 7] x [2, 3] AMMs using dot
a [0, 3]
[6, 9] [4, 5]
[2, 3] notation
* x [2, 3] [4,7]
x [4, 5]
a [4, 7] a [0, 3]
[4,5]
[8, 11] [6, 7]
[6, 9]
x [6, 7]
a [0, 3]
[8, 9] [8, 11]

x [4, 5] [6, 7]
a [4, 7]
[8, 9]
[10,13]
[10,11]

x [6, 7] p[0, 1]
a [4, 7] p[2, 3]
p[4,5]
p[12,15] p [10,11] p [8, 9] p [6, 7]

Fig. 12.5 An 8  8 multiplier built of 4  2 AMMs.


Inputs marked with an asterisk carry 0s.
Apr. 2015 Computer Arithmetic, Multiplication Slide 81
Multiplier Built of AMMs: Alternate Design
a [4, 7] a [0,3]
* * *
This design is more regular
x [0, 1]
than that in Fig. 12.5 and is
easily expandable to larger
* configurations; its latency,
x [2, 3] however, is greater

*
x [4, 5]

*
x [6, 7]

p[0, 1] Fig. 12.6 Alternate 8  8


Legend: p[2, 3] multiplier design based on
2 bits p[4,5] 4  2 AMMs. Inputs marked
4 bits p [6, 7] with an asterisk carry 0s.
p[12,15] p [10,11] p [8, 9]
Apr. 2015 Computer Arithmetic, Multiplication Slide 82
12.3 Bit-Serial Multipliers
Bit-serial adder
(LSB first) …x x x FF
…s s s
2 1 0 2 1 0
FA
…y y y
2 1 0

Bit-serial multiplier
…a a a …p p p
(Must follow the k-bit 2 1 0
inputs with k 0s;
alternatively, view …x x x ? 2 1 0

the product as being 2 1 0


only k bits wide)

What goes inside the box to make a bit-serial multiplier?


Can the circuit be designed to support a high clock rate?

Apr. 2015 Computer Arithmetic, Multiplication Slide 83


Semisystolic Serial-Parallel Multiplier
Multiplicand (parallel in) x0 x1 x2 x3
a3 a2 a1 a0
Multiplier
(serial in)
LSB-first

Sum
FA FA FA FA
Product
Carry (serial out)

Fig. 12.7 Semi-systolic circuit for 4  4 multiplication in 8 clock cycles.

This is called “semisystolic” because it has a large signal fan-out of k


(k-way broadcasting) and a long wire spanning all k positions
Apr. 2015 Computer Arithmetic, Multiplication Slide 84
Systolic Retiming as a Design Tool
A semisystolic circuit can be converted to a systolic circuit
via retiming, which involves advancing and retarding signals
by means of delay removal and delay insertion in such a
way that the relative timings of various parts are unaffected
Cut +d –d

e e+d
f f+d
CL CR CL CR
g g–d
h h–d
–d +d
Original delays Adjusted delays
Fig. 12.8 Example of retiming by delaying the inputs to CL
and advancing the outputs from CL by d units
Apr. 2015 Computer Arithmetic, Multiplication Slide 85
Alternate Explanation of Systolic Retiming

t t t+d1 d1 t

t+a d1 d2 t+a+d1+d2 t+a+d1 d2 t+a+d1+d2

Transferring delay from the outputs of a subsystem to its


inputs does not change the behavior of the overall system

Apr. 2015 Computer Arithmetic, Multiplication Slide 86


Multiplicand (parallel in) x0 x1 x2 x3
a3 a2 a1 a0
A First Attempt Multiplier
(serial in)
at Retiming LSB-first

Su m
FA FA FA FA
Product
Carry (serial out)

Fig. 12.7
Multiplicand (parallel in) x0 x1 x2 x3
a3 a2 a1 a0
Multiplier
(serial in)
LSB-first

Sum
FA FA FA FA
Product
Carry (serial out)

Fig. 12.9 A retimed version


of our semi-systolic multiplier.
Cut 3 Cut 2 Cut 1
Apr. 2015 Computer Arithmetic, Multiplication Slide 87
Multiplicand (parallel in) x0 x1 x2 x3
a3 a2 a1 a0
Deriving a Fully Multiplier
(serial in)
Systolic Multiplier LSB-first

Su m
FA FA FA FA
Product
Carry (serial out)

Fig. 12.7
Multiplicand (parallel in) x0 x1 x2 x3
a3 a2 a1 a0
Multiplier
(serial in)
LSB-first

Sum
FA FA FA FA
Product
Carry (serial out)
Fig. 12.10 Systolic circuit for
4  4 multiplication in 15 cycles.
Apr. 2015 Computer Arithmetic, Multiplication Slide 88
A Direct Design for a Bit-Serial Multiplier
p (i–1) ai xi ai a(i - 1)
a
x
t out t in xi x(i - 1)

ai x(i - 1) Already
accumulated
cout ai xi c in into three
ai xi
numbers
(5; 3)-counter
2 1 0 xi a(i - 1)
1
sin s out
0
M ux
p
Already output
Fig. 12.11 Building block for a (a) Structure of the bit-matrix
latency-free bit-serial multiplier.
... ai p(i - 1)
... xi ai xi
ai x(i - 1)
... t out t in LSB xi a(i - 1)
... cout c in 0
2p (i ) Shift right to
obtain p(i )
... sin s out pi

(b) Reduction after each input bit


Fig. 12.12 The cellular structure
of the bit-serial multiplier based on Fig. 12.13 Bit-serial multiplier
the cell in Fig. 12.11. design in dot notation.
Apr. 2015 Computer Arithmetic, Multiplication Slide 89
12.4 Modular Multipliers

FA FA ... FA FA FA

Fig. 12.14 Modulo-(2b – 1)


carry-save adder.

4 4 4 4

Mod-15 CSA

Divide by 16
Mod-15 CSA

Fig. 12.15 Design of a Mod-15 CPA


4  4 modulo-15 multiplier.
4

Apr. 2015 Computer Arithmetic, Multiplication Slide 90


Other Examples of Modular Multiplication

Address n inputs
...
Fig. 12.16 One way
to design of a 4  4 .
modulo-13 multiplier. Table . CSA Tree
.

Data

3-input
Fig. 12.17 A method for Modulo-m
Adder
modular multioperand addition.
sum mod m
Apr. 2015 Computer Arithmetic, Multiplication Slide 91
12.5 The Special Case of Squaring
Multiply x by x x4 x3 x2 x1 x0
x4 x3 x2 x1 x0
x4 x0 x3 x0 x2 x0 x1 x0 x0 x0
x4 x1 x3 x1 x2 x1 x1 x1 x0 x1
x4 x2 x3 x2 x2 x2 x1 x2 x0 x2
x4 x3 x3 x3 x2 x3 x1 x3 x0 x3
x4 x4 x3 x4 x2 x4 x1 x4 x0 x4
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
Simplify

_
x4 x3 x4 x2 x4 x1 x4 x0 x3 x0 x2 x0 x1 x0 x0
x4 x3 x2 x3 x1 x2 x1 x1
x3 x2 x1x0 –x1x0
p9 p8 p7 p6 p5 p4 p3 p2 0 x0
Fig. 12.18 Design of a 5-bit squarer.

Apr. 2015 Computer Arithmetic, Multiplication Slide 92


Divide-and-Conquer Squarers

Building wide squarers from narrower ones

axH axLL Rearranged partial products


in 2b-by-2b multiplication
 xH xL
2b bits b bits
axL x L
axHH x L
axL x H
axHH x H xaLL x L
axH
H xL
xaHH x H axL x H

p 3b bits

Divide-and-conquer (recursive) strategy for synthesizing a


2b  2b squarer from b  b squarers and multiplier.

Apr. 2015 Computer Arithmetic, Multiplication Slide 93


12.6 Combined Multiply-Add Units
Additive input Multiply-add
(a) CSA tree output versus
multiply-accumulate

Carry-save additive input Multiply-accumulate


units often have wider
(b) CSA tree output additive inputs

Additive input
Fig. 12.19
Dot matrix for the Dot-notation
4  4 multiplication representations
(c)
of various methods
for performing
Carry-save additive input a multiply-add
operation
Dot matrix for the
4  4 multiplication in hardware.
(d)
Apr. 2015 Computer Arithmetic, Multiplication Slide 94

You might also like