Professional Documents
Culture Documents
IIEST VLSI Multiplication
IIEST VLSI Multiplication
Multiplication
Review multiplication schemes and various speedup methods
• Multiplication is heavily used (in arith & array indexing)
• Division = reciprocation + multiplication
• Multiplication speedup: high-radix, tree, recursive
• Bit-serial, modular, and array multipliers
Chapter Highlights
Multiplication = multioperand addition
Hardware, firmware, software algorithms
Multiplying 2’s-complement numbers
The special case of one constant operand
a Multiplicand
x Multiplier
x0a 20
Partial
x1a 21
products
x2a 22 bit-matrix
x3a 23
p Product
Fig. 9.1 Multiplication of two 4-bit unsigned binary numbers in dot
notation.
12/02/23 SUBIR KUMAR SARKAR
Examples of Basic Multiplication
Right-shift algorithm Left-shift algorithm
======================== ======================= Fig. 9.2
a
x
1 0 1 0 1 0 1 0
1 0 1 1
a
x
1 0 1 0
1 0 1 1
Examples
======================== ======================= of
p(0) 0 0 0 0 p(0) 0 0 0 0 sequential
+x0a 1 0 1 0 2p(0) 0 0 0 0 0
multipli-
––––––––––––––––––––––––– +x3a 1 0 1 0
2p(1) 0 1 0 1 0 –––––––––––––––––––––––– cation with
p (1)
0 1 0 1 0 p(1) 0 1 0 1 0 right and
+x1a 1 0 1 0 2p(1) 0 1 0 1 0 0 left shifts.
––––––––––––––––––––––––– +x2a 0 0 0 0
2p(2) 0 1 1 1 1 0 ––––––––––––––––––––––––
p(2) 0 1 1 1 1 0 p(2) 0 1 0 1 0 0
+x2a 0 0 0 0 p
2p(2) = (p 0+ x
(j+1) (j)
1j a
0 2k1) 20–1 0 0
––––––––––––––––––––––––– +x1a |–––add–––|1 0 1 0 Check:
2p(3) 0 0 1 1 1 1 0 ––––––––––––––––––––––––
|––shift right––| 10 11
p (3)
0 0 1 1 1 1 0 p (3)
0 1 1 0 0 1 0
+x3a 1 0 1 0 2p (3)
0 1 1 0 0 1 0 0
= 110
––––––––––––––––––––––––– +x0a 1 0 1 0 = 64 + 32 +
2p(4) 0 1 1 0 1 1 1 0 –––––––––––––––––––––––– 8+4+2
p (4)
0 1 1 0 1 1 1 0 p(4) 0 1 1 0 1 1 1 0
======================== =======================
12/02/23 SUBIR KUMAR SARKAR
Examples of Basic Multiplication (Continued)
Right-shift algorithm Left-shift algorithm
======================== ======================= Fig. 9.2
a 1 0 1 0 a 1 0 1 0 Examples
x 1 0 1 1 x 1 0 1 1
======================== ======================= of
p(0) 0 0 0 0 p(0) 0 0 0 0 sequential
+x0a 1 0 1 0 2p(0) 0 0 0 0 0
multipli-
––––––––––––––––––––––––– +x3a 1 0 1 0
2p(1) 0 1 0 1 0 –––––––––––––––––––––––– cation with
p (1)
0 1 0 1 0 p(1) 0 1 0 1 0 right and
+x1a 1 0 1 0 2p(1) 0 1 0 1 0 0 left shifts.
––––––––––––––––––––––––– +x2a 0 0 0 0
2p(2) 0 1 1 1 1 0 ––––––––––––––––––––––––
p(2) (j+1) 0 1 1 (j) 1 1 0 p(2) 0 1 0 1 0 0
+x2a p =0 0 0 0 k–j–1a
2 p + x 2p(2) 0 1 0 1 0 0 0
|shift|
––––––––––––––––––––––––– +x1a 1 0 1 0 Check:
2p(3) 0 0|––––add––––|
1 1 1 1 0 –––––––––––––––––––––––– 10 11
p (3)
0 0 1 1 1 1 0 p(3) 0 1 1 0 0 1 0
+x3a 1 0 1 0 2p (3)
0 1 1 0 0 1 0 0
= 110
––––––––––––––––––––––––– +x0a 1 0 1 0 = 64 + 32 +
2p(4) 0 1 1 0 1 1 1 0 –––––––––––––––––––––––– 8+4+2
p (4)
0 1 1 0 1 1 1 0 p(4) 0 1 1 0 1 1 1 0
======================== =======================
12/02/23 SUBIR KUMAR SARKAR
9.2 Programmed Multiplication
{Using right shifts, multiply unsigned m_cand and m_ier,
storing the resultant 2k-bit product in p_high and p_low.
Registers: R0 holds 0 Rc for counter R0 0 Rc Counter
Ra for m_cand Rx for m_ier
Rp for p_high Rq for p_low} Ra Multiplicand Rx Multiplier
{Load operands into registers Ra and Rx} Rp Product, high Rq Product, low
mult: load Ra with m_cand
load Rx with m_ier
{Initialize partial product and counter}
copy R0 into Rp
copy R0 into Rq
load k into Rc
{Begin multiplication loop}
m_loop: shift Rx right 1 {LSB moves to carry flag}
branch no_add if carry = 0
add Ra to Rp {carry flag is set to cout}
no_add: rotate Rp right 1 {carry to MSB, LSB to carry}
rotate Rq right 1 {carry to MSB, LSB to carry}
decr Rc {decrement counter by 1}
branch m_loop if Rc 0
{Store the product} Fig. 9.3 Programmed
store Rp into p_high
store Rq into p_low
multiplication (right-shift
m_done: ... algorithm).
Multiplier x
(j)
Doublewidth partial product p
Shift
Multiplicand a
0
k xj
0 1
Mux
xj a k p(j+1) = (p(j) + xj a 2k) 2–1
Adder |–––add–––|
cout |––shift right––|
k
1 1
0 1
0 1
0 (11)ten
1 1
0 0 0
1 1
0 1
0 1
0 0
1 0 (110)ten
1 0 1 0 (10)ten
k k–1
To adder To mux control
Multiplier x
(j)
Doublewidth partial product p
Shift
xk-j-1 2k Multiplicand a
0
0 1
Mux
xk-j-1a k
2k-bit adder
cout
2k
Multiplicand
1 0 Enable
Mux
Select
k+1
Justification
2j + 2j–1 + . . . + 2i+1 + 2i = 2j+1 – 2i
Column j
Software aspects:
Optimizing compilers replace multiplications by shifts/adds/subs
Produce efficient code using as few registers as possible
Find the best code by a time/space-efficient algorithm
Hardware aspects:
Synthesize special-purpose units such as filters
y[t] = a0x[t] + a1x[t – 1] + a2x[t – 2] + b1y[t – 1] + b2y[t – 2]
R2 R1 shift-left 1
R3 R2 + R1 Shift, add Shift
R6 R3 shift-left 1 Ri: Register that contains i times (R1)
R7 R6 + R1
R112 R7 shift-left 4 This notation is for clarity; only one
register other than R1 is needed
R113 R112 + R1
R7 R8 – R1
R112 R7 shift-left 4 CPA 1
R119 R112 + R7
119x
R65 R1 shift-left 6 + R1
Chapter Highlights
High radix gives rise to “difficult” multiples
Recoding (change of digit-set) as remedy
Carry-save addition reduces cycle time
Implementation and optimization methods
Add/subtract z i/2 a
control To adder input
Fig. 10.6 The multiple generation part of a radix-4
multiplier based on Booth’s recoding.
0 2a
Mux
0 a x i+1 xi
Old Cumulative Mux
Partial Product
CSA
Adder
Partial Product
k Sum
k
Right
Carry shift
Multiplicand
0
Mux
Upper half of PP Lower half of PP
k-Bit CSA
(a) Multiplier
Carry k k block diagram (b) Operation in a typical cycle
Sum
Fig. 10.8 Radix-2 multiplication with
the upper half of the cumulative partial
k-Bit Adder product kept in stored-carry form.
x x x
i+1 i i-1
Booth recoder
and selector
Old cumulative
z a
partial product i/2
CSA
New cumulative
partial product 2-bit
Adder
Adder FF Extra “dot”
Recoding Logic
0 2a
Mux
0 a x i+1 xi
Old Cumulative Mux
Partial Product
(4; 2)-counter Fig. 10.11 Radix-4 multiplication,
with the cumulative partial product,
CSA
xia, and 2xi+1a combined into two
numbers by two CSAs.
CSA
New Cumulative
Partial Product
2-Bit
Adder FF Adder
CSA CSA
CSA
Fig. 10.12 Radix-16
multiplication with the
upper half of the CSA
cumulative partial Sum
4
product in carry-save Carry 3 FF
4-Bit
Adder
form. Partial Product 4
To the Lower Half
(Upper Half) of Partial Product
CSA CSA
Small CSA
tree Full CSA
Adder
tree
Adder Adder
High-radix
Basic or Full
binary Speed up partial tree Economize tree
Pipelined Pipelined
Radix-8 Radix-8
Booth Booth Beat-1
Recoder Recoder Node 3 Input
& Selector & Selector
Node 1
CSA CSA Beat-2
Input
Sum Sum
Carry Carry
5 6
6-Bit
FF Adder
Node 2 Beat-3
Adder 6
Input
To the Lower Half
of Partial Product
Fig. 10.14 Twin-beat multiplier Fig. 10.16 Conceptual view
with radix-8 Booth’s recoding. of a three-beat multiplier.
Any VLSI circuit computing the product of two k-bit integers must
satisfy the following constraints:
AT grows at least as fast as k3/2
AT2 is at least proportional to k2
The preceding radix-2b implementations are suboptimal, because:
AT = O(k2 log b + bk log k)
AT2 = O((k3/b) log2b)
Chapter Highlights
Tree multiplier = reduction tree
+ redundant-to-binary converter
Avoiding full sign extension in multiplying
signed numbers
Array multiplier = one-sided reduction tree
+ ripple-carry adder
High-radix
Basic or Full
binary Speed up partial tree Economize tree Redundant result
Redundant-to-Binary
Fig. 10.13 High-radix multipliers Converter
as intermediate between sequential Some lower-order
Higher-order product bits are
radix-2 and full-tree multipliers. product bits generated directly
... ...
Small tree of
Large tree of carry-save
carry-save Log- adders
adders depth
Log-
Adder depth Adder
Product Product
Partial-Products
Reduction Tree
2. Partial products reduction tree (Multi-Operand
Addition Tree)
Redundant result
Fig. 11.1
3. Redundant-to-binary converter Redundant-to-Binary
Converter
Some lower-order
Higher-order product bits are
product bits generated directly
Level
Inputs FA
1
FA FA FA
Level-1 FA
carries
11 + 1 = 21 + 3 FA FA FA
Level-2 Level
Therefore, 1 = 8 carries 2
FA
carries are needed FA FA
Level-3
carries FA
Level
Fig. 11.4 3
FA
A slice of a FA
Level-4
balanced-delay carry
tree for 11 Level
FA
FA 4
inputs.
Outputs FA
Level
5
FA
0 1
4-to-2 4-to-2
0 1
FA
4-to-2
c s c s
(a) Binary tree of (4; 2)-counters (b) Realization with FAs (c) A faster realization
Similarly,
using Booth’s
recoding may
not yield any
advantage,
because it
introduces Redundant-to-binary converter
irregularity
1 1
--------------------------------------------------------
-
p p p p p p p p p p
9 8 7 6 5 4 3 2 1 0
a a a a a
Fig. 11.8 c. Baugh-Wooley x4
4
x3
3
x2
2
x1
1
x0
0
_
----------------------------
_ a 4 x0 a3 x0 a2 x0 a1 x0 a0 x0
–a4x0 = a4(1 – x0) – a4 _ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
= a4x 0 – a4 _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
_4 x3 a_3 x3 a_2 x3 a_1 x3 a0 x3
a
a4 x4 a3 x4 a2 x4 a1 x4 a0 x4
–a4 a4x 0 a4 a4
1 x4 x4
a4 --------------------------------------------------------
In next column -
p p p p p p p p p p
9 8 7 6 5 4 3 2 1 0
a4 a3 a2 a1 a0
d. Modified B-W x
4
x
3
x
2
x
1
x
0
__
---------------------------
–a4x0 = (1 – a4x0) – 1 __ a x
4 0
a x
3 0
a x
2 0
a x
1 0
a x
0 0
__ a x a x a x a x a x
= (a4x0) – 1 __ a x a4 x1 a3 x1 a2 x1 a1 x1 0 1
4 2 3 2 2 2 1 2 0 2
a__x a__x a__x a__x a x
4 3 3 3 2 3 1 3 0 3
–1 (a4x0) a x
4 4
a x
3 4
a x
2 4
a x
1 4
a x
0 4
1 1
1 --------------------------------------------------------
-
p p p p p p p p p p
In next column 9 8 7 6 5 4 3 2 1 0
Baugh-Wooley Methods
a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
a4 x3 a3 x3 a2 x3 a1 x3 a0 x3
a4 x4 a3 x4 a2 x4 a1 x4 a0 x4
--------------------------------------------------------
-
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
-------------------------------------------- -a4 x1 a3 x1 a2 x1 a1 x1 a0 x1
-a4 x2 a3 x2 a2 x2 a1 x2 a0 x2
-------------------------------------------- c. Baugh-Wooley
a
x4
4
a
x3
3
a
x2
2
a
x1
1
a
x0
0
1 --------------------------------------------------------
-
p p p p p p p p p p
-------------------------------------------- 9 8 7 6 5 4
a4
3
a3
2
a2
1
a1
0
a0
__ a x a x a x a x a x
+ x4 xa44 a3x4 a2x4 a1x4 a0x4 __
__
a x
a x
a4 x1
4 0
a x
3
a x 1
3 0
a x
2
a x 1
2 0
a x
1
a x 1
1 0
a x
0 1
0 0
4 2 3 2 2 2 1 2 0 2
a4
a__x a__x a__x a__x a x
1 x4 a x
4 4
4 3
a x
3 4
3 3
a x
2 4
2 3
a x
1 4
1 3
a x
0 4
0 3
1 1
x4 --------------------------------------------------------
-
p
9
p
8
p
7
p
6
p
5
p
4
p
3
p
2
p
1
p
0
--------------------------------------------
12/02/23 SUBIR KUMAR SARKAR
11.4 Partial-Tree and Truncated Multipliers
h inputs
High-radix versus partial- ... Upper part of
the cumulative
tree multipliers: The partial product
difference is quantitative, not (stored-carry)
qualitative CSA Tree
p0
x3 a CSA a3 x1
a2 x1 a1 x1 a0 x1
a4 x1
CSA p1
x4 a a3 x2
a2 x2 a1 x2 a0 x2
a4 x2
CSA p2
a3 x3
a2 x3 a1 x3 a0 x3
a4 x3
CSA
p3
a3 x4
a2 x4 a1 x4 a0 x4
Ripple-Carry Adder a4 x4
p4
ax 0
p9 p8 p7 p6 p5
Fig. 11.11 A basic array
multiplier uses a one-sided CSA Fig. 11.12 Details of a 5 5
tree and a ripple-carry adder. array multiplier using FA blocks.
p9 p8 p7 p6 p5 p4
p2
x3
p3
x4
FA p4
p5
p9 p8 p7 p6
i i Bi Level i
Fig. 11.16 Carry-save Mux
addition, performed in i i
Bi+1
level i, extends the i+1 i+1
CSA Latch
Sum CSA
Carry
Lower part of Sum
FF the cumulative
Carry
Adder partial product
h-Bit Lower part of
Adder FF the cumulative
partial product
Adder h-Bit
Adder
Latched FA
Fig. 11.18 Pipelined 5 5
FA with
AND gate FA
array multiplier using
latched FA blocks. Latch
FA
are latches.
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
Chapter Highlights
Building a multiplier from smaller units
Performing multiply-add as one operation
Bit-serial and (semi)systolic multipliers
Using a multiplier for squaring is wasteful
adders to synthesize
Multiply Multiply Multiply Multiply
an 88 multiplier.
[12,15] [8,11] [8,11] [4, 7] [8,11] [4, 7] [4, 7] [0, 3]
8
Add
[4, 7]
12 8
Add Add
[8,11] [4, 7]
12
Add
000 [8,11]
Add
[12,15]
xH xL
Mult 1 Mult 3 Mult 2
In 2007, Martin Furer managed to replace the log log k term with
an asymptotically smaller term (for astronomically large numbers)
x [4, 5] [6, 7]
a [4, 7]
[8, 9]
[10,13]
[10,11]
x [6, 7] p[0, 1]
a [4, 7] p[2, 3]
p[4,5]
p[12,15] p [10,11] p [8, 9] p [6, 7]
*
x [4, 5]
*
x [6, 7]
Bit-serial multiplier
…a a a …p p p
?
(Must follow the k-bit 2 1 0
2 1 0
inputs with k 0s;
alternatively, view …x x x
2 1 0
the product as being
only k bits wide)
Sum
FA FA FA FA
Product
Carry (serial out)
e e+d
f f+d
CL CR CL CR
g g–d
h h–d
–d +d
Original delays Adjusted delays
Fig. 12.8 Example of retiming by delaying the inputs to C L
and advancing the outputs from CL by d units
t t t+d1 d1 t
Sum
FA FA FA FA
Product
Carry (serial out)
Fig. 12.7
Multiplicand (parallel in) x0 x1 x2 x3
a3 a2 a1 a0
Multiplier
(serial in)
LSB-first
Sum
FA FA FA FA
Product
Carry (serial out)
Sum
FA FA FA FA
Product
Carry (serial out)
Fig. 12.7
Multiplicand (parallel in) x0 x1 x2 x3
a3 a2 a1 a0
Multiplier
(serial in)
LSB-first
Sum
FA FA FA FA
Product
Carry (serial out)
Fig. 12.10 Systolic circuit for
4 4 multiplication in 15 cycles.
ai x(i - 1) Already
accumulated
cout ai xi c in into three
a i xi
numbers
(5; 3)-counter
2 1 0 xi a(i - 1)
1
s in s out
0
Mux
p
Already output
Fig. 12.11 Building block for a (a) Structure of the bit-matrix
latency-free bit-serial multiplier.
... ai p(i - 1)
... xi a i xi
ai x(i - 1)
... t out t in LSB xi a(i - 1)
... cout c in 0
2p (i ) Shift
right to
... s in s out pi obtain p(i )
FA FA ... FA FA FA
4 4 4 4
Mod-15 CSA
Divide by 16
Mod-15 CSA
Address n inputs
...
Fig. 12.16 One way
to design of a 4 4 .
modulo-13 multiplier. Table . CSA Tree
.
Data
3-input
Fig. 12.17 A method for Modulo-m
Adder
modular multioperand addition.
sum mod m
x4 x 0 x3 x 0 x2 x 0 x1 x 0 x0 x 0
x4 x 1 x3 x 1 x2 x 1 x1 x 1 x0 x 1
x4 x2 x3 x 2 x2 x 2 x1 x 2 x0 x 2
x4 x3 x3 x3 x2 x 3 x1 x 3 x0 x 3
x4 x4 x3 x4 x2 x4 x1 x 4 x0 x 4
p9 p8 p7 p6 p5 p4 p3 p2 p1 p0
Simplify
_
x4 x3 x4 x2 x4 x1 x4 x 0 x3 x 0 x2 x 0 x1 x 0 x0
x4 x3 x2 x3 x 1 x2 x 1 x1
x3 x2 x1x0 –x1x0
p9 p8 p7 p6 p5 p4 p3 p2 0 x0
Fig. 12.18 Design of a 5-bit squarer.
Additive input
Fig. 12.19
Dot matrix for the Dot-notation
4 4 multiplication representations
(c)
of various methods
for performing
Carry-save additive input a multiply-add
Dot matrix for the
operation
4 4 multiplication in hardware.
(d)