Design OfHigh Performance 64 Bit MAC Unit

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2013 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]

Design ofHigh Performance 64 bit MAC Unit


P.Jagadeesh Mr.S.Ravi Dr. Kittur Harish Mallikarjun
M Tech VLSI JJnd year Assistant Professor (senior), SENSE Professor, SENSE
VIT University VIT University VIT University
Vellore,India Vellore,India Vellore,India
pjagadeesh89@gmail.com msravi@vit.ac.in kittur@vit.ac.in

Abstract - A design of high performance 64 bit block. The design consists of 64 bit modified Wallace
Multiplier-and-Accumulator ( MAC) is implemented in multiplier, 128 bit carry save adder and a register.
this paper. MAC unit performs important operation in
many of the digital signal processing (DSP)
This paper is divided into six sections. In the first
applications. The multiplier is designed using modified
section the introduction about MAC unit is discussed.
Wallace multiplier and the adder is done with carry
save adder. The total design is coded with verilog-HDL In the second section discuss about the detailed
and the synthesis is done using Cadence RTL complier operation of MAC unit. The third and fourth section
using typical libraries of TSMC O.18um technology. deals with the operation of modified Wallace
The total MAC unit operates at 217 MHz. The total multiplier and carry save adder respectively. In the
power dissipation is 177.732 mW. fifth section, the obtained result for the 64 bit MAC
unit is discussed and finally the conclusion is made in
Keywords - Modified Wallace multiplier, Carry save the sixth section.
adder, multiplier and accumulator (MAC).

I. INTRODUCnON 11. MAC OPERAnON


MAC unit is an inevitable component in many
The Multiplier-Accumulator (MAC) operation is
digital signal processing (DSP) applications
the key operation not only in DSP applications but
involving multiplications and/or accumulations.
also in multimedia information processing and
MAC unit is used for high performance digital signal
various other applications. As mentioned above,
processing systems. The DSP applications include
MAC unit consist of multiplier, adder and
filtering, convolution, and inner products. Most of
register/accumulator. In this paper, we used 64 bit
digital signal processing methods use nonlinear
modified Wallace multiplier. The MAC inputs are
functions such as discrete cosine transform (DCT) or
obtained from the memory location and given to the
discrete wavelet transforms (DWT). Because they are
multiplier block. This will be useful in 64 bit digital
basically accomplished by repetitive application of
signal processor. The input which is being fed from
multiplication and addition, the speed of the
the memory location is 64 bit. When the input is
multiplication and addition arithmetic determines the
given to the multiplier it starts computing value for
execution speed and performance of the entire
the given 64 bit input and hence the output will be
calculation [1]. Multiplication-and-accumulate
128 bits. The multiplier output is given as the input to
operations are typical for digital filters. Therefore, the
carry save adder which performs addition.
functionality of the MAC unit enables high-speed
filtering and other processing typical for DSP
The function of the MAC unit is given by the
applications. Since the MAC unit operates
following equation [4]:
completely independent of the CPU, it can process F= IPjQj (1)
data separately and thereby reduce CPU load. The
application like optical communication systems
The output of carry save adder is 129 bit i.e. one
which is based on DSP , require extremely fast
bit is for the carry (128bits+1 bit). Then, the output is
processing of huge amount of digital data. The Fast
given to the accumulator register. The accumulator
Fourier Transform (FFT) also requires addition and
register used in this design is Parallel In Parallel Out
multiplication. 64 bit can handle larger bits and have
(PIPO). Since the bits are huge and also carry save
more memory.
adder produces all the output values in parallel, PIPO
register is used where the input bits are taken in
A MAC unit consists of a multiplier and an
parallel and output is taken in parallel. The output of
accumulator containing the sum of the previous
the accumulator register is taken out or fed back as
successive products. The MAC inputs are obtained
one of the input to the carry save adder. The figure 1
from the memory location and given to the multiplier
shows the basic architecture of MAC unit.

978-1-4673-4922-2/13/$31.00 ©2013 IEEE 782


2013 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]

the in each stage of the reduction phase is calculated


64 BIT 64 BIT
by the formula

rj+ 1
=
2[ri/3]+rjmod3 (2 )
MULTIPLIER If rj mod3 0, then rj+ 1 = 2r/3
=
(3 )

If the value calculated from the above equation for


number of rows in each stage in the second phase and
the number of row that are fonned in each stage of
the second phase does not match, only then the half
adder will be used. The final product of the second
stage will be in the height of two bits and passed on
to the third stage. During the third stage the output of
the second stage is given to the carry propagation
adder to generate the final output.

ACCUMULATOR • • • • • • • • • • • • • • • • • • •

• • • • • • • • • • • • • • • • •

• • • • • • • • • • • • • • •

• • • • • • • • • • • • •

• • • • • • • • • • •

• • • • • • • • •

• • • • • • •

• • • • •

• • •

·�//////////////:.
Figure I [4]: Basic architecture of MAC unit
• · Z,777777? : ·

III.MODIFIED WALLACE MULTIPLIER


• •
A modified Wallace multiplier is an efficient
hardware implementation of digital circuit
��////////////.:.
multiplying two integers. Generally in conventional • • •
Wallace multipliers many full adders and half adders • 11 .11 .
are used in their reduction phase. Half adders do not • • •

reduce the number of partial product bits. Therefore, :��////////.� • • • •

minirnizing the number of half adders used in a •

multiplier reduction will reduce the complexity [2]. • • • • • • • •

• • •
Hence, a modification to the Wallace reduction is
done in which the delay is the same as for the
• �/�////.
• • • • • •

• • •
conventional Wallace reduction. The modified • • • •
reduction method greatly reduces the number of half • • • • • • • • • •
adders with a very slight increase in the number of • • • • •

full adders [2].


Figure 2[2]: Modified Wallace IO-bit by IO-bit reduction

Reduced complexity Wallace multiplier reduction


consists of three stages [2]. First stage the N x N Thus 64 bit modified Wallace multiplier is
product matrix is formed and before the passing on to constructed and the total number of stages in the
the second phase the product matrix is rearranged to second phase is 10. As per the equation the number
take the shape of inverted pyramid. During the of row in each of the 10 stages was calculated and the
second phase the rearranged product matrix is use of half adders was restricted only to the 10th
grouped into non-overlapping group of three as stage. The total number of half adders used in the
shown in the figure 2, single bit and two bits in the second phase is 8 and the total number of full adders
group will be passed on to the next stage and three that was used during the second phase is slightly
bits are given to a full adder. The number of rows in increased that in the conventional Wallace multiplier.

783
2013 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]

Since the 64 bit modified Wallace multiplier is


difficult to represent, a typical lO-bit by 10-bit
reduction shown in figure 2 for understanding. The
modified Wallace tree shows better performance
when carry save adder is used in final stage instead of
ripple carry adder. The carry save adder which is
used is considered to be the critical part in the
multiplier because it is responsible for the largest
amount of computation.

IV. CARRY SAVE ADDER


In this design 128 bit carry save adder [6] is used
since the output of the multiplier is 128 bits (2N).
The carry save adder minirnize the addition from 3 Figure 3 [6]: 8 bit carry save adder
numbers to 2 numbers. The propagation delay is 3
produced, and hence the delay will be equal to that of
gates despite of the number of bits. The carry save
n full adders. However a carry-save adder produces
adder contains n full adders, computing a single sum
all the output values in parallel, resulting in the total
and carries bit based mainly on the respective bits of
computation time less than ripple carry adders. So,
the three input numbers. The entire sum can be
Parallel In Parallel Out (PIPO) is used as an
calculated by shifting the carry sequence left by one
place and then appending a 0 to most significant bit accumulator in the final stage.
of the partial sum sequence. Now the partial sum
sequence is added with ripple carry unit resulting in V. RESULT
n + 1 bit value. The ripple carry unit refers to the

process where the carryout of one stage is fed directly The design is developed using Verilog-HDL and
to the carry in of the next stage. This process is synthesized in Encounter RTL compiler using typical
continued without adding any intermediate carry libraries of TSMC 180nm technology.
propagation.
As a previous work, 8 bit MAC unit is designed
Since the representation of 128 bit carry save using different multipliers and adders. The
adder is infeasible , hence a typical 8 bit carry save multipliers used for comparative study are: (i)
adder is shown in the figure 3[6].Here we are Modified Booth Aigorithm (ii) Dadda Multiplier (iii)
computing the sum of two 128 bit binary numbers, Wallace multiplier. The different adders used in the
then 128 half adders at the first stage instead of 128 study are: (i) Carry Look Ahead (ii) Carry Select
full adder. Therefore , carry save unit comprises of Adder (iii) Carry Save adder. The area, delay and
128 half adders, each of which computes single sum power dissipation comparative table are shown in the
and carry bit based only on the corresponding bits of table 1 and table 2 respectively. The graphs are
the two input numbers. plotted against the different type of 8 bit MAC unit.

If x and y are supposed to be two 128 bit numbers


then it produces the partial products and carry as S 9000
and C respectively.
.,.. 8000 +--....
[ 7000 .., .....-.-..
-. �
Ei 6000
Si xi 1\ yi (4) '3 5000 '- .....
=

<. 4000 ....,


.
Ci xi & yi (5) '-
......: . 3 000 ...
=

� 2000
During the addition of two numbers using a half 1000
o
adder, two ripple carry adder is used. This is due the
u u u u u u u u
fact that ripple carry adder cannot compute a sum bit -« -« -« -«
u
-« -« -« -« -«

without waiting for the previous carry bit to be 2: 2: 2: 2: 2: 2: 2: 2: 2:


--' l.LJ --' --' I UJ
U U � U
l.LJ
U � U u �
2: 2: 2: 0 ,0 0
3: 3: 3:
2
Figure 4: Graph Comparison of Cell Area in J.lm

784
2013 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]

Table 1: Area and Delay Report of Different MAC unit is also less when compared other MAC unit which
are done either by modified booth or dadda
N MAC NAME Area Report Delay
multiplier. Hence we have designed a 64 bit MAC
0 Cells Cell (ps) unit using Wallace multiplier and carry save adder.
Area
Puwe:r Dissip a ti o n ' ( n \V)
( 11m2) S 2 000000
'";i
'-'
1 Multiplier Modified Booth 221 7677 4995

Algorithm
.! 1500000
s..
... 1000000
Adder Carry Look Ahead

(MCLMAC)
ä 500000

2 Multiplier Dadda Multiplier 202 7904 4213
� 0
Adder Carry Look Ahead u u u u u u u u u
.q: .q: .q: .q: .q: .q: .q: .q: .q:
:;§; :;§; :;§; :;§; :;§; :;§; :;§; :;§; :;§;
(DCLMAC) --' IW --' UJ --'
u u t3 u u t3 u t3 IW
u
3 Multiplier Wallace Multiplier 90 4398 3084 :;§; :;§; :;§; , 0 ,0 ,0
3 3 3
Adder Carry Look Ahead Figure 5: Graph Comparison of Total Power Dissipation in nW

(WCLMAC)

4 Multiplier Modified Booth 202 7418 4556 The simulation waveform result of 64 bit MAC
unit in shown in the figure 7. In the simulation result
Algorithm
it is observed that there is a delay by one clock cycle,
Adder Carry Select this is because it takes some time to compute since
(MCEMAC) the multiplier used here is 64 bit and the adder used
here is 128 bits.
5 Multiplier Dadda Multiplier 183 7637 4125

Adder Carry Select Table 2: Power Dissipation Report of Different MAC Unit

(DCEMAC)
SI.No MAC Power Report
6 Multiplier Wallace Multiplier 108 5013 1890
NAME
Adder Carry Select
Leakage Dynamic Total
(WCEMAC)
Power Power Power
7 Multiplier Modified Booth 185 6819 4545
(nW) (nW) (nW)
Algorithm

Adder Carry Save 1 MCLMAC 252 1812909 1813161


(MCSMAC)
2 DCLMAC 270 1849037 1849307
8 Multiplier Dadda Multiplier 166 7099 3890
3 WCLMAC 132 903720 903852
Adder Carry Save

(DCSMAC) 4 MCEMAC 242 1713506 1713749


9 Multiplier Wallace Multiplier 61 2947 1026
5 DCEMAC 260 1721111 1721371
Adder Carry Save
6 WCEMAC 156 657640 657797
(WCSMAC)

7 MCSMAC 218 1517795 1518013

8 DCSMAC 237 1508588 1508825


The figure 4 and 5 shows the cell area and total
power dissipation of different MAC unit. The figure 9 WCSMAC 79 435399 435478
6 indicates the graphical comparison of delay. These
figures clearly indicates that the MAC unit which are
developed using Wallace multiplier requires less The table 3 shows the result of area report and
area, dissipate less power and also delay caused by it delay calculation. The cells required for 64 bit MAC

785
2013 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]

is 11916 and the cell area occupied by MAC unit is VI. CONCLUSION
542177 11m2. For 64 bit MAC unit, the delay is only Hence a design of high performance 64 bit
4900 ps. Multiplier-and-Accumulator (MAC) is implemented
De1ay(ils) in this paper. The total MAC unit operates at a
frequency of 217 MHz. The total power dissipated
6000 by 64 bit MAC unit is 177.732 mW. The total area
5000 occupied by it is 542177 11m2. Since the delay of 64
bit is less, this design can be used in the system
0.. 4000
which requires high performance in processors
.S 3000 involving large number of bits of the operation. The
i!'
;! 2000 MAC unit is designed using Verilog-HDL and
synthesized in Cadence 180nm RTL Complier.
1000

o REFERENCES
u u u u u u u u (....) [1].Young-Ho Seo and Dong-Wook Kim, "New VLSI
�-:c �-:( �c( �c( « �-:( .0:0-:( .0:.;1; �-:(
Architecture of Parallel Multiplier-Accumulator Based on Radix-2
2: 2: 2: 2: 2: 2: 2: 2:
."...
.ii!
--' w --' --' w Modified Booth Algorithm," IEEE Transactions on very large
u u tJ u
w
(J tJ u U tJ
."...
� 2: 2: Cl Cl Cl
:;: $ $ scale integration (vlsi) systems, vol. 18, no. 2,february 20 10

Figure 6: Graph Comparison of Delay in ps (2). Ron S. Waters and Earl E. Swartzlander, Jr., "A Reduced
Complexity Wallace Multiplier Reduction, " IEEE Transactions
On Computers, vol. 59, no. 8, Aug 20 10
From the figure 6, we can infer less delay caused
[3]. C. S. Wallace, "A suggestion for a fast multiplier," iEEE
by Wallace - Carry Save MAC unit. Hence the
Trans. ElectronComput., vol. EC-13, no. I, pp. 14-17, Feb. 1964
performance of 64 bit MAC unit is high.
[4]. Shanthala S, Cyril Prasanna Raj, Dr.S.Y.Kulkarni, "Design
and VLST Implementation of Pipelined Multiply Accumulate
Unit," IEEE International Conference on Emerging Trends in
Engineering and Technology, ICETET-09

[5]. B.Ramkumar, Harish M Kittur and P.Mahesh Kannan, "ASIC


Implementation of Modified Faster Carry Save Adder ", European
Journal of Scientific Research, Vol. 42, Issue 1, 2010.

[6]. R.UMA, Vidya Vijayan, M. Mohanapriya and Sharon Paul,


"Area, Delay and Power Comparison of Adder Topologies",
International Journal of VLSI design & Communication Systems
(VLSICSj Vo1.3, No.1, February 2012

[7]. V. G. Oklobdzija, "High-Speed VLSI Arithmetic Units:

Figure 7: Simulation Waveform of 64 bit MAC uni!. Adders and Multipliers", in "Design of High-Performance
Microprocessor Circuits", Book edited by A.Chandrakasan,IEEE
Press,2000
Table 3: Area and Delay Report of 64 Bit MAC Unit
[8]. Dadda, "Some Schemes for Parallel Multipliers," Alta
Frequenza, vol. 34, pp. 349-356, 1965
Instance Cells Cell Delay
area(J.1m2) (ps) [9]. C.S. Wallace "A Suggestion for a fast multipliers," IEEE

MAC unit 11916 542177 4900 Trans. Electronic Computers, vol. 13, no.l,pp 14-17, Feb. 1967

[ 10]. L.Dadda, "On Parallel Digital Multiplier", Alta Frequenza,


vol. 45, pp. 574-580, 1976
The table 4 shows the various types of power
dissipation in the MAC unit design. [ 1 1]. WJ. Townsend, E.E. Swartzlander Jr., and J.A. Abraham,
"A Comparison of Dadda and Wallace Multiplier Delays," Proc.
SPIE, Advanced Signal Processing Algorithms, Architectures, and
Table 4: Power Dissipation Report of 64 Bit MAC Unit
Implementations XIII, pp. 552-560, 2003

[ 12]. Fabrizio Lamberti and Nikos Andrikos, " Reducing the


Leakage Dynamic Total Computation Time in (Short Bit-Width) Two's Complement
power(nW) power(nW) power(nW) Multipliers", IEEE transactions on computers, Vol. 60, NO. 2,
1063 177731712 177732776 FEBRUARY 20 1 1

786

You might also like