Professional Documents
Culture Documents
Feedforward-Cutset-Free Pipelined Multiply-Accumulate Unit For The Machine Learning Accelerator
Feedforward-Cutset-Free Pipelined Multiply-Accumulate Unit For The Machine Learning Accelerator
Feedforward-Cutset-Free Pipelined Multiply-Accumulate Unit For The Machine Learning Accelerator
Feedforward-Cutset-Free Pipelined
Multiply–Accumulate Unit for the
Machine Learning Accelerator
Sungju Ryu, Student Member, IEEE, Naebeom Park, and Jae-Joon Kim , Member, IEEE
Abstract— Multiply–accumulate (MAC) computations account well-known examples for the achievement of a fast column
for a large part of machine learning accelerator operations. addition, and the carry-lookahead (CLA) adder is often used to
The pipelined structure is usually adopted to improve the reduce the critical path in the accumulator or the final addition
performance by reducing the length of critical paths. An increase
in the number of flip-flops due to pipelining, however, generally stage of the multiplier.
results in significant area and power increase. A large number Meanwhile, a MAC operation is performed in the machine
of flip-flops are often required to meet the feedforward-cutset learning algorithm to compute a partial sum that is the accumu-
rule. Based on the observation that this rule can be relaxed in lation of the input multiplied by the weight. In a MAC unit,
machine learning applications, we propose a pipelining method the multiply and accumulate operations are usually merged
that eliminates some of the flip-flops selectively. The simulation
results show that the proposed MAC unit achieved a 20% energy to reduce the number of carry-propagation steps from two to
saving and a 20% area reduction compared with the conventional one [10]. Such a structure, however, still comprises a long
pipelined MAC. critical path delay that is approximately equal to the critical
Index Terms— Hardware accelerator, machine learning, path delay of a multiplier.
multiply–accumulate (MAC) unit, pipelining. It is well known that pipelining is one of the most popu-
lar approaches for increasing the operation clock frequency.
I. I NTRODUCTION Although pipelining is an efficient way to reduce the critical
path delays, it results in an increase in the area and the
RYU et al.: FCF PIPELINED MAC UNIT FOR THE MACHINE LEARNING ACCELERATOR 3
RYU et al.: FCF PIPELINED MAC UNIT FOR THE MACHINE LEARNING ACCELERATOR 5
Fig. 8. Block diagrams of pipelined MAC architectures. (a) State-of-the-art merged MAC [12] with pipelining. (b) Proposed FCF-MAC with MFCF-PA.
Dotted box: FCF area where the flip-flops are removed from the conventional pipelined MAC.
and therefore, the chance of the undesired data transitions in functional equality regardless of a timing slack of each edge
the other RCAs is reduced. Please note that the accuracy of [Fig. 7(a)]. However, in the FCF pipelining method, some
computation is not affected even if the rare case of undesired edges in the cutset do not necessarily have to be pipelined if
data transition occurs. the edges have enough timing slacks [Fig. 7(b)]. As a result,
a smaller number of flip-flops are required compared with the
IV. P IPELINED MAC U NIT conventional pipelining case. On the other hand, in the Wallace
The column addition in the MAC operation is for the cal- multiplier, as many partial products as possible are involved
culation of binary numbers in each addition stage using the in the calculation for each column addition stage. Because
half-adders and/or full adders and then for the passing of the the partial products do not have enough timing slack to be
results to the next addition stage. Since MAC computations are excluded from pipelining, the effectiveness of the proposed
based on such additions, the proposed pipelining method can FCF pipelining method is smaller in the Wallace multiplier
also be applied to the machine learning-specific MAC struc- case than in the Dadda multiplier case.
ture. In this section, the proposed pipelining method is applied Fig. 8 shows the block diagrams of pipelined MAC archi-
to the MAC architecture by using the unique characteristic of tectures. The proposed MAC architecture [Fig. 8(b)] combines
Dadda multiplier. The Dadda multiplier performs the column the FCF-MAC (MAC with the proposed FCF pipelining) for
addition in a similar fashion to the Wallace multiplier which the column addition and the MFCF-PA for the accumulation.
is widely used, and it has less area and shorter critical path Instead of pipelining all of the final nodes in the column addi-
delay than the Wallace multiplier [13]. tion stage as in Fig. 8(a), the proposed FCF-MAC architecture
Fig. 7 shows the pipelined column addition structures in is used to selectively choose the nodes for the pipelining. For
the Dadda multiplier. The Dadda multiplier performs the the proposed architecture, the merged multiply–accumulation
column addition to reduce the height of each stage. If a style is adopted [12]. The final adder is placed in the last
particular column already satisfies the target height for the next stage of the MAC operation. In general, the final adder is
column addition stage, then no operation is performed during designed using the CLA to achieve a short critical path delay.
the stage [13]. Using this property, the proposed pipelining In contrast, the proposed design uses the MFCF-PA style
method can be applied to the MAC structure as well. Fig. 7(a) in the accumulation stage in consideration of the greater
is an example of pipelining where the conventional method is power and the area efficiency of the MFCF-PA, as described
used. All of the edges in the feedforward-cutset are subject to in Section III.
pipelining. On the other hand, in the proposed FCF pipelining
case [Fig. 7(b)], if a node in the column addition stage does not V. R ESULTS
need to participate in the height reduction, it can be excluded
from the pipelining [the group in the dotted box of Fig. 7(b)]. A. Experimental Setup
In other words, in the conventional pipelining method, all the We evaluate the proposed FCF pipelining method in this
edges in the feedforward-cutset must be pipelined to ensure section. First, for application of the accumulator-only case,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
RYU et al.: FCF PIPELINED MAC UNIT FOR THE MACHINE LEARNING ACCELERATOR 7
Fig. 11. Comparison of power dissipation among CSA + VMA, CondSA, CLA-based accumulators, proposed FCF-PA, and MFCF-PA. Numbers above bars
in the figures: normalized power to the CLA-based accumulator.
TABLE I TABLE II
E XTRA C LOCK C YCLES FOR THE P ROPOSED A CCUMULATORS . E XTRA C LOCK C YCLES FOR THE P ROPOSED MAC U NITS .
B OTH C ONVOLUTIONAL AND F ULLY C ONNECTED L AYERS B OTH C ONVOLUTIONAL AND F ULLY C ONNECTED
OF A LEX N ET A RE C ONSIDERED . B ASELINE I S L AYERS OF A LEX N ET A RE C ONSIDERED .
CLA-BASED A CCUMULATOR BASELINE I S MAC + CLA
Fig. 13. Comparison of MAC power consumption between conventional and proposed pipelining methods. (a,b,c) along the x-axis implies that the MAC unit
consists of an “a”-stage column addition and a “b”-stage final addition. The MAC unit is composed of the “c”-stages in total. Numbers above bars in the
figures: normalized power to the “MAC + CLA” configuration.
VI. C ONCLUSION [8] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron.
Comput., vol. EC-13, no. 1, pp. 14–17, Feb. 1964.
We introduced the FCF pipelining method in this paper. [9] L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza,
In the proposed scheme, the number of flip-flops in a pipeline vol. 34, no. 5, pp. 349–356, Mar. 1965.
can be reduced by relaxing the feedforward-cutset constraint, [10] P. F. Stelling and V. G. Oklobdzija, “Implementing multiply-accumulate
operation in multiplication time,” in Proc. 13th IEEE Symp. Comput.
thanks to the unique characteristic of the machine learning Arithmetic, Jul. 1997, pp. 99–106.
algorithm. We applied the FCF pipelining method to the [11] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and
accumulator (FCF-PA) design, and then optimized the power Implementation. New Delhi, India: Wiley, 1999.
[12] T. T. Hoang, M. Sjalander, and P. Larsson-Edefors, “A high-speed,
dissipation of FCF-PA by reducing the chance of undesired energy-efficient two-cycle multiply-accumulate (MAC) architecture and
data transitions (MFCF-PA). The proposed scheme was also its application to a double-throughput MAC unit,” IEEE Trans. Circuits
expanded, and applied to the MAC unit (FCF-MAC). For the Syst. I, Reg. Papers, vol. 57, no. 12, pp. 3073–3081, Dec. 2010.
[13] W. J. Townsend, E. E. Swartzlander, and J. A. Abraham, “A comparison
evaluation, the conventional and proposed MAC architectures of Dadda and Wallace multiplier delays,” Proc. SPIE, Adv. Signal
were synthesized in a 65-nm CMOS technology. The proposed Process. Algorithms, Archit., Implement. XIII, vol. 5205, pp. 552–560,
accumulator showed the reduction of area and the power con- Dec. 2003, doi: 10.1117/12.507012.
[14] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training
sumption by 17% and 19%, respectively, compared with the deep neural networks with binary weights during propagations,” in Proc.
accumulator with the conventional CLA adder-based design. Adv. Neural Inf. Process. Syst., 2015, pp. 3123–3131.
In the case of the MAC architecture, the proposed scheme [15] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
ImageNet classification using binary convolutional neural networks,” in
reduced both the area and power by 20%. We believe that the Proc. Eur. Conf. Comput. Vis. Springer, 2016, pp. 525–542.
proposed idea to utilize the unique characteristic of machine [16] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:
learning computation for more efficient MAC design can be Scalable and efficient neural network acceleration with 3d memory,”
in Proc. 22nd Int. Conf. Archit. Support Program. Lang. Oper. Syst.,
adopted in many neural network hardware accelerator designs 2017, pp. 751–764.
in the future. [17] A. Parashar et al., “SCNN: An accelerator for compressed-sparse
convolutional neural networks,” in Proc. 44th Annu. Int. Symp. Comput.
Archit., Jun. 2017, pp. 27–40.
R EFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst., 2012, pp. 1097–1105.
[2] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional
networks for large-scale image recognition.” [Online]. Available:
https://arxiv.org/abs/1409.1556
[3] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9.
[4] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust.,
Speech Signal Process. (ICASSP), May 2013, pp. 6645–6649.
[5] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
“Attention-based models for speech recognition,” in Proc. Adv. Neural Sungju Ryu (S’16) received the B.S. degree in
Inf. Process. Syst., 2015, pp. 577–585. electrical engineering from Pusan National Univer-
[6] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An sity, Busan, South Korea, in 2015. He is currently
energy-efficient reconfigurable accelerator for deep convolutional neural working toward the Ph.D. degree at the Department
networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, of Creative IT Engineering, Pohang University of
Jan. 2017. Science and Technology, Pohang, South Korea.
[7] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envi- His current research interests include low-
sion: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy- power machine learning hardware accelerator, adap-
frequency-scalable convolutional neural network processor in 28nm tive/resilient circuits, and near-threshold voltage
FDSOI,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), circuit design.
Feb. 2017, pp. 246–247.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
RYU et al.: FCF PIPELINED MAC UNIT FOR THE MACHINE LEARNING ACCELERATOR 9
Naebeom Park received the B.S degree in electrical Jae-Joon Kim (M’04) received the B.S. and M.S.
engineering from the Pohang University of Science degrees from Seoul National University, Seoul,
and Technology, Pohang, South Korea, in 2016, South Korea, in 1994 and 1998, respectively, and
where he is currently working toward the Ph.D. the Ph.D. degree from Purdue University, West
degree in creative IT engineering. Lafayette, IN, USA, in 2004.
His current research interests include neural From 2004 to 2013, he was a Research Staff
network data flow and accelerator. Member with the IBM T. J. Watson Research
Center, contributing to POWER6 and POWER7
microprocessor design. He is currently an Associate
Professor at the Department of Creative IT Engineer-
ing and Electrical Engineering, Pohang University
of Science and Technology, Pohang, South Korea, where he is involved in
neuromorphic circuits, machine learning hardware accelerator, and flexible
device/circuits. His research interests included 3-D VLSI integration, robust
SRAM cache memory design, and low-voltage circuit designs.