Professional Documents
Culture Documents
A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning
A High-Speed and Low-Complexity Architecture For Softmax Function in Deep Learning
Abstract—Recently, significant improvement has been achieved the exp-log function within a specific range. After the
for hardware architecture design of deep neural networks optimization, the main calculations in the exponential
(DNNs). However, the hardware implementation of one widely unit require only a shift unit, two adders and a constant
used softmax function in DNNs has not been much investigated,
which involves expensive division and exponentiation units. This multiplier. Calculations in the logarithmic unit require
paper performs an efficient hardware implementation of softmax only a shift unit, a leading one detector (LOD) unit and
function. Mathematical transformations and linear fitting are three adders.
used to simplify this function. Multiple algorithmic strength • The constant multiplier is optimized using the algorithmic
reduction strategies and fast addition methods are employed to strength reduction strategy and it is replaced by multiple
optimize the architecture. By using these techniques, complicated
logic units like multipliers are eliminated and the memory adders. Adders are further optimized using carry-save
consumption is largely reduced while the accuracy loss is negli- method [11]- [12].
gible. The proposed design is coded using hardware description The rest of this paper is organized as follows. Section II
language (HDL) and synthesized under the TSMC 28-nm CMOS
gives a brief review of softmax and the domain transformation
technology. Synthesis results show that the architecture achieves a
throughput of 6.976 G/s for 8-bit input data. The power efficiency of this function. Section III and Section IV propose designs
of 463.04 Gb/(mm2 · mW) is achieved and it costs only 0.015mm2 for the exponential unit and the logarithmic unit, respectively.
area resources. To the best of our knowledge, this is the first The proposed architecture of the softmax function is evaluated
work on efficient hardware implementation for softmax in open in Section V. Section VI concludes this paper.
literature.
Index Terms—softmax, hardware implementation, efficient
hardware architecture, deep neural network, VLSI
II. BACKGROUND
The softmax function can be expressed as follows:
I. I NTRODUCTION
exi
Deep neural network (DNN) has achieved great success f (xi ) = PN (i = 1, 2, ..., N ), (1)
in big data area since its resurgence in 2006 [1]. Efficient j=1 exj
hardware architectures for DNNs is one of the hot topics in where x1 , x2 , ..., xN are the input values of the softmax layer
both academic and industrial societies [2]- [6]. The softmax and the output value f (xi ) represents the probability that the
layer is widely used in different DNNs [7]- [9]. However, sample belongs to the i-th category.
most previous works focus on matrix multiplications opti- Mathematical transformation of softmax involving expo-
mization in DNNs and efficient hardware implementation for nential and logarithmic operations is adopted to perform the
softmax function has not been much investigated. Efficient hardware implementation, which is shown in (2).
hardware architectures of softmax layer are desired for high
speed deep learning systems. The exponentiation and division exi
f (xi ) = exp(ln( PN ))
calculations in softmax function are quite expensive, especially xj
j=1 e
in embedded systems. A hardware architecture for softmax XN (2)
has been proposed in [10]. However, no detailed discussion = exp(xi − ln( xj
e ))(i = 1, 2, ..., N ).
about implementation of exponential and the logarithmic units, j=1
which have very high complexity, was presented in [10]. In
this paper, we perform an efficient hardware implementation III. T HE A RCHITECTURE OF THE E XPONENTIAL U NIT
of softmax function. Our main contributions are as follows: The overall architecture for the softmax function are shown
• Mathematical transformations are used to transform the in Fig .1. It consists of multiple exponential units(EXPU),
exponential and logarithmic (exp-log) functions into exp- one adder tree, one natural logarithmic unit(LNU) and some
log operations with limited input range, as well as some other simple units. The most expensive calculations are in the
other simple operations including constant multiplication, EXPU and the LNU. In Section III and Section IV, efficient
shift and addition. Linear fitting is used to calculate architectures for the two units are introduced.
224
log2 k ≈ k − 1, k ∈ [1, 2). (7)
At last, the calculation of lnF can be expressed as:
lnF = ln2(k − 1 + w). (8)
B. Proposed Design for the LNU
According to (8), if we figure out k and w, the value of
lnF will be easy to get. Notice that the value of k is limited
within [1,2), so the calculation of (k − 1 + w) dose not require
any adder. The fractional part of the value of (k − 1 + w) is
equal to that of k, and w is the integer part of the value of
(k − 1 + w).
The hardware architecture for the LNU is described in Fig.
4. Note that the output of the LOD is w, and k = F >> w.
Fig. 3. The architecture for the constant multiplier.
The binary number 0.1011 is adopted as the approximation of
number ln2. The constant multiplier related to 0.1011 involves
replace the Q-bit carry-ripple adder in our design and reduce ”ADD-ADD-ADD” operations. It is implemented with the
the critical path further. Considering the balance of area and same carry-save method as in the EXPU.
speed, it is not adopted.
A. Mathematical Transformation of Logarithmic Calculation Fig. 4. The proposed design for the LNU.
225
the hardware part gives the number Nout to help the software TABLE II
calculate the final results. With this method, the calculation KEY FEATURES OF THE PROPOSED HARDWARE
ARCHITECTURE FOR SOFTMAX
accuracy can be greatly improved with a small increase in
computational complexity and without affecting the speed. Characteristic FPGA TSMC 28nm
The calculation accuracy under a certain error threshold Clock Frequency 436MHz1 2.8GHz
8.00e − 5 are shown in Fig .5. The accuracy is defined as Area 2564LUTs, 0.015mm2
follows: 2794Registers
ncorrect results Power 0.751W 51.60mW
accuracy = , (9) Throughput 6.976 G/s(8bit) 44.8G/s(8bit)
ninput numbers
3.488 G/s(16bit) 22.4G/s(16bit)
where ncorrect results is the number of correct results and Efficiency 57.88G/(mm2 · mW)(8bit)
ninput numbers is the total number of input data. Here cor- - 28.94G/(mm2 · mW)(16bit)
rectness is determined by comparing the results calculated 1 This frequency is a representation of how fast the architecure can
by the architecture proposed using fixed point numbers with run. The frequency in practical applications will be limited by the
speed of data transmission and the clock speed of the development
those calculated by C code using floating point values. If the board.
difference between the corresponding two results are less than
the error threshold, the result is defined as a correct result. As
we can see, our architecture achieves very high accuracy at a datapath delay. In addition, regular multipliers and lookup ta-
low error threshold. bles are eliminated in this design. With all these optimizations,
the architecture proposed achieves a very high efficiency of
463.04Gb/(mm2 · mW). So it demonstrates great potential for
many real-time applications on embedded systems.
R EFERENCES
[1] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for
deep belief nets,” Neural Computation, vol. 18, pp. 1527-1554, 2006.
[2] D. Kim, J. Ahn, and S. Yoo, “A novel zero weight/activation-aware
hardware architecture of convolutional neural network,” in Design,
Automation & Test in Europe Conference & Exhibition, 2017, pp.
14661471.
[3] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and B. Dally,
“Deep compression and eie: Efficient inference engine on compressed
deep neural network,” in Hot Chips 28 Symposium, 2017, pp. 16.
[4] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
and Y. Wang, “Ese: Efficient speech recognition engine with sparse lstm
on fpga,” 2017.
Fig. 5. The calculation accuracy under the error threshold 8.00e − 5 for four [5] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks:
sets of data. A memory-efficient approach,” IEEE Transactions on Very Large Scale
Integration Systems, vol. PP, no. 99, pp. 113, 2017.
[6] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang,
B. Synthesis Results “Clstm: Enabling efficient lstm using structured compression techniques
on fpgas,” in Proceedings of the 2018 ACM/SIGDA International
The architecture is evaluated on two platforms, Xilinx Zynq- Symposium on Field-Programmable Gate Arrays. ACM, 2018, pp. 1120.
7000 ZC706 development board and the TSMC 28-nm CMOS [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition, Proceedings of the IEEE, Nov. 1998.
technology. The synthesis results are presented in Table II. The [8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with
throughput (T ) in the table is defined as the number of outputs recurrent neural networks,” in International Conference on Machine
per second: Learning, 2014, pp. 17641772.
[9] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-
128bit · fclk to-end attention-based large vocabulary speech recognition,” in Acous-
T = . (9)
data width tics, Speech and Signal Processing (ICASSP), 2016 IEEE International
Conference on. IEEE, 2016, pp. 49454949
The efficiency(E) is defined as: [10] B. Yuan, “Efficient hardware architecture of softmax layer in deep neural
T network,” in System-On-Chip Conference, 2017, pp. 323326
E= . (9) [11] T. G. Noll, “Carry-save architectures for high-speed digital signal
area · power processing,” Journal of Vlsi Signal Processing Systems for Signal Image
& Video Technology, vol. 3, no. 1-2, pp. 121140, 1991.
As is shown in the table, the data processing speed is [12] C. Nagendra, M. J. Irwin, and R. M. Owens, “Area-time-power tradeoffs
sufficient for real-time applications. in parallel adders,” IEEE Transactions on Circuits & Systems II Analog
& Digital Signal Processing, vol. 43, no. 10, pp. 689702, 2002.
VI. C ONCLUSION [13] O. J. Bedrij, “Carry-select adder,” Electronic Computers Ire Transactions
on, vol. EC-11, no. 3, pp. 340346, 1962.
In this paper, an efficient hardware implementation of [14] B. Ramkumar and H. M. Kittur, “Low-power and area-efficient carry se-
softmax function is presented for deep neural networks. lect adder,” IEEE Transactions on Very Large Scale Integration Systems,
Algorithm-hardware co-optimizations, including multiple al- vol. 20, no. 2, pp. 371375, 2012.
gorithm strength reduction methods, linear fitting and compo-
nent reuse are performed to reduce hardware complexity and
226