Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2018 IEEE Asia Pacific Conference on Circuits and Systems

A High-Speed and Low-Complexity Architecture


for Softmax Function in Deep Learning
Meiqi Wang, Siyuan Lu, Danyang Zhu,
Jun Lin, Member, IEEE, and Zhongfeng Wang, Fellow, IEEE
School of Electronic Science and Engineering, Nanjing University, Nanjing, China
Email: mqwang99@gmail.com, sylu@smail.nju.edu.cn, zhudanyang10@foxmail.com, {jlin, zfwang}@nju.edu.cn

Abstract—Recently, significant improvement has been achieved the exp-log function within a specific range. After the
for hardware architecture design of deep neural networks optimization, the main calculations in the exponential
(DNNs). However, the hardware implementation of one widely unit require only a shift unit, two adders and a constant
used softmax function in DNNs has not been much investigated,
which involves expensive division and exponentiation units. This multiplier. Calculations in the logarithmic unit require
paper performs an efficient hardware implementation of softmax only a shift unit, a leading one detector (LOD) unit and
function. Mathematical transformations and linear fitting are three adders.
used to simplify this function. Multiple algorithmic strength • The constant multiplier is optimized using the algorithmic
reduction strategies and fast addition methods are employed to strength reduction strategy and it is replaced by multiple
optimize the architecture. By using these techniques, complicated
logic units like multipliers are eliminated and the memory adders. Adders are further optimized using carry-save
consumption is largely reduced while the accuracy loss is negli- method [11]- [12].
gible. The proposed design is coded using hardware description The rest of this paper is organized as follows. Section II
language (HDL) and synthesized under the TSMC 28-nm CMOS
gives a brief review of softmax and the domain transformation
technology. Synthesis results show that the architecture achieves a
throughput of 6.976 G/s for 8-bit input data. The power efficiency of this function. Section III and Section IV propose designs
of 463.04 Gb/(mm2 · mW) is achieved and it costs only 0.015mm2 for the exponential unit and the logarithmic unit, respectively.
area resources. To the best of our knowledge, this is the first The proposed architecture of the softmax function is evaluated
work on efficient hardware implementation for softmax in open in Section V. Section VI concludes this paper.
literature.
Index Terms—softmax, hardware implementation, efficient
hardware architecture, deep neural network, VLSI
II. BACKGROUND
The softmax function can be expressed as follows:
I. I NTRODUCTION
exi
Deep neural network (DNN) has achieved great success f (xi ) = PN (i = 1, 2, ..., N ), (1)
in big data area since its resurgence in 2006 [1]. Efficient j=1 exj
hardware architectures for DNNs is one of the hot topics in where x1 , x2 , ..., xN are the input values of the softmax layer
both academic and industrial societies [2]- [6]. The softmax and the output value f (xi ) represents the probability that the
layer is widely used in different DNNs [7]- [9]. However, sample belongs to the i-th category.
most previous works focus on matrix multiplications opti- Mathematical transformation of softmax involving expo-
mization in DNNs and efficient hardware implementation for nential and logarithmic operations is adopted to perform the
softmax function has not been much investigated. Efficient hardware implementation, which is shown in (2).
hardware architectures of softmax layer are desired for high
speed deep learning systems. The exponentiation and division exi
f (xi ) = exp(ln( PN ))
calculations in softmax function are quite expensive, especially xj
j=1 e
in embedded systems. A hardware architecture for softmax XN (2)
has been proposed in [10]. However, no detailed discussion = exp(xi − ln( xj
e ))(i = 1, 2, ..., N ).
about implementation of exponential and the logarithmic units, j=1
which have very high complexity, was presented in [10]. In
this paper, we perform an efficient hardware implementation III. T HE A RCHITECTURE OF THE E XPONENTIAL U NIT
of softmax function. Our main contributions are as follows: The overall architecture for the softmax function are shown
• Mathematical transformations are used to transform the in Fig .1. It consists of multiple exponential units(EXPU),
exponential and logarithmic (exp-log) functions into exp- one adder tree, one natural logarithmic unit(LNU) and some
log operations with limited input range, as well as some other simple units. The most expensive calculations are in the
other simple operations including constant multiplication, EXPU and the LNU. In Section III and Section IV, efficient
shift and addition. Linear fitting is used to calculate architectures for the two units are introduced.

978-1-5386-8240-1/18/$31.00 ©2018 IEEE 223


slightly different. In the second exponential calculation, lnF
(the output of the LNU in Fig. 1 ) needs to be subtracted from
the input value. Moreover, the results of the first exponential
calculation are accumulated and stored, and those of the
second one are taken as the final outputs.

Fig. 2. The proposed design for the EXP Unit.

The range of vi in f (vi ) = 2vi is limited to (0, 1], so the


function f (vi ) = 2vi can be approximated as functions f (x) =
x + d1 or f (x) = x + d2 within this specific range. The
two bias values d1 and d2 correspond to the first and second
Fig. 1. The overall architecture for the softmax function. exponential operations, respectively. In this way, no multiplier
or lookup table is needed in these approximated functions and
the calculation is simplified with negligible accuracy loss.
A. Mathematical Transformation of Exponential Calculation
Exponential operations are composed of a number of mul- C. Improved Architecture for the Constant multiplier
tiplications. Multiple multiplications will consume plenty of The constant multiplication related to log2 e appears many
resources and cause long datapath delays in hardware imple- times in the proposed architecture, and it is the critical
mentation. Also, if the range of input cannot be limited to a path of the system before optimization. In this subsection,
specific range, the value of the exponent will be difficult to optimizations for this calculation at both algorithmic level and
calculate. Considering these difficulties, mathematical trans- architectural level are developed.
formation is applied to the exponential function. First, we use algorithm strength reduction strategy to reduce
First, we use the base number 2 to take place of the base the total number of operations in the constant multiplication.
number e and the exponential function will be changed to: The approximation value of log2 e is 1.0111 (binary), and it
eyi = 2yi ·log2 e (3) can be expressed as: 1.0111 = 1+0.1−0.0001, called ”ADD-
ADD-SUB” operation. In this way, we can simplify the three
Then the value yi · log2 e is split into an integer number ui additions and three shifts in the constant multiplier into one
and a decimal number vi . Note that ui + vi = yi · log2 e, and addition, one subtraction and two shifts.
ui = byi · log2 ec. In the hardware implementation, the value In hardware implementation, we first transform the ”ADD-
yi · log2 e is a fixed point number, which makes it easy to split ADD-SUB” operation into ”ALL-ADD” operation, then we
the integer and the decimal. The calculation can be simplified adopt the carry-save technique to simplify these additions. The
as follows: overall architecture of the constant multiplier is shown in Fig.
2vi << ui 3.

yi ui +vi ui vi ui > 0
e =2 =2 ·2 = vi Supposing there is E=A*1.0111, in our implementation, A is
2 >> (−ui ) ui ≤ 0
(4) an unsigned number, so we have: B = A >> 1, D = A >> 4
Through this transformation, we replace the original oper- and E = A + B − D. The subtraction of D is equivalent to
ation ”f (yi ) = eyi , −∞ < yi < +∞” with the operation adding its two’s complement. Defining that F =∼ D, we have:
”f (vi ) = 2vi , 0 < vi ≤ 1” and a simple shift operation. The E = A + B + F + 1.
operation ”f (vi ) = 2vi , 0 < vi ≤ 1” can be implemented with In our architecture, both A and E have Q bits. The critical
a lookup table or a linear fitting function, which can greatly path of the traditional method is the operation time of 3Q FAs
reduce the hardware complexity. (full adders). If we use carry-save method, the critical path can
be reduced to the operation time of Q + 1 FAs. Also, the total
B. Proposed Design for the EXPU number of FAs can be reduced from 3Q to 2Q.
The hardware architecture of the EXPU is shown in Fig. Compared with the results without optimization, we find
2. It consists of two adders, a constant multiplier, and a shift that not only the critical path is reduced, but also the area
operation. This hardware architecture is designed to support and power consumption are significantly reduced due to the
both exponential operations in (2), and the two operations are simplification. Carry-select method [13]- [14] can be used to

224
log2 k ≈ k − 1, k ∈ [1, 2). (7)
At last, the calculation of lnF can be expressed as:
lnF = ln2(k − 1 + w). (8)
B. Proposed Design for the LNU
According to (8), if we figure out k and w, the value of
lnF will be easy to get. Notice that the value of k is limited
within [1,2), so the calculation of (k − 1 + w) dose not require
any adder. The fractional part of the value of (k − 1 + w) is
equal to that of k, and w is the integer part of the value of
(k − 1 + w).
The hardware architecture for the LNU is described in Fig.
4. Note that the output of the LOD is w, and k = F >> w.
Fig. 3. The architecture for the constant multiplier.
The binary number 0.1011 is adopted as the approximation of
number ln2. The constant multiplier related to 0.1011 involves
replace the Q-bit carry-ripple adder in our design and reduce ”ADD-ADD-ADD” operations. It is implemented with the
the critical path further. Considering the balance of area and same carry-save method as in the EXPU.
speed, it is not adopted.

IV. T HE A RCHITECTURE OF THE NATURAL L OGARITHMIC


U NIT
PN xj
The LNU has a positive input F = j=1 e , and the
output is lnF . By limiting the input range of the logarithmic
function, the calculation in LNU can be simplified.

A. Mathematical Transformation of Logarithmic Calculation Fig. 4. The proposed design for the LNU.

We define k ∈ [1, 2) and w as an integer. The real number


pair (k, w) satisfies the following conditions: V. E XPERIMENTAL R ESULTS

∀F > 0, ∃!(k, w) : F = 2w · k. (5) A. Simulation Results and Discussion on Accuracy


We choose four sets of random values with different ranges,
A leading one detector (LOD) is required to compute k and represented as rand0.1, rand1, rand5 and rand10 as the input
w in the proposed architecture. In the LNU, the input of the data for the experiment. Each set has 4096 numbers within
LOD is F , and the output is the highest bit “1” position of F . the corresponding range. For instance, numbers in set rand10
It will be easy to calculate the value of k and w, if the decimal are random values lie in the range [-10, +10]. In order to
point and the highest one’s position is given. An example of meet the accuracy needs of different application scenarios, our
the relationships between F and LOD’s output (k and w) is architecture is designed to support input and output of both
shown in TABLE I, assuming that the width of F is 8, and 8bit and 16bit. We take 128bit of input data at a time, and use
the decimal point is between the fifth bit and the sixth bit. 16 EXPUs to process the data concurrently. These units can
be reconfigured to support sixteen 8bit inputs or eight 16bit
TABLE I inputs.
A N E XAMPLE OF T HE R ELATIONSHIP B ETWEEN F A ND (k,w)
xi − xmax , (i = 1, 2, ..., N ) are used as the input of the
F (Binary) Highest 1 in F k(Binary) w hardware architecture, which brings two advantages. On the
11011.111 0 1.1011111 4 one hand, all input data are non-positive, ensuring that the
01010.101 1 1.0101010 3 results of the exponential operation are not greater than 1
00001.101 4 1.1010000 0 and thus avoiding numerical overflow. On the other hand, the
00000.011 6 1.1000000 -2 calculations related to input value don’t involve any sign bit,
which saves the storage space and simplifies the calculation.
In the experiment, we preprocess the input data with
lnF = ln2 · log2 F = ln2(w + log2 k)). (6) software to transform the floating point numbers into fixed
point numbers. Also, the software gives the value Nin which
According to (5) and (6), the range of the input of loga- indicates where the decimal point of these fixed point numbers
rithmic operation is limited within [1, 2). To compute log2 k, is. The hardware architecture uses the input data together with
a linear fitting is used: this value Nin to get the output of softmax function. Similarly,

225
the hardware part gives the number Nout to help the software TABLE II
calculate the final results. With this method, the calculation KEY FEATURES OF THE PROPOSED HARDWARE
ARCHITECTURE FOR SOFTMAX
accuracy can be greatly improved with a small increase in
computational complexity and without affecting the speed. Characteristic FPGA TSMC 28nm
The calculation accuracy under a certain error threshold Clock Frequency 436MHz1 2.8GHz
8.00e − 5 are shown in Fig .5. The accuracy is defined as Area 2564LUTs, 0.015mm2
follows: 2794Registers
ncorrect results Power 0.751W 51.60mW
accuracy = , (9) Throughput 6.976 G/s(8bit) 44.8G/s(8bit)
ninput numbers
3.488 G/s(16bit) 22.4G/s(16bit)
where ncorrect results is the number of correct results and Efficiency 57.88G/(mm2 · mW)(8bit)
ninput numbers is the total number of input data. Here cor- - 28.94G/(mm2 · mW)(16bit)
rectness is determined by comparing the results calculated 1 This frequency is a representation of how fast the architecure can
by the architecture proposed using fixed point numbers with run. The frequency in practical applications will be limited by the
speed of data transmission and the clock speed of the development
those calculated by C code using floating point values. If the board.
difference between the corresponding two results are less than
the error threshold, the result is defined as a correct result. As
we can see, our architecture achieves very high accuracy at a datapath delay. In addition, regular multipliers and lookup ta-
low error threshold. bles are eliminated in this design. With all these optimizations,
the architecture proposed achieves a very high efficiency of
463.04Gb/(mm2 · mW). So it demonstrates great potential for
many real-time applications on embedded systems.
R EFERENCES
[1] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for
deep belief nets,” Neural Computation, vol. 18, pp. 1527-1554, 2006.
[2] D. Kim, J. Ahn, and S. Yoo, “A novel zero weight/activation-aware
hardware architecture of convolutional neural network,” in Design,
Automation & Test in Europe Conference & Exhibition, 2017, pp.
14661471.
[3] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and B. Dally,
“Deep compression and eie: Efficient inference engine on compressed
deep neural network,” in Hot Chips 28 Symposium, 2017, pp. 16.
[4] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
and Y. Wang, “Ese: Efficient speech recognition engine with sparse lstm
on fpga,” 2017.
Fig. 5. The calculation accuracy under the error threshold 8.00e − 5 for four [5] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks:
sets of data. A memory-efficient approach,” IEEE Transactions on Very Large Scale
Integration Systems, vol. PP, no. 99, pp. 113, 2017.
[6] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang,
B. Synthesis Results “Clstm: Enabling efficient lstm using structured compression techniques
on fpgas,” in Proceedings of the 2018 ACM/SIGDA International
The architecture is evaluated on two platforms, Xilinx Zynq- Symposium on Field-Programmable Gate Arrays. ACM, 2018, pp. 1120.
7000 ZC706 development board and the TSMC 28-nm CMOS [7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition, Proceedings of the IEEE, Nov. 1998.
technology. The synthesis results are presented in Table II. The [8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with
throughput (T ) in the table is defined as the number of outputs recurrent neural networks,” in International Conference on Machine
per second: Learning, 2014, pp. 17641772.
[9] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-
128bit · fclk to-end attention-based large vocabulary speech recognition,” in Acous-
T = . (9)
data width tics, Speech and Signal Processing (ICASSP), 2016 IEEE International
Conference on. IEEE, 2016, pp. 49454949
The efficiency(E) is defined as: [10] B. Yuan, “Efficient hardware architecture of softmax layer in deep neural
T network,” in System-On-Chip Conference, 2017, pp. 323326
E= . (9) [11] T. G. Noll, “Carry-save architectures for high-speed digital signal
area · power processing,” Journal of Vlsi Signal Processing Systems for Signal Image
& Video Technology, vol. 3, no. 1-2, pp. 121140, 1991.
As is shown in the table, the data processing speed is [12] C. Nagendra, M. J. Irwin, and R. M. Owens, “Area-time-power tradeoffs
sufficient for real-time applications. in parallel adders,” IEEE Transactions on Circuits & Systems II Analog
& Digital Signal Processing, vol. 43, no. 10, pp. 689702, 2002.
VI. C ONCLUSION [13] O. J. Bedrij, “Carry-select adder,” Electronic Computers Ire Transactions
on, vol. EC-11, no. 3, pp. 340346, 1962.
In this paper, an efficient hardware implementation of [14] B. Ramkumar and H. M. Kittur, “Low-power and area-efficient carry se-
softmax function is presented for deep neural networks. lect adder,” IEEE Transactions on Very Large Scale Integration Systems,
Algorithm-hardware co-optimizations, including multiple al- vol. 20, no. 2, pp. 371375, 2012.
gorithm strength reduction methods, linear fitting and compo-
nent reuse are performed to reduce hardware complexity and

226

You might also like