06) A Time-Domain Computing-In-Memory Based Processor Using Predictable Decomposed Convolution For Arbitrary Quantized DNNs

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

IEEE Asian Solid-State Circuits Conference

S6-1 November 9-11, 2020 ONLINE

A Time-Domain Computing-in-Memory based


Processor using Predictable Decomposed
Convolution for Arbitrary Quantized DNNs
Jianxun Yang∗ , Yuyao Kong† , Zhao Zhang∗ , Zhuangzhi Liu∗ , Jing Zhou∗ , Yiqi Wang∗ , Yonggang Liu∗ ,
Chenfu Guo∗ , Te Hu∗ , Congcong Li∗ , Leibo Liu∗ , Jin Zhang‡ , Shaojun Wei∗ , Jun Yang† , and Shouyi Yin∗§
∗ Institute of Microelectronics, Beijing Innovation Center for Future Chips, Tsinghua University, Beijing, China
∗ Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
† School of Electronic Science and Engineering, Southeast University, Nanjing, China
‡ Ingenic Semiconductor CO., Beijing, China § Corresponding author: yinsy@tsinghua.edu.cn

Abstract—Time-domain based computing-in-memory (TD-


CIM) architectures present superior flexibility, accuracy and
scalability for deep neural networks (DNNs). However, there
are still three challenges for accelerating multi-bit DNNs due to
their inferior bit-split convolution computation: poor network
adaptability, massive redundant computations, and high quan-
2020 IEEE Asian Solid-State Circuits Conference (A-SSCC) | 978-1-7281-8436-4/20/$31.00 ©2020 IEEE | DOI: 10.1109/A-SSCC48613.2020.9336145

tization energy and error. This work firstly proposes a unique-


weight kernel decomposition based convolution computation
method for CIMs to accelerate multi-bit DNNs. Then a TD-CIM
based processor is designed with three techniques to further
address above challenges: 1) Cross-flipping based fast kernel
decomposer to reduce memory accesses for various DNNs,
2) Dual-mode complementary predictor to remove redundant
computations, 3) Activation-weight-adaptive pulse quantizer to
reduce quantization energy and error. Fabricated in 28nm and
evaluated with 1-to-8bit AlexNet and VGG16, the processor
achieves a peak energy efficiency of 2.4-to-152.7TOPS/W.
I. I NTRODUCTION
Computing-in-memory (CIM) is a predominant architec-
ture for accelerating deep neural networks (DNNs), since
massive data movement is avoided and computations are
conducted in analog domain. Compared to voltage- [1, 2] and
frequency-domain [3, 4] CIM, time-domain CIM (TD-CIM)
[5, 6] processes multi-bit data as pulse width modulation
signal, requiring low toggle activity and voltage headroom. Fig. 1. (a) Bit-split convolution and (b)-(d) Challenges on TD-CIM.
Therefore, TD-CIM is splendid for accelerating DNNs.
Multi-bit DNNs are imperative for applications requiring results have to be quantized using the same sample frequency,
high accuracy. Like voltage-domain CIMs, previous TD- leading to high quantization energy or error when quantizing
CIMs accelerate multi-bit DNNs using bit-split convolution wide pulse using high frequency or narrow pulse using low
computation (BSC) as shown in Fig. 1(a). By splitting one k- frequency.
bit vector into multiple h-bit vectors (h ≤ k), the convolution To address above issues, we firstly propose a unique-
is decomposed into multiple inner-products, and each of weight kernel decomposition based convolution computation
which is multiplied with one scaling factor. Although BSC method (UWKDC), making TD-CIMs feasible to accelerate
makes CIMs possible to accelerate multi-bit DNNs, there arbitrary quantized DNNs with less computation overhead.
are still three challenges as shown in Fig. 1(b)-(d): 1) Poor Then a TD-CIM based processor is designed with three
network adaptability: Unlike linear quantized DNNs (LQ- techniques: 1) Cross-flipping based fast kernel decomposer
DNNs) with equal quantization interval, non-linear quantized (CFFKD) to reduce memory access by 1.91-to-3.83×. 2)
DNNs (NLQ-DNNs) with high accuracy can not be computed Dual-mode complementary predictor (DMCPP) to remove
by BSC using quantization indexes due to unequal quanti- 2.32-to-5.46× redundant computations. 3) Activation-weight-
zation intervals. 2) Massive redundant computations: Since adaptive pulse quantizer (AWAPQ) to reduce 1.41-to-1.92×
ReLU sets one negative value to zero, if we can predict quantization energy and 0.37%-to-3.49% accuracy loss.
an output activation would be negative before convolution II. U NIQUE - WEIGHT KERNEL DECOMPOSITION BASED
finishes, the rest computations are redundant and can be CONVOLUTION C OMPUTATION
skipped. It occurs when partial sums of both calculated and Fig. 2(a) presents UWKDC principle, which reorders and
rest computations are zero. However, each term in BSC aggregates computations with the same weight. It decom-
(ei,j × (Wi · Xj )) is always positive, so partial sum of rest poses one k-bit quantized kernel into 2k pairs of unique-
computations is always positive, making prediction invalid. weight (wi ) and bitmap (Bi ). The standard convolution is
3) High quantization energy and error: Value ranges of all decomposed into 2k sparse convolutions of bitmaps requiring
inner-products (Wi ·Xj ) in BSC are identical, so pulses of all only 2k multiplications and at most n additions. Each sparse
978-1-7281-8436-4/20/$31.00 ©2020 IEEE 1

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.
IEEE Asian Solid-State Circuits Conference
S6-1 November 9-11, 2020 ONLINE

Fig. 2. (a) Principle and (b) three features of proposed unique-weight kernel decomposition based convolution computation.

Fig. 3. (a) Overall architecture. (b) Architecture of CIM macro.


convolution computes sum of activations multiplied with the
same weight.
Three algorithmic features in UWKDC can be utilized by
TD-CIMs to accelerate DNNs. 1) Bitmap-based convolution
to accelerate arbitrary quantized DNNs: Feature maps are
convolved with bitmaps rather than actual weight kernels
or index kernels, so LQ-DNNs and NLQ-DNNs with ar-
bitrary precision can be computed on the same TD-CIM
using bitmaps. 2) Predictable output activation to remove
redundant computations: Due to commutative law of addition,
convolutions of bitmaps for one kernel in UWKDC can be
reordered. Utilizing ReLU, if we compute convolution from
positive to negative weights, the computation of negative Fig. 4. Computation dataflow.
output activations can be early-stopped by prediction. 3)
with one 1-bit weight, through scaling pulse width controlled
Shrinking value range of convolution results to reduce pulse
by voltage via a 2:4 decoder. The bypass pulse path is
quantization energy and error: Sum of convolution results
designed to utilize high sparsity in bitmaps for reducing
of all bitmaps for one kernel is a constant which equals to
computation energy.
sum of all activations in the sliding window. So every one
The processor reuses bitmaps by convolving 2 batches
convolution computed, upper limit of next convolution result
with the same kernel on 2 tiles. Since UWKDC convolves
gradually decreases, offering a chance to dynamically tune
one sliding window repetitively with each bitmap of all
sample frequency to minimize quantization energy and error,
kernels, we adopt input-stationary dataflow to reuse activa-
according to shrinking pulse width.
tions extremely as shown in Fig. 4. Bitmaps of one kernel
III. T IME - DOMAIN CIM BASED P ROCESSOR are serially generated by CFFKD, and send to TD-CIMs
for convolution computation, then the results are fed into
The designed processor is shown in Fig. 3(a), including
AWAPQ to perform prediction for deciding next computed
a main controller, a 72KB weight index SRAM, a 128KB
bitmap. Once prediction succeeds, bitmaps of next kernel are
activation SRAM, and three featured modules: 1) a CFFKD
processed. To utilize all macros, kernel parallelism is adjusted
to generate bitmaps, 2) a DMCPP to predict output activations
according to kernel sizes.
for early-stopping computations, 3) a TD-CIM with AWAPQ
to compute convolution of bitmaps with feature maps. A. Cross-Flipping based Fast Kernel Decomposer
The TD-CIM is improved on the basis of [5], consisting Since storing all bitmaps requires tremendous memory
of two tiles with eight CIM macros. As shown in Fig. 3(b), capacity, we on-line serially generate bitmaps to alleviate
each macro includes a pulse generator, a pulse quantizer, an memory pressure by comparing equality of the index ker-
activation memory, a latch array for bitmaps and 9×128 pulse nel with each quantization index as shown in Fig. 5(a).
delay cells (PDCs). Each PDC computes multiplication of an To support multi-bit DNNs, we use bit-serial word-parallel
8b activation and a 1b element in bitmaps by 2 delay chains. equality comparison (XNOR). Conventionally, k-bit indexes
Each delay chain employs 2 pulse width modulation units are encoded in ascending order. To generate one bitmap Bi
(PWMUs) to compute multiplications of two 2bit activations of one index di , all bits of each element in the index kernel
2

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.
IEEE Asian Solid-State Circuits Conference
S6-1 November 9-11, 2020 ONLINE

Fig. 5. Cross-flipping based fast kernel decomposer: (a) Principle, (b)


Architecture.
Fig. 6. Dual-mode complementary predictor.
Di need to be read from memory and compared with bits
of the index di using XNOR operation. Since indexes are
irrelevant, to generate all bitmaps, the index kernel has to be
read from memory repetitively and compared with all bits of
all indexes, consuming massive memory access and latency.
Fortunately, bitmaps are only decided by actual weight ker-
nel, so actual weights can be arbitrary encoded and matched
with quantization indexes. Besides, we observe that for data
a, b and ∼b (flipping all bits of b), the equality of both (a, b)
and (a,∼b) can be obtained using bitwise XNOR results
of a and b by logic AND (CA ) and NOR (∼CO ). Based
on that, we propose cross-flipping order to encode indexes
for reducing comparison overhead by reusing intermediate
results. Every four successive indexes form one group. The
latter three indexes in one group are generated by flipping
high ([∼H, L]), low ([H, ∼L]) and all bits ([∼H, ∼L]) of
the first index ([H, L]), respectively. Accordingly, we only
Fig. 7. Activation-weight-adaptive pulse quantizer: (a) Principle, (b) Optimal
need to compute the logic AND and NOR of equality results frequency search, (c) Reduction of energy and accuracy loss, (d) Architec-
of high (H0 CA , ∼H0 CO ) and low (L0 CA , ∼L0 CO )) ture.
bits in the first index with index kernels once, then the four
hardware resources for two predictions. The unique-weight
intermediate results can be reused to generate bitmaps of four
scheduling unit controls computation sequence of bitmaps.
indexes in one group (B0 , B1 , B2 , B3 ). Consequently, both
The predicting unit accepts actual weight decoded from
memory access and latency are reduced by 1.91-to-3.83× for
weight codebook and convolution result from TD-CIMs to
1-to-8bit DNNs. The architecture of CFFKD consists of one
perform prediction through two datapaths. One prediction
index bit loader computing SRAM address of next compared
method with fewest computations is off-line assigned for each
bits of the index kernel, and 4608 parallel cross-flipping
kernel based on weight distribution. Evaluated with 1-to-8bit
equality comparators. Each comparator has a comparing
VGG16, DMCPP can reduce computations by 2.32-to-5.46×.
module reused to generate comparison results for high and
Convolution results are not changed by DMCPP, so network
low bits, and a flipping module reusing comparison results to
accuracy is not impacted.
generate bitmaps. CFFKD has no impact on DNN accuracy.
B. Dual-Mode Complementary Predictor C. Activation-Weight-Adaptive Pulse Quantizer
Computation sequence of unique-weights (bitmaps) for one In UWKDC, quantization energy and error are impacted by
kernel in UWKDC can be arbitrary. To remove maximum pulse width of convolution results and scaling factor of actual
redundant computations for all DNNs, we propose DMCPP weights as shown in Fig. 7. So it is imperative to adaptively
to realize two kinds of prediction by reordering computing select sample frequency according to convolution results and
sequence of unique-weights: present-accumulated-value pre- actual weights before pulse computation. Fortunately, convo-
diction (PAVP) and prospective-hypothetic-extremum predic- lution results in UWKDC are shrinking, so we can speculate
tion (PHEP) as shown in Fig. 6. Let SUM1 and SUM2 denote the width range of next pulse before computation. Besides,
the sum of convolution results of calculated and rest bitmaps. the computation order of unique-weights is fixed by PAVP
PAVP computes bitmaps from positive to negative weights, and PHEP, so the weight value can also be known in advance.
which predicts successfully if SUM1 is less than thresh- Based on that, we propose AWAPQ to tune quantizer using
old (T h), since SUM2 must be negative. Contrarily, PHEP four levels of sample frequencies for each DNN. The optimal
computes bitmaps from negative to positive weights, and frequencies are off-line searched to reduce maximum energy
computes the maximum of rest computations max(SUM2 ) and accuracy loss through running an energy model and DNN
by multiplying sum of all rest activations (SXre ) with the model with 1000 batches. For 1-to-8bit VGG16, the energy
rest largest positive weight (wux ). If SUM1 +max(SUM2 ) is reduced by 1.41-to-1.92×, and accuracy is improved by
≤ T h, the final activation must be less than T h, so prediction 0.37%-to-3.49%, compared to quantizer with fixed frequency.
succeeds. The architecture of DMCPM is designed to reuse Two off-line generated thresholds (SXth and wth ) are used
3

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.
IEEE Asian Solid-State Circuits Conference
S6-1 TABLE I November 9-11, 2020 ONLINE
C OMPARISON WITH STATE - OF - THE - ART CIM- BASED PROCESSORS .
ISSCC 2018 ISSCC 2019 CICC 2017 ISSCC 2018 ISSCC 2019 ISSCC 2019
Reference This work
[1] [2] [3] [4] [5] [6]
Tech. [nm] 65 55 65 55 65 40 28
CIM Domain Voltage Voltage Frequency Frequency Time Time Time
Die Area [mm2 ] 0.067 0.037 0.24 3.4 0.22 0.124 7.21
Supply Voltage [V] 0.9-1.2 - 1.2 0.4-1.0 0.6-0.9 0.375-1.1 0.55-1.05
Weight Precision 1b 1-8b 8b 6b 1-2b 1-8b 1-8b
Activation Precision 7b 1-8b 3b 6b 8b 4b 8b
Peak Throughput [GOPS] 10.7 0.365 0.396 2.152 - 0.365 246.3
Benchmark LeNet-5 MNIST MLP RL AlexNet LeNet-5 VGG16 AlexNet
1
Linear 72.1@(2,1) 63.64@(1,8) 56.71@(1,8)
Reported 28.1@(1,7) 0.019@(3,8) 3.12@(6,6) 46.6@(1,8) 12.08@(1,4)
Energy Quantized 18.37@(5,2)
27.17@(2,8) 32.34@(2,8)
@(1,8) 24.6 18.03 0.057 14.04 46.6 6.04
Efficiency DNNs Normalized2 9.65@(4,8) 8.96@(4,8)
@(4,8) 6.15 4.51 0.014 3.51 11.7 1.51
[TOPS/W] 4
65.89@(1,8) 58.37@(1,8)
Non-Linear Quantized DNNs3 3.07@(*,8) 2.87@(*,8) 0.007@(*,8) 1.76@(*,8) 5.85@(*,8) 0.76@(*,8)
11.32@(4,8) 9.98@(4,8)
1
a@(b,c) means energy efficiency is a, when quantization bitwidth of weight (index) and activation is b and c, respectively.
2
Scaling energy efficiency of each work by linearly multiplying the ratio of weight or activation bitwidth.
3
Energy efficiency of previous works for NLQ-DNNs is equal to that for LQ-DNNs with 8b weights, since they have to use actual weights for computation.
4
(*,8) means the energy efficiency is identical for NLQ-DNNs with different quantization bitwidths in previous works.

Fig. 10. (a) Improvement of energy efficiency by different techniques on


VGG-16, (b) Area and power breakdown.

by UWKDC and DMCPP. When accelerating NLQ-DNNs,


Fig. 8. Chip photography and specification. previous works have to use high-bitwidth actual weights for
Voltage-Frequency Scaling
160
Bit-Width Scaling computation, leading to low and identical energy efficiency
Energy Efficiency (TOPS/W)

1bit 2bit

160
35
140 4bit 8bit for different quantization bitwidths. Thanks to UKWDC
Frequency (MHz)

30
120
and AWAPQ, our processor still can achieve 62.1 and 10.7
Power (mW)

120 25
100

80
20
80 TOPS/W energy efficiency for 1-bit and 4-bit NLQ-VGG16
15
60

40
10 40
and AlexNet, which are 10.6× and 1.83× higher than [5].
5 20
The processor achieves nearly identical energy efficiency for
0 0 0
0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.5 0.6 0.7 0.8 0.9 1.0 1.1 LQ- and NLQ-DNNs with the same precision, since UWKDC
Voltage (V) Voltage (V)

Fig. 9. (a) Frequency and power scaling with voltage, (b) Peak energy and CFFKD process them consistently.
efficiency scaling with quantization bitwidth and voltage. V. C ONCLUSION
for on-line determining value of convolution result or actual This work aims to accelerate arbitrary quantized DNNs on
weight is large or small to select sample frequency. High CIMs. We firstly propose UWKDC to accelerate convolution
frequency is assigned for large weight and small CIM result, computation for various DNNs. Then, we design a TD-CIM
and vice versa. Fig. 7(d) presents the architecture. By two based processor with three techniques: CFFKD to reduce
threshold comparisons, sample frequency selector chooses memory access, DMCPP to remove redundant computation
one frequency and sends configuration bits to pulse quantizer and AWAPQ to reduce quantization energy and error. Im-
to generate the required frequency by a voltage controlled plemented in 28nm, this processor achieves 60.2 and 62.1
oscillator (VCO) which configures 4 distinct number of TOPS/W energy efficiency for LQ- and NLQ-DNNs, 1.29×
delay cells and 4 levels of control voltages to generate 16 and 10.6× higher than state-of-the-art work.
levels of frequencies. The delay cell is the same as that ACKNOWLEDGMENT
in PWMU to alleviate quantization fluctuation from PVT This work was supported in part by the National Key
variation impacting pulse modulation. R&D Project (2018YFB2202600), the NSFC (61774094 and
U19B2041), the China Major S&T Project (2018ZX01031101-
IV. MEASUREMENT RESULTS 002 and 2018ZX01028101-004) and the Beijing S&T Project
This processor is fabricated in 28nm CMOS with 7.2mm2 (Z191100007519016).
area. Fig. 8 shows the die photograph and chip specifications. R EFERENCES
This processor can accelerate multi-bit LQ- and NLQ-DNNs [1] A. Biswas, et al., “Conv-ram: An energy-efficient sram with
with less than 1.5% accuracy degradation. When scaling embedded convolution computation for low-power cnn-based
down from 1.05V to 0.55V, this chip achieves peak energy machine learning applications,” in ISSCC Dig. Tech. Papers,
efficiency of 2.4-to-152.7TOPS/W for 1-to-8bit DNNs as 2018, pp. 488–490.
shown in Fig. 9. Evaluated with VGG16, CFFKD, DMCPP [2] X. Si, et al., “A twin-8t sram computation-in-memory macro for
multiple-bit cnn-based machine learning,” in ISSCC Dig. Tech.
and AWAPQ averagely improve 2.48×, 1.28× and 1.31× Papers, 2019, pp. 396–398.
energy efficiency as shown in Fig. 10(a), respectively. The [3] M. Liu, et al., “A scalable time-based integrate-and-fire neuro-
area and power breakdown are shown in Fig. 10(b). CFFKD morphic core with brain-inspired leak and local lateral inhibition
and DMCPP only consumes 13.8% and 12.1% of total area capabilities,” in CICC Dig. Tech. Papers, 2017, pp. 1–4.
and power, while making TD-CIMs efficient for accelerating [4] A. Amravati, et al., “A 55nm time-domain mixed-signal neu-
romorphic accelerator with stochastic synapses and embedded
arbitrary quantized DNNs. reinforcement learning for autonomous micro-robots,” in ISSCC
Table I compares this processor with previous CIMs work- Dig. Tech. Papers, 2018, pp. 124–126.
ing in different domains. For 1-bit LQ-VGG16 and AlexNet, [5] J. Yang, et al., “Sandwich-ram: An energy-efficient in-memory
bwn architecture with pulse-width modulation,” in ISSCC Dig.
this processor achieves 60.2 TOPS/W energy efficiency av- Tech. Papers, 2019, pp. 394–396.
eragely, which is 2.45×, 4.29× and 1.29× higher than state- [6] A. Sayal, et al., “All-digital time-domain cnn engine using
of-the-art voltage- [1], frequency- [4], and time-domain [5] bidirectional memory delay lines for energy-efficient edge com-
CIMs. It mainly benefits from reduced computation energy puting,” in ISSCC Dig. Tech. Papers, 2019, pp. 228–230.
4

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.

You might also like