Professional Documents
Culture Documents
06) A Time-Domain Computing-In-Memory Based Processor Using Predictable Decomposed Convolution For Arbitrary Quantized DNNs
06) A Time-Domain Computing-In-Memory Based Processor Using Predictable Decomposed Convolution For Arbitrary Quantized DNNs
06) A Time-Domain Computing-In-Memory Based Processor Using Predictable Decomposed Convolution For Arbitrary Quantized DNNs
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.
IEEE Asian Solid-State Circuits Conference
S6-1 November 9-11, 2020 ONLINE
Fig. 2. (a) Principle and (b) three features of proposed unique-weight kernel decomposition based convolution computation.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.
IEEE Asian Solid-State Circuits Conference
S6-1 November 9-11, 2020 ONLINE
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.
IEEE Asian Solid-State Circuits Conference
S6-1 TABLE I November 9-11, 2020 ONLINE
C OMPARISON WITH STATE - OF - THE - ART CIM- BASED PROCESSORS .
ISSCC 2018 ISSCC 2019 CICC 2017 ISSCC 2018 ISSCC 2019 ISSCC 2019
Reference This work
[1] [2] [3] [4] [5] [6]
Tech. [nm] 65 55 65 55 65 40 28
CIM Domain Voltage Voltage Frequency Frequency Time Time Time
Die Area [mm2 ] 0.067 0.037 0.24 3.4 0.22 0.124 7.21
Supply Voltage [V] 0.9-1.2 - 1.2 0.4-1.0 0.6-0.9 0.375-1.1 0.55-1.05
Weight Precision 1b 1-8b 8b 6b 1-2b 1-8b 1-8b
Activation Precision 7b 1-8b 3b 6b 8b 4b 8b
Peak Throughput [GOPS] 10.7 0.365 0.396 2.152 - 0.365 246.3
Benchmark LeNet-5 MNIST MLP RL AlexNet LeNet-5 VGG16 AlexNet
1
Linear 72.1@(2,1) 63.64@(1,8) 56.71@(1,8)
Reported 28.1@(1,7) 0.019@(3,8) 3.12@(6,6) 46.6@(1,8) 12.08@(1,4)
Energy Quantized 18.37@(5,2)
27.17@(2,8) 32.34@(2,8)
@(1,8) 24.6 18.03 0.057 14.04 46.6 6.04
Efficiency DNNs Normalized2 9.65@(4,8) 8.96@(4,8)
@(4,8) 6.15 4.51 0.014 3.51 11.7 1.51
[TOPS/W] 4
65.89@(1,8) 58.37@(1,8)
Non-Linear Quantized DNNs3 3.07@(*,8) 2.87@(*,8) 0.007@(*,8) 1.76@(*,8) 5.85@(*,8) 0.76@(*,8)
11.32@(4,8) 9.98@(4,8)
1
a@(b,c) means energy efficiency is a, when quantization bitwidth of weight (index) and activation is b and c, respectively.
2
Scaling energy efficiency of each work by linearly multiplying the ratio of weight or activation bitwidth.
3
Energy efficiency of previous works for NLQ-DNNs is equal to that for LQ-DNNs with 8b weights, since they have to use actual weights for computation.
4
(*,8) means the energy efficiency is identical for NLQ-DNNs with different quantization bitwidths in previous works.
1bit 2bit
160
35
140 4bit 8bit for different quantization bitwidths. Thanks to UKWDC
Frequency (MHz)
30
120
and AWAPQ, our processor still can achieve 62.1 and 10.7
Power (mW)
120 25
100
80
20
80 TOPS/W energy efficiency for 1-bit and 4-bit NLQ-VGG16
15
60
40
10 40
and AlexNet, which are 10.6× and 1.83× higher than [5].
5 20
The processor achieves nearly identical energy efficiency for
0 0 0
0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.5 0.6 0.7 0.8 0.9 1.0 1.1 LQ- and NLQ-DNNs with the same precision, since UWKDC
Voltage (V) Voltage (V)
Fig. 9. (a) Frequency and power scaling with voltage, (b) Peak energy and CFFKD process them consistently.
efficiency scaling with quantization bitwidth and voltage. V. C ONCLUSION
for on-line determining value of convolution result or actual This work aims to accelerate arbitrary quantized DNNs on
weight is large or small to select sample frequency. High CIMs. We firstly propose UWKDC to accelerate convolution
frequency is assigned for large weight and small CIM result, computation for various DNNs. Then, we design a TD-CIM
and vice versa. Fig. 7(d) presents the architecture. By two based processor with three techniques: CFFKD to reduce
threshold comparisons, sample frequency selector chooses memory access, DMCPP to remove redundant computation
one frequency and sends configuration bits to pulse quantizer and AWAPQ to reduce quantization energy and error. Im-
to generate the required frequency by a voltage controlled plemented in 28nm, this processor achieves 60.2 and 62.1
oscillator (VCO) which configures 4 distinct number of TOPS/W energy efficiency for LQ- and NLQ-DNNs, 1.29×
delay cells and 4 levels of control voltages to generate 16 and 10.6× higher than state-of-the-art work.
levels of frequencies. The delay cell is the same as that ACKNOWLEDGMENT
in PWMU to alleviate quantization fluctuation from PVT This work was supported in part by the National Key
variation impacting pulse modulation. R&D Project (2018YFB2202600), the NSFC (61774094 and
U19B2041), the China Major S&T Project (2018ZX01031101-
IV. MEASUREMENT RESULTS 002 and 2018ZX01028101-004) and the Beijing S&T Project
This processor is fabricated in 28nm CMOS with 7.2mm2 (Z191100007519016).
area. Fig. 8 shows the die photograph and chip specifications. R EFERENCES
This processor can accelerate multi-bit LQ- and NLQ-DNNs [1] A. Biswas, et al., “Conv-ram: An energy-efficient sram with
with less than 1.5% accuracy degradation. When scaling embedded convolution computation for low-power cnn-based
down from 1.05V to 0.55V, this chip achieves peak energy machine learning applications,” in ISSCC Dig. Tech. Papers,
efficiency of 2.4-to-152.7TOPS/W for 1-to-8bit DNNs as 2018, pp. 488–490.
shown in Fig. 9. Evaluated with VGG16, CFFKD, DMCPP [2] X. Si, et al., “A twin-8t sram computation-in-memory macro for
multiple-bit cnn-based machine learning,” in ISSCC Dig. Tech.
and AWAPQ averagely improve 2.48×, 1.28× and 1.31× Papers, 2019, pp. 396–398.
energy efficiency as shown in Fig. 10(a), respectively. The [3] M. Liu, et al., “A scalable time-based integrate-and-fire neuro-
area and power breakdown are shown in Fig. 10(b). CFFKD morphic core with brain-inspired leak and local lateral inhibition
and DMCPP only consumes 13.8% and 12.1% of total area capabilities,” in CICC Dig. Tech. Papers, 2017, pp. 1–4.
and power, while making TD-CIMs efficient for accelerating [4] A. Amravati, et al., “A 55nm time-domain mixed-signal neu-
romorphic accelerator with stochastic synapses and embedded
arbitrary quantized DNNs. reinforcement learning for autonomous micro-robots,” in ISSCC
Table I compares this processor with previous CIMs work- Dig. Tech. Papers, 2018, pp. 124–126.
ing in different domains. For 1-bit LQ-VGG16 and AlexNet, [5] J. Yang, et al., “Sandwich-ram: An energy-efficient in-memory
bwn architecture with pulse-width modulation,” in ISSCC Dig.
this processor achieves 60.2 TOPS/W energy efficiency av- Tech. Papers, 2019, pp. 394–396.
eragely, which is 2.45×, 4.29× and 1.29× higher than state- [6] A. Sayal, et al., “All-digital time-domain cnn engine using
of-the-art voltage- [1], frequency- [4], and time-domain [5] bidirectional memory delay lines for energy-efficient edge com-
CIMs. It mainly benefits from reduced computation energy puting,” in ISSCC Dig. Tech. Papers, 2019, pp. 228–230.
4
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on June 20,2022 at 10:21:05 UTC from IEEE Xplore. Restrictions apply.