Lattice An ADC DAC-less ReRAM-based Processing-In-Memory Architecture For Accelerating Deep Convolution Neural Networks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Lattice: An ADC/DAC-less ReRAM-based Processing-In-Memory

Architecture for Accelerating Deep Convolution Neural Networks


Qilin Zheng1* , Zongwei Wang1* ,Zishun Feng5 , Bonan Yan2 , Yimao Cai1 , Ru Huang1
Yiran Chen2 , Chia-Lin Yang4 and Hai (Helen) Li2
1,3
{wangzongwei,caiyimao}@pku.edu.cn, 2 {bonan.yan,yiran.chen,hai.li}@duke.edu.cn, 4 yangc@csie.ntu.edu.tw
1
Inst. of Microelectronics, Peking Univeristy, 2 Dept. of ECE, Duke University,
3
Frontiers Science Center for Nano-optoelectronics, Peking University, 4 Dept. of ECE, National Taiwan University
5
Dept. of BME, UNC at Chapel Hill * Equal Contribution.

Abstract—Nonvolatile Processing-In-Memory (NVPIM) has to such stochasticity. FloatPIM [6], for example, attempts to
demonstrated its great potential in accelerating Deep Con- implement VMM using in-memory ‘NOR’ logic operations.
volution Neural Networks (DCNN). However, most of existing However, the in-memory ‘NOR’ logic operations require
NVPIM designs require costly analog-digital conversions and
often rely on excessive data copies or writes to achieve initialization of the memory cells to Low Resistant State
performance speedup. In this paper, we propose a new NVPIM (LRS) at the beginning of each ‘NOR’ operation. Additional
architecture, namely, Lattice, which calculates the partial sum memory space is also required to store every intermediate
of the dot products between the feature map and weights ‘NOR’ results when combining ‘NOR’ logics to compute
of network layers in a CMOS peripheral circuit to eliminate multiplication or accumulation.
the analog-digital conversions. Lattice also naturally offers an
efficient data mapping scheme to align the data of the feature We note that the high performance of the existing NVPIM
maps and the weights and hence, avoiding the excessive data designs is achieved at the cost of excessive data copies or
copies or writes in the previous NVPIM designs. Finally, we writes. As we shall show in Section II, in PipeLayer, an
develop a zero-flag encoding scheme to save the energy of element in a feature map needs to be copied into ReRAM
processing zero-values in sparse DCNNs. Our experimental arrays for n×n times when performing an n×n convolution
results show that Lattice improves the system energy efficiency
by 4× ∼ 13.22× compared to three state-of-the-art NVPIM layer; in FloatPIM, an 1-bit add involves six intermediate
designs: ISAAC, PipeLayer, and FloatPIM. results, which cause six expensive ReRAM writes.
In this paper, we proposed an ADC/DAC-less ReRAM-
I. INTRODUCTION based NVPIM architecture named Lattice. Lattice moves
the computation between the feature maps and weights of
Ultra-low power machine learning processors are essen- network layers to a CMOS peripheral circuit in order to
tial to performing cognitive tasks for embedded system eliminate the expensive ADC/DACs. Lattice also naturally
as the power budget are limited, e.g., with batteries or offers an efficient data mapping scheme to align the data
energy harvesting sources. However, the data generated by of the features map and the weights and hence, avoiding
Deep Convolution Neural Network (DCNN) incurs heavy the excessive data copies or writes in the previous NVPIM
traffic between memory and computing units in conven- designs. Our experimental results show that Lattice achieves
tional von Neumann architectures, and adversely affects 7.6×, 13.42×, and 4.0× higher energy efficiency compared
the energy efficiency of these systems. Resistive Random to ISAAC, PipeLayer, and FloatPIM, respectively.
Access Memory (ReRAM) based Nonvolatile Processing-In- The rest of this paper is organized as follows: Section II
Memory (NVPIM) recently emerges as a promising solution presents the backgrounds of this work; Section III illustrates
of accelerating DCNN executions [1]–[5]. The high cell the Lattice architecture; Section IV gives our experimental
density of ReRAM allows large on-chip ReRAM arrays setup; Section V presents the evaluation results and the
to be implemented on the chip to store the parameters related discussions; Section VI concludes our work.
of the DCNN while proper functions, e.g., vector-matrix
multiplications (VMM), can be directly performed in the II. BACKGROUNDS
ReRAM arrays and their peripheral circuits.
Existing ReRAM-based NVPIM designs are mainly in A. ReRAM-Based NVPIM Accelerators with ADC/DACs
mixed-signal manner: ISAAC [1] and PipeLayer [2], for ReRAM is an emerging NVM technology with very high
instance, include a large number of digital/analog and ana- cell density [7]. The resistance of a ReRAM device can
log/digital converters (ADC/DAC) to transform digital inputs be switched between two or more levels [8] by applying
into analog signals for PIM operations, and then convert electrical excitation with different amplitudes and duration.
the computing results back to digital format, respectively. ReRAM is normally formed in a crossbar array, which has
The ADC/DACs, however, dominate the area and power been widely used to perform VMM [9] [10].
consumption of these PIM designs. Digital PIM designs were As shown in Fig 1(b), the elements of a matrix can be
proposed to improve the energy efficiency by eliminating represented as the conductance of the ReRAM cells located
A/D conversions, and improve the overall design resilience at every cross-points of a ReRAM crossbar; the vector can

978-1-7281-1085-1/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:00 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. (a) Overview of a ReRAM-based NVPIM accelerator. (b) The design
of a PIM core. (c) An 1-bit add in FloatPIM design [6]. (d) Mapping scheme
of feature maps in PipeLayer [2].

be represented by the voltages on the crossbar’s inputs. Then Fig. 2. Overview of Lattice architecture.
the bitline currents collected at the outputs of the crossbar
form the VMM result. In this design, DAC/ADC are needed III. L ATTICE A RCHITECTURE
to convert the signal between analog and digital formats.
However, these DAC/ADCs contribute to majority of the A. System Overview
chip area and power consumption [1] [11].
Fig. 2 depicts the overview of the proposed Lattice archi-
B. Redundancy in Memory Writes of FloatPIM tecture. Lattice is composed of multiple PIM banks and each
bank includes a 2D array of PIM blocks, a global functional
To simplify the design and improve the computational unit, a bank controller, and a global accumulator. PIM blocks
reliability, FloatPIM [6] computes VMM by performing a are the basic operation units that can operate in either com-
combination of ‘NOR’ operations of binary inputs. This puting mode or memory mode. When a PIM block is in the
design, however, introduces significant redundancy in mem- memory mode (namely, Mem.block), it works as a memory
ory writes: Fig. 1(c) shows a ‘NOR’ logic [12] that is that supports read and write accesses; when a PIM block is in
composed of two input devices and one output device. When the computing mode (namely, Comp.block), it can perform
performing the ‘NOR’ operation, the output device, which is Vector-Vector Multiplication (VVM) to compute Partial sum
connected to the bitline BL[2], is first initialized to ‘1’ (i.e., (Psum) of output feature maps in DCNN executions with
at LRS). Then the two bitlines of the inputs devices BL[0] reconfigurable data precision. The global functional unit per-
and BL[1] are connected to Vdd while the bitline BL[2] is forms miscellaneous functions such as pooling, activation,
grounded. If the values of the two input devices are all ‘0’ and our proposed zero-flag encoding scheme, etc. The bank
(i.e., at High Resistant State, or HRS), the current passing controller orchestrates the data mapping and computation
through the output device is small and the output device stays flow of Convolution (CONV) layer and Fully Connected
at LRS. If one of the two input devices is at LRS, a large (FC) layer. It is also in charge of reconfiguring the data
current will pass through the output device and switch the precision of every VVM operation. The global accumulator
output device to ‘0’. As also shown in Fig. 1(c), computing computes the generated output feature maps and sends them
a sum of 1-bit add involves total six intermediate results, to adjacent PIM banks.
each of which introduces one write as illustrated above. In inference, the weights of a network layer are stored
in some Comp.blocks and the feature maps can be stored
C. Redundancy in memory copies of PipeLayer in some Mem.blocks.1 Since the weight/feature maps are
high dimension tensors, they should be first unrolled into a
In PipeLayer, all the data in the feature maps correspond- numbers of vectors, and then stored in the memory array
ing to a K × K filter, e.g., K × K × Inch data in total, column by column. As we shall show later, such column-
are copied into a column of the ReRAM array, as shown in wise arrangement will give us some convenience to support
Fig. 1(d). Here Inch is the number of the input feature maps. reconfigurable data precision.
When the stride is smaller than K, e.g., 1, a feature map data
will be stored into different locations on different columns 1 In some cases that the size of the feature maps is small, e.g., for FC
of the ReRAM array for K × K times (except for the data layers, the feature maps may be directly stored in the column buffer of the
on the boundary of the feature maps without padding). Comp.blocks.

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:00 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. Illustration of mapping scheme. (a) Kernel mapping scheme for CONV layer. (b) Feature map Mapping scheme. (c) Computation flow of CONV
layer. (d) Data movement between different blocks.

B. Data Mapping of Weights/Feature Maps in the feature maps and redo everything above to map this
new row of the feature maps into another PIM block. Note
We assume the configuration of the kernels is represented that each time we only need to store K rows of the feature
by [Inch ,Outch ,K,K], where Inch is the number of input maps to facilitate the computation: when the K + 1 row is
channels, Outch is the number of output channels, and K is needed, it can overwrite the locations of the 1st row in the
the dimension of the kernel. The configuration of the feature PIM blocks [1]. Again, here we assume Inch ≤ Row and
maps is represented by [Inch ,H,W ], where H and W are the W × M ≤ Col. If Inch > Row or W × N > Col, we may
height and width of the feature maps. The ReRAM crossbar need to extend the mapping to more PIM blocks.
size is [Row,Col], where Row and Col are the numbers of For example, if we want to compute a CONV3×3 layer in
rows and columns. The bitwidths of the kernel weights and VGG-Net with a kernel of [256,32,3,3] and feature maps of
the feature map parameters are N and M , respectively. We [256,32,32] using PIM blocks with a ReRAM array size of
also assume each ReRAM device can only reliably store a 256 × 256, we need to map the kernels and the feature maps
binary value, i.e., ‘0’ and ‘1’. into 9 and 3 PIM blocks, respectively, as shown in Fig. 3(d).
For the Inch kernels that have a size of K ×K and corre- Here the Comp.block block(i,j) (i,j = 0,1,2) represents the
spond to the same output channel, we group the elements at elements (i,j) in the Inch ×ch kernels. And the Mem.blocks
the same position of the Inch kernels into a vector and map block(3,j) (j = 0,1,2) represents the jth row of the feature
the vector into a column of a PIM block; the elements at maps that participating in the computation.
different positions of the kernels are mapped into different Mapping the weights of a FC layer into PIM blocks is
PIM blocks. Hence, we will map the kernels into total K×K trivial: the columns of the weight matrix are directly mapped
PIM blocks, as shown in Fig. 3(a). If the bitwidth of the into the columns of the PIM blocks and multiple columns of
kernel weight N > 1, then we need N ReRAM devices at the PIM blocks may be combined to support a high precision
different columns to represent one kernel element. Here we of the weights. Again, if the size of the weight matrix is
assume Inch ≤ Row and Outch ×N ≤ Col. If Inch > Row larger than that of the PIM bocks, the weight matrix will be
or Outch × N > Col, we may need to extend the mapping mapped to multiple PIM blocks.
to more PIM blocks.
Fig. 3(b) illustrates the mapping of feature maps. We C. Computation Flow
group the elements at the same position of the same row The computation of convolution in Lattice can be divided
of Inch different feature maps into a vector, and map the into three phases – preparation, computation, and shifting,
vector into a column of a PIM block. We then slide the as shown in Fig. 3(c):
element position horizontally and map the grouped vector In the preparation phase, the feature maps will be first
into a different column in the same PIM block. There will read out from the Mem.blocks column by column, and sent
be total W such vectors that cover the elements on the same into the input of each rows of the Comp.blocks. The data
row of all the feature maps and are mapped into the PIM transfer direction is determined by the row index Rout of the
block. If the bitwidth of the feature maps N > 1, we need output feature maps to be computed. For example, Rout = 0
N ReRAM devices at different columns to represent one at start of the convolution. The column i (i = 0,1,2) of the
feature map parameter. After that, we move to the next row block(3,j) (j = 0,1,2) will be read out and sent into the inputs

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:00 UTC from IEEE Xplore. Restrictions apply.
of the block(i,j) (i, j = 0,1,2) to compute the row Rout = 0.
After the row Rout = 0 is obtained, Rout moves to 1. The
locations that store the 0th row of the input feature map in
block(3,0) will be overwritten by the 3rd row. The column i
(i = 0,1,2) of the block(3,j) (j = 0,1,2) will be read out and
sent into the inputs of the block(i,(j + 2)%3) (i, j = 0,1,2)
to compute the row Rout = 1, as shown in Fig. 3(d).
In general,the column i (i = 0,1,2) of the block(3,j) (j
= 0,1,2) will be read out and sent into the inputs of the
block(i,f (j)) (i, j = 0,1,2) to compute the row Rout . Here
the function f (j, Rout ) can be expressed as:
f (j, Rout ) = (j + K − Rout %K)%K. (1)
K is 3 in the presented example.
In the computing phase, the Comp.blocks compute the
VVM results between feature maps and weights as:
InCh−1
X
P sumi,j,oc = Xi,j,ic Woc,ic,i,j . (2) Fig. 4. Illustration of PIM block designs. (a) Structure of PIM blocks. (b)
Datapath of memory mode. (c) Datapath of computing mode. (d) A simple
ic=0 example to compute VVM in computation mode.
Here i, j are the indexes of the PIM block in the array. ic, oc
are the indexes of the input and output channels, respectively. In the computing mode, Psum and Col.VVM engine are
Xi,j,ic is the icth input channel of the feature maps stored in activated, as shown in Fig. 4(c). The Col.VVM consists
the block(i, j) and Woc,ic,i,j is the icth input channel of the of a numbers of ‘AND’ gates, a bit counter, and a shift-
oc output channel of the weights stored in the block(i, j). In accumulator. The computation of the performed VVM can
the presented example, total 3 × 3 = 9 Psums are generated expressed as:
in block(0,0)–block(2,2) at the same time. All these 9 Psums
must be accumulated by the global accumulator to generate Row−1
X NX −1 M
X −1
the corresponding element of the output feature maps. Note Psum = ai,r bj,c ∗ 2i+j . (3)
that we need to repeat the above computation Outch times r=0 i=0 j=0
to generate the elements at the same position of total Outch
feature maps. In the above presented example, Outch = 32. Here a and b are the columns of the feature map originally
During the computation phase, the Rout + K row of the stored in the Mem.block and the weights stored in the
input feature maps will be stored into the Mem.blocks and Comp.block, respectively. Note that a has been loaded into
overwrite the location of the Rout row, as aforementioned. the Col.Buffer of the Comp.block during the preparation
In the shifting phase, the column Cout + K − 1 of the phase. N and M are the bitwidths of a and b, respectively.
block(3,j) (j = 0,1,2) will be read out, where Cout is the After b is read out from a column of the ReRAM array
column index of the output feature maps to be computed. of the Comp.block, it will be sent into the Col.VVM to
Then the inputs of block(i,j) are shifted to block(i − 1,j) perform the dot product between a and b, or a · b =
(i, j = 0,1,2) and the columns Cout + K − 1 of block(3,j) is [a0 b0 , ..., aRow−1 bRow−1 ] using the AND gates. a · b will
sent to block(2,f (j)), as discussed above. Note that such a be then sent into a bit counter to calculate the sum of the
subtle shifting design avoids making multiple copies of the elements of a · b. If the bitwidths of the feature map and
feature map elements as PipeLayer [2] and hence, reduce the weights are not 1 (i.e., binary precision), the above
the energy consumption associated with ReRAM writes. process will repeat for N × M times to compute the dot
products between higher order bits of a and b. A shift-
D. PIM Block Design and Operations accumulation operation is needed to compute the sum of the
Fig. 4(a) depicts the structure of a PIM block, which dot products with different weights and the Psum is used to
consists of a ReRAM crossbar, a standard memory IO, store the intermediate partial sums. Fig. 4(d) shows a simple
row/column decoders, and a column buffer (Col.Buffer) for illustration of the computing process where the precision’s
intermediate data storage, and a VVM computing engine of weights and feature maps are Int2 and Int1, respectively.
(Col.VVM), a partial sum buffer (Psum), and a control To handle negative values, we can duplicate the Col.VVM
circuit (Ctrl). to process both positive and negative accumulation in par-
In the memory mode, when accessing the feature maps allel. The sign bit of every element in the vector is read
in a PIM block, only Col.Buffer is activated (memory array first from the ReRAM array and determine which Col.VMM
and Ctrl are always on during any operations), as shown in will execute the accumulation of positive or negative data.
Fig. 4(b). Here Col.Buffer is used to buffer the input data The negative Psum then will be subtracted from the positive
that are being accessed on each row of a memory column. Psum during the Psum accumulation.

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:00 UTC from IEEE Xplore. Restrictions apply.
E. Zero-flag Encoding Scheme TABLE II
C OMPONENT E VALUATION R ESULTS
It has been widely observed that the feature maps
Component Area/bit(um2) Energy/bit(fJ)
of DCNNs can be sparse, e.g., caused by the use of ReRAM cell 0.0064 –
ReLU [13] [14]. If the feature map has a precision higher SA 2.3 5.3
than 1-bit, we can use a simple zero-flag encoding scheme Decoder 0.35 0.23
Array Summary Total Area(um2) S/R Energy(pJ)
to save the unnecessary writes of the zero values into the 64Kb Memory Array 3148 0.4/2.3
ReRAM array: An additional column is added into the Component Area(um2) Energy/Op(fJ)
ReRAM array to store a zero-flag corresponding to the high- Col.VVM Engine 1297.2 414
precision data that are stored in multiple columns of the Col.Buffer 1203.1 230
Global Accumulator 228 74
ReRAM array. The default value of the zero-flag is ‘0’, Global Function Unit 840 920
indicating that the data is non-zero. If a data is zero, we MUX 1152 26
simply set the zero-flag to ‘1’ but do not perform the writes
to the ReRAM cells to save the write energy. During the
reduces much more significantly than all the prior designs.
preparation phase, the loaded bits of a data that is marked
It is because the accumulation operation of each ReRAM
as zero are all set to ‘0’ if the data’s zero-flag is ‘1’.
crossbar column in the prior designs still requires high-
IV. E XPERIMENTAL S ETUP precision (e.g., 8-bit) ADCs, which dominate the energy
consumption of the designs. The ADC/DAC-less Lattice
ReRAM devices: The parameters of the ReRAM cells
design demonstrates a great scalability in handling low-
adopted in our simulations are extracted from the exper-
precision computation.
imental results in [8]. The LRS/HRS are set to 10K/100K
ohm with 2/1.5V SET/RESET pulse voltages and 10ns pulse B. CONV Inference
width. Only binary states of the ReRAM cells are used in Effectiveness of the proposed data mapping scheme:
our design, i.e., ‘0’ (HRS) and ‘1’ (LRS). As discussed in Section III-C, in Lattice, feature maps are
Peripheral circuit designs: We use Cadence Virtuoso to carefully mapped into Mem.blocks and Comp.blocks with a
implement the memory array and sense amplifiers based on a subtle shifting mechanism to avoid making multiple copies
commercial 40nm technology to obtain area, energy and la- of the feature maps like PipeLayer [2]. Fig. 6 shows the
tency. The buffer parameters are extracted from ISAAC [1]. energy comparison between Lattice and PipeLayer at Int8
The digital circuits are synthesized and evaluated with and Int4 where we assume the bitwidths of the weights and
Synopsys Design Compiler. The clock frequency is 1GHz. feature maps are the same. The results show that Lattice
Benchmarks: We evaluate the Lattice architecture us- averagely improves the energy efficiency by 4.1× (Int8) and
ing some typical neural network configurations, which are 4.7× (Int4) w.r.t. PipeLayer.
summarized in TABLE I. Then we customize a cycle- Effectiveness of zero-flag encoding: The energy saving
accurate simulator to emulates the computation and data achieved by zero-flag encoding on various CONV layer con-
movement of Latticeto collect the system-level energy, area figurations are depicted Fig. 6. Here we assume the overall
and performance results. feature map sparsity is 50% (though the actual sparsity varies
for different tradeoffs between model sparsity and accuracy).
V. E VALUATION R ESULTS On average, zero-flag encoding further improves the energy
A. Design Metrics of PIM Blocks in Lattice efficiency by 1.2× (Int8) and 1.3× (Int4).
TABLE II summarizes the design metrics of a PIM block C. FC Inference
in Lattice, including the breakdown of each component.
Fig. 5 compares the energy consumption of Lattice with Fig. 7 compares the energy efficiency of several designs
some prior ReRAM PIM designs. Here N /M are the that handle feature maps differently in the computation of FC
bitwidths of the feature maps/kernel weights parameters. layers. It includes the scenarios where 1) the feature maps
Lattice always achieves the lowest energy consumption are all stored in the Mem.blocks without zero-flag encoding
among all the designs, thanks to the ADC/DAC-less design.
When the bitwidth redcues, the energy per MAC of all the
designs decreases while the energy per MAC of Lattice

TABLE I
N EURAL N ETWORKS C ONFIGURATION
Network Name Configuration In-Channels Out-Channels
CONV-A CONV 3×3 64 64
CONV-B CONV 3×3 128 128
CONV-C CONV 3×3 256 256
CONV-D CONV 3×3 512 512
FC-A FC 256 256
FC-B FC 512 512
FC-C FC 1024 1024 Fig. 5. Block MAC energy consumption at different precision.

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:00 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Comparison of energy consumption between Lattice and PipeLayer
at different bitwidths.

(Mem Mode); 2) zero-flag encoding is applied (Zero-Flag);


and 3) the feature maps are small enough so that they can Fig. 8. Area efficiency and energy efficiency comparisons on VGG-16
be buffered in the Col.Buffer of the Comp.blocks (Buffer between Lattice and other state-of-the-art designs.
mode). The results show that Buffer mode can almost double feature map and weights of network layers to CMOS periph-
the energy efficiency by eliminating the data transferring eral circuits to eliminate the analog-digital conversions. Lat-
between the Mem.blocks and the Comp.blocks. tice also naturally offers an efficient data mapping scheme
D. Comparison with Other State-of-the-Art Designs to align the data of feature maps and weights and hence,
avoiding the excessive data copies or writes in the prior
Lattice vs. other mixed-signal NVPIM designs: As
NVPIM designs. Our experimental results show that Lattice
shown in Fig. 8, the area efficiency of Lattice is 1351.6
improves the system energy efficiency by 4× ∼ 13.22×
GOPS/mm2 , which is slightly lower than PipeLayer, but
compared to three state-of-the-art NVPIM designs: ISAAC,
higher than ISAAC. The utilization of ReRAM as feature
PipeLayer, and FloatPIM.
maps buffer increases the area efficiency adopted in Lat-
tice and Pipelayer. In ISSAC, the large eDRAM buffer ACKNOWLEDGEMENT
degrades its area efficiency. Lattice achieves an energy This work was supported by National Key Research
efficiency of 11.14 TOPS/W, which is 13.22× and 7.68× and Development Program of China under grant No.
higher than that of PipeLayer and ISAAC, respectively. 2019YFB2205400 and in part by the National Natural Sci-
Lattice vs. in-memory logic design (FloatPIM): Lat- ence Foundation of China under grant No. 61834001, No.
61851404.
tice demonstrates both better area normalized efficiency and
energy efficiency compared to FloatPIM at 40nm technology R EFERENCES
with Int8. 2 The high complexity of logic operations and the [1] A. Shafiee et al., “Isaac: A convolutional neural network accelerator
incurred frequency writes greatly harm the performance of with in-situ analog arithmetic in crossbars,” in ISCA, 2016.
[2] L. Song et al., “Pipelayer: A pipelined reram-based accelerator for
FloatPIM when the write energy/latency of the ReRAM cells deep learning,” in HPCA, 2017.
is high/long. [3] P. Chi et al., “Prime: A novel processing-in-memory architecture for
Lattice vs. non-PIM design (UNPU): The comparison neural network computation in reram-based main memory,” in ISCA,
2016.
between Lattice and a state-of-the-art non-PIM accelerator [4] Z. Zhu et al., “A configurable multi-precision cnn computing frame-
design “UNPU” [15] shows that Lattice achieves about 10× work based on single bit rram,” in DAC.
area efficiency thanks to the high cell density of ReRAM. [5] X. Sun et al., “Xnor-rram: A scalable and parallel resistive synaptic
architecture for binary neural networks,” in DATE, 2018.
The energy efficiency of Lattice is comparable to that of [6] M. Imani et al., “Floatpim: in-memory acceleration of deep neural
UNPU at the same bitwidth and technology node. network training with high precision,” in ISCA, 2019.
[7] M. Chang et al., “19.4 embedded 1mb reram in 28nm cmos with
VI. C ONCLUSION 0.27-to-1v read using swing-sample-and-couple sense amplifier and
self-boost-write-termination scheme,” in ISSCC, 2014.
In this paper, we propose a new NVPIM architecture, [8] Z. Wang et al., “Modulation of nonlinear resistive switching behav-
namely, Lattice, which moves some operations between the ior of a taox-based resistive device through interface engineering,”
Nanotechnology, 2016.
2 We normalizied FloatPIM with our device configuration for fairness. [9] M. Hu et al., “Dot-product engine for neuromorphic computing: Pro-
gramming 1t1m crossbar to accelerate matrix-vector multiplication,”
in DAC, 2016.
[10] B. Liu et al., “Reduction and ir-drop compensations techniques for
reliable neuromorphic computing systems,” in ICCAD, 2014.
[11] B. Li et al., “Merging the interface: Power, area and accuracy co-
optimization for rram crossbar-based mixed-signal computing sys-
tem,” in DAC, 2015.
[12] S. Kvatinsky et al., “Magic—memristor-aided logic,” TCAS-II, 2014.
[13] S. Han et al., “Eie: Efficient inference engine on compressed deep
neural network,” in ISCA, 2016.
[14] W. Wei et al., “Learning structured sparsity in deep neural networks,”
in NIPS, 2016.
Fig. 7. Energy efficiency comparison between different data mapping [15] J. Lee et al., “Unpu: An energy-efficient deep neural network accel-
modes of FC layers. erator with fully variable weight bit precision,” JSSC, 2019.

Authorized licensed use limited to: INSTITUTE OF COMPUTING TECHNOLOGY CAS. Downloaded on October 25,2022 at 11:29:00 UTC from IEEE Xplore. Restrictions apply.

You might also like