Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2022 IEEE 14th International Conference on Computer Research and Development

A Survey for Realizing In-Memory Computing


†WenKun Xiao †YangSong Shi
International college of cqupt Glasgow college
Chongqing University of Posts and Telecommunications University of Electronic Science and Technology of China
Chongqing, China Chengdu, China
*wx35@nau.edu *2019190503030@std.uestc.edu.cn
2022 14th International Conference on Computer Research and Development (ICCRD) | 978-1-7281-7721-2/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICCRD54409.2022.9730467

†These authors contributed equally

Abstract—To resolve high energy consumption and low In this article, we introduce two PIM models about
efficiency of traditional Von Neumann architecture-based achieving IMC, and we also explain working analogue and
memory, in-memory-computing (IMC) was proposed. It realizes stochastic computing schemes and comply principle. Finally,
computing and storing at the same place. A variety of IMC we compare the advantages and disadvantages of the two
accelerators have been invented. This article introduces the IMC models and make a conclusion of our review.
technology and describe three typical examples of them: DRISA,
C3SRAM and Voltage-controlled Magnetic Tunnel Junctions. II. TYPES OF IMC ACCELERATOR
The first two models are introduced in detail. All of these models
reduce the energy consuming and increase the data handling In recent years, several categories of PIM are coming out,
speed significantly which is obvious in AI processing. The such as SRAM-based, DRAM-based and Voltage-Controlled
limitations which require further research in future are also Magnetic Tunnel Junctions (Voltage-Controlled MTJ). This
discussed in this article. paper introduces two of them in details.
A. C3SRAM
Keywords—In-memory computing (IMC), C3SRAM, DRISA,
Voltage-Controlled Magnetic Tunnel Junctions (VCMTJ) The C3SRAM is an SRAM module whose circuit is
embedded in bit units and peripherals to hardware acceleration
I. INTRODUCTION on a neural network with binarization weight and activation [2].
Nowadays, computers are generally depending on the Von It can assert all its rows at the same time and use analog
Neumann architecture, computing and storage are run in two voltage at the read bitline node [2]. In addition, the C3SRAM
different places. Data is taken from the storage unit and is installed with a analog-to-digital (ADC) in each column to
transferred to the processing unit. The process unit is also enable the PIM to have parallel vector-matrix multiplication in
where the calculation takes place. Then it will be taken back to every cycle [2]. It has been tested that the C3SRAM saves
the memory unit to complete the storage. In-memory 3975 times energy compared to traditional digital baseline
computing (IMC) is a system that computing within the operations [2].
memory. The IMC technique makes the algorithms embedded B. DRAM
on the memory particles. This technology greatly reduces the
time and energy consumption required for calculation. In The DRAM has hierarchical bank-level design, which helps
addition to reducing the latency and energy costs associated reduce bitline length [3]. Typically, it uses voltages in
with data movement, IMC has the latent energy to highly capacitors to represent binary data [3,4]. A variety of advanced
improve the required of calculation time and the complexity of PIMs are based on DRAM, such as Newton-DRAM
some computing tasks. This is mainly caused by the massive Accelerator [5], DrAcc [6] and DRAM-based Reconfigurable
parallelism provided by a dense array of millions of memory In-Situ Accelerator (DRISA) [7]. This paper will focus on
devices performing the computations. By introducing physical DRISA accelerator. It can perform several functions based on
coupling between storage devices, the computational time different combinations of simple Boolean logic gates [7]. The
complexity can be further reduced. By blur the boundary DRISA could help realize 8.8x speed up and 1.2x better energy
between processing and memory units, this property is also efficiency than ASICS, and 7.7x speed up and 15x better
shared with highly energy efficient mammalian brains where energy efficiency than GPUs [7].
memory and processing are deeply intertwined [1]. However, C. Voltage-Controlled Magnetic Tunnel Junctions
these sacrifices that the generality provided by conventional The voltage-controlled MTJ uses voltage and current to
methods are functionally distinct from each other. At present, represent input signals, based on voltage-controlled magnetic
the implementation of the IMC is based on volatile, and the anisotropy (VCMA) and Hall effect (SHE) respectively [8]. It
existing technologies include more mature SRAM or DRAM can realize various complex functions by combining multiple
construction, some new storage devices and new materials logic cells [9]. The write and read delay and energy
construction which are based on non-volatile, such as Voltage consumption of VCMA-MTJ device are low [9]. It could be
Controlled Magnetic Tunnel Junctions.
profitable for massive integration and application in future.

978-1-7281-7721-2/22/$31.00 ©2022 IEEE 345

Authorized licensed use limited to: ShanghaiTech University. Downloaded on December 04,2022 at 06:31:46 UTC from IEEE Xplore. Restrictions apply.
III. C3SRAM C. The bMAC processing of C3SRAM
A. The superiority of computers The Fig. 3 shows the bMAC processing of C3SRAM. Two
steps are involved in this processing. Every step can be
C3SRAM is an IMC SRAM macro which is depended on finished in a half cycle duration. In step 1, every column’s
the capacitance coupling computing (c3) [10]. The prototype MBL is charged in advance via the footer TFT to
macro is 256×64 in size and computes 64 256-input binary- ·VDR. To narrow the voltage oscillation on the
MAC operation (bMAC) in parallel. This macro can support
the networks of various size by used in a modular fashion. The MBL nodes, VRST is set close to the voltage corresponding to
65-nm prototype chip demonstrates 671.5-TOPS/W energy bMAC output of 0 (actually 0.4 V), because typical bMAC
efficiency and 1-638 GOPS throughput, which is 3975× outputs in BNNs have an arrow distribution close to 0 value. In
improvement in energy–delay product (EDP), compared to the this step, MWL and MWLB of each row are reset to VRST, so
digital baseline. The accuracy for the MNIST data set is about that their voltage is near to 0V on bit cell capacitors, and the
98.3% and 85.5% for the CIFAR-10 data set [2]. The capacitors are access to the circuit in parallel so that the two
architecture of C3SRAM IMC macro is shown in Figure 1. node can share reset to the same voltage, as shown in Fig. 3
(bottom left). In step 2, the footer is turned off. The 256 inputs
activation, which is denoted as Ini , is applied to 256
in parallel. If Ini = +1(−1) , MWL is driven
from VRST to VDR (VSS), whereas MWLB is driven to VSS
(VDR). If , both and MWLB remain at VRST
without consuming dynamic power [2]. When the weight is +1
(−1), the voltage ramping via T7 (T8) induces a displacement
current through capacitor in the bit cell, whose magnitude is
presented as below:

Ic = Cc  d V MWL ( B ) / dt ()

Fig.1 Architecture of C3SRAM IMC macro [2].

B. Memory Array Operation


Fig. 2 shows the 8T1C bit cell layout of the proposed
design, the circuit diagram of the two-bit cells is showed in one
column, and the table of XNOR operands. Under the same
logic rule, the bit cell is 80% larger than the traditional 6T bit
cell, because one capacitor and the two additional pass
transistors occupy 27% of the bit cell area. The capacitor is
implemented as MOSCAP for high capacitance density. A
MOMCAP covering the area of a C3SRAM bit cell has 80%
lower capacitance than Cc(~ 5 fF ) [2].

Fig. 3. Capacitance coupling based in-memory computation of bMAC [2]

The charge transferred from the bit cell to MBL is:

t1 1
QCi =  Icdt = CcVDR ()
0 2
where t1 is the time it takes VMWL to reach VDR. The shared
Fig. 2. C3SRAM bitcell design and in-cell bMAC operand table [2]. MBL voltage is set to as below:

346

Authorized licensed use limited to: ShanghaiTech University. Downloaded on December 04,2022 at 06:31:46 UTC from IEEE Xplore. Restrictions apply.
256 In Ambit, three DRAM cells are used to accomplish simple
VMBL = CCVDR  ( XNORi) /( 256CC + C p ) () logic computing. In Figure 4, when ‘C’ is 0, the output is A
1 AND B, otherwise when ‘C’ is 1, the output is A OR B [6].
However, DRISA consists of a mass of DRAM memory
Through these formulas, MOSCAP’s high capacitance
arrays that can perform Boolean logic operations and various
density provides a bMAC transfer curve with a wider full-scale
functions by combining them in different ways [7].
range (FSR) than the less capacity dense MOMCAP [2] does.
Given the same level of MBL parasitics, the FSR loss got by B. Architecture of DRISA for Computing
MOMCAP would be about 80% higher than the one got by The computing methods have two solutions, including
MOSCAP. 3T1C solution and 1T1C solution [7]. Compared to the
D. Advantages and Limitations of C3SRAM traditional DRAM, the former uses the cells themselves for
computing, it is composed of two read/write transistors and a
In order to make the speed and power of this method to be
transistor for decoupling capacitor [7,12]. The latter adds extra
fully used, such as those in the achieving of Vesti [11], are also
circuits which attached to the SAs for computing [7]. However,
necessary to stop the memory macros from waiting for the next
both these two methods share the same shift design. The
input import or else it would take a lot of time in activation
solutions enable the in-situ computing as well [7].
data movement. Therefore, compared with a direct substitution
of SRAM which is depends on the traditional von Neumann C. Architecture of DRISA for Data Movement
architecture, C3SRAM can be fully used in a stand-alone For data movement, inter-lane SHF circuits, as shown in
module. Figure 5, are used within subarrays. It can thus shift a row from
and to its adjacent lane, and lane-FWD circuits in Figure 6 are
TABLE I. ACCURACY COMPARISON
used for moving data between any two lanes [7].
Neural Network Baseline Test
Dataset
Network Topology Accuracy Chip

784FC-512FC-
MNIST MLP 512FC-512FC- 98.7% 98.3%
10FC

128C3-128C3-
MP2-
256C3-256C3-
CIFAR- VGG-like MP2-
88.6% 85.5%
10 CNN 512C3-512C3-
MP2-
1024FC-
1024FC-10FC

Fig. 5. The inter-lane SHF circuit [7]


Table Ⅰ shows that the comparison for the C3SRAM’s
accuracy of MNIIST and CIFAR-10. This accuracy result is This configuration help save the space since all lanes share
taken by directly measuring the whole process, this table can one data-shifting wire. The data can be move to both left and
prove that C3SRAM has high energy efficiency and throughput, right based on the ping/pong phases which decide the odd or
the mapping of C3SRAM could be used as computational even shift occurs [7]. As the Figure 5 shows, the four parallel
primitive for larger BNNs for harder machine learning tasks [2]. blue lines are even, ping, odd, pong respectively.
IV. A DRAM-BASED PIM ACCELERATOR-DRISA
A. DRISA
As several DRAM-based PIMs are mentioned in section 2,
in this section, the paper will focus on a novel PIM named
DRAM-based Reconfigurable In-Situ Accelerator (DRISA) [7].
The Ambit DRAM structure is shown in Figure 4.

Fig. 6. The lane-FWD circuit [7]

The configuration shown in Figure 6 is the lane-FWD


circuit. It supports the random read and write from or to any
Fig. 4. The hierarchical structure of Ambit [6] arbitrary line [7].

347

Authorized licensed use limited to: ShanghaiTech University. Downloaded on December 04,2022 at 06:31:46 UTC from IEEE Xplore. Restrictions apply.
D. Architecture of DRISA for Controllers REFERENCES
The controllers of DRISA include four levels, including [1] Ielmini, D., Wong,HS.P. In-memory computing with resistive switching
chip, group, bank, and sub array-level controllers [4,7]. The devices. Nat Electron 1, 333–343 (2018).
https://doi.org/10.1038/s41928-018-0092-2
first two levels are responsible for decode and data movement,
and the bank level convert the instructions and µ-operations [2] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: In-
MemoryComputing SRAM macro based on capacitive-coupling
into addresses, vector lengths and control codes [7]. The computing,” IEEE Solid-State Circuits Lett., vol. 2, no. 9, pp. 131–134,
subarray level has address latches, decoders and counters. In Sep. 2019.
addition, the cell array is separated to data and compute regions, [3] W. J. Lee, C. H. Kim, Y. Paik, J. Park, I. Park and S. W. Kim, "Design
which helps reduce the overhead to 19.02% [7]. However, of Processing- “Inside”-Memory Optimized for DRAM Behaviors,"
another previous article introduced three kinds of bank in IEEE Access, vol. 7, pp. 82633-82648, 2019, doi:
scheduling: all-bank, per-bank, and bank-group. Besides, the 10.1109/ACCESS.2019.2924240.
matrix sizes would also influence the performance scalability [4] W. J. Lee, C. H. Kim, Y. Paik, J. Park, I. Park and S. W. Kim, "Design
of Processing- “Inside”-Memory Optimized for DRAM Behaviors,"
[6]. in IEEE Access, vol. 7, pp. 82633-82648, 2019, doi:
10.1109/ACCESS.2019.2924240.
Since the DRISA is a co-processor, its integration methods
are the same as GPU or FPGA [7]. However, for software [5] M. He et al., "Newton: A DRAM-maker’s Accelerator-in-Memory (AiM)
Architecture for Machine Learning," 2020 53rd Annual IEEE/ACM
component, it requires a special frame programming language International Symposium on Microarchitecture (MICRO), 2020, pp.
as well as a corresponding compiler, like CUDA or AP and 372-385, doi: 10.1109/MICRO50266.2020.00040.
SDK. For hardware support, both PCIe and DIMM are capable. [6] Q. Deng, L. Jiang, Y. Zhang, M. Zhang and J. Yang, "DrAcc: a DRAM
The PCIe has consistent power delivery supply and based Accelerator for Accurate CNN Inference," 2018 55th
sophisticated control system. The DIMM require DRISA looks ACM/ESDA/IEEE Design Automation Conference (DAC), 2018, pp. 1-
like DDR in interface but has an accelerator’s function [7]. 6, doi: 10.1109/DAC.2018.8465866.
[7] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan and Y. Xie, "DRISA:
E. Advantages and Limitations of DRISA A DRAM-based Reconfigurable In-Situ Accelerator," 2017 50th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO),
DRISA does not have the charge-leak challenge as every 2017, pp. 288-301.
bitwise logic operation is before copying data to the source and [8] H. Zhang, W. Kang, L. Wang, K. L. Wang and W. Zhao, "Stateful
target rows [7,13]. In addition, in the accelerating aspect, Reconfigurable Logic via a Single-Voltage-Gated Spin Hall-Effect
DRISA shows 8.8x speed up and 1.2x better energy efficiency Driven Magnetic Tunnel Junction in a Spintronic Memory," in IEEE
compared to ASICS, and 7.7x speed up and 15x better energy Transactions on Electron Devices, vol. 64, no. 10, pp. 4295-4301, Oct.
efficiency than GPUs [7]. 2017, doi: 10.1109/TED.2017.2726544.
[9] L. Wang et al., "Voltage-Controlled Magnetic Tunnel Junctions for
However, when calculating floating points, DRISA has Processing-In-Memory Implementation," in IEEE Electron Device
limitations. Since all lanes in a subarray share the same Letters, vol. 39, no. 3, pp. 440-443, March 2018, doi:
10.1109/LED.2018.2791510.
controller but the floating-point calculation is data dependent in
controller [4,7], as a result it is not beneficial to the accelerator [10] Singh J., Mohanty S.P., Pradhan D.K. (2013) Introduction to SRAM. In:
Robust SRAM Designs and Analysis. Springer, New York, NY.
while taking floating points performance [7]. https://doi-org.ezproxy.lib.gla.ac.uk/10.1007/978-1-4614-0818-5_1
[11] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A mixed-signal
V. CONCLUSION binarized Convolutional-Neural-Network accelerator integrating dense
In this paper, we have discussed about the two different weight storage and multiplication for reduced data movement,” in Proc.
PIM models for achieving in memory computing, and this IEEE Symp. VLSI Circuits, Jun. 2018, pp. 141–142.
paper also discusses their advantages and disadvantages. Our [12] A. Raha et al, "Quality Configurable Approximate DRAM," IEEE
Transactions on Computers, vol. 66, (7), pp. 1172-1187, 2017.
study also explains the process of they work, therefore this
[13] M. Son et al, "Enhancement of DRAM Performance by Adopting Metal-
paper can help the people selecting an appropriate models to Interlayer-Semiconductor Source/Drain Contact Structure on DRAM
achieving IMC in the future and getting deeply understanding Cell," IEEE Transactions on Electron Devices, vol. 68, (5), pp. 2275-
of the process of realizing IMC. 2280, 2021.

348

Authorized licensed use limited to: ShanghaiTech University. Downloaded on December 04,2022 at 06:31:46 UTC from IEEE Xplore. Restrictions apply.

You might also like