Professional Documents
Culture Documents
ReRAM-Based In-Memory Computing For Search Engine and Neural Network Applications
ReRAM-Based In-Memory Computing For Search Engine and Neural Network Applications
ReRAM-Based In-Memory Computing For Search Engine and Neural Network Applications
2, JUNE 2019
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 389
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
390 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019
The design checks the input voltage that represents the data
entry against the conductance of one memristor as the second
one will be its complement. The matching operation is per-
formed according to the following XNOR Boolean expression:
V A R B +V A R B . The truth table of the XNOR cell is presented
in Table I. The input voltage levels V A and V A can be selected
based on the usage model. For example for BCAM and
TCAM, V A = 0.6 V and V A = 0 V or vice versa. But in
the case of similarity search where the comparison result is
written into ROUT then, V A = 1 V and V A = 0 V or vice
versa. This is to enable to write ROUT in the given time. It is
worth noting that the voltages used in these simulations were
chosen to achieve fast switching with good noise margin based
on reported real devices [18]. The 1 V was selected because it
allows a full switching of the output memristor when all inputs
match with the databases/filters. On the other hand, 0.6 V was
chosen because it provides an acceptable distinction between
matching and mismatching cells in case of BCAM/TCAM
where an output memristor is not used. The output memristor
is initialized to ROFF . When both the applied input voltage
and the stored data are matching, the voltage drop across the
accumulation memristor is high. Only when a matching occur,
it will be pushed to RON . It was chosen this way since most Fig. 3. Comventional CMOS-CAM array where each cell consists of 6T
SRAM cell and dynamic comparator to compute the match/mismatch signal.
of the time there will be more mismatches than matches and
hence the energy will be minimized. Otherwise, in the case
of a mismatch, the voltage drop will be small and the output costs at minimal. Conventional CMOS-based search engines
memristance will stay as ROFF . This will result in overall better suffer from density and power limitations.
power efficiency. The output voltage can be calculated through Most systems use software-based search engines, such
the following equation. as associative lookup tables (LUT). Such implementation is
restricted to the size of available physical memory, which
RB RB
VOUT = V A + VA leads to higher latency due to the need for external mem-
RB
RB + RB 1 + ROUT R B + R B 1+ RROUT
B
ory access [19]. Usually, these algorithmic software-based
(1) approaches are suitable for general purpose application, but
cannot be utilized for IoTs that need to process data in real-
time. Application specific integrated circuits (ASICs) hardware
B. Read Operation implementations are much faster than software approaches
In order to read the stored patterns in the memristor, and hence more appropriate to accommodate real-time search
the corresponding voltage can be applied only to the stored operations [20], [21]. ASIC also utilizes special type of CAM
data as there is no need to read its complement. If the output that can perform lookup operation in a single clock cycle.
voltage is high, then the stored bit is RON otherwise, the stored Conventional CMOS-based CAM, depicted in Fig. 3,
bit is ROFF . Read operation of the stored patterns for CAM is composed of a memory element combined with a com-
type applications is rarely performed. And if it to be done parison circuitry [20] that facilitate parallel search operation.
for the whole CAM structure, it should follow a bit-wise read The data is stored in the memory using a large number of
operation. This is a limitation of the proposed CAM structure transistors (10+) per bit. Once the input data is fed into the
as it requires multiple cycles to read a single database. Hence, CAM, the search operation is performed using a precharge-
a memristor-based memory crossbar with the same stored evaluate operations, similar to SRAM read operation [22].
database can be used in parallel to fetch the database with After that, the address of the matching data will be returned (or
the highest similarity score. the matching data itself in the case of associative memory).
BCAM and TCAM are two widely used search memories.
In the former, exact-match searches are performed. While in
III. S EARCH E NGINE S YSTEM A RCHITECTURES the latter, don’t cares are further utilized to improve matching
performance by performing partial or range matches [23].
A. State-of-the-Art Systems The major challenges for conventional CMOS-based CAM
Fast search engines are required for real-time decision mak- and TCAM designs are their low density and high power con-
ing in various fields including computer vision, machine learn- sumption. Implementations utilizing pre-charge phase increase
ing and object recognition. For IoTs and similarly resource power consumptions and increase complexity [24]. Further
constrained devices that need to implement fast search engines, research into lower power and higher density cells to align with
it is of paramount importance to keep both area and energy the continuing rapid growth in data has been sought. Moreover,
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 391
for media applications that deal with 2D images, as the image based stateful logic operations using material implication is
size increases, circuit complexity increases. Traditional CMOS presented. For a two-input XOR implementation, the design
implementation continue to be more expensive in terms of requires 4 CRS (8 memristor devices) and 6 steps to compute
area, power and processing time. For example, in [25] a 2D the operation.
pattern is divided into sub-patterns with a size that must In [32], the search operation is performed by firstly pro-
be small to be manageable for processing. The architecture gramming the CRS array with a template pattern. Then each
becomes impractical as pattern size increases. match line (ML) is pre-charged by a global reset signal. After
To address the highlighted challenges, there is a growing that, the ML is discharged by the result of correlation based
interest towards utilizing emerging non-volatile nano-devices on the Hamming distance between the two input images.
for search and match operations. In [26], the memory core is Moreover, in [33] the search engine consists of two CRS and
composed of a CMOS transistor and a phase-change memory cross-coupled switches with an analog correlator that flags the
(PCM) with 1T1R structure; where the PCM is used to store Hamming distance for the image-recognition and classification
the data. This is followed by a differential sense amplifier to process.
change the stored value into a corresponding voltage value Approximate TCAM (ACAM) is proposed for online learn-
(i.e. GND or VDD). Then, a comparator is used to compare ing in [30]. The authors use MTJ to tackle the endurance issue
the stored data with the input search voltage. The non-volatile associated with the wear-out of memories for online learning.
element is not part of the computation and only used as a The architecture consists of a 5T-4MTJ TCAM followed
storage element. Researchers in [27] have proposed the use by an STT-RAM memory. Once the searched input data is
of STT-RAM in CAM/TCAM. The design uses 1T1MTJ per found, a clock gating signal is activated to stop the processor
cell on each bitline and one reference resistor. The architec- computation. And finally, the corresponding line of the STT-
ture suffered from long delay. Hence, in [28] 14 transistors RAM memory to read the precomputed output data.
and pentaMTJ are utilized to obtain a sub-1ns TCAM cell.
However, PCM is temperature sensitive [29]. The MTJ has
low noise margin between the two resistance states and long B. Proposed Search Engine System Architecture
search delay [30], which negatively affect the performance. In The proposed VR-XNOR cell was used as the main building
addition to the aforementioned devices, researchers are looking block for the memritor-based CAM array architectures. The
into utilizing the memristor for CAM application. Some of crossbar architecture shown in Fig. 4 enables parallel lookups
the major resistive-based CAM cell implementations studies of multiple inputs in multiple CAM banks. For N input bits,
utilize stateful logic, while others use complementary resistive the system requires 2N memristor devices to perform a com-
switches (CRS). Moreover, the designs vary in the numbers parison. And an analog memristor accumulator at the end of
of both nano-devices and transistors used to build the bit-cell each column. So each data entry requires a representation of its
memory. Other implementations focus more on the endurance value and its complementary value. Input data is represented
issue and speed while relaxing on the energy consumption. as applied voltages while the stored data is represented by
A memristors-based TCAM (mTCAM) cell consisting of conductance value of the memristor. Each column corresponds
five transistors and two memristors is proposed in [3]. The to a different database.
stored data is represented by resistance values of the memris- Multiple bank lengths can be traded-off between the number
tive devices as the data and its complement. Four search lines of bits per share match line and the output voltage difference
are used per cell and only three are effective during search in (mV) output resistance between matching and mismatching
mode. Initially, match lines are precharged to high. Delays cells. As the bank length increases, the voltage difference
from the four search lines, TCAM cells with 5T2R structure, between all matching cells and a single mismatch starts to
and the ML would contribute to an increase in the search decrease until it reaches almost zero when all bits mismatch
latency as well as the energy consumption. and hence no switching occurs to the output memristor. This
In [2], a transistor-free memristor TCAM structure for is due to the fact that when there is a mismatch the high
comparing input data against a bank of stored words is resistance bits are still contributing to the output voltage and
presented. This architecture can be used for intrusion detection the combination of matching and mismatching cells result in
applications where packets are scanned to determine if they reduced voltage difference.
are malicious or not. For N bits, the number of memristors The output bank reading circuitry is illustrated in Fig. 4(c).
required are 4N+2. The design has shown ∼ 360 × reduc- Depending on the desired application as explained previously
tion in area with competitive timing and energy compared in Section II, if the architecture is used for BCAM/ TCAM
to CMOS-based CAM. In that structure, each bit requires then the output voltage will be fed directly into the comparator
4 memristors; one connected to the input voltage, the second as shown in part (a) of the figure. While if a similarity search,
to the input voltage complement, and the third and fourth are then the switch will be initially closed to write the output into
connected to two bias voltages to account for the variations of the memristor. Then the next step is to evaluate R OU T , so the
the memristance value. This is in addition to two bias voltages switch will open to read its value without disturbing the rest
per column. of the cells and feed it into the comparator as demonstrated
Another approach for search using implication logic was in part (b) of the figure.
proposed in [19]. In this approach, eleven steps per CAM Logic ‘1’ corresponds to the memristor ON state, RON ,
cell are required to generate a cell match signal. In [31] CRS while logic ‘0’ corresponds to the memristor OFF state, ROFF .
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
392 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019
Fig. 4. Memristor-based multiple banks CAM architecture shown in a) and its digital equivalence is shown in b). The readout circuitry schematics for the
BCAM/ TCAM configuration and similarity search is shown in c), (en stands for the enable signal and Vre f is the reference voltage). The switch SW can
configure if the output is voltage for BCAM and TCAM (SW=OFF) or resistance in case of similarity search (SW=ON).
IV. C ONVOLUTIONAL N EURAL N ETWORKS A PPLICATION flattened to produce a 1D array to check the effect or sig-
nificance of each feature on the final classification stage
A. Basics of CNN for the different classes. All neurons are connected to all
Convolutional neural networks (CNNs) are found to be output neurons in the output layer.
superior compared to other standard feedforward neural net- The last step is to calculate the classification error and
works in terms of accuracy. Such network mimics the locality update the weights using the backpropagation algorithms
found in human visual system (HVS) and have much fewer according to the chain rule.
connections and parameters, as a result they are easier to train Convolutions are the computationally intensive operations
[37]. CNN structures are divided into two main stages: feature as they nominally account for nearly 90% of the total process-
extraction followed by classification. The following 4 main ing task [38], with heavy reliance on floating point matrix
layers are used in the implementation of a CNN: multiplication. This will cause an issue with general purpose
• Convolution (ConV) layer: is the ability of the filters to processors specially with the limited cache size. Hence, graph-
capture (detect) as many generic features as possible from ics processing units (GPUs) have been widely used to accel-
the training dataset that would help in classification (for erate the convolution operation due to their high parallelism
example) of the testing dataset. It consists of sliding N and floating point operations performance [39]. Nonetheless,
filters each of size k ×k with randomly initialized weights GPUs suffer from high power consumption which makes
over a set of training images. This step produces feature them expensive to deploy during the inference/testing phase
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 393
of CNNs specially in mobile systems that require real-time across the layers. The dot-product operations involved in
and low power consumption [40]. convolution and classifier layers are performed on crossbar
Thus, field-programmable gate array (FPGA) and applica- arrays; those results are sent to then converted to digital
tion specific integrated circuit (ASIC)-based CNN accelerators representation, and then aggregated in output registers after
are increasingly being used to achieve high accuracy with low any necessary shift-and-adds.
power consumption and highly reduced computational time Moreover, Ni et al. presented in [48] a binary vector-matrix
[39]. On the other hand, several schemes and versions of multiplication where they use two pairs to represent an AND
CNN are being exploited in order to align with the rise of and OR operations which translates to binary multiply and
IoT and mobile devices such as neuron pruning and weights add operations, respectively. This is followed by a voltage
quantization [41]. comparator with a reconfigurable threshold voltage. The con-
Approximating the full precision floating point weights volution part requires a 2N×N ReRAM-crossbar per filter. The
through quantization has been studied lately. Lowering the authors compared their RRAM-based BNN system to other
precision and utilizing an extremely compact data representa- hardware (GPU, FPGA and CMOS) implementations and it
tion such as the binarized CNN (BNN) has shown the ability has shown 4 orders of magnitude, 4155× and 62× less power
to reduce memory resources and decrease computational time respectively under similar accuracy. A digital RRAM-based
[42]. In [41], the authors have shown 58× improvement in convolutional block that is capable of performing dot product
convolution operations and 32× memory savings when both operations in a single cycle is proposed in [49]. The main cell
the weights and inputs are binarized with 12.5% loss in consists of 4 transistors and 1 RRAM (4TIMR) components.
accuracy as compared to the full-precision using AlexNet. The cell employs two pairs of NMOS and PMOS transistors
Moreover, in [43] the network biases were removed as most in order to trigger a set or a reset process to program the
of them had magnitudes less than 1 after quantization which memristor. The RRAM crossbar is followed by an XNOR
did not affect the accuracy. As a consequence of the binariza- sensing circuit and then a combinational bitcount circuit.
tion, the convolution operation (multiply and add modules) Increasing the number of layer increases the accuracy as it
is reduced to XNOR and bitcount adders to realize the helps to represent complex features from local low-level ones.
convolution operation efficiently specially for systems with Nonetheless, this also increases the number of multiply-and-
limited resources [44]. Nonetheless, the neural weights still accumulate (MAC) operation as each connection to a neuron
require to be moved from the memory to the computational represent a MAC operation. It has been reported in [38] that
unit which creates timing and power overhead. It is appreciated AlexNet, which consists of 5 convolutional layers, requires
that the binarization process causes accuracy loss and hence, 666 million MACs per 227×227 pixel size image. While
there are efforts to reduce this impact. For example, authors the VGG16 architecture that uses 13 convolutional layers,
in [40] proposed the use of the sign() function with shift requiring 15.3 billion MACs per 224×224 pixel size image.
parameter inside in order to approximate the full precision
using binary bases. Thus, accuracy loss was reduced to around
5% compared to the state-of-the art implementations. B. Proposed CNN Architecture
The proposed method reduces training time significantly One of the drawbacks of using analog memristors to carry
since the movement of weights has been eliminated in addition out computations is that the data stored will likely have less
to the binarization process [41]. Moreover, it reduces the area precision when compared to typical 32-bit floating point data
by performing an unsigned bitwise operation instead of using representation [50]. Variations in memristor does impose a
two crossbars to store negative and positive weights with a challenge for analog computation. On the other hand, utilizing
subtract operation for every complementary bitline output [4]. a memristor for digital operation (high resistance, low resis-
Memristor crossbars have been widely used as neural net- tance), as it is the case for BNN, will provide a more robust
work accelerators as they perform in-memory computations computation in face of device variability. For the endurance
on the stored weights and hence eliminate their movement. of the devices, the number of read to write ratio is large
For example, in [45], the authors presented a memristor- as reported in the literature [18]. Moreover, the devices are
based architecture for on-chip training of deep neural net- programmed once after training in case of CNN applications
works (DNN). The architecture utilizes two memristors per and for the databases in CAM. Hence, endurance effect is
synapse for higher precision of weights. Moreover, the back- minimal in the presented case.
propagation algorithm was used to train the autoencoders and To the best of our knowledge, this is the first paper to present
to fine tune the pre-trained weights. In addition, in [46], a VR-XNOR memristor-based template matching circuit for
the authors proposed the use of memristor-based CNN to par- BCNN with 2N memristor devices per bit. In addition to the
allelize the recognition phase. They assumed that the crossbars ability to perform the pooling operation with the convolution
are already trained and stored as conductance values. The process at the same time.
crossbars were used for analog vector-matrix multiplication The storage memristor which is used to store the feature
acceleration with each weight represented by two memristor map values is initialized to ROFF . This is because the features
to account for both the positive and negative values. The filters do not exist in all patches of the input image set and conse-
were expanded into large sparse matrices. In [47], the authors quently there will be more mismatches than matches. So from
presented ISAAC architecture, an analog CNN accelerator an energy point of view, initializing the output device to
where tiles containing memristor crossbars are partitioned ROFF minimizes the energy and results in better overall power
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
394 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019
TABLE III
VTEAM M EMRISTOR M ODEL PARAMETERS FOR C OMPUTATIONAL
D EVICES AND S TORAGE D EVICES
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 395
TABLE IV
T RADE -O FFS B ETWEEN D IFFERENT F ILTER S IZES AND M ATCHING
M ARGIN FOR A 2×2 AND 3×3 S IZES
It can be noted that the energy decreases as the number of Fig. 6. Illustration for a convolution where the input is a binary image patch
bits per column increases (larger filer size). This is because compared against two filters stored in the memristor crossbar.
by having more voltage sources, this would cause the output
memristor to switch slightly faster than with the smaller
filter size. Moreover, the energy decreases as the number of
matching cells within the same filter increases since the current
is limited by the memristor storing the feature map values.
Different 2D-filter sizes were investigated as shown
in Table IV after reshaping them into 1D-array. The selected
sizes of 2×2 and 3×3 used in these simulations are utilized
because they are widely used for CNN. Nonetheless, other
sizes such as 5×5 and 11×11 have been reported in literature
[55]. As a result, the more parallel pairs added to the structure
the total resistance will decrease, resulting in reducing the dif-
ferentiable voltage margin between matching and mismatching
cases. Hence, there is a trade-off between the filter size
and the corresponding output resistance during matching and
mismatching operation. As the filter size increases, the column Fig. 7. Illustration for the output from 2 filter banks where one 100% matches
length increases, and the voltage difference between all match- with the input patch and the other matches with 55.5%.
ing cells case and a single mismatch starts to decrease until it
reaches almost zero when all bits mismatch. This is captured in 3. If implemented in ASIC, 9 multiplications and 9 additions
the output resistance value. This means that the learnt feature including the bias are required for each patch convolution.
stored in the filter does not exist in that part of the input Hence when convolving one image with a single filter it would
image and the high resistance bits are still contributing to the require (9×83×83) multiplications + (9×83×83) additions.
output voltage. With each mismatching bit between the filter While if memristor was used to implement the filter bank,
and the image patch, the voltage will drop by a , whose a 83 × 83 multiplications and additions are required for the
value is dependent on the length of the column. So the final same scenario. This shows a reduction in the number of
value of the output memristor, which will store the feature map operations by 18× for one filter. So if 96 filters were used
value, will correspond to the matching bits. In other words, as in AlexNet [38], then a significant reduction with 4 orders
the amount of current produced in each column represents the of magnitude are observed when the proposed architecture is
similarity between the input image patch and feature in the utilized. This is because the filter bank is capable of calculating
filter. the results for N feature maps in parallel.
An example of 2 different filters with two different features
of size 3×3 along with the feature map output is demonstrated
V. C ONCLUSIONS AND F UTURE W ORK
in Fig. 6. The input is a binarized image patch and is converted
to 1 V if the pixel is white and 0 V if it is black. The In this paper, a reconfigurable CAM and a feature extraction
output from the filters will represent how much the input architecture for filter learning have been proposed. The main
resembles the learned features and are stored directly into building block is a memristor-based VR-XNOR gate that is
the accumulative memristor. When the feature learned by followed by an inherent summation which will result in a
the second filter matches the image patch 100%, the memristor percentage reflecting the existence of the matching/similarity
is turned fully ON. While when the learned feature in the first feature. For CNN, the filter is followed by a pooling layer that
filter is different than the one presented in the image patch, downsamples the feature map for a reduction in computational
the conductance is not changed a lot as illustrated in Fig. 7. time. The proposed approach provides a significant area saving
In order to account for the savings in number of operations, and fast computational time as compared to the conventional
let’s assume the following scenario: an input image has a approaches. In the proposed work, we have focused on the
dimension of 249×249, a filter size of size 3×3 with a stride = convolution and pooling parts, as a result a direct comparison
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
396 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019
with other implementation in the literature that use full system [19] Y. Liu, C. Dwyer, and A. R. Lebeck. (2016). “Combined compute and
implementations is out of the scope of this paper. storage: Configurable memristor arrays to accelerate search.” [Online].
Available: https://arxiv.org/abs/1601.05273
Future work will extend the proposed architecture to build [20] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable mem-
a full memristor-based BCNN system that implements all ory (CAM) circuits and architectures: A tutorial and survey,” IEEE J.
layers. Moreover, the architecture will be expanded to account Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006.
for binary weight and ternary input as a trade-off between [21] A. T. Do, C. Yin, K. Velayudhan, Z. C. Lee, K. S. Yeo, and
T. T.-H. Kim, “0.77 fJ/bit/search content addressable memory using
complexity and performance. Co-optimization of learning small match line swing and automated background checking scheme
algorithms as well as Memristor-based in-memory computing for variation tolerance,” IEEE J. Solid-State Circuits, vol. 49, no. 7,
hardware architecture is the key for tackling the accuracy and pp. 1487–1498, Jul. 2014.
[22] B. Mohammad, P. Bassett, J. Abraham, and A. Aziz, “Cache organiza-
performance challenges especially for resource constrained tion for embeded processors: CAM-vs-SRAM,” in Proc. IEEE Int. SOC
applications. Besides, implementation of multiple stages for Conf., Sep. 2006, pp. 299–302.
the reconfigurable CAM for wider data path. [23] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, “Emerging trends in design
and applications of memory-based computing and content-addressable
memories,” Proc. IEEE, vol. 103, no. 8, pp. 1311–1330, Aug. 2015.
R EFERENCES [24] T. V. Mahendra, S. Mishra, and A. Dandapat, “Self-controlled high-
[1] A. Basu et al., “Low-power, adaptive neuromorphic systems: Recent performance precharge-free content-addressable memory,” IEEE Trans.
progress and future directions,” IEEE J. Emerg. Sel. Topics Circuits Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 8, pp. 2388–2392,
Syst., vol. 8, no. 1, pp. 6–27, Mar. 2018. Aug. 2017.
[2] C. Yakopcic, V. Bontupalli, R. Hasan, D. Mountain, and T. Taha, [25] S.-I. Chae, J. T. Walker, C.-C. Fu, and R. F. Pease, “Content-addressable
“Self-biasing memristor crossbar used for string matching and ternary memory for VLSI pattern inspection,” IEEE J. Solid-State Circuits,
content-addressable memory implementation,” Electron. Lett., vol. 53, vol. 23, no. 1, pp. 74–78, Feb. 1988.
no. 7, pp. 463–465, 2017. [26] P. Junsangsri, J. Han, and F. Lombardi, “Design and comparative
[3] L. Zheng, S. Shin, and S.-M. S. Kang, “Memristors-based ternary evaluation of a PCM-based CAM (content addressable memory) cell,”
content addressable memory (mTCAM),” in Proc. IEEE Int. Symp. IEEE Trans. Nanotechnol., vol. 16, no. 2, pp. 359–363, Mar. 2017.
Circuits Syst. (ISCAS), Jun. 2014, pp. 2253–2256. [27] W. Xu, T. Zhang, and Y. Chen, “Design of spin-torque transfer mag-
[4] S. Mittal, “A survey of ReRAM-based architectures for processing-in- netoresistive RAM and CAM/TCAM with high sensing and search
memory and neural networks,” Mach. Learn. Knowl. Extraction, vol. 1, speed,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 1,
no. 1, pp. 75–114, 2018. pp. 66–74, Jan. 2010.
[5] Y. Halawani, M. A. Lebdeh, B. Mohammad, M. Al-Qutayri, and [28] M. K. Gupta and M. Hasan, “Design of high-speed energy-efficient
S. Al-Sarawi, “Stateful memristor-based search architecture,” IEEE masking error immune pentaMTJ-based TCAM,” IEEE Trans. Magn.,
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 12, vol. 51, no. 2, pp. 1–9, Feb. 2015.
pp. 2773–2780, Dec. 2018. [29] N. Ciocchini et al., “Bipolar switching in chalcogenide phase change
[6] D. Soudry, D. Di Castro, A. Gal, A. Kolodny, and S. Kvatinsky, memory,” Sci. Rep., vol. 6, Jul. 2016, Art. no. 29162.
“Memristor-based multilayer neural networks with online gradient
descent training,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10, [30] M. Imani, Y. Kim, A. Rahimi, and T. Rosing, “ACAM: Approximate
pp. 2408–2421, Oct. 2015. computing based on adaptive associative memory with online learning,”
[7] H. D. Nguyen, J. Yu, L. Xie, M. Taouil, S. Hamdioui, and D. Fey, in Proc. Int. Symp. Low Power Electron. Des. (ISLPED), Jun. 2016,
“Memristive devices for computing: Beyond CMOS and beyond von pp. 162–167.
Neumann,” in Proc. 25th IFIP/IEEE Int. Conf. Very Large Scale Integr. [31] Y. Yang, J. Mathew, S. Pontarelli, M. Ottavi, and D. K. Pradhan,
(VLSI-SoC), Oct. 2017, pp. 1–10. “Complementary resistive switch-based arithmetic logic implementa-
[8] Y. Levy et al., “Logic operations in memory using a memristive Akers tions using material implication,” IEEE Trans. Nanotechnol., vol. 15,
array,” Microelectron. J., vol. 45, no. 11, pp. 1429–1437, 2014. no. 1, pp. 94–108, Jan. 2016.
[9] L. Xie et al., “Scouting logic: A novel memristor-based logic design [32] K. Cho, S. J. Lee, K. S. Oh, C. R. Han, O. Kavehei, and K. Eshraghian,
for resistive computing,” in Proc. IEEE Comput. Soc. Annu. Symp. “Pattern matching and classification based on an associative memory
VLSI (ISVLSI), Jul. 2017, pp. 176–181. architecture using CRS,” in Proc. 13th Int. Workshop Cellular Nanosc.
[10] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and Netw. Appl., Aug. 2012, pp. 1–5.
U. C. Weiser, “Memristor-based material implication (IMPLY) logic: [33] S.-J. Lee, S.-J. Kim, K. Cho, and K. Eshraghian, “Implementation of
Design principles and methodologies,” IEEE Trans. Very Large Scale complementary resistive switch for image matching through back-to-
Integr. (VLSI) Syst., vol. 22, no. 10, pp. 2054–2066, Oct. 2014. back connection of ITO/ITO2−x /TIO2 /ITO memristors,” Phys. Status
[11] S. Gupta, M. Imani, and T. Rosing, “FELIX: Fast and energy-efficient Solidi A, vol. 211, no. 8, pp. 1933–1940, 2014.
logic in memory,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. [34] L. Zheng, S. Shin, S. Lloyd, M. Gokhale, K. Kim, and S. M. Kang,
(ICCAD), Nov. 2018, pp. 1–7. “RRAM-based TCAMs for pattern search,” in Proc. IEEE Int. Symp.
[12] S. Kvatinsky et al., “MAGIC—memristor-aided logic,” IEEE Trans. Circuits Syst. (ISCAS), May 2016, pp. 1382–1385.
Circuits Syst., II, Exp. Briefs, vol. 61, no. 11, pp. 895–899, Nov. 2014. [35] S. Smaili and Y. Massoud, “Memristor state to logic mapping for optimal
[13] M. A. Lebdeh, H. Abunahla, B. Mohammad, and M. Al-Qutayri, noise margin in memristor memories,” in Proc. 14th IEEE Int. Conf.
“An efficient heterogeneous memristive xnor for in-memory computing,” Nanotechnol., Aug. 2014, pp. 291–295.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64, no. 9, pp. 2427–2437, [36] P. M. Sheridan, C. Du, and W. D. Lu, “Feature extraction using
Sep. 2017. memristor networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27,
[14] S. Shin, K. Kim, and S.-M. Kang, “Resistive computing: Memristors- no. 11, pp. 2327–2336, Nov. 2016.
enabled signal multiplication,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 60, no. 5, pp. 1241–1249, May 2013. [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
[15] I. Vourkas and G. C. Sirakoulis, “Nano-crossbar memories comprising with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
parallel/serial complementary memristive switches,” BioNanoScience, Process. Syst., Jun. 2012, pp. 1097–1105.
vol. 4, no. 2, pp. 166–179, 2014. [38] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
[16] S. Kvatinsky, N. Wald, G. Satat, A. Kolodny, U. C. Weiser, and efficient reconfigurable accelerator for deep convolutional neural net-
E. G. Friedman, “MRL—Memristor ratioed logic,” in Proc. 13th Int. works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Workshop Cellular Nanosc. Netw. Appl., Aug. 2012, pp. 1–6. Jan. 2017.
[17] Y. Sun, X. Yan, X. Zheng, Y. Liu, Y. Shen, and Y. Zhang, “Influence [39] E. Nurvitadhi et al., “Can FPGAs beat GPUs in accelerating next-
of carrier concentration on the resistive switching characteristics of a generation deep neural networks?” in Proc. ACM/SIGDA Int. Symp.
ZnO-based memristor,” Nano Res., vol. 9, no. 4, pp. 1116–1124, 2016. Field-Program. Gate Arrays, Feb. 2017, pp. 5–14.
[18] S. Srivastava, P. Dey, S. Asapu, and T. Maiti, “Role of GO and r- [40] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional
GO in resistance switching behavior of bilayer TiO2 based RRAM,” neural network,” in Proc. Adv. Neural Inf. Process. Syst., Nov. 2017,
Nanotechnology, vol. 29, no. 50, Oct. 2018, Art. no. 505702. pp. 344–352.
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 397
[41] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Baker Mohammad (M’04–SM’13) received the
ImageNet classification using binary convolutional neural networks,” in B.S. degree from the University of New Mexico,
Proc. Eur. Conf. Comput. Vis., Sep. 2016, pp. 525–542. Albuquerque, the M.S. degree from Arizona State
[42] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, University, Tempe, and the Ph.D. degree from The
“Binarized neural networks,” in Proc. Adv. Neural Inf. Process. Syst., University of Texas at Austin in 2008, all in ECE. He
Jun. 2016, pp. 4107–4115. was a Senior staff Engineer/Manager at Qualcomm,
[43] R. Zhao et al., “Accelerating binarized convolutional neural networks Austin, TX, USA, for six years, where he was
with software-programmable FPGAs,” in Proc. ACM/SIGDA Int. Symp. engaged in designing high performance and low
Field-Program. Gate Arrays, Feb. 2017, pp. 15–24. power DSP processor used for communication and
[44] H. Nakahara, H. Yonekawa, T. Sasao, H. Iwamoto, and M. Motomura, multi-media application. Before joining Qualcomm,
“A memory-based realization of a binarized deep convolutional neural he was with the Intel Corporation for 10 years,
network,” in Proc. Int. Conf. Field-Program. Technol. (FPT), Dec. 2016, where he involved in a wide range of micro-processors design from high
pp. 277–280. performance, server chips > 100Watt (IA-64), to mobile embedded processor
[45] R. Hasan, T. M. Taha, and C. Yakopcic, “On-chip training of mem- low power sub 1 watt (xscale). He has over 16 year’s industrial experience
ristor based deep neural networks,” in Proc. Int. Joint Conf. Neural in microprocessor design with emphasis on memory, low power circuit and
Netw. (IJCNN), May 2017, pp. 3527–3534. physical design. He is currently an Associate Professor of ECE and a Director
[46] C. Yakopcic, M. Z. Alom, and T. M. Taha, “Extremely parallel memristor of the System-on-Chip Research Center at Khalifa University. His research
crossbar architecture for convolutional neural network implementa- interests includes VLSI, power efficient computing, high yield embedded
tion,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, memory, emerging technology such as memristor, STTRAM, in-memory-
pp. 1696–1703. computing, and hardware accelerators for cyber physical system. In addition,
[47] A. Shafiee et al., “ISAAC: A convolutional neural network accelerator he is engaged in microwatt range computing platform for wearable electronics
with in-situ analog arithmetic in crossbars,” ACM SIGARCH Comput. and WSN focusing on energy harvesting, power management, and power
Archit. News, vol. 44, no. 3, pp. 14–26, 2016. conversion including efficient dc/dc and ac/dc convertors.
[48] L. Ni, Z. Liu, H. Yu, and R. V. Joshi, “An energy-efficient digi-
tal ReRAM-crossbar-based CNN with bitwise parallelism,” IEEE J. Muath Abu Lebdeh received the B.Sc. degree in
Explor. Solid-State Comput. Devices Circuits, vol. 3, no. 4, pp. 37–46, electrical and electronic engineering and the M.Sc.
May 2017. degree in electrical and computer engineering from
[49] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and P.-E. Gaillardon, Khalifa University, in 2015 and 2017, respectively.
“A robust digital RRAM-based convolutional block for low-power image He is currently pursuing the Ph.D. degree with the
processing and learning applications,” IEEE Trans. Circuits Syst. I, Reg. Computer Engineering Laboratory, Delft University
Papers, vol. 66, no. 2, pp. 643–654, Feb. 2019. of Technology. His research interests include CIM
[50] C. Yakopcic, M. Z. Alom, and T. M. Taha, “Memristor crossbar circuits and architectures.
deep network implementation based on a convolutional neural net-
work,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2016,
pp. 963–970. Mahmoud Al-Qutayri (M’86–SM’04) received the
[51] S. N. Truong, S. Shin, S.-D. Byeon, J. Song, H.-S. Mo, and K.-S. Min, B.Eng. degree from Concordia University, Montreal,
“Comparative study on statistical-variation tolerance between comple- Canada, in 1984, the M.Sc. degree from the Univer-
mentary crossbar and twin crossbar of binary nano-scale memristors sity of Manchester, U.K., in 1987, and the Ph.D.
for pattern recognition,” Nanosc. Res. Lett., vol. 10, no. 1, pp. 1–9, degree from the University of Bath, U.K., in 1992,
2015. all in electrical and electronic engineering. He was
[52] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, “VTEAM: with De Montfort University, U.K., and the Univer-
A general model for voltage-controlled memristors,” IEEE Trans. sity of Bath, U.K. He is currently a Full Professor
Circuits Syst., II, Exp. Briefs, vol. 62, no. 8, pp. 786–790, with the Department of Electrical and Computer
Aug. 2015. Engineering and the Associate Dean for Graduate
[53] I. Vourkas and G. C. Sirakoulis, “A novel design and modeling para- Studies with the College of Engineering, Khalifa
digm for memristor-based crossbar circuits,” IEEE Trans. Nanotechnol., University, United Arab Emirates. He has authored/coauthored numerous
vol. 11, no. 6, pp. 1151–1159, Nov. 2012. technical papers in peer-reviewed international journals and conferences. He
[54] R. S. Williams. (2012). Finding the Missing Memristor-Keynote Talk At also coauthored a book entitled Digital Phase Lock Loops: Architectures and
UC San Diego CNS Winter Research Review. [Online]. Available: http:// Applications and edited a book entitled Smart Home Systems. This is in
cns.ucsd.edu/files_2010/january_2010/agenda2010winterreivew.pdf addition to a number of book chapters and four patents. His current research
[55] W. Yu, K. Yang, Y. Bai, T. Xiao, H. Yao, and Y. Rui, “Visualizing and interests include embedded systems design and applications, design and test
comparing AlexNet and VGG using deconvolutional layers,” in Proc. of mixed-signal integrated circuits, wireless sensor networks, cognitive radio,
33rd Int. Conf. Mach. Learn., Jun. 2016, pp. 1–7. and hardware security. His professional service includes membership of the
steering, organizing and technical program committees of many international
conferences, and reviewer for a number of journals.
Said F. Al-Sarawi (S’92–M’96) received the general
certificate in marine radio communication and the
B.Eng. degree (Hons.) in marine electronics and
communication from the Arab Academy for Science
and Technology (AAST), Egypt, in 1987 and 1990,
respectively, and the Ph.D. degree in mixed ana-
log and digital circuit design techniques for smart
wireless systems with special commendation in elec-
trical and electronic engineering and the Graduate
Certificate in Education (Higher Education) from
The University of Adelaide, Australia, in 2003 and
2006, respectively, where he is currently the Director of the Centre for
Biomedical Engineering. His research interests include design techniques for
Yasmin Halawani (S’14) received the B.S. degree in electrical and electronics mixed signal systems in micro- and nano-electronictrons and optoelectronic
engineering from the University of Sharjah (UOS), United Arab Emirates, technologies for high performance radio transceivers, low power and low
in 2012, and the M.S. degree in electrical and electronics engineering from voltage radio frequency identification (RFID) systems, data converters, and
Khalifa University, United Arab Emirates, in 2014, where she is currently pur- microelectromechanical systems (MEMS) for biomedical applications. He
suing the Ph.D. degree in the area of memristor-based in-memory computing received the University of Adelaide Alumni Postgraduate Medal (formerly
architectures. Her research project focused on investigating the suitability of Culross) for outstanding academic merit at the postgraduate level and the
emerging memory technologies, such as Memristor and STT-RAM for low Commonwealth Postgraduate Research Award (Industry) while pursuing
power applications. his Ph.D.
Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.