ReRAM-Based In-Memory Computing For Search Engine and Neural Network Applications

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

388 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO.

2, JUNE 2019

ReRAM-Based In-Memory Computing for Search


Engine and Neural Network Applications
Yasmin Halawani , Student Member, IEEE, Baker Mohammad , Senior Member, IEEE, Muath Abu Lebdeh,
Mahmoud Al-Qutayri , Senior Member, IEEE, and Said F. Al-Sarawi , Member, IEEE

Abstract— Resource-constrained computing devices such as


those used in IoTs require low-power, high performance, and
small size to be enabled to operate efficiently. Resistive ran-
dom access memory (ReRAM) is a promising technology for
building novel in-memory computing architectures, due to its
ability to perform storage and computation using the same
physical element with low energy and high density. ReRAM-
based search engines and neural network (NN) accelerators
have grown significantly especially for IoT devices. In this
paper, we propose a memristor-based voltage-resistance xnor
(VR-XNOR) cell. The advantages of this cell are demonstrated
through building a reconfigurable content-addressable memory
(CAM) architecture that can support binary-CAM (BCAM) and
ternary-CAM (TCAM) and enable approximate search opera-
tions. Moreover, the memristor-based VR-XNOR cell is utilized
for binarized convolutional neural networks (CNN) with focus
Fig. 1. An extension of the classification of Memristor-Based Logic
on the convolution operation. This is achieved by replacing the
Design found in [7] with multiple input data representation. The first letter
convolution module with XNOR-based filter banks. Simulations represents Input 1, the second letter represents Input 2. While the third letter
of the proposed architectures for search engine and feature represents the processing element. V: voltage, R: resistance, B: both voltage
extraction were carried out using VTEAM model in Cadence and resistance, M: memristor-only and H for hybrid CMOS-Memristor.
Virtuoso Analog Design Environment. The proposed filter bank ([∗ [8], [9]], [∗∗ [10]–[13]], [† [14]], [‡ [15], [16]], [$ [2]], [§ Proposed].)
architecture achieves a 1-ns extraction cycle time over N filters
and produces multiple output feature maps in a single processing
cycle. The filter uses two memristor devices to realize each XNOR capabilities, which makes it a potential building block for In-
gate and has shown a significant reduction in the number of Memory Computing (IMC) [2], [3]. With such capabilities,
multiply-add operations. ReRAM based computing systems have the ability to mitigate
Index Terms— Memristor, CNN, accelerator, in-memory com- the key bottlenecks of Conventional CMOS-based computing
puting, template matching, XNOR. designs that are based on the von Neumann architecture.
For example, data movement through Input/Output pins in
I. I NTRODUCTION CMOS-based architecture is estimated to consume 100× more
energy than a floating-point operation and can limit device

M EMRISTOR, a type of Resistive RAM (ReRAM)


device technology, promises to extend the trend of low
power while maintaining low cost with high density [1]. The
performance [4]. Furthermore, the continuous response of
memristor state variable to an applied input voltage makes
it an ideal candidate for multiply and accumulate operations
two-terminal device consists of a thin oxide film that has the commonly needed in many digital image processing tasks.
ability to store information with zero leakage current, high When these devices are built in a crossbar architecture,
endurance, relatively fast write time and small cell size. The they result in an inherently parallel operations, and can
device has shown both storage and information processing naturally realize the vector-matrix operation with potential
Manuscript received November 30, 2018; revised February 5, 2019; savings in energy, area and execution time. Hence, ReRAM-
accepted March 22, 2019. Date of publication April 4, 2019; date of current based systems are ideal candidates for real-time search engines
version June 11, 2019. This publication is based upon work supported by the and neural network accelerators, especially for resource-
Khalifa University of Science and Technology under Award No. [RC2-2018-
020]. This paper was recommended by Guest Editor J.-S. Seo. (Corresponding constrained IoT nodes [5], [6]. ReRAM can be configured to
author: Yasmin Halawani.) support analog matrix multiplication or bitwise search opera-
Y. Halawani, B. Mohammad, M. Abu Lebdeh, and M. Al-Qutayri are with tions within memory. In this paper, we focus on implementing
the System-on-Chip (SoC) Center, Khalifa University, Abu Dhabi 127788,
United Arab Emirates (e-mail: yasmin.halawani@kustar.ac.ae). an XNOR gate to enable a bitwise comparison for a CAM
S. F. Al-Sarawi is with the Centre for Biomedical Engineering, The search engine. Moreover, the same gate is utilized for feature
University of Adelaide, Adelaide SA 5005, Australia. extraction in convolutional neural networks (CNN).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. There are eight classes of resistive-memory based logic
Digital Object Identifier 10.1109/JETCAS.2019.2909317 designs depending on the input and output data representations
2156-3357 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 389

and the processing elements used [7]. We have extended the


classification to include a second input as illustrated in Fig. 1
that can either be resistance, voltage or both. This is in
contrast to [7] where the inputs are restricted to either volt-
age or resistance. Each architecture family has its advantages
and disadvantages depending on the application. For example,
resistive-resistive logic is appropriate for search applications
where both operands are stored in the memory. This is in
contrast to read the data first followed by the search operation.
While in case of real-time search operation it is better to have
the inputs as voltages from the sensor be compared against the Fig. 2. Two-input memristor VR-XNOR cell where one operand is voltage
(V A , V A ) that is compared against a second operand that is resistance (R B ,
values stored in the memristor. In [5], we proposed the use of R B ). Output can either be voltage V OU T or stateful R OU T . The switch SW
stateful-input and stateful-output for memristor-based search can configure if the output is voltage for BCAM and TCAM or resistance in
engine that is based on a novel heterogeneous memristive case of similarity search.
XOR gate. In this case, control voltages are applied to the TABLE I
XOR array to perform a comparison operation between two XNOR T RUTH TABLE (L OGIC ‘0’ → 0 V AND 10 M,
memristor states. The output of such comparison is also stored L OGIC ‘1’ → 0.6 V AND 10 k)
as a memristor conductance value. It is worthnoting that this
type of representation is ideal for the case when both inputs
(operands) are stored in memory.
In this paper, the proposed scheme has one input as voltage
and the second as resistance. The new cell/architecture enables
multiple database matching. To the best of our knowledge, this
is the first reporting of a reconfigurable cell that can output a
resistance value or a voltage. The contributions of this paper a sensor, can be fed directly to the non-volatile CAM structure
can be summarized as follow: for a similarity check. The cell consists of 3 memristive
• Reconfigurable voltage-resistance XNOR (VR-XNOR) devices. One memristor stores the data while the other stores
cell that has one operand in resistance representation its complement as RON and ROFF or vice versa. These two
while the second operand is voltage representation. The memristors are used to perform the search operation. While
output can be in voltage or resistance representation. the third device is for the output.
• A CAM architecture that can be configured as BCAM, The cell consists of computing and storage bipolar mem-
TCAM or approximate search for 2D input. ristor devices with two different turn ON voltages (switching
• A memristor-based filter bank architecture that uses time). The switching time for the computational devices is very
XNOR cell to extract features for convolution neural slow in order not to be disturbed during computation. While
networks (CNNs). Demonstrating a significant reduction the switching time for the output memristor is fast. Fabrication
in the number of multiply-and-accumulate operations than of such structures has been reported in the literature through
the conventional methods. having different oxygen vacancy concentration for the same
• A parallelized and pipelined average pooling operation to material [17]. The fabrication process implies that there is a
allow simultaneous pooling of sub-feature maps. need to have an extra mask to enable device selection. The
In the next section, the VR-XNOR cell design is presented sub-sections below explain the main operations namely the
and discussed. The rest of the paper is organized as fol- compare and read.
lows: The proposed memristor-based search engine architec-
ture is described and demonstrated in Section III. After that,
A. Search Operation
Section IV provides a brief background of convolutional layers
along with the simulated experimental results for multiple The proposed cell can be used to build either BCAM,
filter architecture for CNNs are reported and discussed. Then, TCAM or for similarity search. In the first, all cells are
Section V concludes the paper and discusses future research. contributing to the output. In the second case, depending on
the variable length of the input stream, if certain bits are
II. P ROPOSED VR-XNOR C ELL D ESIGN masked or have don’t care during the comparison operation,
Bitwise XNOR operation has been widely used to efficiently then the corresponding input to the cells are left floating. The
implement data encryption, error detection/ correction, binary output column will be fed directly to a comparator to identify
to Gray encoding and search engines [13]. In this work, a match/mismatch in a BCAM or TCAM structure. In case
we propose to use a VR-XNOR cell for CAM design and of a similarity search, the accumulation memristor will be
feature extraction for CNN. In the proposed voltage resistance connected at the end of the column. Hence, a switch can be
(VR)-CAM cell demonstrated in Fig. 2, one operand is voltage used with the output accumulative memristor. So if the switch
that is compared against a second operand that is resistance. is ON, the CAM structure is used for similarity search. While
The output can be voltage or resistance. This is useful for real- if it is OFF, a BCAM/TCAM comparison is performed by the
time applications where the input voltage, e.g. an image from CAM structure.

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
390 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019

The design checks the input voltage that represents the data
entry against the conductance of one memristor as the second
one will be its complement. The matching operation is per-
formed according to the following XNOR Boolean expression:
V A R B +V A R B . The truth table of the XNOR cell is presented
in Table I. The input voltage levels V A and V A can be selected
based on the usage model. For example for BCAM and
TCAM, V A = 0.6 V and V A = 0 V or vice versa. But in
the case of similarity search where the comparison result is
written into ROUT then, V A = 1 V and V A = 0 V or vice
versa. This is to enable to write ROUT in the given time. It is
worth noting that the voltages used in these simulations were
chosen to achieve fast switching with good noise margin based
on reported real devices [18]. The 1 V was selected because it
allows a full switching of the output memristor when all inputs
match with the databases/filters. On the other hand, 0.6 V was
chosen because it provides an acceptable distinction between
matching and mismatching cells in case of BCAM/TCAM
where an output memristor is not used. The output memristor
is initialized to ROFF . When both the applied input voltage
and the stored data are matching, the voltage drop across the
accumulation memristor is high. Only when a matching occur,
it will be pushed to RON . It was chosen this way since most Fig. 3. Comventional CMOS-CAM array where each cell consists of 6T
SRAM cell and dynamic comparator to compute the match/mismatch signal.
of the time there will be more mismatches than matches and
hence the energy will be minimized. Otherwise, in the case
of a mismatch, the voltage drop will be small and the output costs at minimal. Conventional CMOS-based search engines
memristance will stay as ROFF . This will result in overall better suffer from density and power limitations.
power efficiency. The output voltage can be calculated through Most systems use software-based search engines, such
the following equation. as associative lookup tables (LUT). Such implementation is
restricted to the size of available physical memory, which
RB RB
VOUT = V A   + VA   leads to higher latency due to the need for external mem-
RB
RB + RB 1 + ROUT R B + R B 1+ RROUT
B
ory access [19]. Usually, these algorithmic software-based
(1) approaches are suitable for general purpose application, but
cannot be utilized for IoTs that need to process data in real-
time. Application specific integrated circuits (ASICs) hardware
B. Read Operation implementations are much faster than software approaches
In order to read the stored patterns in the memristor, and hence more appropriate to accommodate real-time search
the corresponding voltage can be applied only to the stored operations [20], [21]. ASIC also utilizes special type of CAM
data as there is no need to read its complement. If the output that can perform lookup operation in a single clock cycle.
voltage is high, then the stored bit is RON otherwise, the stored Conventional CMOS-based CAM, depicted in Fig. 3,
bit is ROFF . Read operation of the stored patterns for CAM is composed of a memory element combined with a com-
type applications is rarely performed. And if it to be done parison circuitry [20] that facilitate parallel search operation.
for the whole CAM structure, it should follow a bit-wise read The data is stored in the memory using a large number of
operation. This is a limitation of the proposed CAM structure transistors (10+) per bit. Once the input data is fed into the
as it requires multiple cycles to read a single database. Hence, CAM, the search operation is performed using a precharge-
a memristor-based memory crossbar with the same stored evaluate operations, similar to SRAM read operation [22].
database can be used in parallel to fetch the database with After that, the address of the matching data will be returned (or
the highest similarity score. the matching data itself in the case of associative memory).
BCAM and TCAM are two widely used search memories.
In the former, exact-match searches are performed. While in
III. S EARCH E NGINE S YSTEM A RCHITECTURES the latter, don’t cares are further utilized to improve matching
performance by performing partial or range matches [23].
A. State-of-the-Art Systems The major challenges for conventional CMOS-based CAM
Fast search engines are required for real-time decision mak- and TCAM designs are their low density and high power con-
ing in various fields including computer vision, machine learn- sumption. Implementations utilizing pre-charge phase increase
ing and object recognition. For IoTs and similarly resource power consumptions and increase complexity [24]. Further
constrained devices that need to implement fast search engines, research into lower power and higher density cells to align with
it is of paramount importance to keep both area and energy the continuing rapid growth in data has been sought. Moreover,

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 391

for media applications that deal with 2D images, as the image based stateful logic operations using material implication is
size increases, circuit complexity increases. Traditional CMOS presented. For a two-input XOR implementation, the design
implementation continue to be more expensive in terms of requires 4 CRS (8 memristor devices) and 6 steps to compute
area, power and processing time. For example, in [25] a 2D the operation.
pattern is divided into sub-patterns with a size that must In [32], the search operation is performed by firstly pro-
be small to be manageable for processing. The architecture gramming the CRS array with a template pattern. Then each
becomes impractical as pattern size increases. match line (ML) is pre-charged by a global reset signal. After
To address the highlighted challenges, there is a growing that, the ML is discharged by the result of correlation based
interest towards utilizing emerging non-volatile nano-devices on the Hamming distance between the two input images.
for search and match operations. In [26], the memory core is Moreover, in [33] the search engine consists of two CRS and
composed of a CMOS transistor and a phase-change memory cross-coupled switches with an analog correlator that flags the
(PCM) with 1T1R structure; where the PCM is used to store Hamming distance for the image-recognition and classification
the data. This is followed by a differential sense amplifier to process.
change the stored value into a corresponding voltage value Approximate TCAM (ACAM) is proposed for online learn-
(i.e. GND or VDD). Then, a comparator is used to compare ing in [30]. The authors use MTJ to tackle the endurance issue
the stored data with the input search voltage. The non-volatile associated with the wear-out of memories for online learning.
element is not part of the computation and only used as a The architecture consists of a 5T-4MTJ TCAM followed
storage element. Researchers in [27] have proposed the use by an STT-RAM memory. Once the searched input data is
of STT-RAM in CAM/TCAM. The design uses 1T1MTJ per found, a clock gating signal is activated to stop the processor
cell on each bitline and one reference resistor. The architec- computation. And finally, the corresponding line of the STT-
ture suffered from long delay. Hence, in [28] 14 transistors RAM memory to read the precomputed output data.
and pentaMTJ are utilized to obtain a sub-1ns TCAM cell.
However, PCM is temperature sensitive [29]. The MTJ has
low noise margin between the two resistance states and long B. Proposed Search Engine System Architecture
search delay [30], which negatively affect the performance. In The proposed VR-XNOR cell was used as the main building
addition to the aforementioned devices, researchers are looking block for the memritor-based CAM array architectures. The
into utilizing the memristor for CAM application. Some of crossbar architecture shown in Fig. 4 enables parallel lookups
the major resistive-based CAM cell implementations studies of multiple inputs in multiple CAM banks. For N input bits,
utilize stateful logic, while others use complementary resistive the system requires 2N memristor devices to perform a com-
switches (CRS). Moreover, the designs vary in the numbers parison. And an analog memristor accumulator at the end of
of both nano-devices and transistors used to build the bit-cell each column. So each data entry requires a representation of its
memory. Other implementations focus more on the endurance value and its complementary value. Input data is represented
issue and speed while relaxing on the energy consumption. as applied voltages while the stored data is represented by
A memristors-based TCAM (mTCAM) cell consisting of conductance value of the memristor. Each column corresponds
five transistors and two memristors is proposed in [3]. The to a different database.
stored data is represented by resistance values of the memris- Multiple bank lengths can be traded-off between the number
tive devices as the data and its complement. Four search lines of bits per share match line and the output voltage difference 
are used per cell and only three are effective during search in (mV) output resistance between matching and mismatching
mode. Initially, match lines are precharged to high. Delays cells. As the bank length increases, the voltage difference
from the four search lines, TCAM cells with 5T2R structure, between all matching cells and a single mismatch starts to
and the ML would contribute to an increase in the search decrease until it reaches almost zero when all bits mismatch
latency as well as the energy consumption. and hence no switching occurs to the output memristor. This
In [2], a transistor-free memristor TCAM structure for is due to the fact that when there is a mismatch the high
comparing input data against a bank of stored words is resistance bits are still contributing to the output voltage and
presented. This architecture can be used for intrusion detection the combination of matching and mismatching cells result in
applications where packets are scanned to determine if they reduced voltage difference.
are malicious or not. For N bits, the number of memristors The output bank reading circuitry is illustrated in Fig. 4(c).
required are 4N+2. The design has shown ∼ 360 × reduc- Depending on the desired application as explained previously
tion in area with competitive timing and energy compared in Section II, if the architecture is used for BCAM/ TCAM
to CMOS-based CAM. In that structure, each bit requires then the output voltage will be fed directly into the comparator
4 memristors; one connected to the input voltage, the second as shown in part (a) of the figure. While if a similarity search,
to the input voltage complement, and the third and fourth are then the switch will be initially closed to write the output into
connected to two bias voltages to account for the variations of the memristor. Then the next step is to evaluate R OU T , so the
the memristance value. This is in addition to two bias voltages switch will open to read its value without disturbing the rest
per column. of the cells and feed it into the comparator as demonstrated
Another approach for search using implication logic was in part (b) of the figure.
proposed in [19]. In this approach, eleven steps per CAM Logic ‘1’ corresponds to the memristor ON state, RON ,
cell are required to generate a cell match signal. In [31] CRS while logic ‘0’ corresponds to the memristor OFF state, ROFF .

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
392 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019

Fig. 4. Memristor-based multiple banks CAM architecture shown in a) and its digital equivalence is shown in b). The readout circuitry schematics for the
BCAM/ TCAM configuration and similarity search is shown in c), (en stands for the enable signal and Vre f is the reference voltage). The switch SW can
configure if the output is voltage for BCAM and TCAM (SW=OFF) or resistance in case of similarity search (SW=ON).

TABLE II maps yn that are results of element-wise multiplication


C OMPARISON B ETWEEN P ROPOSED A RCHITECTURES and summation between the filter and the specific block
AND C OMPETITIVE M EMRISTIVE CAM S
of an image. This value represents whether the specific
feature exists in the particular part of the image.

yn = wi x i + bi , (2)
i
where n is the number of columns (filters), wi is the
weight of the filter, x i is the input and bi is the bias.
• Pooling layer: downsamples the feature maps into smaller
The ranges can be determined by dividing the whole resistance sizes which reduces the complexity of the whole archi-
spectrum in half equally or unequally depending on the voltage tecture. If max pooling is used then prominent features
difference [35]. from a specific block are extracted. On the other hand,
The estimated area calculations for the memristor crossbar average pooling gives a smoothed/blurred output of the
was based on a fabricated full-pitch width of 400-nm from features from a specific receptive field.
[36]. For the first architecture that consists of multiple banks:
f (x) = max (0, x), (3)
the 256 × 144 structure would be divided into 4 banks each
with 64 memristor devices (32 pairs) = 4 × 64 × 144. It would • Non-linear activation function: is responsible for mapping
occupy an area of [4×(400 nm × 64) × (400 nm × 144)] = the neurons output in terms of its input. Rectified linear
5898.2 μm2 that will accommodate 128 bits. units (ReLU) has been widely used instead of sigmoid
In the next section, CNN is briefly explained followed by and hyperbolic tangent, as ReLU is much faster to train
the XNOR-based CAM cell is utilized for feature extraction. than these non-saturating nonlinearity [37].
• Fully connected (FC) layer: the final output stack is

IV. C ONVOLUTIONAL N EURAL N ETWORKS A PPLICATION flattened to produce a 1D array to check the effect or sig-
nificance of each feature on the final classification stage
A. Basics of CNN for the different classes. All neurons are connected to all
Convolutional neural networks (CNNs) are found to be output neurons in the output layer.
superior compared to other standard feedforward neural net- The last step is to calculate the classification error and
works in terms of accuracy. Such network mimics the locality update the weights using the backpropagation algorithms
found in human visual system (HVS) and have much fewer according to the chain rule.
connections and parameters, as a result they are easier to train Convolutions are the computationally intensive operations
[37]. CNN structures are divided into two main stages: feature as they nominally account for nearly 90% of the total process-
extraction followed by classification. The following 4 main ing task [38], with heavy reliance on floating point matrix
layers are used in the implementation of a CNN: multiplication. This will cause an issue with general purpose
• Convolution (ConV) layer: is the ability of the filters to processors specially with the limited cache size. Hence, graph-
capture (detect) as many generic features as possible from ics processing units (GPUs) have been widely used to accel-
the training dataset that would help in classification (for erate the convolution operation due to their high parallelism
example) of the testing dataset. It consists of sliding N and floating point operations performance [39]. Nonetheless,
filters each of size k ×k with randomly initialized weights GPUs suffer from high power consumption which makes
over a set of training images. This step produces feature them expensive to deploy during the inference/testing phase

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 393

of CNNs specially in mobile systems that require real-time across the layers. The dot-product operations involved in
and low power consumption [40]. convolution and classifier layers are performed on crossbar
Thus, field-programmable gate array (FPGA) and applica- arrays; those results are sent to then converted to digital
tion specific integrated circuit (ASIC)-based CNN accelerators representation, and then aggregated in output registers after
are increasingly being used to achieve high accuracy with low any necessary shift-and-adds.
power consumption and highly reduced computational time Moreover, Ni et al. presented in [48] a binary vector-matrix
[39]. On the other hand, several schemes and versions of multiplication where they use two pairs to represent an AND
CNN are being exploited in order to align with the rise of and OR operations which translates to binary multiply and
IoT and mobile devices such as neuron pruning and weights add operations, respectively. This is followed by a voltage
quantization [41]. comparator with a reconfigurable threshold voltage. The con-
Approximating the full precision floating point weights volution part requires a 2N×N ReRAM-crossbar per filter. The
through quantization has been studied lately. Lowering the authors compared their RRAM-based BNN system to other
precision and utilizing an extremely compact data representa- hardware (GPU, FPGA and CMOS) implementations and it
tion such as the binarized CNN (BNN) has shown the ability has shown 4 orders of magnitude, 4155× and 62× less power
to reduce memory resources and decrease computational time respectively under similar accuracy. A digital RRAM-based
[42]. In [41], the authors have shown 58× improvement in convolutional block that is capable of performing dot product
convolution operations and 32× memory savings when both operations in a single cycle is proposed in [49]. The main cell
the weights and inputs are binarized with 12.5% loss in consists of 4 transistors and 1 RRAM (4TIMR) components.
accuracy as compared to the full-precision using AlexNet. The cell employs two pairs of NMOS and PMOS transistors
Moreover, in [43] the network biases were removed as most in order to trigger a set or a reset process to program the
of them had magnitudes less than 1 after quantization which memristor. The RRAM crossbar is followed by an XNOR
did not affect the accuracy. As a consequence of the binariza- sensing circuit and then a combinational bitcount circuit.
tion, the convolution operation (multiply and add modules) Increasing the number of layer increases the accuracy as it
is reduced to XNOR and bitcount adders to realize the helps to represent complex features from local low-level ones.
convolution operation efficiently specially for systems with Nonetheless, this also increases the number of multiply-and-
limited resources [44]. Nonetheless, the neural weights still accumulate (MAC) operation as each connection to a neuron
require to be moved from the memory to the computational represent a MAC operation. It has been reported in [38] that
unit which creates timing and power overhead. It is appreciated AlexNet, which consists of 5 convolutional layers, requires
that the binarization process causes accuracy loss and hence, 666 million MACs per 227×227 pixel size image. While
there are efforts to reduce this impact. For example, authors the VGG16 architecture that uses 13 convolutional layers,
in [40] proposed the use of the sign() function with shift requiring 15.3 billion MACs per 224×224 pixel size image.
parameter inside in order to approximate the full precision
using binary bases. Thus, accuracy loss was reduced to around
5% compared to the state-of-the art implementations. B. Proposed CNN Architecture
The proposed method reduces training time significantly One of the drawbacks of using analog memristors to carry
since the movement of weights has been eliminated in addition out computations is that the data stored will likely have less
to the binarization process [41]. Moreover, it reduces the area precision when compared to typical 32-bit floating point data
by performing an unsigned bitwise operation instead of using representation [50]. Variations in memristor does impose a
two crossbars to store negative and positive weights with a challenge for analog computation. On the other hand, utilizing
subtract operation for every complementary bitline output [4]. a memristor for digital operation (high resistance, low resis-
Memristor crossbars have been widely used as neural net- tance), as it is the case for BNN, will provide a more robust
work accelerators as they perform in-memory computations computation in face of device variability. For the endurance
on the stored weights and hence eliminate their movement. of the devices, the number of read to write ratio is large
For example, in [45], the authors presented a memristor- as reported in the literature [18]. Moreover, the devices are
based architecture for on-chip training of deep neural net- programmed once after training in case of CNN applications
works (DNN). The architecture utilizes two memristors per and for the databases in CAM. Hence, endurance effect is
synapse for higher precision of weights. Moreover, the back- minimal in the presented case.
propagation algorithm was used to train the autoencoders and To the best of our knowledge, this is the first paper to present
to fine tune the pre-trained weights. In addition, in [46], a VR-XNOR memristor-based template matching circuit for
the authors proposed the use of memristor-based CNN to par- BCNN with 2N memristor devices per bit. In addition to the
allelize the recognition phase. They assumed that the crossbars ability to perform the pooling operation with the convolution
are already trained and stored as conductance values. The process at the same time.
crossbars were used for analog vector-matrix multiplication The storage memristor which is used to store the feature
acceleration with each weight represented by two memristor map values is initialized to ROFF . This is because the features
to account for both the positive and negative values. The filters do not exist in all patches of the input image set and conse-
were expanded into large sparse matrices. In [47], the authors quently there will be more mismatches than matches. So from
presented ISAAC architecture, an analog CNN accelerator an energy point of view, initializing the output device to
where tiles containing memristor crossbars are partitioned ROFF minimizes the energy and results in better overall power

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
394 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019

TABLE III
VTEAM M EMRISTOR M ODEL PARAMETERS FOR C OMPUTATIONAL
D EVICES AND S TORAGE D EVICES

image patch will be applied. So for an 32 ×32 image size


Fig. 5. Memristor-based multiple filter architecture that shows in a) a k×k and a 2×2 filter size with stride = 2, it will take 16×16 ns
image patch being flattened to a 1D array and then a voltage generator will to perform the convolution. Convolution can be further
translate the intensity of the pixels to voltages prior to applying them as input
to the XNOR-based filters. The output is the feature map value that is directly optimized and completed within 16 ns at the expense of
written to the memristor crossbar shown in b). Then voltages are applied to more hardware resources. For example, a stack of filter
the crossbar seen in b) to perform average pooling. These average pooling banks can be used where each bank is responsible for
voltages can also be used as row selectivity to ground the active rows in order
to write the output to the feature map crossbar. The last row in the feature a row from the image. This will also require a second
map crossbar corresponds to the output of the image patch (shown in the stage of analog output data to accumulate the matching
lower right corner of the input image) being XNORed with N filters. from all column that correspond to the same filter from
the stack.
efficiency. When there is a match between the image patch • Once the convolution is being performed through com-
pixel and the stored feature pixel, the voltage drop across the putation operation on the above crossbar layer presented
accumulation memristor is high and pushes the state variable in Fig. 5(a), the result will be directly written to the
x to RON . Otherwise, in case of a mismatch, the voltage drop pooling layer shown in Fig. 5(b) in a row-by-row fashion.
will be small and the output memristance will stay as ROFF . This helps to first eliminate additional peripheral circuitry,
It is worth noting that the main difference between the work and secondly eliminate the select transistor in the pooling
by Ni et al. [48] and ours is 1) they use 2N×N crossbar layer as inactive rows are kept floating. It is expected
to represent a single filter, while we use a 2N×M crossbar that conversion from analog to digital and vice versa can
size that can represent M filters since each column in that consume 85% [4] in neuromorphic systems. It is worth
crossbar is a different filter. And 2) in our proposed design the noting that there is a need to address the sneakpath current
native XNOR output is written directly to the pooling crossbar when writing to the convolution layer devices. This can
instead of sensing and then writing. be achieved by i) resetting the whole crossbar in a single
The crossbar architecture shown in Fig. 5 enables parallel cycle and then write logic ‘1’ to the devices in a row-by-
operation over several N filters. For each bit, the system row fashion where other inactive rows are kept floating
requires 2 memristor devices to perform a comparison. And a with either half or third of the supply voltage can be used
storage memristor at the end of each column to store the value for the writing scheme as in [51], and ii) utilizing a 1T1R
of the feature map. So each entry of the filter is represented by structure with the same writing scheme as previous point.
its value and its complementary value. In addition, input data Writing during operational phase is rare since the training
from the image patch is represented as applied voltages while is performed offline.
the filter elements are represented by conductance value of the
memristor. Each column corresponds to a different filter.
In order to implement the proposed VR-XNOR for CNN C. Simulation Results and Analysis
filtering operation: The proposed filter bank design has been simulated and
• The N filters, are first converted to a 1D-array in order to verified using circuit design spice simulator from Cadence
be able to implement the element-wise comparison and tools. Two types of memristors have been utilized in the
addition from all filter members at once. VR-XNOR-filter bank as explained in Section III.A. VTEAM
• The corresponding image patch is converted into a 1D- spice model for memristor is used in the simulations [52]
array and applied to the terminal of the filter bank as with the values of the used parameters presented in Table III.
corresponding voltages. The storage memristor switches within 1 ns as has been
• Applying the first image patch as voltages to the filter demonstrated by HP Group for 3 nm ×3 nm device sizes
bank shown in Fig. 5(a) produces the extracted feature [18], [53], [54]. Since from circuit-point-of-view CNN and
value from the first bank, the second column with the sec- CAM share the same VR-XNOR-filter bank architecture; it
ond bank and so on. In the second clock cycle, the second has been verified on CNN.

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 395

TABLE IV
T RADE -O FFS B ETWEEN D IFFERENT F ILTER S IZES AND M ATCHING
M ARGIN FOR A 2×2 AND 3×3 S IZES

It can be noted that the energy decreases as the number of Fig. 6. Illustration for a convolution where the input is a binary image patch
bits per column increases (larger filer size). This is because compared against two filters stored in the memristor crossbar.
by having more voltage sources, this would cause the output
memristor to switch slightly faster than with the smaller
filter size. Moreover, the energy decreases as the number of
matching cells within the same filter increases since the current
is limited by the memristor storing the feature map values.
Different 2D-filter sizes were investigated as shown
in Table IV after reshaping them into 1D-array. The selected
sizes of 2×2 and 3×3 used in these simulations are utilized
because they are widely used for CNN. Nonetheless, other
sizes such as 5×5 and 11×11 have been reported in literature
[55]. As a result, the more parallel pairs added to the structure
the total resistance will decrease, resulting in reducing the dif-
ferentiable voltage margin between matching and mismatching
cases. Hence, there is a trade-off between the filter size
and the corresponding output resistance during matching and
mismatching operation. As the filter size increases, the column Fig. 7. Illustration for the output from 2 filter banks where one 100% matches
length increases, and the voltage difference between all match- with the input patch and the other matches with 55.5%.
ing cells case and a single mismatch starts to decrease until it
reaches almost zero when all bits mismatch. This is captured in 3. If implemented in ASIC, 9 multiplications and 9 additions
the output resistance value. This means that the learnt feature including the bias are required for each patch convolution.
stored in the filter does not exist in that part of the input Hence when convolving one image with a single filter it would
image and the high resistance bits are still contributing to the require (9×83×83) multiplications + (9×83×83) additions.
output voltage. With each mismatching bit between the filter While if memristor was used to implement the filter bank,
and the image patch, the voltage will drop by a , whose a 83 × 83 multiplications and additions are required for the
value is dependent on the length of the column. So the final same scenario. This shows a reduction in the number of
value of the output memristor, which will store the feature map operations by 18× for one filter. So if 96 filters were used
value, will correspond to the matching bits. In other words, as in AlexNet [38], then a significant reduction with 4 orders
the amount of current produced in each column represents the of magnitude are observed when the proposed architecture is
similarity between the input image patch and feature in the utilized. This is because the filter bank is capable of calculating
filter. the results for N feature maps in parallel.
An example of 2 different filters with two different features
of size 3×3 along with the feature map output is demonstrated
V. C ONCLUSIONS AND F UTURE W ORK
in Fig. 6. The input is a binarized image patch and is converted
to 1 V if the pixel is white and 0 V if it is black. The In this paper, a reconfigurable CAM and a feature extraction
output from the filters will represent how much the input architecture for filter learning have been proposed. The main
resembles the learned features and are stored directly into building block is a memristor-based VR-XNOR gate that is
the accumulative memristor. When the feature learned by followed by an inherent summation which will result in a
the second filter matches the image patch 100%, the memristor percentage reflecting the existence of the matching/similarity
is turned fully ON. While when the learned feature in the first feature. For CNN, the filter is followed by a pooling layer that
filter is different than the one presented in the image patch, downsamples the feature map for a reduction in computational
the conductance is not changed a lot as illustrated in Fig. 7. time. The proposed approach provides a significant area saving
In order to account for the savings in number of operations, and fast computational time as compared to the conventional
let’s assume the following scenario: an input image has a approaches. In the proposed work, we have focused on the
dimension of 249×249, a filter size of size 3×3 with a stride = convolution and pooling parts, as a result a direct comparison

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
396 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 9, NO. 2, JUNE 2019

with other implementation in the literature that use full system [19] Y. Liu, C. Dwyer, and A. R. Lebeck. (2016). “Combined compute and
implementations is out of the scope of this paper. storage: Configurable memristor arrays to accelerate search.” [Online].
Available: https://arxiv.org/abs/1601.05273
Future work will extend the proposed architecture to build [20] K. Pagiamtzis and A. Sheikholeslami, “Content-addressable mem-
a full memristor-based BCNN system that implements all ory (CAM) circuits and architectures: A tutorial and survey,” IEEE J.
layers. Moreover, the architecture will be expanded to account Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006.
for binary weight and ternary input as a trade-off between [21] A. T. Do, C. Yin, K. Velayudhan, Z. C. Lee, K. S. Yeo, and
T. T.-H. Kim, “0.77 fJ/bit/search content addressable memory using
complexity and performance. Co-optimization of learning small match line swing and automated background checking scheme
algorithms as well as Memristor-based in-memory computing for variation tolerance,” IEEE J. Solid-State Circuits, vol. 49, no. 7,
hardware architecture is the key for tackling the accuracy and pp. 1487–1498, Jul. 2014.
[22] B. Mohammad, P. Bassett, J. Abraham, and A. Aziz, “Cache organiza-
performance challenges especially for resource constrained tion for embeded processors: CAM-vs-SRAM,” in Proc. IEEE Int. SOC
applications. Besides, implementation of multiple stages for Conf., Sep. 2006, pp. 299–302.
the reconfigurable CAM for wider data path. [23] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, “Emerging trends in design
and applications of memory-based computing and content-addressable
memories,” Proc. IEEE, vol. 103, no. 8, pp. 1311–1330, Aug. 2015.
R EFERENCES [24] T. V. Mahendra, S. Mishra, and A. Dandapat, “Self-controlled high-
[1] A. Basu et al., “Low-power, adaptive neuromorphic systems: Recent performance precharge-free content-addressable memory,” IEEE Trans.
progress and future directions,” IEEE J. Emerg. Sel. Topics Circuits Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 8, pp. 2388–2392,
Syst., vol. 8, no. 1, pp. 6–27, Mar. 2018. Aug. 2017.
[2] C. Yakopcic, V. Bontupalli, R. Hasan, D. Mountain, and T. Taha, [25] S.-I. Chae, J. T. Walker, C.-C. Fu, and R. F. Pease, “Content-addressable
“Self-biasing memristor crossbar used for string matching and ternary memory for VLSI pattern inspection,” IEEE J. Solid-State Circuits,
content-addressable memory implementation,” Electron. Lett., vol. 53, vol. 23, no. 1, pp. 74–78, Feb. 1988.
no. 7, pp. 463–465, 2017. [26] P. Junsangsri, J. Han, and F. Lombardi, “Design and comparative
[3] L. Zheng, S. Shin, and S.-M. S. Kang, “Memristors-based ternary evaluation of a PCM-based CAM (content addressable memory) cell,”
content addressable memory (mTCAM),” in Proc. IEEE Int. Symp. IEEE Trans. Nanotechnol., vol. 16, no. 2, pp. 359–363, Mar. 2017.
Circuits Syst. (ISCAS), Jun. 2014, pp. 2253–2256. [27] W. Xu, T. Zhang, and Y. Chen, “Design of spin-torque transfer mag-
[4] S. Mittal, “A survey of ReRAM-based architectures for processing-in- netoresistive RAM and CAM/TCAM with high sensing and search
memory and neural networks,” Mach. Learn. Knowl. Extraction, vol. 1, speed,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 1,
no. 1, pp. 75–114, 2018. pp. 66–74, Jan. 2010.
[5] Y. Halawani, M. A. Lebdeh, B. Mohammad, M. Al-Qutayri, and [28] M. K. Gupta and M. Hasan, “Design of high-speed energy-efficient
S. Al-Sarawi, “Stateful memristor-based search architecture,” IEEE masking error immune pentaMTJ-based TCAM,” IEEE Trans. Magn.,
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 12, vol. 51, no. 2, pp. 1–9, Feb. 2015.
pp. 2773–2780, Dec. 2018. [29] N. Ciocchini et al., “Bipolar switching in chalcogenide phase change
[6] D. Soudry, D. Di Castro, A. Gal, A. Kolodny, and S. Kvatinsky, memory,” Sci. Rep., vol. 6, Jul. 2016, Art. no. 29162.
“Memristor-based multilayer neural networks with online gradient
descent training,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 10, [30] M. Imani, Y. Kim, A. Rahimi, and T. Rosing, “ACAM: Approximate
pp. 2408–2421, Oct. 2015. computing based on adaptive associative memory with online learning,”
[7] H. D. Nguyen, J. Yu, L. Xie, M. Taouil, S. Hamdioui, and D. Fey, in Proc. Int. Symp. Low Power Electron. Des. (ISLPED), Jun. 2016,
“Memristive devices for computing: Beyond CMOS and beyond von pp. 162–167.
Neumann,” in Proc. 25th IFIP/IEEE Int. Conf. Very Large Scale Integr. [31] Y. Yang, J. Mathew, S. Pontarelli, M. Ottavi, and D. K. Pradhan,
(VLSI-SoC), Oct. 2017, pp. 1–10. “Complementary resistive switch-based arithmetic logic implementa-
[8] Y. Levy et al., “Logic operations in memory using a memristive Akers tions using material implication,” IEEE Trans. Nanotechnol., vol. 15,
array,” Microelectron. J., vol. 45, no. 11, pp. 1429–1437, 2014. no. 1, pp. 94–108, Jan. 2016.
[9] L. Xie et al., “Scouting logic: A novel memristor-based logic design [32] K. Cho, S. J. Lee, K. S. Oh, C. R. Han, O. Kavehei, and K. Eshraghian,
for resistive computing,” in Proc. IEEE Comput. Soc. Annu. Symp. “Pattern matching and classification based on an associative memory
VLSI (ISVLSI), Jul. 2017, pp. 176–181. architecture using CRS,” in Proc. 13th Int. Workshop Cellular Nanosc.
[10] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and Netw. Appl., Aug. 2012, pp. 1–5.
U. C. Weiser, “Memristor-based material implication (IMPLY) logic: [33] S.-J. Lee, S.-J. Kim, K. Cho, and K. Eshraghian, “Implementation of
Design principles and methodologies,” IEEE Trans. Very Large Scale complementary resistive switch for image matching through back-to-
Integr. (VLSI) Syst., vol. 22, no. 10, pp. 2054–2066, Oct. 2014. back connection of ITO/ITO2−x /TIO2 /ITO memristors,” Phys. Status
[11] S. Gupta, M. Imani, and T. Rosing, “FELIX: Fast and energy-efficient Solidi A, vol. 211, no. 8, pp. 1933–1940, 2014.
logic in memory,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. [34] L. Zheng, S. Shin, S. Lloyd, M. Gokhale, K. Kim, and S. M. Kang,
(ICCAD), Nov. 2018, pp. 1–7. “RRAM-based TCAMs for pattern search,” in Proc. IEEE Int. Symp.
[12] S. Kvatinsky et al., “MAGIC—memristor-aided logic,” IEEE Trans. Circuits Syst. (ISCAS), May 2016, pp. 1382–1385.
Circuits Syst., II, Exp. Briefs, vol. 61, no. 11, pp. 895–899, Nov. 2014. [35] S. Smaili and Y. Massoud, “Memristor state to logic mapping for optimal
[13] M. A. Lebdeh, H. Abunahla, B. Mohammad, and M. Al-Qutayri, noise margin in memristor memories,” in Proc. 14th IEEE Int. Conf.
“An efficient heterogeneous memristive xnor for in-memory computing,” Nanotechnol., Aug. 2014, pp. 291–295.
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64, no. 9, pp. 2427–2437, [36] P. M. Sheridan, C. Du, and W. D. Lu, “Feature extraction using
Sep. 2017. memristor networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27,
[14] S. Shin, K. Kim, and S.-M. Kang, “Resistive computing: Memristors- no. 11, pp. 2327–2336, Nov. 2016.
enabled signal multiplication,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 60, no. 5, pp. 1241–1249, May 2013. [37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
[15] I. Vourkas and G. C. Sirakoulis, “Nano-crossbar memories comprising with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
parallel/serial complementary memristive switches,” BioNanoScience, Process. Syst., Jun. 2012, pp. 1097–1105.
vol. 4, no. 2, pp. 166–179, 2014. [38] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
[16] S. Kvatinsky, N. Wald, G. Satat, A. Kolodny, U. C. Weiser, and efficient reconfigurable accelerator for deep convolutional neural net-
E. G. Friedman, “MRL—Memristor ratioed logic,” in Proc. 13th Int. works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Workshop Cellular Nanosc. Netw. Appl., Aug. 2012, pp. 1–6. Jan. 2017.
[17] Y. Sun, X. Yan, X. Zheng, Y. Liu, Y. Shen, and Y. Zhang, “Influence [39] E. Nurvitadhi et al., “Can FPGAs beat GPUs in accelerating next-
of carrier concentration on the resistive switching characteristics of a generation deep neural networks?” in Proc. ACM/SIGDA Int. Symp.
ZnO-based memristor,” Nano Res., vol. 9, no. 4, pp. 1116–1124, 2016. Field-Program. Gate Arrays, Feb. 2017, pp. 5–14.
[18] S. Srivastava, P. Dey, S. Asapu, and T. Maiti, “Role of GO and r- [40] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional
GO in resistance switching behavior of bilayer TiO2 based RRAM,” neural network,” in Proc. Adv. Neural Inf. Process. Syst., Nov. 2017,
Nanotechnology, vol. 29, no. 50, Oct. 2018, Art. no. 505702. pp. 344–352.

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.
HALAWANI et al.: ReRAM-BASED IN-MEMORY COMPUTING FOR SEARCH ENGINE AND NN APPLICATIONS 397

[41] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: Baker Mohammad (M’04–SM’13) received the
ImageNet classification using binary convolutional neural networks,” in B.S. degree from the University of New Mexico,
Proc. Eur. Conf. Comput. Vis., Sep. 2016, pp. 525–542. Albuquerque, the M.S. degree from Arizona State
[42] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, University, Tempe, and the Ph.D. degree from The
“Binarized neural networks,” in Proc. Adv. Neural Inf. Process. Syst., University of Texas at Austin in 2008, all in ECE. He
Jun. 2016, pp. 4107–4115. was a Senior staff Engineer/Manager at Qualcomm,
[43] R. Zhao et al., “Accelerating binarized convolutional neural networks Austin, TX, USA, for six years, where he was
with software-programmable FPGAs,” in Proc. ACM/SIGDA Int. Symp. engaged in designing high performance and low
Field-Program. Gate Arrays, Feb. 2017, pp. 15–24. power DSP processor used for communication and
[44] H. Nakahara, H. Yonekawa, T. Sasao, H. Iwamoto, and M. Motomura, multi-media application. Before joining Qualcomm,
“A memory-based realization of a binarized deep convolutional neural he was with the Intel Corporation for 10 years,
network,” in Proc. Int. Conf. Field-Program. Technol. (FPT), Dec. 2016, where he involved in a wide range of micro-processors design from high
pp. 277–280. performance, server chips > 100Watt (IA-64), to mobile embedded processor
[45] R. Hasan, T. M. Taha, and C. Yakopcic, “On-chip training of mem- low power sub 1 watt (xscale). He has over 16 year’s industrial experience
ristor based deep neural networks,” in Proc. Int. Joint Conf. Neural in microprocessor design with emphasis on memory, low power circuit and
Netw. (IJCNN), May 2017, pp. 3527–3534. physical design. He is currently an Associate Professor of ECE and a Director
[46] C. Yakopcic, M. Z. Alom, and T. M. Taha, “Extremely parallel memristor of the System-on-Chip Research Center at Khalifa University. His research
crossbar architecture for convolutional neural network implementa- interests includes VLSI, power efficient computing, high yield embedded
tion,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, memory, emerging technology such as memristor, STTRAM, in-memory-
pp. 1696–1703. computing, and hardware accelerators for cyber physical system. In addition,
[47] A. Shafiee et al., “ISAAC: A convolutional neural network accelerator he is engaged in microwatt range computing platform for wearable electronics
with in-situ analog arithmetic in crossbars,” ACM SIGARCH Comput. and WSN focusing on energy harvesting, power management, and power
Archit. News, vol. 44, no. 3, pp. 14–26, 2016. conversion including efficient dc/dc and ac/dc convertors.
[48] L. Ni, Z. Liu, H. Yu, and R. V. Joshi, “An energy-efficient digi-
tal ReRAM-crossbar-based CNN with bitwise parallelism,” IEEE J. Muath Abu Lebdeh received the B.Sc. degree in
Explor. Solid-State Comput. Devices Circuits, vol. 3, no. 4, pp. 37–46, electrical and electronic engineering and the M.Sc.
May 2017. degree in electrical and computer engineering from
[49] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and P.-E. Gaillardon, Khalifa University, in 2015 and 2017, respectively.
“A robust digital RRAM-based convolutional block for low-power image He is currently pursuing the Ph.D. degree with the
processing and learning applications,” IEEE Trans. Circuits Syst. I, Reg. Computer Engineering Laboratory, Delft University
Papers, vol. 66, no. 2, pp. 643–654, Feb. 2019. of Technology. His research interests include CIM
[50] C. Yakopcic, M. Z. Alom, and T. M. Taha, “Memristor crossbar circuits and architectures.
deep network implementation based on a convolutional neural net-
work,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2016,
pp. 963–970. Mahmoud Al-Qutayri (M’86–SM’04) received the
[51] S. N. Truong, S. Shin, S.-D. Byeon, J. Song, H.-S. Mo, and K.-S. Min, B.Eng. degree from Concordia University, Montreal,
“Comparative study on statistical-variation tolerance between comple- Canada, in 1984, the M.Sc. degree from the Univer-
mentary crossbar and twin crossbar of binary nano-scale memristors sity of Manchester, U.K., in 1987, and the Ph.D.
for pattern recognition,” Nanosc. Res. Lett., vol. 10, no. 1, pp. 1–9, degree from the University of Bath, U.K., in 1992,
2015. all in electrical and electronic engineering. He was
[52] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, “VTEAM: with De Montfort University, U.K., and the Univer-
A general model for voltage-controlled memristors,” IEEE Trans. sity of Bath, U.K. He is currently a Full Professor
Circuits Syst., II, Exp. Briefs, vol. 62, no. 8, pp. 786–790, with the Department of Electrical and Computer
Aug. 2015. Engineering and the Associate Dean for Graduate
[53] I. Vourkas and G. C. Sirakoulis, “A novel design and modeling para- Studies with the College of Engineering, Khalifa
digm for memristor-based crossbar circuits,” IEEE Trans. Nanotechnol., University, United Arab Emirates. He has authored/coauthored numerous
vol. 11, no. 6, pp. 1151–1159, Nov. 2012. technical papers in peer-reviewed international journals and conferences. He
[54] R. S. Williams. (2012). Finding the Missing Memristor-Keynote Talk At also coauthored a book entitled Digital Phase Lock Loops: Architectures and
UC San Diego CNS Winter Research Review. [Online]. Available: http:// Applications and edited a book entitled Smart Home Systems. This is in
cns.ucsd.edu/files_2010/january_2010/agenda2010winterreivew.pdf addition to a number of book chapters and four patents. His current research
[55] W. Yu, K. Yang, Y. Bai, T. Xiao, H. Yao, and Y. Rui, “Visualizing and interests include embedded systems design and applications, design and test
comparing AlexNet and VGG using deconvolutional layers,” in Proc. of mixed-signal integrated circuits, wireless sensor networks, cognitive radio,
33rd Int. Conf. Mach. Learn., Jun. 2016, pp. 1–7. and hardware security. His professional service includes membership of the
steering, organizing and technical program committees of many international
conferences, and reviewer for a number of journals.
Said F. Al-Sarawi (S’92–M’96) received the general
certificate in marine radio communication and the
B.Eng. degree (Hons.) in marine electronics and
communication from the Arab Academy for Science
and Technology (AAST), Egypt, in 1987 and 1990,
respectively, and the Ph.D. degree in mixed ana-
log and digital circuit design techniques for smart
wireless systems with special commendation in elec-
trical and electronic engineering and the Graduate
Certificate in Education (Higher Education) from
The University of Adelaide, Australia, in 2003 and
2006, respectively, where he is currently the Director of the Centre for
Biomedical Engineering. His research interests include design techniques for
Yasmin Halawani (S’14) received the B.S. degree in electrical and electronics mixed signal systems in micro- and nano-electronictrons and optoelectronic
engineering from the University of Sharjah (UOS), United Arab Emirates, technologies for high performance radio transceivers, low power and low
in 2012, and the M.S. degree in electrical and electronics engineering from voltage radio frequency identification (RFID) systems, data converters, and
Khalifa University, United Arab Emirates, in 2014, where she is currently pur- microelectromechanical systems (MEMS) for biomedical applications. He
suing the Ph.D. degree in the area of memristor-based in-memory computing received the University of Adelaide Alumni Postgraduate Medal (formerly
architectures. Her research project focused on investigating the suitability of Culross) for outstanding academic merit at the postgraduate level and the
emerging memory technologies, such as Memristor and STT-RAM for low Commonwealth Postgraduate Research Award (Industry) while pursuing
power applications. his Ph.D.

Authorized licensed use limited to: National Central University. Downloaded on September 19,2023 at 03:20:35 UTC from IEEE Xplore. Restrictions apply.

You might also like