Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Fully Parallel Stochastic Computing Hardware


Implementation of Convolutional Neural Networks
for Edge Computing Applications
Christiam F. Frasser , Pablo Linares-Serrano, Iván Díez de los Ríos , Student Member, IEEE,
Alejandro Morán, Student Member, IEEE, Erik S. Skibinsky-Gitlin , Member, IEEE, Joan Font-Rosselló,
Vincent Canals , Member, IEEE, Miquel Roca , Member, IEEE, Teresa Serrano-Gotarredona,
and Josep L. Rosselló , Member, IEEE

Abstract— Edge artificial intelligence (AI) is receiving a Index Terms— Convolutional neural networks (CNNs), edge
tremendous amount of interest from the machine learning computing (EC), stochastic computing (SC).
community due to the ever-increasing popularization of the
Internet of Things (IoT). Unfortunately, the incorporation of AI
characteristics to edge computing devices presents the drawbacks I. I NTRODUCTION
of being power and area hungry for typical deep learning
techniques such as convolutional neural networks (CNNs). In this
work, we propose a power-and-area efficient architecture based
on the exploitation of the correlation phenomenon in stochastic
E DGE computing (EC) is characterized by implementing
data processing at the edge of the network [1] instead
of doing it at the server level. This has brought about great
computing (SC) systems. The proposed architecture solves the
challenges that a CNN implementation with SC (SC-CNN) may interest in the microelectronic industry due to the proliferation
present, such as the high resources used in binary-to-stochastic of the Internet of Things (IoT). At the same time, incorporating
conversion, the inaccuracy produced by undesired correlation artificial intelligence (AI) capacity in everyday devices has
between signals, and the complexity of the stochastic maximum been in the spotlight in recent times, making the development
function implementation. To prove that our architecture meets of new techniques to extend AI to edge applications a must
the requirements of edge intelligence realization, we embed a fully
parallel CNN in a single field-programmable gate array (FPGA) [2], [3]. The idea behind these research efforts is to assist EC
chip. The results obtained showed a better performance than devices to further reduce their dependence on cloud processing
traditional binary logic and other SC implementations. In addi- and reduce the energy associated with data transmission. How-
tion, we performed a full VLSI synthesis of the proposed design, ever, research on edge intelligence is still in its early days since
showing that it presents better overall characteristics than other edge nodes normally present considerable limitations in terms
recently published VLSI architectures.
of area and power consumption, thus limiting the incorpora-
tion of typical state-of-the-art deep learning implementations
to embedded devices. Therefore, new solutions for efficient
Manuscript received October 20, 2020; revised September 28, 2021 and hardware implementations for machine learning applications,
January 23, 2022; accepted April 1, 2022. This work was supported in part such as neuromorphic hardware [4], [5] or convolutional neural
by the Ministerio de Ciencia e Innovación; in part by the European Union
NextGenerationEU/PRTR; in part by the European Regional Development networks (CNNs) [6], [7], have become a trending topic
Fund (ERDF) under Grant TEC2017-84877-R, Grant PID2019-105556GB- recently.
C31, Grant PCI2019-111826-2, Grant PID2020-120075RB-I00, and Grant Stochastic computing (SC) is an approximate computing
PDC2021-121847-I00; in part by the ERDF A way of making Europe under
Grant MCIN/AEI/10.13039/501100011033; and in part by the European technique that has been arousing increasing interest over the
Union NextGenerationEU/PRTR. The work of Pablo Linares-Serrano was last decade. Its capacity to compress complex functions within
supported by the CSIC JAE-Intro-ICU 2019 Scholarship, Instituto de Micro- a low number of logic gates has motivated the development
electrónica de Sevilla (IMSE). (Corresponding author: Josep L. Rosselló.)
Christiam F. Frasser, Alejandro Morán, and Erik S. Skibinsky-Gitlin are with of different proposals for pattern recognition applications [8],
the Electronics Engineering Group, Industrial Engineering and Construction to implement ANNs accelerators in hardware [9]–[14], to
Department, University of Balearic Islands, 07122 Palma, Spain. implement random vector functional link (RVFL) [15], and
Pablo Linares-Serrano, Iván Díez de los Ríos, and Teresa
Serrano-Gotarredona are with the Instituto de Microelectrónica de Sevilla more specifically to implement CNNs [7], [9], [10], [16], [17].
(IMSE-CNM), CSIC, 41092 Seville, Spain. Nonetheless, some realization challenges remain, such as the
Joan Font-Rosselló, Vincent Canals, Miquel Roca, and Josep L. Rosselló high resources used to implement independent random number
are with the Electronics Engineering Group, Industrial Engineering and
Construction Department, University of Balearic Islands, 07122 Palma, Spain, generators (RNGs), the accuracy degradation produced by the
and also with the Balearic Islands Health Research Institute (IdISBa), lack of full decorrelation between signals, and the compactness
07120 Palma, Spain (e-mail: j.rossello@uib.es). of convolutional and max-pooling (MP) functions. Tackling
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2022.3166799. these issues is not a trivial issue. Lee et al. [18] approached
Digital Object Identifier 10.1109/TNNLS.2022.3166799 them by implementing only the first convolutional layer
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

using SC. Sim et al. [19] created a hybrid stochastic-binary


architecture, where only the multiplications were implemented
in SC. Both approaches are not fully stochastic implementa-
tions, and therefore, the benefits are limited.
The development of efficient CNN acceleration systems
has recently attracted a lot of attention from the scientific
community [20], [21]. Its implementations can be classified
under two categories: generic-base accelerators or streaming
accelerators. Generic-base accelerators provide architectures
that only operate with a specific unit, which is common for all
or most of the network blocks. This generic unit must handle
every size, type, and channel width input for all layers. The Fig. 1. Stochastic multiplication using different codification techniques:
outcomes are stored and read from memory for every new (a) unipolar multiplication gate, (b) time diagram for unipolar multiplication
circuit, (c) bipolar multiplication gate, (d) time diagram for bipolar multipli-
processing cycle. This approach is low resource consuming, cation circuit.
but the throughput is likewise low. Moreover, the several
memory accesses boost the power drained by the system [22]. [1, 1, 0, 1] (a weight of 0.25 for each data bit), while for an
On the other hand, streaming accelerators dedicate specific 8-bit stream, we have the possible sequence [0, 1, 1, 0, 1, 1,
hardware to the computation blocks, making them ideal for 1, 1] with a bit weight of 0.125.
parallelizing architectures. They are high resource consuming The probabilistic interpretation of SC only allows to rep-
but have the best throughput. In addition, the power drained by resent positive numbers (between 0 and 1) and is normally
the memory access is practically null, as storing intermediate known as unipolar codification. It is possible to represent neg-
results is needless. ative values (between −1 and 1, known as bipolar codification)
Different classic SC CNN accelerators have been imple- by just using a variable change from the unipolar codification.
mented in the literature, but most of them are generic-base In bipolar codification, the number of zeros is subtracted from
(see [9], [10], [17], [23]), having the drawbacks previously the number of ones and finally divided by the total number of
described. Furthermore, the different results they present are bits in the stream, giving: p∗ = (N1 − N0 )/(N0 + N1 ), where
all obtained by software simulations, and no real implemen- N0 is the number of zeros, N1 is the number of ones, and ∗
tation in hardware is introduced. symbol denotes the bipolar codification. The variable change
In this work, we propose an efficient reduced area hardware from unipolar ( p) to bipolar ( p∗ ) would be: p∗ = 2 p − 1.
architecture that overcomes the named hurdles. We exploited One of the main advantages of using SC is the low cost
both the decorrelation and correlation between SC signals in hardware resources when implementing complex functions.
to implement the basic building blocks of a CNN. To the Take, for instance, the multiplication operation implemented
best of our knowledge, this is the first time a classic SC in SC using just a single logic gate: an AND gate for unipolar
fully parallel CNN is implemented on a single chip. As a codification and an XNOR gate for bipolar. Fig. 1 shows how
real proof of concept, we implemented the proposed architec- the multiplication is obtained by using different logic gates for
ture on a single field-programmable gate array (FPGA) chip the two SC codifications.
and compared its performance characteristics with different
previously-published FPGA works. Furthermore, we synthe-
sized our CNN designs in VLSI circuits, demonstrating an B. Conversion
improvement over state-of-the-art VLSI designs of SC-CNN In order to operate in stochastic domain, a converter cir-
circuits. cuit must be implemented. The most commonly used circuit
is based on a pseudorandom number generator, normally a
II. S TOCHASTIC C OMPUTING linear feedback shift register (LFSR) and a comparator. Each
two’s complement magnitude X must be converted to its
A. Codifications serialized stochastic counterpart (bit stream) x(t) by using
SC arises as a promising technique for deep learning an RNG (generating R(t) as shown in Fig. 2). The full
hardware implementations and is becoming extensively used converter circuit (comparator and LFSR) is noted as binary-
for neuromorphic applications [24]. In contrast to traditional to-stochastic converter (BSC). If the X value is greater than
binary logic, in which the bit weight (2i ) is proportional to the pseudorandom number R(t), the bit stream output x(t)
2 powered to its relative position “i ” in the bit stream, SC is is set to “1”; otherwise, set to “0.” If the random number
based on a simple mathematical codification, where all the data generated R(t) is uniformly distributed in the interval of all
bits processed have the same weight. SC is normally related to possible values of X, the stochastic signal x(t) expressed
a probabilistic process, where each signal is serially processed, in bipolar mode (x ∗ ) is then proportional to the converted
as usual in SC designs. Every time step provides a bit value two’s complement magnitude x ∗ ∼ X/2 N −1 . To recover the X
representing the probability of finding a TRUE state (logic “1”) value, a digital up/down counter may be used during a fixed
at any arbitrary position throughout the sequence of bits. evaluation period of time. The stream length over which the
For instance, the number 0.75 could be represented by a bit evaluation is performed is related to the conversion precision
stream in which the probability of finding a logic “1” is 75%: so that the longer the stream, the higher the precision.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FRASSER et al.: FULLY PARALLEL STOCHASTIC COMPUTING HARDWARE IMPLEMENTATION OF CNNs 3

C. Traditional Computing Versus SC


The advantage of traditional binary logic is that the available
information (I that would be related to the precision of the
operations) is coded with the minimum number of data bits
(Nb ) so that the ratio I /Nb is always maximum and equal
to 1 (where I ≡ − log2 ( p), being p the probability of any
feasible category/number included in the codification, assumed
all to be equally probable). In the case of SC, its codification
has a poor efficiency rate since I /Nb = log2 (1 + Nb )/Nb .
This rate vanishes to zero as soon as Nb is increased. As a
counterpart, the poor codification efficiency of SC is com-
pensated with more simple hardware compared to traditional
Fig. 2. Correlation impact on stochastic operations. Correlation between
computing (TC). We define the hardware packing efficiency signals changes the operation computed by the logic gate. Stochastic signals
as the ratio of Nb with respect to the expected number of x(t) and y(t) are said to be completely correlated when they share the same
gates used in a specific operation. A comparison between RNG (Rx = R y ), producing the function 1−|x ∗ −y ∗ |; by contrast, if Rx  = R y ,
they are said to be uncorrelated and the output function is different: x ∗ y ∗ .
both methodologies is shown in Table I, for the special case
of the multiplication operation. Given that more gates are
needed when constructing a TC-binary multiplier than on an
SC one (both in a parallel way), the hardware efficiency is
always 1 for SC, while for TC, it vanishes to zero as Nb is
increased. For the selection of the best computing technique
for a specific computation, we can define the information
processing efficiency [or computing efficiency (CE)] as the
ratio between the information to be processed (related to the
precision required in the operation) and the number of gates
needed (see the last row of Table I). For a better understanding
of these results, we can plot the ratio of both efficiencies
(CESC /CETC ), as a function of the required precision (see Fig. 3. Information processing efficiency ratio between SC and TC as a
Fig. 3). As it can be appreciated, SC provides much better function of the number of bits to be processed for the case of the product
operation.
efficiency than TC for low-precision operations (less than
9 bits). For the case of pattern recognition, in which a relatively
small set of categories must be distinguished, it is well known correlation is desirable. Let us consider the circuit shown in
that the precision used in the intermediate operations is not a Fig. 2. Input signals X and Y are N-bit two’s complement
critical issue. Therefore, it is expected that SC should provide binary signals compared with two fluctuating pseudorandom
a greater processing efficiency than TC for deep learning numbers Rx (t) and R y (t). At the output of the comparators,
applications. we have stochastic signals with Boolean values x(t) and y(t)
that can be stochastically coded in bipolar mode as x ∗ and
y ∗ , respectively. The XNOR gate, in the presence of two
D. Correlation uncorrelated signals (Rx (t) = R y (t) on each comparator)
One of the advantages of TC with respect to SC is that performs the product operation (x ∗ y ∗ ), but in the presence
the former is better established and the design process is quite of two correlated signals (produced by sharing the same
systematic. The design techniques of SC are more complex due random generator Rx (t) = R y (t) in the conversion circuit),
to the incorporation of probabilistic laws to the coding. In this it performs a different operation (1 − |x ∗ − y ∗ |). In general,
sense, the stochastic multipliers shown in Fig. 1 only work the maximum correlation between two SC signals is obtained
correctly when the input signals are completely uncorrelated when connecting the same LFSR output (R(t)) as the reference
(with a null covariance between them). In general, in this work, input of the comparators (see Fig. 2).
two bit streams are defined to be completely correlated when To obtain the analytical expression, we can use the diagrams
they are generated by the same RNG. shown in Fig. 4. Two different LFSRs are used to generate
Most of the errors produced when designing SC systems bipolar signals x ∗ and y ∗ (with Rx and R y ) through the
come from operating stochastic signals with an unknown diagram shown in Fig. 4(a). Since both R y and Rx are
level of correlation. For this reason, many have tried to oscillating independently, we display them in two independent
avoid the stochastic correlated imprecision by generating all axes. Then, the possible Boolean values of stochastic signals
stochastic bit streams with independent LFSRs. Unfortunately, [x(t) and y(t)] can be easily delimited. The areas shown in the
this approach employs a high amount of hardware resources diagram over each possible [x(t), y(t)] pair are proportional
in the conversion circuits, thereby limiting the advantage to the probability of generating them by the stochastic system.
that SC offers for hardware implementations. Nevertheless, For the case of using an XNOR gate as in Fig. 2, the
although some operations (such as the multiplier) need the shadowed area of Fig. 4(a) will be related to z(t) values
use of uncorrelated signals, there are also some cases where equal to 1; otherwise, z(t) = 0. Therefore, the bipolar signal

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE I
C OMPARISON B ETWEEN S TOCHASTIC AND TC T ECHNIQUES FOR THE P RODUCT O PERATION

Fig. 5. FPGA outcome difference when operating with the maximum and
Fig. 4. Stochastic diagrams for the estimation of the stochastic function minimum correlation inputs for the AND and OR gates using 8-bit precision.
of a logic gate depending if signals are (a) uncorrelated or (b) completely
correlated.
operation is carried out as long as the inputs are totally corre-
z ∗ = (N1 − N0 )/(N1 + N0 ) can be easily obtained by lated. This interesting feature can be exploited to implement
estimating these areas, providing the result z ∗ = x ∗ y ∗ . essential functions for deep learning applications, such as the
In case of complete correlation (Rx = R y ), we can follow rectified linear unit (ReLU) and MP operations employed in
similar reasoning but using one single axis instead of two CNN, leading to produce high-performance architectures that
since both pseudorandom numbers are the same. To create a reduce the area and power for hardware implementations.
clear diagram without overlapping areas, and without loss of
generality, we can order the input signal values, thus defining E. Stochastic Addition
the boundaries x  = max(X, Y ) and y  = min(X, Y ). Ordering
Accurate implementation of the stochastic addition contin-
the input allows to identify the different areas related to the
ues to be a challenge. Different circuits have been put forward
pair of signals (max{x(t), y(t)}, min{x(t), y(t)}), as shown in
to achieve the task: a simple OR gate, a multiplexer, and an
Fig. 4(b). For the case of the XNOR gate, only the shaded area
accumulative parallel counter (APC) [25], [26]. Fig. 6 shows
will be related to a high output (z = 1), which corresponds
the different stochastic addition circuits, where for the sake
to the bipolar value of z ∗ = (N1 − N0 )/(N1 + N0 ) =
of clarity, stochastic signals are denoted as lowercase letters
1 − (x  − y  )/2 N −1 = 1 − |x ∗ − y ∗ |. For the cases of the AND
and without the time-dependent reference (t) hereafter. The
and OR gates, we can follow a similar procedure obtaining:
⎧ use of an OR gate as an adder [Fig. 6(a)] is the smallest
1 
 ∗ ∗  ⎨ x ∗ y ∗ + x ∗ + y ∗ − 1 Rx = R y circuit in terms of hardware footprint, yet it has different
AND x , y = 2  (1) drawbacks. It is inaccurate when the input values are relatively
⎩ min x ∗ , y ∗  Rx = R y high and very sensitive to correlation. This is why its use as
⎧ a stochastic addition circuit is ruled out in most applications.
1 
 ∗ ∗  ⎨ 1 + x ∗ + y ∗ − x ∗ y ∗ Rx = R y The multiplexer [Fig. 6(b)] is one of the most popular circuits
OR x , y = 2  (2)
⎩ max x ∗ , y ∗  R =R .
to achieve the addition. The circuit is low cost in terms of
x y
area and the precision is not affected by the correlation among
Fig. 5 shows the FPGA outcome when operating with the inputs. The main disadvantage is the inaccuracy increase
the maximum correlation (Rx = R y ) and minimum correlation as the number of inputs grows, being not suitable for deep
(Rx = R y ) for the AND and OR gates using 8-bit precision. learning implementations, where a high number of inputs per
As it can be observed, we have totally different results neuron are demanded. The last case is the APC [Fig. 6(c)],
depending on the correlation level of the input signals. Notice which counts the number of high pulses at the inputs and
the differences for the OR gate instance, where the maximum accumulates the counted value for a period of time, producing

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FRASSER et al.: FULLY PARALLEL STOCHASTIC COMPUTING HARDWARE IMPLEMENTATION OF CNNs 5

Fig. 6. Stochastic addition circuits. (a) Stochastic addition using an OR gate,


where x · y must be close to zero in order to compute the addition accurately.
(b) Stochastic scaled addition using a multiplexer, where the accuracy outcome Fig. 7. (a) Stochastic neuron design exploiting correlation to reduce area
is dependent on the number of inputs. (c) Stochastic addition using an APC, cost. Stochastic signal a ∗ and zero-bipolar (0∗ ) are generated with the same
where the accuracy is not degraded and the output is represented in two’s LFSR (Rx (t)) to produce total correlation between them, returning the ReLU
complement codification. function on the output with a single OR gate. (b) Stochastic MP circuit for
a spatial window size of k = 2 × 2. Stochastic neuron outputs yk∗ are totally
correlated, allowing the implementation of the maximum function with a
a digital output in two’s complement codification. The APC single OR gate. The generalization of this design for any window size is
solution is the most accurate of the three circuits presented. trivial.
In addition, it is correlation-insensitive, thereby being the
preferable approach for most implementations. One of the benefits of the proposed stochastic-ReLU
function approach is its normalized reproduction of the
III. P ROPOSED D ESIGN standard-ReLU function used by the machine learning com-
munity. This means that the weights obtained after the training
A. Stochastic Neuron
process of the ANN can be easily adapted to the hardware,
CNNs are constructed of several interconnected layers given that the expected activation function is not disturbing;
of neurons. The most common core neuron is composed unlike other published studies, in which the function outcome
of a scalar product block and a ReLU activation function, is distorted, as is the case of references [10], [17]. In these
which implements the operation max(0, input). Therefore, the particular cases, the stochastic implementation of the ReLU
common base operations used in CNN implementations are function, apart from area-consuming, results in being clipped
multiplication, addition, and maximum function, employed for and not exact. In our simple ReLU proposal, the OR gate
the MP operation and the ReLU activation function. All of implementation computes the maximum function without clip-
them can be easily implemented via SC systems if correlation ping or distorting the signal, and therefore, the weights can be
is properly used. In the literature, different stochastic neuron incorporated directly to the hardware after a simple process of
designs have been put forward [9]–[12], although none of them normalization.
have exploited properly both the signal correlation and decor- Moreover, approaches suggested by the literature do not
relation, an approach that could simplify the CNN hardware control the correlation level of the outcome, being unpre-
considerably. dictable. This is a problem for the following layers, where full
Fig. 7(a) shows the proposed stochastic neuron design that decorrelation must be guaranteed to achieve the multiplication
exploits both correlation and decorrelation. The incoming sto- operation, thereby leading to accumulating inaccuracies when
chastic vector x∗ (formed by n elements) is generated by using the signal goes from layer to layer.
the output of one LFSR circuit Rx (t), whereas the stochastic
weight vector wi∗ (where i represents the i th neuron of the
current layer) is generated by using the output of a second B. Stochastic MP
LFSR circuit Rw (t), not shown in the diagram. As a result, In traditional CNNs, a subsampling operation (pooling) is
and considering a bipolar codification, the n-XNOR-gate array accomplished intent on reducing the spatial dimensions of
calculates the stochastic product between neuron inputs and the convoluted feature. In the case of MP, the subsampling
weights. Then, these product signals are added using the APC is performed by selecting the maximum value from a spatial
circuit, yielding an m-bit two’s complement number as the window of the convoluted feature: maxkj =1 (y ∗j ), where k is the
output (where m is the system’s bit resolution). Once we have size of the spatial window.
the stochastic inner product in digital representation, it must Ren et al. [9], Li et al. [10], and Yu et al. [17] have come
be converted into the stochastic domain again to operate the up with some stochastic maximum function designs using a
following layers. The stage is where the correlation phenom- set of counters, comparators, and multiplexers. The drawbacks
enon is fully exploited with the purpose of implementing the of these architectures lie in the employment of a large area
max function. of hardware resources; what is more, this sort of designs
As previously mentioned, the max function can be easily only determines the maximum value after counting the total
implemented using an OR gate if the inputs are totally cor- number of high pulses in the bit stream within a period
related. Therefore, if we generate the zero reference signal of time, incurring in long latency and considerable energy
(0∗ ) and the APC stochastic signal (a ∗ ) upon using the consumption. In contrast, the architecture proposed in this
same pseudorandom number Rx (t), then the ReLU activation work takes advantage of the full correlation among neuron
function is carried out with
 a single OR gate, performing the outputs, and as in the ReLU transfer function case, it extracts
operation: yi∗ = max(0∗ , nj =1 x ∗j · wi∗j ). the instantaneous maximum value with a unique OR gate

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 8. Fully parallel stochastic CNN architecture. Only two unique pseudorandom number generators are employed. All neurons are working simultaneously
in parallel due to the correlation phenomenon exploitation. 5 × 5 size kernels are used for convolution layers. Shared signals among neurons are depicted by
dashed lines.

[Fig. 7(b)], saving precious area resources, latency time, and CNN architecture is needed. For this reason, an efficient FEB
energy consumption. will reverberate in the whole efficiency of the network.
Table II shows different FEB designs found in the SC
C. Full CNN Architecture literature. Each macro-column represents the main operations
carried out in the FEB: multiplication, addition, activation
Fig. 8 shows how the whole system is connected to
function, and pooling. For each operation, we show the
reproduce a CNN design (the LeNeT-5). As noted, only two
type, the circuit implemented, and the result of the operation
unique pseudorandom number generators [Rx (t) and Rw (t)]
(approximate or exact). As shown, the proposed FEB design
are needed to accomplish the overall calculus, considerably
has three operations providing exact results while occupying
saving area and power in the design. This could be achieved
a low area in the pooling and activation function blocks since
due to the stochastic neuron design, which exploits correlation
only a single OR gate is needed.
and decorrelation for computing. LFSR1 is used for the Rx (t)
In Table III, we compared the frequency, area, power,
number generation and is connected to the input image com-
and energy with respect to reference HEIF [10] results. The
parator, the 0∗ signal reference generator, and every stochastic
64-input FEB is made of 4 ReLU neurons of 16 inputs each
neuron in the whole design [for the APC stochastic generator,
and a 4-to-1 MP block. The design has been synthesized in
see Fig. 7(a)]. LFSR2 is used for the Rw (t) number generation,
TSMC 40-nm CMOS technology using the Cadence Genus
which is only used to produce the stochastic weights. In this
Tool. We have synthesized two FEB designs: with and without
way, each stochastic signal generated by LFSR1 is totally
pipelining. The pipeline is accomplished by inserting some
uncorrelated with those generated by LFSR2, allowing neuron
DFFs in critical paths of the design to improve the frequency
inputs to be multiplied by weights with the highest precision.
of the system. Pipelining is essential if a more complex
Moreover, the architecture proposed allows neuron outputs
architecture is required: it allows to split the whole network
from layer li to be connected to the neuron inputs of the next
into smaller processing elements that can fit in the device and
layer li+1 without any risk of signal degradation. Since the li
operate in a sequential manner (tiling technique).
neuron outputs are generated from a first LFSR block Rx (t)
Comparing the two proposed FEB designs (pipelined versus
and the li+1 weights from a second LFSR Rw (t), the error
nonpipelined), the pipeline optimization achieves 1.6× more
induced from layer to layer by the appearance of uncontrolled
clock speed while increasing 1.3× the area, 1.9× the power,
correlation between signals is totally avoided.
and 1.2× the energy. Comparing the proposed pipelined design
It is important to note that no pruning, weight sharing,
with the proposal taken from HEIF [10], this work presents a
or clustering has been carried out. The overall array of weights
1.8× increment in processing speed and 3.9× more energy
has been embedded in the design.
efficiency. The advantage is produced mainly due to the
As noted by dashed lines, Rx (t) and 0∗ are shared through
exploitation of the correlation phenomenon, achieving an exact
the whole network, saving plenty of resources and enabling
ReLU and MP functions while reducing the total circuitry
all neurons to work simultaneously in parallel. Power con-
path delay. The difference in area comes from the APC design
sumption plummets since no access to memory for reading
used in [10] (an approximate APC (AAPC) that is developed
and writing intermediate results is necessary.
in [29]). The block is considerably smaller than an exact APC
IV. E XPERIMENTAL R ESULTS although it is more imprecise.
A. FEB Evaluation
The feature extraction block (FEB) is defined as the union of B. SC-CNN Evaluation
convolutional and pooling neurons to generate a single feature To evaluate the proposed SC design, we have implemented
point. This block is the base of every single convolutional two different CNN architectures: the LeNet-5 and a 30M of
layer and the minimum block required in case that a bigger operations CNN capable of processing the CIFAR-10 dataset.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FRASSER et al.: FULLY PARALLEL STOCHASTIC COMPUTING HARDWARE IMPLEMENTATION OF CNNs 7

TABLE II
N EURON D ESIGN C OMPARISON W ITH O THER SC M ODELS

TABLE III
FEB P ERFORMANCE C OMPARISON FOR P IPELINED
AND N ONPIPELINED A RCHITECTURES

Fig. 10. Feature maps comparison between software floating-point imple-


mentation and FPGA measurements.

resolution (8 bits) and the computations are not deterministic.


Be that as it may, they do not affect the final outcome.
Table IV shows the comparison between the proposed
implementation and some TC and SC FPGA-based CNN
accelerators. We compared the methods in terms of through-
Fig. 9. Evaluation FPGA board [33]. put, performance, energy efficiency, and computation blocks
employed. Each metric is evaluated using the full SC design,
including the pseudorandom number generators used (that are
For the case of the LeNet-5 architecture, a fully parallel FPGA the blocks with the greatest impact on SC designs [4]).
implementation is carried out. As can be appreciated, the proposed method outperforms
1) LeNet-5 Implementation: LeNet-5 CNN is oriented to other architectures. The results show that the proposed sto-
processing the MNIST handwriting dataset composed of chastic CNN implementation achieves 28× more throughput
60k training images and 10k testing images [31]. The and 18× more performance versus the VX690T imple-
CNN architecture consists of two convolutional layers and mentation [34]. In terms of energy efficiency, this work
three fully connected (FC) layers, as the original paper by presents a 6.3× increase compared to Virtex7-485t implemen-
Lecun et al. [32]. The baseline score of the trained model was tation [35], making it promising for real embedded system
98.6% (no special optimizations were introduced during train- applications.
ing), and the stochastic model could get 97.6%, a 1% accuracy Comparing the proposed system with other SC FPGA
degradation compared to the software version. It means a implementation [27], the difference is notorious. The pro-
satisfactory result, considering that no parameter fine-tuning posal is 1730× faster, achieves 692× more performance,
process was applied, just a simple weight normalization. and consumes 306× less energy per image. Although both
We tested the full SC CNN implementation on a implementations are based on SC, the difference lies mainly
GIDEL PROC10A board (Fig. 9), which has an Intel in the exploitation of the correlation phenomenon. This fact
10AX115H3F34I2SG FPGA running the 8-bit SC implemen- permits to realize the ReLU and MP functions in a highly
tation at 150 MHz. The communications were done through a compact way, decreasing drastically the area usage and thus
PCI express bus. allowing a total parallelization of the model.
Fig. 10 shows the comparison between software feature It is important to note that the different power estimations
maps extracted from the two pooling layers with the feature of these works only consider the FPGA chip consumption.
maps measured in the FPGA board. As shown, the results are However, the energy invested in memory access, which can be
practically the same except for some differences that have been over 80% of the energy consumed in a CNN accelerator [39],
highlighted. Foregone differences were expected if we take is omitted. Thus, considering that the proposed system is the
into consideration that the hardware computes in lower data only one that implements the model in a fully parallel way

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV
C OMPARISON W ITH O THER FPGA L ENET-5 I MPLEMENTATIONS

TABLE V
C OMPARISON W ITH O THER VLSI LENET-5 I MPLEMENTATION

and without a permanent memory access, its comparison with The total area of the full design is 10.88 mm2 in the UMC
the other works can be considered as a worst case scenario. 250-nm technology node. The design synthesized in the TSMC
This is one of the drawbacks of layer accelerator implemen- 40 nm takes up a total area of 2.01 mm2 and consumes a
tations, where you must save the output of the intermediate 651 mW operating at 200 MHz.
computations and then read them back to carry out the whole Table V summarizes and compares the performance
processing. By contrast, parallel pipelined architectures do of the synthesized SC-CNN LeNet5 with other imple-
not suffer from this phenomenon because all the parameters mentations published in the literature. Compared with
are embedded in the system, and the intermediate results state-of-the-art implementations of the LeNet5 using non-
are directly connected to the next layer, making the RAM traditional logic, the proposed system achieves 1.4× more
transactions needless. computational density evaluated in MOPS/mm2 , 1.28× more
To the best of our knowledge, this is the first time an entire throughput measured in TOPS, 1.58× more energy effi-
fully parallel SC CNN is embedded in a single FPGA. This ciency expressed in TOPS/watts, 3.8× more area efficiency
feature is in stark contrast to the studies presented, where expressed in TOPS/mm2 , 10.4× more throughput measured in
the inference operations are realized by using a loop-tiling images/microseconds, 2× more energy efficiency expressed in
technique (an optimization approach to use the same hardware images/microjoules, and 3.6× more area efficiency expressed
resources recursively). in images/(µs · mm2 ), compared to the best reference of each
In our design, DSP blocks are avoided since an unconven- case. This is due to the compact implementation of the ReLU
tional computing technique (SC) is used instead of traditional function and MP operation by adequately exploiting the signal
binary logic. At the same time, memory blocks are not correlations. Furthermore, the use of correlated signals allows
required since the computation is not performed in a tile-loop to implement the architecture by using a very reduced number
manner, thereby getting rid of the principal source of power of pseudorandom number generators.
consumption, which comes from the access operations to the 2) CIFAR-10 CNN Implementation: In addition, we also
memory. present the VLSI synthesis for a bigger CNN, which can
The complete SC-CNN architecture has also been synthe- be able to process the CIFAR-10 dataset. CIFAR-10 consists
sized in TSMC 40-nm CMOS technology and UMC 250-nm of 60k 32 × 32 RGB images. The images are composed of
technology using the Cadence Genus Tool. The implemented real objects, which can be categorized among ten different
design comprises a total number of 913 906 combinatorial classes. For training, we use 50k images and 10k let for
elementary cells (NAND, NOR, and inverter gates) and 104 317 testing. The CNN architecture is formed by two blocks of
sequential cells. two convolutional layers plus one MP, and two FC layers. All

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FRASSER et al.: FULLY PARALLEL STOCHASTIC COMPUTING HARDWARE IMPLEMENTATION OF CNNs 9

TABLE VI applications. Nevertheless, many difficulties are still being


VLSI S YNTHESIS R ESULTS FOR CIFAR-10 CNN A RCHITECTURE faced in the quest to achieve good results. In this article,
we present an efficient reduced area architecture in an attempt
to find a solution to the high area consumed by RNGs, the
precision degradation produced by correlation between signals,
and the stochastic maximum function implementation. For
the first time, a fully parallel SC-based CNN is embedded
in a single FPGA chip, obtaining better performance results
TABLE VII than traditional binary logic and other SC implementations,
CIFAR-10 CNN P ERFORMANCE C OMPARISON showing the compression effectiveness of the architecture
by exploiting the correlation features presented by stochastic
signals.
The fully parallel SC-CNN has been synthesized in a VLSI
circuit, demonstrating improved efficiency over previously
reported SC-CNN VLSI implementations.
In addition, the synthesis result for 30 million of operations
CNN is presented, which could still be implemented in a single
chip. This fact shows the benefits of the proposed design for
implementing relatively complex edge-oriented CNN architec-
tures in a compact and efficient way.
filters have a 3 × 3 sliding window with a stride of 1. The
number of filters per convolutional layer is 32, 32, 64, and 64. R EFERENCES
No zero padding is used. The MP is made of 2 × 2 sliding [1] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision
window with a stride of 2. FC layers are built with 512 and and challenges,” IEEE Internet Things J., vol. 3, no. 5, pp. 637–646,
Oct. 2016.
10 neurons. The accuracy for the SC implementation was 81% [2] K. S. Zaman, M. B. I. Reaz, S. H. M. Ali, A. A. A. Bakar, and
using 8-bit precision. M. E. H. Chowdhury, “Custom hardware architectures for deep learning
Table VI shows the synthesis results for the CIFAR-10 CNN on portable devices: A review,” IEEE Trans. Neural Netw. Learn. Syst.,
early access, Jun. 4, 2021, doi: 10.1109/TNNLS.2021.3082304.
architecture for each layer using the same CMOS technology. [3] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge
As shown, the second convolutional layer is the most demand- intelligence: Paving the last mile of artificial intelligence with edge
ing in terms of area and power. This is because 47.6% of the computing,” Proc. IEEE, vol. 107, no. 8, pp. 1738–1762, Aug. 2019.
[4] Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, “A survey of sto-
total operations are executed in this layer. As shown in the chastic computing neural networks for machine learning applications,”
table, this relatively complex CNN (that includes a total of IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 7, pp. 2809–2824,
30 million of parameters) presents an integrated circuit (IC) Aug. 2020.
[5] A. Ananthakrishnan and M. G. Allen, “All-passive hardware implemen-
size that could be integrated in a single chip (thus maintaining tation of multilayer perceptron classifiers,” IEEE Trans. Neural Netw.
the energy-efficiency values provided). Learn. Syst., vol. 32, no. 9, pp. 4086–4095, Sep. 2021.
We compared the performance of the proposed system [6] H. Fan, S. Liu, Z. Que, X. Niu, and W. Luk, “High-performance
acceleration of 2-D and 3-D CNNs on FPGAs using static block
for the two CNN architectures (CIFAR-10 and LeNeT-5) in floating point,” IEEE Trans. Neural Netw. Learn. Syst., early access,
Table VII. Unlike LeNet-5 implementation, which could get a Oct. 13, 2021, doi: 10.1109/TNNLS.2021.3116302.
200-MHz clock frequency, CIFAR-10 architecture, on account [7] H. Abdellatef, M. Khalil-Hani, N. Shaikh-Husin, and S. O. Ayat,
“Accurate and compact convolutional neural network based on stochastic
of its higher complexity, could only get 166.6 MHz. This computing,” Neurocomputing, vol. 471, pp. 31–47, Jan. 2022.
leads to a 1.2× longer latency, which consequently affects all [8] V. Canals, A. Morro, and J. L. Rosselló, “Stochastic-based pattern-
metrics that depend on latency. It is observed how the absolute recognition analysis,” Pattern Recognit. Lett., vol. 31, no. 15,
pp. 2353–2356, Nov. 2010.
energy and area efficiency exhibit a significant drop. This is [9] A. Ren et al., “SC-DCNN: Highly-scalable deep convolutional neural
caused by the absolute metrics, which are not normalized, network using stochastic computing,” ACM SIGOPS Operating Syst.
being dependent on the power and area model. Nevertheless, Rev., vol. 51, no. 2, pp. 405–418, Jul. 2017.
[10] Z. Li et al., “HEIF: Highly efficient stochastic computing-based infer-
the throughput measured in TOPS is 45× higher. This can ence framework for deep neural networks,” IEEE Trans. Comput.-Aided
be explained since 54× more operations are achieved in the Design Integr. Circuits Syst., vol. 38, no. 8, pp. 1543–1556, Aug. 2019.
CIFAR-10 CNN for the same number of cycles. This clearly [11] V. Canals, A. Morro, A. Oliver, M. L. Alomar, and J. L. Rossellè,
“A new stochastic computing methodology for efficient neural network
shows the advantages of operating in SC, where the number implementation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3,
of cycles per inference is independent of the number of pp. 551–564, Mar. 2016.
operations computed (see the absolute throughput for both [12] B. Li, Y. Qin, B. Yuan, and D. J. Lilja, “Neural network classifiers using
stochastic computing with a hardware-oriented approximate activation
designs in Table VII). function,” in Proc. IEEE Int. Conf. Comput. Design (ICCD), Nov. 2017,
pp. 97–104.
[13] J. L. Rossello, V. Canals, and A. Morro, “Hardware implementation of
V. C ONCLUSION stochastic-based neural networks,” in Proc. Int. Joint Conf. Neural Netw.
Due to the advantages of area shrinkage and low power (IJCNN), Jul. 2010, pp. 1–4.
[14] J. L. Rossello, V. Canals, and A. Morro, “Probabilistic-based neural net-
consumption, SC is presented as a paradigm solution so as work implementation,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN),
to implement deep learning algorithms in hardware for EC Jun. 2012, pp. 1–7.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[15] D. Kleyko, M. Kheffache, E. P. Frady, U. Wiklund, and E. Osipov, [38] H. T. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convo-
“Density encoding enables resource-efficient randomly connected neural lutional neural networks for efficient systolic array implementations:
networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 8, Column combining under joint optimization,” in Proc. 24th Int. Conf.
pp. 3777–3783, Aug. 2021. Architectural Support Program. Lang. Operating Syst., Apr. 2019,
[16] H. Sim and J. Lee, “Cost-effective stochastic MAC circuits for deep pp. 821–834, doi: 10.1145/3297858.3304028.
neural networks,” Neural Netw., vol. 117, pp. 152–162, Sep. 2019. [39] M. Dhouibi, A. K. Ben Salem, A. Saidi, and S. Ben Saoud, “Accelerating
[17] J. Yu, K. Kim, J. Lee, and K. Choi, “Accurate and efficient stochastic deep neural networks implementation: A survey,” IET Comput. Digit.
computing hardware for convolutional neural networks,” in Proc. IEEE Techn., vol. 15, no. 2, pp. 79–96, Mar. 2021.
Int. Conf. Comput. Design (ICCD), Nov. 2017, pp. 105–112.
[18] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze, “Energy-efficient
hybrid stochastic-binary neural networks for near-sensor computing,”
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017,
pp. 13–18. Christiam F. Frasser received the B.Sc. degree
in electronics engineering from the University of
[19] H. Sim, D. Nguyen, J. Lee, and K. Choi, “Scalable stochastic-computing
accelerator for convolutional neural networks,” in Proc. 22nd Asia South Libertadores, Bogotá, Colombia, in 2010, and the
M.S. degree in electronics systems for smart envi-
Pacific Design Automat. Conf. (ASP-DAC), Jan. 2017, pp. 696–701.
ronments from the University of Málaga, Málaga,
[20] W. Huang et al., “FPGA-based high-throughput CNN hardware
Spain, in 2017. He is currently pursuing the Ph.D.
accelerator with high computing resource utilization ratio,” IEEE
degree with the Industrial Engineering and Construc-
Trans. Neural Netw. Learn. Syst., early access, Feb. 15, 2021, doi:
tion Department, University of the Balearic Islands,
10.1109/TNNLS.2021.3055814.
Palma, Spain.
[21] S. Liu, H. Fan, M. Ferianc, X. Niu, H. Shi, and W. Luk, “Toward full- His current research interest includes machine
stack acceleration of deep convolutional neural networks on FPGAs,” learning implementations in embedded devices.
IEEE Trans. Neural Netw. Learn. Syst., early access, Feb. 12, 2021, doi:
10.1109/TNNLS.2021.3055240.
[22] S. Han et al., “EIE: Efficient inference engine on compressed deep neural
network,” ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, 2016,
pp. 243–254.
[23] A. Zhakatayev, S. Lee, H. Sim, and J. Lee, “Sign-magnitude SC: Getting Pablo Linares-Serrano received the B.Sc. and
10X accuracy for free in stochastic computing for deep neural networks,” M.Sc. degrees in telecommunication engineering
in Proc. 55th ACM/ESDA/IEEE Design Autom. Conf. (DAC), Jun. 2018, from the University of Seville, Seville, Spain,
pp. 1–6. in 2019 and 2021, respectively. He is currently
[24] A. Morro et al., “A stochastic spiking neural network for virtual pursuing the Ph.D. degree with the Department
screening,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, of Electrical and Computer Engineering, Johns
pp. 1371–1375, Apr. 2018. Hopkins University, Baltimore, MD, USA.
[25] B. Parhami and C.-H. Yeh, “Accumulative parallel counters,” in Proc. His research interests include analog and
Conf. Rec. 29th Asilomar Conf. Signals, Syst. Comput., vol. 2, 1995, mixed-signal circuit designs, switched-capacitor
pp. 966–970. filters, design of bioinspired vision sensors, and
[26] E. E. Swartzlander, “Parallel counters,” IEEE Trans. Comput., signal processing.
vols. C-22, no. 11, pp. 1021–1024, Nov. 1973. Mr. Linares-Serrano has served as the Chairman for the IEEE Student
Branch at the University of Seville from 2018 to 2019 and from 2020 to 2021.
[27] P. K. Muthappa, F. Neugebauer, I. Polian, and J. P. Hayes, “Hardware-
based fast real-time image classification with stochastic computing,”
in Proc. IEEE 38th Int. Conf. Comput. Design (ICCD), Oct. 2020,
pp. 340–347.
[28] Y. Zhang, X. Zhang, J. Song, Y. Wang, R. Huang, and R. Wang, “Parallel Iván Díez de los Ríos (Student Member, IEEE)
convolutional neural network (CNN) accelerators based on stochastic received the B.Sc. degree in telecommunication
computing,” in Proc. IEEE Int. Workshop Signal Process. Syst. (SiPS), technologies engineering and the M.Sc. degree in
Oct. 2019, pp. 19–24. telecommunication engineering from the University
[29] K. Kim, J. Lee, and K. Choi, “Approximate de-randomizer for sto- of Seville, Seville, Spain, in 2019 and 2022, respec-
chastic circuits,” in Proc. Int. SoC Design Conf. (ISOCC), Nov. 2015, tively. He is currently pursuing the Ph.D. degree in
pp. 123–124. physical sciences and technologies with the Institute
[30] F. Neugebauer, I. Polian, and J. P. Hayes, “On the maximum function in of Microelectronics of Seville, Spanish National
stochastic computing,” in Proc. 16th ACM Int. Conf. Comput. Frontiers, Research Council (IMSE-CNM-CSIC), Seville, and
Apr. 2019, pp. 59–66. the University of Seville.
[31] Y. Lecun. The MNIST Database of Handwritten Digits. His research interests include neural networks,
http://yann.lecun.com/exdb/mnist/ and [Online]. Available: https://ci. neuromorphic systems, memristors, field-programmable gate arrays (FPGAs),
nii.ac.jp/naid/10027939599/en/ and application-specific integrated circuit (ASIC) design.
[32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
[33] Gidel Company. Proc10A Board Image. Accessed: Jun. 10, 2020.
[Online]. Available: https://www.intel.com/content/dam/altera-www/ Alejandro Morán (Student Member, IEEE)
global/en_US/portal/dsn/3/boardimage-us-dsnbk-3-3405483112768- received the B.Sc. degree in physics from the
proc10agxplatform.jpg University of the Balearic Islands (UIB), Palma,
[34] Z. Liu et al., “Throughput-optimized FPGA accelerator for deep con- Spain, in 2016, the M.Sc. degree in physics
volutional neural networks,” ACM Trans. Reconfigurable Technol. Syst., of complex systems from the Institute for
vol. 10, no. 3, p. 17, 2017. Cross-Disciplinary Physics and Complex Systems
[35] Z. Li et al., “Laius: An 8-bit fixed-point CNN hardware inference (CSIC-UIB), UIB, in 2017, and the Ph.D. degree
engine,” in Proc. IEEE ISPA/IUCC, Dec. 2017, pp. 143–150. from UIB in 2022.
[36] S.-S. Park, K.-B. Park, and K. Chung, “Implementation of a CNN He is currently a Teaching Assistant with the
accelerator on an Embedded SoC Platform using SDSoC,” Proc. 2nd Industrial Engineering and Construction Department
Int. Conf. Digit. Signal Process., Feb. 2018, pp. 161–165. and a Researcher with the Electronic Engineering
[37] A. Sayal, S. S. T. Nibhanupudi, S. Fathima, and J. P. Kulkarni, Group, UIB. His research interests include machine learning in general and
“A 12.08-TOPS/W all-digital time-domain CNN engine using bi- machine learning hardware based on unconventional computing techniques,
directional memory delay lines for energy efficient edge computing,” neuromorphic architectures, embedded systems, and field-programmable gate
IEEE J. Solid-State Circuits, vol. 55, no. 1, pp. 60–75, Jan. 2020. arrays (FPGAs).

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FRASSER et al.: FULLY PARALLEL STOCHASTIC COMPUTING HARDWARE IMPLEMENTATION OF CNNs 11

Erik S. Skibinsky-Gitlin (Member, IEEE) received Teresa Serrano-Gotarredona received the B.S.
the B.Sc. degree in physics from the University degree in electronics physics and the Ph.D. degree
of Salamanca, Salamanca, Spain, in 2016, and the in VLSI neural categorizers from the University of
M.Sc. and Ph.D. degrees in physics from the Univer- Seville, Seville, Spain, in 1992 and 1996, respec-
sity of Granada, Granada, Spain, in 2017 and 2021, tively, and the M.Sc. degree from the Depart-
respectively. ment of Electrical and Computer Engineering, Johns
He is currently a contracted Research Scientist Hopkins University, Baltimore, MD, USA, in 1997.
with the Electronic Engineering Group, University She is currently a tenured Researcher at the
of the Balearic Islands, Palma, Spain. His research Seville Microelectronics Institute (IMSE-CNM-
interests include machine learning based on uncon- CSIC), Seville, Spain. She is also a part-time Pro-
ventional computing techniques, neuromorphic hard- fessor at the University of Seville, Seville. Her
ware, embedded systems, and field-programmable gate arrays (FPGAs). research interests include analog circuit design of linear and nonlinear cir-
cuits, VLSI neural-based pattern recognition systems, VLSI implementations
Joan Font-Rosselló received the Telecommuni- of neural computing and sensory systems, transistor parameter mismatch
cation Engineering degree from the Polytechnic characterization, bioinspired circuits, nanoscale memristor-type address event
University of Catalonia (UPC), Barcelona, Spain, representation (AER), and real-time vision sensing and processing chips.
in 1994, and the Ph.D. degree from the University of Dr. Serrano-Gotarredona has served as the Chair for the Sensory Systems
the Balearic Islands (UIB), Palma, Spain, and UPC Technical Committee of the IEEE Circuits and Systems Society and IEEE
in 2009. Circuits and Systems Spain Chapter. She was an Academic Editor of the PLOS
He is currently an Associate Professor in elec- One and an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS
tronic technology with the Industrial Engineering AND S YSTEMS —I: R EGULAR PAPERS and the IEEE T RANSACTIONS ON
and Construction Department and a Researcher with C IRCUITS AND S YSTEMS —II: E XPRESS B RIEFS . She is serving as an
the Electronic Engineering Group, UIB. He has Associate Editor for Frontiers in Neuromorphic Engineering and a Senior
been working in the oscillation-based predictive test Editor of the IEEE J OURNAL ON E MERGING T ECHNOLOGIES ON C IRCUITS
and neural networks. His current work focuses on nonconventional neural AND S YSTEMS .
networks and neuromorphic hardware.

Vincent Canals (Member, IEEE) received the Ph.D.


degree in electronics engineering from the Univer-
sity of the Balearic Islands (UIB), Palma, Spain,
in 2012.
He has been an Associate Professor of mechan-
ical engineering with the Industrial Engineering
and Construction Department, UIB, since 2020, and
a member of the Energy Engineering Research
Group (Green) and Electronics Engineering Group
(EEG), UIB. He has coauthored 18 international
journal articles, more than 40 conference contribu-
tions, and three filed patents. His current research topics include hardware
acceleration of artificial intelligence systems, machine learning solution based
on unconventional computing methodologies, pattern recognition, and data
mining focused on energy price and renewable energy generation forecasting.

Miquel Roca (Member, IEEE) received the B.Sc. Josep L. Rosselló (Member, IEEE) received the
degree in physics and the Ph.D. degree from the Ph.D. degree in physics from the University of the
University of the Balearic Islands, Palma, Spain, Balearic Islands (UIB), Palma, Spain, in 2002.
in 1990. He has been a Full Professor of electronic technol-
After a research period at the Electronic Engineer- ogy with the Industrial Engineering and Construc-
ing Department, Polytechnic University of Catalo- tion Department, UIB, since 2021. He is currently
nia, Barcelona, Spain, and a research stage at the the Principal Investigator of the Electronic Engi-
Department of Electrical Engineering and Computer neering Group, Industrial Engineering and Construc-
Science, INSA, Toulouse, France, he obtained a tion Department, UIB. His current research interests
post of Associate Professor at the University of include neuromorphic hardware, edge computing,
the Balearic Islands, where he is currently a Full stochastic computing, and high-performance data
Professor with the Electronic Engineering Research Group and the Head of mining for drug discovery.
the Industrial Engineering and Construction Department. He has been working Dr. Rosselló is also serving as an AI consultant and developer for different
in microelectronic design and test, and radiation dosimeters design. His current technological companies and is part of the Organizing Committee of several
research interests deal with neural networks-based systems, neuromorphic conferences, such as the Power and Timing Modeling Optimization and
hardware based on field-programmable gate arrays (FPGAs), and neural Simulation Conference and the International Joint Conference on Neural
networks applications. Networks.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.

You might also like