541 - Literature Review

Electrical Engineering Department
EE 541: Design of DigitalSystem
(Term 231)
NEURAL NETWORK ON FPGA
By:
Ikhsan Hidayat
g202303630
Andi Dwiki Yulianto

g202304190
I. Introduc*on
Deep Learning models such as neural network are rapidly gaining prominence in diverse
fields including computer vision, natural language processing, robotics, finance etc. The need
for high-end computing applications causes interest in putting neural networks on specific
hardware platforms to make operations quicker with less energy use. Examples of the
hardware platforms include FPGA with its reconfigurable nature, and high-performance data
parallelism ability. Moreover, utilizing neural networks built on FPGAs has numerous benefits.
For instance, it is possible to configure the hardware architecture for each neural network
according to individual needs and the general efficiency improves. FPGAs are highly suitable
for inference and training neural networks applications with less time delay and higher
throughput. This way, great speedup can be realized when comparing the software
implementations as opposed to this parallelism inherent within FPGAs.
We study a hardware-based neural network implementation on FPGA in this work/project

that aims at fast but high-quality task accomplishment under real conditions. This report
discusses the issues concerning neural networking algorithm into FPGA architectures, which
include capacity allocation, energy consumption, and overall quality. In this paper, through
our analyses we wish to show advantages in implementing FPGAs that will boost the speed
with which neural network computation occurs and also assist in deploying efficient low
latency computing solutions.
II. Surveyed Papers
a. FPGA Architecture for Feed-Forward Sequen*al Memory

Network targe*ng Long-Term Time-Series Forecas*ng
1. Summary of Findings:
The motivation for the authors to employ FPGA (Field-Programmable Gate Array) in this
research is rooted in their need for efficient hardware implementation of the feed-forward
sequential memory network (FSMN) for long-term time-series forecasting, particularly to
address computationally intensive aspects of the task. FPGA's suitability lies in its ability to
facilitate parallel processing of computationally demanding operations such as summation
and filtering, enabled by components like adders and multipliers within the architecture. This
parallelism enhances computational efficiency, and FPGAs excel in managing these parallel
computations, making them an ideal choice for high-performance computing environments.
Additionally, FPGAs offer resource scalability, enabling complex forecasting tasks without an
exponential increase in resource requirements.
The paper introduces a novel FPGA architecture for a Feed-Forward Sequential Memory
Network (FSMN) designed for long-term time-series forecasting. The FSMN mitigates the
vanishing gradient problem, which is a challenge in recurrent neural networks (RNNs) and
long short-term memory (LSTM) networks. In terms of computations, the FSMN employs time-
domain filters and a feed-forward neural network structure to perform forward propagation.
The architecture's computational load is evaluated with a specific network topology, which
includes the number of multiplications per iteration. Resource utilization, such as Adaptive
Logic Module (ALM), registers, and Digital Signal Processing (DSP) blocks, is detailed, with
percentages to indicate efficiency. The architecture operates at a frequency of 200 MHz,
showcasing its performance capabilities. The tables in the paper provide a comprehensive
breakdown of these computational and resource-related aspects, facilitating an
understanding of the architecture's computational requirements and efficiency. The network
construction for algorithmic evaluation reveals the specific configurations for both FSMN and
LSTM, along with the number of multiplications per iteration, offering insights into their
computational loads. This detailed description of computations and resource requirements
serves to quantify the computational demands and resource efficiency of the proposed FPGA-
based FSMN architecture, particularly in comparison to LSTM.
The FPGA architecture presented in the paper offers several compelling reasons for its
superiority over software-based implementations on CPUs, GPUs, TPUs, and similar
platforms. Hardware implementations on FPGAs provide inherent parallelism and
customization, allowing for highly efficient and dedicated processing tailored to the specific
demands of the FSMN model. FPGAs can exploit parallelism at a fine-grained level, thus
offering significantly faster execution times and lower latencies. Moreover, they excel in
scenarios where power efficiency and real-time processing are crucial. Additionally, the paper
demonstrates that FPGA implementations can efficiently scale as the network size increases
without the exponential resource demands observed in software-based solutions. This
combination of performance, efficiency, and scalability positions FPGA-based
implementations as a superior choice for the FSMN model, particularly in long-term time-
series forecasting applications.
The paper outlines the architecture for a specific Feed-Forward Sequential Memory Network
(FSMN) topology. In this configuration, the network consists of a 4-3-4 layout, indicating three
layers: the input layer with four neurons, a middle layer with three neurons, and an output
layer with four neurons. The time-domain filter taps are set to 20, which characterizes the
temporal information processing within the network. The total number of neurons, layers,
and filter taps can be considered the essential attributes defining the network's size in this
context. This network size of 4-3-4 with 20 time-domain filter taps is tailored to the specific
application of long-term time-series forecasting and can be adapted or scaled to suit the
demands of different scenarios.
The paper effectively addresses several of the challenges associated with FPGA-based
hardware implementations for the Feed-Forward Sequential Memory Network (FSMN). It
demonstrates a comprehensive understanding of FPGA resource management, optimizing
resource utilization while preserving computational performance. The extensive evaluation
showcases a thorough testing and verification process, affirming the architecture's reliability
and correctness. However, the paper highlights the need for expert knowledge in FPGA
design, acknowledging the challenge in terms of hardware development expertise. The
adaptation of the architecture for various network sizes remains unaddressed in the paper,
and while the architecture shows substantial improvements over RNNs and LSTMs, the noise
observed in the FSMN's forecast in long-term time-series predictions indicates a potential
area for further refinement. In essence, the paper presents a robust and efficient FPGA-based
FSMN architecture but leaves room for ongoing exploration and enhancement in terms of
noise mitigation and scalability.
2. Proposed Model
Figure a-a The en&re architecture. (Orimo, et al.)
The authors have introduced an FPGA architecture designed to implement a specific variant
of the Feed Forward Sequential Memory Network (FSMN) for offline learning. In their
research, they present an overall architecture illustrated above, tailored for networks
comprising five layers with the incorporation of time-domain filters within the 2nd and 4th
layers. Notably, the paper introduces the concept of an "arithmetic module" composed of
product-sum units denoted as "A" and "B," as well as a summation-and-filtering unit. While
this concept holds promise, its clarity and its seamless integration into the FSMN architecture
warrant further scrutiny and refinement. The role of the layer and data controllers in
supplying essential weights and coefficients to this arithmetic module, ultimately integrating
them into the FSMN, is emphasized.
Figure a-b feed forward sequen&al memory network. (Orimo, et al.)
In this part the author introduces an interesting architecture for Feed Forward Sequential
Memory Network (FSMN) but comes with two key constraints. First, it defines a minimal FSMN
structure as a 4-3-4 network, meaning it has only three layers. Second, the paper places the
time-domain filter in the middle layer of this three-layered network. To help us understand
how this architecture works, the authors use a three-layered model called Layer I, Layer II,
and Layer III. Within this structure, they incorporate N sets of product-sum units labeled as
"A" and "B." In a practical example with N set to 2, they break down the neurons in Layer I
into four groups, each directed to its respective product-sum unit "A." These units facilitate
the flow of data to Layer II, where values Â1 and Â2 are calculated.
Figure a-c summa&on and filtering unit. (Orimo, et al.)

In the architecture's calculation procedure with N set to 2, neurons in Layer I are divided into
groups, and they are directed to product-sum units labeled as "A." These "A" units process
the flow of data to three neurons in Layer II, culminating in the calculation of weight matrices
Â1 and Â2 during STEP1. Following this, the outputs of all "A" units are collectively summed
for each neuron in Layer II, after which a rectified linear unit (ReLU) function is applied to
these outputs. Subsequently, the process involves the repetitive direction of three neurons
to product-sum units called "B." Within these units, data accumulation continues until all
outputs from the time-domain filter are processed, corresponding to the computation of
weight matrices ˆB1 and ˆB2 during STEP1 and STEP2. Once all time-domain filter outputs
have been processed, the architecture concurrently obtains the outputs for neurons in Layer
III by applying the ReLU function to the outputs of the product-sum units. This streamlined
procedure not only enables efficient FSMN calculations with minimal resource utilization but
also supports the creation of deeper networks as needed.
The design and functionality of product-sum units "A" and "B" are further detailed in the
figures above. These units play a pivotal role in the architecture, with "A" being constructed
using a multiplier and adder tree to calculate the output Oj, which involves summing the
products of inputs Ii, weight matrices wij, and spans over four indices. In a similar vein, "B"
employs a comparable structure but over three indices. The paper also introduces the
summation-and-filtering unit, a crucial component for computing the summation of outputs
from the "A" units for all three neurons and for filtering the time-domain data for each
neuron. This unit is depicted in Figure 7 and employs multipliers and adder trees to carry out
these operations, achieving time-division processing by alternately controlling the input of its
multipliers between summation and time-domain filtering. The strategic sharing of
multipliers and adders is emphasized as a crucial mechanism for preventing excessive
resource demands when scaling up the network. However, the paper would benefit from
providing more comprehensive insights into the practical implications and benefits of these
design elements within the overall architecture.
b. Low-Cost Hardware Design Approach for Long Short-Term

Memory (LSTM)
The author utilizes FPGA (Field-Programmable Gate Arrays) to address the challenge of low-
cost hardware design for Long Short-Term Memory (LSTM) networks, capitalizing on several
key advantages. FPGAs are programmable, allowing them to adapt to various digital circuit
requirements, including LSTM networks, making them a flexible and versatile platform for
hardware implementation. Their efficiency is harnessed by programming FPGAs to efficiently
implement LSTM networks, taking advantage of the inherent parallelism within the LSTM
architecture. Furthermore, their reconfigurability enables the implementation of different
circuits, making FPGAs ideal for prototyping and testing new LSTM architectures. Cost-
effectiveness is another compelling factor as FPGAs are increasingly affordable, rendering
them a practical choice for low-cost hardware implementation. Moreover, FPGAs excel in real-
time data processing, further solidifying their suitability for implementing LSTM networks.
The computations underlying the hardware implementation of LSTM networks, as outlined in

the paper, encompass a mix of mathematical operations and data management. These
computations involve the utilization of sigmoid and hyperbolic tangent functions for
information flow and memory decisions, necessitating frequent activation and evaluation.
Weight multiplications are also a fundamental part, demanding numerous multiplications
and additions for each LSTM cell. Memory management operations include reading from and
writing to memory cells, with associated summations and multiplications to calculate new cell
states. Furthermore, synchronization and multiplexing techniques are introduced to
streamline input and output layers, reducing hardware units. The computations also address
resource utilization, power consumption, and area reduction. The paper demonstrates these
computations are central to achieving an efficient hardware design for LSTM networks,
especially for lightweight applications.
The hardware implementations, as proposed in the paper, offer several advantages over
software-based implementations on traditional CPUs, GPUs, TPUs, or other processors.
Hardware implementations, particularly on FPGAs, provide programmability and flexibility,
allowing the direct translation of LSTM networks into custom digital circuits. This adaptability
is crucial for tailoring hardware to the specific LSTM network requirements, optimizing
performance, and reducing hardware cost. Additionally, FPGAs harness parallelism inherent
in LSTM networks, achieving efficient computation. They also facilitate reconfiguration for
different circuit implementations, ideal for prototyping and experimenting with new LSTM
architectures. Furthermore, their growing cost-effectiveness makes them a compelling choice
for low-cost hardware designs. Overall, the paper's hardware-based approach capitalizes on
the versatility, efficiency, reconfigurability, and cost-effectiveness of FPGA implementations,
setting it apart from traditional software-based alternatives.
The size of the neural network described in the paper is not explicitly mentioned. However,
the paper mentions that the hardware implementation focuses on reducing the hardware
cost for LSTM networks. While specific details about the network's size, such as the number
of neurons, layers, or parameters, are not provided in the information given, it can be inferred
that the emphasis is on optimizing the hardware architecture for LSTM with a focus on
lightweight applications and efficiency rather than network size or complexity. The paper's
main goal is to achieve a low-cost and power-efficient hardware implementation for LSTM,
making it suitable for applications with hardware constraints.
The papers reviewed in the study have employed various strategies to address the challenges
and issues associated with hardware implementations. They typically discuss architectural
optimizations, such as shared sigmoid units and adder units, and highlight the reduction in
hardware area. While the papers emphasize their approaches' effectiveness in achieving low
hardware cost, critical assessment is required. For instance, the claim of a 16% area reduction
should be critically examined, and power consumption savings should be validated.
Furthermore, ensuring real-time data processing capability and resource management needs
in practical scenarios should be rigorously evaluated. A comprehensive assessment of these
claims would require practical testing and verification of the proposed hardware
implementations' performance and efficiency, moving beyond the theoretical discussions in
the papers.
2. Proposed Model
Figure b-a the structure of LSTM. (Khalil, Mohaidat, & Bayoumi, 2023)
Figure b-b the proposed design of LSTM. (Khalil, Mohaidat, & Bayoumi, 2023)
The proposed method outlined in the paper focuses on a fundamental objective of reducing
the hardware cost of Long Short-Term Memory (LSTM) networks, which, in turn, makes them
suitable for resource-constrained applications such as the Internet of Things (IoT). To
accomplish this, the architecture of the LSTM model is efficiently restructured to include four
key units: two sigmoid (σ) units and two hyperbolic tangent (tanh) units, a departure from the
conventional LSTM architecture comprising five units. A pivotal innovation in this method is
the unification of the input and output layer functions using a single sigmoid unit. Additionally,
the shared sigmoid unit's computation is facilitated by a common adder for both Xt and ht-1.
This novel design not only significantly reduces the architectural complexity but also results
in a more compact hardware footprint, as validated by experimental results. The shared
sigmoid operation is governed by time division multiplexing, ensuring the efficient sharing of
computational resources between the input and output layers. In the input layer, Xt and ht -1
undergo multiplication with their respective weights (Wxi and Whi), and in the output layer,
they are multiplied with weights (Wxo and Who). A multiplexer, guided by a selection input (s),
determines which computed path to direct to the output, with the resulting data then
proceeding to a shared adder that manages the addition operations for the input and output
layers independently. This detailed methodology highlights the intricate optimization of LSTM
architecture, facilitating hardware efficiency and reduced costs.
In essence, the paper's proposed method strives to make LSTM networks more accessible for
lightweight applications by enhancing their hardware design. The optimization scheme
involves streamlining the architecture to incorporate only four essential units, two sigmoid
and two tanh units, rather than the traditional five. Notably, a pivotal innovation includes
consolidating the functions of the input and output layers using a single sigmoid unit. This
approach introduces an element of resource sharing, with a shared adder overseeing the
computations for both Xt and ht-1 within the shared sigmoid unit. The outcome is a notably
reduced architectural footprint, a fact substantiated by the experimental results. The
approach is governed by time division multiplexing, which enables efficient resource
allocation between the input and output layers. Within the input layer, Xt and ht-1 undergo
multiplication with their respective weights (Wxi and Whi), and a similar operation occurs in
the output layer, with multiplication by weights (Wxo and Who). The subsequent data is
processed by a multiplexer, influenced by a selection input (s), which steers the chosen
computation path toward the output. The data then proceeds to a shared adder, responsible
for managing the independent addition operations of the input and output layers. This
comprehensive approach underscores the paper's intention to optimize LSTM architecture
for enhanced hardware efficiency and cost-effectiveness.
The paper's proposed method seeks to reduce the hardware cost of LSTM networks to cater
to lightweight applications, particularly those within the Internet of Things (IoT). This
optimization strategy revolves around simplifying the LSTM architecture, specifically by
reducing the number of units from the conventional five to four, incorporating two sigmoid
units and two tanh units. A key innovation in this approach involves unifying the
responsibilities of the input and output layers through a single sigmoid unit, supported by a
shared adder that operates on both Xt and ht -1. This design leads to a considerably smaller
architectural footprint, as empirically demonstrated. The time division multiplexing technique
is employed to effectively allocate resources to the shared sigmoid function between the
input and output layers. In the input layer, the inputs (Xt and ht-1) are subjected to
multiplication with their corresponding weights (Wxi and Whi), and the output layer follows a
similar process with the weights (Wxo and Who). A multiplexer, controlled by a selection input
(s), dictates the flow of computed data to the output. Subsequently, the multiplexer output is
channeled to a shared adder that manages the independent addition operations for the input
and output layers. This detailed method serves as a testament to the paper's core objective,
which is the optimization of LSTM architecture to enhance hardware efficiency and cost-
effectiveness, making it suitable for IoT and similar resource-constrained domains.
Figure b-c the schema&c diagram of the proposed model. (Khalil, Mohaidat, &
Bayoumi, 2023)
In the proposed method, the operational sequence is elucidated as follows: When the
selection signal is set to '1', the multiplexer directs the multiplication result from the input
layer to the shared adder. The adder's outcome is then subjected to a sigmoid function for
further processing. The result is subsequently directed to a demultiplexer, which determines
the appropriate pathway—either the input layer or the output layer. The selection signal,
specified as '1' (s = 010), guides the demultiplexer to route the result to the input layer.
Conversely, when s = 000, the multiplexer steers the multiplication result from the output
layer to the shared adder, followed by processing through the sigmoid function. For s = 010,
the demultiplexer routes the sigmoid result to the output layer. The input layer's result
undergoes multiplication with the update layer's result, and this product is then added to the
previous memory value Ct-1 to generate the subsequent memory value Ct Simultaneously,
the output layer yields a tanh result derived from the new memory value Ct to produce the
new output ht. Notably, this innovative approach consolidates two units into one shared
adder, with sigmoid units serving both the input and output layers. As a result, the proposed
method exhibits a smaller architectural footprint compared to the traditional approach. By
employing fewer units, it contributes to reduced power consumption and optimized costs,
rendering it suitable for diverse applications such as the Internet of Things (IoT) and
biomedical systems.
Figure b-d block diagram of inputs and output func&on. (Khalil, Mohaidat,
& Bayoumi, 2023)
Figure b-e block diagram for final outputs. (Khalil, Mohaidat, & Bayoumi,
2023)
In this hardware implementation using VHDL and Altera Arria 10 GX FPGA, the proposed LSTM
architecture's resource utilization is substantially more efficient compared to the traditional
LSTM, as demonstrated in Table. The proposed method demonstrates significant
improvements, utilizing fewer resources in terms of the number of slice registers (7040 with
6.82% utilization), slice LUTs (3440 with 7.59% utilization), DSPs (24 with 8.95% utilization),
BUFs (4 with 3.34% utilization), Block RAM (12 with 9.37% utilization), FF (1197 with 1.53%
utilization), and Memory LUTs (383 with 3.08% utilization). In contrast, the traditional LSTM
consumes higher resources, indicated by its usage of 8610 slice registers (8.34% utilization),
4043 slice LUTs (8.92% utilization), 45 DSPs (16.36% utilization), 5 BUFs (4.3% utilization), 16
Block RAM (12.5% utilization), 2552 FF (3.26% utilization), and 496 Memory LUTs (3.98%
utilization). Moreover, the proposed method operates at a frequency of 120 MHz with a power
consumption of 1.546 W, while the traditional method consumes 1.847 W. This reflects a
substantial power consumption advantage of the proposed method. Furthermore, the
proposed method exhibits a 16% reduction in area compared to the traditional approach,
signifying its superior hardware efficiency. These advantages make the proposed method a
compelling choice for various applications with hardware constraints, thanks to its low
hardware cost, power efficiency, and compact architectural footprint.
c. An FPGA Implementa*on of a Long Short-Term Memory Neural
Network
The author employs Field-Programmable Gate Arrays (FPGAs) for this hardware
implementation primarily due to their ability to provide a high degree of parallelism, low-
latency processing, and customization for specific neural network applications. FPGAs offer
hardware-level acceleration, which is well-suited for recurrent neural networks like Long
Short-Term Memory (LSTM), enabling faster execution of matrix-vector multiplications and
elementwise operations. Additionally, FPGAs utilize on-chip memory resources, minimizing
data transfer overhead and thus improving overall computational efficiency. Furthermore,
FPGAs are energy-efficient, allowing for the development of hardware accelerators with lower
power consumption compared to general-purpose processors such as CPUs, GPUs, or TPUs.
The paper underscores the advantages of FPGA-based solutions in achieving high throughput
and low power consumption, making them a suitable choice for this hardware
implementation. The paper mentions two different FPGAs that were used in the
implementation. For network sizes N = 4, 8, 16, and 32, the Xilinx XC7Z020 SoC FPGA was
employed. This choice was based on its availability and accessibility for initial testing and
validation. However, for larger network sizes of N = 64 and 128, a Virtex-7 XC7VX485T FPGA
on the VC707 board was utilized. The choice of FPGA depends on the scale of the network
and the available resources on the FPGA. The XC7Z020 SoC FPGA was used for smaller
networks due to its accessibility, while the Virtex-7 FPGA was selected for larger networks as
it provided more available resources, including DSP slices, which are crucial for accelerating
the computations required by the LSTM network. This selection allowed the researchers to
effectively balance available resources and computational needs.
The paper presents a novel hardware architecture for LSTM (Long Short-Term Memory)
neural networks, targeting various applications, and the underlying computations are
extensively detailed. In LSTM networks, critical computations involve matrix-vector dot
products between weight matrices and input/output vectors, requiring parallel processing to
improve performance. Polynomial approximations are utilized for transcendental activation
functions like sigmoid and tanh, reducing memory usage and enabling efficient calculations.
The architecture features resource sharing, with parameter KG controlling the number of
rows that share a multiplier, optimizing the utilization of available resources. This design
employs memory storage (LUTRAM) for weights, and gate modules that perform tasks,
including matrix multiplications and bias addition. Computation times vary with network size
and resource sharing levels, demonstrating substantial speed improvements over software
implementations. While smaller networks are faster and consume less power, flexibility is
maintained for different network sizes through parameterization. The application examined
is binary addition, which requires sequence prediction and memory retention, highlighting
the LSTM's suitability for such tasks.
Hardware implementations of LSTM networks are superior to their software-based

counterparts, including CPUs, GPUs, and TPUs, for several compelling reasons. First, the
proposed architecture leverages the inherent parallelism of FPGAs to perform computations
more efficiently. This parallelism accelerates the critical matrix-vector dot product
calculations that are fundamental to LSTM networks. Second, the use of polynomial
approximations for activation functions reduces memory requirements, resulting in
optimized resource usage. The efficient hardware design significantly outperforms software
implementations in terms of execution time, allowing for a substantial increase in forward
propagation speed. Moreover, this approach offers greater energy efficiency, as it scales with
the size of the network, ensuring that power consumption remains within reasonable limits,
which is particularly beneficial for mobile or embedded applications. Overall, the hardware
implementation of LSTM networks excels in terms of speed, energy efficiency, and resource
utilization, making it a superior choice for various applications compared to traditional
software-based implementations.
The size of the LSTM network described in the paper varies, with different configurations
explored, including network sizes of N = 4, 8, 16, 32, 64, and 128 neurons. The size primarily
refers to the number of LSTM neurons or memory cells in the network, and it significantly
impacts the hardware requirements, processing speed, and power consumption of the
implementation. The paper details the synthesis results and performance metrics for these
various network sizes, offering a comprehensive view of the scalability and efficiency of the
hardware design across a range of network sizes.
The paper presents a comprehensive hardware implementation of LSTM networks and offers
several innovations to address the associated challenges. It successfully introduces a
parameterized architecture that enhances parallelism, leverages FPGA internal resources
efficiently, and provides scalability to various network sizes. However, the design's scalability
is limited for larger networks, which may require external memory solutions. The paper's
performance claims are validated with synthesis results, demonstrating impressive speed
improvements over software implementations. Still, the complex sharing logic for multipliers
in the Matrix-Vector Dot Product Module affects clock frequencies, especially for larger values
of resource sharing (KG). While the power consumption for the FPGA-based design is well-
managed and scalable, its feasibility for more extensive networks requires further
exploration. Overall, the paper presents a highly efficient hardware implementation but faces
limitations in terms of scalability and complex critical paths that affect its full potential.
2. Proposed Model
a. LSTM Structure
The image Above is the reference of the structure of the LSTM neuron used in this paper.
Figure c-a LSTM block (Greff, Srivastava, Koutník, Steunebrink, & Schmidhuber, 2017)
As shown from the figure below, LSTM network is basically having the same connection
with regular RNN except now, it has multiple entry points for the data flow. The
explanation for each component is summarized as follows:
◦ Input Gate - this is the input gate, where the relative importance of each
feature of the input vector at time 𝑡, 𝑥 (") , and the output vector at the previous
time step 𝑦 ("$%) , are weighed, producing an output 𝑖 (") . (Ferreira & Fonseca,
2016)
◦ Block Input Gate - as the name implies, this gate controls the flow of
information from the input gate to the memory cell. It also receives the input
vector and the previous output set, producing an output 𝑧 (") . The activation
&(')(%
function of this gate can both the logistic sigmoid, , or the hyperbolic
%)* !"
tangent, tanh(X) but the most common choice is the hyperbolic tangent.
(Ferreira & Fonseca, 2016)
◦ Forget Gate - its role is to control the contents of the Memory Cell, either to
set or reset them, using the Hadamard vector multiplication of its output at
time t, 𝑐 ("$%) . The activation function of this gate is always sigmoid, and the
resulting signal is 𝑓 (") . (Ferreira & Fonseca, 2016)
◦ Output Block Gate - this gate has a role very similar to that of the Block Input
Gate, but now it controls the information flow out of the LSTM neuron, namely
the activated Memory Cell output. The control signal it produces is 𝑜(") .
◦ Memory Cell - the cornerstone of the LSTM neuron. This is the memory
element of the neuron, where the previous state is kept, and updated
according to the dynamics of the gates that connect to it. Also, this is where
the peephole connections come from. (Ferreira & Fonseca, 2016)
◦ Output Activation - the output of the Memory Cell goes through this
activation function that, as the gate activation function, is the hyperbolic
tangent. (Ferreira & Fonseca, 2016)
◦ Peepholes - direct connections from the memory cell to the gates, that allow
them to 'peep' at the states of the memory cell. They were added after the
initial 1997 formulation, and their absence was proven to have a minimal
performance impact. For this reason, they were omitted in this architecture.
The operation for each gate can be seen from the image below. Where the Bold lower-case
letter represents the vector, and the Bold upper-case letter represent the matrices.
Figure c-b Gates equa&ons. (Ferreira & Fonseca, 2016)
The ⊙ symbol in the equation represents Hadamard multiplication, W is the weight, and R is
the Recurrent weight.
b. Network Architecture
The number representation used in this system is signed fixed point Q6.11 in two’s
compliment. There are 18 bits in total, with 7 bits being the integer bits (one of which is the
sign bit), and the other 11being the fractional bits. Moreover, the building blocks of the
proposed model classified as follows:
◦ Activation Function Calculation
In this block, Polynomial Approximations were used to evaluate the
transcendental activation function (sigmoid and the tanh(x)). Also, in finding
the optimal polynomial, Least Maximum Approximation is used.
◦ Matrix-vector Dot Product Calculation

As previously seen in the Figure c-d, the W mul(plied by the input vector X and
R is mul(plied by vector y. Moreover, this network is parameterized with W has
size of N x M, and R has size of N x N. The algorithm for matrix-vector
mul(plica(on of a matrix is given as:
Figure c-c Matrix-vector multiplication of a matrix.

◦ Weight storage
The weights are stored in LUTRAM, thus the weights are declared as a matrix,
where the write and read access can be performed in parallel.
Figure c-d input matrix. (Ferreira & Fonseca, 2016)
◦ Gate Module
In gate module, there will be 3 tasks to be performed by each module which is
W x vector x, R x vector y, and summing the bias of vector b. In this paper, the
multiplications of both W and R are performed in parallel.
Figure c-f the hardware block of the Gate. (Ferreira & Fonseca, 2016)
Figure c-e Hardware block diagram of LSTM. (Ferreira & Fonseca, 2016)
III. Conclusion
From our literature study, we have decided to implement LSTM on FPGA to forecast a time
series data of wind speed. Our data will be a time dependent data and the dataset of which
the LSTM is highly effective for analyzing and forecasting this type of data. For the data
preprocessing, we prefer software-based preprocessing because it provides flexibility and
ease of implementation, allowing for quick prototyping, algorithm development, and iterative
testing of various preprocessing techniques without the need for hardware modifications or
reconfigurations.
Furthermore, we will utilize the memory of the FPGA to store the weights and make use the
parallelism of FPGA for LSTM unit operations because In LSTM networks, critical computations
involve matrix-vector dot products between weight matrices and input/output vectors,
requiring parallel processing to improve performance. For this reason, we are interested to
implement a model from the 3rd paper (An FPGA Implementation of a Long Short-Term
Memory Neural Network) because of its flexibility in terms of network sizes (parametrized).
Moreover, the architecture of this model, as we previously elaborate, can perform LSTM-
related computations for various applications, such as processing real-time data streams,
pattern recognition, forecasting, etc. This is a one-time design and optimization process.
Secondly, once the hardware is implemented, it can be used for real-time data analysis.
IV. Reference
Greff, K., Srivastava, R., Koutník, J., Steunebrink, B., & Schmidhuber, J. (2017). LSTM: A Search
Space Odyssey. IEEE Transac(ons on Neural Networks and Learning Systems, 2.
Ferreira, J., & Fonseca, J. (2016). An FPGA implementa(on of a long short-term memory
neural network. 2016 Interna4onal Conference on ReConFigurable Compu4ng and
FPGAs (ReConFig).
Orimo, K., Ando, K., Ueyoshi, K., Ikebe, M., Asai, T., & Motomura, M. (n.d.). FPGA
architecture for feed-forward sequen(al memory network targe(ng long-term (me-
series forecas(ng. 2016 Interna4onal Conference on ReConFigurable Compu4ng and
FPGAs (ReConFig). 2016.
Khalil, K., Mohaidat, T., & Bayoumi, M. (2023). 2023 IEEE Interna4onal Symposium on
Circuits and Systems (ISCAS). Low-Cost Hardware Design Approach for Long Short-
Term Memory (LSTM.

541 - Literature Review

Uploaded by

Copyright:

Available Formats

You might also like

541 - Literature Review

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

541 - Literature Review

Uploaded by

Copyright:

Available Formats

Electrical Engineering Department

EE 541: Design of DigitalSystem

NEURAL NETWORK ON FPGA

Andi Dwiki Yulianto

We study a hardware-based neural network implementation on FPGA in this work/project

II. Surveyed Papers

a. FPGA Architecture for Feed-Forward Sequen*al Memory

Figure a-a The en&re architecture. (Orimo, et al.)

Figure a-b feed forward sequen&al memory network. (Orimo, et al.)

Figure a-c summa&on and ﬁltering unit. (Orimo, et al.)

b. Low-Cost Hardware Design Approach for Long Short-Term

The computations underlying the hardware implementation of LSTM networks, as outlined in

Hardware implementations of LSTM networks are superior to their software-based

Figure c-b Gates equa&ons. (Ferreira & Fonseca, 2016)

◦ Matrix-vector Dot Product Calculation

Figure c-c Matrix-vector multiplication of a matrix.

Figure c-d input matrix. (Ferreira & Fonseca, 2016)

You might also like