Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

A LAB REPORT

On

Deep Learning Challenges and Solutions with XILINX FPGAs

Submitted by
K.Kavya Sri
20NN1A0431

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING


VIGNAN’S NIRULA INSTITUTE OF TECHNOLOGY AND SCIENCE FOR WOMEN
(Approved by AICTE,NEW DELHI and Affiliated to JNTUK,KAKINADA)
PEDAPALAKALURU,GUNTUR-522009
CERTIFICATE

This is to certify that the report entitled “Deep Learning Challenges


and Solutions with XILINX FPGA’S” is the bonafide work carried out
by,K.Kavya Sri for design tools lab in the Department of Electronics and
Communication Engineering, from J.N.T.U. Kakinada during the year 2023-
2024.

Lab In charge Head of the


Department
Dr.BMS.Rani Dr. G. Sandhya
Associate Professor Associate Professor

External Examiner
Contents

Deep Learning Challenges and Solutions with XILINX FPGAs.......................................................................1


ABSTRACT:........................................................................................................................................................1
CHAPTER-1...........................................................................................................................................................2
INTRODUCTION................................................................................................................................................2
CHAPTER-2...........................................................................................................................................................5
SOFTWARE........................................................................................................................................................5
Xilinx software:....................................................................................................................................................5
CHAPTER-3.........................................................................................................................................................10
ALGORITHM....................................................................................................................................................10
Xilinx FPGA and Deep Learning :.....................................................................................................................12
Deep Learning Architectures.............................................................................................................................14
Fixed Point Quantization....................................................................................................................................15
Design Optimization....................................................................................................................................17
FPGA Resource Allocation...........................................................................................................................17
DSP Placement............................................................................................................................................18
ADVANTAGES................................................................................................................................................19
Hardware Acceleration................................................................................................................................19
Flexibility and Reconfigurability...................................................................................................................19
Parallel Processing Capabilities....................................................................................................................19
Optimized IP Blocks.....................................................................................................................................19
Memory Bandwidth Management..............................................................................................................19
DISADVANTAGES:.........................................................................................................................................20
Learning Curve.............................................................................................................................................20
Resource Constraints:..................................................................................................................................20
Development Time......................................................................................................................................20
Power Consumption in Continuous Operation............................................................................................20
APPLICATIONS...............................................................................................................................................20
REFERENCES.......................................................................................................................................................22
LIST OF FIGURES:

Figure 1 Acceleration frame work,base platform,reconfiguration


Figure 2 Convolution weight histogram
Figure 3 DSP Placement
Deep Learning Challenges and Solutions with XILINX FPGAs

ABSTRACT

In this paper, we will describe the architectural, software, performance,


and implementation challenges and solutions and current research on the use
of programmable logic to enable deep learning applications. First a discussion
of characteristics of building a deep learning system will described. Next
architectural choices will be explained for how a FPGA fabric can efficiently
solve deep learning tasks. Finally specific techniques for how DSPs, memories
and are used in high performance applications will be described.

Deep Learning (DL) has emerged as a transformative technology


across various domains, ranging from computer vision and natural language
processing to healthcare and finance. As DL models continue to grow in
complexity and size, there is an increasing demand for efficient hardware
accelerators to overcome the computational challenges associated with training
and deploying these models. Field-Programmable Gate Arrays (FPGAs) have
gained attention as a promising solution for accelerating deep learning
workloads due to their flexibility, reconfigurability, and potential for high
performance.

This abstract provides an overview of the challenges encountered in


implementing deep learning on FPGAs, with a specific focus on solutions
offered by Xilinx FPGA technology. The challenges encompass issues such as
limited on-chip resources, high memory bandwidth requirements, and the need
for efficient parallel processing. Xilinx FPGAs, equipped with high-level
synthesis tools and optimized libraries, offer a platform for mitigating these
challenges by enabling designers to implement custom hardware architecture
tailored to deep learning tasks.

1
CHAPTER-1
INTRODUCTION

The foundation of deep learning lies in the field ofrepresentation learning.


Classical machine learning algorithms often depend heavily on the human
designed features of the data followed by statistical techniques such as logistic
regression, naive Bayes, kmeans to find patterns and insights into data [1].
These

techniques however are limited by the quality of features chosen. Deep


learning provides a solution to this problem of statistical techniques by not
only find mapping from features to output but also identifying features
automatically from a given set of data. In a seminal paper describing an image
classifier commonly known as AlexNet [2], the authors demonstrated an
application of deep learning using a convolutional neural network (CNN) to
provide a solution to image classification which produced significantly better
results than state of the art algorithms based on feature engineering. Since then
various deep learning models and their combinations have been used to solve a
variety of problems.

Convolutional neural networks for image classification [3][4], recurrent neural


networks (RNN) and their variants for speech recognition and natural language
processing [6], and language translation [7] are just a few of the areas of deep
learning algorithm application.

What is an FPGA?

An FPGA is made of many Programmable Logic Blocks (PLBs), each of


which contains several Logic Cells. In the FPGA used in the series (the Lattice
iCE40HX1K on the Ice stick), each PLB contains 8 cells. This number may
change depending on your particular FPGA.

What is FPGA in Xilinx?

2
Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are
based around a matrix of configurable logic blocks (CLBs) connected via
programmable interconnects. FPGAs can be reprogrammed to desired
application or functionality requirements after manufacturing.

What is the introduction of FPGA in Xilinx?

At a high level, all Xilinx FPGAs share a common programmable logic


architecture consisting of:

Configurable Logic Blocks (CLBs) – Basic building blocks of logic and


routing that implement logic functions.

Input/Output Blocks (IOBs) – Periphery blocks supporting external I/O


interfaces.

What is the history of Xilinx FPGAs?

Founded in 1984, Xilinx invented the field-programmable gate array (FPGA)


and was the first fabless semiconductor company. In 2012, the company
introduced the first product based on 3-D stacked silicon using silicon
interposer technology.

Programmable logic in the form of Field-Programmable Gate Arrays (FPGAs)


has become now a widely accepted design approach for low- and medium-
volume computing applications. Low development costs and inherent
functional flexibility have spurred the spectacular growth of this technology.
There are many FPGA types but a widely used is the static-RAM based FPGA
architecture. In such a programmable circuit, an array of logic cells and
interconnection cells can be configured in the field to implement a desired
designed function, this is usually called In-System Configurability (ISC) [ 1-
21. Taking into account their flexibility, the RAM-based FPGAs are
classically tested using a function-oriented approach as described by the
following procedure: A logic function is implemented into the FPGA and then
the FPGA is tested according to the implemented function. Then another
function is implemented into the FPGA and the FPGA is re-tested according to
the new implemented function, And so on and so forth. In the above function-
oriented test procedure, it is assumed that the set of successively implemented
functions and related tests allow to test for the complete FPGA functionality.

3
'Logic Vision Inc. 101 Metro Drive San Jose CA 951 10 USA
zorian@lvision.com Tel (1)4084530146 On the other side, structural testing of
these circuits has only recently been addressed in the literature. Structural
testing for conventional digital ICs is a difficult and important task. But
structural testing for FPGAs is an even more complex problem [3-241. The
main objective of this paper is to report on an original experiment about
structural testing of two XILINX FPGAs families: the 4000 and the 3000
family 1251. In this we analyse the specific difficulties encountered with
FPGA testing and gives a general description of the structural approach used
in this experiment. It also illustrates how this approach has been applied to the
logic elements, and to the interconnect elements and section 5 to the RAM
elements. Finally, summarizes the results obtained on the 4000 and 3000
families, and gives some concluding remarks.

What is the general architecture of Xilinx FPGA?

FPGAs from Xilinx are hybrid computation systems with Block RAMs,
programmable fabric, DSP Slices, and PCI Express support. Just Because all
of these compute resources can be accessed at the same time, they enable
scalability and pipelining of applications throughout the entire platform.

4
CHAPTER-2
SOFTWARE

Xilinx software:
What is Xilinx software?
Xilinx integrated synthesis environment or ISE is a software tool got from
Xilinx meant for the synthesis, as well as analysis of the HDL designs, which
are known to majorly target embedded firmware development for Xilinx
CPLD and FPGA integrated circuit product families.

Installation Notes for Xilinx 14.7 on Windows 10

1)These instructions are for installing the Xilinx ISE version 14.7 on Windows
10 only. If you are using Windows 7/XP, then you can follow the instructions
in Part 0 in the original manual written by Prof. Erik Brunvand from Page 4
onwards. In that case, please use the following link instead of the link provided
in the step 1. The next screenshot is for the following link.
https://www.xilinx.com/support/download/index.html/content/xilinx/en/
downloadNav/designtools/v2012_4---14_7.html

2) For using the tool on Windows 10, there are two options: Go to the
following link,
https://www.xilinx.com/support/download/index.html/content/xilinx/en/
downloadNav/design-tools.html

5
The first option is 14.7(Windows 10) which supports Windows 10 Pro or
Windows 10 Enterprise. Since I don’t have either one of those (I have
Windows 10 Home), I had some errors while running it after installation. But I
can tell you that it installs a virtual machine (VM) with a Red Hat linux OS
with Xilinx ISE already installed. Xilinx ISE needs to be run in that VM. If
you have either Windows 10 Pro or Windows 10 Enterprise, you can try
installing this version. We can help you out if there is some problem.

3)The other option (which I used) is to use the 14.7 option highlighted in the
above figure. The problem with this version is that it is meant for Windows
7/XP and causes problems running the appropriate binaries after the
installation. But fortunately, Rajath and I were able to come up with work
arounds that would be sufficient for this course. I will go through the process
in the next steps.

4)Click the 14.7 link in the figure for step 2. Then find and click the link as
shown in the figure for the step
1. You will then be prompted to login to your Xilinx account.

6
If you don’t have an account, then create one using the link in the above
figure. After creating your account, you will need to fill out the name and
address verification questions. Continuing through that, you will get a prompt
to download the Xilinx_ISE_DS_Win_14.7_1015_1.tar file.

5) After downloading the tar file, use some file extraction software like 7zip to
extract it into some directory. Next, follow the steps 3-5.

6) This is a large software, so it may take anywhere from 30 minutes to an


hour, but keep in mind that you will be prompted a couple of times when it
gets to about 90% of the installation. Accept the third-party installations
offered through those prompts.

7) From this step onwards, I am assuming that you installed the software in the
C:\Xilinx directory. Please make the changes accordingly if you installed in
some other directory.

8) The shortcuts that you will get on the Desktop or Start menu folder (after
the installation is finished) are all linked to the 64-bit binaries. We need to
force all these shortcuts to run the 32-bit version instead so that they run on
this OS.

7
9) Right click the shortcut on Desktop, click on properties, and change the
Target field from
C:\Xilinx\14.7\ISE_DS\settings64.bat C:\Xilinx\14.7\ISE_DS\ISE\bin\nt64\
ise.exe to
C:\Xilinx\14.7\ISE_DS\settings64.bat C:\Xilinx\14.7\ISE_DS\ISE\bin\nt\
ise.exe

10) Then we need to add the free WebPACK license to the Xilinx ISE
software. For this go to the folder C:\Xilinx\14.7\ISE_DS\common\bin\nt and
run the executable xlcm.exe and follow the steps 6 and 7 on Page 5 after the
window License Configuration Manager opens. The figure in step 7 (Page 6)
tells you to select the option Vivado Design Suite (includes ISE): WebPACK
license. But as Xilinx website has been updated, you will need to select the
option ISE WebPACK License. You will then receive a Xilinx.lic in your
email (that you used to make the Xilinx account). Download this file and keep
somewhere safe (probably in the C:\Xilinx\14.7 folder). In the License
Configuration Manager window, go to the tab Manage Licenses, click Load
License… and navigate to the Xilinx.lic file. This will make the ISE software
use this license when you start your project.

11) This sets up your Xilinx ISE software and the license. There are two more
fixes that we need to perform to run the inbuilt simulator for testing your code
and to run the software PlanAhead that maps the inputs/outputs to the switches
and LEDs on the FPGA.

12) Download the zip file sim_planahead_fix.zip from the class website.
Unzip this file. You will find two files in the folder one each for the two fixes
mentioned above.

13) Go to the folder C:\Xilinx\14.7\ISE_DS\ISE\bin\nt. There, find a file


called fuse.exe. Rename this file to something like fuse_orig.exe, and then
copy (in this directory) the fuse.exe file from the folder you unzipped in the
last step. I renamed the original file so that we will have a copy of the original

8
file in case we need it later. This should fix the error which occurs while using
the simulator.

14) Go to the folder C:\Xilinx\14.7\ISE_DS\PlanAhead\bin. There, rename the


file rdiArgs.bat to something like rdiArgs.bat.orig, and then copy (in this
directory) the rdiArgs.bat file from the folder you unzipped in the step 12. This
should fix the error while running the PlanAhead utility. These are all the
errors and their fixes while trying to run the 14.7 Xilinx ISE for Windows
7/XP on Windows 10. After following through these steps go to the Part I in
the original manual on Page 7 and start your first project. Start the project
navigator from the Desktop shortcut (so that it runs the binary in the target C:\
Xilinx\14.7\ISE_DS\settings64.bat
C:\Xilinx\14.7\ISE_DS\ISE\bin\nt\ise.exe)

There are other fixes available too. I have mentioned the solution that I used
for the installation on my system. We
(TAs) will be happy to help in case you have any questions with any of the
steps or if you still get some error in running the software.
Xilinx FPGAs are heterogeneous compute platforms that include Block
RAMs, DSP Slices, PCI Express support, and programmable fabric. They
enable parallelism and pipelining of applications across the entire platform as
all of these compute resources can be used simultaneously.

9
CHAPTER-3

ALGORITHM

Creating a step-by-step algorithm for addressing deep learning


challenges with Xilinx FPGAs involves several key stages. The algorithm
below outlines a systematic process to leverage Xilinx FPGAs for efficient
deep learning implementation:

1. Problem Definition

Identify the specific deep learning challenges to be addressed, such as


computational intensity, memory bandwidth constraints, and the need for real-
time inference.

2. Xilinx FPGA Platform Selection

Evaluate and select the appropriate Xilinx FPGA platform based on the deep
learning requirements.

Consider factors such as available resources, memory bandwidth, and the


flexibility of the FPGA architecture.

3. Model Analysis and Optimization

Analyze the deep learning model architecture to identify opportunities for


optimization.

Explore techniques like model quantization, pruning, and architecture


adjustments to reduce computational requirements.

4. High-Level Synthesis (HLS) Integration

Utilize Xilinx HLS tools to convert high-level algorithmic descriptions into


efficient hardware implementations.

Optimize the design for parallelism and pipeline stages to maximize FPGA
resource utilization.

5. Custom IP Block Integration

10
Integrate custom Intellectual Property (IP) blocks from Xilinx libraries to
accelerate specific deep learning operations.

Leverage Xilinx hardened IP blocks for tasks like matrix multiplication and
convolutional operations.

6. Memory Hierarchy Optimization

Design an optimized memory hierarchy to meet the high memory bandwidth


demands of deep learning workloads.

Utilize on-chip and off-chip memory efficiently, considering data movement


and storage requirements.

7. Power Efficiency Considerations

Implement power-efficient design strategies, such as clock gating and dynamic


voltage and frequency scaling.

Explore Xilinx power optimization tools to analyze and enhance power


efficiency.

8. Real-Time Inference Support

Ensure the FPGA design supports real-time inference by meeting latency


constraints.

Optimize the pipeline to minimize inference time while maintaining accuracy.

9. Performance Evaluation

Use benchmarking and profiling tools to evaluate the performance of the deep
learning model on the Xilinx FPGA.

Compare the FPGA-accelerated solution with traditional CPU or GPU


implementations.

10. Iterative Refinement

Iterate through the optimization process, refining the design based on


performance evaluations.

Fine-tune parameters and architecture to achieve the best trade-off between


speed, accuracy, and resource utilization.

11. Documentation and Reporting

11
Document the implemented algorithm, design decisions, and optimizations
made.

Provide a comprehensive report highlighting the improvements achieved with


the Xilinx FPGA solution compared to traditional approaches.

12. Deployment Considerations

Explore considerations for deploying the FPGA-accelerated deep learning


model in real-world applications.

Address integration challenges and potential scalability issues.

Xilinx FPGA and Deep Learning :


The rapid expansion in the area of deep learning has resulted in following
challenges for deep learning system designers. Here we list some of the
challenges: 1. Application Diversity: Deep learning algorithms are now being

applied to solve a variety of problems such as: language translation, game and
decision systems, medical diagnosis, social media applications, robotics,
advanced driver assistance systems and deep embedded systems for example
hearing aids. Satisfying these different use cases requires different
computational networks & different figures of merits: speed, latency, energy,
accuracy which in turn requires flexibility in a deep learning solution. 2. High
compute and memory requirement: Most of the deep learning algorithms
require billions of MAC operations and large amounts of data to store model
parameters. This requires availability of large amount of compute capability,
flexible memory hierarchy and high performance system level interconnect.

Xilinx is focused on implementing efficient and high performance inference


engines for a broad range of applications. The Xilinx FPGA architecture
provide a set of capabilities which are uniquely suited for implementing deep
learning inference engines for a range of problems. Following are some key
architectural features of Xilinx FPGAs [5] that enable deep learning system: 1.
Silicon: Xilinx FPGA’s Stacked Silicon Interconnect technology enables high
compute density and power advantages as well as scalability. The UltraScale+
VU13P device for example has 12,228 DSPs. Currently 8 bit operations are
one of the leading contenders for efficient deep learning datatypes which will

12
be discussed further in the context of quantization. The DSPs have the
flexibility to be used for various size data types. Additionally certain small bit
width datatypes can use the regular FPGA fabric to compute various parts of
deep learning networks. 2. Compute and Memory: Deep learning requires
smart architectural choices of how tightly to couple memory and compute. In
modern FPGAs, there are multiple levels of memory hierarchies ranging from
small local memories running at the same speed as the DSPs, medium and
large UltraRAM (URAM) on-die memories, in-package HBM all the way to
standard DDR interfaces. Often the first choice in a deep learning system is
where to store network weights and image data and this will vary depending
on the type of network, whether CNN, RNN or other types because they all
have different ratios of compute to memory requirements.

Figure 1 Acceleration frame work,base platform,reconfiguration

The Reconfigurable Acceleration methodology for FPGAs uses the


fundamental concept of changing the run-time function of an FPGA while it is
in operation. While a block is being reconfigured, the FPGA device maintans
the connection to the external logic and interfaces. The time needed for
reconfiguration is very short compared to re-configuring the whole device. In a
reconfigurable acceleration design, one or more regions are identified as
reconfigurable modules (RM). The connection between the static logic and the
reconfigurable modules' logic is well-defined. When a designer intends to

13
accelerate one part of their problem statement, that part can be programmed in
the Reconfigurable Modules (RM) of the design. Once the new RM is
implemented and partial bitstream file is produced, that partial bitstream gets
loaded into the FPGA device. As a result of this process, the newly
programmed FPGA device starts accelerating the user's design. Prior to deep
learning applications this type of capability is very useful in military
communications and public safety radio where communications needs to be
extremely secured by accelerating advanced Cryptography algorithms. Other
interesting applications are automotive, avionics, industrial and medical
markets, where redundancy needs to be supported for critical services. In a
deep learning framework this reconfiguration presents the opportunity for an
FPGA to take on different personas, for example it may be configured to run a
highly optimized convolution engine to run a CNN then reconfigure itself to
run matrix multiplications for RNN or LSTM on the output of the CNN
classifications. This is one of the key advantages of FPGAs over ASIC based
deep learning accelerators which are inflexible to new types of networks or
computations in a field of rapidly advancing research. The internal on-die
URAM blocks allow previous work to be stored in the FPGA during
reconfiguration which allows transferring data between deep learning phases
very efficient. However the designer must pay careful attention to the memory
and address layouts of the various deep learning architectures. What may work
well for image based networks like CNNs may have access patterns that
change when the problem becomes pure matrix multiplication.

Deep Learning Architectures :


Deep Learning algorithms are typically at their core a series of matrix-matrix
or matrix-vector multiplications. In a linear algebra system built on
programmable hardware there is a balance between the number of DSP units,
the architecture design for their connectivity, frequency, latency, routing and
placement. Various approaches have been tried from packet based processors
with high frequency, small area but high latency, to large systolic arrays
benefiting from small, fast connections between nodes but requiring careful
pipelining and considerations for array efficiency. In the typical systolic array

14
implementation of deep learning, we can take a convolutional layer as an
example of the typical architecture which is composed of a state machine to
load weights into the array, a state machine to load activations (image) data
into the array, the systolic array itself, and the module to take data out of the
array. Buffers must be designed to quickly feed the arrays with the appropriate
latency balance between any kind of DDR read or data transpose and the speed
of which the DSP arrays operate. A key architecture choice is also the
programmability of the design. One side of the spectrum of choices can
present a highly optimized design that can only compute a single deep learning
network. It may be the choice for production servers where the network will
not change for a very long time and the time invested in optimizing a single
network pays off in performance. The other side of the spectrum can be a
highly programmable architecture that simply performs matrix multiplication
and while is very general, requires extra processing to fit various deep learning
convolution or other operations into the matrix multiplication.

Fixed Point Quantization:


A further architectural optimization is to reduce computational complexity of
the deep learning operations. The first main path is switching away from single
precision floating point, which is the common datatype in deep learning
frameworks. Floating point while generally easier to understand in algorithm
design has an extremely high cost per computation. At Xilinx we have
demonstrated that functionally equivalent FIR filters implemented in fixed
point instead of floating point have a reduction of 80% dynamic power usage
[12]. This type of conversion from floating point to fixed point has been
utilized and researched for decades in the area of DSP applications. Due to the
efficiency of fixed point, Xilinx FPGAs provide only fixed point DSPs with
IPs available to produce floating point operations if required.

In order to convert from floating point to fixed point there are various
strategies have been published with bit widths ranging from 16 bits down to
ternary networks {-1, 0, 1}, or even binary weight networks. Additionally
conversion of weights can be considered separately from conversion of

15
activation values. An additional consideration during fixed point arithmetic is
rounding. Bias introduced during the quantization step may result a loss of
floating point accuracy. Various rounding experiments including to zero,
round to even, stochastic rounding [13] all have been used with various
tradeoffs. Some approaches have used different quantization strategies per
layer or channel. Additionally, non-uniform quantization strategies have been
used which can trade off an increase precision in the weight distribution with
additional compute and area resources.

Figure 2 Convolution weight histogram

Figure 2 is an actual weight distribution from the convolutional layers of


GoogLeNet and an immediate question this brings to mind is during
quantization is the full range of floating point values required or can the range
be limited and allow for saturation of values above the quantization range. The
actual range of weights that generated figure 2 extends from -1.10434 and
1.603776. A quantization process that covers such a range will introduce
significant quantization errors for the majority of values between -0.15 and
+0.15. Intelligent approches to quantization are the keys to achieving accuracy

16
with low bit widths. Additional considerations are to be made with the
asymmetric nature of the weights in that the magnitude of the positive and
negative ranges could be different.

Design Optimization:

In the realm of the implementation of a deep learning architecture of Xilinx


FPGAs there are various aspects of the hardware architecture that the software
tools must take advantage of. The first and foremost aspect which we have to
look at are the following factors: 1. Type of device which are targeting 2.
Static platform 3. Target Frequency 4. Available DSPs 5. Memories
(BRAM/URAM) 6. Number of DDR ports During the process of planning
implementation, the language choices are also important. In deep learning
kernels, two approaches have been used. The first is based on compute kernels
written in OpenCL or HLS C/C++. The second approach is writing kernels in
RTL. Both approaches have proven to be effective with different tradeoffs and
optimization strategies. Optimizations of OpenCL or HLS C/C++ mainly
involves approaching the design from the point of view of the intended
hardware. Even though the kernel is written in a high level language, the cycle
counts, throughput and expected resources are still very important
characteristics to design for. In our experience high level languages can
insulate the kernel developer from details such as the DDR transfer sizes or the
AXI protocols. With proper coding we have seen very effective and efficient
implementations of CNN and RNN networks.

On the other hand, RTL kernels are still where we find some of the highest
performing deep learning applications. Using specific features of the Xilinx
FPGAs such as using the packed 8 bit mode of the DSP48E2 are easiest to
achieve in RTL. In RTL the AXI protocols and DDR access patterns must be
hand crafted or connected to existing IP.

FPGA Resource Allocation:

Design application and performance requirements are more challenging due to


increased complexity of deep learning kernels that attempt to fully utilize all

17
FPGA resources. The designs created for deep learning have grown bigger as
FPGA devices offer higher resource count in the latest generation UltraScale+
devices. We have found that users of Xilinx FPGAs in deep learning
applications achieve some of the highest resource utilizations of any
application type.

While implementing deep learning kernels, special attention must be made to


the base or static region that communicates with the DDR or PCIE interfaces.
Except for certain embedded applications, these interfaces are the only way the
FPGA can communicate with the deep learning software environment. The
area and clocks used by these interfaces do not have the same flexibility as the
kernels.

DSP Placement:

For any convolution neural network, the convolution layer is based on large
numbers of multiplyaccumulate operations. FPGAs are often used in signal
processing applications which use finite-impulse response (FIR) filters as a
core building block. These types of filters are at their core also multiply
accumulates and the hardware designed for FIR filters is also utilized during
the CNN or RNN applications. Typical DSPs are laid out in vertical columns
with cascaded connections which are direct routes. In order to build efficient
deep learning computation with minimal placement and routing congestion,
one application of the cascaded connection is to pass the partial accumulation
result along a chain of DSPs that are multiplying the weight value with the
network’s activation. A typical UltraScale+ device can chain between 96 and
120 DSPs in height which allows for the large 48 bit datapath between DSPs
to be entirely removed from the placement and routing process. In this regular
layout, very high utilization of device DSPs as well as very high compute
efficicency can be achieved. This type of layout also leads to a structure
resembling a systolic array which is known to be very efficient for many
computations.

18
Figure 3 DSP Placement

ADVANTAGES:

Hardware Acceleration:

Advantage: Xilinx FPGAs provide customizable hardware acceleration,


allowing for the implementation of specialized hardware tailored to the deep
learning workload.

Flexibility and Reconfigurability:


Advantage: FPGAs are highly flexible and reconfigurable, enabling rapid
prototyping and iterative optimization of deep learning models.
Parallel Processing Capabilities:
Advantage: Xilinx FPGAs inherently support parallel processing, making them
well-suited for the parallelized nature of deep learning computations.
Optimized IP Blocks:
Advantage: Xilinx provides pre-designed intellectual property (IP) blocks
optimized for common deep learning operations.
Memory Bandwidth Management:
Advantage: FPGAs allow for fine-grained control over the memory hierarchy,
addressing the high memory bandwidth demands of deep learning models.

19
DISADVANTAGES:

Learning Curve:

Disadvantage: Designing and optimizing FPGA-based solutions may have a


steep learning curve for developers unfamiliar with hardware design languages
and FPGA architectures.

Resource Constraints:
Disadvantage: FPGAs have limited on-chip resources (such as logic elements,
memory, and DSP slices).

Development Time:

Disadvantage: Designing and optimizing FPGA-based solutions can be time-


consuming, especially for complex deep learning models.

Power Consumption in Continuous Operation:

during active processing, their power consumption may not be as low as


specialized low-power processors during idle or low-activity periods.

Cost:

Disadvantage: FPGAs may have higher upfront costs compared to using GPUs
or CPUs for deep learning tasks.

APPLICATIONS:

Computer Vision:

Application: Object detection, image recognition, and video analytics.

Smart Agriculture:

Application: Crop monitoring, pest detection, and yield predic

Healthcare Imaging:

Application: Medical image analysis, including MRI and CT scans.

Robotics:

Application: Visual perception, object manipulation, and path planning for


robots.

20
CONCLUSION

Deep learning systems have shown tremendous effectiveness in various


applications. The development of high performance flexible hardware is a key
enabler for future research in deep learning and FPGAs do enable this
flexibility. The challenges of deep learning vary from architecture decisions,
data science analysis of the numeric computation, understanding of device
capabilities, and detailed netlist optimizations. The future of deep learning will
depend on having the tools to allow deep learning researchers and users to
explore and experiment with new ideas, new algorithms, and new
architectures.

21
REFERENCES

[1] Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville,
MIT Press, 2016

[2] "Imagenet classification with deep convolutional neural networks."


Advances in neural information processing systems. Krizhevsky, Alex, Ilya
Sutskever, and Geoffrey E. Hinton, ILSVRC 2012

[3] Christian Szegedy, Wei Lu, Yangqing Jia., Going deeper with
convolutions, , ILSVRC 2014

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual
Learning for Image Recognition, ILSVRC 2015

[5] Xilinx Ultrascale Family/Generation Architecture,


www.xilinx.com/about/generation-ahead-16nm.html

[6] Towards End-To-End Speech Recognition Using Deep Neural Networks,


Invited Talk, Tara Sainath, ICML 2015

[7] Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Technical Report 2016

[8] In-Datacenter Performance Analysis of a Tensor Processing Unit, Norman


P. Jouppi et. al., 44th International Symposium on Computer Architecture
(ISCA), Toronto, Canada, June 26, 2017

[9] SDA: Software-Defined Accelerator for LargeScale DNN Systems, Jian


Ouyang et.al., HotChips 2014

[10] ESE: Efficient Speech Recognition Engine for Compressed LSTM on


FPGA , S. Han et al., International Symposium on Field-Programmable Gate
Arrays (FPGA), 2017

[11] FINN: A Framework for Fast, Scalable Binarized Neural Network


Inference Yaman Umuroglu et. al, 25th International Symposium on
FieldProgrammable Gate Arrays, February 2017

22
[12] Reduce Power and Cost by Converting from Floating Point to Fixed
Point https://www.xilinx.com/support/documentation/white _papers/wp491-
floating-to-fixed-point.pdf

[13] Deep Learning with Limited Numerical Precision, S. Gupta, et. al.,
https://arxiv.org/abs/1502.02551 2015

23

You might also like