Professional Documents
Culture Documents
Xilinx 1
Xilinx 1
On
Submitted by
K.Kavya Sri
20NN1A0431
External Examiner
Contents
ABSTRACT
1
CHAPTER-1
INTRODUCTION
What is an FPGA?
2
Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are
based around a matrix of configurable logic blocks (CLBs) connected via
programmable interconnects. FPGAs can be reprogrammed to desired
application or functionality requirements after manufacturing.
3
'Logic Vision Inc. 101 Metro Drive San Jose CA 951 10 USA
zorian@lvision.com Tel (1)4084530146 On the other side, structural testing of
these circuits has only recently been addressed in the literature. Structural
testing for conventional digital ICs is a difficult and important task. But
structural testing for FPGAs is an even more complex problem [3-241. The
main objective of this paper is to report on an original experiment about
structural testing of two XILINX FPGAs families: the 4000 and the 3000
family 1251. In this we analyse the specific difficulties encountered with
FPGA testing and gives a general description of the structural approach used
in this experiment. It also illustrates how this approach has been applied to the
logic elements, and to the interconnect elements and section 5 to the RAM
elements. Finally, summarizes the results obtained on the 4000 and 3000
families, and gives some concluding remarks.
FPGAs from Xilinx are hybrid computation systems with Block RAMs,
programmable fabric, DSP Slices, and PCI Express support. Just Because all
of these compute resources can be accessed at the same time, they enable
scalability and pipelining of applications throughout the entire platform.
4
CHAPTER-2
SOFTWARE
Xilinx software:
What is Xilinx software?
Xilinx integrated synthesis environment or ISE is a software tool got from
Xilinx meant for the synthesis, as well as analysis of the HDL designs, which
are known to majorly target embedded firmware development for Xilinx
CPLD and FPGA integrated circuit product families.
1)These instructions are for installing the Xilinx ISE version 14.7 on Windows
10 only. If you are using Windows 7/XP, then you can follow the instructions
in Part 0 in the original manual written by Prof. Erik Brunvand from Page 4
onwards. In that case, please use the following link instead of the link provided
in the step 1. The next screenshot is for the following link.
https://www.xilinx.com/support/download/index.html/content/xilinx/en/
downloadNav/designtools/v2012_4---14_7.html
2) For using the tool on Windows 10, there are two options: Go to the
following link,
https://www.xilinx.com/support/download/index.html/content/xilinx/en/
downloadNav/design-tools.html
5
The first option is 14.7(Windows 10) which supports Windows 10 Pro or
Windows 10 Enterprise. Since I don’t have either one of those (I have
Windows 10 Home), I had some errors while running it after installation. But I
can tell you that it installs a virtual machine (VM) with a Red Hat linux OS
with Xilinx ISE already installed. Xilinx ISE needs to be run in that VM. If
you have either Windows 10 Pro or Windows 10 Enterprise, you can try
installing this version. We can help you out if there is some problem.
3)The other option (which I used) is to use the 14.7 option highlighted in the
above figure. The problem with this version is that it is meant for Windows
7/XP and causes problems running the appropriate binaries after the
installation. But fortunately, Rajath and I were able to come up with work
arounds that would be sufficient for this course. I will go through the process
in the next steps.
4)Click the 14.7 link in the figure for step 2. Then find and click the link as
shown in the figure for the step
1. You will then be prompted to login to your Xilinx account.
6
If you don’t have an account, then create one using the link in the above
figure. After creating your account, you will need to fill out the name and
address verification questions. Continuing through that, you will get a prompt
to download the Xilinx_ISE_DS_Win_14.7_1015_1.tar file.
5) After downloading the tar file, use some file extraction software like 7zip to
extract it into some directory. Next, follow the steps 3-5.
7) From this step onwards, I am assuming that you installed the software in the
C:\Xilinx directory. Please make the changes accordingly if you installed in
some other directory.
8) The shortcuts that you will get on the Desktop or Start menu folder (after
the installation is finished) are all linked to the 64-bit binaries. We need to
force all these shortcuts to run the 32-bit version instead so that they run on
this OS.
7
9) Right click the shortcut on Desktop, click on properties, and change the
Target field from
C:\Xilinx\14.7\ISE_DS\settings64.bat C:\Xilinx\14.7\ISE_DS\ISE\bin\nt64\
ise.exe to
C:\Xilinx\14.7\ISE_DS\settings64.bat C:\Xilinx\14.7\ISE_DS\ISE\bin\nt\
ise.exe
10) Then we need to add the free WebPACK license to the Xilinx ISE
software. For this go to the folder C:\Xilinx\14.7\ISE_DS\common\bin\nt and
run the executable xlcm.exe and follow the steps 6 and 7 on Page 5 after the
window License Configuration Manager opens. The figure in step 7 (Page 6)
tells you to select the option Vivado Design Suite (includes ISE): WebPACK
license. But as Xilinx website has been updated, you will need to select the
option ISE WebPACK License. You will then receive a Xilinx.lic in your
email (that you used to make the Xilinx account). Download this file and keep
somewhere safe (probably in the C:\Xilinx\14.7 folder). In the License
Configuration Manager window, go to the tab Manage Licenses, click Load
License… and navigate to the Xilinx.lic file. This will make the ISE software
use this license when you start your project.
11) This sets up your Xilinx ISE software and the license. There are two more
fixes that we need to perform to run the inbuilt simulator for testing your code
and to run the software PlanAhead that maps the inputs/outputs to the switches
and LEDs on the FPGA.
12) Download the zip file sim_planahead_fix.zip from the class website.
Unzip this file. You will find two files in the folder one each for the two fixes
mentioned above.
8
file in case we need it later. This should fix the error which occurs while using
the simulator.
There are other fixes available too. I have mentioned the solution that I used
for the installation on my system. We
(TAs) will be happy to help in case you have any questions with any of the
steps or if you still get some error in running the software.
Xilinx FPGAs are heterogeneous compute platforms that include Block
RAMs, DSP Slices, PCI Express support, and programmable fabric. They
enable parallelism and pipelining of applications across the entire platform as
all of these compute resources can be used simultaneously.
9
CHAPTER-3
ALGORITHM
1. Problem Definition
Evaluate and select the appropriate Xilinx FPGA platform based on the deep
learning requirements.
Optimize the design for parallelism and pipeline stages to maximize FPGA
resource utilization.
10
Integrate custom Intellectual Property (IP) blocks from Xilinx libraries to
accelerate specific deep learning operations.
Leverage Xilinx hardened IP blocks for tasks like matrix multiplication and
convolutional operations.
9. Performance Evaluation
Use benchmarking and profiling tools to evaluate the performance of the deep
learning model on the Xilinx FPGA.
11
Document the implemented algorithm, design decisions, and optimizations
made.
applied to solve a variety of problems such as: language translation, game and
decision systems, medical diagnosis, social media applications, robotics,
advanced driver assistance systems and deep embedded systems for example
hearing aids. Satisfying these different use cases requires different
computational networks & different figures of merits: speed, latency, energy,
accuracy which in turn requires flexibility in a deep learning solution. 2. High
compute and memory requirement: Most of the deep learning algorithms
require billions of MAC operations and large amounts of data to store model
parameters. This requires availability of large amount of compute capability,
flexible memory hierarchy and high performance system level interconnect.
12
be discussed further in the context of quantization. The DSPs have the
flexibility to be used for various size data types. Additionally certain small bit
width datatypes can use the regular FPGA fabric to compute various parts of
deep learning networks. 2. Compute and Memory: Deep learning requires
smart architectural choices of how tightly to couple memory and compute. In
modern FPGAs, there are multiple levels of memory hierarchies ranging from
small local memories running at the same speed as the DSPs, medium and
large UltraRAM (URAM) on-die memories, in-package HBM all the way to
standard DDR interfaces. Often the first choice in a deep learning system is
where to store network weights and image data and this will vary depending
on the type of network, whether CNN, RNN or other types because they all
have different ratios of compute to memory requirements.
13
accelerate one part of their problem statement, that part can be programmed in
the Reconfigurable Modules (RM) of the design. Once the new RM is
implemented and partial bitstream file is produced, that partial bitstream gets
loaded into the FPGA device. As a result of this process, the newly
programmed FPGA device starts accelerating the user's design. Prior to deep
learning applications this type of capability is very useful in military
communications and public safety radio where communications needs to be
extremely secured by accelerating advanced Cryptography algorithms. Other
interesting applications are automotive, avionics, industrial and medical
markets, where redundancy needs to be supported for critical services. In a
deep learning framework this reconfiguration presents the opportunity for an
FPGA to take on different personas, for example it may be configured to run a
highly optimized convolution engine to run a CNN then reconfigure itself to
run matrix multiplications for RNN or LSTM on the output of the CNN
classifications. This is one of the key advantages of FPGAs over ASIC based
deep learning accelerators which are inflexible to new types of networks or
computations in a field of rapidly advancing research. The internal on-die
URAM blocks allow previous work to be stored in the FPGA during
reconfiguration which allows transferring data between deep learning phases
very efficient. However the designer must pay careful attention to the memory
and address layouts of the various deep learning architectures. What may work
well for image based networks like CNNs may have access patterns that
change when the problem becomes pure matrix multiplication.
14
implementation of deep learning, we can take a convolutional layer as an
example of the typical architecture which is composed of a state machine to
load weights into the array, a state machine to load activations (image) data
into the array, the systolic array itself, and the module to take data out of the
array. Buffers must be designed to quickly feed the arrays with the appropriate
latency balance between any kind of DDR read or data transpose and the speed
of which the DSP arrays operate. A key architecture choice is also the
programmability of the design. One side of the spectrum of choices can
present a highly optimized design that can only compute a single deep learning
network. It may be the choice for production servers where the network will
not change for a very long time and the time invested in optimizing a single
network pays off in performance. The other side of the spectrum can be a
highly programmable architecture that simply performs matrix multiplication
and while is very general, requires extra processing to fit various deep learning
convolution or other operations into the matrix multiplication.
In order to convert from floating point to fixed point there are various
strategies have been published with bit widths ranging from 16 bits down to
ternary networks {-1, 0, 1}, or even binary weight networks. Additionally
conversion of weights can be considered separately from conversion of
15
activation values. An additional consideration during fixed point arithmetic is
rounding. Bias introduced during the quantization step may result a loss of
floating point accuracy. Various rounding experiments including to zero,
round to even, stochastic rounding [13] all have been used with various
tradeoffs. Some approaches have used different quantization strategies per
layer or channel. Additionally, non-uniform quantization strategies have been
used which can trade off an increase precision in the weight distribution with
additional compute and area resources.
16
with low bit widths. Additional considerations are to be made with the
asymmetric nature of the weights in that the magnitude of the positive and
negative ranges could be different.
Design Optimization:
On the other hand, RTL kernels are still where we find some of the highest
performing deep learning applications. Using specific features of the Xilinx
FPGAs such as using the packed 8 bit mode of the DSP48E2 are easiest to
achieve in RTL. In RTL the AXI protocols and DDR access patterns must be
hand crafted or connected to existing IP.
17
FPGA resources. The designs created for deep learning have grown bigger as
FPGA devices offer higher resource count in the latest generation UltraScale+
devices. We have found that users of Xilinx FPGAs in deep learning
applications achieve some of the highest resource utilizations of any
application type.
DSP Placement:
For any convolution neural network, the convolution layer is based on large
numbers of multiplyaccumulate operations. FPGAs are often used in signal
processing applications which use finite-impulse response (FIR) filters as a
core building block. These types of filters are at their core also multiply
accumulates and the hardware designed for FIR filters is also utilized during
the CNN or RNN applications. Typical DSPs are laid out in vertical columns
with cascaded connections which are direct routes. In order to build efficient
deep learning computation with minimal placement and routing congestion,
one application of the cascaded connection is to pass the partial accumulation
result along a chain of DSPs that are multiplying the weight value with the
network’s activation. A typical UltraScale+ device can chain between 96 and
120 DSPs in height which allows for the large 48 bit datapath between DSPs
to be entirely removed from the placement and routing process. In this regular
layout, very high utilization of device DSPs as well as very high compute
efficicency can be achieved. This type of layout also leads to a structure
resembling a systolic array which is known to be very efficient for many
computations.
18
Figure 3 DSP Placement
ADVANTAGES:
Hardware Acceleration:
19
DISADVANTAGES:
Learning Curve:
Resource Constraints:
Disadvantage: FPGAs have limited on-chip resources (such as logic elements,
memory, and DSP slices).
Development Time:
Cost:
Disadvantage: FPGAs may have higher upfront costs compared to using GPUs
or CPUs for deep learning tasks.
APPLICATIONS:
Computer Vision:
Smart Agriculture:
Healthcare Imaging:
Robotics:
20
CONCLUSION
21
REFERENCES
[1] Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville,
MIT Press, 2016
[3] Christian Szegedy, Wei Lu, Yangqing Jia., Going deeper with
convolutions, , ILSVRC 2014
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual
Learning for Image Recognition, ILSVRC 2015
[7] Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation, Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Technical Report 2016
22
[12] Reduce Power and Cost by Converting from Floating Point to Fixed
Point https://www.xilinx.com/support/documentation/white _papers/wp491-
floating-to-fixed-point.pdf
[13] Deep Learning with Limited Numerical Precision, S. Gupta, et. al.,
https://arxiv.org/abs/1502.02551 2015
23