Fpgas Design Ebook Emea Emeaen

Mouser Electronics eBook
FPGA Design
eBook
By Adam Taylor
Liu zishan / shutterstock.com

Contents
• Introduction 3
• What is a Field Programmable Gate Array (FPGA)? 3
• Benefits of Programmable Logic (PL) 3
• FPGA Building Blocks 3
• FPGA Design 4
• Device families 5
• High end solutions 6
• Toolchains 6
• Vivado Design Suite 6
• Application Software Creation 8
• Acceleration 8
• AI and Beyond 9
• Embedded Processing 9
• Embedded Processors 10
• Soft-Core Processors 11
• Big - Little Approach 13
• Communicating Internally and Externally 13
• Internal Data Movement 14
• Programmable Logic Applications 15
• Test Equipment 15
• Automotive and ADAS 15
• Cloud Computing 17
• Industrial 17
• Deep Dive Application - Creating Embedded Vision Systems 18
• Elements of a PL Image Processing System 19
• Software Defined Image Processing 19
• Conclusion 20
• About the author 21
2
What is an FPGA? For example, vision and signal processing
Introduction
and RADAR.
The world we inhabit is analogue. However,
Traditionally programmable logic devices
digital processing enables us to experience
have been available in two device classes:
Since their introduction in the and interact with the world in new ways: from
Complex Programmable Logic Devices
mid-1980s, Field Programmable satellite navigation, to autonomous vehicles,
(CPLD) or Field Programmable Gate Arrays
Gate Array devices have moved augmented reality, and the smartphones we
(FPGA). CPLDs offer a simple device
from providing the developer carry in everyday life.
structure of registers and logic functions
with the ability to integrate several Being able to process this information in using a sea-of-gate approach.
discrete logic functions, e.g. glue real and near real-time requires significant
FPGAs offer a more complex structure
logic, to becoming truly the heart processing capability and, of course, this
than CPLDs and often include dedicated
of the system. processing capability has benefited from
hardware elements such as block memory,
Moore’s law. Design engineers also have
digital signal processing, clock management,
several processing technologies from
Modern FPGA devices and tools gigabit serial transceivers, and IO blocks.
which to choose when selecting the most
are unrecognizable to those frst appropriate for the application at hand. These
introduced; they contain not only
traditional programmable logic
processing technology choices range from
traditional processors to Graphics Processing
FPGA Building Blocks
resources but also high-performance Units (GPU) and Programmable Logic (PL). The basic building blocks of FPGAs are
embedded processors, dedicated the lookup table (LUT), registers, and the
Of all processing technologies, programmable
flexible IO cell structures. LUTs enable the
interfaces and memory controllers, logic is probably the least well known and
implementation of logic equations while
and IO structures capable of often considered one of the more challenging
registers provide the storage element
providing multi-gigabit data rates. to use when implementing solutions.
necessary to implement sequential logic
designs. LUTs and registers are combined
This eBook explores modern FPGAs, Benefits of PL to provide what is often called a logic slice,
how they are programmed and a simple example of which is shown in
Programmable logic enables users to
Figure 1. In modern devices, these
developed, and examines various implement truly parallel implementations
slices contain many options which allow
applications, including a deep-dive of their algorithms and applications. This
implementation of combinatorial or sequential
into image processing. parallel implementation enables a more
logic circuits including local distributed
deterministic and responsive solution.
memory and the ability to use the LUT as a
As such, they are used where real-time
shift register depending upon configuration.
processing and responses are required.
Flip Flop Select

OP Select
Input E
D Q
Input A
A
Input B
B
Input C
C
Input D
D
LUT
Figure 1 – Simple LUT Structure
3
Switch
Matrix
CLB CLB
Switch Switch Switch

Matrix Matrix Matrix
CLB CLB
Switch
Matrix
Figure 2 – Configurable routing blocks and interconnection using switch matrixes.
Within the FPGA device, it is common to model the parallel architecture of the Processing System (PS) delays, and even
to group together two slices to form a FPGA architecture. It is also increasingly serial de-serializer structures. This means
Configurable Logic Block (CLB). These common to develop FPGA IP blocks using FPGAs offer any-to-any interfacing and are
CLBs are interconnected to implement the high level synthesis (HLS) using languages able to interface with any standard, bespoke
necessary functionality using routing and such as C, C++ or OpenCL, while these or legacy interface. This flexibility also frees
switching matrixes as shown in Figure 2. languages do not support parallelism up the system designers from becoming pin
compiler directives can be used by the bound when using Applications Specific
FPGA Design engineer to indicate parallel structures.

Using a higher level language provides the
Standard Products (ASSP).
engineer with a much faster development Generation of a programmable logic

FPGAs are normally designed using a
and verification cycle. design solution therefore requires the
hardware description language (HDL)
following stages:
with the two most common being Verilog
The IO structures of the FPGA devices enable
and VHDL. These languages describe the
direct interfacing with a range of IO standards.
design at a much lower level than traditional • Synthesis – Translates the
This ranges from single-ended standards
software languages by describing the HDL design into a series of
such as LVCMOS to differential standards
register level transfer of the design (e.g. logic equations which are then
like LVDS and TMDS. This interfacing does
implementing state machines, counters etc.). mapped onto the resources
not stop there however. Modern IO structures
Both VHDL and Verilog inherently support available in the target FPGA.
are able to provide on-chip termination, fine
the concept of concurrency which is needed
4
• Place – The logic resources While FPGAs offer significant performance Of course, the device presented above is
determined by the synthesis tool and interfacing benefits, development of currently the largest Xilinx FPGA which
are placed at available locations FPGA-based solutions could be considered would be overkill for many applications.
within the target device. more complex than traditional software To help guide engineers in selecting a
• Routing – The placed logic resources development. However, modern design tools, suitable FPGA for their application, Xilinx
in the design are interconnected especially high-level synthesis coupled with offers a range of FPGA and System on Chip
using routing and switch matrixes the availability of a range of freely available devices capable of supporting a wide array
to implement the final application. IP together with the capabilities of modern of solutions across several different families.
• Bit File – The generation of the final devices, means this is not the case.
The cost-optimized portfolio developed
programming file for the target FPGA.
around the 28 nm node provides three
Device families different device families, each optimized
Simulation is used to ensure that for different user needs.
If you are unfamiliar with the history of
the implemented design functions in FPGAs, they were invented by Ross Freeman
accordance with the design requirements. and Bernard Vonderschmitt in 1985 with • Spartan-7 FPGAs – The Spartan-7
Engineers create test benches which the release of the XC2064. This first FPGA family is the successor to the
stimulate the RTL modules inputs and has 64 configurable logic blocks. Today’s extremely popular Spartan-6 range
monitors the resulting outputs from the modern Xilinx devices offer the user 8,938, of devices and offers developers
Register-Transfer Level (RTL) module (no s). 000 system logic cells, 3840 DSP elements, with increased performance
The behavior of the modules can be verified and 76 Mb of Block RAM (BRAM) and 90 Mb and lower power over the older
by viewing the simulation waveform as shown of UltraRAM. This is quite a capability step technology 45 nm node. The
in Figure 3 or alternatively writing a more up from the original offering. Spartan-7 devices are I/O optimized,
complex test bench which can check and offering the highest pin count within
verify the outputs.
Figure 3 – RTL Simulation Output
5
the cost-optimized portfolio. HBM devices are used in applications to At the heart of Vivado is the IP integrator
• Artix-7 FPGAs – A new family accelerate network and storage applications. which enables designers to capture designs
for the Xilinx 7 series which are quickly and easily using IP provided by Xilinx,
transceiver optimized offering Toolchains third parties or custom developed. This IP
can be defined using HDL or alternatively, a
6.6 Gbps transceivers.
• Zynq-7000 SoCs – A revolutionary All devices from the smallest Spartan-7 higher-level approach can be used with Vitis
family when first debuted, Zynq- to the largest Virtex UltraScale+ are HLS which enables the development of IP
7000 SoCs introduced a new class supported by Xilinx development tools. blocks using C and C++.
of devices which combine hard core These development tools cover every
While implementation is Vivado’s focus,
Arm Cortex-A9 processors with FPGA aspect of the design life cycle from RTL
Vivado offers a complete development
fabric. This enables a new class of capture, to simulation, and software
ecosystem and provides several different
device which can provide integrated development for use with processor cores.
capabilities which aid the overall
system solutions, along with the
programmable logic development.
associated benefits of integration • Vivado Design Suite – Vivado enables
including reduce power consumption, the capture of the design, RTL One of the key features of any design
a smaller overall solution, and simulation along with implementation is being able to guarantee functional
significantly reduced EMI. process of synthesis, place and performance of the HDL prior to
route and bit file generation. implementation. To verify the HDL
• Vivado HLS – High-level functionality, Vivado includes an HDL
Devices within this portfolio can support synthesis which enables IP simulator which enables the developer to
a range of applications from sensor fusion, development using C or C++. stimulate the HDL. Depending upon the
to precision control, image processing, and • Vitis Unified Software Platform – stage of implementation, the test bench can
cloud computing. Vitis enables software development be applied against the RTL, synthesized
for embedded processors and also netlist, or the implemented netlist with
High end solutions enables acceleration using OpenCL. associated timing information.
• PetaLinux Tools – Petalinux is
For ultra-high performance and more an embedded Linux solution It is through Vivado that we can also
specialized applications, Xilinx provides for embedded processors. debug designs on the hardware thanks to
the Kintex and Virtex families across three This technology stack enables us to its support for integrated logic analysers
technology nodes at 28nm, 20nm and develop solutions for both traditional (ILA), Virtual IO (VIO), and JTAG to AXI IP.
16nm. This progression of devices provides FPGA and heterogeneous system on chips This allows the designer to instrument the
significant increases in performance which combine programmable logic with programmable logic design and observe
and capability with the UltraScale and high-performance Arm processor cores. behaviour at run time in the actual system.
UltraScale+ family of devices. Interestingly, as we will see, this technology
stack enables implementation using Both Vivado’s implementation and simulation
Kintex devices offer increasing performance, capabilities are used by higher-level tools in
traditional hardware design language (HDL)
logic resources, and transceivers across the the development stack as we will see.
capture and a higher-level system optimizing
three technology nodes. From 65,500 logic
compiler approach depending upon the
cells in the Kintex devices to 1,143,00 in
users’ desired entry point.
Kintex UltraScale+ devices, they offer both
GTH and GTY transceivers which support The inter-relationship between the design
data rates at up to 16.3 Gbps and 32.75 tools can be seen below in Figure 4. Each
Gbps respectively. element of the technology stack provides a
specific capability.
The highest-performance FPGAs are within
the Virtex family. These devices provide
not only logic resources of up to 8,938,000 Vivado Design Suite
system logic cells and transceivers capable
The lowest level of the technology stack
of operating at 58 Gbps, but also support
is Vivado. Vivado enables us to capture
for high bandwidth memory (HBM). These
designs using VHDL or Verilog as well as
devices provide between 4GB and 16GB
synthesise the HDL design to the target
of on-chip DRAM, with up to 460 Gbps
device before placing and routing and
bandwidth or approximately 20 times more
generating the programming file.
than provided by a DDR4 DIMM. Virtex
6
AI / ML Solution
Implementation
Application
Prototyping and Rapid
Development
Embedded and
Accelerated Software
Development
Design Capture and

System Definition
and Implementation
Figure 4 – Xilinx Technology Stack
7
Application Software will include operations with elements of

the programmable logic design, it is also
To make use of this acceleration capability,
an acceleration platform must be available
Creation possible to cross probe and breakpoint with to Vitis. Acceleration platforms can be
created in Vivado for custom hardware and,
ILAs inside the programmable logic to get
If we are working with a heterogeneous SoC a greater understanding of the hardware / alternatively, they are available for download
or FPGA which contains a soft processor software interaction. for many development boards.
core such as a MicroBlaze, we will need to
If we desire an embedded Linux solution, Several system resources such as such as
work with higher levels of the stack to create
developers utilize the PetaLinux build tool. AXI interfaces, processor interrupts, and
the operating systems’ application software.
clocks are available to the Vitis compiler
The relationship between Vivado, PetaLinux PetaLinux is not a Linux distribution but a
through the acceleration platform. The
and Vitis can be seen below in Figure 5. build system. As such, PetaLinux allows
accelerated elements of the design are
developers to configure an embedded
connected to these resources to enable
Both hard and softcore processor Linux system which includes the drivers
access from the processing system.
implementations require the generation and device tree entries to work with the
of operating systems and development of programmable logic design. To perform this acceleration, Vitis used
application software. This software can be the OpenCL framework. OpenCL is an
developed to run on a BareMetal, real time Once a PetaLinux operating system is
open industry standard developed by The
operating system (FreeRTOS) or embedded available for a Xilinx SoC, the developer can Khronos Group to enable parallel computing
Linux solution. deploy the PYNQ framework which enables of heterogeneous systems. The OpenCL
Python and Jupiter Notebooks to be used model consists of a host, typically an x86
This application software can be developed with programmable logic. PYNQ is perfect system, and an OpenCL device often called
and debugged using Xilinx’s Vitis unified for rapid prototyping because the Python a kernel. While the host manages the overall
software platform. Vitis provides two environment contains many drivers to work application execution, the OpenCL provides
development flows embedded and with IP in the programmable logic, thereby the parallel processing and application
accelerated. With the embedded flow reducing the software development required. acceleration. This parallel processing can
developers can create software solutions be provided by an FPGA, DSP, GPU or CPU.
for any supported embedded processor, To access higher levels of the technology Using OpenCL enables the source code
while the accelerated flow enables users stack such as the Vitis acceleration that is running on the OpenCL device to be
of Xilinx SoC and Acceleration cards to capabilities and Vitis AI, a hardcore retargeted without any source code changes.
accelerate functionality from the processor processor and embedded Linux solution or
an Alveo acceleration card is required. To support this model, OpenCL provides
system to the programmable logic.
OpenCL APIs which can be compiled into
When used in the embedded flow, Vitis
developers can import Vivado designs to
Acceleration the host application using standard C
compilers such as GCC or G++. When it
create embedded solutions. To aid developers comes to the OpenCL device, these are
Vitis also provides an entry point for
developed using the OpenCL C language.
in the debugging of the embedded systems, solutions which use both the processing
OpenCL C language is derived from ISO
Vitis includes a complete debugger which system and the programmable logic thanks
C99 but it is not C because the standard
is capable of supporting multi-core to its ability to accelerate applications into
libraries are difficult to support across all the
operation. As the software application the programmable logic using OpenCL.
potential OpenCL devices.
This means that each OpenCL device has a

specific OpenCL compiler provided by the
XSA vendor. In the case of Xilinx devices, this is
Vivado Vitis Platform Vitis Application provided by Vitis.
XSA In the Xilinx ecosystem, these heterogeneous

Embedded or
Image.ub systems comprise of the following
Accelerated
XSA SYSROOT architectures:
Applications
Boot Files
QEMU
• Arm processing cores and the
programmable logic are connected
PetaLinux using AXI interconnects
• X86 processor and Alveo acceleration
card connected over PCIe
Figure 5 - relationship between Vivado, PetaLinux and Vitis
8
Both architectures enable the processing

element to act as the host while the
AI and Beyond There are also other commercial and
open source software tools which can be
programmable logic, either on the same Building on top of the Vitis acceleration used as part of the FPGA development
silicon or in a sperate device, is the capabilities is Vitis AI. Using Vitis, the flow. These tools range from synthesis to
OpenCL device. developer can instantiate a deep learning simulation and an increasingly large number
processor unit in the programmable logic. of verification tools that support both
Implementation of the programmable logic simulation and formal verification.
This instantiation then provides significant
design will call Vivado in the background but
acceleration for the machine learning
of course this can take time to implement.
Vitis also offers different build types for use
inference applications. Embedded Processing
in developing and verifying the application Of course, instantiating the DPU within the Programmable logic in FPGAs is ideal for
before generating the final bit-stream. These programmable logic provides significant implementing parallel processing structures
are a software emulation which enable benefits to the developer. However, Vitis for functions like finite impulse response
correction of typos and basic errors while AI goes one step further and enables filters, image processing pipelines, and
hardware emulation leverages QEMU and developers to work with common machine motor control algorithms. However, there
co-simulation for the programmable logic learning and artificial intelligence are times when serial processing is required.
design. The hardware emulation enables us frameworks such as Caffe, TensorFlow Implementing communication protocols,
to optimize performance, interfacing, and and PyTorch. graphical user interfaces or control,
resources as shown in Figure 6 above.
configuration, and status reporting of IP
Rather than being a single tool, Vitis AI is a
To support application development, Vitis blocks are good examples of this. Of course,
collection of tools to compile, quantize, and
also provides several common libraries which serial processing is also critical when we
optimize their machine learning applications
can be accelerated into the programmable wish to work with higher-level open source
from a floating-point implementation to
logic and several domain specific libraries frameworks and languages like TensorFlow,
a fixed-point representation suitable for
for implementation of specific applications. OpenCV and Python.
implementation within the DPU.
Included in the common libraries are maths,
Luckily in the programmable logic world,
linear algebra, and DSP libraries, while the To aid the development and optimization of
we have several options which can be used
domain-specific libraries include computer the DPU implementation, Vitis AI also provides
to implement embedded processors with
vision, database, quantitative finance, and a profiler to enable system-level optimization.
programmable logic devices. At the highest
security. These libraries enable developers to
level, these can be classified into two
leverage programmable logic performance Of course, if you wish to start from scratch
distinct groups:
without needing to start from scratch writing there are also pre-optimized models
commonly used and freely available SW available in the Xilinx Model Zoo.
functions to be accelerated. • Heterogeneous System-on-
Chip: These devices combine
programmable logic with a processing
system. In these heterogeneous
system-on-chip solutions, the
processing solution is hard in
the silicon of the device and as
Source Code
such, offers great performance
Syntax Errors
although the processing solution
Verify Functional
Performance then provides only limited
flexibility for configuration.
Optimize Performances Software Emulation
• Soft-Core Embedded Processors:
Optimize Interfacing Soft-core processors are implemented
Optimize Resources using the programmable logic
Hardware Emulation resources (e.g. the flip-flops (FF),
look-up tables and BRAMs). As
these soft-core processors are
implemented using the logic
resources, they are more configurable
Create Boot Files
but typically offer lower performance.
Figure 6 - Vitis Emulation Flow
9
Figure 7 - Zynq MPSoC Processing System Interfacing with PL Image Processing Chain
Both heterogeneous SoC and soft-core add) multiple times with little control code. With the evolution of the Zynq-7000 SoC
embedded solutions have a range of use In applications such as this, leveraging the into the next-generation Zynq MPSoC,
cases across several exciting applications SIMD unit can result in a significant increase a significant step change in processing
as we will see. It is not unusual to implement in performance. capabilities was introduced along with
additional soft-core processors in the the latest logic fabric. For the first time,
Data transfer between the processing system
programmable logic of heterogeneous SoCs heterogeneous processors were introduced
and the programmable logic is implemented
to create a Big-Little enabling off-loading of within the processing system, enabling the
using several Advanced eXtensible Interfaces
time for critical tasks. developer to address several challenges
(AXI). Using this interface, both the processor
within the same device.
system or the programmable logic can be
Embedded Processors the initiator of the transaction. This allows The processing system within the
transfer of data easily to and from the Zynq MPSoC contains the following
In the Xilinx suite of devices, embedded
processor system DDR memory. processor cores:
processors are provided in the Zynq-7000
SoC and Zynq MPSoC product lines. These This combination of processing system
• Application Processing Unit –
devices offer true heterogeneous processing and programmable logic makes the Zynq-
Consists of quad or dual 64-bit
systems on the same silicon. Architecturally 7000 series excellent for implementing
Arm Cortex-A53 processors
in these devices, the processor system applications which require both serial and
• Real Time Processing Unit –
boots first like a traditional processor and parallel processing (for example image
Dual lockstep 32-bit Arm
then configures the programmable logic. processing, robotics, and augmented
Cortex-R5 processors
reality). To ease connectivity and leverage
The Zynq-7000 SoC was the first introduction • Platform Management Unit
the large support of frameworks and
and offers dual or single-core 32-bit Arm – Silicon implementation of
applications, embedded Linux solutions can
Cortex-A9 processors combined with a Triple Modular Redundant
be deployed on the processing system, while
programmable logic. As would be expected, 32-bit MicroBlaze processor
the programmable logic accelerates key
the processing system provides peripherals • Graphics Processor Unit –
elements or algorithms. This combination
used for both volatile and non-volatile of PS and PL provides for a more responsive Arm Mali-400 MP GPU
memory along with several interfacing and deterministic solution. In the table
peripherals such as ethernet, UART and CAN. below, you will find a simple demonstration
In addition to the four processing groups
implementing AES encryption.
To support high-performance applications, which can be programmed by the developer,
each Cortex-A9 core also includes a the MPSoC processing system also contains
Operating System Linux
floating-point unit and a NEON engine. a configuration security processor to
The NEON engine allows processing of Processor System Clocks 36662 implement safety and security processing
large data sets in parallel using a single and security event responses.
instruction against multiple data (SIMD). PS Clocks with Programmable 15644
This is especially useful for applications Logic This diverse range of processing solutions
like image and audio processing, where enables the creation of single-chip solutions
algorithms require data sets to be processed Reduction in Processing Time 54.8% for many applications (e.g. automotive)
using simple instructions (e.g. multiply and where both high-level algorithms and user
10
interfaces can be implemented using the Thumb/Thumb2 instruction set

APU and GPU while real-time control and support. This processor is popular Processor DMIPS/ Comment
MHz
interfacing with the vehicle controls can in Internet of Things applications.
be via the Block RAM (BRAM) which is • RISC-V – an open source 32/64/128- Cortex-A53 2.3 Quad or dual
designed and certified for ISO26262 bit instruction set. RISC-V compliant processors
or IEC6508 applications. implementations are available from Cortex-A9 2.3 Dual or single
several IP vendors for implementation
Communication between the PS and PL Cortex-R5 1.67 Dual or lockstep
in Xilinx FPGAs. Of course, RISC-V
again uses AXI interfaces however in place MicroBlaze 1.04 –
is highly customizable and like
of 32-bit interfaces, 128-bit interfaces 1.31
MicroBlaze, can also run embedded
are provided significantly increasing the Cortex-M1 0.8
Linux and other operating systems.
throughput between PS and PL. This
Cortex-M3 1.25
high-bandwidth capability enables the
implementation of high-performance vision- While the processing performance of a RISC-V 1.7 Depends on
based machine learning applications as soft-core processor is below that of a hard implementation
often deployed in automotive applications silicon instantiation, the configurability and
or other edge-based solutions. As such, the flexibility of the solution is higher which
embedded processing solution offered by allows for a much more customized solution The application is equally as important. If the
the Zynq-7000 SoC and Zynq MPSoC class to be implemented. This also allows for the processor core is required to only configure
of devices provides the highest performance processor to be portable across several IP within the processing system or implement
processor systems. devices or even vendors depending upon serial communication protocols, then a
the processor selection. softcore-based processor is often a better
An example image processing application selection. However, for high performance
which transfers image data between the Like the hard silicon processors, again algorithms which need significant processing
processor system and the programable logic AXI is a popular interface for peripheral capabilities, hard-core processors definitely
to implement the desired algorithm can be connection. However, in the soft-core offer a performance advantage.
seen in Figure 7. processor world, this peripheral also
Security can also be a major consideration
includes DDR memory interfaces, UARTs,
Soft-Core Processors and other common processor interfaces
especially for edge applications as hard
processing solutions like the Zynq MPSoC
such as I2C and SPI. An example of this can
provide inbuilt security solutions e.g. the
The options for soft-core processors in be seen in Figure 8 which uses a MicroBlaze
Configuration Security Unit, Secure Boot
the Xilinx ecosystem are unlimited. Any processor to configure and control a high-
and Arm Trust Zone. When using softcore
processor which is described in RTL can be speed image processing pipeline.
processors, the security protections also
implemented using the FF, LUTs, and RAMs
So how do we as engineers decide between needs to be included in the programmable
of Xilinx FPGAs.
the implementation of a hard or soft-core logic.
However, these are the most deployed soft- processor? Of course, performance is one
One major difference between hard- and
core processors within Xilinx FPGAs: of the main factors, but others include
soft-core processors is in the configuration,
application, flexibility, security, resource
in a hard-core system the processor is the
• MicroBlaze – a 32-bit processor availability, portability, and licensing. Of
master and boot first and configures the
which can be used in a range of course, each application will have different
programmable logic as desired. This enables
configurations from controller weightings on these factors as engineers several different power saving modes to
to full MMU support capable determine what is best for their solution. be implemented by the processor e.g.
of running embedded Linux.
powering down a processor core, powering
• Arm Cortex-M1 – a 32-bit FPGA To compare the processor capabilities, we
off peripherals or even powering down the
implementation of the popular need to be able to compare processing power.
entire programmable logic.
Cortex-M0 which offers a small logic We achieve this using a benchmark called
footprint and great code density Dhrystone MIPS or Millions of Instructions When a soft-core processor is used the
thanks to the Thumb Instruction set. Per Second. By comparing the potential FPGA must first be configured to instantiate
• Arm Cortex-M3 – a 32-bit hard and soft cores in the table below, it is the soft-core processor, only then can the
implementation of the Cortex-M3 apparent the embedded processors win due processor begin operation. This limits the
processors which offer full blown to their higher clock speeds. ability of the soft-core processor to be able
MMU and OS support along with to implement power down / power saving
good code density thanks to schemes, although there is still much which
can be achieved by lower clock frequencies.
11
Figure 8 - MicroBlaze Image Processing Application
Figure 9 – Big-Little Approach with the Zynq MPSoC and Arm Corex-M3
12
Big - Little Approach thanks to the flexibility of the different IO

cells provided in Xilinx programmable logic,
Each of these IP classes can support many
different IO standards: from simple single-
However, increasingly you will often see an external PHY is often not needed. ended CMOS signals to LVDS and HSTL.
applications which use both hard processors At the system level, this has a significant
However, as programmable logic designers, impact on integration. The wide range
and soft processors in the same solution as
it’s not just the external interfacing of support for a variety of IO standards
demonstrated in Figure 9. Such a Big-Little
we need to consider. We also need to reduces the need for external PHYs, thereby
approach enables the high-performance ap-
understand the efficient methods of moving reducing component count, crucial PCB
plication processor to focus on the high-level
data around within the programmable logic area, and power.
application while real-time applications such
once it is received from the IO cells.
as sensor interfacing and motor control are
However, it is not just the obvious benefits
off loaded to the soft-core processor in the Let us start our journey of connectivity that programmable logic IO provides. The
programmable logic. This approach provide by examining the different IO structures IO structures have many features that ease
a more responsive solution than just using that exist for the Xilinx UltraScale and the implementation at the system level
an application processor. UltraScale+ families. They include the Kintex beyond board area and power. The flexible
Correctly architecting the Big-Little and Virtex UltraScale families as well as the IO structures enable the drive strength and
interfacing also enables the main application Kintex, Virtex and Zynq UltraScale+ families. slew rate to be controlled, allowing the
to be updated as required without the signal integrity and EMI emissions of signals
requirement to change the code executing The UltraScale and UltraScale+ devices on the board to be carefully crafted for
within the little processor. This also means offer three different IO classes, excluding optimal performance. They also support on-
sensor changes and updates can be gigabit transceivers. chip termination schemes implemented by a
addressed simply by updating the code Digitally Controlled Impedance (DCI).
running with the soft-core processor. • High Performance (HP) – optimised
for high-speed interfaces such The IO cells support alignment between
high-speed signal traces, offering precision
Communicating as memory and chip-to-chip.
The maximum operating voltage
timing adjustments using the IDELAY and
Internally and Externally for a supported IO standard is 1v8.

ODELAY output, while the ISERDES and
OSERDES structures support conversion
• High Range (HR) – supports
between serial and parallel data - especially
One of the genuinely great things interfaces that operate at up to 3v3.
useful for aligning data patterns on high-
about FPGAs, outside of their truly • High Density (HD) – supporting
speed interfaces as shown in Figure 10.
parallel nature and the ability to have interfaces that operate at up to 3v3
heterogeneous systems, is their ‘any-to- at data rates of up to 250 Mbps. While the HR, HP, and HD IO classes provide
any’ interfacing capability. significant interfacing capabilities and
UltraScale devices provide a mix of HP and enable significant data movement on and off
What this means is, with the right PHY,
HR IO banks, while the UltraScale+ devices chip, the highest interfacing capacities are
programmable logic can provide interfacing provided by the gigabit transceivers.
provide a mix of HP and HD IO banks.
to many different industry standards, legacy
interfaces, and bespoke interfaces. And
Figure 10 – Camera Link reception and alignment using ISERDES
13
These transceivers enable the ultra-fast programmable logic and processing systems, channels: the write address, write data,
transfer of data allowing the programmable if desired. Within the Xilinx environment, the and write response sub-channels.
logic to work with some of the fastest serial primary protocol used for data movement
interface standards such as PCIe, SATA, is the Advanced eXtensible Interface (AXI): AXI Stream interfaces are commonly used to
100G ethernet, SDI, JESD204A/B, USB 3.0 a subset of the Arm AMBA bus, developed transfer information from a single producer
and DisplayPort. specifically to support implementation in to a consumer, typically between IP blocks
programmable logic. as part of a processing chain. Example
Xilinx identifies transceivers as GTx, where x processing chains could be image processing
indicates the specific standards. UltraScale To provide scalability for different use or signal processing, where the signal is
and UltraScale+ devices provide for data
cases, AXI itself offers three different received and processed by each IP block
rates between 6 Gbps (GTR) and 58 Gbps
interfacing standards. before being passed on to another.
(GTM). The specific mix of GTx depends
upon the family. Across the UltraScale and When AXI interfaces are implemented
UltraScale+ families of devices, this means • AXI Full / Memory Map - A higher- in programmable logic, we can benefit
we get an excellent range of ultrafast IO and, performance memory-mapped from wide data bus widths to increase the
consequently, a significant peak bandwidth. interface that supports independent bandwidth. Data bus widths can vary between
read and write channels. Both 32 to 256 Bits when assuming a clock of
Scale GTx Gbps channels enable bursts to optimize 400MHz, giving data rates between 12.8
throughput. In programmable logic Gbps and 100 Gbps. This arrangement
Virtex UltraScale+ GTY/ 32.75/58.0
designs, AXI Full is often used to of AXI Full, AXI Lite, and AXI Stream is
GTM
implement direct memory transfers shown in Figure 11.
Kintex UltraScale+ GTH/ 16.3/32.75 between the programable logic and an
GTY external DDR memory, for example. AXI interfaces can be locked down using Arm
Zynq UltraScale+ GTR/ 6.0/16.3/ • AXI Lite – A stripped down version TrustZone software to support design security
GTH/ 32.75 of AXI Full to provide a memory- when working with Zynq MPSoC UltraScale+
GTY mapped interface which can devices. This capability is increasingly
be used for configuration and important both in the cloud and at the
Virtex UltraScale GTH/ 16.3/30.5
control of IP blocks. AXI Lite does edge. Allowing these orthogonal software
GTY
not support burst accesses. worlds prevents lower security, higher-risk
Kintex UltraScale GTH 16.3 applications from being able
• AXI Stream - A unidirectional stream
of data from a producer to a consumer. to access registers and peripherals defined
This stream is point-to-point and as secure.
Of course, being able to interface with such
contains no addressing information.
data volumes provided by the HD, HP, HR,
When we are working with heterogeneous
and GTx interfacing capabilities means we
Both the AXI Full and AXI Lite consist SoCs like the Zynq MPSoC UltraScale+
need to be just as efficient if not more within
of independent read and write channels. device, system-level cache coherency
the device.
Of course, the complexity of the channel becomes increasingly important. AXI also
varies between the two flavours. The read provides the additional sideband signals to
Internal Data Movement channels consist of two sub-channels: the provide IO cache coherence and complete
read address and control sub-channel and cache coherence with the ACE, ACP and
A key internal feature of the programmable HP(C) ports available on the Zynq MPSoC
the read data and response sub-channel.
logic device is the ability to move data: processing system.
The write channel consists of three sub-
between IP blocks, even between
Figure 11 - AXI Connections in Zynq UltraScale+
14
As AXI is implemented within programable

logic, AXI is a point-to-point protocol
between a producer (master) and a consumer
(slave). To enable a single producer to be
able to connect with several consumers,
the developer can make use of a smart
interconnect which enables multiple
consumer (slave) connections to a single
producer (master). Useful when using a
single AXI Lite interface to configure several
IP blocks, for example. Smart interconnects
are also capable of providing clock domain
crossing if required.
The use of standard interfaces based on AXI

enables a much larger ecosystem of IP cores
to be created by both Xilinx and third-party
partners. Standard interfaces also help when
developing custom IP using either RTL or HLS.
Figure 12 – TBS1052C in action debugging an embedded system
Programmable Logic
Test Equipment as part of the STTE enables the
Applications programmable logic element to receive

stimulus test patterns and report captured
Developing complex embedded system
With its inherent flexibility, it stands to data patterns for additional analysis using
solutions can be very difficult, and testing
reason that programmable logic is often Gigabit Ethernet links or USB3.
and verification of those systems can be
deployed across a range of applications, even more complex. This often requires the
Of course, programmable logic is not
offering a more deterministic, high creation of specialized test equipment that
limited to deployment in STTE. Traditional
performance and power-efficient solution is able to provide real-time stimulus, data
FPGA and heterogeneous SoCs are
than traditional microprocessor and Digital capture, analysis, and storage. For this
increasingly found within common lab
Signal Processor applications. reason, the high performance provided by
equipment such as oscilloscopes, logic
programmable logic is often used to enable
As you might expect, these applications analysers, and signals generators. One such
Special-to-Type Test Equipment (STTE) to
span a wide range of industries and use example of this is the Tektronix TBS1052C
stimulate and capture output data. Such
cases. In this article, we are going to examine shown in figure one. This uses a single-core
an approach enables real-time samples
four different and diverse use case which Zynq-7000 SoC to implement signal capture
to be generated and captured for most
feature programmable logic at the core. and provides interfacing with 1GSa/S ADCs
applications. Using a heterogeneous SoC
SAE Name Examples Vehicle Monitoring Fall Back Vehicle

Level Control Control Capability
0 No Automation N/A Human Driver Human Driver Human Driver N/A
1 Driver Assistance Adaptive Cruise Control/Lane Human Driver Human Driver Human Driver Some Driving
Keeping and Parking Assist and Vehicle Modes
2 Partial Automation Traffc Jam Assist Vehicle Human Driver Human Driver Some Driving
Modes
3 Conditional Automation Full Stop and Go Highway Driving Vehicle Vehicle Human Driver Some Driving
Self Parking Modes
4 High Automation Automated Driving Vehicle Vehicle Vehicle Some Driving
Modes
5 Full Automation Driverless Vehicle Operation Vehicle Vehicle Vehicle All Driving Modes
Table 1. SAE Levels for assisted and automated vehicles
15
along with advanced triggering capabilities more responsive solution, which is required by several IEEE 802.1 standards. Time
on the digitized data. The ARM processor to implement safe interaction with other awareness across the network is implemented
core contained within the SoC runs the vehicles and the environment. in TSN by allocating scheduled traffic in time-
operating system and scope application and defined slots, while also supporting cyclic
provides the USB interfacing capabilities.
Such an approach enables a more tightly
Cloud Computing data transmission and providing pre-emption
for higher priority packets.
integrated solution.
One of the hottest topics in the
Correctly implementing TSN requires a
programmable logic world is data centre
Automotive and ADAS acceleration. Deploying programmable
solution which can provide a low latency and
deterministic response at TSN end points
logic in data centres combines large FPGAs
In line with the society of Automotive and switches. This is where Xilinx Zynq SoC
with x86 processors connected using PCIe.
Engineer’s five-level capability matrix, the and Zynq MPSoC devices come into play
Such an approach enables the x86 software
automotive world is on a mission to increase because they enable the implementation of
application to offload highly parallel functions
the level of assisted and automated driving TSN ports with programmable logic.
to the FPGA, accelerating the performance of
capabilities in vehicles.
the system. To be able to deploy accelerators We can implement TSN ports in the Xilinx
In order to safely interact with the external with the Cloud or on premises, Xilinx offers ecosystem using their TSN Ethernet
world, automotive solutions must use a range of accelerator cards called Alveo. Endpoint MAC LogiCORE IP within a Zynq-
several diverse sensor modalities and These cards are design to interface over PCIe 7000 SoC or Zynq UltraScale+ MPSoC.
communications systems, which include and are programmed using the Vitis unified Each device utilises both the Processing
the following: software platform and OpenCL. This enables System (PS) and the Programmable Logic
the developer to use C/C++ and OpenCL to (PL). The LogiCORE IP consists of FPGA
• Vision Systems – Including Infrared accelerate algorithms. Typical datacentre logic for MAC, TSN Bridge, and TSN
• 4D RADAR applications include quantitative finance, Endpoint, along with software components
• LIDAR database and data analytics, machine for network synchronization, initialization,
• Accelerometers learning, and network acceleration. and interfacing with network configuration
• Global Positioning Systems controllers for Stream Reservation as
• Vehicle-to-Vehicle and Industrial defined in P802.1Qcc. The software is
Infrastructure Communications designed to run on PetaLinux and will be
Programmable logic plays a large part published as Yocto patches.
in Industry 4.0 where one of the main
These sensors and systems generate challenges is to implement a converged The logic IP core provides deterministic
significant data volumes which need to be network. Converged networks merge the behavior in the PL for synchronization (IEEE
aggregated and processed before decisions Information Technology (IT) and Operation 802.1AS), scheduled traffic (IEEE 802.1Qbv),
on vehicle actions can be implemented, Technology (OT) networks. Traditionally and seamless redundancy (P802.1CB) while
presenting several challenges to the system the IT network is where the Enterprise helping to offload the processing unit. It
designer. With such a diverse range of Resource Planning (ERP) is located, while is also possible to implement an optional
sensors comes a diverse range of sensor the OP network contains the sensors and integrated time-aware L2 switch to enable
interfaces, ranging from high performance drives used to manufacture the product. either chain or tree topology.
multi-gigabit serial links (e.g. MIPI) to lower- Connections between the two bring about
speed interfaces such as SPI and I2C as used challenges using gateways and bridges and Once implemented, the TSN can be
by lower-speed sensors. Traditional system- protocol convertors, which can limit the combined with custom applications like motor
on-chip solutions provide the user with a scalability of the OT network. control or sensor interfacing within the PL to
limited number of fixed-function interfaces be able to act under the control of the TSN.
of varying types. Using a programmable As such, the IT and OT networks have
logic device enables the system designer
to leverage the flexible IO structure to
different requirements. The IT network Deep Dive Application
needs to be able to access multiple systems
implement the specific number and type and databases etc., while the OT network - Creating Embedded
of interfaces required, freeing previous needs to be real time and deterministic to be
IO limitations. able to control its sensors and drives. Vision Systems
The parallel structure of programmable There is something special about seeing an
One of the increasingly popular solutions is
logic enables the implementation of parallel image that you have created on a display.
to implement a solution using Time-Sensitive
image/signal processing chains. This parallel That display could be demonstrating a
Networking (TSN). The most popular TSN
implementation of algorithms provides a simple, transparent image showcasing an
standard is Ethernet which is defined
16
image sensor’s capability. Alternatively, it Developers can get the best of both worlds further processing. Once the image has
could be implementing an advanced image by using a heterogeneous SoC such as the been captured, further processing might
processing solution that identifies and Zynq-7000 SoC or Zynq UltraScale+ MPSoC. be required to obtain a useable image
classifies objects or tracks movement in the These devices combine programmable logic for the image processing algorithm. This
image. Of course, with the correct sensor additional image processing may require
with high-performance Arm processors.
selection, we can even extend the range of color filtering (debayer) to convert raw
This provides significant flexibility because
vision beyond the visible range of the EM pixel values to RGB pixels. The image
the image processing algorithms can be
spectrum into the infrared or X-ray elements capture phase may also include gamma
implemented within the programmable logic.
of EM spectrum. correction, noise filtering, and color space
While the processing system can provide the
conversion. In adaptive systems, the input
Implementing image processing algorithms is image processing algorithm configuration to video timing and resolution will be detected
computationally intensive, especially as image allow easy adaption to new image sensors to enable the image processing system to
resolutions increase beyond HD and move or requirements, it can also implement high- automatically configure itself for the video
to 4K. A color HD image of 1920 pixels by level algorithms that take the output from format received. An example image capture
1080 lines using a 30-bit pixel must be able to pipeline can be seen in Figure 13 below
the image processing system.
process 3.73 Gbps to achieve a 60 frames per which includes a MIPI CSI-2, demosaic
second. Moving to 4K resolution, which has
3840 pixels and 2160 lines with a 30-bit pixel Elements of a PL Image (debayer), and frameBuffer to write to PS
DDR memory.
and 60 frames per second, requires quite an
increase of 14.92 Gbps. Each stage of the
Processing System
Algorithm – This is the actual implementation
image processing algorithm must, therefore, Implementing an image processing system of the image processing algorithm. In many
be able to support this data rate to achieve in programmable logic is not as daunting cases it will consist of several stages of
the desired frame rates, even when doing as it first may seem. The image processing image processing algorithms, each one
complex calculations on each pixel. pipeline can be broken down into three connected to the next stage using an AXI
distinct elements: image capture, algorithm, Stream. These IP blocks may be provided by
The truly parallel nature of programmable
and output pipeline. the Xilinx Vivado IP library which includes
logic provides an ideal technology to
IP blocks that can scale images up or and
implement image processing pipelines. The Image Capture – The image capture
layer video layers on top of each other as
parallel nature frees the developer from the pipeline connects interfaces directly with
demonstrated in Figure 14.
sequential software world where each stage the image sensor or camera. As such, the
image capture interfaces externally to Alternatively, they can be implemented using
of the image processing algorithm must be
the programmable logic using interfaces a hardware description language or high-
implemented in sequence. In programmable
such as HDMI, SD/HD/UHD-SDI, MIPI level synthesis which enables higher-level
logic, the algorithm’s elements run in
or Parallel/LVDS. Thanks to the flexible languages such as C/C++ to implement
parallel, enabling an increased throughput
nature of programmable logic IOs, most image processing algorithms. Using a
and a more deterministic performance. This standards can be implemented using higher-level language enables developers
can be critical for many image processing the IO structures without the need for an to leverage the vision domain Xilinx Vitis
applications that use embedded vision to external PHY. To help capture the image, accelerated libraries. These libraries
interact safely with the environment. ADAS Xilinx provides a range of IP cores in provide several advanced image processing
or vision-guided robotics are two good the Vivado IP library that will enable the functions like filters, bounding boxes, bad
examples. image to be captured and made ready for pixel correction, warp transformation, and
Figure 13 – Example Image Capture Pipeline
17
Figure 14 – Image Processing Video Mixer mixing live video with a Head Up Display.
stereo block matching. If the image needs the programmable logic, the developer needs processing system to perform high-level
to be made available to the processor to convert the processed image which is in algorithms on the processed image contents.
system for higher-level algorithms for an AXI Stream format into the correct output The software can further process the image
example, Video Direct Memory can be used format. Along with converting the AXI Stream and output it back into the image processing
to transfer the video stream to the PS DDR. into the appropriate format, the video must stream if so desired.
This transfer can also operate to transfer also be re-timed for output. Just like with the
data from the processor system to the
programmable logic. Such PS-PL transfers
image capture and algorithm pipeline, the
Vivado IP library contains the necessary IP
Software Defined Image
can be used to provide an overlay on the cores to generate the output video in the Processing
image, presenting information on the display correct format. Figure 15 below shows a
if required. typical output pipeline with a frame read from There are several approaches which can
the PS DDR passing data to the AXI Stream be undertaken when it comes to working
Output Pipeline – Once the image has to Parallel Video output, operating under the with the image in the processing system.
completed the algorithm pipeline, the control of video timing controller. Regardless of the approach taken, the
processed image needs to be out to the image processing implemented within the
appropriate display. This could be MIPI-DSI, It is possible to move the image into the programmable logic is highly configurable
HDMI, SD/HD/UHD-SDI or traditional parallel processing system DDR memory during by the developed application software.
video with pixel clock, V Sync and H Sync. In the algorithmic processing. This allows the
Figure 15 – Example Output Pipeline
18
Bare Metal – Bare metal developments aids the acceleration of the machine learning
are often used as an initial stage in the inference into the programmable logic using
development of the image processing the DPU and supporting frameworks such as
system. They enable the developers to TensorFlow, Caffe and PyTorch.
demonstrate the design quickly and easily in
the programmable logic and the image sensor
can be correctly configured. This allows for
Conclusion
the creation of a transparent path which In this eBook we have introduced the basics
displays the captured image on the selected of what an FPGA is, the tools we use to Adam Taylor is a world-recognised expert
display. The bare metal application does not work with AMD FPGAs along with looking in in design and development of embedded
include the complexity of an embedded Linux depth structures, elements and features of systems and FPGAs for several end
stack. As such, it is very useful for debugging a modern FPGA. Along with demonstrating applications. Throughout his career,
and commissioning the design using the several different applications which can Adam has used FPGAs to implement
Internal Logic Analysers (ILA) and memory benefit from FPGA technology. a wide variety of solutions from RADAR
views as well as the debugging capabilities to to safety critical control systems (SIL4)
inspect register contents. If you want to know more about FPGAs,
and satellite systems. He also had
a range of resources are available at
interesting stops in image processing
PYNQ - The open source PYNQ framework https://resources.mouser.com/programmable-logic.
and cryptography along the way.
enables developers to leverage the
power of Python to control IP within the Adam is Chartered Engineer, Senior
programmable logic thanks to several Member of the IEEE, Fellow of the Institute
PYNQ APIs, drivers and libraries. Thanks of Engineering and Technology, Arm
to these PYNQ provisions, developers Innovator, Edge Impulse Ambassador.
can focus on the algorithm development
because the PYNQ frameworks includes He is also the owner of the engineering
drivers for most AXI connected IP blocks and consultancy company Adiuvo
on the programmable logic. PYNQ runs Engineering and Training which develops
on a Ubuntu-like distribution and enables embedded solutions for high reliability,
developers to start focusing on the image mission critical and space applications.
processing algorithms using OpenCV Current projects include ESA Plato,
and other popular Python frameworks. Lunar Gateway, Generic Space Imager,
Using PYNQ enables us to focus on the UKSA TreeView and several other clients
algorithm development using the real- across the world.
world sensors, which includes being able
to see the limitations of the sensor under FPGAs are Adam‘s first love, he is the
different conditions and the impacts on author of numerous articles and papers
the implemented algorithm. Once we know on electronic design and FPGA design
what the algorithm is, we can implement including over 440 blogs and 30 million
the functionality in the programmable logic plus views on how to use the Zynq and
using Xilinx IP, HLS or the Vitis accelerated Zynq MPSoC for Xilinx.
library function.
PetaLinux – An embedded Linux solution

may be considered if higher-level algorithms
or communication is needed. In this instance,
PetaLinux can be used in conjunction with
the Video4Linux (V4L) and GStreamer
packages to create higher-level image
processing algorithms. This framework may
also be used if USB3 cameras are used that
are connected to the processing system.
Using PetaLinux enables the developer to
also leverage the Vitis AI flow and the Xilinx
Deep Learning Processor Unit (DPU) which
19

Fpgas Design Ebook Emea Emeaen

Uploaded by

Copyright:

Available Formats

You might also like

Fpgas Design Ebook Emea Emeaen

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fpgas Design Ebook Emea Emeaen

Uploaded by

Copyright:

Available Formats

Mouser Electronics eBook

Liu zishan / shutterstock.com

What is an FPGA? For example, vision and signal processing

Flip Flop Select

Figure 1 – Simple LUT Structure

Switch Switch Switch

Figure 2 – Configurable routing blocks and interconnection using switch matrixes.

FPGA Design engineer to indicate parallel structures.

engineer with a much faster development Generation of a programmable logic

Figure 3 – RTL Simulation Output

Design Capture and

Figure 4 – Xilinx Technology Stack

Application Software will include operations with elements of

This means that each OpenCL device has a

XSA In the Xilinx ecosystem, these heterogeneous

Both architectures enable the processing

Figure 6 - Vitis Emulation Flow

interfaces can be implemented using the Thumb/Thumb2 instruction set

Figure 8 - MicroBlaze Image Processing Application

Big - Little Approach thanks to the flexibility of the different IO

Internally and Externally for a supported IO standard is 1v8.

Figure 10 – Camera Link reception and alignment using ISERDES

Figure 11 - AXI Connections in Zynq UltraScale+

As AXI is implemented within programable

The use of standard interfaces based on AXI

Applications programmable logic element to receive

SAE Name Examples Vehicle Monitoring Fall Back Vehicle

Table 1. SAE Levels for assisted and automated vehicles

Figure 13 – Example Image Capture Pipeline

Figure 15 – Example Output Pipeline

PetaLinux – An embedded Linux solution

You might also like