Zynq Applications Include Wired and Wireless Communications, Automotive, Image and Video Processing, High Performance Computing, and Numerous Others

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Chapter 4

 Zynq applications include wired and wireless communications, automotive, image and
video processing, high performance computing, and numerous others.
 An important point to reiterate is that the programmable logic of Zynq is equivalent to
that of an FPGA. The PL of the smaller Zynq devices corresponds to the fabric of
Artix-7 FPGAs, while the larger ones are equivalent to Kintex-7.
 In embedded applications, it is usual to require one or more processors to orchestrate
the system, support software, and co-ordinate exchanges with peripheral components.
 FPGAs have been commonly used with soft processors for more than a decade, and
as these devices are deployed for ever more sophisticated applications, the need for
processor-based systems grows.
 Current standard FPGAs are composed of programmable logic without a hard
processor facility, and therefore it is reasonable to explore the option of soft processors
in greater detail.
 The standard FPGA soft processors (the MicroBlaze), which is a 32-bit soft
processor with extensive support within the Xilinx toolflow, and there are also some
further options.
 MicroBlaze is the principal soft processor type and is supported within both the Xilinx
ISE and Vivado design flows, including the most recent releases.
 Multiple MicroBlaze instances can be deployed on a single device, if desired.
 There is no implied licensing cost of using MicroBlaze processors in system designs,
commercially or otherwise.
 One of the benefits of using a soft-processor is that its configuration is flexible.
 The MicroBlaze has a number of different architectural options which can be
included or excluded from the processor implementation depending on the
requirements of the target application. For example, the FPU can be excluded if the
system does not call for floating point computation, thus reducing the footprint of
the processor implementation on the FPGA (i.e. the amount of resources it requires).
 In a more general sense, the configuration of the MicroBlaze can be customized to
optimize for operating frequency, performance, or area; alternatively, it can be
specified such that a suitable balance between these three metrics is achieved. This is
done very easily in Vivado using a configuration wizard.
 MicroBlaze resource utilization varies with configuration, starting at approximately
900 LUTs, 700 FFs, and 2 Block RAMs for the ‘minimum area’ option, and rising to
about 3800 LUTs, 3200 FFs, 6 DSP48E1s and 21 Block RAMs for the ‘maximum
performance’ configuration.
 The maximum frequency attainable by MicroBlaze is dependent on its configuration,
which is customizable, and also other factors such as placement and routing on the
PL.
 To provide a rough indication, a typical MicroBlaze configuration might achieve about
70% of the maximum frequency of the PL, which equates to, at most, two or three
hundred MHz — this compares to the ARM processor’s maximum operating
frequency of 800MHz to 1GHz.
 Processor performance is normally evaluated using a benchmark.
 In order to quantify the performances of the ARM Cortex-A9 and MicroBlaze, and
thus compare them, two widely used benchmarks can be used:
 DMIPs (Dhrystone Millions of Instructions Per second) : - The quantity DMIPs
expresses the number of operations achieved per second undertaken by the processor
when running the Dhrystone standard test application.
 Dhrystone is a synthetic application (i.e. it is not representative of real work),
specifically designed to exercise the processor with a representative set of processor
operations.
 CoreMark score: - CoreMark establishes a simple numerical ‘score’ for processor
performance, which can be directly compared with the scores of other processors.
 The CoreMark application serves the same purpose as Dhrystone, but its content is
tailored to execute a set of operations true to typical embedded processor usage.
 DMIPs and CoreMark are measured (as opposed to calculated) metrics for quantifying
processing capability. Although they are both derived from running specific, freely
available test applications on the processor being evaluated
 For a range of reasons, CoreMark is generally considered to be a more robust and
realistic benchmark than the older Dhrystone method, and indeed ARM recommends
the use of CoreMark.
 MicroBlaze can achieve no more than 260DMIPs on Zynq in speed grade -3, whereas
the dual-core ARM is projected to reach 5000DMIPs (2500DMIPs per core),
assuming a PS clock frequency of 1GHz. This indicates that the ARM processor offers
approximately a 20 times greater processing performance than a single MicroBlaze
core.
 CoreMark figures can also be obtained to compare the Zynq ARM processor with MicroBlaze.

 There are several important differences between the MicroBlaze and ARM Cortex-A9
processors. Among them: the MicroBlaze is a single core processor compared to
ARM’s dual-core.
 The ARM has a richer instruction set than the MicroBlaze.
 The MicroBlaze FPU implements only single precision floating point, whereas the
ARM also supports double precision.
 The cache configuration of the MicroBlaze provides a single level cache, whereas the
ARM has a two-level cache with greater capacity.
 The ARM Cortex-A9 processor is better equipped than the MicroBlaze. Even so,
MicroBlaze is still a very suitable choice for many applications.
 In the specific context of Zynq, a MicroBlaze can act as a useful ‘subordinate’
processor to the ARM.
 A MicroBlaze could be used to control a subset of the PL system functionality
 Zynq brings a clear advantage to the implementation of processor-intensive
applications: it offers a level of performance unattainable by a standard FPGA.
 The PicoBlaze is a microcontroller rather than a processor (i.e. it comprises other
facilities besides the processing element, and supports a limited but useful set of
operations).
 The PicoBlaze is 8-bit soft microcontroller IP has a very small footprint (a few tens
of slices plus program memory) and is capable of implementing finite state machines
and other simple control functionality.
 The designs for PicoBlaze can be obtained as a download directly from the Xilinx
website, and the file set includes VHDL and Verilog for the core PicoBlaze
controller, together with optional functionality such as UART and SPI interfacing
 PicoBlaze functionality is limited, and incomparable to that of a Zynq ARM
processor. However, a PicoBlaze instance can run at over 200MHz in Kintex-7 logic
fabric, in most cases as fast as the logic it may be controlling.
 PicoBlaze is possible that this compact controller may play a useful role within a Zynq
or MicroBlaze based embedded system, controlling lower-level functionality.
 ARM offers a ‘soft-core’ microcontroller, the ARM Cortex-M1, which is optimized
for FPGA implementation. Therefore in Zynq, this core would be implemented in the
PL section of the device to complement the processing undertaken by the ARM
Cortex-A9.
 Like the MicroBlaze, the configuration of the Cortex-M1 can be specified according
to user requirements, meaning that the logic resources required to implement the core
may be minimized.
 MicroBlaze is the most prevalent soft processor in Xilinx FPGA and SoC designs,
due to both the integrated and extensive support provided for it, and its excellent
implementation and performance characteristics.
 MicroBlaze is not the only soft processor available, and third party processor IP is
available as an alternative, or to cater for niche applications.
 Third party processors include LEON4 and OpenRISC.
 LEON4 states its performance as 1.7DMIPs/MHz or 2.1 CoreMark/MHz, and that it
can reach up to 125MHz on a Virtex-5 device, while the area requirement is given as
4,000 LUTs.
 OpenRISC is a collaborative open source project hosted by OpenCores;
performance and area statistics are not readily available.
 In both cases, the processor cores are not exclusively for use on FPGAs, but can also
be targeted at ASIC implementation.
 OpenSparc is an open-source, 64-bit Reduced Instriction Set Computer (RISC)
processor developed by Sun MicroSystems, one particular version of which specifically
targets FPGA implementation.
 The single hard processor to be discussed is the IBM PowerPC®, which was included
as a hard processor in the Virtex-II Pro. Each of Virtex-4 and Virtex-5 FPGAs
includes either one or two PowerPC (PPC) units.
 The most advanced PowerPC-equipped FPGA available, the PowerPCs in the
Virtex-5 can each achieve up to 1,100 DMIPs (i.e. 2,200 DMIPs in total using one of
the larger, two-unit devices), whereas MicroBlaze performance is around 240 DMIPs.
 MicroBlaze is integrated into Vivado and extensive support is available.
 Standard processor (like General Purpose Processor (GPP) or a Digital Signal
Processor (DSP)), it must be assumed that there is a need to support software routines,
or an operating system with applications. It might be that the software implements
computationally intensive operations which result in a bottleneck, and would benefit
from hardware acceleration using the additional PL available in the Zynq.
 The resources of a processor are fixed, and usually limited to one, two or four
processing cores (more in some cases), which are required to operate at a specific
clock frequency.
 The cost of implementing a desired software implementation is measured in terms of
clock (execution) cycles, which will of course require some specific amount of time to
execute at the desired clock frequency; the more complex the required processing, the
longer the execution time will be.
 The efficiency of the implemented algorithm is also important, such that it does not
include redundant operations.
 If we consider the behavior of a generic processor, it has a finite number of timeslots
(clock cycles) that are occupied — or not — by particular operations scheduled onto
them. Some of these operations may take a single cycle; others several cycles.
 More specifically, the occupation of processor cycles can be represented in terms of
program functions or tasks.
 It might be that the processor supports a number of different tasks that repeat
regularly, or occur on an ad hoc basis, and which are scheduled onto the processor
according to their relative priorities.
 Notice that, as the processor is a serial resource, it accommodates only one task
during any one timeslot; this reflects the serial nature of processor operation. Of
course, it is also true that processors are increasingly ‘multi-core’, meaning that they
have two or four processing cores (for instance), each of which can process tasks in
serial.

 When aiming for real-time operation, it should be considered that there is a ‘budget’ of
execution cycles available, within which the desired software applications or
algorithms must be executed.
 For example, a real-time video processing application demands a processing resource
that can keep pace with data arriving at the desired frame rate and resolution;
otherwise, the processor will still be busy with the previous frame(s) as new data
arrives.
 There are effectively processing deadlines by which all processing relating to each
individual frame must be completed.
 Real-time systems can be classified as soft real-time or hard real-time. If soft real-
time, the system performance is likely to degrade, usually temporarily, as a result of a
missed processing deadline; if hard real-time, the system may fail completely.
 When considering the activity of the processor, it is often useful to consider its activity
with respect to the various tasks it must support. For instance, if the processor
spends 80% of its execution cycles on a particular task, this should be investigated to
establish whether any parallelism can be identified. If so, then partitioning this
functionality into hardware can result in a significant speed-up overall.
 One aspect to be aware of when partitioning functionality from software into
hardware is the implied overhead of communicating between the two parts of the
system.
 The time taken to transfer data and instructions between software and hardware
constitutes an additional latency that offsets the processing speed-up; if the overhead
is too large, the benefit of hardware acceleration is lost.
 An implication of using Zynq, as compared to a discrete processor and FPGA, is that
the communication overhead is low; this is due to the tight coupling between the PS
and PL components of the device.
 PL is the ultimate accelerator because it can support arbitrary levels of parallelism,
and therefore flexibly supports a wide range of algorithms, and even multiple co-
processors simultaneously.
 The last architecture to compare with Zynq is that of a processor-FPGA combination,
i.e. where these two elements exist as physically separate components.
 The motivation for this type of system is usually that the system needs to support both
computationally intensive data-flow type processing (ideally suited to the FPGA), and
sophisticated software algorithms or applications (ideally suited to a dedicated
processor).
 It may be that the software element of the system extends beyond the scope of
MicroBlaze implementation, meaning that both an FPGA and a processor are
deemed to be required.
 The Zynq gives the option of a single-chip replacement for this configuration
 The Zynq solution is advantageous: a single device has the potential to reduce the
BOM and the board-level system hardware architecture is simplified due to the
reduced number of components, which further contributes to cost savings, and
potentially also improves reliability.
 The use of a Zynq device also enables the physical size of the system to be reduced,
while power consumption may be significantly lower
 The external links of a discrete component system expend more power than the
comparatively more local connections between the PS and PL in the Zynq; this
contributes to the overall power saving associated with the Zynq solution.
 Other power savings are achieved due to the smaller physical size of the Zynq device,
its 28nm device architecture, and its tight memory integration.
 From a design perspective, Zynq also provides the potential for productivity gains,
leading to accelerated development times.
 It is also significant that the Zynq design flow features support for a set of standard
AXI interfaces between the PL and PS, which removes much of the effort of
designing and implementing appropriate interfaces.
 The extensive availability of third party AXI-compatible IP is a further advantage.
 The overhead of communicating between the processor and FPGA over external
connections is avoided; this may be the cause of bandwidth constraints and increased
latency in the two-chip model.
 The internal connections in the Zynq device are inherently more secure than external
links and, in fact, additional security features are integrated to facilitate a secure boot
sequence, and to defeat tampering
 One particularly powerful aspect of the Vivado design flow is its tool for high-level
synthesis, Vivado HLS, which allows hardware (destined for implementation on the
PL) to be generated from a C-based software description.
 The HLS design method permits rapid creation of designs, as a result of describing
functionality at a higher level of abstraction than the traditional RTL-level (as used
by HDL and related design entry methods).
 Using HLS, the designer can influence the synthesis of C code to hardware by
applying specific directives and constraints that relate to aspects of the generated
hardware.
 The use of HLS is particularly compelling in the context of Zynq system development,
because its architecture comprises both PS and PL. It means that functional parts of
the system can very easily be ported from software destined for execution on the
ARM, to hardware for implementation on the PL, simply by retargeting C code with
only minor modifications.
 Changing the realization of system elements represents a different
hardware/software partitioning, which may achieve performance or implementation
benefits.
 For example, the functional element F4 has been moved from software to hardware
implementation (potentially via the use of HLS), and the element F1 has been shifted
from a hardware implementation to a software routine. The adapted system
architecture may be found to facilitate increased data throughput.
 When developing complex systems, the implication that software/hardware
partitioning can be very readily accomplished with the support of HLS methods is of
significant benefit.
 As required, the design team can investigate different partitioning before committing
to a final system architecture, with a reduced time overhead in doing so.
 Zynq gives the designer the flexibility to partition the system between software and
hardware elements according to requirements. One of the facilities which helps with
this process is Vivado HLS, a tool which converts C or C++ algorithms into
hardware descriptions suitable for implementation in Zynq’s PL.

You might also like