A Guide To Using Field Programmable Gate Arrays (Fpgas) For Application-Specific Digital Signal Processing Performance

A Guide to Using Field Programmable Gate Arrays (FPGAs) for
Application-Specific Digital Signal Processing Performance

Gregory Ray Goslin
Digital Signal Processing Program Manager
Xilinx, Inc.
2100 Logic Dr.
San Jose, CA 95124
Abstract:
FPGAs have become a competitive alternative for high performance DSP applications, previously dominated by
general purpose DSP and ASIC devices. This paper describes the benefits of using an FPGA as a DSP Co-processor, as
well as, a stand-alone DSP Engine. Two Case Studies, a Viterbi Decoder and a 16-Tap FIR Filter are used to illustrate
how the FPGA can radically accelerate system performance and reduce component count in a DSP application.
Finally, different implementation techniques for reducing hardware requirements and increasing performance are
described in detail.
Introduction:
Traditionally, digital signal processing (DSP) algorithms
are implemented using general-purpose (programmable)
DSP chips for low-rate applications, or special-purpose
(fixed function) DSP chip-sets and application-specific
integrated circuits (ASICs) for higher rates.
Advancements in Field Programmable Gate Arrays
(FPGAs) provide new options for DSP design engineers.
The FPGA maintains the advantages of custom
functionality like an ASIC while avoiding the high
development costs and the inability to make design
modifications after production. The FPGA also adds
design flexibility and adaptability with optimal device
utilization while conserving both board space and
system power, which is often not the case with DSP
chips. When a design demands the use of a DSP, or
time-to-market is critical, or design adaptability is
crucial, then the FPGA may offer a better solution.
The SRAM-based FPGA is well suited for arithmetic,
including Multiply & Accumulate (MAC) intensive DSP
functions. A wide range of arithmetic functions (such as
Fast Fourier Transforms (FFTs), convolutions, and
other filtering algorithms) can be integrated with
surrounding peripheral circuitry. The FPGA can also be
reconfigured on-the-fly to perform one of many systemlevel functions.
When building a DSP system in an FPGA, the design
can take advantage of parallel structures and arithmetic
algorithms to minimize resources and exceed the
performance of single or multiple general-purpose DSP
devices.
Distributed Arithmetic[1] for array
multiplication in an FPGA is one way to increase data
bandwidth and throughput by several order of
V.1.0
magnitudes over off-the-shelf DSP solutions. One

example is a 16-Tap, 8-Bit Fixed Point, Finite Impulse
Response (FIR) filter[2].
The FIR design supports more than 8 million samples
per second. This example can also be implemented
using multiple bits, until a Fully Parallel Distributed
Arithmetic algorithm is obtained for higher sample
rates (i.e., 55.89 million samples per second).
Figure 1 compares the 16-Tap FIR filter implemented in
a state-of-the-art fixed-point DSP with that of the Xilinx
FPGA. As published by Forward Concepts[5], For
1995, the state-of-the-art fixed-point DSP is rated at 50
MIPS. Such a device requires 20 nsec per Tap to
implement a 16-Tap FIR filter, which translates to a
theoretical maximum (with zero wait-states) sample rate
of 3.125 million samples per second.
An In-System Programmable (ISP) FPGA can also be
reconfigured on the board during system operation.
Taking advantage of the reconfigurability feature means
a minimal chip solution can be transformed to perform
multiple functions. For example, an FPGA could be the
basis for a system that performs one of several DSP
functions. Suppose, for instance, one function is to
compress a data stream in transmit mode and another
function is to decompress the data in receive mode. The
FPGA can be reconfigured on-the-fly to switch, or
toggle, from one function to another. This capability of
the FPGA adds functionality and processing power to a
minimum-chip DSP system controlled with an internal
or an external controller.
This Reconfigurable
Computing technique is beginning to impact design
methodologies.
1995 XILINX, Inc. All rights reserved.
A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

By Gregory Ray Goslin
Data Sample Rate (MHz)

70
65
60
FPGA Performance
Characteristics for
Distributed
Arithmetic
55
50
45
FPGA
Region
processor becomes less efficient when an algorithm is

dependent on two or more of the past, present, and/or
future state conditions. This is primarily due to the feedback or parallel structure of the data flow being
processed sequential with additional wait-states in a
DSP.
Full Parallel DA
@70-50MHz
Memory-A Memory-B
4-Bit Parallel DA
@35MHz
Control Logic
I/O
40
35
30
25
1-Bit Serial Distributed Arithmetic

w/ XC4000E-3 @CLK=66MHz
20
2-Bit Parallel DA
@16MHz
15
10
5
8-Bit Serial DA
@8MHz
DSP
Region
0
1
8
16 24 32 40 48
# of Multiply & Accumulates per Sample
Figure 1.
Distributed Arithmetic Performance
for 16-Tap, 8-Bit Fixed Point, FIR Filter.
The FPGA design cycle requires less hardware-specific
knowledge than most DSP chips or ASIC design
solutions. Smaller design groups, with less experienced
engineers, can design larger, more complex DSP systems
in less time than larger design groups with more
experienced engineers who are required to know devicespecific programming languages. The FPGA-based DSP
system-level design team can design, test, verify, and
ready a complex DSP system for production in weeks.
FPGA-Based DSP Accelerator:

General-purpose DSP devices have been the enabling
technology for most mid to high-end electronic systems.
Although high-volume applications favor ASIC
implementations, the general-purpose DSP devices are
used to drive the technology forward.
The engine in the DSP system is typically a specialized
time-multiplexed sequential computer system that
performs a continuous mathematical process, attempted
in real-time [see Figure 2]. While the DSP processor
may perform multiple instructions per clock cycle
(Harvard Architecture), the overall process is performed
in a three-step series of 1
memory-read, 2
process,
and 3
memory-write instructions. This process is
adequate for independent sequential algorithms (a
modified Von Neumann Architecture).
The DSP
V.1.0
Dedicated Arithmatic Processing

Unit
DSP Device
Figure 2.
Basic DSP Block Diagram
System performance requirements continue to increase.

The advancement in performance for DSP devices is
lagging behind this growth. Because of this imbalance
the DSP designer is forced to use a systolic array of DSP
processors to boost the overall performance.
A typical DSP algorithm contains many feed-back loops
or parallel structures [see Figure 3]. The software code
for a DSP algorithm of this type is not efficiently
implemented in a general purpose DSP. Typically,
about 10-30% of the DSP code utilizes 60-80% of the
processors power. Analyzing the DSP algorithm and
breaking out any parallel structures or repetitive loops
into multiple data paths, one can enhance the overall
performance of the algorithm. The multi-path parallel
data structures can be processed either through parallel
DSP devices or in a single FPGA-based DSP hardware
accelerator with or without the assistance of a DSP
device.
The FPGA is well suited for many DSP algorithms and
functional routines. The FPGA can be programmed to
perform any number of parallel paths. These operational
data paths can consist of any combination of simple and
complex functions, such as; Adders, Barrel Shifters,
Counters, Multiply and Accumulation, Comparators and
Correlators just to mention a few. The FPGA can also
be partially or completely reconfigured in the system for
a modified or completely different algorithm. (Note that
the Xilinx XC6200[6] family can be partially
reconfigured while the rest of the device is still active in
the system.) The primary concept is to unload the
compute-intensive functions requiring multiple DSP

clock cycles into the FPGA and allow the DSP processor
to concentrate on optimized single-clock functions.
Data_In
Begin
Fetch - A
Fetch - D
Fetch - C
Fetch - B
Fetch - E
g(C,F) = G
about 80% of the overall processing time. Note this did

not include the calculation of the Diff_1 and Diff_2
outputs. These two outputs would have required an
additional seven clock cycles. The seven (2sComplement) I/O Data words were multiplexed on a
common I/O Bus at the maximum I/O rate of 33 MHz.
Old_1
+
+
M
U
X
+
-
f(A,B) = C
f(D,E) = F
Send - G
I/O Bus
New_1
MSB
INC
Diff_1
I/O Bus
Diff_2
Send - C
Adaptive
Changes
Data_Out
+
+
Old_2
MSB
M
U
X
New_2
Prestate Buffer
24-bit
Figure 3.
1 0 Bit
24-bit
Basic DSP Algorithm Process
The FPGA-based DSP accelerator is conceptually

similar to a Coprocessor for the Microprocessor Units
(MPUs) in the computer market. Initially, the MPU
required
a
Math-Coprocessor
to
accelerate
computational algorithms. The functionality of the
Coprocessor became so frequently used that it is now an
integrated part of the latest processor families. Some
DSPs have been optimized to do some dedicated
functions much faster than a MPU (Von Neumann
architecture). For example, some DSPs can Multiply
and Accumulate (MAC) in one clock cycle (DSP @66
MHz = 15 nsec per MAC) compared to eleven clock
cycles (P5@100 MHz = 110 nsec per MAC) in the
Pentium processor.
The combined functionality of the FPGA and the
general-purpose DSP can support several magnitudes
higher data throughput than two or more parallel DSP
devices. The FPGA/DSP implementation is more
flexible and proves to be more cost effective than
multiple DSPs or an ASIC.
Case Study
Figure 4.
Viterbi Decoder Block Diagram
There were two limiting factors for this DSP-based

design. First, the wait state associated with the DSPs
external SRAM memory required two 15 nsec clock
cycles for each memory access. Hence, the data Bus
transfer required 30 nsec for each data transaction. This
forced the I/O Bus speed to a maximum of 30 nsec.
Secondly, each Add/Subtract and MUX stage had to be
performed sequentially with additional wait-states. The
Add/Subtract stages required four additional operations
with multiple instructions.
FPGA & DSP Based Viterbi Decoder:

The Viterbi Decoder is well suited for the FPGA. The
ability to process parallel data paths within the FPGA
takes advantage of the parallel structures of the four
Add/Subtract-blocks in the first stage and the two
Subtract-blocks in the second stage. The two MUXblocks take advantage of the ability to register and hold
the input data until needed with no external memory or
additional clock cycles.
Old_1
Viterbi Decoder
+
+
R
E
G
+
-
A Viterbi Decoder[7] is a good example of how the

FPGA can accelerate a function [Refer to Figure 4]. The
initial design used two 66 MHz programmable DSP
devices to implement the algorithm. The algorithm used
to calculate the New_1 and New_2 outputs, shown
in Figure 4, required 17-computational clock cycles, plus
an additional 7-clock cycles due to the wait-state
associated with the DSPs external SRAM memory.
This memory was required by the programmable DSP
for data storage. Hence, the Viterbi Decoder algorithm
required 360 nsec [(24-clock cycles)(15 nsec)] of
processing time. The Viterbi Decoder algorithm utilized
V.1.0
24-bit
I/O Bus
INC
M
U
X
R
E
G
R
E
G
New_1
MSB
R
E
G
Diff_1
I/O Bus
Old_2
+
+
Diff_2
R
E
G
MSB
M
U
X
R
E
G
R
E
G
Prestate Buffer
Optional
Pipelining
Register's
Figure 5.
Diagram
New_2
R
E
G
24-bit
24-bit
1 0 Bit
24-bit
FPGA-Based Viterbi Decoder Block
The design conversion resulted in the following

performance:
The FPGA based Viterbi Decoder
required 135 nsec [(9-clock cycles)(15 nsec)] of total

200
360 ns
100
62 % Better
Performance with
FPGA Co-processor
135 ns
0
Two 66 MHz DSPs
Six 15 ns RAMs
66 MHz DSP+FPGA
Three 15 ns RAMs
Figure 6.
Performance of two Viterbi decoder
implementations. The DSP+FPGA solution is 2.7
times faster
This design can also be implemented in a multiplexed
version with the original 33 MHz I/O Data transfer rate.
This multiplexed implementation minimizes the CLB
resources used.
This is done by observing the
symmetrical nature of the design. Note that Old_1 and
Old_2 Datum follow the same path with only a minor
difference in the magnitude of the second stage SUBblock. This implementation requires a specific ordering
for data accesses on the I/O Data Bus.
This is one example of how the FPGA can accelerate a
DSP function. To use the FPGA in a DSP design,
identify the parallel data paths and/or the operations
requiring multiple clock cycles in the DSP.
Digital Filter Design:
One example is a Serial Distributed Arithmetic (SDA),

FPGA-Based, 16-Tap 8-Bit Finite Impulse Response
(FIR) filter[2]. The SDA-FIR design supports an on-off
chip I/O-data rate at more than 8 million samples per
second [13.7 nsec x (8-Bits + 1) / 16-Taps = 7.71 nsec
per Tap] while occupying 68 Configurable Logic Blocks
(CLBs)[3] in an XC4000E-3 FPGA.
Note that
increasing the number of Multiply and Accumulation
blocks or Taps has no significant impact on the sample
rate when Serial Distributed Arithmetic is used [see
Figure 1].
20
17.88
18
16
14
16-Tap, 8-Bit FIR Filter

Relative Performance
FPGA
300
When building a Digital Filter in an FPGA, the design

can take advantage of parallel structures and Distributed
Arithmetic algorithms to exceed the performance of
multiple general-purpose DSP devices. While the FPGA
has no dedicated multiplier, the use of Distributed
Arithmetic for array multiplication in an FPGA is one
technique used to implement and increase the functions
data bandwidth and throughput by several order of
magnitudes over off-the-shelf DSP solutions.
12
10
8
6
4.00
2.60
2
0.18
1.00
FPGA
400
FPGA Based Filters:
Relative Performance (@50MHz)
Processing Time (ns)
processing time (this includes all of the outputs)

compared to the 360 nsec required for the partial output
data by the DSP. This enhancement equates to 37.5% of
the original DSP processing time or 62.5% better
processing performance. The I/O Data Bus can also
support a 66 MHz data transfer rate, supporting twice the
original throughput. The FPGA-based implementation
replaced a programmable DSP and three SRAM chips.
The MAC function can be implemented more efficiently

with various Distributed Arithmetic (DA)[1] techniques
then with conventional arithmetic methods. Distributed
Arithmetic can make extensive use of look-up tables
(LUTs), which makes it ideal for implementing DSP
functions in LUT-based FPGAs.
This example can also be implemented using multiple

bits, up to a full Parallel Distributed Arithmetic (PDA)
algorithm to obtain higher I/O-data sample rates which
can exceed 55 million samples per second [ 17.9 nsec /
16-Taps = 1.12 nsec per Tap ] and occupies 400 CLBs in
an XC4000E-3. Note that these two examples will work
V.1.0
PDA
DSP
P-5
Hardware Device
DSPx4
Figure 7.
Relative performance for various
implementations of a 16-Tap, 8-Bit FIR filter
compared to a 50 MHz fixed-point DSP processor
SDA
The processing engine of most filter algorithms is a

Multiply and Accumulate (MAC) function.
Filter
designs can vary over a wide range in the number of
MACs, from one to thousands. As the number of MACs
increase, the algorithm becomes much more complex for
a CPU-based architecture.
Hence, the algorithm
becomes more compute-intensive for any conventional
DSP.

with any 16-Tap configuration with signed or unsigned,

fixed point 8-Bit data and coefficients. Optimizing the
design for a specific configuration can often reduce the
number of CLBs by as much as 20%.
Compare the same 16-Tap FIR filter algorithm
implemented in various state-of-the-art fixed-point DSPs
with that of the Xilinx FPGA. A fixed-point DSP rated
at 50 MIPS requires at least 20 nsec per Tap to
implement a 16-Tap FIR filter, which translates to a
maximum sample rate of 3.125 million samples per
second or 12.5 million samples per second with a 4xDSP
multi-chip module, as shown in Figure 8.
The
performance of general-purpose DSP devices degrades
significantly with each additional MAC (one additional
Tap), as shown in Figure 8. If the system design
requires a systolic array or uses more than one
traditional DSP device, consider the performance and
cost advantages available using an FPGA.
Data Rate (MHz),
Case Study
16-Tap, 8-Bit FIR Filter:
The 16-Tap FIR is a discrete-time filter in which the
output is an explicit function only of the present and
previous inputs to compute the weighted average of the
16-data sample points. Since the FIR response,
mathematically, contains only feed-forward terms, the
FIR is unconditionally stable and can be designed to
exhibit a linear phase response.
REG
Data Input
X[7:0]
REG
REG
PSC
0
C0
15
REG
1
C1
REG
14
REG
2
C2
REG
13
REG
REG
3
C3
12
REG
4
11
10
REG
REG
6
C5
REG
REG
5
C4
DSP Processors
50
REG
C6
C7
-1
REG
Data Output
Y[9:0]
# of DSP Cores
4x-DSP MCM
3x-DSP MCM
2x-DSP MCM
1x-DSP
45
40
35
30
25
FPGA
Region
20
15
10
5
0
4
8 12 16 20 24 28 32 36 40 44 48
# of Multiply & Accumulates per Sample
Figure 8.
FPGA Based 16-Tap FIR Filter:

The FPGA has the ability to implement a FIR filter
function using one of several Distributed Arithmetic
techniques, depending on the performance required.
Note that these techniques can be used to optimize the
implementation of many other types of data processing
or MAC-based algorithms.
Parallel Distributed
Arithmetic techniques are used to achieve the fastest
sample rates, while lower rates can be sustained with a
Serial or Serial-Sequential Distributed Arithmetic
techniques that use less resources (fewer CLBs).
The primary design concern is the performance or the
sample rate of the filter. The design must work at the
sample rate. A design which runs below the sample rate
is of no value, while any additional performance, which
uses more CLBs, is of no added value.
DSP
Region
Figure 9.
16-Tap FIR filter Data Flow Block
Diagram with Symmetrical Coefficients
Programmable DSP Performance
Distributed Arithmetic:
Distributed Arithmetic differs from conventional
arithmetic only in the order in which it performs
operations.
Take for example a four-product MAC function that uses
a conventional sequential shift-and-add technique to
multiply four pairs of numbers and sum the results [see
Figure 10].
V.1.0

A
A
Y0
SHIFT-REG
MAC Block
Ai
+
+
S-REG
MS
P
REG
LS
S-REG
PA
B
S-REG
S-REG
Bi
PB
MAC
C
S-REG
S-REG
Ci
Y2
Di
Y3
PC
PD
SHIFT-REG
Ai
1-Bit Scaling
Accumulator
+
Y2
PC
Ci
+
+
Y3
PD
Di
MS
REG
LS
SUM
+
1/2
Figure 12.
Serial Distributed Arithmetic for a
four-product MAC.
MAC
Each multiplier forms partial products by multiplying the

coefficient-Y by 1-bit of the data-A at a time in an AND
operation. This technique then adds these partial
products into an accumulator that shifts the
accumulators feedback 1-bit position, to the right, to
perform a divide by two (2) each clock cycle, thus
compensating for the bit-weighting of the ingressing
partial product [see Figure 11].
Y
Figure 10.
Conventional four-product MAC
using shift-and-add technique to multiply pairs and
sum the result
PB
MAC
D
S-REG
SUM
Bi
PA
Y1
C
+
1/2
Y1
Y0
Ai
MAC Block
+
+
MS
Using Distributed Arithmetic, as shown in Figure-12, the

operations are reordered. This technique reduces the
number of shift-and-add circuits to one, but does not
change the number of simple adders.
The coefficients in many filtering applications are
constants.
Consequently, the output of the AND
functions and the three adders of Figure 12 depend only
on the four input bits from the shift registers. Replacing
these AND functions and the three adders with a simple
4-bit (16-word) Look-up Table (LUT) gives the final
reduced form of a bit Serial Distributed Arithmetic MAC
[see Figure 13].
A
P
REG
LS
S-REG
Ai
1/2
S-REG
LET A = A 3 A 2 A 1 A 0 and Y = Y 3 Y 2 Y 1 Y 0 then, A*Y = P is,

{ Start }
A0 * ( Y 3 Y 2 Y 1 Y 0 )<1/2>
{ Cycle=0 }
+ A1 * ( Y 3 Y 2 Y 1 Y 0 ) <1/2>
{ Cycle=1 }
+ A 2 * ( Y 3 Y 2 Y1 Y0 )
<1/2>
{ Cycle=2 }
+ A 3 * ( Y3 Y2 Y1 Y0 )
{ Cycle=3 }
{ End }
P 7 P 6 P 5 P4 P3 P 2 P 1 P 0
Figure 11.
Four-Bit MAC using shift-and-add
technique to multiply
The four-multiplications are performed simultaneously
and the results are then summed when the products are
complete. This process requires n-clock cycles for data
sample of n-bits. Hence, the processing clock rate is
equal to the data rate divided by the number of data bits.
During each data clock-cycle, the four-multipliers
simultaneously create four-product terms [i.e., in Figure
10; PA, PB, PC, & PD], that eventually are summed into
the output. Distributed Arithmetic differs from this
process by adding the partial-products before, rather than
after, the bit-weighted accumulation.
V.1.0
C
S-REG
Ci
LOOKUP
TABLE
+
+
D
S-REG
1-Bit Scaling
Accumulator
Bi
Di
MS
P
REG
LS
SUM
1/2
Figure 13.
LUT-Based Serial Distributed
Arithmetic for a four-product MAC.
The LUT data, referenced in Figure 14, is composed of
all partial sums of the coefficients (Y0, Y1, Y2, & Y3).
The least significant bit (the output from each serial shift
register) of the 4-data samples addresses the LUT. If all
four data bits are 1, then the output from the LUT is the
sum of all four coefficients. If any data bit is a zero, then
the corresponding coefficient is eliminated from the sum.
Because the address of the LUT contains all possible
combinations of one or zero, based on the four inputs,
the LUT output contains all 16 possible sums of the
coefficients [see Figure 14].

A
S-REG
Look-Up Table
Ai
ADDR_0
B
S-REG
Bi
C
S-REG
Ci
ADDR_1
LOOKUP
TABLE
ADDR_2
D
S-REG
Di
ADDR_3
O
U
T
P
U
T
LUT Inputs
< D i Ci Bi
<0 0 0
<0 0 0
<0 0 1
<0 0 1
<0 1 0
<0 1 0
<0 1 1
<0 1 1
<1 0 0
<1 0 0
<1 0 1
<1 0 1
<1 1 0
<1 1 0
<1 1 1
<1 1 1
Ai >
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
LUT Output
Partial Sum
0000...0
YA
YB
Y B +Y A
YC
Y C +Y A
Y C +Y B
Y C +Y B +Y A
YD
Y D +Y A
Y D +Y B
Y D +Y B +Y A
Y D +Y C
Y D +Y C +Y A
Y D +Y C +Y B
Y D +Y C +Y B +Y A
D0
S-REG
16-MAC CIRCUIT
D1
Using 4-bit LUT-Based

Distributed Arithmetic
S-REG
D2
LUT
S-REG
D3
S-REG
D4
S-REG
D5
S-REG
D6
LUT
S-REG
D7
Figure 14.
Table, LUT.
Contents of a 16-Word Look-Up
S-REG
D8
S-REG
In the four-MAC example, the width of the LUT is

typically 2-bits greater than the coefficient width. The
additional two-bits allows for word growth from the
addition of the four coefficients. The circuit will require
at most 2-bits because the circuit is summing no more
than four coefficients. Fewer bits may be adequate in
cases in which the combinations of coefficients result in
less word growth. More than four coefficients require
more width for word growth. In general, the number of
additional bits should be at least Log2(number of
Coefficients). You can use fewer bits only when the
specific coefficients are known and the LUT has been
computed.
As the number of coefficients increase, the size of the
LUT grows exponentially. A large LUT can be avoided
by partitioning the circuit into smaller groups and
combining the LUT outputs with adders. The adders are
less costly than the larger LUT.
In the example of the 16-Tap FIR filter, the circuit would
require a 16-bit LUT. If the FPGA architecture uses
four-input function generators, the optimum partition is
four products per LUT. The 16-MAC product and sum
can be partitioned as shown in Figure 15.
Rather than increase the size of the LUT by a growth
factor of 16, four LUTs the same size as the one used in
the example of Figure 13 are used. The additional three
adders, each occupying approximately the same space as
one 4-bit LUT, result in a growth factor of between 6 and
7. In a SDA-based design with multiple four-input
LUTs, each are configured in the same manner as was
discussed with the single four-input LUT.
V.1.0
D9
1-Bit Scaling
Accumulator
+
MS
LS
SUM
1/2
S-REG
DA
P
REG
LUT
S-REG
DB
S-REG
DC
S-REG
DD
S-REG
DE
LUT
S-REG
DF
S-REG
Figure 15.
LUT-Based Serial Distributed
Arithmetic for a four-product MAC.
The only difference between a MAC circuit and a FIR
filter design is how the input data is handled [see Figure
16].
The MAC has a series of parallel-to-serial
converters, while the FIR filter has one long bit-wide
shift register that taps into each word. Only the first
word is parallel loaded. The first shift register operates
by parallel-loading the new data sample into a registerbased shift register. The shift register then serially shifts
out all the data-bits, 1-bit at a time. The output bit from
the shift register addresses 1-bit of a corresponding LUT,
until the most significant bit of n-bits has been clocked
out. When the shifting is complete, the first data shift
register is empty and is ready to be parallel loaded with
the next word and the process repeats.
The output bit from the first shift register (the first filter
Tap) is the input to a serial-in and serial-out shift register
(the second filter Tap). Each additional Tap would also
use a serial-in and serial-out shift register with its input
fed serially from the previous cascaded Tap. Note that

the data flow can be changed to adapt to the input

characteristics of any function.
With the exception of the first parallel-in and serial-out
shift register, the shift registers could be constructed
using registers or more efficiently implemented in a
Synchronous RAM-based shift register. This is a result
of needing access to only one bit of each data sample
(Tap) per clock cycle. The Synchronous RAM-based
shift register offers 16 times the density of a registerbased shift register (each CLB can be configured as two
Registers, two 16x1-bit Synchronous RAMs or one 32x1bit Synchronous RAM for data storage).
The outputs from each LUT and ADDER can be

registered in the same CLB that holds the LUT or
ADDER at no extra cost. By adding pipeline registers
the design can be clocked at a higher rate. For the 16Tap FIR filter example, the pipelining registers result in a
continuous output data stream with a latency of 4-sample
cycles, as shown in Figure 17.
DATA CLK
INPUT_10
INPUT_11
INPUT_12
INPUT_13
INPUT_14
***
***
OUTPUT_06
OUTPUT_07
OUTPUT_08
OUTPUT_09
OUTPUT_10
***
The data in the remaining RAM-based shift registers are

positioned to be multiplied by the appropriate coefficient
by addressing its corresponding LUT during the next
cycle. The oldest samples (the last Tap) data-bit is lost
at the end of each process.
D0
16-Tap FIR Filter

Using Serial
Distributed Arithmetic
S-REG
S-RAM
LUT
REG
S-RAM
***
Figure 17.
Pipelined input/output timing
diagram for a SDA 16-Tap FIR Filter.
If the filter is symmetrical it is possible to further reduce
the resources required for the design. In a symmetrical
filter the number of LUTs (or MACs) can be reduced by
a factor of two. This is done by adding the input data
with a common coefficient prior to the multiplication.
This is done with serial adders. This also reduces the
number of adder stages in the design, as shown in Figure
19.
S-RAM
S-RAM
SUM
A+B
REG
REG
S-RAM
LUT
A + B + C_IN
REG
S-RAM
S-RAM
C_OUT
1-Bit Scaling
Accumulator
S-RAM
MS
LS
REG
C_IN
REG
CLK
P
REG
SUM
1/2
S-RAM
Figure 18.
LUT
S-RAM
S-RAM
When serial adders are used to reduce the number of

MACs in a Serial Distributed Arithmetic algorithm, an
extra clock cycle is required to flush the Carry_Out from
the addition of the two most significant bits of the input
data [see Figure 18]. The processing clock rate can be
calculated by dividing the Sample Rate by the number of
sample bits plus one.
REG
S-RAM
LUT
S-RAM
REG
S-RAM
Figure 16.
16-Tap FIR Filter using LUT-Based
Serial Distributed Arithmetic.
V.1.0
Serial Adder input/output diagram.
REG
S-RAM
A filter design implemented in an FPGA with SDA gives

a significant amount of performance in a modest number
of CLBs (e.g. 4.25-CLBs at 7.7 nsec, per Tap). SDA
uses the smallest number of CLBs while processing all
data samples (Taps) in parallel.

D0
S-REG
S-RAM
S-RAM
S-RAM
T0
T15
SERIAL
ADD
A0
T1
T14
SERIAL
ADD
A1
SERIAL
ADD
A2
SERIAL
ADD
A3
T11
SERIAL
ADD
A0
T5
T10
SERIAL
ADD
A1
LUT
T2
S-RAM
S-RAM
T13
REG
T3
S-RAM
S-RAM
T12
1-Bit Scaling
Accumulator
T4
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
T9
MS
LS
REG
REG
SUM
1/2
SERIAL
ADD
A2
REG
S-REG
S-RAM
T8
BITS[(n-1),...,5,3,1]
Ai
T7
S-RAM
LUT
T6
S-RAM
the addition of a 1-bit scaling adder, required to add the

two partial sums which results from each of the two
parallel sample bits. The scaling accumulators input bus
is expanded to accommodate the larger partial sum and
the final scaling accumulator is changed from a 1- to a 2bit shift (1/4) for scaling. These changes essentially
double the resources required compared to that of the
SDA design.
Symmetrical 16-Tap
FIR Filter Using
Serial Distributed
Arithmetic
SERIAL
ADD
A3
S-REG
Figure 19.
Symmetrical 16-Tap FIR Filter
using LUT-Based Serial Distributed Arithmetic.
S-REG
Ci
LOOKUP
TABLE
D
S-REG
A Serial-Sequential Distributed Arithmetic (SSDA)

technique could be used at lower data rates to conserve
more CLBs. This paper does not include a formal
discussion on Serial-Sequential Distributed Arithmetic.
The process is performed bit-serially, word-sequentially
(rather than bit-serially, word-parallel, compared with
SDA). This technique uses an internal synchronous
Dual-Ported RAM or FIFO to time multiplex each of the
data samples, bits serially, through the logic. The
performance for Serial-Sequential Distributed Arithmetic
is rated at the maximum data rate for a given number of
data-bits with SDA, divided by the number of Taps. This
technique is typically useful for data sample rates below
1 MHz.
PS_BITS_1
Bi
Di
Ai
MS
P
REG
+
LS
SUM
1/4
Bi
C
S-REG
1/2
B
S-REG
R
E
G
LS
A BITS[(n-2),...,4,2,0]
S-REG
2-Bit Scaling
Accumulator
Ci
LOOKUP
TABLE
PS_BITS_0
D
S-REG
Di
Figure 20.
LUT-Based 2-Bit Parallel
Distributed Arithmetic for a four-product MAC.
Parallel Distributed Arithmetic:
The performance for the 16-Tap FIR filter example,

implemented with a 2-Bit PDA algorithm, resulted with a
sample rate of 16 MHz, in 130 CLBs [see Figure 21].
Parallel Distributed Arithmetic (PDA) is used to

increase the overall performance of Serial Distributed
Arithmetic. With PDA, the number of bits being
processed during each clock cycle is increased. Note that
increasing the number of bits sampled has a significant
effect on the number of CLBs used for the design.
Therefore, the number of parallel bits sampled should be
increased only to meet the required performance.
A filter design implemented in an FPGA with 2-Bit PDA

results in twice the performance and about twice the
number of CLBs (e.g. 8.2-CLBs at 3.9 nsec, per Tap)
compared to the same function with SDA. 2-Bit PDA
uses a less modest number of CLBs then that of SDA,
while still processing all data samples (Taps) in parallel,
at twice the SDA data sample rate.
Increasing the number of bits processed from 1-bit, in the

case of SDA, to a 2-bit PDA results in half the number of
processing clock cycles [see Figure 20]. Hence, 2-Bit
PDA results in twice the throughput. With 2-bit PDA,
the serial shift registers, referenced in the discussion on
SDA, are each replaced with two similar 1-bit shift
registers at half the bit depth. The two parallel shift
registers are spilt, such that one stores the even-bits and
the other stores the odd-bits. The 2-bit parallel data
samples require twice the number of LUTs. There is also
V.1.0
The number of bits being processed during each clock

cycle can be increased until an n-Bit PDA
implementation is reached, for n-bit data samples. When
the design is an n-Bit PDA, the sample data rate is at a
maximum. For an XC4000E-3 device this will enable a
single chip solution with a sustained data sample rate
between 50 and 70 MHz.

D 0 Bits[7,5,3,1]
S-REG
S-RAM
S-RAM
S-RAM
T0
T15
SERIAL
ADD
A0
T1
T14
SERIAL
ADD
A1
LUT
T2
S-RAM
S-RAM
T13
Symmetrical 16-Tap
FIR Filter Using
2-Bit Parallel Distributed
Arithmetic
SERIAL
ADD
A2
S-RAM
SERIAL
ADD
A3
T11
SERIAL
ADD
A0
T5
T10
SERIAL
ADD
A1
T9
SERIAL
ADD
A2
T7
T8
SERIAL
ADD
A3
T12
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
REG
S-RAM
SERIAL
ADD
PS_01
S-RAM
T13
A0
S-RAM
T12
REG
MS
REG
+
LSB
LSB
SUM
SERIAL
ADD
A2
S-RAM
S-RAM
S-RAM
T11
T5
T10
SERIAL
ADD
SERIAL
ADD
SERIAL
ADD
S-RAM
S-RAM
S-RAM
T9
REG
D 6,2
REG
SERIAL
ADD
A0
A1
A2
PS_23
+
Sign Extended
MSB
REG
LSB
REG
D 5,2
REG
D 4,2
REG
1/2
+
D 3,2
REG
D 2,2
REG
D 1,2
REG
D 0,2
REG
D 7,1
REG
D 6,1
REG
BITS_2
REG
+
Sign Extended
MSB
D 5,1
REG
D 4,1
REG
D 3,1
REG
D 2,1
REG
REG
REG
D 0,1
REG
D 7,0
REG
D 6,0
REG
Sign Extended
MSB
BITS_1
REG
REG
REG
D 3,0
REG
Sign Extended
MSB
REG
PS_O1
LSB
1/2
+
REG
T7
T8
SERIAL
ADD
A3
D 2,0
REG
D 1,0
REG
D 0,0
REG
BITS_0
LUT
Figure 21.
Symmetrical 16-Tap FIR
using 2-Bit Parallel Distributed Arithmetic.
Filter
With PDA, each additional parallel bit requires an

additional level of scaling (by powers of 2) and
summation for each partial product pair of bits [see
Figure 22].
The LUTs for SDA and PDA can always be the same for
any given 4-MAC block, regardless of the number of bits
in the sample data. This is true for PDA only if common
bit-weighted sample inputs are used to address the LUT,
as shown in Figure 22.
REG
First 4-Bits of 8-Taps

in a 16-Tap, 8-Bit FIR Filter
Using Parallel Distributed Arithmetic
Figure 22.
16-Tap FIR Filter using LUT-Based
Parallel Distributed Arithmetic.
The filter design example implemented in an FPGA with
8-Bit PDA resulted in more than seven times the
performance and about four and a half times the number
of CLBs (e.g. 19-CLBs at 1.1 nsec, per Tap) compared to
the same function with SDA. Full PDA uses a much
larger number of CLBs then that of SDA, while still
processing all data samples (Taps) in parallel, at the data
sample rate.
The performance for the symmetrical 16-Tap FIR filter

example, implemented with an 8-Bit PDA algorithm,
resulted in a sample rate of 58 MHz, in 270 CLBs. A
non symmetrical 16-Tap FIR filter, as shown in Figure
22, resulted in a sample rate of 55 MHz, in 400 CLBs.
V.1.0
PS_07
PS_03
LSB
REG
D 4,0
LSB
1/4
LUT
D 5,0
+
REG
REG
PS_BITS_0
1/16
REG
D 1,1
SUM
REG
LUT
A3
LUT
T6
S-RAM
REG
D 7,2
REG
REG
T4
S-RAM
D 0,3
BITS_3
LUT
1/2
A1
T3
S-RAM
REG
REG
LUT
T2
S-RAM
D 2,3
D 1,3
LUT
2-Bit Scaling
Accumulator
1/4
SERIAL
ADD
REG
T1
T14
REG
REG
S-RAM
REG
D 3,3
LUT
D 0 Bits[6,4,2,0]
S-REG
D 4,3
REG
REG
+
T0
T15
REG
LUT
T6
S-RAM
D 5,3
PS_47
LUT
T4
S-RAM
REG
REG
PS_BITS_1
REG
D 6,3
LUT
REG
T3
S-RAM
D 7,3
10

Other Resources:
Glossary of Terms:
DSP Applications Group
ASICApplication-Specific Integrated Circuit,

commonly called a gate array.
For more assistance in FPGA-Based DSP related

applications, information or for help on implementing an
algorithm in programmable logic, contact the Xilinx DSP
Applications Group. Contact them via E-mail at
dsp@xilinx.com or FAX: 408-879-4442.
CPLDComplex Programmable Logic Device. Also

called EPLD or Erasable Programmable Logic Device.
DADistributed Arithmetic. An alternate approach to
implementing arithmetic functions.
DSPDigital Signal Processing
Xilinx WebLINX DSP Site

Xilinx has a home page on the World-Wide Web that
includes a special section for DSP. The WebLINX
UURL for DSP is
http://www.xilinx.com/appsweb.htm#DSP
Application notes and data sheets are also available via

WebLINX.
FIRFinite-Impulse Response. A type of digital filter.

FPGAField Programmable Gate Array.
HDLHardware Description Language such as VHDL
and Verilog.
IIRInfinite-Impulse Response. A type of digital filter.
MACMultiply/Accumulator. A DSP processor
provides good performance in DSP applications because
it executes a multiply and an addition in a MAC unit in a
single clock cycle.
References:
[1] Goslin, G. R. Using Xilinx FPGAs to Design
Custom Digital Signal Processing Devices, Proceedings
of the DSPX 1995 Technical Proceedings pp. 595-604,
12JAN95.
[2] Goslin, G. R. & Newgard, B. 16-Tap, 8-Bit FIR
Filter Application Guide, Xilinx Publications,
21NOV94.
[3] The programmable Logic Data Book, Xilinx, Inc.
Second Edition, 1994
[4] XC4000E Data Sheet, Xilinx, Inc. Latest
Version.
NRENon-Recurring Engineering. The set-up or mask

charges required to build an ASIC.
PDAParallel Distributed Arithmetic. An alternate
approach to implementing arithmetic functions. Efficient
and high performance in some FPGA architectures.
PSCParallel to Serial Converter.
SDASerial Distributed Arithmetic.
An alternate
approach to implementing arithmetic functions. Very
efficient and a good compromise between speed and
density in some FPGA architectures.
NOTES:
[5] Strauss, W. DSP Strategies for the 90s: The MidDecade Outlook, Forward Concepts, JAN95.
[6]
XC6200 Data Sheet, Xilinx, Inc. Latest Version.
[7] Viterbi, A. J. & Omura, J. K. Principles of Digital

Communication and Coding, McGraw-Hill, New York,
1965.
V.1.0
11

A Guide To Using Field Programmable Gate Arrays (Fpgas) For Application-Specific Digital Signal Processing Performance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Guide To Using Field Programmable Gate Arrays (Fpgas) For Application-Specific Digital Signal Processing Performance

Uploaded by

Copyright:

Available Formats

A Guide to Using Field Programmable Gate Arrays (FPGAs) for

Application-Specific Digital Signal Processing Performance

magnitudes over off-the-shelf DSP solutions. One

1995 XILINX, Inc. All rights reserved.

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

Data Sample Rate (MHz)

processor becomes less efficient when an algorithm is

1-Bit Serial Distributed Arithmetic

FPGA-Based DSP Accelerator:

Dedicated Arithmatic Processing

Basic DSP Block Diagram

System performance requirements continue to increase.

1995 XILINX, Inc. All rights reserved.

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

about 80% of the overall processing time. Note this did

Basic DSP Algorithm Process

The FPGA-based DSP accelerator is conceptually

Viterbi Decoder Block Diagram

There were two limiting factors for this DSP-based

FPGA & DSP Based Viterbi Decoder:

A Viterbi Decoder[7] is a good example of how the

FPGA-Based Viterbi Decoder Block

The design conversion resulted in the following

1995 XILINX, Inc. All rights reserved.

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

Digital Filter Design:

One example is a Serial Distributed Arithmetic (SDA),

16-Tap, 8-Bit FIR Filter

When building a Digital Filter in an FPGA, the design

FPGA Based Filters:

Relative Performance (@50MHz)

Processing Time (ns)

processing time (this includes all of the outputs)

The MAC function can be implemented more efficiently

This example can also be implemented using multiple

1995 XILINX, Inc. All rights reserved.

The processing engine of most filter algorithms is a

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

with any 16-Tap configuration with signed or unsigned,

# of Multiply & Accumulates per Sample

FPGA Based 16-Tap FIR Filter:

Programmable DSP Performance

1995 XILINX, Inc. All rights reserved.

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

Each multiplier forms partial products by multiplying the

Using Distributed Arithmetic, as shown in Figure-12, the

LET A = A 3 A 2 A 1 A 0 and Y = Y 3 Y 2 Y 1 Y 0 then, A*Y = P is,

1995 XILINX, Inc. All rights reserved.

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

Using 4-bit LUT-Based

Contents of a 16-Word Look-Up

In the four-MAC example, the width of the LUT is

1995 XILINX, Inc. All rights reserved.

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance

the data flow can be changed to adapt to the input

The outputs from each LUT and ADDER can be

The data in the remaining RAM-based shift registers are

16-Tap FIR Filter

When serial adders are used to reduce the number of

Serial Adder input/output diagram.

A filter design implemented in an FPGA with SDA gives

1995 XILINX, Inc. All rights reserved.

A Guide to Using FPGAs for Application-Specific Digital Signal Processing Performance