Professional Documents
Culture Documents
A Guide To Using Field Programmable Gate Arrays (Fpgas) For Application-Specific Digital Signal Processing Performance
A Guide To Using Field Programmable Gate Arrays (Fpgas) For Application-Specific Digital Signal Processing Performance
Introduction:
Traditionally, digital signal processing (DSP) algorithms
are implemented using general-purpose (programmable)
DSP chips for low-rate applications, or special-purpose
(fixed function) DSP chip-sets and application-specific
integrated circuits (ASICs) for higher rates.
Advancements in Field Programmable Gate Arrays
(FPGAs) provide new options for DSP design engineers.
The FPGA maintains the advantages of custom
functionality like an ASIC while avoiding the high
development costs and the inability to make design
modifications after production. The FPGA also adds
design flexibility and adaptability with optimal device
utilization while conserving both board space and
system power, which is often not the case with DSP
chips. When a design demands the use of a DSP, or
time-to-market is critical, or design adaptability is
crucial, then the FPGA may offer a better solution.
The SRAM-based FPGA is well suited for arithmetic,
including Multiply & Accumulate (MAC) intensive DSP
functions. A wide range of arithmetic functions (such as
Fast Fourier Transforms (FFTs), convolutions, and
other filtering algorithms) can be integrated with
surrounding peripheral circuitry. The FPGA can also be
reconfigured on-the-fly to perform one of many systemlevel functions.
When building a DSP system in an FPGA, the design
can take advantage of parallel structures and arithmetic
algorithms to minimize resources and exceed the
performance of single or multiple general-purpose DSP
devices.
Distributed Arithmetic[1] for array
multiplication in an FPGA is one way to increase data
bandwidth and throughput by several order of
V.1.0
FPGA Performance
Characteristics for
Distributed
Arithmetic
55
50
45
FPGA
Region
Full Parallel DA
@70-50MHz
Memory-A Memory-B
4-Bit Parallel DA
@35MHz
Control Logic
I/O
40
35
30
25
20
2-Bit Parallel DA
@16MHz
15
10
5
8-Bit Serial DA
@8MHz
DSP
Region
0
1
8
16 24 32 40 48
# of Multiply & Accumulates per Sample
Figure 1.
Distributed Arithmetic Performance
for 16-Tap, 8-Bit Fixed Point, FIR Filter.
The FPGA design cycle requires less hardware-specific
knowledge than most DSP chips or ASIC design
solutions. Smaller design groups, with less experienced
engineers, can design larger, more complex DSP systems
in less time than larger design groups with more
experienced engineers who are required to know devicespecific programming languages. The FPGA-based DSP
system-level design team can design, test, verify, and
ready a complex DSP system for production in weeks.
V.1.0
DSP Device
Figure 2.
clock cycles into the FPGA and allow the DSP processor
to concentrate on optimized single-clock functions.
Data_In
Begin
Fetch - A
Fetch - D
Fetch - C
Fetch - B
Fetch - E
g(C,F) = G
+
+
M
U
X
+
-
f(A,B) = C
f(D,E) = F
Send - G
I/O Bus
New_1
MSB
INC
Diff_1
I/O Bus
Diff_2
Send - C
Adaptive
Changes
Data_Out
+
+
Old_2
MSB
M
U
X
New_2
Prestate Buffer
24-bit
Figure 3.
1 0 Bit
24-bit
Case Study
Figure 4.
Viterbi Decoder
+
+
R
E
G
+
-
V.1.0
24-bit
I/O Bus
INC
M
U
X
R
E
G
R
E
G
New_1
MSB
R
E
G
Diff_1
I/O Bus
Old_2
+
+
Diff_2
R
E
G
MSB
M
U
X
R
E
G
R
E
G
Prestate Buffer
Optional
Pipelining
Register's
Figure 5.
Diagram
New_2
R
E
G
24-bit
24-bit
1 0 Bit
24-bit
200
360 ns
100
62 % Better
Performance with
FPGA Co-processor
135 ns
0
Two 66 MHz DSPs
Six 15 ns RAMs
66 MHz DSP+FPGA
Three 15 ns RAMs
Figure 6.
Performance of two Viterbi decoder
implementations. The DSP+FPGA solution is 2.7
times faster
This design can also be implemented in a multiplexed
version with the original 33 MHz I/O Data transfer rate.
This multiplexed implementation minimizes the CLB
resources used.
This is done by observing the
symmetrical nature of the design. Note that Old_1 and
Old_2 Datum follow the same path with only a minor
difference in the magnitude of the second stage SUBblock. This implementation requires a specific ordering
for data accesses on the I/O Data Bus.
This is one example of how the FPGA can accelerate a
DSP function. To use the FPGA in a DSP design,
identify the parallel data paths and/or the operations
requiring multiple clock cycles in the DSP.
18
16
14
FPGA
300
12
10
8
6
4.00
2.60
2
0.18
1.00
FPGA
400
V.1.0
PDA
DSP
P-5
Hardware Device
DSPx4
Figure 7.
Relative performance for various
implementations of a 16-Tap, 8-Bit FIR filter
compared to a 50 MHz fixed-point DSP processor
SDA
Case Study
16-Tap, 8-Bit FIR Filter:
The 16-Tap FIR is a discrete-time filter in which the
output is an explicit function only of the present and
previous inputs to compute the weighted average of the
16-data sample points. Since the FIR response,
mathematically, contains only feed-forward terms, the
FIR is unconditionally stable and can be designed to
exhibit a linear phase response.
REG
Data Input
X[7:0]
REG
REG
PSC
0
C0
15
REG
1
C1
REG
14
REG
2
C2
REG
13
REG
REG
3
C3
12
REG
4
11
10
REG
REG
6
C5
REG
REG
5
C4
DSP Processors
50
REG
C6
C7
-1
REG
Data Output
Y[9:0]
# of DSP Cores
4x-DSP MCM
3x-DSP MCM
2x-DSP MCM
1x-DSP
45
40
35
30
25
FPGA
Region
20
15
10
5
0
4
8 12 16 20 24 28 32 36 40 44 48
Figure 8.
DSP
Region
Figure 9.
16-Tap FIR filter Data Flow Block
Diagram with Symmetrical Coefficients
Distributed Arithmetic:
Distributed Arithmetic differs from conventional
arithmetic only in the order in which it performs
operations.
Take for example a four-product MAC function that uses
a conventional sequential shift-and-add technique to
multiply four pairs of numbers and sum the results [see
Figure 10].
V.1.0
A
Y0
SHIFT-REG
MAC Block
Ai
+
+
S-REG
MS
P
REG
LS
S-REG
PA
B
S-REG
S-REG
Bi
PB
MAC
C
S-REG
S-REG
Ci
Y2
Di
Y3
PC
PD
SHIFT-REG
Ai
1-Bit Scaling
Accumulator
+
Y2
PC
Ci
+
+
Y3
PD
Di
MS
REG
LS
SUM
+
1/2
Figure 12.
Serial Distributed Arithmetic for a
four-product MAC.
MAC
Figure 10.
Conventional four-product MAC
using shift-and-add technique to multiply pairs and
sum the result
PB
MAC
D
S-REG
SUM
Bi
PA
Y1
C
+
1/2
Y1
Y0
Ai
MAC Block
+
+
MS
P
REG
LS
S-REG
Ai
1/2
S-REG
Figure 11.
Four-Bit MAC using shift-and-add
technique to multiply
The four-multiplications are performed simultaneously
and the results are then summed when the products are
complete. This process requires n-clock cycles for data
sample of n-bits. Hence, the processing clock rate is
equal to the data rate divided by the number of data bits.
During each data clock-cycle, the four-multipliers
simultaneously create four-product terms [i.e., in Figure
10; PA, PB, PC, & PD], that eventually are summed into
the output. Distributed Arithmetic differs from this
process by adding the partial-products before, rather than
after, the bit-weighted accumulation.
V.1.0
C
S-REG
Ci
LOOKUP
TABLE
+
+
D
S-REG
1-Bit Scaling
Accumulator
Bi
Di
MS
P
REG
LS
SUM
1/2
Figure 13.
LUT-Based Serial Distributed
Arithmetic for a four-product MAC.
The LUT data, referenced in Figure 14, is composed of
all partial sums of the coefficients (Y0, Y1, Y2, & Y3).
The least significant bit (the output from each serial shift
register) of the 4-data samples addresses the LUT. If all
four data bits are 1, then the output from the LUT is the
sum of all four coefficients. If any data bit is a zero, then
the corresponding coefficient is eliminated from the sum.
Because the address of the LUT contains all possible
combinations of one or zero, based on the four inputs,
the LUT output contains all 16 possible sums of the
coefficients [see Figure 14].
Look-Up Table
Ai
ADDR_0
B
S-REG
Bi
C
S-REG
Ci
ADDR_1
LOOKUP
TABLE
ADDR_2
D
S-REG
Di
ADDR_3
O
U
T
P
U
T
LUT Inputs
< D i Ci Bi
<0 0 0
<0 0 0
<0 0 1
<0 0 1
<0 1 0
<0 1 0
<0 1 1
<0 1 1
<1 0 0
<1 0 0
<1 0 1
<1 0 1
<1 1 0
<1 1 0
<1 1 1
<1 1 1
Ai >
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
0>
1>
LUT Output
Partial Sum
0000...0
YA
YB
Y B +Y A
YC
Y C +Y A
Y C +Y B
Y C +Y B +Y A
YD
Y D +Y A
Y D +Y B
Y D +Y B +Y A
Y D +Y C
Y D +Y C +Y A
Y D +Y C +Y B
Y D +Y C +Y B +Y A
D0
S-REG
16-MAC CIRCUIT
D1
S-REG
D2
LUT
S-REG
D3
S-REG
D4
S-REG
D5
S-REG
D6
LUT
S-REG
D7
Figure 14.
Table, LUT.
S-REG
D8
S-REG
V.1.0
D9
1-Bit Scaling
Accumulator
+
MS
LS
SUM
1/2
S-REG
DA
P
REG
LUT
S-REG
DB
S-REG
DC
S-REG
DD
S-REG
DE
LUT
S-REG
DF
S-REG
Figure 15.
LUT-Based Serial Distributed
Arithmetic for a four-product MAC.
The only difference between a MAC circuit and a FIR
filter design is how the input data is handled [see Figure
16].
The MAC has a series of parallel-to-serial
converters, while the FIR filter has one long bit-wide
shift register that taps into each word. Only the first
word is parallel loaded. The first shift register operates
by parallel-loading the new data sample into a registerbased shift register. The shift register then serially shifts
out all the data-bits, 1-bit at a time. The output bit from
the shift register addresses 1-bit of a corresponding LUT,
until the most significant bit of n-bits has been clocked
out. When the shifting is complete, the first data shift
register is empty and is ready to be parallel loaded with
the next word and the process repeats.
The output bit from the first shift register (the first filter
Tap) is the input to a serial-in and serial-out shift register
(the second filter Tap). Each additional Tap would also
use a serial-in and serial-out shift register with its input
fed serially from the previous cascaded Tap. Note that
INPUT_11
INPUT_12
INPUT_13
INPUT_14
***
***
OUTPUT_06
OUTPUT_07
OUTPUT_08
OUTPUT_09
OUTPUT_10
***
S-REG
S-RAM
LUT
REG
S-RAM
***
Figure 17.
Pipelined input/output timing
diagram for a SDA 16-Tap FIR Filter.
If the filter is symmetrical it is possible to further reduce
the resources required for the design. In a symmetrical
filter the number of LUTs (or MACs) can be reduced by
a factor of two. This is done by adding the input data
with a common coefficient prior to the multiplication.
This is done with serial adders. This also reduces the
number of adder stages in the design, as shown in Figure
19.
S-RAM
S-RAM
SUM
A+B
REG
REG
S-RAM
LUT
A + B + C_IN
REG
S-RAM
S-RAM
C_OUT
1-Bit Scaling
Accumulator
S-RAM
MS
LS
REG
C_IN
REG
CLK
P
REG
SUM
1/2
S-RAM
Figure 18.
LUT
S-RAM
S-RAM
REG
S-RAM
LUT
S-RAM
REG
S-RAM
Figure 16.
16-Tap FIR Filter using LUT-Based
Serial Distributed Arithmetic.
V.1.0
REG
S-RAM
T0
T15
SERIAL
ADD
A0
T1
T14
SERIAL
ADD
A1
SERIAL
ADD
A2
SERIAL
ADD
A3
T11
SERIAL
ADD
A0
T5
T10
SERIAL
ADD
A1
LUT
T2
S-RAM
S-RAM
T13
REG
T3
S-RAM
S-RAM
T12
1-Bit Scaling
Accumulator
T4
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
T9
MS
LS
REG
REG
SUM
1/2
SERIAL
ADD
A2
REG
S-REG
S-RAM
T8
BITS[(n-1),...,5,3,1]
Ai
T7
S-RAM
LUT
T6
S-RAM
Symmetrical 16-Tap
FIR Filter Using
Serial Distributed
Arithmetic
SERIAL
ADD
A3
S-REG
Figure 19.
Symmetrical 16-Tap FIR Filter
using LUT-Based Serial Distributed Arithmetic.
S-REG
Ci
LOOKUP
TABLE
D
S-REG
PS_BITS_1
Bi
Di
Ai
MS
P
REG
+
LS
SUM
1/4
Bi
C
S-REG
1/2
B
S-REG
R
E
G
LS
A BITS[(n-2),...,4,2,0]
S-REG
2-Bit Scaling
Accumulator
Ci
LOOKUP
TABLE
PS_BITS_0
D
S-REG
Di
Figure 20.
LUT-Based 2-Bit Parallel
Distributed Arithmetic for a four-product MAC.
V.1.0
T0
T15
SERIAL
ADD
A0
T1
T14
SERIAL
ADD
A1
LUT
T2
S-RAM
S-RAM
T13
Symmetrical 16-Tap
FIR Filter Using
2-Bit Parallel Distributed
Arithmetic
SERIAL
ADD
A2
S-RAM
SERIAL
ADD
A3
T11
SERIAL
ADD
A0
T5
T10
SERIAL
ADD
A1
T9
SERIAL
ADD
A2
T7
T8
SERIAL
ADD
A3
T12
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
S-RAM
REG
S-RAM
SERIAL
ADD
PS_01
S-RAM
T13
A0
S-RAM
T12
REG
MS
REG
+
LSB
LSB
SUM
SERIAL
ADD
A2
S-RAM
S-RAM
S-RAM
T11
T5
T10
SERIAL
ADD
SERIAL
ADD
SERIAL
ADD
S-RAM
S-RAM
S-RAM
T9
REG
D 6,2
REG
SERIAL
ADD
A0
A1
A2
PS_23
+
Sign Extended
MSB
REG
LSB
REG
D 5,2
REG
D 4,2
REG
1/2
+
D 3,2
REG
D 2,2
REG
D 1,2
REG
D 0,2
REG
D 7,1
REG
D 6,1
REG
BITS_2
REG
+
Sign Extended
MSB
D 5,1
REG
D 4,1
REG
D 3,1
REG
D 2,1
REG
REG
REG
D 0,1
REG
D 7,0
REG
D 6,0
REG
Sign Extended
MSB
BITS_1
REG
REG
REG
D 3,0
REG
Sign Extended
MSB
REG
PS_O1
LSB
1/2
+
REG
T7
T8
SERIAL
ADD
A3
D 2,0
REG
D 1,0
REG
D 0,0
REG
BITS_0
LUT
Figure 21.
Symmetrical 16-Tap FIR
using 2-Bit Parallel Distributed Arithmetic.
Filter
REG
Figure 22.
16-Tap FIR Filter using LUT-Based
Parallel Distributed Arithmetic.
The filter design example implemented in an FPGA with
8-Bit PDA resulted in more than seven times the
performance and about four and a half times the number
of CLBs (e.g. 19-CLBs at 1.1 nsec, per Tap) compared to
the same function with SDA. Full PDA uses a much
larger number of CLBs then that of SDA, while still
processing all data samples (Taps) in parallel, at the data
sample rate.
V.1.0
PS_07
PS_03
LSB
REG
D 4,0
LSB
1/4
LUT
D 5,0
+
REG
REG
PS_BITS_0
1/16
REG
D 1,1
SUM
REG
LUT
A3
LUT
T6
S-RAM
REG
D 7,2
REG
REG
T4
S-RAM
D 0,3
BITS_3
LUT
1/2
A1
T3
S-RAM
REG
REG
LUT
T2
S-RAM
D 2,3
D 1,3
LUT
2-Bit Scaling
Accumulator
1/4
SERIAL
ADD
REG
T1
T14
REG
REG
S-RAM
REG
D 3,3
LUT
D 0 Bits[6,4,2,0]
S-REG
D 4,3
REG
REG
+
T0
T15
REG
LUT
T6
S-RAM
D 5,3
PS_47
LUT
T4
S-RAM
REG
REG
PS_BITS_1
REG
D 6,3
LUT
REG
T3
S-RAM
D 7,3
10
Other Resources:
Glossary of Terms:
References:
[1] Goslin, G. R. Using Xilinx FPGAs to Design
Custom Digital Signal Processing Devices, Proceedings
of the DSPX 1995 Technical Proceedings pp. 595-604,
12JAN95.
[2] Goslin, G. R. & Newgard, B. 16-Tap, 8-Bit FIR
Filter Application Guide, Xilinx Publications,
21NOV94.
[3] The programmable Logic Data Book, Xilinx, Inc.
Second Edition, 1994
[4] XC4000E Data Sheet, Xilinx, Inc. Latest
Version.
NOTES:
[5] Strauss, W. DSP Strategies for the 90s: The MidDecade Outlook, Forward Concepts, JAN95.
[6]
V.1.0
11