Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Building Custom FIR Filters Using System Generator

James Hwang and Jonathan Ballagh


Xilinx Inc. 2100 Logic Drive, San Jose, CA 95124 (USA) Jim.Hwang, Jonathan.Ballagh @xilinx.com

Abstract. System Generator is a high level design tool well suited to creating custom DSP data paths in FPGAs. While providing a high level abstraction of an FPGA circuit, it can be used to build designs comparable to hand crafted implementations in terms of area and performance. In this paper we use a MAC-based FIR lter design example to demonstrate the interplay between mathematical abstraction and hardware-centric considerations enabled by System Generator. We demonstrate how an algorithm can be efciently mapped onto FPGA resources and present the hardware results of several System Generator FIR lter implementations.

1 Introduction
There has been considerable recent progress in software tool development to support DSP applications in FPGAs. System Generator is a high-level design tool for Xilinx FPGAs that extends the capabilities of Simulink to include bit and cycle accurate modeling of FPGA circuits, and generation of an FPGA circuit from a Simulink model [3, 4]. System Generator provides robust Simulink libraries for arithmetic and logic functions, memories, and DSP functions. By supporting high level modeling and automatic code generation, System Generator creates new opportunities to explore the interplay between mathematical abstraction and hardware-centric considerations. In this paper we discuss several implementations of FIR lters. We use System Generator to map an algorithm onto FPGA resources, including shift register logic, dedicated multipliers, memory structures, and the logic fabric of VirtexTM and Virtex-IITM FPGAs. We then compare area-performance tradeoffs for different lter structures.
R

2 Hardware Modeling in System Generator


Virtex family FPGAs provide dedicated circuitry for building fast, compact adders, multipliers, and exible memory architectures in the logic fabric [1]. System Generator provides abstractions for these resources; the use of a silicon feature is either inferred or available through conguration parameters on the block interfaces. The Addressable Shift Register (ASR) block abstracts the SRL16 memory conguration [5], with the capability of running delay and address inputs at different rates. System Generator Multiplier blocks provide options for combinational and pipelined structures built in the FPGA fabric, and where available, embedded multipliers. The memory blocks can target either distributed or block RAM, the latter in either a single ported or dual ported conguration.

3 Building MAC-based FIR Filters


A versatile FIR lter architecture that maps well onto FPGA resources employs multiplyaccumulate (MAC) engines. In general, MAC operations are required to compute an output sample for an -tap lter. A System Generator model of a single-MAC FIR lter is shown in Figure 1. This architecture is very compact, and is a reasonable choice for low throughput applications or when is small.

Fig. 1. Single-MAC FIR Filter

3.1 Conguring the Data Path The tapped delay line is implemented using an ASR, which provides both a compact design and simple addressing requirements. The delay line runs at the data rate, but the rest of the lter, including the address port of the ASR, runs at -times this rate. The impulse response of the lter is stored in a single port ROM having user-speciable output precision (total and fractional bits), arithmetic type (signed or unsigned), and implementation option (distributed or block RAM). The multiply-accumulate (MAC) engine is comprised of a multiplier and accumulator block. The multiplier computes the product of a lter tap and a sample from the data buffer, and the accumulator computes a running sum of these products. The multiplier can be implemented either in the logic fabric or using embedded multipliers. The accumulator is congured to reinitialize upon reset to its current input value to avoid a one-clock cycle stall at the end of each sum-of-products computation. A register captures the output of the MAC engine before it is reset. The capture register output is down sampled so the lter output rate matches its input rate. The lter coefcient ROM is initialized from a MATLAB array bound to the model from the MATLAB workspace. Memory, counter, and down-sampler block parameters are specied in terms of this array. Multiplier and accumulator widths are dened as MATLAB functions of ROM wordsize, coefcient values, and lter input precision. Consequently, the model requires no modication to accommodate a change in the impulse response. The implications of this ability in System Generator to exploit the MATLAB interpreter during model customization should not be underestimated.

3.2 Filter Control Logic A single counter, congured to count from to repeatedly, generates addresses for both the coefcient ROM and input data ASR block. The counters output sample period denes the rates of the downstream data and coefcient buffers using Simulinks sample time propagation rules. Since the lter requires addresses to change at times the input data rate , the sample period is . For every new input sample, the accumulator block is reset to its current input, and the capture register latches the MAC engine output. This occurs when the address is zero; a relational block detects the condition.

3.3 Multiple-MAC Architectures A single-MAC architecture has the drawback that throughput is inversely proportional to the number of lter taps. Throughput can be increased dramatically by exploiting parallelism. In theory the computation can be fully parallelized, e.g., using a direct-form implementation [6], but System Generator and the FPGA allow a preferable solution that matches resource usage and availability to throughput. The tapped delay line of the single-MAC architecture can be partitioned into cascaded sections, each serviced by a separate MAC engine. The outputs of these MAC engines are combined in an adder tree to compute the lter output. With this approach it is possible to increase the throughput over a single-MAC lter while keeping the resource consumption as small as possible. The entire data path can be fully pipelined in System Generator. As an example, suppose a MAC engine can operate at 150 MHz, and a 144-tap lter is required to run at a data rate of 3.125 MHz. Then the MAC can service 150/3.125, or 48 lter taps, which implies a cascade section length of 48, and 144/48, or three such sections for the lter.

4 Results
All results were obtained using the Xilinx ISE Series 4.2i software and XST synthesis tool to target an XC2V250-6 part. All models use dedicated multipliers, block memory for coefcient storage, and 12-bit precision for lter input and coefcients.
Table 1. Performance of 16-tap MAC FIR lters with pipelined and non-pipelined dedicated multipliers. Pipelined Multiplier No Yes
(MHz)

Frequency (MHz) 95.21 209.63

Slices 51 77

% Utilization 3 5

5.95 13.10

The results in Table 1 demonstrate the effectiveness of using the pipeline stages of the dedicated multipliers. If the lower throughput is tolerable, the lter can be implemented in only 51 slices. Table 2 illustrates the tradeoff between lter size and throughput by providing the performance results of three 64-tap FIR lters with a varying number of MAC engines. It can be seen that in this experiment, the operating clock frequency is not particularly sensitive to the number of MAC-engines employed.
Table 2. Performance of fully pipelined non-symmetric 64-tap FIR lters with one, two, and four MAC engines. MAC Engines 1 2 4
(MHz)

Frequency (MHz) 194.74 182.18 193.98

Slices 110 187 313

% Utilization 7 12 20

3.04 5.69 12.12

5 Conclusions
Modern FPGAs are capable of implementing high performance DSP functions, but the relative unfamiliarity of many DSP engineers with hardware design has slowed their wider adoption. Tools like System Generator provide means for both high level modeling of DSP algorithms, and implementing custom high performance DSP data paths in FPGAs. Through the use of simple examples, we have demonstrated several general techniques for building efcient MAC-based FIR lters from a system level development environment.

References
1. Xilinx, Inc., Virtex-II Platform FPGA Handbook, 2001. 2. The Mathworks Inc., Matlab, Getting Started with Matlab, Natick, Massachusetts, 1999. 3. The Mathworks Inc., Simulink, Dynamic System Simulation for Matlab, Using Simulink, Natick, Massachusetts, 1999. 4. J. Hwang, B. Milne, N. Shirazi, and J. Stroomer, System Level Tools for DSP in FPGAs, FPL 2001, Lecture Notes in Computer Science, pp 534-543, 2001. 5. K. Chapman, Saving Costs with the SRL16E, Xilinx techXclusive, 2000. 6. S. Mitra, Digital Signal Processing: A Computer Based Approach Edition, McGrawHill, 2001.

You might also like