Professional Documents
Culture Documents
Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations
Synthesis of Fpga Synthesis of Fpga - Based FFT Based FFT Implementations Implementations Implementations Implementations
Hojin Kee1, Newton Peterson2, 2, Shuvra 1 J Jacob b Kornerup K Sh S. S Bhattacharyya Bh h of Electrical and Computer Engineering, University of Maryland, College Park, 20742, USA. 2National N ti l Instruments I t t Corporation, C ti Austin, A ti 78759, 78759 USA. USA
1Department
Overview
Propose a systematic approach for synthesizing fieldprogrammable gate array (FPGA) implementations of fast F Fourier i transform t f computations. t ti Proposed approach is composed of two orthogonal techniques FFT inner loop p unrolling g and outer loop p unrolling g to perform design space exploration in terms of cost and performance. Achieve cost-optimized cost optimized FFT implementations, subject to user-specified performance levels. Proposed techniques that can be retargeted to different kinds of FPGA devices.
Introduction
Fast Fourier transform (FFT) computation potentially requires multi-cycle processing blocks as its computational complexity is blocks, O(N*logN), where N is the number of inputs. Proposed approaches. Outer O t loop l unrolling lli : R Realizing li i pipelining i li i by instantiating multiple processing cores across FFT butterfly stages. Inner loop unrolling : Realizing parallelism by allocating multiple cores within each stage. Our synthesis approach is prototyped in National Instruments LabVIEW FPGA 8.5. Cost metric
Usage of FPGA slices 1 of Block RAMs Usage
Related Works
Ma [2] developed an efficient method for in-place memory management in FFT implementation, but this approach is restricted t i t d to t a single i l b butterfly tt fl unit. it Nordin et al. [4] presented a parameterized soft core generator for the FFT based on the Peace FFT algorithm g with the stride permutation approach proposed by Takala et al. [5]. Jackson et al. [6] proposed a systolic structure to provide for high throughput FFT implementation implementation. Distinguishing aspect in our approach : Realization of data
parallelism and pipelining with a carefully-configured address generator. t
No special permutation structures for butterfly operations. Efficient utilization of FPGA slices subject to user-defined performance.
3
Unrolling techniques
A basic FFT core (BFC) provides dedicated hardware for one butterfly operation. K- times throughput improvement
Running BFCs simultaneously across stages. Incorporating p gp parallelism inside the BFC within a given stage.
Two unrolling techniques show different cost functions in terms of usage of FPGA slices or BRAMs. The two approaches should be considered jointly for cost-efficient FPGA-based, FFT implementation.
4
This approach introduces k identical copies of the sub-FFT core. It is expected that a factor k of increase in hardware cost results. Trade-offs associated with outer loop unrolling are complemented by inner loop unrolling. unrolling
BFC
bpRAM
= br br-1 b1 1
(1)
ap
switch
output address =1 0 1 0
reg
output address =1110
reg
br br-1 b1 = 1 1 0
Cost/Performance Analysis
Cost model for outer loop unrolling/ inner loop unrolling. We calibrate the model using synthesis results.
uinner = sinner*uinitial(kinner-1)+uinitial uouter = souter*uinitial(kouter-1)+uinitial
uinner/uouter uinitial kinner/kouter sinner/souter unrolling : Amount of utilization after inner/outer loop unrolling : Amount of utilization without loop unrolling : Unrolling factors : The slope p of the linear p plots from synthesis y for inner ( (outer) ) loop p
Experimental Results
Figure 3 reports the FPGA resource utilization when the target speedup is 6. (kinner, kouter)=(3, 2) shows the best utilization performance in the target speedup. This matches to the results from the analytic cost function we analyzed. For streaming FFT performance, our approach requires 23% less FPGA slices compared to the Xilinx core, but 140% more BRAMs. For the sequential performance level, our approach requires 30% fewer slices, and 17% more BRAMs.
10
Conclusion
Our approach incorporates efficient FFT address generation and memory management, and applies two orthogonal loop unrolling methods et ods to op provide o de a tu tunable ab e trade-off ade o be between ee pe performance o a ce a and d FPGA resource costs. We also develop an analytical approach for high level design space exploration, which allows one to estimate the most resourceresource efficient FFT architecture configuration for a given throughput constraint and a given critical target resource. A distinguishing characteristic of our approach approach, compared to commercially available FFT IP cores, is that we provide a systematic method to generate an FPGA-based FFT architecture while taking into account trade trade-offs offs between performance and cost.
11
References
[1] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, Vol. 19, No. 90, 297-301, 1965. [2] Y. Ma, An Effective Memory Addressing Scheme for FFT Processors, IEEE T Transactions ti on Signal Si lP Processing, i vol. l 47 47, I Issue 3 3, pp. 907 907-911, 911 M March h 1999 1999. [3] W. Wolf. FPGA-Based System Design. Prentice Hall, 2004. [4] G. Nordin, P. A. Milder, J. C. Hoe, M. Puschel, Automatic Generation of Customized Discrete Fourier Transform IPs IPs , Design Automation Conference Conference, pp pp. 471471 474, 474 2005. [5] J. Takala, T. Jarvinen, P. Salmela, and D. Akopian. Multi-port interconnection networks for radix-r algorithms. In Proc. IEEE Intl. Conf. Acoustics, Speech, Signal P Processing, i 2001 2001. [6] P. A. Jackson, C. P. Chan, J. E. Scalera, C. M. Rader, and M. M. Vai, A Systolic FFT Architecture for Real Time FPGA Systems, High Performance Embedded Computing Workshop, 2004
12