A Variable-Size FFT Hardware Accelerator Based On Matrix Transposition

A Variable-Size FFT Hardware Accelerator
Based on Matrix Transposition

ABSTRACT:
Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain
of digital signal processing, and the FFT sizes of different applications are very different.
Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the
IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range
from 2 to 2 20 points. First, a parallel Cooley–TukeyFFT algorithm based on matrix transposition
(MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that
can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is
designed, and several FFT performance optimization techniques such as hybrid twiddle factor
generation, multibank data memory, block MT, and token-based task scheduling are proposed.
Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of
2.4 mm2 and the power consumption of 91.3 mW at 25 ◦C, 0.9 V. Finally, several experiments
are carried out to evaluate the proposal’s performance in terms of FFT execution time, resource
utilization, and power consumption. Comparative experiments show that our FFT hardware
accelerator achieves at most 18.89× speedups in comparison to two software-only solutions and
two hardware dedicated solutions.
SOFTWARE IMPLEMENTATION:
 Modelsim
 Xilinx 14.2
EXISTING SYSTEM:
FAST Fourier transform (FFT) is a fundamental algorithm in the domain of digital signal
processing (DSP) and is widely used in applications such as digital communication, sensor signal
processing, and synthetic aperture radar (SAR). It is always the most time-consuming part in
these applications. Moreover, the FFT size varies very much in different applications. For
example, the FFT size is smaller than the magnitude of thousands (K) in digital communication
or wireless communication applications, but the FFT size reaches the magnitude of millions (M)
in SAR application. Therefore, it is basically required to support high performance variable-size
FFT in a DSP domain.
FFT performance varies dramatically for various implementation methods in different platforms.
In general, general purpose CPU platforms have poor FFT performance, because the power-of-2
strides of FFT algorithm interact badly with set associative cache, set associative address
translation mechanism, and power-of-banked memory subsystem. Many researchers pay their
attention on FFT optimization on GPU and field-programmable gate array (FPGA). However,
the computational efficiency1 of FFT on these platforms is still not high, due to their low
utilization of computational units. An alternative way is to accelerate FFT by application-
specified integrated circuit (ASIC) hardware so as to achieve high computation performance and
energy efficiency. For instance, hardware accelerator FFT (HWAFFT) is an FFT hardware
accelerator on the commercial TI TMS320VC55X DSP chips. It compared with flexible cores
such as FPGA, GPU, and CPU, an ASIC FFT hardware accelerator can achieve much higher
performance improvement and improve the energy efficiency by one to two orders of magnitude.
A DSP application usually contains multiple batches of FFT computations. A batch of FFT
computation includes several FFTs with the same size. For instance, the chirp scaling algorithm
of SAR, requires four batches of FFTs (azimuth FFT, range FFT, range IFFT, and azimuth
IFFT), the cross-correlation algorithm contains three batches of FFTs, and the Cooley–Tukey
scheme for large 1-D FFT includes two batches of FFT (column FFT and row FFT). Because
there is usually no data dependence among FFTs in the same batch of FFT computations, more
task-level parallelism can be exploited to improve the FFT performance.
In this paper, in order to design a high-performance variable size FFT hardware accelerator, we
first propose a parallel Cooley–Tukey FFT algorithm based on matrix transposition (MT). The
proposed algorithm can efficiently divide a big size FFT into several small size FFTs that can be
executed in parallel. Second, guided by this algorithm, we design the FFT hardware accelerator
supporting 2∼2 20-point FFT computation and propose several FFT performance optimization
techniques such as hybrid twiddle factor generation, multibank data memory (MBDM), block
MT, and token-based task scheduling. Then, its VLSI implementation is detailed and its
performance is evaluated experimentally in terms of FFT execution time, resource utilization,
and power consumption
DISADVANTAGES:
 Execution time is less

 Efficiently cannot divide a large size FFT into several small size FFTs
PROPOSED SYSTEM:
VLSI Implementation
Host Chip
The FFT hardware accelerator has been implemented and integrated into our FT-M8 chip, which
is an eight-core version of our FT-MATRIX chip. FT-MATRIX is a highperformance
coordination-aware vector-SIMD architecture for signal processing. Fig. 1(a) shows the overall
architecture of the FT-M8 chip highlighting the FFT hardware accelerator. FT-M8 chip contains
eight FT-MATRIX cores and integrates different on-chip peripherals. As shown in Fig. 1(a), a
high-speed data network (HDN) supporting the AXI4 protocol with 128-bit data width is
designed to connect eight FT-MATRIX cores, a DDR3 controller, an on-chip shared memory
(SM), a low-speed data network (LDN), the FFT hardware accelerator, and some high-speed
peripherals such as PCIE, SGMII, and Rapid IO [not drawn in Fig. 1(a)]. The HDN works at the
system clock domain (SYS_CLK in Table I), so its bandwidth is 8 GB/s at 500 MHz. The LDN
supporting the AHB protocol with 32-bit data width is designed to connect low-speed memories
such as SRAM and FLASH and other low-speed peripherals such as SPI, IIC, UART, and so on.
(not drawn in the figure). The HDN and the LDN are bridged by an AXI2AHB bridge module.
The FFT hardware accelerator is connected with the HDN and, hence, can access the SM or the
DDR3 memory through the DDR3 controller. FT-MATRIX cores configure and start up the FFT
hardware accelerator through the AXI slave command interface. An accomplishment interrupt is
sent from the FFT hardware accelerator to the FT-MATRIX cores, once the FFT computation is
accomplished.
FT-M8 chip is designed and implemented under TSMC 45-nm process and has been successfully
taped out. Fig. 1(b) shows the layout of the FT-M8 chip with block annotations. A hierarchical
physical design strategy is adopted. First, FT-MATRIX core, DDR3 controller, SM, PCIE,
RapidIO, SGMII, and FFT hardware accelerator are
Fig. 1. (a) Architecture of FT-M8 chip integrating the FFT hardware accelerator. (b) Layout of
the FT-M8 chip with block annotations. (c) Layout of the FFT hardware accelerator with
subblock annotations.
implemented individually. Second, the entire FT-M8 chip is implemented. In the FT-M8 chip
implementation, they are included as macro blocks [see the dark green blocks in Fig. 1(b)], and
LDN, HDN, and low-speed peripherals such as SPI, IIC, and UART are flatten among these
macro blocks. Because the FFT hardware accelerator is an internal module without chip pins, it
is not placed around the sides and corners. Based on the FT-M8 chip, the functionality of the
proposed FFT hardware accelerator was verified by Cadence ncverilog simulation.
FFT Implementation
Table I summarizes the FFT implementation results, and Fig. 1(c) shows the layout of the FFT
hardware accelerator with subblock annotations. The implemented FFT hardware accelerator
contains two FFT-PEs and each FFT-PE has two butterfly units. Each butterfly unit is fully
pipelined and composed of four singleprecision floating-point multipliers and six single-
precision floating-point adders. According to the singleprecision floating-point data format (1-bit
sign, 8-bit biased exponent, and 23-bit unsigned fraction), the adders and the multipliers are
designed.
In their design, they do not have intermediate rounding-off so as to reduce the area and latency,
and improve the precision. Besides, some exceptions such as overflow, underflow, divide-by-
zero, and so on, are reported as an interrupt output. Each FFT-PE is capable of computing at
most 210-point FFT. Under the token-based task scheduling of the FFT controller, the two FFT-
PEs can cooperate together to compute a larger size FFT (210 ∼ 2 20points). Each FFT-PE
contains an MBDM consisting of eight (=2 × 2 × M) SRAMs with the bit width of 64 and the
depth of 256 (=(Nm/2M)). The butterfly units feature a 10-stage pipeline. In logic design phase,
a 256×64 SRAM with two read/write ports and a 32 × 64 SRAM with one read port and one
write port were compiled by Synopsys Embed-It! Integrator using TSMC 45-nm M-Bitcell
library. The former was instantiated eight times in each FFT-PE to constitute the MBDM, while
the latter was instantiated four times as the asynchronous data FIFO. The asynchronous
command FIFO was implemented by registers rather than SRAMs, since the data amount of
commands are a little. Hence, the total memory size in the FFT hardware accelerator is 33 kB.
The FFT hardware accelerator was synthesized by Synopsys design compiler on the basis of the
generated memories and the TSMC 45-nm standard cell library. In the synthesis, FFT_CLK and
SYS_CLK were targeted at 0.65 ns. After the synthesis, the RTL-to-Netlist formality was carried
out by Synopsys formality to ensure the consistency between the RTL codes and the netlist.
In physical design phase, Cadence SoC Encounter was used to perform floorplan, supply/ground
grid, clock tree, place, and route, and so on. FFT_CLK and SYS_CLK were targeted at 1 ns.
Besides, STA, DRC, and LVS were done to ensure that the final GDSII data meet the
requirements of the TSMC foundry. The final results show that: 1) the area is 2.4 mm 2; 2) the
clock frequency achieves 1 GHz under a wide temperature range from −40 ◦C to 125 ◦C, and the
critical path is located in the butterfly unit in FFT-PE1 and takes 998 ps; 3) under the typical
corner (25 ◦C and 0.9 V), according to the VCD-based simulation data of 1024-point FFT
computation, the power consumption is calculated to be 91.3 mW.
ADVANTAGES:
 Execution time is high

 Efficiently divide a large size FFT into several small size FFTs
REFERENCE:
XiaowenChen ,Yuanwu Lei , Zhonghai Lu , Senior Member, IEEE, and Shuming Chen, “A
Variable-Size FFT Hardware AcceleratorBased on Matrix Transposition”, IEEE
TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018.

A Variable-Size FFT Hardware Accelerator Based On Matrix Transposition

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Variable-Size FFT Hardware Accelerator Based On Matrix Transposition

Uploaded by

Copyright:

Available Formats

A Variable-Size FFT Hardware Accelerator

Based on Matrix Transposition

 Execution time is less

 Execution time is high

You might also like