Professional Documents
Culture Documents
2020 ISCAS A 128-Point Multi-Path SC FFT Architecture
2020 ISCAS A 128-Point Multi-Path SC FFT Architecture
Shun-Che Hsu & Shen-Jui Huang Sau-Gee Chen & Shin-Che Lin Mario Garrido
Intelligo Technology Inc. Insitute of Electronics Department of Electrical Engineering
Hsinchu City, Taiwan Natioanl Chiao Tung University Universidad Politécnica de Madrid
shenray.ee95g@g2.nctu.edu.tw Hsinchu City, Taiwan Madrid, Spain
sgchen@g2.nctu.edu.tw mario.garrido@upm.es
X [ k ] = x[ n]WN ,
kn
and multipliers in the processing element (PE) of each stage k = 0, 1, …, N-1, (1)
are reduced by half. n=0
− j 2π nk / N
where X[k] is the kth DFT coefficient and WN =e
nk
Although the SC FFT is very hardware-efficient, it .
considers only a single serial input and output data streams. Fast Fourier transform (FFT) algorithms are applied to
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on April 18,2022 at 11:05:58 UTC from IEEE Xplore. Restrictions apply.
Stage1 Stage2 Stage3
x[n] X[8k]
Stage1 Stage2 −
WN4n
x[n+N/8] X[8k+4]
x[n] X[4k]
x[n+2N/8] − WN2n X[8k+2]
x[n+N/4] − WN2n X[4k+2]
x[n+3N/8] − -j − WN6n
WNn X[4k+1] X[8k+6]
−
x[n+2N/4] − WNn
x[n+4N/8] X[8k+1]
x[n+3N/4] − -j − WN3n 1 WN5n
X[4k+3]
x[n+5N/8] − W8 −
(a) X[8k+5]
x[n+6N/8] − W82 − WN3n X[8k+3]
− W8
3
− -j − WN7n X[8k+7]
x[n+7N/8]
(b) Fig. 3. PE with W4 rotator (PE_W4).
Stage1 Stage2 Stage3 Stage4
x[n] X[16k]
WN8n
x[n+N/16] X[16k+8]
WN4n
x[n+2N/16] X[16k+4]
-j WN 12n X[16k+12]
x[n+3N/16]
x[n+4N/16]
WN2n X[16k+2]
x[n+5N/16]
W81 WN 10n
X[16k+10]
x[n+6N/16] W82 WN6n
X[16k+6]
W83 -j WN 14n
x[n+7N/16] X[16k+14]
x[n+8N/16] WNn X[16k+1]
x[n+9N/16]
W161 WN9n X[16k+9]
x[n+10N/16] W162 WN5n X[16k+5] Fig. 4. PE with W8 rotator (PE_W8).
x[n+11N/16]
W163 -j WN 13n X[16k+13]
W164 WN3n S3_1
x[n+12N/16] X[16k+3]
W165 W81 WN 11n RIN S3_1
x[n+13N/16] X[16k+11] 0 S3_5
x[n+14N/16]
W166 W82 WN7n D + ROUT
X[16k+7] 1
0
D
x[n+15N/16]
W167 W83 -j WN 15n X[16k+15] S3_2 S3_3
0
1
(c) IIN 1 0
W16_MUL
Fig. 2. Signal flow graphs for different radices. (a) Radix-22. (b) Radix-23. D 0
– D -1 1
S3_1 IOUT
1
(c) Radix-24. S3_4 + 1
D 1 0
W16_MUL
reduce the computational cost of the DFT, leading to a 0
2
Real W16 rotator
complexity of O(Nlog2N) which is much less than O(N2) for W16_MUL
direct computation of the DFT. During the FFT process, the x
<< 6 S When S = 0, out = 473x
major basic computation involved is the multiplication of a 64x
2
When S = 1, out = 362x
When S = 2, out = 196x
5x
− j 2π m / L 40x
C. Symmetric angle sets Fig.2 shows the flow graphs of radices 22, 23 and 24
FFTs which are based on decimation-in-frequency (DIF)
A symmetric angle set (SAS) [17,18] is a set of angles of decomposition. The proposed architecture is based on these
the form nπ/2 ±α, where n = 0,…,3 and α∈[0, π/4]. Any building blocks. One can observe that the rotation
rotation in a symmetric angle set can be calculated as a operations in the flow graph only appear at the lower edges
rotation by an angle α∈[0, π/4], a trivial rotation and an of the butterfly outputs. This translates into the fact that only
exchange of the real and imaginary part of the rotation the lowest path of the PE in SC FFT needs to be rotated. As
coefficient. a result, the PEs in Figs. 3 to 6 are derived.
Fig. 3 shows the realization of the PE for the stages
D. M-rotator where a W4 rotation is applied. In this case, only a -j rotation
is needed. The S1_2 signal selects a rotation by 1 or –j. Fig.
An M-rotator or M-rot [18] is a rotator that can rotate a 4 shows the realization of the PE for the stages where W8
number of angles in M different symmetric angle sets. For rotation is operated. In this case, the constant 0.707 in
instance, a rotator that rotates by 0º, 45º and 135º is a 2-rot as W81=0.707-0.707j and W83=-0.707-0.707j is moved to the
it rotates angles in the symmetric angle sets nπ/2 and nπ/2 ± input of the rotator. As a result, the rotator only requires one
π/4. Likewise, the twiddle factor W8 is a 2-rot, the twiddle constant multiplier. With proper control signal setting, this
factor W16 is a 3-rot and the twiddle factor W32 is a 5-rot. In PE calculates all the rotations only with W8 twiddle factor.
general, a twiddle factor WL is an L/8+1-rot. This means that The constant multiplication by 0.707 can be implemented as
there are L/8+1 angles in the range [0, π/4] and the rest of the four simple shift-and-add operations for 12-bit precision, as
can be easily shown.
angles can be obtained by utilizing the symmetry property.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on April 18,2022 at 11:05:58 UTC from IEEE Xplore. Restrictions apply.
required in these stages. Fig. 6 shows the corresponding PE.
S4_1
× B. Multi-path SC FFT Architecture
+ −
Fig. 6. PE with general complex rotator (PE_TW). TABLE I. ROTATOR DISTRIBUTION AT STAGES 4 TO 6 OF THE
PROPOSED MSC FFT
Fig. 5 shows the implementation of the PE for the stages Stage 4 Stage 5 Stage 6
where a W16 rotation is required. By using the symmetry Path 1 W160 , W164 W80 W40
property around the unit circle, the rotator only has 3 SAS. Path 2 W161 , W165 W81 W40
Therefore, it only needs to rotate by W160,W161 and W162, Path 3 W162 , W166 W82 W40
whereas the rest of rotations are obtained by symmetries. If Path 4 W163 , W167 W83 W41
we consider that W161 = (473-196j)/512, and W162 =
(362-362j)/512, the rotator only needs to multiply its inputs In addition to combining radix-23 and radix-24, the
by 473/512, 362/512, or 196/512. This allows for sharing proposed SC FFT applies the ideas of rotator allocation [17]
the resources to calculate the multiplications by all these to simplify the rotators even further. Table I shows the
values. With proper setting of control signals, W16_MUL rotations at the parallel paths of stages 4, 5 and 6. One can
can calculate the multiplication by 473, 362, or 196 using observe that in the upper path of stage 4, it only has to
only three real adders. calculate rotations by W160 and W84, which are trivial. Hence,
In the final stage of the radix-2k flow graphs in Fig. 2, a PE_W4 is enough here, which much simplifies the rotator
rotations occur both at the upper and lower output paths of complexity. Likewise, other rotators at stages 4, 5 and 6 are
each butterfly. Therefore, a complex multiplication is simplified by following the same ideas. These rotators
correspond to the shaded blue modules in Fig. 7.
3 4
Radix-2 Radix-2
Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7
0 0 0 0
Path 1 PE_W8 D7 1
PE_W4 D3 1
PE_ TW D1 1
PE_W4 D15 1
PE_W4
1 1 1 1
0 0 0 0
BU2 BU2
0 0 0 0
Path 2 PE_W8 D7 1
PE_W4 D3 1
PE_ TW D1 1
PE_W16 D15 1
PE_W8
1 1 1 1
0 0 0 0
0 0 0 0
Path 3 PE_W8 D7 1
PE_W4 D3 1
PE_ TW D1 1
PE_W8 D15 1
PE_W4
1 1 1 1
0 0 0 0
BU2 BU2
0 0 0 0 -j
Path 4 PE_W8 D7 1
PE_W4 D3 1
PE_ TW D1 1
PE_W16 D15 1
PE_W8
1 1 1 1
0 0 0 0
IV. EXPERIMENTAL RESULTS AND COMPARISON Columns 3 to 7 in Table II show the synthesis results of
compared 128-point multi-path FFT architectures. All the
To verify its performance, the proposed MSC FFT compared designs are based either on an MDC or MDF FFT
architecture has been synthesized by using the TSMC 90 nm architectures which are derived from their single-path
technology. Ten thousand random input patterns of 1 or -1 are predecessors, namely, SDC and SDF. In contrast, the proposed
applied to the FFT architecture. The experimental results are MSC FFT architecture is the first parallel FFT architecture
shown in the second column of Table II. The architecture derived and optimized from the highly-efficient single-path
works at a frequency of 250 MHz, leading to a throughput of 1 SC FFT.
GS/s due to the 4 parallel paths. The word length used in the Together with the proposed architecture, the works of [5],
architecture is 12 bits. This results in a SQNR of 42.7 dB. The [8] and [9] are all 4-parallel architectures. For a proper area
pre-layout gate-level synthesis results show that the whole comparison, we have normalized the area figures of all the
FFT architecture area is 0.167 mm2. Finally, the power compared designs according to the following commonly
consumption is 14.81 mW at the clock rate of 250 MHz.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on April 18,2022 at 11:05:58 UTC from IEEE Xplore. Restrictions apply.
TABLE II. COMPARISON OF 128-POINT MULTI-PATH FFT ARQUITECTURES
This work [5] [8] [9] [10] [15]
FFT architecture 4-parallel MSC 4-parallel MDF 4-parallel MRMDF 4-parallel MDF 8-parallel MDF 2-parallel MDC
FFT algorithm radix-23/24 radix-24 radix-8/2 radix-24 radix-23/24 radix-2
FFT size 128 128 128/64 128 128 128
Tech. (nm) 90 @1.0V 180 @1.8V 180 @1.8V 180 @1.8V 90 @1V 65 @0.7V
Word length 12 bits 12 bits 10 bits 10 bits 8 bits 12bits
Working frequency 250 MHz 200 MHz 250 MHz 450 MHz 260 MHz 160 MHz
3.097 0.53
A (mm2) 0.167 - - 0.34
(excl. test module) (excl. test module)
Gate count 59049 93200 - 130000 -
14.81 132 175 6.8 2.12
P (mW) -
@1 GS/s @800 MS/s @1 GS/s @409.6 MS/s @317.25 MS/s
Normalized Area: 0.167 0.774 0.53 0.652
- -
AN (mm2) (26%) (100%) (68%) (84%)
11.2*
Normalized Power: 6.75 11.97 7.3 8.31
@440 MS/s -
PN (mW) @440 MS/s (56%) @440 MS/s (100%) @440 MS/s (61%) @440 MS/s (69%)
(94%)
SQNR 42.7dB 40dB - 33dB 26.4dB -
Throughput rate
4R 4R 4R 4R 8R 2R
(R: clock rate)
Experimental results Post-synthesis Post-synthesis Post-implementation Post-synthesis Post-implementation Post-synthesis
*: This value has been normalized by multiplying the working frequency rate.
adopted formula, where the proposed design is used as the Finally, the SQNR is calculated as explained in [20]. By
benchmark design: comparing the results in the table, it can be observed that the
proposed approach achieves the highest SQNR among all the
A 128-point multi-path FFT designs.
AN = 2
, (2)
(Tech. / 90nm) V. CONCLUSIONS
where AN is the normalized area, A is the area, and Tech. is
In this work, we have presented the first multi-path SC FFT
process feature size of the compared design. For the proposed
(MSC FFT). The demonstration design is a 128-point 4-parallel
approach, the process size is 90nm . radix-23/24 MSC FFT. The use of an SC-based processing
element facilitates the minimization of the hardware resources
It is observed that the proposed design requires significantly in the architecture. Furthermore, the utilization of radix-23/24
less area than the compared 4-parallel designs: Compared to FFT combined with optimized rotator allocation results in
[5], the area of the proposed design is 61% smaller in terms of highly hardware-efficient rotator architectures. Compared with
gate counts. With respect to [8], the proposed design requires the existing designs with the same FFT size and degree of
around one-fourth area of [8], while compared to [9], the parallelism, the proposed architecture achieves savings by
proposed design consumes less than half of its gate count. more than 40% in both area and power consumption, while
provide very high throughput. The proposed architecture can
To compare the power consumption, similarly the power be extended to higher parallelism than four with much higher
data are normalized with respect to the proposed design by throughputs which can meet the requirements of current and
using the following formula: future most demanding communication systems (with
throughput of more than 10Gpbs), such as 5G mobile systems,
IEEE802.11ax and IEEE 802.11ay systems. Those application
P designs are currently under investigation.
PN = 2
, (3)
(Tech./ 90nm) × (V /1.0)
REFERENCES
where the number 1.0 in the formula is the operating voltage [1] S. He and M. Torkelson, “Design and implementation of a 1024-point
of the proposed implementation, and V, P and PN are the pipeline FFT processor,” in Proc. IEEE Custom Integrated Circuits
operating voltage, power consumption and normalized power Conf., May 1998, pp. 131-134.
[2] N. Li and N. P. van der Meijs, “A radix-22 based parallel pipeline FFT
consumption of the compared design, respectively. processor for MB-OFDM UWB system,” in Proc. IEEE Int. SOC Conf.,
2009, pp. 383–386.
At 440 MS/s, the proposed approach consumes only 60% [3] J. Lee, H. Lee, S. I. Cho, and S.-S. Choi, “A high-speed, low-complexity
and 56% of the total power consumption in [5] and [8], radix-24 FFT processor for MB-OFDM UWB systems,” in Proc. IEEE
respectively. This represents savings of 40% or larger in Int. Symp. Circuits Syst., 2006, pp. 210–213.
power consumption.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on April 18,2022 at 11:05:58 UTC from IEEE Xplore. Restrictions apply.
[4] H. Liu and H. Lee, “A high performance four-parallel 128/64-point [12] Y.-N. Chang, “An efficient VLSI architecture for normal I/O order
radix-24 FFT/IFFT processor for MIMO-OFDM systems,” in Proc. pipeline FFT design,” IEEE Trans. Circuit Syst. II: Express Briefs, vol.
IEEE Asia–Pacific Conf. Circuits Syst., Nov. 2008, pp. 834–837. 55, no. 12, pp. 1234-1238, Dec. 2008.
[5] S.-I. Cho, K.-M. Kang, and S.-S. Choi, “Implemention of 128-point fast [13] M. Garrido, J. Grajal, M.A. Sanchez, and O. Gustafsson, “Pipelined
Fourier transformprocessor for UWB systems,” in Proc. Int. Wirel. radix-2k feedforward FFT architectures,” IEEE Trans. VLSI, vol. 21,
Commun. Mobile Comput. Conf., 2008, pp. 210–213. issue. 1, pp. 23-32, 2013.
[6] C.-H. Yang, T.-H. Yu, and D. Markovic, “Power and area minimization [14] M. A. Sánchez, M. Garrido, M. L. López, and J. Grajal, “Implementing
of reconfigurable FFT processors: A 3GPP-LTE example,” IEEE J. FFT-based digital channelized receivers on FPGA platforms,” IEEE
Solid-State Circuits, vol. 47, no. 3, pp. 757–768, Mar. 2012. Trans. Aerosp. Electron. Syst., vol. 44, no. 4, pp. 1567–1585, Oct. 2008.
[7] S.-N. Tang, J.-W. Tsai, and T.-Y. Chang, “A 2.4-GS/s FFT processor for [15] T. Ahmed, “A low-power time-interleaved 128-point FFT for IEEE
OFDM-based WPAN applications,” IEEE Trans. Circuits Syst. I, 802.15.3c standard,” in Proc. Int. Conf. Informatics, Electronics Vision,
Regular Papers, vol. 57, no. 6, pp. 451–455, Jun. 2010. Dhaka, 2013, pp. 1-5.
[8] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A 1-GS/s FFT/IFFT processor for [16] Mario Garrido, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson,
UWB applications,” IEEE J. of Solid-State Circuits, vol. 40, pp. 1726 – “The Serial Commutator (SC) FFT,” IEEE Trans. Circuits Syst. II:
1735, Aug 2005. Express Briefs, Vol. 63, No. 10, pp. 974-978, Oct. 2016.
[9] Minhyeok Shin and Hanho Lee, “A high-speed four-parallel [17] Mario Garrido, Shen-Jui Huang and Sau-Gee Chen, “Feedforward FFT
radix-24 FFT/IFFT processor for UWB applications,” in Proc. IEEE Int. hardware architectures based on rotator allocation,” IEEE Trans.
Symp. Circuits Syst., Seattle, WA, 2008, pp. 960-963. Circuits Syst. I: Regular Papers, Vol. 65, No. 2, pp. 581-592, Feb. 2018.
[10] S. Tang, J. Tsai and T. Chang, “A 2.4-GS/s FFT Processor for [18] R. Andersson. “FFT hardware architectures with reduced twiddle factor
OFDM-Based WPAN Applications,” in IEEE Transactions on Circuits sets,” Master’s thesis. Dpt. Of Electrical Engineering, Linköping
and Systems II: Express Briefs, vol. 57, no. 6, pp. 451-455, June 2010. University, Jun. 2014.
[11] X. Liu, F. Yu, and Z. ke Wang, “A pipelined architecture for normal I/O [19] M. Garrido, J. Grajal and O. Gustafsson. “Optimum circuits for bit-
order FFT,” Journal of Zhejiang University - Science C, vol. 12, no. 1, dimension permutations,” IEEE Trans. on VLSI, Vol 27, No. 5, pp.
pp. 76-82, Jan, 2011. 1148-1160, May 2019.
[20] D. Guinart and M. Garrido, “SQNR in FFT hardware architectures,”
Under review.
Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on April 18,2022 at 11:05:58 UTC from IEEE Xplore. Restrictions apply.