Professional Documents
Culture Documents
Fnet: Mixing Tokens With Fourier Transforms
Fnet: Mixing Tokens With Fourier Transforms
Table 1: Model operations and parameters. n is the sequence length and dhidden is the model hidden dimension.
The “encoder” is the pre-training model without the output projections onto the embedding table for MLM and
binary classification head for NSP. The mixing layer operation and parameter values are given on a per layer basis.
Loss Accuracy
Model Total MLM NSP MLM NSP
BERT-Base 1.76 1.48 0.28 0.68 0.86
Linear-Base 2.12 1.78 0.35 0.62 0.83
FNet-Base 2.45 2.06 0.40 0.58 0.80
Random-Base 5.02 4.48 0.55 0.26 0.70
FF-only-Base 7.54 6.85 0.69 0.13 0.50
FNet-Hybrid-Base 2.13 1.79 0.34 0.63 0.84
BERT-Large 1.49 1.23 0.25 0.72 0.88
Linear-Large 1.91 1.60 0.31 0.65 0.85
FNet-Large 2.11 1.75 0.36 0.63 0.82
Table 2: Loss and accuracy pre-training metrics on TPUs. The GPU metrics are very similar.
Table 4: GLUE Validation results on TPUs. All models are pre-trained. F1-accuracy mean scores are reported for
QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other
tasks. After controlling for batch size and training steps, the GPU metrics (not shown) are very similar.
addition of just two self-attention sublayers signifi- Large.7 The Linear-Large model severely underper-
cantly increases accuracy with only small costs in forms its Base counterpart on GLUE benchmark,
training speed. presumably due to training instabilities. As noted
in Section 4.1.1, we found that the Linear model
4.1.2 GLUE fine-tuning and BERT were less stable than the FNet, Random
For the GLUE benchmark (Wang et al., 2018), we and FF-only models.
use the same settings as in Devlin et al. (2018).
We found that different runs with the same base 4.1.3 Speed and accuracy trade-offs
learning rate could yield slightly different results. The pre-training and fine-tuning results show that
Consequently, for the Base models, we performed MLM metrics are a good indicator of final GLUE
3 trials for each base learning rate and reported the results. Hence, in this subsection, we restrict our
best result across all learning rates and trials. As attention to MLM accuracy.
noted in Devlin et al. (2018), we found that BERT- Speed vs MLM accuracy curves for GPU (8
Large was more unstable than its Base counterpart. V100 chips) and TPU (4 × 4 TPU v3 chips) pre-
As a result, for the Large models, we performed 6 training are shown Figures 2 and 3, respectively.
trials for each learning rate. Both TPU and GPU models are trained for 1 mil-
We report the results for the best base learning lion steps as in Devlin et al. (2018). Motivated by
rate (no early stopping) on the GLUE Validation the models considered in Turc et al. (2019), we
split in Table 4.6 The GLUE Base results mir- evaluated several model sizes; see Table 5. As
ror the pre-training metrics – BERT performs best, in Turc et al. (2019), for all models, we fixed the
FNet and the Linear model both slightly underper- feed-forward size to 4dhidden and the number of
form BERT by 7.5 − 8%, and the FF-only and self-attention heads (for BERT) to dhidden /64. We
Random models perform very poorly. The hybrid found that the smaller model architectures bene-
FNet model performs very well, achieving 97% of fited from larger learning rates, so, for all models,
BERT’s accuracy. Recall from Table 3 that all of we choose the best results from two base learning
the aforementioned models pre-train six to seven rates: 10−3 and 10−4 .
times faster than BERT on GPUs, and roughly Figures 2 (GPU) and 3 (TPU) display the same
twice as fast on TPUs. trends. For larger models, BERT is the most accu-
Interestingly, the gap between the BERT and rate. For fixed training speeds slower than roughly
FNet Large models (3%) is smaller than that of 15 ms, BERT is the most accurate model. But
their Base counterparts; this may be due to FNet- for smaller, very fast models, FNet and the Lin-
Large being more stable during training than BERT- ear model offer a better trade off between speed
ally found that replacing the final layers was the best. 7
Devlin et al. (2018) obtain roughly 2.5 average point
6
WNLI is excluded in Devlin et al. (2018). BERT’s boost on the Test split going from BERT-Base to BERT-Large.
accuracy on WNLI is below baseline, unless a special We only see a roughly 1.5 point boost on the Validation split,
training recipe is used. See also (12) in https:// which may be due to reduced headroom on the Validation
gluebenchmark.com/faq. split.
Dimensions Parameters (millions)
Name dhidden df f Layers BERT Linear FNet
Large 1024 4096 24 372 302 271
Base 768 3072 12 136 118 108
512 2048 12 72 72 72
512 2048 8 59 59 59
“Mini” 512 2048 4 46 46 46
256 1024 4 20 20 20
“Micro” 256 1024 2 18 18 18
128 512 2 9 9 9
Table 5: Model sizes. Named models are highlighted in Figures 2 and 3. The smaller architectures have a similar
number of parameters across all models because the majority of parameters are in the embedding and output layers.
Figure 2: Speed-accuracy Pareto curves for GPU pre-training. The dashed region is where the FNet (yellow) and
Linear (red) curves are both more accurate (above) and faster (left) of the BERT (blue) curve, indicating better
speed-accuracy trade-offs.
Table 6: Accuracy results on the Long-Range Arena benchmark. Average score does not include Path-X. These
results were obtained on TPUs. All models fail the Path-X task, but for different reasons – Transformer fails due to
memory limits, whereas FNet and the Linear model fail because they perform no better than chance on the binary
Path-X classification task.
Table 7: TPU and GPU training speeds, measured in steps per second (larger is better), on the Text Classification
task. Speed up multipliers relative to the Transformer are given in parentheses. Models were not evaluated for the
largest sequence length on GPUs.
Table 8: TPU and GPU peak memory usage (GB; smaller is better) on the Text Classification task during training.
Models were not evaluated for the largest sequence length on GPUs.
Seq. length 512 1024 2048 4096 8192 16386
GPU
Transformer 5.8 21.3 67.6 227.3 FAIL FAIL
Linear 5.4 (1.1x) 6.1 (3.5x) 22.6 (3.0x) 63.7 (3.6x) 181.8 FAIL
FNet (FFT) 5.3 (1.1x) 5.2 (4.1x) 15.6 (4.3x) 35.8 (6.3x) 74.6 147.1
Performer 5.8 (1.0x) 10.9 (2.0x) 24.8 (2.7x) 53.5 (4.3x) 109.9 227.3
TPU
Transformer 4.6 7.3 34.6 126.6 476.2 FAIL
Linear 5.1 (0.9x) 5.5 (1.3x) 4.8 (7.2x) 14.6 (8.7x) 49.5 (9.6x) FAIL
FNet (matmul) 4.7 (1.0x) 5.3 (1.4x) 9.3 (3.7x) 35.0 (3.6x) 131.6 (3.6x) 454.5
FNet (FFT) 5.7 (0.8x) 11.8 (0.6x) 28.6 (1.2x) 58.8 (2.2x) 119.0 (4.0x) 232.6
Performer 5.1 (0.9x) 5.4 (1.3x) 6.2 (5.6x) 18.5 (6.8x) 26.5 (18.0x) 55.9
Table 9: GPU and TPU inference speeds on the Text Classification task. Results are given in microseconds per
batch (smaller is better). Speed up multipliers relative to the Transformer are given in parentheses.
applications, researchers and practitioners may pri- Our experiments also showed that FNet has a
oritize inference speeds. Table 9 shows that the lighter memory footprint than all the efficient
training speed trends generally carries over to in- Transformer architectures, across all sequence
ference too, with the FNet’s gains on GPUs again lengths.
being stronger than those on TPUs.
• We demonstrated that, for a fixed speed and
5 Conclusion accuracy budget, small FNet models outper-
form Transformer models.
In this work, we studied simplified mixing modules
for Transformer-like encoder architectures. We Our work has highlighted the potential of linear
made the following contributions: transformations as a drop in replacement for the at-
tention mechanism in text classification tasks. We
• We showed that simple, linear token “mixing”
found the Fourier Transform to be a particularly ef-
transformations, along with the nonlinearities
fective mixing mechanism, in part due to the highly
in feed-forward layers, are sufficient to model
efficient FFT. It is quite remarkable that an unpa-
diverse semantic relationships in text.
rameterized mixing mechanism can can yield a
• We introduced FNet, a Transformer-like relatively very accurate model. On a practical note,
model wherein the self-attention sublayer is we only performed a cursory survey of other linear
replaced by an unparameterized Fourier Trans- transformations. Therefore, we believe there may
form. FNet achieves 92% of BERT’s accuracy be value in exploring other fast transformations.
on the GLUE benchmark, but trains seven Because of the speed and accuracy advantages
times as fast on GPUs and twice as fast on of smaller FNet models relative to Transformer
TPUs. models, we suspect that FNet will be effective as a
lightweight, distilled student model deployed in set-
• Because of its favorable scaling properties, tings where latency or memory is important, such
FNet is very competitive with the “efficient” as in production web services or on edge devices.
Transformers evaluated on the Long-Range The need for such lightweight serving models is
Arena benchmark. Specifically, we find that only forecast to grow given the surge in interest
FNet is comparably accurate to the most ac- in giant models (Raffel et al., 2019; Brown et al.,
curate “efficient” Transformers. Moreover, on 2020; Lepikhin et al., 2020). A natural avenue
GPUs, FNet was significantly faster at both to explore in this regard will be using knowledge
training and inference time than all efficient distillation on small FNet models from larger Trans-
Transformers across all sequence lengths; on former teacher models, following approaches es-
TPUs, FNet is faster at training and infer- tablished by Sanh et al. (2019); Jiao et al. (2019);
ence for relatively shorter sequence lengths Turc et al. (2019) and others.
(≤ 2048 and ≤ 1024 tokens, respectively). Another aspect of interest that we only briefly
touched on in this work is hybrid FNet-attention Acceleration of convolutional neural network us-
models. We found that adding self-attention sub- ing fft-based split convolutions. arXiv preprint
arXiv:2003.12621.
layers to FNet models offers a simple way to trade
off speed for accuracy, at least in the relatively Krzysztof Choromanski, Valerii Likhosherstov, David
shorter sequence length application we considered. Dohan, Xingyou Song, Jared Davis, Tamas Sarlos,
Specifically, replacing the final two Fourier sublay- David Belanger, Lucy Colwell, and Adrian Weller.
2020a. Masked language modeling for proteins via
ers of FNet with self-attention sublayers yielded a linearly scalable long-context transformers. arXiv
model that achieved 97% of BERT’s accuracy on preprint arXiv:2006.03555.
the GLUE benchmark, but pre-trained nearly six
times as fast on GPUs and twice as fast on TPUs. Krzysztof Choromanski, Valerii Likhosherstov, David
Dohan, Xingyou Song, Andreea Gane, Tamas Sar-
Throughout this work we have restricted our at- los, Peter Hawkins, Jared Davis, Afroz Mohiuddin,
tention to encoders. FNet decoders can be designed Lukasz Kaiser, et al. 2020b. Rethinking attention
using causally masked discrete Fourier Transform with performers. arXiv preprint arXiv:2009.14794.
matrices. However, a lower level implementation
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
would be required if one wanted to introduce causal Christopher D. Manning. 2019. What does BERT
masking to the FFT. It is also not immediately clear look at? an analysis of BERT’s attention. In Pro-
what the right mechanism for encoder-decoder in- ceedings of the 2019 ACL Workshop BlackboxNLP:
teractions would be; for example, You et al. (2020) Analyzing and Interpreting Neural Networks for
NLP, pages 276–286, Florence, Italy. Association
suggest that cross attention between the encoder for Computational Linguistics.
and decoder is crucial to Transformer encoder-
decoder performance. James W Cooley and John W Tukey. 1965. An algo-
rithm for the machine calculation of complex fourier
series. Mathematics of computation, 19(90):297–
301.
References
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Va- George Cybenko. 1989. Approximation by superposi-
clav Cvicek, Zachary Fisher, Philip Pham, Anirudh tions of a sigmoidal function. Mathematics of con-
Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. trol, signals and systems, 2(4):303–314.
2020. Etc: Encoding long and structured inputs
in transformers. In Proceedings of the 2020 Con- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
ference on Empirical Methods in Natural Language Kristina Toutanova. 2018. Bert: Pre-training of deep
Processing (EMNLP), pages 268–284. bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Alexey Dosovitskiy, Lucas Beyer, Alexander
learning to align and translate. arXiv preprint Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
arXiv:1409.0473. Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, et al. 2020.
Andrew R Barron. 1993. Universal approximation An image is worth 16x16 words: Transformers
bounds for superpositions of a sigmoidal func- for image recognition at scale. arXiv preprint
tion. IEEE Transactions on Information theory, arXiv:2010.11929.
39(3):930–945.
Hazem M El-Bakry and Qiangfu Zhao. 2004. Fast ob-
Iz Beltagy, Matthew E Peters, and Arman Cohan. ject/face detection using neural networks and fast
2020. Longformer: The long-document transformer. fourier transform. International Journal of Signal
arXiv preprint arXiv:2004.05150. Processing, 1(3):182–187.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Matteo Frigo and Steven G Johnson. 2005. The de-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind sign and implementation of fftw3. Proceedings of
Neelakantan, Pranav Shyam, Girish Sastry, Amanda the IEEE, 93(2):216–231.
Askell, et al. 2020. Language models are few-shot
learners. arXiv preprint arXiv:2005.14165. Himanshu Gothwal, Silky Kedawat, Rajesh Kumar,
et al. 2011. Cardiac arrhythmias detection in an ecg
Rewon Child, Scott Gray, Alec Radford, and beat signal using fast fourier transform and artificial
Ilya Sutskever. 2019. Generating long se- neural network. Journal of Biomedical Science and
quences with sparse transformers. arXiv preprint Engineering, 4(04):289.
arXiv:1904.10509.
John Hewitt and Percy Liang. 2019. Designing and
Kamran Chitsaz, Mohsen Hajabdollahi, Nader Karimi, interpreting probes with control tasks. In Proceed-
Shadrokh Samavi, and Shahram Shirani. 2020. ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter- 2018. Fft-based deep learning deployment in em-
national Joint Conference on Natural Language Pro- bedded systems. In 2018 Design, Automation & Test
cessing (EMNLP-IJCNLP), pages 2733–2743, Hong in Europe Conference & Exhibition (DATE), pages
Kong, China. Association for Computational Lin- 1045–1050. IEEE.
guistics.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben
Tyler Highlander and Andres Rodriguez. 2016. Very Goodrich, Ryan Sepassi, Lukasz Kaiser, and
efficient training of convolutional neural networks Noam Shazeer. 2018. Generating wikipedia by
using fast fourier transform and overlap-and-add. summarizing long sequences. arXiv preprint
arXiv preprint arXiv:1601.06815. arXiv:1801.10198.
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Michael Mathieu, Mikael Henaff, and Yann LeCun.
Li, and Torsten Hoefler. 2020. Data movement is 2014. Fast training of convolutional networks
all you need: A case study of transformer networks. through ffts: International conference on learning
arXiv preprint arXiv:2007.00072. representations (iclr2014), cbls, april 2014. In 2nd
International Conference on Learning Representa-
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
tions, ICLR 2014.
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
2019. Tinybert: Distilling bert for natural language
Kei-ichiro Minami, Hiroshi Nakajima, and Takeshi
understanding. arXiv preprint arXiv:1909.10351.
Toyoshima. 1999. Real-time discrimination of ven-
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- tricular tachyarrhythmia with fourier-transform neu-
pas, and François Fleuret. 2020. Transformers are ral network. IEEE transactions on Biomedical Engi-
rnns: Fast autoregressive transformers with linear neering, 46(2):179–185.
attention. In International Conference on Machine
Learning, pages 5156–5165. PMLR. Martina Mironovova and Jirı́ Bı́la. 2015. Fast fourier
transform for feature extraction and neural net-
Young Jin Kim and Hany Hassan Awadalla. 2020. work for classification of electrocardiogram sig-
Fastformers: Highly efficient transformer models nals. In 2015 Fourth International Conference
for natural language understanding. arXiv preprint on Future Generation Communication Technology
arXiv:2010.13382. (FGCT), pages 1–6. IEEE.
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Marcin Moczulski, Misha Denil, Jeremy Appleyard,
2020. Reformer: The efficient transformer. arXiv and Nando de Freitas. 2015. Acdc: A structured effi-
preprint arXiv:2001.04451. cient linear layer. arXiv preprint arXiv:1511.05946.
Renée Koplon and Eduardo D Sontag. 1997. Using Sharan Narang, Hyung Won Chung, Yi Tay, William
fourier-neural recurrent networks to fit sequential in- Fedus, Thibault Fevry, Michael Matena, Karishma
put/output data. Neurocomputing, 15(3-4):225–248. Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong
Lan, et al. 2021. Do transformer modifications trans-
Taku Kudo and John Richardson. 2018. Sentencepiece: fer across implementations and applications? arXiv
A simple and language independent subword tok- preprint arXiv:2102.11972.
enizer and detokenizer for neural text processing.
arXiv preprint arXiv:1808.06226. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Kaiser, Noam Shazeer, Alexander Ku, and Dustin
Henry O. Kunz. 1979. On the equivalence between
Tran. 2018. Image transformer. In International
one-dimensional discrete walsh-hadamard and mul-
Conference on Machine Learning, pages 4055–4064.
tidimensional discrete fourier transforms. IEEE
PMLR.
Computer Architecture Letters, 28(03):267–268.
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy
Dehao Chen, Orhan Firat, Yanping Huang, Maxim Schwartz, Noah A Smith, and Lingpeng Kong.
Krikun, Noam Shazeer, and Zhifeng Chen. 2020. 2021. Random feature attention. arXiv preprint
Gshard: Scaling giant models with conditional com- arXiv:2103.02143.
putation and automatic sharding. arXiv preprint
arXiv:2006.16668. Harry Pratt, Bryan Williams, Frans Coenen, and Yalin
Zheng. 2017. Fcnn: Fourier convolutional neu-
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, ral networks. In Joint European Conference on
Burigede Liu, Kaushik Bhattacharya, Andrew Stu- Machine Learning and Knowledge Discovery in
art, and Anima Anandkumar. 2020. Fourier neural Databases, pages 786–798. Springer.
operator for parametric partial differential equations.
arXiv preprint arXiv:2010.08895. Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih,
Sinong Wang, and Jie Tang. 2019. Blockwise self-
Sheng Lin, Ning Liu, Mahdi Nazemi, Hongjia Li, attention for long document understanding. arXiv
Caiwen Ding, Yanzhi Wang, and Massoud Pedram. preprint arXiv:1911.02972.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, BERT rediscovers the classical NLP pipeline. In
Wei Li, and Peter J Liu. 2019. Exploring the limits Proceedings of the 57th Annual Meeting of the Asso-
of transfer learning with a unified text-to-text trans- ciation for Computational Linguistics, pages 4593–
former. arXiv preprint arXiv:1910.10683. 4601, Florence, Italy. Association for Computational
Linguistics.
Alessandro Raganato, Yves Scherrer, and Jörg Tiede-
mann. 2020. Fixed encoder self-attention patterns Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
in transformer-based machine translation. arXiv Toutanova. 2019. Well-read students learn better:
preprint arXiv:2002.10260. On the importance of pre-training compact models.
arXiv preprint arXiv:1908.08962.
Ali Rahimi and Benjamin Recht. 2008. Random
features for large-scale kernel machines. In Ad- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
vances in Neural Information Processing Systems, Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
volume 20. Curran Associates, Inc. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner,
Philipp Seidl, Michael Widrich, Lukas Gruber, Jesse Vig and Yonatan Belinkov. 2019. Analyzing
Markus Holzleitner, Milena Pavlović, Geir Kjetil the structure of attention in a transformer language
Sandve, Victor Greiff, et al. 2020. Hop- model. In Proceedings of the 2019 ACL Workshop
field networks is all you need. arXiv preprint BlackboxNLP: Analyzing and Interpreting Neural
arXiv:2008.02217. Networks for NLP, pages 63–76, Florence, Italy. As-
sociation for Computational Linguistics.
S. Reid and M. Mistele. 2019. Fast fourier trans-
formed transformers: Circulant weight matrices for Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nmt compression. nrich, and Ivan Titov. 2019. Analyzing multi-head
self-attention: Specialized heads do the heavy lift-
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and ing, the rest can be pruned. In Proceedings of the
David Grangier. 2021. Efficient content-based 57th Annual Meeting of the Association for Com-
sparse attention with routing transformers. Transac- putational Linguistics, pages 5797–5808, Florence,
tions of the Association for Computational Linguis- Italy. Association for Computational Linguistics.
tics, 9:53–68.
Apoorv Vyas, Angelos Katharopoulos, and François
Victor Sanh, Lysandre Debut, Julien Chaumond, and Fleuret. 2020. Fast transformers with clustered at-
Thomas Wolf. 2019. Distilbert, a distilled version tention. arXiv preprint arXiv:2007.04825.
of bert: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108. Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. 2018.
Noam Shazeer. 2019. Fast transformer decoding: Glue: A multi-task benchmark and analysis platform
One write-head is all you need. arXiv preprint for natural language understanding. arXiv preprint
arXiv:1911.02150. arXiv:1804.07461.
Sam Shleifer and Alexander M Rush. 2020. Pre- Sinong Wang, Belinda Li, Madian Khabsa, Han
trained summarization distillation. arXiv preprint Fang, and Hao Ma. 2020. Linformer: Self-
arXiv:2010.13002. attention with linear complexity. arXiv preprint
arXiv:2006.04768.
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan,
Zhe Zhao, and Che Zheng. 2020a. Synthesizer: Re- Weiqiu You, Simeng Sun, and Mohit Iyyer. 2020.
thinking self-attention in transformer models. arXiv Hard-coded gaussian attention for neural machine
preprint arXiv:2005.00743. translation. arXiv preprint arXiv:2005.00742.
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
Da-Cheng Juan. 2020b. Sparse sinkhorn attention. Joshua Ainslie, Chris Alberti, Santiago Ontanon,
In International Conference on Machine Learning, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
pages 9438–9447. PMLR. et al. 2020. Big bird: Transformers for longer se-
quences. arXiv preprint arXiv:2007.14062.
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang
Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Jiong Zhang, Yibo Lin, Zhao Song, and Inderjit
Yang, Sebastian Ruder, and Donald Metzler. 2020c. Dhillon. 2018. Learning long term dependencies via
Long range arena: A benchmark for efficient trans- fourier recurrent units. In International Conference
formers. arXiv preprint arXiv:2011.04006. on Machine Learning, pages 5815–5823. PMLR.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Ying-Qian Zhang and Lai-Wan Chan. 2000. Forenet:
Metzler. 2020d. Efficient transformers: A survey. Fourier recurrent networks for time series predic-
arXiv preprint arXiv:2009.06732. tion.
Zhenyou Zhang, Yi Wang, and Kesheng Wang. 2013.
Fault diagnosis and prognosis using wavelet packet
decomposition, fourier transform and artificial neu-
ral network. Journal of Intelligent Manufacturing,
24(6):1213–1227.