Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

FNet: Mixing Tokens with Fourier Transforms

James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontañón


Google Research
{jamesleethorp, jainslie, ilyaeck, santiontanon}@google.com

Abstract though potentially limited in their effectiveness


(Hewitt and Liang, 2019), these probes of attention
We show that Transformer encoder architec- layers generally back the intuition that, by allowing
tures can be massively sped up, with limited
arXiv:2105.03824v1 [cs.CL] 9 May 2021

higher order units to form out of compositions of


accuracy costs, by replacing the self-attention
sublayers with simple linear transformations the input, Transformer models can flexibly capture
that “mix” input tokens. These linear trans- diverse syntactic and semantic relationships.
formations, along with simple nonlinearities in In this work, we investigate whether simpler mix-
feed-forward layers, are sufficient to model se- ing mechanisms can wholly replace the relatively
mantic relationships in several text classifica- complicated attention layers in Transformer-like
tion tasks. Perhaps most surprisingly, we find architectures. We first replace the attention sub-
that replacing the self-attention sublayer in a
layer with two parameterized matrix multiplica-
Transformer encoder with a standard, unpa-
rameterized Fourier Transform achieves 92% tions – one mixing the sequence dimension and
of the accuracy of BERT on the GLUE bench- one mixing the hidden dimension. Seeing promis-
mark, but pre-trains and runs up to seven times ing results in this simple linear mixing scheme, we
faster on GPUs and twice as fast on TPUs. The further investigate the efficacy of faster, structured
resulting model, which we name FNet, scales linear transformations. Surprisingly, we find that
very efficiently to long inputs, matching the ac- the Fourier Transform, despite having no parame-
curacy of the most accurate “efficient” Trans-
ters at all, achieves nearly the same performance as
formers on the Long Range Arena benchmark,
but training and running faster across all se-
dense linear mixing and scales very efficiently to
quence lengths on GPUs and relatively shorter long inputs, especially on GPUs (owing to the Fast
sequence lengths on TPUs. Finally, FNet has a Fourier Transform). We call the resulting model
light memory footprint and is particularly effi- FNet.
cient at smaller model sizes: for a fixed speed By replacing the attention sublayer with linear
and accuracy budget, small FNet models out- transformations, we are able to reduce the com-
perform Transformer counterparts. plexity and memory footprint of the Transformer
architecture. We show that FNet offers an excel-
1 Introduction
lent compromise between speed, memory footprint,
The Transformer architecture (Vaswani et al., 2017) and accuracy, achieving 92% of the accuracy of
has achieved rapid and widespread dominance in BERT in a common classification transfer learning
natural language processing. At the heart of the setup on the GLUE benchmark (Wang et al., 2018),
Transformer is a self-attention mechanism – an in- but training seven times as fast on GPUs and twice
ductive bias that connects each token in the input as fast on TPUs. Aside from these practical con-
through a relevance weighted basis of every other siderations, it is remarkable that unparameterized
token. Each hidden unit is represented in the basis (and therefore task independent) mixing, facilitated
of the hidden units of the previous layer. Many by the discrete Fourier Transform coefficients, cou-
papers have prodded and probed the Transformer, pled with simple nonlinearities, in the feed-forward
and in particular the self-attention sub-layers, in an sublayers, is sufficient to model the GLUE tasks.
effort to better understand the architecture; see, for We also show that an FNet hybrid model contain-
example, Tenney et al. (2019); Vig and Belinkov ing only two self-attention sublayers achieves 97%
(2019); Clark et al. (2019); Voita et al. (2019). Al- of BERT’s accuracy on the GLUE benchmark, but
trains nearly six times as fast on GPUs and twice of Fourier series, which in turn are dense in the
as fast on TPUs. relevant Hilbert space; see, for example, (Cybenko,
We also show that FNet is competitive with all 1989; Barron, 1993).
the “efficient” Transformers evaluated on the Long- In terms of practical applications, Fourier Trans-
Range Arena benchmark (Tay et al., 2020c) – a forms, and in particular the Fast Fourier Transform
suite of tasks designed to assess models on longer (FFT), appear to have first been used in neural net-
sequence length inputs. Specifically, we find that works to tackle signal processing problems. For ex-
FNet achieves comparable accuracies to the most ample, Minami et al. (1999); Gothwal et al. (2011);
accurate efficient Transformer architectures but is Mironovova and Bı́la (2015) all fit neural networks
faster. In fact, on GPUs, FNet is significantly faster to FFTs of electrocardiogram signals of the heart
at both training and inference than all of the Trans- to identify abnormal electrical activity. In a similar
former architectures across all sequence lengths. spirit, Zhang et al. (2013) use FFTs of vibration sig-
On TPUs, FNet is faster at training and inference nals as features for training neural networks to clas-
for relatively shorter sequence lengths (≤ 2048 and sify fault and predict degradation in manufacturing
≤ 1024 tokens, respectively); for longer sequences, systems. Recent work by Li et al. (2020) uses
the only efficient Transformers that are faster than Fourier Transforms in neural networks to evolve
FNet appear to be less accurate on the Long-Range solutions of Partial Differential Equations.
Arena benchmark. Our experiments also suggest
Because ordinary multiplication in the fre-
that FNet has a lighter memory footprint than all
quency domain corresponds to a convolution in
the efficient Transformer architectures, across all
the time/real domain, FFTs have been deployed in
sequence lengths.
Convolutional Neural Networks (CNNs) to speed
The rest of this paper is organized as follows. In
up computations; see, for example, (El-Bakry and
the next section, Section 2, we situate our work in
Zhao, 2004; Mathieu et al., 2014; Highlander and
the current literature. We then review the basics
Rodriguez, 2016; Pratt et al., 2017; Lin et al., 2018;
of the discrete Fourier Transform and introduce
Chitsaz et al., 2020). Fourier Transforms have
the FNet architecture in Section 3. In Section 4,
also been applied to Recurrent Neural Networks
we summarize the main results of this paper in the
(RNNs): Fourier RNN (Koplon and Sontag, 1997)
form of two sets of experimental results:
and ForeNet (Zhang and Chan, 2000). More re-
• In Section 4.1, we compare FNet, BERT (De- cently, Zhang et al. (2018) used Fourier expansions
vlin et al., 2018) and several simple baselines in an RNN in an effort to stabilize training and
in a transfer learning setting – masked lan- reduce the exploding and vanishing gradients prob-
guage modelling (MLM) and next sentence lems that are particularly prevalent with RNNs.
prediction (NSP) pre-training, followed by Fourier Transforms have even been used to ap-
fine-tuning on the GLUE benchmark. proximate dense, linear layers in neural networks.
For example, Moczulski et al. (2015) approximate
• In Section 4.2, we show that FNet offers a bet- individual linear layers with multiple sublayers of
ter compromise between speed and accuracy diagonal parameter matrix multiplications in the
than most, if not all, of the efficient Trans- real and frequency domains, facilitated by discrete
formers evaluated in the Long-Range Arena cosine transforms. This approximation reduces the
benchmark (Tay et al., 2020c). number of parameters and the computational com-
plexity.
We close in Section 5 with a summary of our re-
sults. Fourier Transforms have been used indirectly in
at least two Transformer works. The Performer
2 Related work (Choromanski et al., 2020a) linearizes the complex-
ity of the Transformer self-attention mechanism by
2.1 Fourier Transforms in neural networks leveraging random Fourier features. Specifically,
Fourier analysis features heavily in theoretical stud- Choromanski et al. (2020a) first represent the (self-
ies of neural networks. For example, universal attention) softmax kernel in terms of the Gaussian
approximation properties of neutral networks are kernel, and then approximate the latter by sampling
typically investigated by first showing that a partic- from the Fourier Transform (Rahimi and Recht,
ular class of neutral networks is dense in the space 2008). In our work, rather than approximating the
self-attention kernel with sampled Fourier features, mixing sublayer.
we replace the entire self-attention sublayer with Motivated by the observation that most learned
a Fourier Transform. Reid and Mistele (2019), on attention heads trained on Neural Machine Transla-
the other hand, replace the feed-forward layers of tion (NMT) tasks focus on a local window around
the Transformer architecture with block-circulant the input query position, You et al. (2020) replace
layers, which they compute efficiently using the dis- attention weights in the Transformer encoder and
crete Fourier Transform. On a German-to-English decoder with unparameterized Gaussian distribu-
translation task, they ultimately find pruning to be tions. Provided that they retain the learnable cross-
a more effective compression strategy for Trans- attention weights between the encoder and decoder,
formers than block-circulant layers. You et al. (2020) see minimal degradation in NMT
Our work also connects with two broad themes BLEU scores. Working from similar observations,
of attention based models, which we discuss in the Raganato et al. (2020) find little to no accuracy
next two subsections. degradation on NMT tasks when they replace all
but one of the attention heads of each attention
2.2 The power of attention to model semantic sublayer in the Transformer encoder with fixed,
relationships non-learnable positional patterns; note that, in their
setup, the decoder retains all of its learnable param-
Since attention (Bahdanau et al., 2014) and then eters.
the Transformer (Vaswani et al., 2017) architec- The results of both You et al. (2020) and Ra-
ture were first introduced, attention models have ganato et al. (2020) suggest that most connections
achieved state of the art results across virtually all in the attention sublayer in the encoder - and possi-
NLP tasks and even some image tasks (Dosovit- bly the decoder - do not need to be learned at all,
skiy et al., 2020). This success is generally at- but can be replaced by predefined patterns. While
tributed to the flexibility and capacity of attention reasonable, this conclusion is somewhat obscured
models. Although some works (Ramsauer et al., by the learnable attention heads that remain in the
2020) have endeavoured to gain a deeper theoret- decoder and/or the cross-attention weights between
ical understanding of self-attention, the pervasive the encoder and decoder. It is also not fully clear
(and arguably, very appealing) intuition is that the what the exact speed improvements of the You et al.
success of attention models derives from the token- (2020) and Raganato et al. (2020) models are.
dependent attention patterns in different layers; see,
for example, (Tenney et al., 2019). 2.3 Efficient Transformers and long sequence
However, it is natural to ask: Do we really need models
the flexibility, and associated cost, of attention or is The standard self-attention mechanism (Vaswani
mixing tokens in other manners sufficient to train et al., 2017) has a quadratic time and memory bot-
accurate and performant models? Our models show tleneck with respect to sequence length. This limits
that linear, fixed transformations offer a more effi- its applicability in text tasks involving long range
cient, while nearly as powerful, mechanism to mix dependencies, character-level modelling, speech
tokens. processing, image and video processing.
Several other works have also investigated these Much effort has been devoted to reducing the
questions by modifying or replacing the attention quadratic complexity with more efficient attention
sublayers in Transformers. Perhaps most relevant mechanisms. Most efforts to improve the attention
is Tay et al. (2020a), who empirically investigated efficiency are based on sparsifying the attention
the importance of the dot product operation in the matrix. Tay et al. (2020d) survey many of the re-
self-attention mechanism. They find that learn- cent efficient attention works, which they broadly
ing token-dependent attention weights from (dot- categorize as follows:
product) token-token interactions is powerful, but
• Data independent sparsity approaches that use
not necessarily crucial for realizing accurate NLP
fixed attention patterns (Child et al., 2019; Qiu
models. Our “Linear” model baseline is similar to
et al., 2019; Parmar et al., 2018; Beltagy et al.,
the Random Synthesizer architecture of Tay et al.
2020; Ainslie et al., 2020; Zaheer et al., 2020).
(2020a), but simplifies that model further by remov-
ing the multiple heads and the softmax projections, • Data dependent sparsity approaches that
resulting in just two matrix multiplications for the dynamically compress the attention matrix
(Wang et al., 2020; Tay et al., 2020b,a; Ki- and Awadalla, 2020; Shleifer and Rush, 2020). In
taev et al., 2020; Roy et al., 2021; Vyas et al., a similar vein, we ignore several improvements to
2020; Liu et al., 2018). the original BERT model (Devlin et al., 2018), such
as Adafactor (instead of Adam) for optimization,
• Linearization of the attention matrix using inverse square root learning schedule, and pre-layer
kernel decompositions (Katharopoulos et al., (as opposed to post-layer) normalization (Narang
2020; Choromanski et al., 2020b; Peng et al., et al., 2021). We consider these directions orthog-
2021). onal to this work, wherein we compare a vanilla

Several of these works achieve O(N N ) or even Transformer with a vanilla FNet in an effort to in-
O(N ) theoretical complexity. However, the scaling vestigate different token mixing mechanisms. We
constants hidden by this notation are often large. suspect that many of the aforementioned general
The Long-Range Arena benchmark (Tay et al., performance and training optimizations and will
2020c) attempts to compare many of the above “ef- apply equally well to FNet.
ficient” Transformers in a series of tasks requiring
3 Model
long range dependencies. Tay et al. (2020c) find
that the Performer (Choromanski et al., 2020b), This section introduces our proposed model, FNet.
Linear Transformer (Katharopoulos et al., 2020), We begin with a brief review of the discrete Fourier
Linformer (Wang et al., 2020) and Image Trans- Transform.
former (Local Attention) (Parmar et al., 2018) were
the fastest on 4 × 4 TPU v3 chips and had the low- 3.1 Background: discrete Fourier Transform
est peak memory usages per device.1 The latter The Fourier Transform decomposes a function into
memory footprint results are important as empiri- its constituent frequencies. Given a sequence {xn }
cal studies have shown that Transformer architec- with n ∈ [0, N − 1], the discrete Fourier Transform
tures are often memory-bound (Ivanov et al., 2020; (DFT) is defined by the formula:
Shazeer, 2019).
N −1
Instead of following the above paradigm of ap- X 2πi
Xk = xn e − N
nk
, 0 ≤ k ≤ N − 1. (1)
proximating the attention matrix, we completely re- n=0
place self-attention with a different mixing, namely
the Fourier Transform. This offers several benefits: For each k, the Fourier Transform generates a new
representation Xk as a sum of all of the original
• Performance: between the FFT and pre- input tokens xn , with so-called “twiddle factors”.
computed matrices, the discrete Fourier Trans- There are two primary approaches to comput-
form is extremely fast on TPUs and GPUs. ing the DFT: the Fast Fourier Transform (FFT)
and matrix multiplication. The standard FFT al-
• Model size: because there are no learnable
gorithm is the Cooley–Tukey algorithm (Cooley
weights in the mixing layer, we dramatically
and Tukey, 1965; Frigo and Johnson, 2005), which
reduce the number of model parameters and
recursively re-expresses the DFT of a sequence of
memory footprint.
length N = N1 N2 in terms of N1 smaller discrete
• Simplicity: no complicated sparsity patterns. Fourier transforms of sizes N2 to reduce the com-
putation time to O(N log N ).
Using the Long-Range Arena benchmark (Tay An alternative approach to computing the DFT
et al., 2020c), we show that the Fourier Transform is to simply apply the DFT matrix to the input se-
offers an excellent compromise between efficiency quence. The DFT matrix is a Vandermonde matrix
and accuracy. for the roots of unity up to a normalization factor:
Finally, we note that there are many optimiza-
2πi
!
tions, beyond seeking more efficient attention e− N nk
mechanisms, that are used to improve Transformer Wnk = √ , (2)
N nk
performance from fusing operations and sharing
key-value-query matrices to general pruning, quan- where n, k = 0, . . . , N − 1. This matrix multipli-
tization, and distillation; see, for example, (Kim cation is an O(N 2 ) operation, which has higher
1
Although, as the authors state, the results may be some- asymptotic complexity than the FFT. However, on
what sensitive to implementation. TPUs, by pre-computing the matrix W, we find that
computing the DFT through matrix multiplications
is faster for relatively shorter sequences. On GPUs,
the FFT is faster than the DFT matrix multiplica-
tion, for all sequence lengths.

3.2 FNet architecture


FNet is a layer normalized ResNet architecture with
multiple layers, each of which consists of a Fourier
mixing sublayer followed by a feed-forward sub-
layer. The architecture is shown in Figure 1. Essen-
tially, we replace the self-attention sublayer of each
Transformer encoder layer with a Fourier Trans-
form sublayer, which applies a 2D Fourier Trans-
form to its (sequence length, hidden dimension)
embedding input – one 1D Fourier Transform along
the sequence dimension, Fseq , and one 1D Fourier
Transform along the hidden dimension, Fhidden :

y = < Fseq (Fhidden (x)) . (3)

As indicated by Equation (3), we only keep the


real part of the result; hence, we do not need to
modify the (nonlinear) feed-forward sublayers or
output layers to handle complex numbers, as these Figure 1: FNet encoder architecture with N encoder
are ignored. blocks.
We found that FNet obtained the best results
when the real part of the total transformation was
only extracted at the end of the Fourier sublayer; (NSP), sentence pairs for paraphrasing, hypothesis-
that is, after applying both Fseq and Fhidden . This premise pairs for entailment and question-passage
negates some opportunity to use the real Fourier pairs for question answering.
Transform. However, empirically, we did not ob-
3.3 Implementation
serve major speed ups from the real Fourier Trans-
form anyway. Empirically, we found that:
The simplest interpretation for the Fourier Trans-
• On GPUs: the FFT is faster than matrix multi-
form is as a particularly effective mechanism for
plications for all sequence lengths we consider
mixing tokens, which evidently provides the feed-
(512 − 8192 tokens).
forward sublayers sufficient access to all tokens.
Because of the duality of the Fourier Transform, • On TPUs: for relatively shorter sequences
we can also view each alternating encoder block (≤ 8192 tokens), it is faster to pre-compute
as applying alternating Fourier and inverse Fourier the DFT matrix and then compute the Fourier
Transforms, transforming the input back and forth Transform through matrix multiplications than
between real and frequency space. Because multi- using the FFT; for longer sequences, the FFT
plying, in the feed-forward sublayer, in frequency is faster.
space is the same as convolving in real space, FNet
is alternating between multiplications and (large As a result, our GPU FNet implementation always
kernel) convolutions. uses the FFT, while our TPU implementation com-
We use the same embedding layers as in Devlin putes the 2D Fourier Transform using matrix multi-
et al. (2018); namely, we combine the word embed- plications for sequences up to lengths of 8192 and
dings, absolute position embeddings of the tokens the FFT for longer lengths.
and type embeddings of the sentences. The type Presumably the GPU vs TPU difference is pri-
embeddings are used to distinguish input segments – marily a result of two factors: (1) TPUs are even
e.g. sentence A and B for Next Sentence Prediction more highly optimized for matrix multiplications
than GPUs, and (2) GPUs offer a more efficient exactly recreate the Fourier Transform (because the
FFT implementation than TPUs. We suspect that dense layers are real), it is still reasonable to expect
FNet will only become more performant on TPUs that the Linear encoder model will be a more flex-
when the XLA implementation of the FFT im- ible model. As we shall observe in the following
proves. subsections, the Linear model can achieve higher
We implement our models in JAX using the Flax accuracy than FNet on certain tasks, but has several
framework2 . We plan to release the code with a drawbacks relative to FNet – it is slower on GPUs,
subsequent version of this pre-print. has a much larger memory footprint, and is often
prone to instability during training (especially in
4 Results its larger configuration).
4.1 Transfer learning 4.1.1 Masked language modelling
We compare the FNet and Transformer architec- pre-training
tures in a common transfer learning setting. For a We adopt the same fixed “Base” and “Large” model
fuller picture, we compare multiple models: and learning configurations as for the original
• BERT-Base: a Transformer model. BERT (Devlin et al., 2018). We train on the much
larger C4 dataset (Raffel et al., 2019) and use a
• FNet encoder: we replace every self-attention 32000 SentencePiece vocabulary model (Kudo and
sublayer with a Fourier sublayer as described Richardson, 2018) trained on a 100 million sen-
in Section 3.2. tence subset of C4. Our TPU experiments use a
batch size of 256 as in Devlin et al. (2018) and
• Linear encoder: we replace each self-attention are each run on 4 × 4 TPU v3 chips. Our GPU
sublayer with a two learnable, dense, linear experiments use a smaller batch size of 64 and are
sublayers, one applied to the hidden dimen- run on 8 V100 chips.
sion and one applied to the sequence dimen- Tables 2 and 3 summarize the pre-training met-
sion. rics and training speeds, respectively, for the differ-
• Random encoder: we replace each self- ent models. Because the training configuration is
attention sublayer with a two constant random lifted from Devlin et al. (2018), it may be slightly
matrices, one applied to the hidden dimension biased towards the BERT attention model.
and one applied to the sequence dimension. From Table 2, we see that the FF-only model
severely underperforms the other models, indicat-
• Feed Forward-only (FF-only) encoder: we ing that, as expected, token mixing is critical to
remove the self-attention sublayer from the the expressivity of the model. For example, a 0.5
Transformer layers; this leaves a ResNet with accuracy result for the FF-only model on the binary
only feed-forward layers and no token mixing. NSP task indicates that the model fails to learn the
task. The accuracy metrics of the Random model,
We originally introduced the Linear model as a which does allow mixing, show that it is at least
baseline for FNet. However, despite its simplic- somewhat able learn the tasks and perform signifi-
ity, it turns out to be surprisingly accurate and fast, cantly better than chance. BERT obtains the best
especially on TPUs. Our Linear model is similar results. The Linear model and FNet significantly
to the Random Synthesizer (Tay et al., 2020a), but outperform the FF-only and Random baselines, but
simplifies that model further by removing the multi- underperform BERT.
ple heads and softmax projections, resulting in just
Turning to Table 3, we see that although they are
two matrix multiplications in the mixing sublayer.
slightly less accurate, the Linear model and FNet
In our limited cursory experiments, we found no
train significantly faster than BERT – about twice
major accuracy benefit in the softmax or multiple
as fast on TPUs and roughly seven times faster
head projections, although they did come with the
on GPUs. Remarkably, on GPUs, an FNet-Large
cost of slightly decreasing training speeds.
model trains more than twice as fast as a BERT-
The 2D Fourier Transform of FNet can be rep-
Base model. We also find that the three models
resented by two complex, DFT matrix multiplica-
with no learnable parameters in their mixing layer,
tions. So although the Linear encoder model cannot
namely FNet, Random and FF-only, are the most
2
https://github.com/google/flax stable during training.
Learnable parameters (millions)
Model Mixing layer ops (forward pass) Mixing layer Encoder Pre-training
BERT-Base ndhidden (2n + 4dhidden ) 2.36 111 136
Linear-Base ndhidden (n + dhidden ) 0.85 93 118
FNet-Base (matmul) ndhidden (n + dhidden ) 0 83 108
FNet-Base (FFT) ndhidden (log(n) + log(dhidden )) 0 83 108
Random-Base ndhidden (n + dhidden ) 0 83 108
FF-only-Base 0 0 83 108

Table 1: Model operations and parameters. n is the sequence length and dhidden is the model hidden dimension.
The “encoder” is the pre-training model without the output projections onto the embedding table for MLM and
binary classification head for NSP. The mixing layer operation and parameter values are given on a per layer basis.

Loss Accuracy
Model Total MLM NSP MLM NSP
BERT-Base 1.76 1.48 0.28 0.68 0.86
Linear-Base 2.12 1.78 0.35 0.62 0.83
FNet-Base 2.45 2.06 0.40 0.58 0.80
Random-Base 5.02 4.48 0.55 0.26 0.70
FF-only-Base 7.54 6.85 0.69 0.13 0.50
FNet-Hybrid-Base 2.13 1.79 0.34 0.63 0.84
BERT-Large 1.49 1.23 0.25 0.72 0.88
Linear-Large 1.91 1.60 0.31 0.65 0.85
FNet-Large 2.11 1.75 0.36 0.63 0.82

Table 2: Loss and accuracy pre-training metrics on TPUs. The GPU metrics are very similar.

BERT’s higher accuracy is not simply a result


of having more parameters than the other models.
Indeed, Table 2 shows that BERT-Base is actually
more accurate than FNet-Large, which contains
more than twice as many parameters. BERT is
presumably more expressive because the mixing
Model GPU (64) TPU (256)
(attention) weights are both task specific and to-
BERT-Base 161 41 ken dependent, determined by token-token (query-
Linear-Base 28 (5.7x) 23 (1.8x) key) dot products; see also Tay et al. (2020a).
FNet-Base 24 (6.9x) 21 (2.0x) FNet’s mixing weights, on the other hand, are nei-
Random-Base 26 (6.1x) 21 (2.0x) ther task specific nor token dependent. We found
FF-only-Base 21 (7.8x) 20 (2.0x) that other task independent mixing transformations,
FNet-hybrid-Base 28 (5.7x) 22 (1.8x) such as the Random model (see Table 3) and the
BERT-Large FAIL 89 Hadamard3 transformation (results not included),
Linear-Large FAIL 51 (1.8x) led to less accurate models than FNet.
FNet-Large 70 44 (2.0x) We also include metrics from a hybrid FNet at-
Table 3: Milliseconds per training step with batch sizes
tention model in Tables 2 and 3. In the hybrid
of 64 (GPU) and 256 (TPU). BERT-Large and Linear- model, we replace the final two Fourier sublay-
Large are unstable for a batch size of 64. ers of FNet with self-attention sublayers.5 The
3
Whereas the Fourier Transform matrix (2) contains N
roots of unity, the Hadamard Transform simply contains two
roots of unity: {±1}; see Kunz (1979) for a full description
of the relationship between the two transforms.
4
FF-only-Base average score does not include the failure
for STS-B.
5
Other configurations are certainly possible, but we gener-
Model MNLI (m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.
BERT-Base 84/81 87 91 93 73 89 83 69 83.3
Linear-Base 74/75 84 80 94 67 67 83 69 77.0
FNet-Base 72/73 83 80 95 69 79 76 63 76.7
Random-Base 51/50 70 61 76 67 4 73 57 56.6
FF-only-Base 34/35 31 52 48 67 FAIL 73 54 49.34
FNet-Hybrid-Base 78/79 85 88 94 76 86 79 60 80.6
BERT-Large 88/88 88 92 95 71 88 86 66 84.7
Linear-Large 35/36 84 80 79 67 24 73 60 59.8
FNet-Large 78/76 85 85 94 78 84 88 69 81.9

Table 4: GLUE Validation results on TPUs. All models are pre-trained. F1-accuracy mean scores are reported for
QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other
tasks. After controlling for batch size and training steps, the GPU metrics (not shown) are very similar.

addition of just two self-attention sublayers signifi- Large.7 The Linear-Large model severely underper-
cantly increases accuracy with only small costs in forms its Base counterpart on GLUE benchmark,
training speed. presumably due to training instabilities. As noted
in Section 4.1.1, we found that the Linear model
4.1.2 GLUE fine-tuning and BERT were less stable than the FNet, Random
For the GLUE benchmark (Wang et al., 2018), we and FF-only models.
use the same settings as in Devlin et al. (2018).
We found that different runs with the same base 4.1.3 Speed and accuracy trade-offs
learning rate could yield slightly different results. The pre-training and fine-tuning results show that
Consequently, for the Base models, we performed MLM metrics are a good indicator of final GLUE
3 trials for each base learning rate and reported the results. Hence, in this subsection, we restrict our
best result across all learning rates and trials. As attention to MLM accuracy.
noted in Devlin et al. (2018), we found that BERT- Speed vs MLM accuracy curves for GPU (8
Large was more unstable than its Base counterpart. V100 chips) and TPU (4 × 4 TPU v3 chips) pre-
As a result, for the Large models, we performed 6 training are shown Figures 2 and 3, respectively.
trials for each learning rate. Both TPU and GPU models are trained for 1 mil-
We report the results for the best base learning lion steps as in Devlin et al. (2018). Motivated by
rate (no early stopping) on the GLUE Validation the models considered in Turc et al. (2019), we
split in Table 4.6 The GLUE Base results mir- evaluated several model sizes; see Table 5. As
ror the pre-training metrics – BERT performs best, in Turc et al. (2019), for all models, we fixed the
FNet and the Linear model both slightly underper- feed-forward size to 4dhidden and the number of
form BERT by 7.5 − 8%, and the FF-only and self-attention heads (for BERT) to dhidden /64. We
Random models perform very poorly. The hybrid found that the smaller model architectures bene-
FNet model performs very well, achieving 97% of fited from larger learning rates, so, for all models,
BERT’s accuracy. Recall from Table 3 that all of we choose the best results from two base learning
the aforementioned models pre-train six to seven rates: 10−3 and 10−4 .
times faster than BERT on GPUs, and roughly Figures 2 (GPU) and 3 (TPU) display the same
twice as fast on TPUs. trends. For larger models, BERT is the most accu-
Interestingly, the gap between the BERT and rate. For fixed training speeds slower than roughly
FNet Large models (3%) is smaller than that of 15 ms, BERT is the most accurate model. But
their Base counterparts; this may be due to FNet- for smaller, very fast models, FNet and the Lin-
Large being more stable during training than BERT- ear model offer a better trade off between speed
ally found that replacing the final layers was the best. 7
Devlin et al. (2018) obtain roughly 2.5 average point
6
WNLI is excluded in Devlin et al. (2018). BERT’s boost on the Test split going from BERT-Base to BERT-Large.
accuracy on WNLI is below baseline, unless a special We only see a roughly 1.5 point boost on the Validation split,
training recipe is used. See also (12) in https:// which may be due to reduced headroom on the Validation
gluebenchmark.com/faq. split.
Dimensions Parameters (millions)
Name dhidden df f Layers BERT Linear FNet
Large 1024 4096 24 372 302 271
Base 768 3072 12 136 118 108
512 2048 12 72 72 72
512 2048 8 59 59 59
“Mini” 512 2048 4 46 46 46
256 1024 4 20 20 20
“Micro” 256 1024 2 18 18 18
128 512 2 9 9 9

Table 5: Model sizes. Named models are highlighted in Figures 2 and 3. The smaller architectures have a similar
number of parameters across all models because the majority of parameters are in the embedding and output layers.

Figure 2: Speed-accuracy Pareto curves for GPU pre-training. The dashed region is where the FNet (yellow) and
Linear (red) curves are both more accurate (above) and faster (left) of the BERT (blue) curve, indicating better
speed-accuracy trade-offs.

Figure 3: Speed-accuracy Pareto curves for TPU pre-training.


and accuracy than BERT, as indicated by their re- (2020c) indicate that the (vanilla) Transformer is
spective curves rising above (higher accuracy) and the second most accurate model studied. Given the
leftward (faster) of the BERT curve; see dashed relative differences in the average accuracy scores
region in Figure 2. So although a “Mini” FNet in Tay et al. (2020c) and Table 6, our results sug-
is larger than a “Micro” BERT, the FNet model gest that FNet would be competitive with the most
is both more accurate and faster to train than the accurate of the efficient Transformers.
smaller “Micro” BERT. As we shall see below in Tay et al. (2020c) also report training speed and
Section 4.2, this comparison of larger FNets with memory usage statistics for the selected Trans-
smaller Transformers is also reasonable in terms former models. In Tables 7 and 8, we provide
of peak memory usage as FNet has a much lighter comparable metrics from our experiments on TPUs
footprint than the Transformer. (4 × 4 TPU v3 chips) and GPUs (8 V100 chips).
We perform a grid search over sequence lengths
4.2 Long-Range Arena benchmark
{512, 1024, 2048, 4096, 8192, 16386} instead of
Of the efficient Transformers evaluated on the just {1K, 2K, 3K, 4K}. We also include results
Long-Range Arena benchmark, Table 1 of Tay et al. for the Performer (Choromanski et al., 2020b),
(2020c) shows that the vanilla Transformer, Big which the results in Tay et al. (2020c) suggest is
Bird (Zaheer et al., 2020) and Longformer (Belt- the fastest of the efficient Transformers that they
agy et al., 2020) are the most accurate models. evaluated. We use Performer as an extra yardstick
We use the Tay et al. (2020c) codebase and run in comparing model efficiencies.
on the same hardware (4 × 4 TPU v3 chips). To en- On GPUs, FNet is much faster than all other
sure a fair comparison, we also report the results of models, across all sequence lengths. This is pre-
our Long-Range Arena experiments for the vanilla sumably due to the highly efficient FFT implemen-
Transformer in Table 6.8 tation on GPUs. As TPU FFT implementations
Using the Transformer as a yardstick, we can improve, we expect to see similar gains on TPUs.
perform a relative comparison with the results in
For the time being, on TPUs, the Linear model
Tay et al. (2020c). This comparison should be made
and FNet still train faster than all the efficient Trans-
with two caveats in mind:
formers for sequence lengths ≤ 4096 and 2048,
• We found that results for certain tasks - in respectively. For longer sequences, FNet is slower
particular, the Text and Retrieval tasks - can than the Performer and, based on results in Tay et al.
vary quite a bit between runs, especially for (2020c), likely also slower than the other efficient
the Transformer; we report the best results. Transformers that linearize attention, namely Local
Attention (Parmar et al., 2018), Linformer (Wang
• The learning configuration may not be fully et al., 2020) and Linear Transformer (Katharopou-
optimized even for the Transformer; for exam- los et al., 2020). It is worth noting that Table 1 in
ple, we found that the Retrieval and ListOps Tay et al. (2020c) and Table 6 in this paper together
tasks may both benefit from training for longer suggest that FNet is more accurate than all of the
than suggested in Tay et al. (2020c). aforementioned models.
For both TPUs and GPUs, Table 8 indicates that
Bearing these caveats in mind, the primary observa-
FNet has a lighter memory footprint than all of
tion from Table 6 is that, in aggregate, the (vanilla)
the efficient Transformer models. This is partly
Transformer model and FNet obtain very compa-
because FNet has no learnable parameters in its
rable results. Moreover, the results in Tay et al.
mixing layer, but also due to the efficiency of the
8
Several hyperparameters are not described in Tay et al. FFT, especially at longer sequence lengths.
(2020c) and there a few mismatches between the configura-
tions described in the paper and the code repository. Where
A model’s training speed is likely of primary
possible, we prioritize configurations described in the paper concern when first optimizing a model for a given
with only two exceptions. Firstly, for the CIFAR10 (Image) use case, by iterating through data configurations
task, we perform a grid search of the number of layers in the
range [1, 2, 3, 4]. We found that 1 layer worked best for all and tuning model hyperparameters. Similarly, in
models; Tay et al. (2020c) suggest 3 layers yielded the best settings where models are continuously trained in
results. Secondly, for the Pathfinder task, we found that a base an online fashion, such as on-device and other edge
learning rate of 0.001 (as given in the code repository) yielded
better results for all models than the 0.01 value indicated in applications, both training speed and memory foot-
Tay et al. (2020c). print will be important considerations. For other
Model ListOps Text Retrieval Image Pathfinder Path-X Avg.
Transformer 36.06 61.54 59.67 41.51 80.38 FAIL 55.83
Linear 33.75 53.35 58.95 41.04 83.69 FAIL 54.16
FNet 35.33 65.11 59.61 38.67 77.80 FAIL 55.30

Table 6: Accuracy results on the Long-Range Arena benchmark. Average score does not include Path-X. These
results were obtained on TPUs. All models fail the Path-X task, but for different reasons – Transformer fails due to
memory limits, whereas FNet and the Linear model fail because they perform no better than chance on the binary
Path-X classification task.

Seq. length 512 1024 2048 4096 8192 16386


GPU
Transformer 26 11 4 FAIL FAIL
Linear 49 (1.9x) 23 (2.0x) 11 (2.6x) 4 FAIL
FNet (FFT) 60 (2.3x) 30 (2.7x) 16 (3.9x) 8 4
Performer 32 (1.3x) 19 (1.6x) 10 (2.3x) 5 2
TPU
Transformer 43 16 5 1 FAIL FAIL
Linear 78 (1.8x) 62 (3.8x) 28 (5.7x) 12 (9.9x) 4 FAIL
FNet (matmul) 92 (2.1x) 61 (3.8x) 26 (5.4x) 11 (8.8x) 4 1
FNet (FFT) 36 (0.8x) 25 (1.5x) 13 (2.7x) 7 (5.4x) 3 1
Performer 59 (1.4x) 42 (2.6x) 23 (4.6x) 12 (9.9x) 6 3

Table 7: TPU and GPU training speeds, measured in steps per second (larger is better), on the Text Classification
task. Speed up multipliers relative to the Transformer are given in parentheses. Models were not evaluated for the
largest sequence length on GPUs.

Seq. length 512 1024 2048 4096 8192 16386


GPU
Transformer 1.6 4.0 12.2 FAIL FAIL
Linear 0.9 1.6 2.8 6.9 FAIL
FNet (FFT) 0.8 1.3 2.2 3.9 7.4
Performer 1.1 1.9 3.1 5.5 10.4
TPU
Transformer 1.1 2.1 5.8 9.1 FAIL FAIL
Linear 0.9 1.1 1.9 4.9 14.8 FAIL
FNet (matmul) 0.8 0.9 1.3 2.2 4.8 11.9
FNet (FFT) 0.8 0.9 1.3 2.0 3.5 6.3
Performer 1.0 1.3 1.8 3.0 5.1 9.6

Table 8: TPU and GPU peak memory usage (GB; smaller is better) on the Text Classification task during training.
Models were not evaluated for the largest sequence length on GPUs.
Seq. length 512 1024 2048 4096 8192 16386
GPU
Transformer 5.8 21.3 67.6 227.3 FAIL FAIL
Linear 5.4 (1.1x) 6.1 (3.5x) 22.6 (3.0x) 63.7 (3.6x) 181.8 FAIL
FNet (FFT) 5.3 (1.1x) 5.2 (4.1x) 15.6 (4.3x) 35.8 (6.3x) 74.6 147.1
Performer 5.8 (1.0x) 10.9 (2.0x) 24.8 (2.7x) 53.5 (4.3x) 109.9 227.3
TPU
Transformer 4.6 7.3 34.6 126.6 476.2 FAIL
Linear 5.1 (0.9x) 5.5 (1.3x) 4.8 (7.2x) 14.6 (8.7x) 49.5 (9.6x) FAIL
FNet (matmul) 4.7 (1.0x) 5.3 (1.4x) 9.3 (3.7x) 35.0 (3.6x) 131.6 (3.6x) 454.5
FNet (FFT) 5.7 (0.8x) 11.8 (0.6x) 28.6 (1.2x) 58.8 (2.2x) 119.0 (4.0x) 232.6
Performer 5.1 (0.9x) 5.4 (1.3x) 6.2 (5.6x) 18.5 (6.8x) 26.5 (18.0x) 55.9

Table 9: GPU and TPU inference speeds on the Text Classification task. Results are given in microseconds per
batch (smaller is better). Speed up multipliers relative to the Transformer are given in parentheses.

applications, researchers and practitioners may pri- Our experiments also showed that FNet has a
oritize inference speeds. Table 9 shows that the lighter memory footprint than all the efficient
training speed trends generally carries over to in- Transformer architectures, across all sequence
ference too, with the FNet’s gains on GPUs again lengths.
being stronger than those on TPUs.
• We demonstrated that, for a fixed speed and
5 Conclusion accuracy budget, small FNet models outper-
form Transformer models.
In this work, we studied simplified mixing modules
for Transformer-like encoder architectures. We Our work has highlighted the potential of linear
made the following contributions: transformations as a drop in replacement for the at-
tention mechanism in text classification tasks. We
• We showed that simple, linear token “mixing”
found the Fourier Transform to be a particularly ef-
transformations, along with the nonlinearities
fective mixing mechanism, in part due to the highly
in feed-forward layers, are sufficient to model
efficient FFT. It is quite remarkable that an unpa-
diverse semantic relationships in text.
rameterized mixing mechanism can can yield a
• We introduced FNet, a Transformer-like relatively very accurate model. On a practical note,
model wherein the self-attention sublayer is we only performed a cursory survey of other linear
replaced by an unparameterized Fourier Trans- transformations. Therefore, we believe there may
form. FNet achieves 92% of BERT’s accuracy be value in exploring other fast transformations.
on the GLUE benchmark, but trains seven Because of the speed and accuracy advantages
times as fast on GPUs and twice as fast on of smaller FNet models relative to Transformer
TPUs. models, we suspect that FNet will be effective as a
lightweight, distilled student model deployed in set-
• Because of its favorable scaling properties, tings where latency or memory is important, such
FNet is very competitive with the “efficient” as in production web services or on edge devices.
Transformers evaluated on the Long-Range The need for such lightweight serving models is
Arena benchmark. Specifically, we find that only forecast to grow given the surge in interest
FNet is comparably accurate to the most ac- in giant models (Raffel et al., 2019; Brown et al.,
curate “efficient” Transformers. Moreover, on 2020; Lepikhin et al., 2020). A natural avenue
GPUs, FNet was significantly faster at both to explore in this regard will be using knowledge
training and inference time than all efficient distillation on small FNet models from larger Trans-
Transformers across all sequence lengths; on former teacher models, following approaches es-
TPUs, FNet is faster at training and infer- tablished by Sanh et al. (2019); Jiao et al. (2019);
ence for relatively shorter sequence lengths Turc et al. (2019) and others.
(≤ 2048 and ≤ 1024 tokens, respectively). Another aspect of interest that we only briefly
touched on in this work is hybrid FNet-attention Acceleration of convolutional neural network us-
models. We found that adding self-attention sub- ing fft-based split convolutions. arXiv preprint
arXiv:2003.12621.
layers to FNet models offers a simple way to trade
off speed for accuracy, at least in the relatively Krzysztof Choromanski, Valerii Likhosherstov, David
shorter sequence length application we considered. Dohan, Xingyou Song, Jared Davis, Tamas Sarlos,
Specifically, replacing the final two Fourier sublay- David Belanger, Lucy Colwell, and Adrian Weller.
2020a. Masked language modeling for proteins via
ers of FNet with self-attention sublayers yielded a linearly scalable long-context transformers. arXiv
model that achieved 97% of BERT’s accuracy on preprint arXiv:2006.03555.
the GLUE benchmark, but pre-trained nearly six
times as fast on GPUs and twice as fast on TPUs. Krzysztof Choromanski, Valerii Likhosherstov, David
Dohan, Xingyou Song, Andreea Gane, Tamas Sar-
Throughout this work we have restricted our at- los, Peter Hawkins, Jared Davis, Afroz Mohiuddin,
tention to encoders. FNet decoders can be designed Lukasz Kaiser, et al. 2020b. Rethinking attention
using causally masked discrete Fourier Transform with performers. arXiv preprint arXiv:2009.14794.
matrices. However, a lower level implementation
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
would be required if one wanted to introduce causal Christopher D. Manning. 2019. What does BERT
masking to the FFT. It is also not immediately clear look at? an analysis of BERT’s attention. In Pro-
what the right mechanism for encoder-decoder in- ceedings of the 2019 ACL Workshop BlackboxNLP:
teractions would be; for example, You et al. (2020) Analyzing and Interpreting Neural Networks for
NLP, pages 276–286, Florence, Italy. Association
suggest that cross attention between the encoder for Computational Linguistics.
and decoder is crucial to Transformer encoder-
decoder performance. James W Cooley and John W Tukey. 1965. An algo-
rithm for the machine calculation of complex fourier
series. Mathematics of computation, 19(90):297–
301.
References
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Va- George Cybenko. 1989. Approximation by superposi-
clav Cvicek, Zachary Fisher, Philip Pham, Anirudh tions of a sigmoidal function. Mathematics of con-
Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. trol, signals and systems, 2(4):303–314.
2020. Etc: Encoding long and structured inputs
in transformers. In Proceedings of the 2020 Con- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
ference on Empirical Methods in Natural Language Kristina Toutanova. 2018. Bert: Pre-training of deep
Processing (EMNLP), pages 268–284. bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Alexey Dosovitskiy, Lucas Beyer, Alexander
learning to align and translate. arXiv preprint Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
arXiv:1409.0473. Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, et al. 2020.
Andrew R Barron. 1993. Universal approximation An image is worth 16x16 words: Transformers
bounds for superpositions of a sigmoidal func- for image recognition at scale. arXiv preprint
tion. IEEE Transactions on Information theory, arXiv:2010.11929.
39(3):930–945.
Hazem M El-Bakry and Qiangfu Zhao. 2004. Fast ob-
Iz Beltagy, Matthew E Peters, and Arman Cohan. ject/face detection using neural networks and fast
2020. Longformer: The long-document transformer. fourier transform. International Journal of Signal
arXiv preprint arXiv:2004.05150. Processing, 1(3):182–187.

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Matteo Frigo and Steven G Johnson. 2005. The de-
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind sign and implementation of fftw3. Proceedings of
Neelakantan, Pranav Shyam, Girish Sastry, Amanda the IEEE, 93(2):216–231.
Askell, et al. 2020. Language models are few-shot
learners. arXiv preprint arXiv:2005.14165. Himanshu Gothwal, Silky Kedawat, Rajesh Kumar,
et al. 2011. Cardiac arrhythmias detection in an ecg
Rewon Child, Scott Gray, Alec Radford, and beat signal using fast fourier transform and artificial
Ilya Sutskever. 2019. Generating long se- neural network. Journal of Biomedical Science and
quences with sparse transformers. arXiv preprint Engineering, 4(04):289.
arXiv:1904.10509.
John Hewitt and Percy Liang. 2019. Designing and
Kamran Chitsaz, Mohsen Hajabdollahi, Nader Karimi, interpreting probes with control tasks. In Proceed-
Shadrokh Samavi, and Shahram Shirani. 2020. ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter- 2018. Fft-based deep learning deployment in em-
national Joint Conference on Natural Language Pro- bedded systems. In 2018 Design, Automation & Test
cessing (EMNLP-IJCNLP), pages 2733–2743, Hong in Europe Conference & Exhibition (DATE), pages
Kong, China. Association for Computational Lin- 1045–1050. IEEE.
guistics.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben
Tyler Highlander and Andres Rodriguez. 2016. Very Goodrich, Ryan Sepassi, Lukasz Kaiser, and
efficient training of convolutional neural networks Noam Shazeer. 2018. Generating wikipedia by
using fast fourier transform and overlap-and-add. summarizing long sequences. arXiv preprint
arXiv preprint arXiv:1601.06815. arXiv:1801.10198.
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Michael Mathieu, Mikael Henaff, and Yann LeCun.
Li, and Torsten Hoefler. 2020. Data movement is 2014. Fast training of convolutional networks
all you need: A case study of transformer networks. through ffts: International conference on learning
arXiv preprint arXiv:2007.00072. representations (iclr2014), cbls, april 2014. In 2nd
International Conference on Learning Representa-
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,
tions, ICLR 2014.
Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
2019. Tinybert: Distilling bert for natural language
Kei-ichiro Minami, Hiroshi Nakajima, and Takeshi
understanding. arXiv preprint arXiv:1909.10351.
Toyoshima. 1999. Real-time discrimination of ven-
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- tricular tachyarrhythmia with fourier-transform neu-
pas, and François Fleuret. 2020. Transformers are ral network. IEEE transactions on Biomedical Engi-
rnns: Fast autoregressive transformers with linear neering, 46(2):179–185.
attention. In International Conference on Machine
Learning, pages 5156–5165. PMLR. Martina Mironovova and Jirı́ Bı́la. 2015. Fast fourier
transform for feature extraction and neural net-
Young Jin Kim and Hany Hassan Awadalla. 2020. work for classification of electrocardiogram sig-
Fastformers: Highly efficient transformer models nals. In 2015 Fourth International Conference
for natural language understanding. arXiv preprint on Future Generation Communication Technology
arXiv:2010.13382. (FGCT), pages 1–6. IEEE.

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Marcin Moczulski, Misha Denil, Jeremy Appleyard,
2020. Reformer: The efficient transformer. arXiv and Nando de Freitas. 2015. Acdc: A structured effi-
preprint arXiv:2001.04451. cient linear layer. arXiv preprint arXiv:1511.05946.

Renée Koplon and Eduardo D Sontag. 1997. Using Sharan Narang, Hyung Won Chung, Yi Tay, William
fourier-neural recurrent networks to fit sequential in- Fedus, Thibault Fevry, Michael Matena, Karishma
put/output data. Neurocomputing, 15(3-4):225–248. Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong
Lan, et al. 2021. Do transformer modifications trans-
Taku Kudo and John Richardson. 2018. Sentencepiece: fer across implementations and applications? arXiv
A simple and language independent subword tok- preprint arXiv:2102.11972.
enizer and detokenizer for neural text processing.
arXiv preprint arXiv:1808.06226. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
Kaiser, Noam Shazeer, Alexander Ku, and Dustin
Henry O. Kunz. 1979. On the equivalence between
Tran. 2018. Image transformer. In International
one-dimensional discrete walsh-hadamard and mul-
Conference on Machine Learning, pages 4055–4064.
tidimensional discrete fourier transforms. IEEE
PMLR.
Computer Architecture Letters, 28(03):267–268.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy
Dehao Chen, Orhan Firat, Yanping Huang, Maxim Schwartz, Noah A Smith, and Lingpeng Kong.
Krikun, Noam Shazeer, and Zhifeng Chen. 2020. 2021. Random feature attention. arXiv preprint
Gshard: Scaling giant models with conditional com- arXiv:2103.02143.
putation and automatic sharding. arXiv preprint
arXiv:2006.16668. Harry Pratt, Bryan Williams, Frans Coenen, and Yalin
Zheng. 2017. Fcnn: Fourier convolutional neu-
Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, ral networks. In Joint European Conference on
Burigede Liu, Kaushik Bhattacharya, Andrew Stu- Machine Learning and Knowledge Discovery in
art, and Anima Anandkumar. 2020. Fourier neural Databases, pages 786–798. Springer.
operator for parametric partial differential equations.
arXiv preprint arXiv:2010.08895. Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih,
Sinong Wang, and Jie Tang. 2019. Blockwise self-
Sheng Lin, Ning Liu, Mahdi Nazemi, Hongjia Li, attention for long document understanding. arXiv
Caiwen Ding, Yanzhi Wang, and Massoud Pedram. preprint arXiv:1911.02972.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, BERT rediscovers the classical NLP pipeline. In
Wei Li, and Peter J Liu. 2019. Exploring the limits Proceedings of the 57th Annual Meeting of the Asso-
of transfer learning with a unified text-to-text trans- ciation for Computational Linguistics, pages 4593–
former. arXiv preprint arXiv:1910.10683. 4601, Florence, Italy. Association for Computational
Linguistics.
Alessandro Raganato, Yves Scherrer, and Jörg Tiede-
mann. 2020. Fixed encoder self-attention patterns Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
in transformer-based machine translation. arXiv Toutanova. 2019. Well-read students learn better:
preprint arXiv:2002.10260. On the importance of pre-training compact models.
arXiv preprint arXiv:1908.08962.
Ali Rahimi and Benjamin Recht. 2008. Random
features for large-scale kernel machines. In Ad- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
vances in Neural Information Processing Systems, Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
volume 20. Curran Associates, Inc. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner,
Philipp Seidl, Michael Widrich, Lukas Gruber, Jesse Vig and Yonatan Belinkov. 2019. Analyzing
Markus Holzleitner, Milena Pavlović, Geir Kjetil the structure of attention in a transformer language
Sandve, Victor Greiff, et al. 2020. Hop- model. In Proceedings of the 2019 ACL Workshop
field networks is all you need. arXiv preprint BlackboxNLP: Analyzing and Interpreting Neural
arXiv:2008.02217. Networks for NLP, pages 63–76, Florence, Italy. As-
sociation for Computational Linguistics.
S. Reid and M. Mistele. 2019. Fast fourier trans-
formed transformers: Circulant weight matrices for Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nmt compression. nrich, and Ivan Titov. 2019. Analyzing multi-head
self-attention: Specialized heads do the heavy lift-
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and ing, the rest can be pruned. In Proceedings of the
David Grangier. 2021. Efficient content-based 57th Annual Meeting of the Association for Com-
sparse attention with routing transformers. Transac- putational Linguistics, pages 5797–5808, Florence,
tions of the Association for Computational Linguis- Italy. Association for Computational Linguistics.
tics, 9:53–68.
Apoorv Vyas, Angelos Katharopoulos, and François
Victor Sanh, Lysandre Debut, Julien Chaumond, and Fleuret. 2020. Fast transformers with clustered at-
Thomas Wolf. 2019. Distilbert, a distilled version tention. arXiv preprint arXiv:2007.04825.
of bert: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108. Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. 2018.
Noam Shazeer. 2019. Fast transformer decoding: Glue: A multi-task benchmark and analysis platform
One write-head is all you need. arXiv preprint for natural language understanding. arXiv preprint
arXiv:1911.02150. arXiv:1804.07461.

Sam Shleifer and Alexander M Rush. 2020. Pre- Sinong Wang, Belinda Li, Madian Khabsa, Han
trained summarization distillation. arXiv preprint Fang, and Hao Ma. 2020. Linformer: Self-
arXiv:2010.13002. attention with linear complexity. arXiv preprint
arXiv:2006.04768.
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan,
Zhe Zhao, and Che Zheng. 2020a. Synthesizer: Re- Weiqiu You, Simeng Sun, and Mohit Iyyer. 2020.
thinking self-attention in transformer models. arXiv Hard-coded gaussian attention for neural machine
preprint arXiv:2005.00743. translation. arXiv preprint arXiv:2005.00742.

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
Da-Cheng Juan. 2020b. Sparse sinkhorn attention. Joshua Ainslie, Chris Alberti, Santiago Ontanon,
In International Conference on Machine Learning, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
pages 9438–9447. PMLR. et al. 2020. Big bird: Transformers for longer se-
quences. arXiv preprint arXiv:2007.14062.
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang
Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Jiong Zhang, Yibo Lin, Zhao Song, and Inderjit
Yang, Sebastian Ruder, and Donald Metzler. 2020c. Dhillon. 2018. Learning long term dependencies via
Long range arena: A benchmark for efficient trans- fourier recurrent units. In International Conference
formers. arXiv preprint arXiv:2011.04006. on Machine Learning, pages 5815–5823. PMLR.

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Ying-Qian Zhang and Lai-Wan Chan. 2000. Forenet:
Metzler. 2020d. Efficient transformers: A survey. Fourier recurrent networks for time series predic-
arXiv preprint arXiv:2009.06732. tion.
Zhenyou Zhang, Yi Wang, and Kesheng Wang. 2013.
Fault diagnosis and prognosis using wavelet packet
decomposition, fourier transform and artificial neu-
ral network. Journal of Intelligent Manufacturing,
24(6):1213–1227.

You might also like