Professional Documents
Culture Documents
Attention Augmented Convolutional Networks
Attention Augmented Convolutional Networks
Attention Augmented Convolutional Networks
Abstract 81
Convolutional networks have been the paradigm of 80
79.1% 78.9%
choice in many computer vision applications. The convolu- 79 78.7%
78.4%
tion operation however has a significant weakness in that it
78 77.7%
top-1 accuracy
78.4%
only operates on a local neighborhood, thus missing global 77.9%
information. Self-attention, on the other hand, has emerged 77 77.5%
as a recent advance to capture long range interactions, but 76 76.4%
has mostly been applied to sequence modeling and gener-
ative modeling tasks. In this paper, we consider the use of
75 74.7%
1
Input Attention maps Weighted average of the values Output
H Head Head ²
¹
Nh = 2
Values
Standard convolution
Figure 2. Attention-augmented convolution: For each spatial location (h, w), Nh attention maps over the image are computed from
queries and keys. These attention maps are used to compute Nh weighted averages of the values V. The results are then concatenated,
reshaped to match the original volume’s spatial dimensions and mixed with a pointwise convolution. Multi-head attention is applied in
parallel to a standard convolution operation and the outputs are concatenated.
tal batch size, for the first 5% training epochs and then an- but does not save parameters or computations.
to the total batch size using a base learning rate of 0.128 for Architecture GFlops Params top-1 top-5
total batch size of 256. During training, we linearly scale MnasNet-0.75 0.45 2.91M 73.3 91.3
the learning rate from 0 to this value for the first 5% of train- AA-MnasNet-0.75 0.51 3.02M 73.9 91.6
ing epochs and divide it by 10 at epochs 30, 60, 80 and 90. MnasNet-1.0 0.63 3.89M 75.2 92.4
AA-MnasNet-1.0 0.70 4.06M 75.7 92.6
We use standard Inception data augmentation as described
MnasNet-1.25 1.01 5.26M 76.7 93.2
in [39].
AA-MnasNet-1.25 1.11 5.53M 77.2 93.6
In all experiments, Attention Augmentation significantly MnasNet-1.4 1.17 6.10M 77.2 93.5
increases performance over the non-augmented baseline AA-MnasNet-1.4 1.29 6.44M 77.7 93.8
and notably outperforms Squeeze-and-Excitation (SE) [17]
while being more parameter efficient (see Table 2 and Table 3. Baseline and attention augmented MnasNet [40] accura-
Figure Figure 1). Remarkably, our AA-ResNet-50 per- cies with width multipliers 0.75, 1.0, 1.25 and 1.4.
forms comparably to the baseline ResNet-101 and our AA-
ResNet-101 outperforms the baseline ResNet-152. These
results suggest that attention augmentation is preferable to
simply making networks deeper. We include and discuss at-
tention maps visualizations from different pixel positions in
78 77.7%
77.2%
the appendix.
77 77.2%
top-1 accuracy
Architecture GFlops Params top-1 top-5 76.7%
ResNet-34 [14] 7.4 21.8M 73.6 91.5 76 75.7%
SE-ResNet-34 [17] 7.4 22.0M 74.3 91.8
AA-ResNet-34 (ours) 7.1 20.7M 74.7 92.0 75 75.2%
ResNet-50 [14] 8.2 25.6M 76.4 93.1
73.9%
SE-ResNet-50 [17] 8.2 28.1M 77.5 93.7 74
AA-ResNet-50 (ours) 8.3 25.8M 77.7 93.8
MnasNet
ResNet-101 [14]
SE-ResNet-101 [17]
15.6
15.6
44.5M
49.3M
77.9
78.4
94.0
94.2
73 73.3% AA-MnasNet (ours)
AA-ResNet-101 (ours) 16.1 45.4M 78.7 94.4 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
ResNet-152 [14] 23.0 60.2M 78.4 94.2 number of parameters (millions)
SE-ResNet-152 [17] 23.1 66.8M 78.9 94.5
AA-ResNet-152 (ours) 23.8 61.6M 79.1 94.6 Figure 4. ImageNet top-1 accuracy as a function of number of pa-
rameters for MnasNet (black) and Attention-Augmented-MnasNet
Table 2. Image classification on the ImageNet dataset [9] across (red) with depth multipliers 0.75, 1.0, 1.25 and 1.4.
a range of ResNet architectures: ResNet-34, ResNet-50, Resnet-
101, and ResNet-152 [14, 45, 13].
4.3. ImageNet classification with MnasNet augmented inverted bottlenecks in the the last 13 blocks out
of 18 in the MnasNet architecture, starting when the spatial
In this section, we inspect the use of Attention Aug- dimension is 28x28. We downsample only in the first stage
mentation in a resource constrained setting by conducting where Attention Augmentation is applied. We leave the fi-
ImageNet experiments with the MnasNet architecture [40], nal pointwise convolution, also referred to as the “head”,
which is an extremely parameter-efficient architecture. In unchanged.
particular, the MnasNet was found by neural architec-
ture search [51], using only the highly optimized mo- In Table 3, we report ImageNet accuracies for the base-
bile inverted bottleneck block [34] and the Squeeze-and- line MnasNet and its attention augmented variants at dif-
Excitation operation [17] as the primitives in its search ferent width multipliers. Our experiments show that At-
space. We follow the training setup described in [40] and tention Augmentation yields accuracy improvements across
train all networks for 350 epochs with the RMSProp opti- all width multipliers. Augmenting MnasNets with relative
mizer using exponential learning rate decay. When train- self-attention incurs a slight parameter increase, however
ing our augmented MnasNets, we divided the learning by 2 we verify in Figure 4 that the accuracy improvements are
and adjusted the learning rate decay so that the final learn- not just explained by the parameter increase. Additionally,
ing rate stays the same. We apply Attention Augmenta- we note that the MnasNet architecture employs Squeeze-
tion to the mobile inverted bottleneck by replacing convolu- and-Excitation at multiple locations that were optimally se-
tional channels in the expansion pointwise convolution us- lected via architecture search, further suggesting the bene-
ing κ=2υ=0.1 and Nh =4. Our augmented MnasNets use fits of our method.
Backbone architecture GFlops Params mAPCOCO mAP50 mAP75
ResNet-50 [26] 182 33.4M 36.8 54.5 39.5
SE-ResNet-50 [17] 183 35.9M 36.5 54.0 39.1
AA-ResNet-50 (ours) 182 33.1M 38.2 56.5 40.7
ResNet-101 [26] 243 52.4M 38.5 56.4 41.2
SE-ResNet-101 [17] 243 57.2M 37.4 55.0 39.9
AA-ResNet-101 (ours) 245 51.7M 39.2 57.8 41.9
Table 4. Object detection on the COCO dataset [27] using the RetinaNet architecture [26] with different backbone architectures. We report
mean Average Precision at three different IoU values.
4.4. Object Detection with COCO dataset crease this fraction to 100%, we begin to replace a Con-
vNet with a fully attentional model, only leaving pointwise
We next investigate the use of Attention Augmentation
convolutions and the stem unchanged. Table 5 presents the
on the task of object detection on the COCO dataset [27].
performance of Attention Augmentation on the ResNet-50
We employ the RetinaNet architecture with a ResNet-50
architecture for varying ratios κ=υ ∈ {0.25, 0.5, 0.75, 1.0}.
and ResNet-101 backbone as done in [26], using the open-
Performance slightly degrades as the ratio of attentional
sourced RetinaNet codebase.3 We follow the setup de-
channels increases, which we hypothesize is partly ex-
scribed in [26, 11] and train the RetinaNet from scratch
plained by the average pooling operation for downsampling
for 150 epochs without using ImageNet pretraining for the
at the first stage where Attention Augmentation is applied.
ResNet backbone. We use the preprocessing pipeline de-
Attention Augmentation proves however quite robust to the
scribed in [26]. We apply multiscale jitter, randomly re-
fraction of attentional channels. For instance, AA-ResNet-
size images from [512, 768] and crop to a max dimension
50 with κ=υ=0.75 outperforms its ResNet-50 counterpart,
of 640 during training. All images are horizontally flipped
while being more parameter and flops efficient, indicating
with a 50% probability. We apply Attention Augmentation
that mostly employing attentional channels is readily com-
uniquely on the ResNet backbone, modifying them simi-
petitive.
larly as in our ImageNet classification experiments. Our rel-
Perhaps surprisingly, these experiments also reveal that
ative self-attention mechanism improves the performance of
our proposed self-attention mechanism is a powerful stand-
the RetinaNet on both ResNet-50 and ResNet-101 as shown
alone computational primitive for image classification and
in Table 4. Most notably, Attention Augmentation yields a
that fully attentional models may be viable for discrimina-
1.4% mAP improvement over a strong RetinaNet baseline
tive visual tasks. In particular, AA-ResNet-50 with κ=υ=1,
from [26].
which uses exclusively attentional channels, is only 2.5%
In contrast to the success of Squeeze-and-Excitation in
worse in accuracy than its fully convolutional counterpart,
image classification with ImageNet, our experiments show
in spite of downsampling with average pooling and having
that adding Squeeze-and-Excitation operators in the back-
25% less parameters. Notably, this fully attentional archi-
bone network of the RetinaNet significantly hurts perfor-
tecture4 also outperforms ResNet-34 while being more pa-
mance, in spite of grid searching over the squeeze ratio
rameter and flops efficient (see Table 5).
σ ∈ {4, 8, 16}. We hypothesize that localization requires
precise spatial information which SE discards during the
spatial pooling operation, thereby negatively affecting per- Architecture GFlops Params top-1 top-5
formance. Self-attention on the other hand maintains spa- ResNet-34 [14] 7.4 21.8M 73.6 91.5
tial information and is likely to be able to identify object ResNet-50 [14] 8.2 25.6M 76.4 93.1
boundaries successfully. Visualizations of attention maps κ = υ = 0.25 7.9 24.3M 77.7 93.8
(See Figures 10 and 11 in the Appendix) reveal that some κ = υ = 0.5 7.3 22.3M 77.3 93.6
heads are indeed delineating objects from their background κ = υ = 0.75 6.8 20.7M 76.7 93.2
which might be important for localization. κ = υ = 1.0 6.3 19.4M 73.9 91.5
4.5. Ablation Study Table 5. Attention Augmented ResNet-50 with varying ratios of
attentional channels.
Fully-attentional vision models: In this section, we in-
vestigate the performance of Attention Augmentation as a
4 We
function of the fraction of attentional channels. As we in- consider pointwise convolutions as dense layers. This architecture
employs 4 non-pointwise convolutions in the stem and the first stage of the
3 https://github.com/tensorflow/tpu/tree/master/ architecture, but we believe such operations can be replaced by attention
models/official/retinanet too.
Architecture Position Encodings top-1 top-5
80 AA-ResNet-34 None 74.4 91.9
AA-ResNet-34 2d Sine 74.4 92.0
78 77.7%
77.3% AA-ResNet-34 CoordConv 74.4 92.0
77.5% 76.7% AA-ResNet-34 Relative (ours) 74.7 92.0
top-1 accuracy