Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 8

What are Spatial Attention

and Channel Attention?


Ref: https://blog.paperspace.com/attention-mechanisms-in-computer-
vision-cbam/
Although the Convolutional Block Attention Module (CBAM) was brought into fashion in the ECCV 2018 paper titled "
CBAM: Convolutional Block Attention Module", the general concept was introduced in the 2016 paper titled "
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning". SCA-CNN
demonstrated the potential of using multi-layered attention: Spatial Attention and Channel Attention combined, which are
the two building blocks of CBAM in Image Captioning. The CBAM paper was the first to successfully showcase the
wide applicability of the module, especially for Image Classification and Object Detection tasks.

CBAM contains two sequential sub-modules called the Channel Attention Module (CAM) and the Spatial Attention
Module (SAM), which are applied in that particular order.
So, what's meant by Spatial Attention?

• Spatial refers to the domain space encapsulated


within each feature map. Spatial attention
represents the attention mechanism/attention
mask on the feature map, or a single cross-
sectional slice of the tensor. For instance, in the
image below the object of interest is a bird, thus
the Spatial Attention will generate a mask which
will enhance the features that define that bird. By
thus refining the feature maps using Spatial
Attention, we are enhancing the input to the
subsequent convolutional layers which thus
improves the performance of the model.
Then what's channel attention, and do we even need it?

• As discussed above, channels are essentially the feature maps stacked in a tensor, where each
cross-sectional slice is basically a feature map of dimension (h x w). Usually in convolutional
layers, the trainable weights making up the filters learn generally small values (close to zero), thus
we observe similar feature maps with many appearing to be copies of one another. This
observation was a main driver for the CVPR 2020 paper titled "GhostNet
: More Features from Cheap Operations". Even though they look similar, these filters are extremely
useful in learning different types of features. While some are specific for learning horizontal and
vertical edges, others are more general and learn a particular texture in the image. The channel
attention essentially provides a weight for each channel and thus enhances those particular
channels which are most contributing towards learning and thus boosts the overall model
performance.
Why use both, isn't either one sufficient?

• Well, technically yes and no; the authors in their code implementation provide the
option to only use Channel Attention and switch off the Spatial Attention.
However, for best results it has been advised to use both. In layman terms,
channel attention says which feature map is important for learning and enhances,
or as the authors say, "refines" it. Meanwhile, the spatial attention conveys what
within the feature map is essential to learn. Combining both robustly enhances the
Feature Maps and thus justifies the significant improvement in model
performance.
Spatial Attention Module (SAM)

Spatial Attention Module (SAM) is comprised of a three-fold sequential operation. The first part of it is called the
Channel Pool, where the Input Tensor of dimensions (c × h × w) is decomposed to 2 channels, i.e. (2 × h × w), where
each of the 2 channels represent Max Pooling and Average Pooling across the channels. This serves as the input to the
convolution layer which output a 1-channel feature map, i.e., the dimension of the output is (1 × h × w). Thus, this
convolution layer is a spatial dimension preserving convolution and uses padding to do the same. The output is then
passed to a Sigmoid Activation layer. Sigmoid, being a probabilistic activation, will map all the values to a range
between 0 and 1. This Spatial Attention mask is then applied to all the feature maps in the input tensor using a simple
element-wise product.
Channel Attention Module (CAM)

'Squeeze-and-Excitation Network

At first glance, CAM resembles Squeeze Excite (SE) layer. Squeeze Excite was first proposed in the CVPR/ TPAMI
2018 paper titled: 'Squeeze-and-Excitation Networks '.
let's do a quick review of the Squeeze Excitation Module. The Squeeze Excitation Module has the following components: Global Average
Pooling (GAP), and a Multi-layer Perceptron (MLP) network mapped by reduction ratio (r) and sigmoid activation . The input to the SE block is
essentially a tensor of dimension (c × h × w). Global Average Pooling is essentially an Average Pooling operation where each feature map is
reduced to a single pixel, thus each channel is now decomposed to a (1 × 1) spatial dimension. Thus the output dimension of the GAP is
basically a 1-D vector of length c which can be represented as (c × 1 × 1). This vector is then passed as the input to the Multi-layer perceptron
(MLP) network which has a bottleneck whose width or number of neurons is decided by the reduction ratio ( r). The higher the reduction ratio,
the fewer the number of neurons in the bottleneck and vice versa. The output vector from this MLP is then passed to a sigmoid activation layer
which then maps the values in the vector within the range of 0 and 1.
Channel Attention Module (CAM)

Channel Attention Module (CAM) is pretty similar to the Squeeze Excitation layer with a small modification. Instead of
reducing the Feature Maps to a single pixel by Global Average Pooling (GAP), it decomposes the input tensor into 2
subsequent vectors of dimensionality (c × 1 × 1). One of these vectors is generated by GAP while the other vector is
generated by Global Max Pooling (GMP). Average pooling is mainly used for aggregating spatial information, whereas
max pooling preserves much richer contextual information in the form of edges of the object within the image which thus
leads to finer channel attention. Simply put, average pooling has a smoothing effect while max pooling has a much
sharper effect, but preserves natural edges of the objects more precisely. The authors validate this in their experiments
where they show that using both Global Average Pooling and Global Max Pooling gives better results than using just
GAP as in the case of Squeeze Excitation Networks.

You might also like