Professional Documents
Culture Documents
Spatial Attention and Channel Attention
Spatial Attention and Channel Attention
CBAM contains two sequential sub-modules called the Channel Attention Module (CAM) and the Spatial Attention
Module (SAM), which are applied in that particular order.
So, what's meant by Spatial Attention?
• As discussed above, channels are essentially the feature maps stacked in a tensor, where each
cross-sectional slice is basically a feature map of dimension (h x w). Usually in convolutional
layers, the trainable weights making up the filters learn generally small values (close to zero), thus
we observe similar feature maps with many appearing to be copies of one another. This
observation was a main driver for the CVPR 2020 paper titled "GhostNet
: More Features from Cheap Operations". Even though they look similar, these filters are extremely
useful in learning different types of features. While some are specific for learning horizontal and
vertical edges, others are more general and learn a particular texture in the image. The channel
attention essentially provides a weight for each channel and thus enhances those particular
channels which are most contributing towards learning and thus boosts the overall model
performance.
Why use both, isn't either one sufficient?
• Well, technically yes and no; the authors in their code implementation provide the
option to only use Channel Attention and switch off the Spatial Attention.
However, for best results it has been advised to use both. In layman terms,
channel attention says which feature map is important for learning and enhances,
or as the authors say, "refines" it. Meanwhile, the spatial attention conveys what
within the feature map is essential to learn. Combining both robustly enhances the
Feature Maps and thus justifies the significant improvement in model
performance.
Spatial Attention Module (SAM)
Spatial Attention Module (SAM) is comprised of a three-fold sequential operation. The first part of it is called the
Channel Pool, where the Input Tensor of dimensions (c × h × w) is decomposed to 2 channels, i.e. (2 × h × w), where
each of the 2 channels represent Max Pooling and Average Pooling across the channels. This serves as the input to the
convolution layer which output a 1-channel feature map, i.e., the dimension of the output is (1 × h × w). Thus, this
convolution layer is a spatial dimension preserving convolution and uses padding to do the same. The output is then
passed to a Sigmoid Activation layer. Sigmoid, being a probabilistic activation, will map all the values to a range
between 0 and 1. This Spatial Attention mask is then applied to all the feature maps in the input tensor using a simple
element-wise product.
Channel Attention Module (CAM)
'Squeeze-and-Excitation Network
At first glance, CAM resembles Squeeze Excite (SE) layer. Squeeze Excite was first proposed in the CVPR/ TPAMI
2018 paper titled: 'Squeeze-and-Excitation Networks '.
let's do a quick review of the Squeeze Excitation Module. The Squeeze Excitation Module has the following components: Global Average
Pooling (GAP), and a Multi-layer Perceptron (MLP) network mapped by reduction ratio (r) and sigmoid activation . The input to the SE block is
essentially a tensor of dimension (c × h × w). Global Average Pooling is essentially an Average Pooling operation where each feature map is
reduced to a single pixel, thus each channel is now decomposed to a (1 × 1) spatial dimension. Thus the output dimension of the GAP is
basically a 1-D vector of length c which can be represented as (c × 1 × 1). This vector is then passed as the input to the Multi-layer perceptron
(MLP) network which has a bottleneck whose width or number of neurons is decided by the reduction ratio ( r). The higher the reduction ratio,
the fewer the number of neurons in the bottleneck and vice versa. The output vector from this MLP is then passed to a sigmoid activation layer
which then maps the values in the vector within the range of 0 and 1.
Channel Attention Module (CAM)
Channel Attention Module (CAM) is pretty similar to the Squeeze Excitation layer with a small modification. Instead of
reducing the Feature Maps to a single pixel by Global Average Pooling (GAP), it decomposes the input tensor into 2
subsequent vectors of dimensionality (c × 1 × 1). One of these vectors is generated by GAP while the other vector is
generated by Global Max Pooling (GMP). Average pooling is mainly used for aggregating spatial information, whereas
max pooling preserves much richer contextual information in the form of edges of the object within the image which thus
leads to finer channel attention. Simply put, average pooling has a smoothing effect while max pooling has a much
sharper effect, but preserves natural edges of the objects more precisely. The authors validate this in their experiments
where they show that using both Global Average Pooling and Global Max Pooling gives better results than using just
GAP as in the case of Squeeze Excitation Networks.