Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 1

Overfitting

Underfitting
weighted average of two model
Swin Transfer

The Swin Transformer introduced two key concepts to address the issues faced by the
original ViT — hierarchical feature maps and shifted window attention. In fact, the
name of Swin Transformer comes from “Shifted window Transformer”.

The ‘Patch Merging’ block and the ‘Swin Transformer Block’ are the two key building
blocks in Swin Transformer.

The first significant deviation from ViT is that Swin Transformer builds
‘hierarchical feature maps’.

the spatial resolution of these hierarchical feature maps is identical to those in


ResNet. This was done intentionally, so that Swin Transformers can conveniently
replace the ResNet backbone networks in existing methods for vision tasks.

Hierarchical feature maps allow the Swin Transformer to be applied in areas where
fine-grained prediction is required, such as in semantic segmentation.

Hierarchical feature maps are built by progressively merging and downsampling the
spatial resolution of the feature maps. In convolutional neural networks such as
ResNet, downsampling of feature maps is done using the convolution operation.

The convolution-free downsampling technique used in Swin Transformer is known as


patch merging.

The 'patch’ refers to the smallest unit in a feature map. In other words, in a
14x14 feature map, there are 14x14=196 patches.

The Swin Transformer block consists of two sub-units. Each sub-unit consists of a
normalization layer, followed by an attention module, followed by another
normalization layer and a MLP layer. The first sub-unit uses a Window MSA (W-MSA)
module while the second sub-unit uses a Shifted Window MSA (SW-MSA) module.

You might also like