Slides To Read

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Slides

Single image super resolution is a task in computer vision that aims to enhance the
resolution and quality of a given low-resolution image. The goal is to generate a
high-resolution image with finer details and improved visual clarity based on the
available information in the low-resolution image.
This task is particularly useful in various applications, such as enhancing the quality of low-resolution
images captured in surveillance systems, improving the quality of compressed images, and enhancing
the resolution of images in medical imaging. Single image super resolution methods typically employ
techniques such as interpolation, image processing filters, and deep learning-based approaches to
reconstruct high-resolution images from their low-resolution counterparts.

We will introduce a new type of network proposed by the authors called Hybrid
Attention Transfromer.

Slide 2-5 Intro to SR

The given text is a synopsis that addresses the problem of single image super-resolution (SR) in
computer vision and image processing. Deep learning methods based on convolutional neural
networks (CNNs) have dominated this field, but recently, the Transformer model, popular in natural
language processing, has gained attention. Transformer-based methods have shown promising results
in high-level vision tasks and SR, particularly a network called SwinIR. However, it is still unclear
why Transformers perform better than CNNs.

To investigate this, the authors employ the LAM attribution analysis method to examine the range of
utilized information in SwinIR. Surprisingly, they find that SwinIR does not exploit more input pixels
than CNN-based methods such as RCAN, suggesting that the range of utilized information needs to be
expanded. Additionally, they observe blocking artifacts in the intermediate features of SwinIR,
indicating that the shift window mechanism fails to achieve effective cross-window information
interaction.

To address these limitations and further leverage the potential of Transformers for SR, the authors
propose a Hybrid Attention Transformer (HAT). HAT combines channel attention and self-attention
methods to utilize global information and enhance representation ability, respectively. They also
introduce an overlapping cross-attention module to enable direct interaction between adjacent window
features. With these designs, HAT activates more pixels for reconstruction and achieves significant
performance improvement.

Furthermore, since Transformers lack inductive bias like CNNs, the authors emphasize the importance
of large-scale data pre-training. They propose an effective same-task pre-training strategy, which
involves pre-training HAT using a large-scale dataset specifically for the SR task. Experimental results
demonstrate the superiority of this strategy.

The contributions of this work are: 1) the design of a novel Hybrid Attention Transformer (HAT) that
combines self-attention, channel attention, and overlapping cross-attention; 2) the proposal of an
effective same-task pre-training strategy to exploit the potential of SR Transformers through large-
scale data pre-training; and 3) the achievement of state-of-the-art performance, and by scaling up HAT
to build a bigger model, the authors significantly push the performance upper bound in the SR task.

Slide 6 Related Work


SRCNN introduced deep convolutional neural networks (CNNs) to image super-resolution
(SR) and showed better performance than traditional methods.
These networks use advanced convolutional modules like residual blocks and dense blocks.
Some explore different frameworks like recursive neural networks and graph neural networks.
Adversarial learning is used for more realistic results, and attention mechanisms improve
reconstruction fidelity.
LAM analyzes which input pixels contribute the most to the final performance using the
integral gradient method. DDR reveals deep semantic representations in SR networks through
feature dimensionality reduction and visualization. FAIG aims to find specific filters for
degradations in blind SR. RDSR demonstrates how Dropout can prevent co-adaptation in SR
networks using channel saliency maps. SRGA evaluates the generalization ability of SR
methods. This work utilizes LAM to analyze and understand the behavior of SR networks.

Slides 7-8 HAT

The authors are interested in understanding why the Swin Transformer performs better than CNN-
based methods. To investigate this, they utilize LAM, an attribution method designed for SR, to
determine the contribution of input pixels to the reconstruction.

Comparing the results of LAM for SwinIR with CNN-based methods like RCAN and EDSR, the
authors find that SwinIR does not exhibit a larger range of utilized pixels, contrary to common sense.
This indicates that SwinIR has a stronger mapping ability and can achieve better performance with less
information. However, it may also result in incorrect textures due to the limited range of utilized
pixels. To overcome this limitation and activate more pixels for reconstruction, the authors propose a
network called HAT (Hybrid Attention Transformer). By incorporating self-attention and activating
pixels all over the image, HAT can restore correct and clear textures.

Additionally, the authors observe significant blocking artifacts in the intermediate features of SwinIR,
caused by the window partition mechanism. This suggests that the shifted window mechanism is
inefficient in establishing cross-window connections. Previous research has shown that enhancing the
connection among windows can improve window-based self-attention methods. Therefore, the authors
strengthen cross-window information interactions in their approach, resulting in a significant reduction
of blocking artifacts in the intermediate features obtained by HAT.

Slide 9 Overall Architecture

The architecture consists of three parts: shallow feature extraction, deep feature extraction, and image
reconstruction. This architecture design has been widely used in previous works.
Initially, a convolution layer is applied to the low-resolution (LR) input ILR to extract shallow
features, resulting in F0. The input has dimensions H×W×Cin, where Cin represents the number of
input channels, and C represents the number of intermediate feature channels.

To perform deep feature extraction, a series of residual hybrid attention groups (RHAG) and a 3 × 3
convolution layer called HConv(·) are employed. The deep features are denoted as FD with
dimensions H×W×C.

A global residual connection is added to combine the shallow features F0 and deep features FD. The
fused features are then passed through a reconstruction module to generate the high-resolution result.

Each RHAG consists of several hybrid attention blocks (HAB), an overlapping cross-attention block
(OCAB), and a 3 × 3 convolution layer with a residual connection. The OCAB enhances cross-window
information interactions.

For the reconstruction module, the pixel-shuffle method is used to up-sample the fused features. The
network parameters are optimized using L1 loss.

Overall, this architecture incorporates shallow and deep feature extraction, attention mechanisms, and
a reconstruction module to achieve image super-resolution.

Slides 10-11 HAB


Fig. 2, it is shown that activating more pixels is achieved when channel attention is utilized, as it
involves global information to calculate channel attention weights. Additionally, the authors mention
that convolutional operations have been observed to improve visual representation and optimization in
Transformer-based models in previous works.

To integrate the channel attention block (CAB) into the Swin Transformer block, it is inserted after the
first LayerNorm (LN) layer in parallel with the window-based multi-head self-attention (W-MSA)
module. This is illustrated in Fig. 4. It is worth noting that a shifted window-based self-attention (SW-
MSA) is employed at intervals in consecutive hybrid attention blocks (HABs).

To avoid conflicts in optimization and visual representation between CAB and MSA, a small constant
α is multiplied to the output of CAB. The process of the HAB is computed as follows:

 XN = LN(X): Compute the intermediate feature using LayerNorm.


 XM = (S)W-MSA(XN) + αCAB(XN) + X: Combine the outputs of W-MSA, CAB, and the
residual input feature X.
 Y = MLP(LN(XM)) + XM: Apply a multilayer perceptron (MLP) to the LN of XM and add the
residual input XM to obtain the output Y.

Regarding the self-attention module, the input feature has a size of H × W × C, where H and W
represent the height and width of the feature map, and C represents the number of channels. The
feature map is partitioned into HW/M^2 local windows of size M × M. Inside each window, self-
attention is calculated.

For a local window feature XW ∈ R^(M^2×C), linear mappings are applied to compute the query, key,
and value matrices (Q, K, and V). The self-attention calculation is then formulated using these
matrices. The formula used is:

Attention(Q, K, V) = SoftMax(QK^T/√d + B)V (Equation 2)

Here, d represents the dimension of the query/key matrices, and B denotes the relative position
encoding, as calculated in reference [53]. The softMax function is applied to normalize the
multiplication of the query and key matrices. This normalization ensures that attention weights are
assigned to each window based on the relevance of the features within the window. Finally, the value
matrix, V, is multiplied by the attention weights to obtain the self-attention output.

The authors mention that they use a large window size to compute self-attention. This choice
significantly enlarges the range of used pixels, as discussed in Section 4.2 of the paper. Additionally, in
order to establish connections between neighboring non-overlapping windows, they employ a shifted
window partitioning approach. Specifically, the shift size is set to half of the window size. This ensures
that neighboring windows overlap and allows for information exchange between them.

( CAB info below)

A CAB consists of two standard convolution layers with a GELU (rectified linear unit) activation
function and a channel attention (CA) module. Since token embedding in the Transformer structure
often requires a large number of channels, the convolution layers compress and expand the channel
numbers by a constant β. The CA module adaptively rescales the channel-wise features.

By incorporating the channel attention-based convolution block and utilizing shifted window-based
self-attention, the network aims to improve the representation ability of the Swin Transformer and
enhance the range of utilized pixels for better image super-resolution.
(SW-MSA window-based multi-head self-attention info below)

In self-attention, the input features are divided into local windows, and attention is calculated within
each window to capture relationships between different positions. However, in standard self-attention,
the windows are typically non-overlapping. This means that information interaction is limited to within
each individual window, without considering connections between neighboring windows.

To address this limitation, SW-MSA introduces a shifted window partitioning approach. This approach
partitions the input features into windows with a certain size, but with a shift size that is set to half of
the window size. This results in overlapping windows that share common pixels between each other.
By shifting the windows, connections and interactions can be established between neighboring
windows, thus allowing for more comprehensive information exchange.

Slides 12-13

Authors introduce OCAB (Overlapping Cross-Attention Block) to enhance the representational ability
of the window self-attention module. OCAB consists of an overlapping cross-attention (OCA) layer
and an MLP layer, similar to the standard Swin Transformer block.

For the OCA layer, the authors use a different window partitioning approach, as shown in Figure 5.
They use different window sizes to partition the projected features. Specifically, the query feature XQ
is partitioned into HW/M^2 non-overlapping windows of size M × M, while the key feature XK and
value feature XV are unfolded to HW/M^2 overlapping windows of size Mo × Mo.

The authors introduce a constant γ, which controls the overlapping size of the windows. The value of
Mo, the size of the overlapping window, is calculated using the formula:

Mo = (1 + γ) × M (Equation 3)

(Difference between standard and overlapping window partition)

When we talk about window partitioning, think of dividing the input feature map into smaller regions.

1. Standard Window Partition:

 Imagine sliding a window of size M x M across the feature map, where M is the window
size.
 The window moves by M units each time (stride), treating each window as a separate
region.
 This creates non-overlapping windows with no shared pixels.
 Attention is calculated separately within each window.
2. Overlapping Window Partition:

 Now, imagine sliding a window with a larger size Mo x Mo across the feature map.
 Mo is calculated using the equation Mo = (1 + γ) × M, where γ is a constant determining
the overlap size.
 The window moves by M units each time (stride), meaning it overlaps with the previous
window.
 To maintain consistent window size, zero-padding is added around the edges of each
window.
 Attention is calculated within each window, considering both overlapping and non-
overlapping regions.

(Info about kernel)

Kernel refers to the size or dimensions of the window that is used to extract information from the input
feature map.
In both the standard window partition and the overlapping window partition, the kernel size determines
the size of the window that moves across the feature map. It specifies the height (H) and width (W) of
the window. The kernel size is typically represented as M x M, where M is the size of the window.
The purpose of the kernel is to define the receptive field of the window. It determines the spatial extent
of the input that the window will consider at each position. By moving the window across the feature
map, the kernel allows the model to extract localized information from different regions.
In the case of the overlapping window partition, the kernel size is Mo x Mo, where Mo is larger than M
to incorporate overlapping regions. This allows the window to capture both overlapping and non-
overlapping regions, enhancing the representation and capturing more contextual information.
Overall, the kernel size defines the size and shape of the window used for extracting information from
the input feature map during the sliding window partitioning process.
(stop)
In both cases, attention is computed using Equation 2 and a relative position bias B is included.
The key difference is that in the standard window partition, the query, key, and value are calculated
from the same window feature. However, in the overlapping window partition, the keys and values are
derived from a larger field, incorporating more useful information for the query within each window.

The key point of OCA is that it computes key and value from a larger field compared to the query. This
means that it utilizes more useful information for the query within each window. In contrast, the
standard window self-attention (WSA) calculates the query, key, and value from the same four-window
feature.
The authors clarify that although the Multi-resolution Overlapped Attention (MOA) module performs a
similar overlapping window partition, OCAB is fundamentally different from MOA. This distinction
arises because MOA calculates global attention using window features as tokens, whereas OCA
computes cross-attention inside each window feature using pixel tokens.

Overall, OCAB introduces overlapping cross-attention by using different window sizes to partition the
projected features. This approach enhances the representative ability of the window self-attention
module by utilizing more information for the query and establishing cross-window connections within
each window.

(Additional info)

1. Let's break it down: HW/M^2

HW/M^2 represents the number of non-overlapping windows obtained by partitioning the input
feature map

 H represents the height dimension of the feature map.


 W represents the width dimension of the feature map.
 M represents the size of the local windows used for partitioning.

By dividing the total number of pixels in the feature map (H * W) by the number of pixels in each local
window (M * M), we get the number of non-overlapping windows created: HW/M^2.

This value indicates the number of separate regions or patches into which the feature map is divided for
subsequent processing. Each non-overlapping window is then used to calculate self-attention within
itself, allowing the model to capture local dependencies and interactions within these windows.

2. To break it down: XQ, XK, XV ∈ R H×W×C

 H represents the height of the feature map.


 W represents the width of the feature map.
 C represents the number of channels in the feature map.

(stop)
The authors suggested that the success of pre-training is largely influenced by the scale and variety of
data used. For instance, to train a model for ×4 SR, they first pre-train a ×4 SR model on ImageNet,
and then fine-tune it on a specific dataset, such as DF2K. This approach, called same-task pre-training,
is simpler yet yields greater performance improvement. It is important to note that for the pre-training
strategy to be effective, it requires sufficient training iterations and an appropriate small learning rate
for fine-tuning. This is because the Transformer architecture requires ample data and training iterations
to learn general knowledge pertaining to the task, but needs a small learning rate during fine-tuning to
prevent overfitting to the specific dataset.

You might also like