Report Final CV

VIETNAM GENERAL CONFEDERATION OF LABOR
TON DUC THANG UNIVERSITY

FACULTY OF INFORMATION TECHNOLOGY
FINAL REPORT
Introduction to Computer Vision
Supervisor: PHAM VAN HUY

Participants: LIEU DANG KHOA- 520K0140
VI NGUYEN THANH DAT- 520C0001
Class: 20K50301
Group: 02
Ho Chi Minh City, 2023

THANK YOU
We would like to express my sincere gratitude to Prof. Pham Van Huy for
giving us the opportunity to work on the report assignment. The task has
not only enhanced our research and writing skills but also helped us to
gain a deeper understanding of the subject matter.
We also would like to thank our friends for their help and guidance in the
completion of the project. Once again, thank you for your time, patience,
and expertise. We look forward to applying the knowledge and skills
gained from this assignment to future projects.
PROJECT COMPLETED
AT TON DUC THANG UNIVERSITY
We hereby declare that this is our own project and is under the guidance of Pham Van
Huy. The research contents and results in this topic are honest and have not been published in any
publication before. The data in the tables for analysis, comments and evaluation are collected by
the author himself from different sources, clearly stated in the reference section.
In addition, the project also uses a number of comments, assessments as well as data of
other authors, other agencies and organizations, with citations and source annotations.
If I find any fraud, we will take full responsibility for the content of my project. Ton Duc
Thang University is not related to copyright and copyright violations caused by us(if any).
Ho Chi Minh, ...........................
Author
(Sign)
Lieu Dang Khoa
Vi Nguyen Thanh Dat

Abstract
Transformer-based techniques have demonstrated impressive performance in low-
level vision tasks such as image super-resolution. Despite this, attribution analysis
indicates that these networks can only take advantage of a restricted spatial range of
input data. This suggests that the potential of Transformers is not fully utilized in
current networks. To enhance the reconstruction of images by activating more input
pixels, the authors have developed an innovative solution, the Hybrid Attention
Transformer (HAT). This method combines both channel attention and window-based
self-attention approaches, taking advantage of their complementary strengths in
utilizing global statistics and strong local fitting capabilities. Furthermore, to better
combine cross-window information, they have introduced an overlapping cross-
attention module that enhances the interaction between adjacent window features. In
the training phase, they employ a same-task pre-training approach to exploit the
model's potential for further improvement.
1. Introduction
Image Super-Resolution(SR) in image processing remains a classic problem in NLP
tasks and computer vision. It aims to reconstruct a low-res image into a high-res
image. Numerous methods based on CNN have been utilized successfully for SR task.
Recently, due to the emerge of Transformers, these transformers model have made
significant progress in many high-level vision tasks as well as low-level vision tasks,
including SR. A new transformed-based model called SwinIR has obtained a
breakthrough improvement in this task.
Performance comparison on PSNR(dB-decibel) of the proposed HAT with the

state-of-the-art methods SwinIR and EDT . HAT-L represents a larger variant of
HAT. The approach can surpass the state-of-the-art methods by 0.3dB∼1.2dB.
By using attribution analysis method LAM to examine the range of employed
information in SwinIR for reconstruction purposes, we have discovered that SwinIR
does not utilize any more input pixels than other CNN-based methods like RCAN for
super-resolution, despite SwinIR exhibiting higher overall quantitative performance,
as illustrated in Fig. 2. However, SwinIR still produces inferior results to RCAN for
certain samples due to the limited range of employed information. These findings
suggest that the Transformer has a stronger ability to model local information but
needs to expand the range of utilized information.
The LAM attribution shows how significant each pixel from the input LR (low-res) image
is when reconstructing the patch that's been marked with a box. The Diffusion Index (DI)
measures the extent to which other pixels are involved. If the DI is higher, it means that
a wider range of pixels are used. The findings reveal that when compared to RCAN,
SwinIR uses less information, while HAT utilizes the most amount of pixels for
reconstruction
In order to overcome the limitations mentioned above and to explore the potential of
Transformer in the field of SR, the atuhors have introduced a novel Hybrid Attention
Transformer called HAT which combines both channel attention and self-attention
methods to leverage the global information gathering ability of the former and the
powerful representational capability of the latter. We have also introduced an
overlapping cross-attention module to encourage more direct interaction between
adjacent window features. These advancements enable our model to activate more
pixels for reconstruction, resulting in significant performance improvement.
In this work, they provide an effective same-task pre-training strategy. Different from
we directly perform pre-training using large-scale dataset on the same task. They
believe that large-scale data is what really matters for pre-training, and experimental
results also show the superiority of their strategy.
2. Methodology
2.1. Motivation
In order to know what make transformer-based models perform better than CNN-
based models in SR task, the authors used LAM to tell which input pixels contribute
most of the selected region. The Fig. 2 shows that the red marked pixels are the most
contribute to the reconstruction. The more information is utilized, the better the
performance.
In contrast to what the authors previously believed, the LAM range of the
transformer-based model SwinIR was not larger than the LAm range other the other
two CNN-based models EDSR and RCAN (as show in Fig. 2). However, it does show
that 1) Despite using much less information, SwinIR was able to achieve better
performance in SR task and 2) Because it cover smaller LAM range, wrong texture
reconstruction may occure, which can be improved if the authors somehow increase
the LAM cover range.
They aim to design a network that can take advantage of similar self-attention while
activating more pixels for reconstruction. As depicted in Fig. 2, HAT can see pixels
almost all over the image and restore correct and clear textures.
Furthermore, it is noticeable that there are significant blocking artifacts present in the
intermediate features of SwinIR, as depicted in Figure 3. These artifacts arise due to
the partitioning of the window, indicating that the shifted window mechanism
inadequately establishes cross-window connections. Previous works on high-level
visual tasks have also emphasized the need to enhance the connection among
windows to improve window-based self-attention methods. To address this issue, they
have intensified cross-window information interactions in the design of their
approach, resulting in the substantial mitigation of the blocking artifacts observed in
the intermediate features produced by HAT.
2.2 Network architecture

2.2.1 Overall architecture
Figure 4 illustrates that the network is composed of three main parts: shallow feature
extraction, deep feature extraction, and image reconstruction. This architectural
design is commonly used in previous studies. In short, the model takes a low-
resolution input image (ILR ∈ R H×W×Cin) and extracts shallow features (F0 ∈ R H×W×C)
using a convolution layer, Cin and C denote the channel number of the input and the
intermediate feature. Then, residual hybrid attention groups (RHAG) and a 3x3
convolution layer (HConv) are used to extract deep features. The shallow (F0 ∈ R
H×W×C)
and deep features (FD∈ R H×W×C) are combined using a global residual
connection, and the high-resolution output is generated using a reconstruction module
that utilizes the pixel-shuffle technique. Each RHAG consists of several hybrid
attention blocks (HAB), an overlapping cross-attention block (OCAB) and a 3x3
convolution layer. The reconstruction module is optimized using L1 loss.
2.2.2 HAB architecture

HAB (Hybrid Attention Block) is a type of deep learning technique used in computer
vision applications. It combines the strengths of both the self-attention mechanism
and the convolutional neural network (CNN) architecture to improve the accuracy and
efficiency of visual recognition tasks.
As depicted in Figure 2, the adoption of channel attention leads to the activation of
more pixels. This is because the calculation of channel attention weights involves
global information. Additionally, many previous studies have shown that convolution
can improve the visual representation of a Transformer model and make optimization
easier. For this reason, they have added a channel attention-based convolution block
to the standard Transformer block in order to enhance the network's representation
capabilities. As shown in Figure 4, they have inserted a channel attention block
(CAB) into the standard Swin Transformer block after the first LayerNorm (LN)
layer, alongside the window-based multi-head self-attention (W-MSA) module.
To prevent any potential issues with optimization and visual representation, we use a
small constant α to scale the output of the channel attention block (CAB). The entire
process of the Hybrid Attention Block (HAB) is as follows: First, the input feature X
is passed through a layer normalization (LN) step to obtain intermediate feature XN.
Then, another intermediate feature XM is calculated using a weighted summation of
three components: the output of the window-based multi-head self-attention (W-
MSA) module multiplied by learnable weights S, the output of the CAB multiplied by
α, and the input feature X. The output of the HAB, denoted as Y, is then calculated by
applying a multilayer perceptron (MLP) to the layer-normalized XM feature, and
adding XM to it.
For calculation of the self-attention module, given an input feature of size H × W × C,
it is first partitioned into HW M2 local windows of size M × M, then self-attention is
calculated inside each window. For a local window feature XW ∈ RM2×C , the
query, key and value matrices are computed by linear mappings as Q, K and V.
The window-based self-attention is expressed as an Attention function that takes in
Query (Q), Key (K), and Value (V), and outputs a SoftMax-weighted sum of the
values, where the SoftMax function normalizes the similarity scores between Query
and Key after adjusting for the dimension of Query and Key with √d. The relative
position encoding is represented by B.
They use a larger window size for computing the self-attention because it allows
them to include more pixels in the input. Additionally, to create connections between
neighboring non-overlapping windows, they use a shifted window partitioning
approach and set the shift size to be half the window size.
A channel attention block (CAB) is made up of two convolution layers, each with a
GELU activation, along with a channel attention (CA) module. This is shown in
Figure 4. Since the Transformer-based structure needs a large number of channels for
token embedding, using convolutions with a constant channel width leads to a
significant computational cost. To address this issue, they compress the number of
channels in the input feature by a constant β before applying the two convolution
layers. The channel number of the output feature after the first convolution layer is
reduced to C/β, and then restored back to C channels through the second layer.
Finally, a standard CA module is used to adaptively rescale the channel-wise
features.
2.2.3 Overlapping Cross-Attention Block (OCAB)

They introduced the Overlapping Channel Attention Block (OCAB) which establishes direct
connections between overlapping windows. In OCA, they partition the input feature XQ into non-
overlapping windows of size M×M, while the corresponding features, XK, XV, are partitioned into
overlapping windows of size Mo×Mo. The value of Mo is calculated as (1 + γ) × M, where γ is a
constant that determines the degree of overlap between neighboring windows, as shown in Figure 5.
We can think of the standard window partition as a type of sliding partition where the
kernel size and stride both equal the window size (M). On the other hand, the
overlapping window partition can be considered as a sliding partition with a kernel
size equal to Mo and a stride equal to M.
Unlike WSA whose query, key and value are calculated from the same 4 window
feature, OCA computes key/value from a larger field where more useful information
can be utilized for the query. MOA calculates global attention using window
features as tokens while OCA computes cross-attention inside each window feature
using pixel token.
3.3. The Same-task Pre-training

The authors suggested that the success of pre-training is largely influenced by the
scale and variety of data used. For instance, to train a model for ×4 SR, they first pre-
train a ×4 SR model on ImageNet, and then fine-tune it on a specific dataset, such as
DF2K. This approach, called same-task pre-training, is simpler yet yields greater
performance improvement. It is important to note that for the pre-training strategy to
be effective, it requires sufficient training iterations and an appropriate small learning
rate for fine-tuning. This is because the Transformer architecture requires ample data
and training iterations to learn general knowledge pertaining to the task, but needs a
small learning rate during fine-tuning to prevent overfitting to the specific dataset.
4. Experiments
4.1. Experimental Setup
They use the DF2K dataset, which is a combination of DIV2K and Flicker2K
datasets, to train model because using only DIV2K dataset leads to overfitting.
Following the pre-training approach proposed in previous studies, we employ
ImageNet dataset prior to training. The HAT architecture has the same depth and
width as SwinIR, with both RHAG and HAB numbers set to 6, channel number set to
180, and attention head number and window size set to 6 and 16 for (S)W-MSA and
OCA. The values of hyper-parameters for proposed modules, such as the weighting
factor (α) in HAB, the squeeze factor (β) between two convolutions in CAB, and the
overlapping ratio (γ) in OCA, are set to 0.01, 3, and 0.5, respectively. For the larger
variant HAT-L, they double the depth of HAT by increasing the RHAG number to 12.
The authors also provide a smaller version HAT-S that has fewer parameters and
similar computations as SwinIR, with the channel number set to 144 and depth-wise
convolution used in CAB. Five benchmark datasets, namely Set5, Set14, BSD100,
Urban100, and Manga109, are used to evaluate the methods using quantitative metrics
such as PSNR and SSIM calculated on the Y channel.
4.3. Ablation Study

Theird findings indicate that both OCAB and CAB deliver a performance gain of
0.1dB when compared to the baseline results. Moreover, the model achieves a further
performance improvement of 0.16dB leveraging the benefits of these two modules. In
addition, they provide a qualitative comparison between the models with and without
OCAB and CAB to showcase their influence. As shown in Figure 7, the model with
OCAB has a larger scope of utilized pixels and generates better-reconstructed results.
When CAB is adopted, the used pixels even expand to almost the entire image.
Furthermore, their method with OCAB and CAB obtains the highest DI, which
indicates that their method utilizes the most input pixels. Although it achieves slightly
lower performance than the model with only OCAB, their method obtains the highest
SSIM and reconstructs the clearest textures.
4.4. Comparison with State-of-the-Art Methods
Table 6 presents a quantitative comparison of approach with the state-of-the-art methods, including
EDSR, RCAN, SAN, IGNN, HAN, NLSN, RCAN-it, and approaches using ImageNet pre-training
such as IPT and EDT. It indicated that their method outperforms all other methods significantly across
all benchmark datasets. Specifically, HAT outperforms SwinIR by 0.48dB∼0.64dB on Urban100 and
0.34dB∼0.45dB on Manga109. In comparison to the pre-training approaches, HAT achieves a large
performance gain of over 0.5dB against EDT on Urban100 for all three scales. Additionally, HAT with
pre-training outperforms SwinIR by a significant margin of up to 1dB on Urban100 for ×2 SR.
Furthermore, the larger model HAT-L can bring further improvement and greatly expand the
performance upper bound of this task. Even the smaller version, HAT-S, with fewer parameters and
similar computation capabilities, can significantly outperform the state-of-the-art method SwinIR. It is
worth noting that the performance differences are more pronounced in Urban100, which contains more
structured and self-repeated patterns that can offer more useful pixels for reconstruction as the range of
information utilized is expanded.
Additional info
What is SwinIR ?
SwinIR is an image super-resolution approach that uses a hierarchical transformer-
based network architecture, called Swin Transformer, to address the problem of low-
resolution image reconstruction. SwinIR has achieved state-of-the-art results in a
number of image super-resolution benchmarks and has been shown to outperform
traditional methods such as bicubic interpolation and deep learning approaches such
as SRCNN, SRResNet, and ESRGAN.
What is LAM ?
What is PSNR ?
PSNR stands for Peak Signal-to-Noise Ratio. It is a commonly used metric to evaluate
the image or video quality after compression or modification. PSNR is calculated by
comparing the original image or video with the corresponding compressed or
modified version.
The higher the value of PSNR, the better the model is. PSNR is inversely proportional
to the error between the original image and the reconstructed image, which means a
higher value of PSNR indicates a smaller error and a better reconstruction quality.
What is SSIM ?
SSIM stands for Structural Similarity Index Measure, which is a widely used method
for measuring the similarity between two images. It attempts to quantify the perceived
similarity between two images by taking into account three individual measures:
luminance, contrast, and structure.
What is shifted window mechanism ?

The "shifted window mechanism" is a technique that's used in certain super-resolution
models, such as SwinIR. It refers to the strategy of dividing an image into overlapping
patches or windows and then processing those patches using a self-attention
mechanism to generate super-resolved output. In the context of SwinIR, instead of
using non-overlapping patches, the shifted window mechanism divides the image into
overlapping patches that are shifted by a certain number of pixels
What is cross-windows connections ?

In the context of this passage, "cross-window connections" refer to the connections
established between different windows or patches in an input image. In super-
resolution models that use a window-based approach, such as SwinIR, each patch is
processed independently using a self-attention mechanism, which allows the model to
attend to different parts of the patch to generate an output. However, in order to
generate accurate and high-quality super-resolved output, these models also need to
establish connections between different patches to ensure that the output is coherent
and consistent across the entire image. To achieve this, the model needs to allow
information to flow between different patches or windows, which is what is meant by
"cross-window connections". The passage suggests that SwinIR's shifted window
mechanism may be inefficient at establishing these connections, which can cause
blocking artifacts in the output.

Report Final CV

Uploaded by

Copyright:

Available Formats

You might also like

Report Final CV

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report Final CV

Uploaded by

Copyright:

Available Formats

VIETNAM GENERAL CONFEDERATION OF LABOR

TON DUC THANG UNIVERSITY

Introduction to Computer Vision

Supervisor: PHAM VAN HUY

Ho Chi Minh City, 2023

Ho Chi Minh, ...........................

Lieu Dang Khoa

Vi Nguyen Thanh Dat

Performance comparison on PSNR(dB-decibel) of the proposed HAT with the

2.2 Network architecture

2.2.2 HAB architecture

2.2.3 Overlapping Cross-Attention Block (OCAB)

3.3. The Same-task Pre-training

4.3. Ablation Study

What is shifted window mechanism ?

What is cross-windows connections ?

You might also like