Professional Documents
Culture Documents
Report Final CV
Report Final CV
Report Final CV
FINAL REPORT
We would like to express my sincere gratitude to Prof. Pham Van Huy for
giving us the opportunity to work on the report assignment. The task has
not only enhanced our research and writing skills but also helped us to
gain a deeper understanding of the subject matter.
We also would like to thank our friends for their help and guidance in the
completion of the project. Once again, thank you for your time, patience,
and expertise. We look forward to applying the knowledge and skills
gained from this assignment to future projects.
PROJECT COMPLETED
AT TON DUC THANG UNIVERSITY
We hereby declare that this is our own project and is under the guidance of Pham Van
Huy. The research contents and results in this topic are honest and have not been published in any
publication before. The data in the tables for analysis, comments and evaluation are collected by
the author himself from different sources, clearly stated in the reference section.
In addition, the project also uses a number of comments, assessments as well as data of
other authors, other agencies and organizations, with citations and source annotations.
If I find any fraud, we will take full responsibility for the content of my project. Ton Duc
Thang University is not related to copyright and copyright violations caused by us(if any).
Author
(Sign)
1. Introduction
Image Super-Resolution(SR) in image processing remains a classic problem in NLP
tasks and computer vision. It aims to reconstruct a low-res image into a high-res
image. Numerous methods based on CNN have been utilized successfully for SR task.
Recently, due to the emerge of Transformers, these transformers model have made
significant progress in many high-level vision tasks as well as low-level vision tasks,
including SR. A new transformed-based model called SwinIR has obtained a
breakthrough improvement in this task.
The LAM attribution shows how significant each pixel from the input LR (low-res) image
is when reconstructing the patch that's been marked with a box. The Diffusion Index (DI)
measures the extent to which other pixels are involved. If the DI is higher, it means that
a wider range of pixels are used. The findings reveal that when compared to RCAN,
SwinIR uses less information, while HAT utilizes the most amount of pixels for
reconstruction
In order to overcome the limitations mentioned above and to explore the potential of
Transformer in the field of SR, the atuhors have introduced a novel Hybrid Attention
Transformer called HAT which combines both channel attention and self-attention
methods to leverage the global information gathering ability of the former and the
powerful representational capability of the latter. We have also introduced an
overlapping cross-attention module to encourage more direct interaction between
adjacent window features. These advancements enable our model to activate more
pixels for reconstruction, resulting in significant performance improvement.
In this work, they provide an effective same-task pre-training strategy. Different from
we directly perform pre-training using large-scale dataset on the same task. They
believe that large-scale data is what really matters for pre-training, and experimental
results also show the superiority of their strategy.
2. Methodology
2.1. Motivation
In order to know what make transformer-based models perform better than CNN-
based models in SR task, the authors used LAM to tell which input pixels contribute
most of the selected region. The Fig. 2 shows that the red marked pixels are the most
contribute to the reconstruction. The more information is utilized, the better the
performance.
In contrast to what the authors previously believed, the LAM range of the
transformer-based model SwinIR was not larger than the LAm range other the other
two CNN-based models EDSR and RCAN (as show in Fig. 2). However, it does show
that 1) Despite using much less information, SwinIR was able to achieve better
performance in SR task and 2) Because it cover smaller LAM range, wrong texture
reconstruction may occure, which can be improved if the authors somehow increase
the LAM cover range.
They aim to design a network that can take advantage of similar self-attention while
activating more pixels for reconstruction. As depicted in Fig. 2, HAT can see pixels
almost all over the image and restore correct and clear textures.
Furthermore, it is noticeable that there are significant blocking artifacts present in the
intermediate features of SwinIR, as depicted in Figure 3. These artifacts arise due to
the partitioning of the window, indicating that the shifted window mechanism
inadequately establishes cross-window connections. Previous works on high-level
visual tasks have also emphasized the need to enhance the connection among
windows to improve window-based self-attention methods. To address this issue, they
have intensified cross-window information interactions in the design of their
approach, resulting in the substantial mitigation of the blocking artifacts observed in
the intermediate features produced by HAT.
Figure 4 illustrates that the network is composed of three main parts: shallow feature
extraction, deep feature extraction, and image reconstruction. This architectural
design is commonly used in previous studies. In short, the model takes a low-
resolution input image (ILR ∈ R H×W×Cin) and extracts shallow features (F0 ∈ R H×W×C)
using a convolution layer, Cin and C denote the channel number of the input and the
intermediate feature. Then, residual hybrid attention groups (RHAG) and a 3x3
convolution layer (HConv) are used to extract deep features. The shallow (F0 ∈ R
H×W×C)
and deep features (FD∈ R H×W×C) are combined using a global residual
connection, and the high-resolution output is generated using a reconstruction module
that utilizes the pixel-shuffle technique. Each RHAG consists of several hybrid
attention blocks (HAB), an overlapping cross-attention block (OCAB) and a 3x3
convolution layer. The reconstruction module is optimized using L1 loss.
They use a larger window size for computing the self-attention because it allows
them to include more pixels in the input. Additionally, to create connections between
neighboring non-overlapping windows, they use a shifted window partitioning
approach and set the shift size to be half the window size.
A channel attention block (CAB) is made up of two convolution layers, each with a
GELU activation, along with a channel attention (CA) module. This is shown in
Figure 4. Since the Transformer-based structure needs a large number of channels for
token embedding, using convolutions with a constant channel width leads to a
significant computational cost. To address this issue, they compress the number of
channels in the input feature by a constant β before applying the two convolution
layers. The channel number of the output feature after the first convolution layer is
reduced to C/β, and then restored back to C channels through the second layer.
Finally, a standard CA module is used to adaptively rescale the channel-wise
features.
We can think of the standard window partition as a type of sliding partition where the
kernel size and stride both equal the window size (M). On the other hand, the
overlapping window partition can be considered as a sliding partition with a kernel
size equal to Mo and a stride equal to M.
Unlike WSA whose query, key and value are calculated from the same 4 window
feature, OCA computes key/value from a larger field where more useful information
can be utilized for the query. MOA calculates global attention using window
features as tokens while OCA computes cross-attention inside each window feature
using pixel token.
4. Experiments
4.1. Experimental Setup
They use the DF2K dataset, which is a combination of DIV2K and Flicker2K
datasets, to train model because using only DIV2K dataset leads to overfitting.
Following the pre-training approach proposed in previous studies, we employ
ImageNet dataset prior to training. The HAT architecture has the same depth and
width as SwinIR, with both RHAG and HAB numbers set to 6, channel number set to
180, and attention head number and window size set to 6 and 16 for (S)W-MSA and
OCA. The values of hyper-parameters for proposed modules, such as the weighting
factor (α) in HAB, the squeeze factor (β) between two convolutions in CAB, and the
overlapping ratio (γ) in OCA, are set to 0.01, 3, and 0.5, respectively. For the larger
variant HAT-L, they double the depth of HAT by increasing the RHAG number to 12.
The authors also provide a smaller version HAT-S that has fewer parameters and
similar computations as SwinIR, with the channel number set to 144 and depth-wise
convolution used in CAB. Five benchmark datasets, namely Set5, Set14, BSD100,
Urban100, and Manga109, are used to evaluate the methods using quantitative metrics
such as PSNR and SSIM calculated on the Y channel.
What is SwinIR ?
SwinIR is an image super-resolution approach that uses a hierarchical transformer-
based network architecture, called Swin Transformer, to address the problem of low-
resolution image reconstruction. SwinIR has achieved state-of-the-art results in a
number of image super-resolution benchmarks and has been shown to outperform
traditional methods such as bicubic interpolation and deep learning approaches such
as SRCNN, SRResNet, and ESRGAN.
What is LAM ?
What is PSNR ?
PSNR stands for Peak Signal-to-Noise Ratio. It is a commonly used metric to evaluate
the image or video quality after compression or modification. PSNR is calculated by
comparing the original image or video with the corresponding compressed or
modified version.
The higher the value of PSNR, the better the model is. PSNR is inversely proportional
to the error between the original image and the reconstructed image, which means a
higher value of PSNR indicates a smaller error and a better reconstruction quality.
What is SSIM ?
SSIM stands for Structural Similarity Index Measure, which is a widely used method
for measuring the similarity between two images. It attempts to quantify the perceived
similarity between two images by taking into account three individual measures:
luminance, contrast, and structure.