Professional Documents
Culture Documents
Computer Vision and Image Understanding: Hamed Elwarfalli, Russell C. Hardie
Computer Vision and Image Understanding: Hamed Elwarfalli, Russell C. Hardie
∗ Corresponding author.
E-mail address: rhardie1@udayton.edu (R.C. Hardie).
https://doi.org/10.1016/j.cviu.2020.103097
Received 29 December 2019; Received in revised form 21 August 2020; Accepted 1 September 2020
Available online 3 September 2020
1077-3142/© 2020 Elsevier Inc. All rights reserved.
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
2
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
2 ∕𝛽 2
𝑤𝑘 (𝑛1 , 𝑛2 ) = 𝑒−𝑅𝑘 (𝑛1 ,𝑛2 ) , (7)
where
√
𝑅𝑘 (𝑛1 , 𝑛2 ) = 𝑑𝑥2 (𝑘, 𝑛1 , 𝑛2 ) + 𝑑𝑦2 (𝑘, 𝑛1 , 𝑛2 ) (8)
and 𝑑𝑥 (𝑘, 𝑛1 , 𝑛2 ) and 𝑑𝑦 (𝑘, 𝑛1 , 𝑛2 ) are the horizontal and vertical distances
of interpolated pixel 𝑔𝑘 (𝑛1 , 𝑛2 ) to the nearest LR pixel in the 𝑘’th frame.
Fig. 2. Optical transfer function from Eq. (1) for the camera model using the The distance computation defined by Eq. (8) is illustrated in Fig. 4.
parameters from Table 1. The SR folding frequency is shown for 𝐿 = 3.
The basic idea is that an interpolated pixel that is near an original LR
pixel will have less interpolation error and should be given a higher
weight. If diverse camera motion is present, each HR interpolated pixel
An impulse-invariant equivalent discrete-space PSF, ℎ𝑑 (𝑛1 , 𝑛2 ), may is likely to have an LR pixel nearby in one or more of the input
be found by sampling ℎ(𝑥, 𝑦) at the HR sampling rate. This impulse frames. The parameter 𝛽 controls how sensitive the weights are to the
invariant PSF is applied in Fig. 1 to produce distances in 𝑅𝑘 (𝑛1 , 𝑛2 ). We use 𝛽 = 0.1 for all of the results presented in
this paper (Hardie et al., 2019). Note that the interframe registration
𝑓̃𝑘 (𝑛1 , 𝑛2 ) = 𝑧̃ 𝑘 (𝑛1 , 𝑛2 ) ∗ ℎ𝑑 (𝑛1 , 𝑛2 ), (4)
information in Eq. (8) has the same dimensions as the interpolated
where ∗ is 2D discrete convolution. The next step in the observation frames and may be viewed as images. This is illustrated in Fig. 5
model is to downsample the shifted and blurred images, 𝑓̃𝑘 (𝑛1 , 𝑛2 ), by for 𝐾 = 4 frames with random interframe shifts and 𝐿 = 3. Note
a factor of 𝐿 in both spatial dimensions. Finally, we add independent that the darker pixels correspond to interpolated values with smaller
and identically distributed Gaussian noise. Thus, the observed frames interpolation distances, and presumably more accurate values. Fig. 5(a)
are given by corresponds to the interpolated reference frame with no shift. Thus,
every 𝐿’th pixel starting from pixel 𝑛1 = 𝑛2 = 1 lines up exactly with
𝑓𝑘 (𝑛1 , 𝑛2 ) = 𝑓̃𝑘 (𝐿𝑛1 , 𝐿𝑛2 ) + 𝜂𝑘 (𝑛1 , 𝑛2 ), (5) an input pixel and has an interpolation distance of 0.
The final step of the FIF SR method is deconvolution of the PSF
for 𝑘 = 1, 2, … , 𝐾, where 𝜂𝑘 (𝑛1 .𝑛2 ) contains Gaussian noise samples with
blur from the fused image. In the original method, this is accomplished
standard deviation 𝜎𝜂 .
with a Wiener filter (Karch and Hardie, 2015; Hardie et al., 2019).
The specific camera parameters used in the experimental results The Wiener filter uses the parameter 𝛤 as the constant noise-to-signal
in this paper, both simulated and real, are listed in Table 1. The power spectral density ratio. In all of our results in this paper, we
camera is an Imaging Source DMK21BU04 visible USB camera with search for and use the optimum 𝛤 when reporting FIF SR quantitative
Sony ICX098BL CCD sensor. The camera is equipped with a 5 mm focal performance results.
length lens set with an f-number of 𝐹 = 5.6. The native undersampling
is 𝑀 = 3.636. However, we have obtained good results using an up- 3.2. FIFNET
samping factor of 𝐿 = 3.000. Thus we define the HR grid with a sample
spacing of 𝑝∕𝐿 and downsample the image by 𝐿 = 3 in the observation In this section we describe the proposed FIFNET SR method. To the
model. A plot of the overall OTF, along with its two components, is best of our knowledge, there are no published algorithms that perform
shown in Fig. 2 for the system in Table 1 with 𝐿 = 3. Note that the motion-based multiframe SR where the image fusion takes place within
|𝐻(𝑢, 𝑣)| OTF curve (shown in red) extends well beyond the native a single CNN. One possible reason for this may be the challenge of
folding frequency (i.e., 1/2 the sampling frequency). This shows that how to deal with the interframe motion in a manner that both exploits
a significant amount of signal frequency content may be aliasing with the motion and allows a CNN to train consistently. It is our hypothesis
this sensor. However, minimal signal energy will be passed by this OTF that the proposed FIFNET method can effectively address this challenge
above the 𝐿 = 3 SR folding frequency. by employing interpolation and registration preprocessing steps similar
3
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
Fig. 3. Block diagram of FIF and FIFNET multiframe SR. The original FIF SR method (Karch and Hardie, 2015) is depicted with the top branch labeled (1). Our proposed FIFNET
replaces the deterministic fusion and restoration blocks with a novel machine learning approach using a specially designed CNN architecture, as shown in the bottom branch
labeled (2).
Fig. 5. Visualization of 𝑅𝑘 (𝑛1 , 𝑛2 ) for a 25 × 25 patch size with different shifts and 𝐿 = 3. HR pixel shifts are (a) 𝐬1 = [0.00, 0.00]𝑇 , (b) 𝐬2 = [0.56, 02.54]𝑇 , (c) 𝐬3 = [−2.03, 4.86]𝑇 ,
and (d) 𝐬4 = [−2.09, 1.25]𝑇 .
4
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
Fig. 6. FIFNET architecture. The interpolated and aligned observed frames are combined with the subpixel registration information to form the input channels. A description of
the individual layers is provided in Table 2. The output of the FIFNET is a single estimated SR image.
Table 2 exception of the final convolution layer (i.e., Layer 19), all convolution
Description of the FIFNET layers. Note that the spatial dimensions of the activations layers employ 64 filters to produce 64 output channels or feature maps
are for training using 61 × 61 image patches. The trained network may be applied to
any size testing image.
and employ a rectified linear unit (ReLU). The final convolutional layer
employs only one filter and has no ReLU.
Layer Type Spatial kernel Activations
Note in Fig. 6 that we concatenate the original interpolated frames
1 Input – 61 × 61 × 2𝐾
2 Splitting – 61 × 61 × 𝐾
(without the registration channels) to the processed feature maps in
3 2D Conv 1 × 1 61 × 61 × 64 Layers 15 and 18. To accommodate this concatenation, Layers 14 and
4 2D Conv 1 × 1 61 × 61 × 64 17 crop the input to match the sizes of the reduced activation dimen-
5 2D Conv 1 × 1 61 × 61 × 64 sions in Layers 15 and 18. The interpolated frames tend to provide the
6 2D Conv 3 × 3 59 × 59 × 64
low spatial frequency content for the final estimate. This allows the
7 2D Conv 3 × 3 57 × 57 × 64
8 2D Conv 3 × 3 55 × 55 × 64
main spatial processing pipeline to have the arguably simpler task of
9 2D Conv 3 × 3 53 × 53 × 64 generating high spatial frequency ‘‘corrections’’, rather than the full
10 2D Conv 3 × 3 51 × 51 × 64 SR image. The final layers fuse the correction information with the
11 2D Conv 5 × 5 47 × 47 × 64 interpolated frames to produce the final output. We have found that
12 2D Conv 5 × 5 43 × 43 × 64
inclusion of the registration channels at the input, and concatenation of
13 2D Conv 5 × 5 39 × 39 × 64
14 2D Crop – 39 × 39 × 𝐾 the interpolated frames late in the network, are both crucial to network
15 Concatenate – 39 × 39 × (64 + 𝐾) performance.
16 2D Conv 5 × 5 35 × 35 × (64 + 𝐾) While the network training uses patches, the trained network can
17 2D Crop – 35 × 35 × 𝐾 be applied to any size test image. The convolutions in each layer are
18 Concatenate – 35 × 35 × (64 + 2𝐾)
simply extended across an arbitrary size test image. During testing we
19 2D Conv 5 × 5 31 × 31 × 1
20 Regression – – pre-pad the borders of the interpolated input images (and registration
arrays) by 15 pixels on all sides. This allows the FIFNET output size to
match the interpolated input size. We use symmetric reflection padding
to minimize discontinuities. The main benefit of the no-padding CNN
The network is trained with image patches of size 61 × 61. Given architecture is that it reduces the network size to allow for faster
𝐾 frames, and combining these with the corresponding registration training. In some cases this also provides improved performance. The
arrays, the input to the network is an array of size 61 × 61 × 2𝐾. We do improved performance may be the result of avoiding data extrapolation
not employ any internal padding with the convolutions in the network. error during training. By selecting patches away from the border,
Thus, the spatial dimensions of the activations get smaller with each no extrapolated data is ever used in the network during training. In
of the spatial convolution layers, as shown in Table 2. This is also contrast, many other methods employ zero padding within the network.
illustrated in Fig. 6 with the reduced size layer blocks. Finally, the This effectively introduces artificial discontinuities to every patch in the
output in Layer 19 is a 31 × 31 × 1 image patch estimate. This output training data.
is compared with the corresponding target patch for regression error
computation in what we list as Layer 20 in Table 2. 3.2.2. Network training
We implement the FIFNET using a total of 13 convolution layers. The objective function for the FIFNET regression network is the
The convolution layers have been divided into three groups, with a mean squared error between the true HR image 𝑧(𝑛1 , 𝑛2 ) and the SR
different spatial kernel size for each group. The first group contains estimate 𝑧(𝑛
̂ 1 , 𝑛2 ). As mentioned above, this error is computed over a
Layers 3–5 and performs 2D convolution using a 1 × 1 spatial kernel 31 × 31 patch. All network training and hyper-parameter optimiza-
that spans all channels. We employ the 1 × 1 spatial kernel size for this tion has been done using the publicly available DIV2K HR training
first group so as to exclusively produce channel fusion of 𝑔𝑘 (𝑛1 , 𝑛2 ) and dataset (Agustsson and Timofte, 2017; Timofte et al., 2018). This
𝑅𝑘 (𝑛1 , 𝑛2 ), rather than spatial processing. This ‘‘fusion-first’’ approach database consists of 800 24-bit RGB images with 2K resolution. All of
is in keeping with the original FIF SR framework and keeps the CNN the input images have been converted to grayscale with a floating point
network complexity low. Layers 6–10 make up the next group that dynamic range of 0–1. The Gaussian noise in the observation model is
begins the spatial processing using a 3 × 3 kernel size. The third and set to 𝜎𝜂 = 0.001 for the simulated training and testing data. Training is
final group of convolution layers uses a 5 × 5 kernel size. With the done using stochastic gradient descent with momentum optimization.
5
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
Sixty four different random patches are extracted from each training
image. We use a mini-batch size of 64 so that each iteration corresponds
to one training image. All training is done with a total of 100 epochs.
We set the initial learning rate to 0.1 and reduced this by a factor of
0.1 after every 10 epochs.
Network training is achieved here using a Windows workstation
with Intel Core i9-9980XE Processor using 3.0 GHz clock speed. It is
equipped two NVIDIA TITAN RTX Graphics Processing Units (GPUs).
Only one GPU is used for training a single network. Training one
FIFNET for 𝐾 = 10 takes 5.56 h using the parameters in Table 2. If
the network is reconfigured to use zero padding with each convolution
layer, and retain the 61 × 61 input size, this larger network takes
6.78 h to train (an increase of 21.95%). For 𝐾 = 1, training the
standard FIFNET takes 2.36 h. Using padding, it takes 3.81 h (an
increase of 61.44%). In addition to the significant speedup in training,
the no-padding architecture gives a modest boost in average PSNR of
approximately 0.2 dB in testing for 𝐾 = 10.
4. Experimental results
Fig. 7. Quantitative performance comparison using simulated data from the DIV2K
In this section we present a number of experimental results to Validation dataset for 𝐿 = 3 and 𝜎𝜂 = 0.001. The average PSNR is shown as a function
demonstrate the efficacy of FIFNET. First, we use imagery with sim- of the number of input frames for the methods shown in the legend.
ulated degradation to provide a quantitative performance analysis.
Next, we process real camera data with no artificial degradation for
subjective analysis in a real application. method provides the best results. Also, note that increasing the number
of input frames significantly improves the FIFNET performance. As
4.1. Simulated data expected, the FS scenario yields higher PSNR than the RS scenario.
However, the FS network is only intended for the one set of shifts it was
The imagery used for testing come from three publicly available trained on. Another important thing to observe in Fig. 7 is the boost
databases. These are the DIV2K Validation dataset (Agustsson and in PSNR that results from including the registration distance images
Timofte, 2017) (100 images), the Set14 dataset (Agustsson and Tim- as additional input channels (see ‘‘with R’’ in the legend). This clearly
ofte, 2017) (14 images), and the BSDS100 dataset (Arbelaez et al., demonstrates the importance of providing subpixel registration infor-
2010) (100 images). None of the images contained in these databases mation to the network. We have also observed that not concatenating
have been used in training. We employ two performance metrics, the input frames at the end of the network leads to a drop in PSNR of
Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index approximately 1.7 dB at 𝐾 = 10.
(SSIM) (Wang et al., 2004). Additional quantitative results are provided in Table 3. These results
We consider two motion model scenarios. The first is what we call show PSNR and SSIM for all three test databases. In these experiments
the Fixed Shifts (FS) scenario. Here, the set of interframe shifts for the the best quantitative results are obtained using the FIFNET with FS
testing data are known a priori and are fixed. All training images are
scenario and 𝐾 = 8 or 𝐾 = 10. This is followed by the FIFNET with
artificially degraded using the observation model with these same fixed
RS scenario.
shifts. The FS scenario might occur in practice when the best possible
In addition to the quantitative results, a number of processed images
result is desired for one test sequence and network training time is not
are provided for subjective evaluation in Figs. 8–10. In each of these
a major consideration. The second case is what we call the Random
figures the truth image is shown in (a), the truth region of interest (ROI)
Shifts (RS) scenario. Here each training image gets a different random
is shown in (b), and the various processed ROIs are shown in (c)–(h).
shift with the hope of making FIFNET robust to shift pattern. Using
Note that the images in the top rows of these figures are for single frame
this approach, one trained network can be applied regardless of the
methods, and the bottom rows are for multiframe methods using 𝐾 = 10
shift pattern in the testing sequence. The RS FIFNET could then be used
frames. The error metric values associated with the images are listed in
to process video by employing a temporal moving window of frames.
the captions.
In addition, we evaluate FIFNET with and without using 𝑅𝑘 (𝑛1 , 𝑛1 ) as
part of the input channels. We do this to study the importance of the The Car image in Fig. 8 gives a good illustration of typical results.
subpixel registration information channels. Note that bicubic interpolation image in Fig. 8(c) shows significant
The average PSNR results for the DIV2K Validation dataset as a blurring as well as aliasing artifacts in the form of jagged edges along
function of the number of input frames is provided in Fig. 7. We use the borders of the license plate. The RCAN single-frame method in
𝐿 = 3 and the noise standard deviation is 𝜎𝜂 = 0.001. A curve for each Fig. 8(d) provides significant sharpening and does a good job recon-
of the scenarios described in the previous paragraph is included. We structing the edges of the license plate. However, the ill-posed nature
also include three single frame methods as benchmarks and we show of the inverse problem is exposed on the numbers ‘‘6’’ and ‘‘8’’ here. The
these with dashed lines. These are single frame bicubic interpolation, multiframe methods are able to recover the numbers reasonably well.
VDSR (Kim et al., 2016), and RCAN (Zhang et al., 2018a). The VDSR However, the original FIF SR method in Fig. 8(e) has artifacts on the
and RCAN networks are trained here using the same data and degra- license plate perimeter. The AWF SR method in Fig. 8(f) appears better
dation model as those for FIFNET. We also include comparison results on the edges and numbers. However, the FIFNET results in Figs. 8(g)
for two additional multiframe SR methods. These are the original FIF and (h) appear to provide the sharpest images with minimal aliasing
SR (Karch and Hardie, 2015) and the Adaptive Wiener Filter (AWF) SR artifacts. Note that the FIFNET FS result in Fig. 8(h) is 2.63 dB higher
method (Hardie, 2007; Milanfar, 2017). For the AWF method we apply than AWF SR. Similar observations can be made in Figs. 9 and 10. It is
a window size of 9 × 9 and a one-step correlation value of 𝜌 = 0.7. interesting to note the dark thin wire in Fig. 9 that runs from the top
Fig. 7 shows that in this experiment the FIFNET outperforms the of the light post to the right middle edge of the image. We believe that
benchmark methods for 𝐾 > 1. For 𝐾 = 1, the single-frame RCAN this detail is most prominently restored in Fig. 9(h) using FIFNET FS.
6
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
Table 3
Average PSNR(dB)/SSIM for 𝐾 = 1, 4, 8, and 10 using different methods and three different datasets: DIV2K Validation, Set14, and BSDS100. The bold numbers indicate the best
performance for the corresponding metric and dataset category.
Dataset 𝐾 Bicubic VDSR RCAN FIF AWF FIFNET RS FIFNET FS
1 24.28/0.720 26.36/0.797 27.07/0.831 25.38/0.777 25.64/0.787 26.41/0.811 26.43/0.812
4 – – – 27.18/0.848 28.33/0.879 29.36/0.900 30.41/0.914
DIV2K Va-lidation (Agustsson and Timofte, 2017)
8 – – – 28.10/0.875 29.57/0.907 29.91/0.910 31.80/0.930
10 – – – 28.13/0.875 29.84/0.912 30.01/0.912 31.77/0.932
1 24.91/0.719 27.33/0.783 27.86/0.816 26.11/0.771 26.38/0.778 27.43/0.803 27.45/0.803
4 – – – 27.85/0.837 29.03/0.865 30.14/0.884 31.22/0.899
Set14 (Agustsson and Timofte, 2017)
8 – – – 28.77/0.862 30.26/0.893 30.71/0.895 32.36/0.915
10 – – – 28.80/0.863 30.54/0.898 30.77/0.897 32.32/0.916
1 25.08/0.690 26.74/0.747 27.39/0.783 25.98/0.743 26.17/0.752 26.84/0.770 26.86/0.771
4 – – – 27.54/0.812 28.48/0.845 29.37/0.868 30.30/0.886
BSDS100 (Arbelaez et al., 2010)
8 – – – 28.32/0.840 29.59/0.878 29.88/0.880 31.63/0.909
10 – – – 28.35/0.841 29.83/0.884 29.99/0.884 31.60/0.911
Fig. 8. Results for image ‘‘0879’’ in the DIV2K validation dataset. The PSNR(dB)/SSIM values are (b) 23.99/0.804, (c) 25.96/0.863, (d) 26.44/0.878, (e) 28.55/0.923, (f)
29.00/0.928, (g) 29.71/0.935, and (h) 31.63/0.938. The noise has a standard deviation of 𝜎𝜂 = 0.001 and 𝐾 = 10 frames are used in (e)–(h).
Fig. 9. Results for image ‘‘78004’’ in the BSD100 dataset. The PSNR(dB)/SSIM values are (b) 23.46/0.665, (c) 25.23/0.737, (d) 26.74/0.803, (e) 27.50/0.854, (f) 27.80/0.862,
(g) 28.21/0.867, and (h) 30.34/0.908. The noise has a standard deviation of 𝜎𝜂 = 0.001 and 𝐾 = 10 frames are used in (e)–(h).
In a final experiment with the simulated data, we have performed image DIV2K Validation dataset. The average testing time per image is
a study of the processing time for the various methods using the 100 plotted versus 𝐾 in Fig. 11. The reported times include the time for all
7
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
Fig. 10. Results for the image ‘‘Man’’ in the Set14 dataset. The PSNR(dB)/SSIM values are (c) 24.74/0.684, (d) 27.09/0.784, (e) 28.36/.853, (f) 29.11/0.872, (g) 29.24/0.867,
and (h) 30.88/0.899. The noise has a standard deviation of 𝜎𝜂 = 0.001 and 𝐾 = 10 frames are used in (e)–(h).
The true test of any SR algorithm is to process real camera data Fig. 12 shows results for an outdoor scene of a brick house. Note
that have not been artificially degraded. In this section we present that the single-frame methods in Figs. 12(b) and (c) have difficulty
several results for such a scenario using image data acquired by the restoring the texture of the brick pattern. On the other hand, all of
authors. Because the data are not artificially degraded, there are no the multiframe methods in Figs. 12(d)–(f) appear to restore the brick
corresponding ground truth images. The results presented here are for pattern fairly well. The FIFNET RS result trained using 𝜎𝜂 = 0.001 is
subjective evaluation purposes. We have chosen familiar scene content shown in Fig. 12(f). It appears to have a more sharpness than FIF SR,
and a well-defined test pattern to facilitate the evaluation. and more noise suppression than AWF.
Since the observation model presented in Section 2 is based on Results for an indoor bookshelf dataset are shown in Fig. 13. Here
a realistic camera OTF, our trained network can be applied directly the aliasing pattern is evident on the lettering in the bicubic interpo-
to data from that camera. The interframe motion is created here by lation image in Fig. 13(b). The RCAN method appears to do a good
manually panning and tilting the camera on a tripod during video job reducing the aliasing artifacts on the large letters, but has more
acquisition that was at a rate of 30 frame per second. We present results difficulty with the smaller lettering. The FIFNET FS result in Fig. 13(f)
for three distinct datasets in Figs. 12–14. For each of the datasets, the appears to provide better sharpness, compared with AWF and FIF, and
multiframe methods use 𝐾 = 10 frames with 𝐿 = 3. more noise suppression. Training was done here using 𝜎𝜂 = 0.01
8
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
Fig. 12. Image results for the real camera data of a brick house. The images shown are (a) bicubic, (b) bicubic ROI, (c) RCAN, (d) FIF SR, (e) AWF and (f) FIFNET RS. The
multiframe methods use 𝐾 = 10 frames for (d)–(f).
Fig. 13. Image results for the real camera data of a bookshelf. The images shown are (a) bicubic, (b) bicubic ROI, (c) RCAN, (d) FIF SR, (e) AWF and (f) FIFNET FS. The
multiframe methods use 𝐾 = 10 frames for (d)–(f).
due to the lower indoor lighting level and corresponding decreased has also been done here using 𝜎𝜂 = 0.01. Aliasing in the form of a
signal-to-noise ratio. Moiré pattern is visible on the high frequency components of the chirp
Finally, a circularly-symmetric chirp pattern image is shown in in the bicubic interpolation image in Fig. 14(a). The aliasing causes
Fig. 14. This test image has been selected to clearly illustrate the the concentric ring pattern to appear inverted on the top and on the
aliasing reduction capabilities of the methods being tested. Training right in the image. The single frame restoration methods shown in
9
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
Fig. 14. Image results for the real camera data of a chirp pattern scene. The images shown are (a) bicubic, (b) VDSR, (c) RCAN, (d) FIF SR, (e) AWF, and (f) FIFNET FS. The
multiframe methods use 𝐾 = 10 for (d)–(f).
Figs. 14(b) and (c) improve contrast, but are unable to correctly decode To the best of our knowledge this has not been done previously with
the true chirp structure from the undersampled imagery. For a single- CNN SR methods in the literature. The model allows us to more
frame method, RCAN does do a good job on most of the chirp pattern. realistically evaluate the methods on simulated data, and it allows us
Note that the multiframe methods shown in Figs. 14(d)–(f) correctly to apply the trained FIFNET directly to real camera data. The quanti-
restore all of the rings of the chirp and improve the readability of the tative results in Section 4.1 show that the proposed FIFNET provides
title text. As with the bookshelf data, FIFNET FS in Fig. 14(f) appears to the best performance among the methods tested for multiple frames
show the best sharpness and more noise reduction than AWF and FIF.
(i.e., 𝐾 > 1). In particular, FIFNET exhibits a consistent performance
advantage over the original FIF SR method. We attribute this to the
5. Conclusions
highly flexible non-linear processing of the CNN, and the synergy
achieved by jointly performing fusion and restoration. We believe the
In this paper we have proposed a novel multiframe SR method that
subjective results with both simulated and real data are consistent
we refer to as FIFNET. This method uses a single CNN architecture to
fuse and restore multiple interpolated input frames. We believe this is with the quantitative results. The results with real camera imagery of
the first motion-based multiframe CNN SR method to perform fusion the chirp pattern demonstrate objective aliasing reduction in a tough
of input images within one CNN. One of the key innovations that real-world application.
makes this effective is the inclusion of subpixel registration informa- We have considered two motion model scenarios. One where the
tion. This information is packaged as auxiliary information channels interframe shifts were known in advance and training could be exclu-
that accompany the input frames, as described in Section 3.2.1. The sively focused on those particular shifts. The other scenario is where the
quantitative results in Section 4.1 clearly illustrate that a significant network is trained to deal with any random shifts. We have demon-
boost in performance is achieved by including these extra channels. strated that FIFNET can be highly effective in both scenarios. The
Another innovation in the architecture, compared with that of VDSR, is fixed shifts scenario generally produces better results, but only for the
the concatenation of the input frames near the end of the convolution
particular set of shifts that it is trained for.
layers. We have found this boosts performance and allows for fewer
total layers. A final notable architectural innovation is the elimination One of the benefits of building on the FIF SR framework is the
of zero padding inside the network to reduce the network size for train- relative ease with which it can be extended to other types of motion
ing and prevent extrapolation error from potentially contaminating the and sensors (Karch and Hardie, 2015). In future work, we are looking
training process. to extend the FIFNET method to handle more complex camera motion
FIFNET training is done here using a realistic observation model and within scene motion. We are also looking to extend the method to
that accounts for camera optics and the corresponding detector array. RGB and other division of focal-plane array imaging sensors.
10
H. Elwarfalli and R.C. Hardie Computer Vision and Image Understanding 202 (2021) 103097
CRediT authorship contribution statement Karch, B.K., Hardie, R.C., 2015. Robust super-resolution by fusion of interpolated frames
for color and grayscale images. Front. Phys. 3, 28. http://dx.doi.org/10.3389/fphy.
Hamed Elwarfalli: Methodology, Software, Investigation, Data cu- 2015.00028.
Kawulok, M., Benecki, P., Kostrzewa, D., Skonieczny, L., 2018. Evolving imaging model
ration, Writing - original draft. Russell C. Hardie: Conceptualiza-
for super-resolution reconstruction. In: Proceedings of the Genetic and Evolutionary
tion, Methodology, Software, Validation, Writing - review & editing, Computation Conference Companion. GECCO ’18, ACM, New York, NY, USA, pp.
Supervision. 284–285. http://dx.doi.org/10.1145/3205651.3205676.
Kawulok, M., Benecki, P., Piechaczek, S., Hrynczenko, K., Kostrzewa, D., Nalepa, J.,
Declaration of competing interest 2019. Deep learning for multiple-image super-resolution. arXiv preprint arXiv:
1903.00440.
Kim, J., Lee, J.K., Lee, K.M., 2016. Accurate image super-resolution using very deep
The authors declare that they have no known competing finan-
convolutional networks. In: 2016 IEEE Conference on Computer Vision and Pattern
cial interests or personal relationships that could have appeared to Recognition. CVPR, pp. 1646–1654. http://dx.doi.org/10.1109/CVPR.2016.182.
influence the work reported in this paper. Lai, W.-S., Huang, J.-B., Ahuja, N., Yang, M.-H., 2017. Deep laplacian pyramid networks
for fast and accurate super-resolution. In: Proceedings of the IEEE Conference on
References Computer Vision and Pattern Recognition. pp. 624–632, http://dx.doi.org/10.1109/
CVPR.2017.618.
Agustsson, E., Timofte, R., 2017. NTIRE 2017 challenge on single image super- Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A.,
resolution: Dataset and study. In: Proceedings of the IEEE Conference on Computer Tejani, A., Totz, J., Wang, Z., et al., 2017. Photo-realistic single image super-
Vision and Pattern Recognition Workshops. pp. 1122–1131, http://dx.doi.org/10. resolution using a generative adversarial network. In: Proceedings of the IEEE
1109/CVPRW.2017.150. Conference on Computer Vision and Pattern Recognition. pp. 4681–4690, http:
Ahn, N., Kang, B., Sohn, K.-A., 2018. Fast, accurate, and lightweight super-resolution //dx.doi.org/10.1109/CVPR.2017.19.
with cascading residual network. In: Proceedings of the European Conference Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K., 2017. Enhanced deep residual networks
on Computer Vision. ECCV, pp. 252–268, http://dx.doi.org/10.1007/978-3-030- for single image super-resolution. In: Proceedings of the IEEE Conference on
01249-6_16. Computer Vision and Pattern Recognition Workshops. pp. 136–144, http://dx.doi.
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J., 2010. Contour detection and hierarchical org/10.1109/CVPRW.2017.151.
image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 898–916. http: Lucas, B.D., Kanade, T., 1981. An iterative image registration technique with an appli-
//dx.doi.org/10.1109/TPAMI.2010.161. cation to stereo vision. In: International Joint Conference on Artificial Intelligence,
Boreman, G.D., 1998. Basic Electro-Optics for Electrical Engineering, Vol. 31. SPIE Vancouver.
Press, http://dx.doi.org/10.1117/3.294180. Ma, Z., Liao, R., Tao, X., Xu, L., Jia, J., Wu, E., 2015. Handling motion blur in multi-
Bulat, A., Tzimiropoulos, G., 2018. Super-FAN: Integrated facial landmark localization frame super-resolution. In: Proceedings of the IEEE Conference on Computer Vision
and super-resolution of real-world low resolution faces in arbitrary poses with and Pattern Recognition. pp. 5224–5232, http://dx.doi.org/10.1109/CVPR.2015.
GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
7299159.
Recognition. pp. 109–117, http://dx.doi.org/10.1109/CVPR.2018.00019.
Milanfar, P., 2017. Super-Resolution Imaging. CRC Press, http://dx.doi.org/10.1201/
Farsiu, S., Robinson, D., Elad, M., Milanfar, P., 2003. Robust shift and add approach
9781439819319.
to superresolution. In: Tescher, A.G. (Ed.), Applications of Digital Image Processing
Park, S.C., Park, M.K., Kang, M.G., 2003. Super-resolution image reconstruction: a
XXVI, Vol. 5203. International Society for Optics and Photonics, SPIE, pp. 121–130.
technical overview. IEEE Signal Process. Mag. 20 (3), 21–36. http://dx.doi.org/
http://dx.doi.org/10.1117/12.507194.
10.1109/MSP.2003.1203207.
Goodman, J., 2004. Introduction to Fourier Optics, third ed. Roberts and Company
Publishers. Sajjadi, M.S., Scholkopf, B., Hirsch, M., 2017. Enhancenet: Single image super-resolution
Hardie, R.C., 2007. A fast super-resolution algorithm using an adaptive Wiener filter. through automated texture synthesis. In: Proceedings of the IEEE International
IEEE Trans. Image Process. 16 (12), 2953–2964. http://dx.doi.org/10.1109/TIP. Conference on Computer Vision. pp. 4491–4500, http://dx.doi.org/10.1109/ICCV.
2007.909416. 2017.481.
Hardie, R.C., Barnard, K.J., Ordonez, R., 2011a. Fast super-resolution with affine motion Timofte, R., Gu, S., Wu, J., Van Gool, L., Zhang, L., Yang, M.-H., Haris, M., et al.,
using an adaptive Wiener filter and its application to airborne imaging. Opt. Express 2018. NTIRE 2018 challenge on single image super-resolution: Methods and results.
19 (27), 26208–26231. http://dx.doi.org/10.1364/OE.19.026208. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Hardie, R.C., Droege, D.R., Dapore, A.J., Greiner, M.E., 2015. Impact of detector- Workshops. pp. 965–96511, http://dx.doi.org/10.1109/CVPRW.2018.00130.
element active-area shape and fill factor on super-resolution. Front. Phys. 3, 31. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al., 2004. Image quality
http://dx.doi.org/10.3389/fphy.2015.00031. assessment: from error visibility to structural similarity. IEEE Trans. Image Process.
Hardie, R.C., LeMaster, D.A., Ratliff, B.M., 2011b. Super-resolution for imagery from 13, 600–612. http://dx.doi.org/10.1109/TIP.2003.819861.
integrated microgrid polarimeters. Opt. Express 19, 12937–12960. http://dx.doi. Wang, Y., Perazzi, F., McWilliams, B., Sorkine-Hornung, A., Sorkine-Hornung, O.,
org/10.1364/OE.19.012937. Schroers, C., 2018. A fully progressive approach to single-image super-resolution.
Hardie, R.C., Rucci, M., Karch, B.K., Dapore, A.J., Droege, D.R., French, J.C., 2019. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Fusion of interpolated frames superresolution in the presence of atmospheric optical Workshops. pp. 864–873, http://dx.doi.org/10.1109/CVPRW.2018.00131.
turbulence. Opt. Eng. 58 (8), 1–16. http://dx.doi.org/10.1117/1.OE.58.8.083103. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y., 2018. Image super-resolution using
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition.
very deep residual channel attention networks. In: Proceedings of the European
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp.
Conference on Computer Vision. ECCV, pp. 286–301, http://dx.doi.org/10.1007/
770–778. http://dx.doi.org/10.1109/CVPR.2016.90.
978-3-030-01234-2_18.
Johnson, J., Alahi, A., Fei-Fei, L., 2016. Perceptual losses for real-time style transfer
Zhang, K., Zuo, W., Zhang, L., 2018. Learning a single convolutional super-resolution
and super-resolution. In: European Conference on Computer Vision. Springer, pp.
network for multiple degradations. In: Proceedings of the IEEE Conference on
694–711. http://dx.doi.org/10.1007/978-3-319-46475-6_43.
Computer Vision and Pattern Recognition. pp. 3262–3271, http://dx.doi.org/10.
1109/CVPR.2018.00344.
11