Professional Documents
Culture Documents
Eednet: Enhanced Encoder-Decoder Network For Autoisp: (Xiangyu - He, Jcheng) @NLPR - Ia.Ac - CN
Eednet: Enhanced Encoder-Decoder Network For Autoisp: (Xiangyu - He, Jcheng) @NLPR - Ia.Ac - CN
Eednet: Enhanced Encoder-Decoder Network For Autoisp: (Xiangyu - He, Jcheng) @NLPR - Ia.Ac - CN
Abstract. Image Signal Processor (ISP) plays a core rule in camera sys-
tems. However, ISP tuning is highly complicated and requires professional
skills and advanced imaging experiences. To skip the painful ISP tuning
process, we introduce EEDNet in this paper, which directly transforms an
image in the raw space to an image in the sRGB space (RAW-to-RGB).
Data-driven RAW-to-RGB mapping is a grand new low-level vision task.
In this work, we propose a hypothesis of the receptive field that large recep-
tive field (LRF) is essential in high-level computer vision tasks, but not
crucial in low-level pixel-to-pixel tasks. Besides, we present a ClipL1 loss,
which simultaneously considers easy examples and outliers during the opti-
mization process. Benefiting from the LRF hypothesis and ClipL1 loss,
EEDNet can generate high-quality pictures with more details. Our method
achieves promising results on Zurich RAW2RGB (ZRR) dataset and won
the first place in AIM2020 ISP challenging.
1 Introduction
Image Signal Processor (ISP) is a specialized digital signal processor for recon-
structing RGB images from raw Bayer images. In conventional camera pipelines,
whether smartphones or DSLR cameras, complex and confidential hardware pro-
cesses are employed to perform image signal processing. Meanwhile, ISP tuning is
highly complicated where professional skills and advanced imaging experiences
are indispensable. It consists of various processing steps including denoising,
white balancing, exposure correction, demosaicing, colour transform, gamma
encoding and so on. While every step with independent task-specific loss func-
tion in conventional ISP is performed sequentially, residual error accumulates
at the same time [17]. To correct these stepwise accumulated errors, tedious
parameter tuning process should be employed at the later stages.
More concretely, many of the conventional methods use hand-crafted
heuristics-based approaches to derive the solution at each step in the image
signal processor pipeline, thus leaving oceans of parameters to be tuned in cor-
responding to complicated and volatile environments in the real world. Besides,
the sequentially performed various ISP process using modular-based algorithms
will result in cumulative errors at every step. A small change in parameter con-
figuration may lead to different reconstructed RGB images.
Meanwhile, smartphones have gradually become a part of daily life. High-
quality photos, along with the continuous improvement of mobile phone cameras,
have gone from the privilege of professional camera to something that ordinary
people can easily access. Heavy image signal processing systems are embedded
in phones, promoting the quality of photos. However, due to the limited hard-
ware resources of mobile cameras, there may always be a big gap between phone
and professional cameras. How to make the picture quality of the mobile phone
camera as close as possible to the professional one has become our concern. It’s
known that a well adjusted ISP can bring competitive quality to the images taken
by smartphones. Nevertheless, the design of ISP and the adjustment of internal
module parameters are not very simple. For camera or smartphone manufactur-
ers, ISP is regarded as a core competency. In light of this, we conduct EEDNet
to evade the painful ISP tuning process and narrow the gaps between various
smartphone cameras generated by different ISP pipelines. EEDNet uses a unified
loss function to optimize the entire processing involved in an ISP pipeline in an
end-to-end optimization setting.
Each module in traditional ISP can neither control the output of other mod-
ules nor recover the signal loss of previous modules. The idea that using a convo-
lutional neural network (CNN) to replace the hardware-based ISP is supported
by the fact that CNN can compensate for the information loss of input images,
which is more reliable than the traditional ISP, and can effectively break through
the hardware limitation. Andrey et al. [9] pioneered the application of CNN to
replace the camera ISP of smartphones and proposed the RAW-to-RGB dataset
with PyNET network.
In this paper, we show that deep neural networks with Large Receptive Fields
(LRF) are not required in this task. In contrast to the popular design in object
detection [12,18] and semantic segmentation [4], which emphasize semantic infor-
mation, we assume that low-level image processing tasks such as RAW-to-RGB
could pay more attention to local structures. To further verify our hypothesis,
we conduct extended experiments on SIDD+ [1]. The results show that U-Net
[19] without LRF can also obtain promising results.
Our main contributions can be summarized as follows:
– We prove that the RAW-to-RGB task does not require LRF in the encoder-
decoder structure. Furthermore, we verify our hypothesis on the SIDD+ task.
EEDNet: Enhanced Encoder-Decoder Network for AutoISP 173
– We propose ClipL1 loss, which eliminates the effect of easy examples and
outliers during training.
– We present EEDNet with a desirable receptive filed configuration, which out-
performs PyNET.
2 Related Work
In this section, we briefly review and discuss the work about image signal pro-
cessing in two parts, i.e., convolution neural network for low-level vision tasks
and previous works using deep learning techniques to learn the ISP pipeline.
Input
(Level 1)
Level 2
Level 3
Level 4
The Highest Level
Fig. 1. Five level U-Net [19] with an additional upsample layer added to the top of
UNet.
21.2
21.15
21.1
21.05
PSNR
21
20.95
20.9
20.85
0 20 40 60 80 100 120 140 160
RFs
Fig. 2. Receptive Fields of the highest level layers in different encoder decoder networks
and its corresponding fidelity.
categorized the ISP pipeline into two weakly correlated parts, restoration and
enhancement, and proposed a two-stage network to account for the two inde-
pendent operations. In this paper, we proposed a simple but effective EEDNet
to achieve better performance both in PSNR and visual effect.
Fn and Fn−1 represent the required RF of nth layer and the known RF of n-1th
layer whose initial value is 1. kn stands for the nth layer’s kernel size. sn is the
stride of layer i.
In our experiments, we take modified U-Net [19] as a baseline and adjust the
receptive field of the highest level (as shown in the bottom red box in Fig. 1) by
four factors: the number of downsampling operations, the size of the filters, the
depth of each level, and the dilation rate.
– For downsampling, we gradually remove the max pooling layer from top to
bottom. Besides, the convolutional layers after the removed pooling layer will
also be deprecated, which is for ensuring that the PSNR improvement is not
obtained by increasing the computations.
– For kernel size, we randomly select several low-level convolutional layers and
change its kernel size from 1 × 1 to 9 × 9 (with a step size of 2) without
changing the architecture.
– For the number of convolutional layers in each level, we randomly remove
some convolutional layers belonging to the highest level. At the same time,
we add corresponding layers at lower levels to guarantee approximately the
same computing cost.
– For dilated convolution, we randomly select one normal convolutional layer
in the encoder and replace it with different dilated convolutional layers.
The results are shown in Fig. 2. In conclusion, for the highest level, the RF
should not be too large, and there is a rough scale of favourable RF configuration
(preferably between 10 and 60). This phenomenon is related to those operations
that shrink the receptive field, such as adopting convolution with small ker-
nel sizes and cutting the downsampling. If not, the fidelity will be drastically
changed.
4 Proposed EEDNet
In this section, we introduce EEDNet inspired by UNet [19] and RF hypothesis.
Besides, to make EEDNet focus more on the significant changes of pixels in the
RGB domain, we propose Channel Attention Residual Dense Block (CA-RDB)
block and ClipL1 loss.
task generally involves both global and local image corrections. Layers belonging
to different levels should have different sensitivity to both high-level properties,
such as brightness or white balance, and low-level features, like textures and
edges. In light of this, we apply Channel Attention Residual Dense Block (CA-
RDB) block to skip connections, shown in Fig. 3. The idea that adding Channel
attention [5] after RDBs is heuristical for making the skip connection focus on
useful information.
For low-level tasks, especially pixel-to-pixel, the information of each pixel of
each sample is very important. Therefore, Batch Normalization [10] considering
the content of all pictures in a batch may result in the loss of unique details
of each sample. Similarly, for algorithms like Layer Normalization (LN) [3] that
need to consider correlations across channels, the difference between different
channels may be ignored. RAW to RGB task is similar to style transfer, which
means models should focus on the uniqueness of each sample since the generated
images depend on the corresponding input images. In this case, Instance Nor-
malization (IN) [22] becomes an ideal choice. Furthermore, for obtaining more
effective statistical information, we adopt SN, which combines the characteris-
tics of BN, LN and IN. In Sect. 5.4, we verify that SN is more effective than IN.
LeakyReLU is applied after each convolutional layer, except for the last layer.
Besides, we use the nearest neighbour interpolation for upsampling to avoid
time-consuming deconvolutions.
EEDNet: Enhanced Encoder-Decoder Network for AutoISP 177
Fig. 5. Gradient comparison between L1 loss, L2 loss and ClipL1 loss. The x axis is
the residual value between prediction and ground truth image, and the y axis is the
gradient value
where x is the reconstructed RGB image by our network, and y represents the
ground truth RGB image from canon ISP. cmin and cmax are thresholds for
clipping easy samples and outliers. Figure 4 shows the comparison between L1
Loss, L2 Loss and ClipL1 loss. As shown above, we regard every pixel in an
image as one sample, and reset it to the threshold if it is out of the range. The
gradient comparison between different losses is shown in Fig. 5.
178 Y. Zhu et al.
Table 1. The test set results of AIM 2020 Learned Smartphone ISP Challenge Track
1 - Fidelity.
5 Experiments
5.1 Dataset
1
to 255 and 1 in our experiments. Note that since ClipL1 loss only focuses on
changes within a certain range, it will be sub-optimal when the network trained
with other losses.
Testing Process. For Track1, we trained 5 models with the same setting for
ensembling. They are, respectively, 4 levels U-Net with Leaky ReLU and SN
(called Modified U-Net) trained with mean square error (MSE), Modified U-Net
trained with L1, Modified U-Net trained with ClipL1, EEDNet trained with
MSE + 0.8∗MS-SSIM, EEDNet trained with ClipL1. Yet for Track2, We only
took the output full-resolution images of EEDNet trained with ClipL1 as the
final submitted results.
5.3 Results
The competition results of Track1 is shown as Table 1. We have got the first
place. Our best single model achieves 21.63 dB on the development set. The
submitted ensemble model reaches 22.26 dB on the test set. And part of the
full resolutoin images are presented in Fig. 6. We still achieved state-of-the-art
results. Our processed images have softer lighting, rich colours and no overex-
posure, compared to PyNET [9].
Fig. 6. Comparison of different ISPs. The Canon output photo in the middle is the
ground truth of the RAW-to-RGB task. Especially, for the first row, the photos pro-
cessed by our AutoISP aremore colorful compared with PyNET [9] and closer to the
Cannon camera’s output. (Color figure online)
the learning rate is 10−4 and step to 10−5 at epoch 1500. On sRGB data set,
we train the model for 1200 epochs and batch size 48, where the learning rate is
3 × 10−4 for the first 10 epochs, and then 3× 10−5 until training process finished.
The experiments are performed on RTX 2080 Ti.
The results are shown in Table 2. For the left rawRGB task, it is obvious
that U-Net* performs better than U-Net in terms of PSNR. Since the SSIM
is already high, there is not much improvement. However, for the right sRGB
task, the PSNR and SSIM of U-Net* are all conspicuously higher than U-Net.
Therefore, our LRF hypothesis can be applied to the SIDD+ task no matter
what the colour space.
EEDNet: Enhanced Encoder-Decoder Network for AutoISP 181
Fig. 7. Overall visual comparisons for showing the effects of each component in EED-
Net. Each column represents a model with its configurations in the top.
182 Y. Zhu et al.
6 Conclusion
In this article, we propose and verify that LRF is not crucially required in image
processing tasks, especially the RAW-to-RGB task and the SIDD task. This
preknowledge will undoubtedly benefit other basic image processing research.
Then, ClipL1 Loss is proposed to enhance the sensitivity of EEDNet to RGB
colour space. Finally, our EEDNet further narrows the gap between CNN and
DSLR’s ISP, making CNN more likely to replace ISP on smartphones. Although
we can get considerable results on the RAW-to-RGB data set, it should be
noticed that the data set is relatively small, and the scenes are not productive,
making the model suboptimal in real scenes, such as at night. In future work,
we will expand the RAW-to-RGB data set, enriching its scenes and discover
an effective solution to solve white balance. At the same time, we will further
optimize the EEDNet and diminish its computations to be applied to mobile
phones.
References
1. Abdelhamed, A., et al.: Ntire 2020 challenge on real image denoising: Dataset,
methods and results. arXiv preprint arXiv:2005.04117 (2020)
2. Araujo, A., Norris, W., Sim, J.: Computing receptive fields of convolutional neu-
ral networks. Distill (2019). https://doi.org/10.23915/distill.00021, https://distill.
pub/2019/computing-receptive-fields
3. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint
arXiv:1607.06450 (2016)
4. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution,
and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848
(2017)
5. Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.S.: Sca-cnn: Spa-
tial and channel-wise attention in convolutional networks for image captioning. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
pp. 5659–5667 (2017)
EEDNet: Enhanced Encoder-Decoder Network for AutoISP 183
6. Fan, Y., Yu, J., Liu, D., Huang, T.S.: Scale-wise convolution for image restoration
(2019)
7. Ignatov, A., Patel, J., Timofte, R.: Rendering natural camera bokeh effect with
deep learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) Workshops (June 2020)
8. Ignatov, A., Timofte, R., et al.: AIM 2020 challenge on learned image signal pro-
cessing pipeline. In: European Conference on Computer Vision Workshops (2020)
9. Ignatov, A., Van Gool, L., Timofte, R.: Replacing mobile camera isp with a single
deep learning model. arXiv preprint arXiv:2002.05509 (2020)
10. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
11. Liang, Z., Cai, J., Cao, Z., Zhang, L.: Cameranet: A two-stage framework for
effective camera isp learning (2019)
12. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection. In: Proceedings of the IEEE International Conference on Computer
Vision. pp. 2980–2988 (2017)
13. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 3431–3440 (2015)
14. Marnerides, D., Bashford-Rogers, T., Hatchett, J., Debattista, K.: Expandnet: A
deep convolutional neural network for high dynamic range expansion from low
dynamic range content. Comput. Graph. Forum 37(2), 37–49 (2017)
15. Nah, S., Son, S., Timofte, R., Lee, K.M.: Ntire 2020 challenge on image and video
deblurring. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops (2020)
16. Purohit, K., Suin, M., Kandula, P., Ambasamudram, R.: Depth-guided dense
dynamic filtering network for bokeh effect rendering. In: 2019 IEEE/CVF Interna-
tional Conference on Computer Vision Workshop (ICCVW). pp. 3417–3426 (2019)
17. Ratnasingam, S.: Deep camera: A fully convolutional neural network for image
signal processing. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV) Workshops (2019)
18. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in Neural Information Processing
Systems. pp. 91–99 (2015)
19. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
20. Sun, Y., Yu, Y., Wang, W.: Moiré photo restoration using multiresolution convo-
lutional neural networks. IEEE Trans. Image Process. 27(8), 4160–4172 (2018)
21. Tao, X., Gao, H., Shen, X., Wang, J., Jia, J.: Scale-recurrent network for deep
image deblurring. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (2018)
22. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: The missing
ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
23. Wang, X., et al.: Esrgan: Enhanced super-resolution generative adversarial net-
works. In: Proceedings of the European Conference on Computer Vision (ECCV)
Workshops (2018)
24. Yan, Q., et al.: Deep hdr imaging via a non-local network. IEEE Trans. Image
Process. 29, 4308–4322 (2020)
184 Y. Zhu et al.
25. Yuan, S., et al.: Aim 2019 challenge on image demoireing: Methods and results.
In: 2019 IEEE/CVF International Conference on Computer Vision Workshop
(ICCVW). pp. 3534–3545 (2019)
26. Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser:
Residual learning of deep cnn for image denoising. IEEE Trans. Image Process.
26(7), 3142–3155 (2017)
27. Zhang, K., Zuo, W., Zhang, L.: Ffdnet: Toward a fast and flexible solution for
cnn-based image denoising. IEEE Trans. Image Process. 27(9), 4608–4622 (2018)