Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Version of Record: https://www.sciencedirect.

com/science/article/pii/S0034425719304444
Manuscript_82fe56ef8737822d696a39e6d0c5723c

1 Deep learning-based fusion of Landsat-8 and Sentinel-2 images for a harmonized surface

2 reflectance product

4 Zhenfeng Shao 1, 2, Jiajun Cai 1, 2, 3, Peng Fu 4, 5 *, Leiqiu Hu 6, Tao Liu 7

1
6 State Key Laboratory for Information Engineering in Surveying, Mapping and Remote

7 Sensing, Wuhan University, Wuhan, China 430072

2
8 Collaborative Innovation Center for Geospatial Technology, Wuhan University, Wuhan,

9 China 430072

3
10 Department of Geography and Resource Management, The Chinese University of Hong

11 Kong, Hong Kong, China 999077

4
12 Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL

13 61801, USA

5
14 Carl R. Woese Institute of Genomic Biology, University of Illinois at Urbana-Champaign,

15 Urbana, IL 61801, USA

6
16 Atmospheric Science Department, University of Alabama at Huntsville, Huntsville, AL

17 35805, USA

7
18 Geographic Data Science, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA

19

20 * Corresponding author

21 Peng Fu (pengfu@illinois.edu)

22 Jiajun Cai (cai_jiajun@foxmail.com)

© 2019 published by Elsevier. This manuscript is made available under the Elsevier user license
https://www.elsevier.com/open-access/userlicense/1.0/
23 Abstract

24 Landsat and Sentinel-2 sensors together provide the most widely accessible

25 medium-to-high spatial resolution multispectral data for a wide range of applications, such as

26 vegetation phenology identification, crop yield estimation, and forest disturbance detection.

27 Improved timely and accurate observations of the Earth’s surface and dynamics are expected

28 from the synergistic use of Landsat and Sentinel-2 data, which entails coordinating the spatial

29 resolution gap between Landsat (30 m) and Sentinel-2 (10 m or 20 m) images. However,

30 widely used data fusion techniques may not fulfil community’s needs for generating a

31 temporally dense reflectance product at 10 m spatial resolution from combined Landsat and

32 Sentinel-2 images because of their inherent algorithmic weaknesses. Inspired by the recent

33 advances in deep learning, this study developed an extended super-resolution convolutional

34 neural network (ESRCNN) to a data fusion framework, specifically for blending Landsat-8

35 Operational Land Imager (OLI) and Sentinel-2 Multispectral Imager (MSI) data. Results

36 demonstrated the effectiveness of the deep learning-based fusion algorithm in yielding a

37 consistent and comparable dataset at 10 m from Landsat-8 and Sentinel-2. Further accuracy

38 assessments revealed that the performance of the fusion network was influenced by both the

39 number of input auxiliary Sentinel-2 images and temporal interval (i.e., difference in image

40 acquisition dates) between auxiliary Sentinel-2 images and the target Landsat-8 image.

41 Compared to the benchmark algorithm, area-to-point regression kriging (ATKPK), the deep

42 learning-based fusion framework proved better in the quantitative assessment in terms of

43 RMSE (root mean square error), correlation coefficient (CC), universal image quality index

44 (UIQI), relative global-dimensional synthesis error (ERGAS), and spectral angle mapper

2
45 (SAM). ESRCNN better preserved the reflectance distribution as the original image

46 compared to ATPRK, resulting in an improved image quality. Overall, the developed data

47 fusion network that blends Landsat-8 and Sentinel-2 images has the potential to help generate

48 continuous reflectance observations of higher temporal frequency than that can be obtained

49 from a single Landsat-like sensor.

50

51 Keywords: Deep learning, Data fusion, Landsat-8, Sentinel-2, continuous monitoring

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

3
67 1 Introduction

68 The Landsat consistent records of the Earth’s surface and dynamics with 30 m

69 spatial resolution represent hitherto the longest space-based observations dating back to the

70 1970s (Roy et al., 2014). The opening of the Landsat archive in 2008 (Woodcock et al., 2008)

71 has fostered many previously unimaginable environmental applications based on time series

72 satellite images, such as near-real time disturbance detection (Verbesselt et al., 2012) and

73 continuous land cover change detection and classification (Zhu & Woodcock, 2014).

74 Currently, it is being the norm to use all available time series images (dating back to 1980s

75 for reflectance products) at a given location for a wide range of applications from

76 characterizing forest disturbance (Kim et al., 2014) and vegetation phenology (Senf et al.,

77 2017) to revealing urbanization-induced land use and land cover changes (Fu & Weng,

78 2016a). Despite the popularity of time series analysis of Landsat images, the usability of

79 Landsat images is often limited by the presence of clouds, shadow, and other poor

80 atmospheric effects (Fu & Weng, 2016b); thus, the actual temporal revisit cycle of usable

81 Landsat images is not periodical and sometimes longer than 16 days. The missed temporal

82 information can hinder applications that require near-daily or multi-day imagery at medium

83 spatial resolution (~30 m), e.g., crop yield estimation (Claverie et al., 2012), flood response

84 (Skakun et al., 2014), vegetation phenology identification (Melaas et al., 2013), and forest

85 disturbance detection (White et al., 2017).

86 Synergies between Landsat Operational Land Imager (OLI) and Sentinel-2

87 Multispectral Imager (MSI) data are promising to fulfil the community’s needs for

88 high-temporal resolution images at medium spatial scale. With the launch of Sentinel-2A and

4
89 -2B satellites in 2015 and 2017, respectively, the combined Landsat-8 OLI and Sentinel-2

90 MSI dual system can provide dense global observations at a nominal revisit interval of 2-3

91 days. Each MSI onboard the twin Sentinel-2 satellites (Sentinel-2A and -2B) acquires images

92 covering thirteen spectral bands (Table 1) at the spatial resolutions of 10 m (four visible and

93 near-infrared bands), 20 m (six red edge and shortwave infrared bands), and 60 m (three

94 atmospheric correction bands) in every 10 days (Drusch et al., 2012; Malenovský et al., 2012).

95 The MSI images are complementary to images captured by the Landsat OLI sensor since the

96 two instruments share similarities in band specifications (Table 1).

97 The combination of Landsat and Sentinel-2 reflectance products enriches the temporal

98 information. For example, we can generate 5-day composite products, much more frequent

99 than those from Landsat and Sentinel-2 individually. However, the disparity in spatial

100 resolution between Landsat-8 and Sentinel-2 data remains unsolved. One common but rather

101 simple approach is to sacrifice the spatial resolution in the reflectance products by resampling

102 the finer-resolution data to match the coarser one (Wang et al., 2017). This resampling

103 approach has also been adopted by the NASA Harmonized Landsat and Sentinel-2 (HLS)

104 project to produce temporal dense reflectance products at 30 m. Thus, valuable information

105 provided by the 10 m Sentinel-2 data is wasted. In this study, we aim to develop a data fusion

106 approach that takes full advantages of available temporal and spatial information in both

107 Landsat-8 and Sentinel-2 images to generate a reflectance product at a finer spatial resolution

108 of 10 meter.

109 Various algorithms have been developed in the past for image fusion to obtain more

110 frequent Landsat-like data, such as the spatial and temporal adaptive reflectance fusion model

5
111 (STARFM) (Gao et al., 2006) and its variants (Hilker et al., 2009; Zhu et al., 2010; Zhu et al.,

112 2016), unmixing-based data fusion (Gevaert & García-Haro, 2015), wavelet transformation

113 (Acerbi-Junior et al., 2006), sparse representation (Wei et al., 2015), and smoothing

114 filter-based intensity modulation (SFIM) (Liu, 2000). Among these data fusion techniques,

115 the STARFM and its variants are probably the most popular fusion algorithms to generate

116 synthetic surface reflectance at both high spatial and temporal resolutions due to their robust

117 prediction performances (Emelyanova et al., 2013). Nevertheless, STARFM and its variants

118 require at least one pair of fine and coarse spatial resolution images (e.g., Landsat and

119 MODIS images) acquired on the same day as inputs to implement the downscaling process.

120 This requisite makes the STARFM framework not suitable for fusing Landsat-8 and

121 Sentinel-2 images that often revisit the same target at different days. Recently, Wang et al.

122 (2017) used area-to-point regression kriging (ATPRK) to yield downscaled Landsat-8 images

123 at 10 m and suggested that the ATPRK approach outperformed STARFM on downscaling

124 reflectance. However, ATPRK, as a geostatistical fusion approach, involves complex

125 semi-variogram modeling from the cokriging matrix that is computationally unrealistic for a

126 large domain due to its sheer size. Plus, ATPRK may not be suitable for areas experiencing

127 rapid land cover changes since its performance relies on input of covariates (i.e., spectral

128 bands) that may come from a subjectively selected image in a specific date. Within a short

129 time period (e.g., 2 weeks), the collected Sentinel-2 and Landsat data for a given location

130 may all have valuable information for image fusion. In contrast, the ATPRK algorithm does

131 not have the flexibility to accommodate a different number of input images within the

132 specific study period for reflectance prediction. With the varied number of input images, a

6
133 new fusion algorithm is highly needed to automatically select the best features from one or

134 more images to perform reflectance prediction at the high spatial resolution (Das & Ghosh,

135 2016).

136 The recent advances in deep learning make it promising in addressing the spatial gap

137 between Landsat-8 and Sentinel-2 data, potentially leading to improved performance of

138 image fusion over existing algorithms (Masi et al., 2016; Yuan et al., 2018). Deep learning is

139 a fully data-driven approach and can automatically transform the representation at one level

140 into a representation at a higher, slightly abstract level (LeCun et al., 2015; Schmidhuber,

141 2015), thus facilitating data predictions at different spatial scales. Especially, convolutional

142 neural networks (CNNs) consist of a series of convolution filters that can extract hierarchical

143 contextual image features (Krizhevsky et al., 2012). As a popular form of deep learning

144 networks, they have been widely used in image classification, object recognition, and natural

145 language processing due to their powerful feature learning ability (Audebert et al., 2017;

146 Hirschberg & Manning, 2015; Simonyan & Zisserman, 2014). Inspired by these successful

147 applications of CNNs, this study extended a super-resolution CNN (SRCNN) to address the

148 gap in spatial resolution between Landsat-8 and Sentinel-2 images. More specifically, the

149 deep learning-based framework was used to downscale the Landsat-8 image of 30 m spatial

150 resolution to 10 m by using Sentinel-2 spectral bands at 10 m and 20 m. Given the better

151 performance of ATPRK in image fusion over previous pan-sharpening and spatiotemporal

152 fusion algorithms (Wang et al., 2017), this study used ATPRK as the benchmark algorithm to

153 assess the effectiveness of deep learning-based fusion approach.

154

7
155 Table 1. Band specifications for Landsat-8 and Sentinel-2 images
Landsat-8 Sentinel-2
Resolution
Band Wavelength (nm) Resolution (m) Band Wavelength (nm)
(m)
1 (Coastal) 430-450 30 1 (Coastal) 433-453 60
2 (Blue) 450-515 30 2 (Blue) 458-523 10
3 (Green) 525-600 30 3 (Green) 543-578 10
4 (Red) 630-680 30 4 (Red) 650-680 10
5 (Red Edge) 698-713 20
6 (Red Edge) 733-748 20
7 (Red Edge) 773-793 20
5 (NIR) 845-885 30 8 (NIR) 785-900 10
9 (Water vapor) 935-955 60
10 (SWIR-Cirrus) 1360-1390 60
6 (SWIR-1) 1560-1660 30 11 (SWIR-1) 1565-1655 20
7 (SWIR-2) 2100-2300 30 12 (SWIR-2) 2100-2280 20
8 (PAN) 503-676 15 - - -

156

157 2 Methodology

158 2.1 Convolutional neural network

159 As a popular neural network form in deep learning, CNNs leverage three important

160 ideas including sparse interactions, parameter sharing, and sub-sampling to help improve a

161 machine learning system (He et al., 2015; Krizhevsky et al., 2012; Ouyang et al., 2015;

162 Zhang et al., 2014; Sun et al., 2014). In CNNs, it is impractical to connect all neurons from

163 different network layers for images of high dimensionality. Thus, neurons in CNNs are

164 connected only to a local region of the original image (i.e., sparse interactions) by using a

165 hyperparameter called a receptive field of a neuron (equivalently this is the filter size as

166 shown in Figure 1). In other words, sparse interactions indicate that the value of each neuron

167 within a convolution layer is calculated according to the neuron values of its spatially

168 neighboring region from a previous layer. Parameter sharing ensures the weights of a

8
169 convolutional kernel remain the same when they are applied to the input data at different

170 locations. In this way, the number of parameters in the network decreases significantly.

171 Figure 1 shows the concepts of local receptive fields (red rectangles) and parameter sharing

172 (i.e., weight values w1, w2, and w3 have the same value: = = ). A typical layer of a

173 convolutional network includes three phases as shown in Figure 2. In the first phase, a series

174 of convolutions are used in parallel to produce linear activations. In the second phase, each

175 linear activation from the first phase is delivered to a nonlinear activation function, such as

176 Sigmoid, Tanh, and ReLU (Rectified Linear Unit) (Nair & Hinton, 2010). In the third phase,

177 a pooling function configured in a so-called sub-sampling layer is adopted to perform local

178 averaging and sub-sampling, reducing the sensitivity of the output to shifts and distortions

179 and the computational complexity of the model. The output from operation in each phase is

180 called feature map.

181 Mathematically, the first two phases are expressed in equation (1).

182 ( )
= ( ( )
+∑ ( )( )
∗ ()
). (1)

() ( )
183 where and are the -th input feature map and -th output feature map of a

( )( ) () ( )
184 convolutional layer, is the convolutional kernel applied to , and denotes the

185 bias. The convolutional operator is indicated by the symbol *, and f means the nonlinear

186 activation function. The third phase, max pooling (or sub-sampling), is expressed in equation

187 (2).

, = max ( , )
,
188 (2)

189 where , is the neuron value at ( , !) in the output layer, and m and n are used to indicate

190 the pixel location around the center neuron at ( , !) within a spatial extent of " × " (also

9
191 known as an image patch). More specifically, equation (2) dictates that the value at location

192 ( , !) is assigned with the maximal value over the spatial extent of " × " in the input layer.

193

194

195 Figure 1. The schematic illustration of local receptive fields (red rectangles) and parameter

196 sharing (weight values w1, w2, and w3 are equal) between layers in convolution neural

197 networks.

198

199
200 Figure 2. The three phases, i.e., convolution, nonlinear activation, and pooling, in a typical

201 convolutional neural network layer.

202

203 2.2 Extending super-resolution convolutional neural network (SRCNN) for image fusion

204 Although CNN is initially designated to predict categorical variables (i.e.,

205 classification), it has recently been modified to output continuous variables (i.e., regression)

206 for data fusion (e.g., Dong et al., 2016; Shao & Cai, 2018). Dong et al. (2016) proposed

207 SRCNN to reconstruct a high-resolution image from a low-resolution image directly. As


10
208 fusion of Sentinel-2 and Landsat-8 images is to improve the spatial resolution of Landsat-8

209 imagery from 30 m to 10 m, it shares some similarities with super-resolution reconstruction

210 (SR). However, SR cannot reconstruct spatial details within image pixels from coarse

211 resolution images that are used as only inputs for predicting images at fine spatial resolution.

212 In this study, an extended SRCNN (ESRCNN) was developed to fuse data with different

213 resolutions, where the high-resolution Sentinel-2 images are treated as auxiliary data for

214 downscaling the low-resolution Landsat-8 image. The ESRCNN framework consists of three

215 layers. The first layer takes inputs with N1 channels and calculates N2 feature maps using a

216 × receptive field and a nonlinear activation ReLU. The second layer computes N3

217 feature maps with the aid of a × receptive field and ReLU. Finally, the third layer

218 outputs fusion results with $% channels based on a × receptive field. In summary,

219 these layers can be expressed in equation (3).

1
( ) = max(0, 1 + 1 ∗ ), 1 : $2 ×( 1 × 1 × $1 ) , 1 : $2 ×1

& 2
( ) = max ,0, 2 + 2 ∗ 1
( )- , 1 : $3 ×( 2 × 2 × $2 ) , 2 : $3 ×1
×( × $3 ) ,
220

3
( )= 3 + 3 ∗ 2
( ), 3 : $4 3 × 3 3 : $4 ×1

221 (3)

222 where max function returns the maximum value of the two in brackets.

223 The workflow of ESRCNN for fusing Landsat-8 and Sentinel-2 images is shown in

224 Figure 3. The downscaling procedures are divided into two parts: self-adaptive fusion of

225 Sentinel-2 images and multi-temporal fusion of Landsat-8 and Sentinel-2 images. First,

226 Sentinel-2 spectral bands 11 and 12 (SWIR-1/2) at 20 m are downscaled to 10 m by feeding

227 ESRCNN with bands 2-4, (B, G, and R), 8 (NIR) at 10 m and bands 11 and 12 at 10 m

228 resampled using the nearest neighbor interpolation. Second, Landsat-8 bands 1-7 are

11
229 downscaled through ESRCNN by using the Landsat-8 panchromatic band (15 m) and

230 Sentinel-2 data sets (bands 2-4, 8, 11-12 at 10 m). The fusion network in this step can

231 accommodate a flexible number of Sentinel-2 images as auxiliary data sets. Inputs for

232 ESRCNN include multi-temporal Sentinel-2 images (10 m) captured in relatively close days

233 to the target Landsat-8 image as well as resampled Landsat-8 bands 1-7 at 10 m with the

234 nearest neighbor interpolation. The incorporation of resampled bands by the fusion network

235 in both steps ensures that information from coarse resolution images is learned by the deep

236 learning network and used for downscaling. Overall, the ESRCNN fusion workflow

237 illustrated in Figure 3 can be summarized in four steps.

238 (1) The 20 m Sentinel-2 bands 11-12 are resampled to 10 m using the nearest neighbor

239 interpolation;

240 (2) the resampled Sentinel-2 bands 11-12 and band 2-4, 8 at 10 m are fed into the ESRCNN

241 to generate downscaled Sentinel-2 bands 11-12 at 10 m (the self-adaptive fusion process);

242 (3) the Landsat-8 bands 1-7 at 30 m and band 8 (PAN band) at 15 m are resampled to 10 m

243 using the nearest neighbor interpolation;

244 (4) the resampled Landsat-8 images and Sentinel-2 images at 10 m are used as inputs to the

245 ESRCNN to generate Landsat-8 bands 1-7 at 10 m (i.e., multi-temporal fusion of Landsat-8

246 and Sentinel-2 images).

247 In this study, parameters of the fusion network were configured following SRCNN

248 (Dong et al., 2016). More specifically, the kernel size ( , and ) of each layer was 9, 1,

249 and 5, respectively. The number of output feature maps ($ and $ ) in the first two layers

250 was 64 and 32, respectively. The number of input and output feature maps in the self-adaptive

12
251 fusion was 6 (bands 2-4, 8, and 11-12) and 2 (bands 11-12), respectively. For multi-temporal

252 fusion of Sentinel-2 and Landsat-8 images, the number of output feature maps was 7

253 (Landsat-8 bands 1-7), while the number of input feature maps depended on the number of

254 auxiliary Sentinel-2 images used in the fusion network. For instance, if only one Sentinel-2

255 image was used for downscaling, N1 would be 14 (Sentinel-2 bands 2-4, 8, 11-12 and

256 Landsat-8 bands 1-8).

257 After the preparation of algorithm inputs and network configurations, it comes to the

258 training stage for image fusion. Assume x is the input of the network and y is the desired

259 output (or known reference data). Then, the training set can be described as [ ()
, ]2
() 3
,

260 where M is the number of training samples. The network training is for learning a mapping

261 function f: 4 = ( ; ), where 4 is the predicted output and w is the set of all parameters

262 including filter weights and bias. With the help of an optimization function (equation 4), the

263 fusion network can learn to reduce the prediction error. Generally, for the deep learning

264 network, the mean square error (equation 4) is used as the optimization function (also known

265 as loss function).

266 6= ∑2 7 ()
− ( ()
)7 (4)

267 where n is the number of training samples used in each iteration, L refers to the prediction

268 error, y is the reference (desired output), f (xi;w) is the predicted output, and ||.|| denotes the

269 ℓ -norm. The stochastic gradient descent with the standard backpropagation (LeCun et al.,

270 1998) algorithm is used for optimization. The weights are updated by equation (5).

∆; = < ∙ ∆; − > ∙ , = + ∆;
?@ D D
271
?ABC ; ; (5)

where > and < are the learning rate and momentum,
?@
?AEC
272 is the derivative, l is the layer,

13
273 and ∆ refers to the intermediate value in iteration t. Following Dong et al. (2016), > is set

274 to 10-4 (except for the last layer where > is set to 10-5 for accelerating convergence), < is

275 set to 0.9, the patch size (the size of a local receptive field, or filter size) is 32 by 32 pixels

276 (Supporting Information Table S1), and the batch size (a term used in machine learning to

277 refer to the number of training samples in one iteration) is 128.

278

279

280 Figure 3. The workflow of the extended SRCNN (ESRCNN) for fusing Landsat-8 and

281 Sentinel-2 images. Conv indicates the convolutional layer and the rectified linear unit (ReLU)

282 represents the nonlinear activation layer.

283

284 2.3 ATPRK for fusion of Landsat-8 and Sentinel-2 images

285 ATPRK is among the first methods used to downscale Landsat-8 images from 30 m to

14
286 10 m spatial resolution. The ATPRK approach performs the downscaling process primarily by

287 introducing an area-to-point kriging (ATPK)-based residual downscaling scheme which is

288 synergistically used with the regression-based overall trend estimation. The ATPRK approach

289 can be regarded as an extension of either regression kriging or ATPK. Further technical

290 details of this downscaling approach can be referred to Wang et al. (2016). In this study, the

291 effectiveness of the fusion network ESRCNN was benchmarked by the algorithm ATPRK

292 that outperformed other data fusion algorithms (Wang et al., 2017).

293

294 2.4 Evaluation metrics for data fusion

295 The performance of the deep learning-based fusion framework and ATPRK was

296 assessed based on five indicators, including the spectral angle mapper (SAM) (Alparone et al.,

297 2007), root-mean-square error (RMSE), relative global-dimensional synthesis error (ERGAS)

298 (Ranchin & Wald, 2000), correlation coefficient (CC), and universal image quality index

299 (UIQI) (Wang & Bovik, 2002). These indicators have been widely used in assessing the

300 performances of data fusion algorithms such as ATPRK (Wang et al., 2017).

301 The SAM (equation 6) measures spectral angle between two vectors.

FGH(I, I4) = sinM


〈O,O4〉
‖O‖R ∗‖O4‖R
302 (6)

303 where v is the pixel vector formed by the reference image, and I4 is the vector formed by the

304 fused image. SAM is first calculated on a per-pixel basis and then all the SAM values are

305 averaged to a single value for the whole image.

306 The RMSE (equation 7) is used to evaluate overall spectral differences between the

307 reference image R and the fused image F.

15
308 SHFT(S, U) = 3V W∑32 ∑V2 [S( , ) − U( , )] (7)

309 The ERGAS provides a global quality evaluation of the fusion result and is calculated

310 in equation 8.

TSXGF = 100 W ∑Z2 [SHFT( )/H\]^( )]


Y
D Z
311 (8)

312 where h/l is the ratio between the spatial resolution of the fused Landsat-8 image and that of

313 the original Landsat-8 image, k is the number of bands of the fused image, Mean(i) is the

314 mean value of differences between the ith band of the reference image and that of the fused

315 image, and RMSE(i) indicates the root mean squared error of the ith band between the

316 reference and fused images.

317 The CC (equation 9) indicates the spectral correlation between the reference image R

318 and the fused image F.

Bef ∑def[`( , )Ma(`)][b( , )Ma(b)]


∑g
__(S, U) =
c

Bef ∑def[`( , )Ma(`)] ∑Bef ∑def[b( , )Ma(b)]


W∑g c R g c R
319 (9)

320 The UIQI (equation 10) is a global fusion performance indicator.

h(S, U) = (lR jklR )(aR


%i ∙a(`)∙a(b)
j k j ak
R)
321 (10)

322 where μ refers to the mean value and m represents standard deviation. To calculate UIQI for

323 the whole image, a sliding window was used to increase differentiation capability and

324 measure the local distortion of a fused image. The final UIQI was acquired by averaging all Q

325 values in sliding windows. The ideal value for SAM, ERGAS, RMSE, CC, and UIQI is 0, 0,

326 0, 1, and 1, respectively, if and only if Ri = Fi, for all i = 1, 2, …, N (where N is the number of

327 pixels).

328

329 3 Study area, satellite data, and synthetic data sets


16
330 The study area covers parts of Shijiazhuang, which is the capital and largest city of

331 North China’s Hebei Province. It is a city of agriculture and industry and has experienced

332 dramatic growth in population and urban extent since the founding of the People's Republic

333 of China in 1949. Two areas of Shijiazhuang, as shown in Figure 4 and Figure 5, were used

334 for training and testing the fusion network, respectively. The two areas contain abundant

335 landforms (e.g., urban areas, cropland, and forest) and have exhibited significant changes in

336 spectral reflectance within the study period (June 15 to July 7, 2017). Figure 4 shows the area

337 with a spatial extent of 19.32 km × 25.20 km, i.e., the 10 m and 20 m Sentinel-2 bands have

338 1932 × 2520 and 966 × 1260 pixels, respectively, while 30 m and 15 m Landsat-8 bands

339 have 644 × 830 and 1288 × 1660 pixels, respectively. Figure 5 shows the study area with a

340 spatial extent of 12 km × 12 km, i.e., the 10 m and 20 m Sentinel-2 bands have 1200 ×

341 1200 and 600 × 600 pixels, respectively, while 30 m and 15 m Landsat-8 bands have 400 ×

342 400 and 800 × 800 pixels, respectively. Three subareas (1, 2, and 3) as shown in Figure 4

343 and two subareas in Figure 5 were selected for visual/quantitative assessment of the fusion

344 network to downscale Landsat-8 images at low spatial resolution to high spatial resolution.

345 Sentinel-2 is an Earth observation mission of the European Space Agency (ESA)

346 developed from the European Union Copernicus Programme to acquire optical imagery at a

347 high spatial resolution. In this study, Sentinel-2 images acquired on June 20, June 27, and

348 July 7, 2017 were used as auxiliary inputs for the fusion framework. These Sentinel-2 images

349 were from the collection of Level-1C products, which provided the top of atmosphere (TOA)

350 reflectance with a sub-pixel multispectral registration accuracy in the UTM/WGS84

351 projection system (Drusch et al., 2012; Claverie et al., 2016, 2018). Two scenes of

17
352 NASA/USGS Landsat-8 TOA reflectance data (L1TP) on June 15 and July 1 in 2017 were

353 collected in this study. The Landsat-8 images were processed for standard terrain correction

354 in the UTM/WGS84 projection system (USGS, 2015). Both Landsat-8 and Sentinel-2 images

355 were captured under clear-sky conditions, and the selection of these dates was to illustrate the

356 effectiveness of the fusion network to yield downscaled Landsat-8 images at 10 m spatial

357 resolution that could be combined with Sentinel-2 images to generate more frequent and

358 consistent observations.

359 Naturally, for the self-adaptive fusion, the fusion network can downscale the

360 Sentinel-2 bands 11 and 12 at 20 m spatial resolution to 10 m with the Sentinel-2 bands 2-4

361 and 8 as inputs; however, the reference data at 10 m are not available to evaluate the

362 performance of the fusion network. Thus, in this study, synthetic data sets were constructed

363 per the Wald’s protocol (Wald et al., 1997). Specifically, the Sentinel-2 spectral bands 2-4 and

364 8 were resampled by a factor of 2 by using a Gaussian model-based degradation function

365 (also known as the point spread function, PSF). The resampled Sentinel-2 bands 2-4, 8 at 20

366 m and bands 11-12 at 40 m were then fused in the proposed fusion network to yield bands

367 11-12 at 20 m. The original Sentinel-2 bands 11-12 at 20 m were used for network training

368 and validation. It is noted that the input data sets for the fusion framework should be of the

369 same size (spatial extent and pixel size) as each other. In other words, the Sentinel-2 bands at

370 40 m should be interpolated (nearest neighbor interpolation) to 20 m and used as inputs for

371 the proposed fusion algorithm. Per the data augmentation technique (rotation and scaling),

372 36864 patches were randomly selected for training and 9216 patches were randomly selected

373 for validating the fusion network (further summarized in Supporting Information Table S2).

18
374 Following the self-adaptive fusion of Sentinel-2, the Sentinel-2 spectral bands 2-4, 8,

375 11, and 12 at 10 m were used as auxiliary data for downscaling Landsat-8 spectral bands at

376 30 m to 10 m. All input images were degraded by a factor of 3 since the reference Landsat-8

377 bands at 10 m were not available. Specifically, the Sentinel-2 bands at 10 m and Landsat-8

378 spectral and panchromatic bands at 30 m and 15 m were resampled to 30 m, 90 m, and 45 m.

379 Then these resampled data sets were used as an input to the fusion framework to yield

380 Landsat-8 bands 1-7 at 30 m. The original Landsat-8 bands 1-7 at 30 m were used for training

381 and validating the fusion network. Similar to the self-adaptive fusion, all the input images

382 should be of the same size (spatial extent and pixel size) as each other. In the multi-temporal

383 fusion step, 12160 patches were randomly selected for training and 3040 patches were

384 randomly selected for validating the fusion network (further summarized in Supporting

385 Information Table S2).

19
386

387 Figure 4. Landsat-8 and Sentinel-2 data sets used in the training phase (bands 4, 3, and 2 as

388 RGB). Landsat-8 images at 30 m were acquired on (a) June 15 and (b) July 1, 2017.

389 Sentinel-2 images were acquired on (c) June 20, (d) June 27, and (e) July 7, 2017. The

390 marked red rectangle areas (1, 2, and 3) were used for visual assessment.

391

20
392
393 Figure 5. Landsat-8 and Sentinel-2 data sets used in the testing phase (bands 4, 3, and 2 as

394 RGB). Landsat-8 images at 30 m were acquired on (a) June 15 and (b) July 1, 2017.

395 Sentinel-2 images were acquired on (c) June 20, (d) June 27, and (e) July 7, 2017. The

396 marked red rectangle areas (1, 2) were used for evaluating the fusion network performance.

397

398 4 Results

399 4.1 Self-adaptive fusion of Sentinel-2 images

400 Figure 6 shows the self-adaptive fusion results of Sentinel-2 data that were acquired

401 on June 20, June 27, and July 7, 2017 for the subarea 1 in Figure 4. Visually, the fused 20 m

402 Sentinel-2 bands 11-12 (Figure 6 (c)) were almost the same as the original 20 m Sentinel-2

403 bands 11-12 (Figure 6 (b)), indicating the good performance of the proposed fusion

404 framework. In addition, the fusion framework revealed much finer spatial details as shown in

21
405 the fused 10 m Sentinel-2 bands 11-12 (Figure 6 (d)). Table 2 presents the accuracy

406 assessment of the self-adaptive fusion results for the training area (Figure 1). Clearly,

407 ESRCNN outperformed ATPRK in downscaling Sentinel-2 bands 11-12 from 40 m 20 m

408 spatial resolution. Evaluation indicators including RMSE, CC, and UIQI provided by

409 ESRCNN were much closer to their corresponding ideal values (0, 1, 1) than those provided

410 by ATPRK. SAM and ERGAS between ESRCNN downscaled and reference images

411 exhibited a value of ~0.8 and ~2.0 for the three dates, less than a value of ~2.0 and ~5.6

412 between ATPRK downscaled and reference images. The two fusion approaches were used to

413 generate the fused 10 m Sentinel-2 bands 11-12, which were subsequently used as inputs for

414 downscaling Landsat-8 images from 30 m spatial resolution to 10 m.

415

416 Table 2. Comparisons between ATPRK and ESRCNN for downscaling Sentinel-2 bands

417 11-12 at 40 m to 20 m in three dates (June 20, June 27, and July 7, 2017). The bold number

418 indicates the best value for accuracy assessment.


June 20, 2017 June 27, 2017 July 7, 2017
Metrics Band
ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN
Band 11 0.0178 0.0065 0.0316 0.0095 0.0304 0.0104
RMSE
Band 12 0.0231 0.0093 0.0324 0.0110 0.0315 0.0119
Band 11 0.9786 0.9962 0.9804 0.9975 0.9769 0.9963
CC
Band 12 0.9772 0.9954 0.9813 0.9970 0.9793 0.9960
Band 11 0.9556 0.9920 0.9575 0.9948 0.9540 0.9929
UIQI
Band 12 0.9511 0.9918 0.9590 0.9942 0.9567 0.9928
ERGAS 5.1905 2.0211 5.7884 1.8727 5.6605 2.0516
SAM (degree) 1.3528 0.6515 1.4396 0.7435 1.5220 0.7420

419

22
420
421 Figure 6. The results of self-adaptive fusion on June 20, June 27, and July 7, 2017 for the

422 subarea 1 in Figure 4. (a) The degraded Sentinel-2 bands at 40 m, (b) the reference Sentinel-2

423 bands at 20 m, (c) the fused Sentinel-2 bands at 20 m, and (d) the fused Sentinel-2 bands at

424 10 m.

23
425

426 4.2 Multi-temporal fusion of Landsat-8 and Sentinel-2 images

427 Figure 7 shows the multi-temporal fusion results for the subarea 1 in Figure 4. The

428 Landsat-8 images on June 15 and July 1, 2017 (Figure 7 (e) and (h)) were individually

429 combined with three Sentinel-2 images on June 20, June 27, and July 7, 2017 (Figure 7 (a),

430 (b), and (c)) to yield fusion results. Visually, the resampled Landsat-8 images at 90 m (Figure

431 7 (d) and (g)) were sharply improved to the fused Landsat-8 images at 30 m (Figure 7 (f) and

432 (i)). The fused Landsat-8 images at 30 m (Figure 7 (f) and (i)) were visually similar to the

433 original Landsat-8 images at 30 m (Figure 7 (e) and (h)). Table 3 lists the values of the five

434 indices for accuracy assessment for the entire training area. For the fused images of these two

435 dates, the fusion network produced an RMSE less than 0.05, CC larger than 0.95, UIQI

436 greater than 0.94, ERGAS less than 1.2, and SAM less than 0.8. These metrics clearly

437 showed that the fused images were spectrally and spatially consistent with original images.

438 The proposed fusion framework also yielded good performance for downscaling

439 Landsat-8 images at low spatial resolution to high spatial resolution in areas experiencing

440 land use and land cover (LULC) changes. For example, the yellow circles in Figure 8 (a) and

441 (c) highlighted LULC changes due to the planting of crops within the subarea 2 from June 20

442 to July 7, 2017. Although the Sentinel-2 image acquired on July 7, 2017 was input to the

443 fusion framework, the final fused image on June 15, 2017 (Figure 8 (f)) did not incorporate

444 such LULC changes. In addition, the Landsat-8 image on July 1, 2017 was more similar to

445 the Sentinel-2 image on July 7, 2017 rather than the other two Sentinel-2 images. Despite the

446 obvious LULC changes among the three Sentinel-2 images used as inputs in the fusion

24
447 network, the final fused image at 30 m on July 1, 2017 still did not incorporate these LULC

448 changes (Figure 8 (i)). These findings suggest that the fusion network can identify areas

449 experiencing LULC changes and then remove features associated with spectral changes that

450 are not consistent with the target image.

451

452 Figure 7. The multi-temporal fusion results for the subarea 1 in Figure 4. (a)-(c) are the

453 resampled Sentinel-2 images at 30 m on June 20, June 27, and July 7, 2017. (d)-(f) are the

454 resampled Landsat image at 90 m, the original Landsat-8 reference at 30 m, and the fused

25
455 Landsat-8 image at 30 m on June 15, 2017. (g)-(i) are the resampled Landsat image at 90 m,

456 the original Landsat-8 reference at 30 m, and the fused Landsat-8 image at 30 m on July 1,

457 2017.

458

459 Figure 8. The multi-temporal fusion results for the subarea 2 in Figure 4 (bands 5, 4, 3 as

460 RGB). (a)-(c) are three Sentinel-2 images at 10 m acquired on June 20, June 27, and July 7,

461 2017. (d)-(f) are the original Landsat-8 reference at 30 m and the fused Landsat-8 image at 30

462 m and 10 m on June 15, 2017. (g)-(i) are the original Landsat-8 reference at 30 m and the

26
463 fused Landsat-8 images at 30 m and 10 m on July 1, 2017. The yellow circle highlights the

464 land use and land cover (LULC) changes due to planting of crops (a-c).

465 Table 3. Comparisons between the ESRCNN fused and reference Landsat-8 images at 30 m.
Date Band RMSE CC UIQI ERGAS SAM (degree)
Band 1 0.0243 0.9860 0.9710
Band 2 0.0242 0.9869 0.9737
Band 3 0.0260 0.9861 0.9726
June 15, 2017 Band 4 0.0258 0.9875 0.9755 1.1817 0.7042
Band 5 0.0405 0.9569 0.9415
Band 6 0.0263 0.9848 0.9738
Band 7 0.0263 0.9864 0.9757
Band 1 0.0193 0.9899 0.9683
Band 2 0.0192 0.9904 0.9732
Band 3 0.0210 0.9894 0.9731
July 01, 2017 Band 4 0.0217 0.9897 0.9749 1.0741 0.7904
Band 5 0.0399 0.9599 0.9420
Band 6 0.0254 0.9830 0.9698
Band 7 0.0254 0.9854 0.9727

466

467 4.3 Image fusion with a flexible number of auxiliary Sentinel-2 images

468 In the previous section, the downscaling of Landsat-8 images was accomplished with

469 three auxiliary Sentinel-2 images. However, it may not always be possible to have three or

470 more Sentinel-2 images as inputs into the fusion framework due to clouds, shadows, and

471 snow contamination. Thus, the impact of the number of auxiliary Sentinel-2 images on the

472 downscaling performance was evaluated in this section. Here a series of data sets with

473 varying number of Sentinel-2 images as input into the fusion framework was prepared: (1) D0

474 (Landsat-8 image on July 1, 2017 and no auxiliary Sentinel-2 image); (2) D1 (Landsat-8

475 image on July 1, 2017 and only one Sentinel-2 image on June 27, 2017); (3) D2 (Landsat-8

476 image on July 1, 2017 and only one Sentinel-2 image on July 7, 2017); (4) D3 (Landsat-8

477 image on July 1, 2017 and only one Sentinel-2 image on June 20, 2017); (5) D1+D2; (6)

27
478 D2+D3; (7) D1+D3; (8) D1+D2+D3. Note that even if no auxiliary Sentinel-2 image was input

479 into the fusion framework (D0), the fusion network could still reconstruct a high-resolution

480 image at 10 m from the Landsat-8 image at 30 m.

481 Table 4 presents the accuracy assessment for the downscaling results from eight

482 different data sets against original images at 30 m. The accuracy assessment was performed

483 for the entire training area (Figure 4). It was observed that the fused image generated from D0

484 had the least accurate results for all the bands. Since the downscaling scheme D0 solely used

485 the Landsat-8 PAN band without any Sentinel-2 information, the results had an inherent

486 pan-sharpening problem (Alparone et al., 2007; Liu et al., 2016): the PAN band was only

487 captured at 15 m and its wavelength only covered those of Landsat-8 bands 2-4 but not bands

488 5-7. The performance of the fusion network degraded as the time interval between Sentinel-2

489 and Landsat-8 images increases. This finding was evidenced by the more accurate fusion

490 results from D1 than those from D2 or D3 (Table 4) when only one auxiliary Sentinel-2 image

491 was used for downscaling the Landsat-8 image. The time intervals between the Landsat-8

492 image and the Sentinel-2 images on D1, D2, and D3 were 4, 6, and 11 days, respectively.

493 When two auxiliary Sentinel-2 images were used for downscaling the Landsat-8 image, the

494 performance of the fusion network improved as the total time interval between the Landsat-8

495 image and the two Sentinel-2 images decreased. Table 4 shows that the fusion image

496 generated from D1 + D2 (the total time interval is 10 days) was more accurate those generated

497 from D2+D3 (the total time interval is 17 days) and D1+D3 (the total time interval is 15 days).

498 Additionally, the fusion image generated from D1 + D3 had a higher accuracy level than that

499 generated from D2 + D3. At least, the fusion accuracy improved as the number of auxiliary

28
500 Sentinel-2 images input into the fusion network increased.

501 Further visual examinations of the downscaled Landsat-8 images at 10 m suggested

502 that the performances of the fusion networks trained by the eight different data sets were not

503 consistent for areas experiencing LULC changes. Figure 9 shows the LULC changes caused

504 by temporal housing (houses temporally built near construction sites, i.e., bright dots marked

505 by the red rectangle) in the subarea 3. As suggested by Figure 9, the number of temporal

506 housing (i.e., bright dots) in the Landsat-8 image on July 1, 2017 and in the Sentinel-2 images

507 on June 20, June 27 and July 7, 2017 was 4, 2, 3, and 5, respectively. For the fusion image

508 derived from D1 (Figure 10 (a)), the number of the bright dots was the same as that in the

509 Figure 9 (a). However, the dots in the green rectangle (Figure 10 (a)) were relatively blurry

510 compared to other dots in the fusion image. The number of bright dots for the fusion image

511 derived D2 was five while the top two dots in the fusion image derived D3 were relatively

512 blurry. In contrast, the fusion results derived from two auxiliary Sentinel-2 images (D1+ D2,

513 D2+ D3 and D1+ D3) were much better though blurry dots were still observed. The best fusion

514 result was observed in the fusion image derived from D1+ D2+ D3 from both the qualitative

515 (visual) and quantitative assessments (Table 4). These findings revealed that the fusion

516 network could learn LULC changes with a sufficient number of auxiliary Sentinel-2 images

517 as inputs.

518

519

520 Table 4. The impact of the number of auxiliary Sentinel-2 images input to the fusion network

521 on the downscaling performance. The bold number in each row indicates the best value for

29
522 accuracy assessment.
D0 D1 D2 D3 D1+D2 D2+D3 D1+D3 D1+D2+D3
Band1 0.0223 0.0147 0.0154 0.0170 0.0144 0.0147 0.0145 0.0141
Band2 0.0223 0.0143 0.0151 0.0166 0.0138 0.0142 0.0139 0.0136
Band3 0.0221 0.0159 0.0171 0.0174 0.0154 0.0159 0.0156 0.0153
RMSE Band4 0.0225 0.0158 0.0170 0.0174 0.0153 0.0158 0.0156 0.0151
Band5 0.0610 0.0265 0.0286 0.0347 0.0245 0.0264 0.0264 0.0240
Band6 0.0325 0.0185 0.0202 0.0222 0.0175 0.0186 0.0181 0.0173
Band7 0.0310 0.0180 0.0201 0.0222 0.0173 0.0184 0.0178 0.0172
Band1 0.9865 0.9942 0.9936 0.9923 0.9945 0.9942 0.9944 0.9947
Band2 0.9870 0.9947 0.9941 0.9829 0.9951 0.9948 0.9950 0.9952
Band3 0.9883 0.9940 0.9930 0.9928 0.9944 0.9940 0.9942 0.9945
CC Band4 0.9888 0.9945 0.9936 0.9934 0.9948 0.9946 0.9946 0.9950
Band5 0.9034 0.9826 0.9796 0.9697 0.9851 0.9827 0.9827 0.9856
Band6 0.9720 0.9910 0.9893 0.9871 0.9920 0.9909 0.9914 0.9922
Band7 0.9781 0.9926 0.9909 0.9888 0.9933 0.9923 0.9929 0.9934
Band1 0.9619 0.9813 0.9802 0.9755 0.9822 0.9813 0.9820 0.9829
Band2 0.9670 0.9857 0.9843 0.9807 0.9866 0.9858 0.9863 0.9871
Band3 0.9712 0.9851 0.9832 0.9825 0.9865 0.9853 0.9858 0.9869
UIQI Band4 0.9733 0.9876 0.9855 0.9847 0.9884 0.9876 0.9876 0.9886
Band5 0.8503 0.9751 0.9706 0.9565 0.9791 0.9755 0.9754 0.9798
Band6 0.9510 0.9841 0.9823 0.9791 0.9864 0.9851 0.9853 0.9869
Band7 0.9614 0.9868 0.9842 0.9810 0.9882 0.9867 0.9874 0.9885
ERGAS 3.0491 1.7754 1.9167 2.1033 1.6972 1.7785 1.7474 1.6768
SAM (degree) 4.6004 2.4788 2.5986 2.9878 2.3462 2.4563 2.4394 2.3162

523 Note: A total of eight different data sets were prepared as inputs to the fusion network: (1) D0

524 (Landsat-8 image on July 1, 2017 and no auxiliary Sentinel-2 image); (2) D1 (Landsat-8 image on July

525 1, 2017 and only one Sentinel-2 image on June 27, 2017); (3) D2 (Landsat-8 image on July 1, 2017

526 and only one Sentinel-2 image on July 7, 2017); (4) D3 (Landsat-8 image on July 1, 2017 and only

527 one Sentinel-2 image on June 20, 2017); (5) D1+D2; (6) D2+D3; (7) D1+D3; (8) D1+D2+D3.

30
528

529 Figure 9. The subarea 3 in Figure 4 experienced changes in temporary housing (bright dots;

530 band 2 gray-scale images were shown here). The red rectangle in the upper-left corner is an

531 enlarged view of the red rectangle in the center of the image. The number of bright dots

532 (temporal housing) in the Landsat-8 image on (c) July 1, 2017, and three Sentinel-2 images

533 on (a) June 20, (b) June 27, and (d) July 7, 2017 was 4, 2, 3, and 5, respectively.

31
534
535 Figure 10. The downscaled Landsat-8 image at 10 m (bands 7, 6, 2 as RGB) on July 1, 2017

536 derived from eight different data sets, including (a) D1, (b) D2, (c) D3, (d) D1+ D2 , (e) D2+ D3,

537 (f) D1+ D3, (g) D1+D2+D3. The red rectangle in the upper-left corner is an enlarged view of

538 the red rectangle in the center of the image.

539

540 4.4 Comparison between ATPRK and ESRCNN for image fusion

541 In the previous sections, we showed steps to train the fusion network and presented

542 the performance of the fusion network to downscale Sentinel-2 and Landsat-8 images for the

543 training area (Figure 4). In this section, we applied the trained network to a different domain

544 within Shijiazhuang (12 km by 12 km as shown in Figure 5). The performance of the fusion

545 network to downscale Sentinel-2 bands 11-12 at 40 m to 20 m and Landsat-8 images from 30

546 m to 10 m for this study area was benchmarked by ATPRK.

547 Table 5 shows the comparisons between ATPRK and ESRCNN for downscaling

548 Sentinel-2 bands 11-12 at 40 m to 20 m acquired on June 20, June 27, and July 7, 2017.
32
549 Clearly, the evaluation results revealed that ESRCNN outperformed ATPRK in self-adaptive

550 fusion of Sentinel-2 images. The downscaled Sentinel-2 bands 11-12 at 10 m (resampled to

551 30 m) were further used in downscaling Landsat-8 images from 90 m to 30 m.

552 Four groups of data sets were used for comparing the performances of the two

553 algorithms for downscaling Landsat-8 images: (1) the Landsat-8 image on June 15, 2017 and

554 the Sentinel-2 image on June 20, 2017, (2) the Landsat-8 image on June 15, 2017 and the

555 Sentinel-2 image on June 27, 2017, (3) the Landsat-8 image on July 1, 2017 and the

556 Sentinel-2 image on June 27, 2017, and (4) the Landsat-8 image on July 1, 2017 and the

557 Sentinel-2 image on June 20. In each group, only one Sentinel-2 image was included since

558 ATPRK could only take one Sentinel-2 image as the input for the downscaling process. Thus,

559 the use of only one Sentinel-2 image in each group in this section ensured a fair comparison

560 between ATPRK and ESRCNN. The original Landsat-8 and Sentinel-2 images in the two

561 groups were resampled by a factor of 3 and then input into the two algorithms for yielding

562 fused images at 30 m. The temporal interval between Sentinel-2 and Landsat-8 images for

563 each group was 5, 12, 4, and 11.

564 The proposed ESRCNN framework outperformed ATPRK in downscaling Landsat-8

565 images at 30 m to 10 m based on all evaluation metrics (Table 6). Figure 11 and Figure 12

566 further show that fused reflectance yielded by ESRCNN matched more closely to reference

567 reflectance than that yielded by ATPRK. In addition, evaluation metrics from Group 1 (or 3)

568 were better than those from Group 2 (or 4), further stressing the importance of temporal

569 interval in image acquisition time between the auxiliary Sentinel-2 and the target Landsat-8

570 images (as shown in Section 4.3). Examinations of the downscaled images yielded by

33
571 ATPRK revealed spectral distortions. For example, over-sharpened building boundaries were

572 observed in Figure 13. As suggested by the distribution of pixel values from all spectral bands

573 in Figure 14, ATPRK tended to change the distribution of reflectance values. Compared to the

574 distribution curves in Figure 14 (a, b, d, e), those in Figure 14 (c, f) were shorted and flatter

575 (based on the number of pixels at each gray level). In contrast, the fusion network had the

576 ability to preserve the data distribution as the original image.

577 Table 7 shows that ESRCNN exhibited better accuracy than ATPRK in downscaling

578 Landsat-8 images in areas experiencing LULC changes (mostly due to changes in vegetation

579 coverage as shown in Figure 15 (a) and (b)). Both ESRCNN and ATPRK can identify LULC

580 changes and reproduce the original spectral information for downscaling Landsat-8 images

581 (Figure 15 (a), (c), and (d)). However, the fused image generated by ATPRK is

582 over-sharpened compared to that generated by ESRCNN (Figure 15 (c) and (d)).

583

584

585

586

587

588

589

590

591

34
592

593 Figure 11. Comparisons between the reference and the fused reference produced by ATPRK

594 and ESRCNN for bands 2-7 on June 15, 2017. (a)-(f) are fusion reflectance yielded by

595 ATPRK with the auxiliary Sentinel-2 image on June 20, 2017. (g)-(l) are fusion reflectance

596 yielded by ESRCNN with the auxiliary Sentinel-2 image on June 20, 2017. The color scheme

597 indicates point density.


35
598

599 Figure 12. Comparisons between the reference and the fused reference produced by ATPRK

600 and ESRCNN for bands 2-7 on July 1, 2017. The color scheme indicates point density. (a)-(f)

601 are fusion reflectance yielded by ATPRK with the auxiliary Sentinel-2 image on June 27,

602 2017. (g)-(l) are fusion reflectance yielded by ESRCNN with the auxiliary Sentinel-2 image

603 on June 27, 2017. The color scheme indicates point density.
36
604

605 Figure 13. The multi-temporal fusion results for the subarea 1 indicated by the red rectangle

606 in Figure 5 (bands 4, 3, 2 as RGB). The original Landsat-8 images at 30 m on (a) June 15 and

607 (e) July 1, 2017. The resampled Landsat-8 images at 90 m on (b) June 15 and (f) July 1. The

608 fused Landsat-8 images (ESRCNN) at 30 m on (c) June 15 and (g) July 1, 2017. The fused

609 Landsat-8 images (ATPRK) at 30 m on (d) June 15 and (h) July 1, 2017.

37
610

611 Figure 14. The distribution of pixel values from all spectral bands in Landsat-8 and

612 Sentinel-2 images. (a)-(c) are the Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8

613 bands at 30 m, and the ATPRK fused Landsat-8 bands at 30 m on June 15. (d)-(f) are the

614 Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8 bands at 30 m, and the ATPRK

615 fused Landsat-8 bands at 30 m on July 1, 2017.

616

38
617

618 Figure 15. Fusion results for the subarea 2 in Figure 5 experiencing changes in vegetation

619 coverage (bands 5, 4, 3 as RGB). (a) The 30 m Landsat-8 image on June 15, 2017, (b) the 30

620 m Sentinel-2 image on July 7, 2017, (c) the 90 m Landsat-8 image to be downscaled on June

621 15, 2017, (d) the ESRCNN downscaled image at 30 m, and (e) the ATPRK downscaled image

622 at 30 m.

623 Table 5. Comparisons between ATPRK and ESRCNN for downscaling Sentinel-2 bands

624 11-12 at 40 m to 20 m for the test area (Figure 5). Bold number indicates the best value for

625 accuracy assessment.


June 20, 2017 June 27, 2017 July 07, 2017
Metrics Band
ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN
Band 11 0.0252 0.0135 0.0424 0.0218 0.0468 0.0248
RMSE
Band 12 0.0325 0.1869 0.0480 0.0278 0.0571 0.0322
Band 11 0.9531 0.9822 0.9545 0.9842 0.9491 0.9808
CC
Band 12 0.9479 0.9774 0.9484 0.9764 0.9397 0.9721
Band 11 0.9143 0.9693 0.9156 0.9713 0.9081 0.9662
UIQI
Band 12 0.9088 0.9614 0.9071 0.9584 0.8964 0.9548
ERGAS 5.8925 3.3045 6.6753 3.6917 7.2484 3.9988
SAM (degree) 1.3316 0.8220 1.6193 1.0906 1.9884 1.1695

39
626 Table 6. Comparisons between ATPRK and ESRCNN for downscaling the Landsat-8 images

627 on June 15 and July 1, 2017 at 90 m to 30 m for the entire test area (Figure 5). The bold

628 numbers in each row indicate the best values for accuracy assessment.
Group 1 Group 2 Group 3 Group 4
June 15, 2017 June 15, 2017 July 1, 2017 July 1, 2017
Metrics Band
ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN
Band 1 - 0.0426 - 0.0434 - 0.0317 - 0.0319
Band 2 0.0800 0.0426 0.0843 0.0442 0.0694 0.0338 0.0732 0.0349
Band 3 0.0864 0.0444 0.0908 0.0468 0.0777 0.0392 0.0776 0.0421
Band 4 0.0975 0.0457 0.09555 0.0491 0.0863 0.0406 0.0806 0.0449
RMSE
Band 5 0.0851 0.0415 0.0835 0.0452 0.0903 0.0524 0.0941 0.0448
Band 6 0.0827 0.0458 0.0824 0.0466 0.0874 0.0462 0.0912 0.0492
Band 7 0.0901 0.0461 0.0904 0.0490 0.0877 0.0441 0.0915 0.0498
Mean 0.0870 0.0455 0.0878 0.0463 0.0831 0.0412 0.0847 0.0425
Band 1 - 0.9513 - 0.9493 - 0.9712 - 0.9708
Band 2 0.9131 0.9539 0.9089 0.9504 0.9194 0.9679 0.9139 0.9657
Band 3 0.9130 0.9547 0.9091 0.9493 0.9171 0.9601 0.9200 0.9539
Band 4 0.9072 0.9581 0.9118 0.9511 0.9078 0.9583 0.9197 0.9483
CC
Band 5 0.8251 0.9395 0.8320 0.9236 0.8443 0.9188 0.8358 0.9412
Band 6 0.9100 0.9520 0.9102 0.9495 0.8904 0.9432 0.8831 0.9347
Band 7 0.9159 0.9588 0.9152 0.9525 0.9025 0.9508 0.8966 0.9365
Mean 0.8974 0.9469 0.8979 0.9465 0.8969 0.9529 0.8948 0.9502
Band 1 - 0.9195 - 0.9134 - 0.9224 - 0.9173
Band 2 0.8372 0.9224 0.8259 0.9136 0.8180 0.9260 0.8050 0.9163
Band 3 0.8335 0.9256 0.8228 0.9139 0.8298 0.9256 0.8322 0.9095
Band 4 0.8178 0.9286 0.8243 0.9137 0.8184 0.9279 0.8379 0.9070
UIQI
Band 5 0.7401 0.8949 0.7490 0.8790 0.7429 0.8486 0.7290 0.8935
Band 6 0.8353 0.9204 0.8355 0.9150 0.8105 0.9090 0.7982 0.8936
Band 7 0.8360 0.9294 0.8348 0.9164 0.8239 0.9229 0.8130 0.8969
Mean 0.8166 0.9115 0.8154 0.9093 0.8073 0.9118 0.8025 0.9049
ERGAS 12.7934 6.8257 12.9690 6.9992 12.6265 6.2570 12.8198 6.5302
SAM (degree) 8.5484 6.8257 8.6148 4.6881 8.4717 4.3759 8.6556 4.2710

629 Note: Four groups of data sets were used: (1) the Landsat-8 image on June 15, 2017 and the

630 Sentinel-2 image on June 20, 2017, (2) the Landsat-8 image on June 15, 2017 and the Sentinel-2

631 image on June 27, 2017, (3) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 27,

632 2017, and (4) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 20.

633

634

40
635 Table 7. Comparisons between ATPRK and ESRCNN for downscaling the Landsat-8 image

636 on June 15, 2017 (Sentinel-2 image on July 7, 2017 as auxiliary) at 90 m to 30 m for the area

637 experiencing LULC changes (Figure 14). The bold number indicates the best value for

638 accuracy assessment.

Metrics RMSE CC UIQI ERGAS SAM (degree)

ATPRK 0.0871 08830 0.8054 11.4530 7.5243

ESRCNN 0.0492 0.9131 0.8895 6.6506 4.1589

639

640 5 Discussion

641 Presently, there is a need for satellite data of higher temporal resolution than that can

642 be provided from a single imaging sensor (e.g., Landsat-8 or Sentinel-2) to better improve

643 understanding of land surface changes and their causal agents (Claverie et al., 2018). The

644 synergistic use of Landsat-8 and Sentinel-2 data provides a promising avenue to satisfy this

645 scientific demand yet requires coordination of the spatial resolution gap between the two

646 sensors. To this end, this study presents a deep learning-based fusion approach to downscale

647 Landsat-8 images at 30 m spatial resolution to 10 m. The downscaled Landsat-8 images at 10

648 m spatial resolution, together with Sentinel-2 data at 10 m, can provide temporally dense and

649 routine information at a short time interval suitable for environmental applications such as

650 crop monitoring (Veloso et al., 2017).

651 Compared to data fusion algorithms such as STARFM and its variants (Gao et al.,

652 2016; Hilker et al., 2009; Zhu et al., 2010; Zhu et al., 2016), the fusion network does not

653 require the input of a pair or more pairs of satellite images acquired in the same date that

654 would be a harsh prerequisite to fulfil to blend Landsat-8 and Sentinel-2 images. In contrast,
41
655 the proposed fusion network can accommodate a flexible number of auxiliary Sentinel-2

656 images to enhance the target Landsat-8 image to a higher spatial resolution. This flexibility

657 makes it possible for the fusion network to leverage several Sentinel-2 images in relatively

658 close dates to the target Landsat-8 for reflectance predictions. ATPRK, as a geostatistical

659 approach for data fusion (Wang et al., 2016; Wang et al., 2017), does not have this flexibility

660 to digest several auxiliary Sentinel-2 images simultaneously because of the computational

661 difficulty to calculate a cokriging matrix containing spectral bands from all input images for

662 cross-semivariogram modeling. This difference in flexibility to use multi-temporal Sentinel-2

663 images between the fusion network and ATPRK also explains the better performance of the

664 fusion network to generate downscaled Landsat-8 images at a high spatial resolution (Table

665 5). Examination of the downscaled Landsat-8 image highlights the advantage of the fusion

666 network over ATPRK to preserve the distribution of spectral values as the original image. The

667 high-fidelity reflectance values and the preservation of the statistical distribution of

668 reflectance values, provided by the fusion network, hold important merits for time series

669 analysis of satellite images. For example, Zhu et al. (2014) and Zhu et al. (2019) showed that

670 continuous land cover change detection and classification was sensitive to anomaly

671 observations that may exert impacts on the accurate modeling of time series reflectance

672 values.

673 The twin Sentinel-2 satellites together can provide consistent records of Earth’s

674 surface at a nominal revisit cycle of 5 days. Thus, one to three Sentinel-2 images captured in

675 relatively close dates to the target Landsat-8 image may be available as inputs for the fusion

676 network. The performance of the fusion network to downscale Landsat-8 images at 30 m

42
677 spatial resolution to 10 m improved as the number of auxiliary Sentinel-2 images input into

678 the fusion network increased. In addition, temporal interval (i.e., difference in acquisition

679 date) between the target Landsat-8 image and the auxiliary Sentinel-2 also exerted impacts on

680 the downscaling performance the fusion network (Table 5). For example, when only one

681 auxiliary Sentinel-2 image was fed into the fusion network, the best downscaling

682 performance was achieved for the data set D1 (Table 5), in which the temporal interval

683 between the target Landsat-8 image and the Sentinel-2 image was the smallest compared to

684 that in D2 and D3. Even if auxiliary Sentinel-2 images are not available (contaminated by

685 clouds, shadows, and snow), the fusion network can still yield a high-resolution image at 10

686 m for a target date with a high accuracy level (Table 4). As such, the proposed fusion network

687 is more suitable than existing fusion algorithms for practical applications requiring dense

688 time series images. In addition, the fusion network can generate the Landsat-8 band 1 at 10 m

689 through learning complex relationships across all spectral bands within and among images.

690 One critical issue in spatiotemporal fusion of satellite images is that LULC changes

691 may occur at the temporal scale of days or weeks. Currently, only a few data fusion

692 algorithms can deal with the downscaling process while considering LULC changes. For

693 example, learning-based data fusion methods such as the Sparse-representation-based

694 spatiotemporal reflectance fusion model (Huang & Song 2012) and the extreme learning

695 machine-based fusion method (Liu et al., 2016) have proved to be able to capture LULC

696 changes. However, these learning-based data fusion methods are only suitable for relatively

697 homogeneous areas and thus may not be directly applicable to the study area presented in this

698 study containing abundant landforms and particularly in cities where spatial heterogeneity is

43
699 high. The most recent variant of STARFM, Flexible Spatiotemporal DAta Fusion method

700 (FSDAF) (Zhu et al., 2016), has the potential to capture both gradual and abrupt LULC

701 changes for data fusion by using spectral unmixing analysis and thin plate spline interpolation.

702 It is worth noting that the ability of FSDAF to capture LULC changes is heavily dependent

703 on whether LULC changes are detectable in coarse resolution images. As Landsat-8 images

704 have a revisit cycle of 16-day, LULC changes identified between two temporally neighboring

705 Landsat images may miss changes occurring at a scale of few days (e.g., crop phenology).

706 The results in this study revealed that the fusion network had the ability to identify the areas

707 experiencing LULC changes within a few days and yield fusion results consistent with the

708 original image. When the number of the auxiliary Sentinel-2 images was less than 3 (Figure

709 10), the fusion network may not be able to yield good fusion results for areas experiencing

710 LULC changes. Thus, it was suggested in this study that three or more auxiliary Sentinel-2

711 images should be input into the fusion network for the downscaling process. In the future,

712 research efforts will be made to explore whether there is a threshold in temporal interval or an

713 optimal number of auxiliary Sentinel-2 images fed into the fusion network that results in the

714 best downscaling performance. As such, high-performance computing environments that

715 leverage parallel processing, such as NASA Earth Exchange, Google Earth Engine, and

716 Amazon Web Services (Gorelick et al., 2017), are necessary and should be employed to help

717 perform sensitivity analysis of the fusion network. In addition, the fusion network requires

718 sampled pixels that are representative of different types of changes within a study area for

719 training. Thus, there is a need to collect various types of LULC changes (more than temporal

720 housing and planting of crops shown in this study) to further train the fusion network. The

44
721 validation of the fusion network using images at 10 m collected from airborne- or

722 ground-based sensing platforms, rather than synthetic data sets, may further help understand

723 the algorithm performance.

724 Although the fusion network is developed to blend Landsat-8 and Sentinel-2 images

725 for a harmonized reflectance data set, it can be used to enhance the spatial resolution of

726 images from other satellite sensors such as Landsat-7 ETM+, the Advanced Spaceborne

727 Thermal Emission and Reflection Radiometer (ASTER), and the Moderate Resolution

728 Imaging Spectroradiometer (MODIS). Attention should be paid to use surface reflectance

729 products from these sensors rather than TOA reflectance products evaluated in this study to

730 reduce prediction uncertainties induced by atmospheric conditions. The fusion of Sentinel-2

731 data with images acquired by these satellite sensors at a coarse spatial resolution ranging

732 from 60 m to 500 m may also entail the reconfiguration of the fusion network. Particularly,

733 the fusion of Sentinel-2 and Landsat-7 images requires special efforts made to fill gaps in the

734 Landsat-7 Scan Line Corrector (SLC)-OFF images.

735 Recently, NASA launched an initiative, the HLS project (https://hls.gsfc.nasa.gov/), to

736 produce harmonized Landsat-8 and Sentinel-2 reflectance products to fulfil community’s

737 needs for image at both high spatial and temporal resolutions. The project aims to produce

738 reflectance products based on a series of algorithms including atmospheric correction, cloud

739 and cloud-shadow masking, spatial co-registration and common gridding system,

740 Bi-directional Reflectance Distribution Function (BRDF) normalization, and spectral

741 bandpass adjustment, to resolve differences between Landsat-8 and Sentinel-2 images. To

742 address the gap in spatial resolution between Sentinel-2 and Landsat-8 images, the HLS

45
743 project adopted the strategy to resample Sentinel-2 images at 10 m spatial resolution to 30 m.

744 However, this resampling strategy wastes valuable information provided by the 10 m

745 Sentinel-2 data. The proposed fusion network, thus, provides a promising approach to help

746 generate a better harmonized Landsat-8 and Sentinel-2 dataset at 10 m spatial resolution.

747 Further training and evaluation of the fusion network to downscale Landsat-8 images for

748 other areas are necessary, particularly since deep learning-based methods are normally

749 data-hungry for robust performance (LeCun et al. 2015).

750

751 6 Conclusion

752 The composition of Landsat-8 and Sentinel-2 satellite images have potential to

753 provide temporally dense observations of land surfaces at short time interval (e.g., 5-day

754 composite) suitable for environmental application such as monitoring of agricultural

755 management and conditions. However, the spatial resolution gap between the two satellite

756 sensors needs to be coordinated. In this study, a deep learning-based data fusion network was

757 developed to downscale Landsat-8 images at 30 m spatial resolution to 10 m with inputs of

758 auxiliary Sentinel-2 images that were acquired in close dates to the target Landsat-8 images.

759 The promising results of Landsat-8 images with rich spatial information descaled from the

760 deep learning-based network showed a high quality, suggesting a great potential in various

761 applications that require a series of satellite observations with both high temporal and spatial

762 data. Overall, the major advantages of the proposed fusion network can be summarized as the

763 following points.

764 (1) Compared to ATPRK that outperforms other data fusion algorithms in downscaling,

46
765 ESRCNN can accommodate a flexible number of auxiliary Sentinel-2 images to enhance

766 Landsat-8 images at 30 m spatial resolution to 10 m with a higher accuracy.

767 (2) By leveraging information from auxiliary Seninel-2 images, ESRCNN has potential to

768 identify LULC changes for data fusion, which generally cannot be handed by existing

769 data fusion algorithms.

770 (3) ESRCNN is superior over ATPRK in preserving the distribution of reflectance values for

771 the downscaled Landsat-8 images.

772

773 Code Availability

774 The code developed for fusing Landsat-8 and Sentinel-2 images is available at

775 https://github.com/MrCPlusPlus/ESRCNN-for-Landsat8-Sentinel2-Fusion.

776

777 Acknowledgements

778 This work was supported in part by the National Key Research and Development Plan

779 on strategic international scientific and technological innovation cooperation special project

780 under Grant 2016YFE0202300, the National Natural Science Foundation of China under

781 Grants 61671332, 41771452, and 41771454, the Natural Science Fund of Hubei Province in

782 China under Grant 2018CFA007. The authors are also grateful to the editors and three

783 anonymous reviewers who helped improve the earlier version of this manuscript through their

784 comments and suggestions.

785

786 References

47
787 Acerbi-Junior, F.W., Clevers, J.G.P.W., & Schaepman, M.E. (2006). The assessment of

788 multi-sensor image fusion using wavelet transforms for mapping the Brazilian

789 Savanna. International Journal of Applied Earth Observation and Geoinformation, 8,

790 278-288

791 Alparone, L., Wald, L., Chanussot, J., Thomas, C., Gamba, P., & Bruce, L.M. (2007).

792 Comparison of pansharpening algorithms: outcome of the 2006 GRS-S data fusion

793 contest. IEEE Transactions on Geoscience and Remote Sensing, 45, 3012-3021

794 Audebert, N., Le Saux, B., & Lefèvre, S. (2017). Beyond RGB: Very high resolution urban

795 remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry

796 and Remote Sensing.

797 Claverie, M., Masek, J. G., Ju, J., & Dungan, J. L. (2017). Harmonized landsat-8 sentinel-2

798 (HLS) product user’s guide. National Aeronautics and Space Administration (NASA):

799 Washington, DC, USA.

800 Claverie, M., Ju, J., Masek, J. G., Dungan, J. L., Vermote, E. F., Roger, J. C., ... & Justice, C.

801 (2018). The Harmonized Landsat and Sentinel-2 surface reflectance data set. Remote

802 Sensing of Environment, 219, 145-161.

803 Das, M., & Ghosh, S. K. (2016). Deep-STEP: A Deep Learning Approach for Spatiotemporal

804 Prediction of Remote Sensing Data. IEEE Geoscience and Remote Sensing Letters,

805 13(12), 1984-1988.

806 DeVries, B., Huang, C., Huang, W., Jones, J., Lang, M., & Creed, I. (2016). Automated

807 quantification of surface water fraction using Landsat and Sentinel-2 data. Paper

808 presented at the AGU Fall Meeting Abstracts.

48
809 Dong, C., Loy, C.C., He, K., & Tang, X. (2016). Image super-resolution using deep

810 convolutional networks. IEEE Transactions on Pattern Analysis and Machine

811 Intelligence, 38, 295-307.

812 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola,

813 C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F., &

814 Bargellini, P. (2012). Sentinel-2: ESA's Optical High-Resolution Mission for GMES

815 Operational Services. Remote Sensing of Environment, 120, 25-36.

816 Emelyanova, I. V., McVicar, T. R., Van Niel, T. G., Li, L. T., & van Dijk, A. I. (2013).

817 Assessing the accuracy of blending Landsat–MODIS surface reflectances in two

818 landscapes with contrasting spatial and temporal dynamics: A framework for

819 algorithm selection. Remote Sensing of Environment, 133, 193-209.

820 Fu, P., & Weng, Q. (2016a). A time series analysis of urbanization induced land use and land

821 cover change and its impact on land surface temperature with Landsat imagery.

822 Remote Sensing of Environment, 175, 205-214.

823 Fu, P., & Weng, Q. (2016b). Consistent land surface temperature data generation from

824 irregularly spaced Landsat imagery. Remote Sensing of Environment, 184, 175-187.

825 Gao, F., Masek, J., Schwaller, M., & Hall, F. (2006). On the blending of the Landsat and

826 MODIS surface reflectance: predicting daily Landsat surface reflectance. IEEE

827 Transactions on Geoscience and Remote Sensing, 44, 2207-2218.

828 Gevaert, C. M., & García-Haro, F. J. (2015). A comparison of STARFM and an

829 unmixing-based algorithm for Landsat and MODIS data fusion. Remote sensing of

830 Environment, 156, 34-44.

49
831 Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017).

832 Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote

833 Sensing of Environment, 202, 18-27.

834 He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional

835 networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine

836 Intelligence, 37, 1904-1916.

837 Hilker, T., Wulder, M.A., Coops, N.C., Linke, J., McDermid, G., Masek, J.G., Gao, F., &

838 White, J.C. (2009). A new data fusion model for high spatial- and temporal-resolution

839 mapping of forest disturbance based on Landsat and MODIS. Remote Sensing of

840 Environment, 113, 1613-1627.

841 Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science,

842 349(6245), 261.

843 Hirschmugl, M., Gallaun, H., Dees, M., Datta, P., Deutscher, J., Koutsias, N., & Schardt, M.

844 (2017). Methods for Mapping Forest Disturbance and Degradation from Optical Earth

845 Observation Data: a Review. Current Forestry Reports, 3(1), 32-45.

846 Huang, B., & Song, H. (2012). Spatiotemporal reflectance fusion via sparse

847 representation. IEEE Transactions on Geoscience and Remote Sensing, 50(10),

848 3707-3716.

849 Kim, D.-H., Sexton, J.O., Noojipady, P., Huang, C., Anand, A., Channan, S., Feng, M., &

850 Townshend, J.R. (2014). Global, Landsat-based forest-cover change from 1990 to

851 2000. Remote Sensing of Environment, 155, 178-193.

852 Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep

50
853 convolutional neural networks. In, International Conference on Neural Information

854 Processing Systems (pp. 1097-1105).

855 Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based leaning applied to

856 document recognition. Proceedings of the IEEE, 86, 2278-2324.

857 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436.

858 Lin, Y., David, R., Hankui, Z. et al. (2016). An Automated Approach for Sub-Pixel

859 Registration of Landsat-8 Operational Land Imager (OLI) and Sentinel-2 Multi

860 Spectral Instrument (MSI) Imagery. Remote Sensing, 8(6):520.

861 Liu, J. G. (2000). Smoothing Filter-based Intensity Modulation: A spectral preserve image

862 fusion technique for improving spatial details. International Journal of Remote

863 Sensing, 21(18), 3461-3472.

864 Liu, P., Xiao, L. and Tang, S. (2016). A new geometry enforcing variational model for

865 pan-sharpening. IEEE Journal of Selected Topics in Applied Earth Observations and

866 Remote Sensing, 9(12), 5726-5739.

867 Liu, X., Deng, C., Wang, S., Huang, G. B., Zhao, B., & Lauren, P. (2016). Fast and accurate

868 spatiotemporal fusion based upon extreme learning machine. IEEE Geoscience and

869 Remote Sensing Letters, 13(12), 2039-2043.

870 Malenovský, Z., Rott, H., Cihlar, J., Schaepman, M.E., García-Santos, G., Fernandes, R., &

871 Berger, M. (2012). Sentinels for science: Potential of Sentinel-1, -2, and -3 missions

872 for scientific observations of ocean, cryosphere, and land. Remote Sensing of

873 Environment, 120, 91-101.

874 Masi, G., Cozzolino, D., Verdoliva, L., & Scarpa G. (2016). Pansharpening by Convolutional

51
875 Neural Networks. Remote Sensing, 8(7), 594.

876 Nair, V., & Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines.

877 In, International Conference on Machine Learning (pp. 807-814)

878 Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., Li, H., Yang, S., Wang, Z., Loy,

879 C.-C., & Tang, X. (2015). Deepid-net: Deformable deep convolutional neural

880 networks for object detection. In, IEEE Conference on Computer Vision and Pattern

881 Recognition (pp. 2403-2412)

882 Quintano, C., Fernández-Manso, A., & Fernández-Manso, O. (2018). Combination of

883 Landsat and Sentinel-2 MSI data for initial assessing of burn severity. International

884 Journal of Applied Earth Observation and Geoinformation, 64, 221-225.

885 Ranchin, T., & Wald, L. (2000). Fusion of high spatial and spectral resolution images: The

886 ARSIS concept and its implementation. Photogrammetric Engineering and Remote

887 Sensing, 66, 49–61.

888 Roy, D. P., Lewis, P., Schaaf, C., Devadiga, S., & Boschetti, L. (2006). The global impact of

889 clouds on the production of MODIS bidirectional reflectance model-based composites

890 for terrestrial monitoring. IEEE Geoscience and Remote Sensing Letters, 3(4),

891 452-456.

892 Roy, D. P., Wulder, M. A., Loveland, T. R., C.E, W., Allen, R. G., Anderson, M. C., . . . Zhu,

893 Z. (2014). Landsat-8: Science and product vision for terrestrial global change research.

894 Remote Sensing of Environment, 145, 154-172.

895 Skakun, S., Kussul, N., Shelestov, A., & Kussul, O. (2014). Flood hazard and flood risk

896 assessment using a time series of satellite images: A case study in Namibia. Risk

52
897 Analysis, 34(8), 1521-1537.

898 Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61,

899 85-117.

900 Senf, C., Pflugmacher, D., Heurich, M., & Krueger, T. (2017). A Bayesian hierarchical model

901 for estimating spatial and temporal variation in vegetation phenology from Landsat

902 time series. Remote Sensing of Environment, 194, 155-160.

903 Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale

904 image recognition. arXiv preprint arXiv:1409.1556.

905 Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint

906 identification-verification. In, International Conference on Neural Information

907 Processing Systems (pp. 1988-1996)

908 Storey, J. C., Roy, D. P., Masek, J. (2016). A note on the temporary misregistration of

909 Landsat-8 Operational Land Imager (OLI) and Sentinel-2 Multi Spectral Instrument

910 (MSI) imagery. Remote Sensing of Environment, 186:121-122.

911 USGS. (2015). Using the USGS Landsat 8 product. [Online]. Available:

912 http://landsat.usgs.gov/Landsat8_Using_Product.php.

913 Veloso, A., Mermoz, S., Bouvet, A., Le Toan, T., Planells, M., Dejoux, J.-F., & Ceschia, E.

914 (2017). Understanding the temporal behavior of crops using Sentinel-1 and

915 Sentinel-2-like data for agricultural applications. Remote Sensing of Environment, 199,

916 415-426.

917 Verbesselt, J., Zeileis, A., & Herold, M. (2012). Near real-time disturbance detection using

918 satellite image time series. Remote Sensing of Environment, 123, 98-108.

53
919 Wald, L., Ranchin, T., & Mangolini, M. (1997). Fusion of satellite images of different spatial

920 resolution: Assessing the quality of resulting images. Photogrammetric Engineering

921 and Remote Sensing, 63, 691-699.

922 Wang, Q., Shi, W., Li, Z., & Atkinson, P. M. (2016). Fusion of Sentinel-2 images. Remote

923 Sensing of Environment, 187, 241-252.

924 Wang, Q., Blackburn, G.A., Onojeghuo, A.O., Dash, J., Zhou, L., Zhang, Y., & Atkinson, P.M.

925 (2017). Fusion of Landsat-8 OLI and Sentinel-2 MSI data. IEEE Transactions on

926 Geoscience and Remote Sensing, 55, 3885-3899.

927 Wang, Z., & Bovik, A.C. (2002). A universal image quality index. IEEE Signal Processing

928 Letters, 9, 81-84.

929 Wei, Q., Bioucas-Dias, J., Dobigeon, N., & Tourneret, J. Y. (2015). Hyperspectral and

930 Multispectral Image Fusion Based on a Sparse Representation. IEEE Transactions on

931 Geoscience and Remote Sensing, 53(7), 3658-3668.

932 White, J. C., Wulder, M. A., Hermosilla, T., Coops, N. C., & Hobart, G. W. (2017). A

933 nationwide annual characterization of 25 years of forest disturbance and recovery for

934 Canada using Landsat time series. Remote Sensing of Environment, 194, 303-321.

935 Woodcock, C. E., Allen, R., Anderson, M., Belward, A., Bindschadler, R., Cohen, W., Wynne,

936 R. (2008). Free Access to Landsat Imagery. Science, 320(5879), 1011.

937 Yuan, Q., Wei, Y., Meng, X., Shen, H., & Zhang, L. (2018). A Multiscale and Multidepth

938 Convolutional Neural Network for Remote Sensing Imagery Pan-sharpening. IEEE

939 Journal of Selected Topics in Applied Earth Observation and Remote Sensing, 11(3),

940 978-989.

54
941 Zhang, L., Weng, Q., & Shao, Z. (2017). An evaluation of monthly impervious surface

942 dynamics by fusing Landsat and MODIS time series in the Pearl River Delta, China,

943 from 2000 to 2015. Remote sensing of environment, 201, 99-114.

944 Zhang, N., Donahue, J., Girshick, R., & Darrel, T. (2014). Part-based RCNNs for fine-grained

945 category detection. In, European Conference on Computer Vision (pp. 834-849)

946 Zhu, X., Chen, J., Gao, F., Chen, X., & Masek, J.G. (2010). An enhanced spatial and temporal

947 adaptive reflectance fusion model for complex heterogeneous regions. Remote

948 Sensing of Environment, 114, 2610-2623.

949 Zhu, X., Helmer, E. H., Gao, F., Liu, D., Chen, J., & Lefsky, M. A. (2016). A flexible

950 spatiotemporal method for fusing satellite images with different resolutions. Remote

951 Sensing of Environment, 172, 165-177.

952 Zhu, Z., & Woodcock, C. E. (2014). Continuous change detection and classification of land

953 cover using all available Landsat data. Remote Sensing of Environment, 144, 152-171.

954

955

956 List of Figure Captions

957

958 Figure 1. The schematic illustration of local receptive fields (red rectangles) and parameter

959 sharing (weight values w1, w2, and w3 are equal) between layers in convolution neural

960 networks.

961 Figure 2. The three phases, i.e., convolution, nonlinear activation, and pooling, in a typical

962 convolutional neural network layer.

55
963 Figure 3. The workflow of the extended SRCNN (ESRCNN) for fusing Landsat-8 and

964 Sentinel-2 images. Conv indicates the convolutional layer and the rectified linear unit (ReLU)

965 represents the nonlinear activation layer.

966 Figure 4. Landsat-8 and Sentinel-2 data sets used in the training phase (bands 4, 3, and 2 as

967 RGB). Landsat-8 images at 30 m were acquired on (a) June 15 and (b) July 1, 2017.

968 Sentinel-2 images were acquired on (c) June 20, (d) June 27, and (e) July 7, 2017. The

969 marked red rectangle areas (1, 2, and 3) were used for visual assessment.

970 Figure 5. Landsat-8 and Sentinel-2 data sets used in the testing phase (bands 4, 3, and 2 as

971 RGB). Landsat-8 images at 30 m were acquired on (a) June 15 and (b) July 1, 2017.

972 Sentinel-2 images were acquired on (c) June 20, (d) June 27, and (e) July 7, 2017. The

973 marked red rectangle areas (1, 2) were used for evaluating the fusion network performance.

974 Figure 6. The results of self-adaptive fusion on June 20, June 27, and July 7, 2017 for the

975 subarea 1 in Figure 4. (a) The degraded Sentinel-2 bands at 40 m, (b) the reference Sentinel-2

976 bands at 20 m, (c) the fused Sentinel-2 bands at 20 m, and (d) the fused Sentinel-2 bands at

977 10 m.

978 Figure 7. The multi-temporal fusion results for the subarea 1 in Figure 4. (a)-(c) are the

979 resampled Sentinel-2 images at 30 m on June 20, June 27, and July 7, 2017. (d)-(f) are the

980 resampled Landsat image at 90 m, the original Landsat-8 reference at 30 m, and the fused

981 Landsat-8 image at 30 m on June 15, 2017. (g)-(i) are the resampled Landsat image at 90 m,

982 the original Landsat-8 reference at 30 m, and the fused Landsat-8 image at 30 m on July 1,

983 2017.

984 Figure 8. The multi-temporal fusion results for the subarea 2 in Figure 4 (bands 5, 4, 3 as

56
985 RGB). (a)-(c) are three Sentinel-2 images at 10 m acquired on June 20, June 27, and July 7,

986 2017. (d)-(f) are the original Landsat-8 reference at 30 m and the fused Landsat-8 image at 30

987 m and 10 m on June 15, 2017. (g)-(i) are the original Landsat-8 reference at 30 m and the

988 fused Landsat-8 images at 30 m and 10 m on July 1, 2017. The yellow circle highlights the

989 land use and land cover (LULC) changes due to planting of crops (a-c).

990 Figure 9. The subarea 3 in Figure 4 experienced changes in temporary housing (bright dots;

991 band 2 gray-scale images were shown here). The red rectangle in the upper-left corner is an

992 enlarged view of the red rectangle in the center of the image. The number of bright dots

993 (temporal housing) in the Landsat-8 image on (c) July 1, 2017, and three Sentinel-2 images

994 on (a) June 20, (b) June 27, and (d) July 7, 2017 was 4, 2, 3, and 5, respectively.

995 Figure 10. The downscaled Landsat-8 image at 10 m (bands 7, 6, 2 as RGB) on July 1, 2017

996 derived from eight different data sets, including (a) D1, (b) D2, (c) D3, (d) D1+ D2 , (e) D2+ D3,

997 (f) D1+ D3, (g) D1+D2+D3. The red rectangle in the upper-left corner is an enlarged view of

998 the red rectangle in the center of the image.

999 Figure 11. Comparisons between the reference and the fused reference produced by ATPRK

1000 and ESRCNN for bands 2-7 on June 15, 2017. (a)-(f) are fusion reflectance yielded by

1001 ATPRK with the auxiliary Sentinel-2 image on June 20, 2017. (g)-(l) are fusion reflectance

1002 yielded by ESRCNN with the auxiliary Sentinel-2 image on June 20, 2017. The color scheme

1003 indicates point density.

1004 Figure 12. Comparisons between the reference and the fused reference produced by ATPRK

1005 and ESRCNN for bands 2-7 on July 1, 2017. The color scheme indicates point density. (a)-(f)

1006 are fusion reflectance yielded by ATPRK with the auxiliary Sentinel-2 image on June 27,

57
1007 2017. (g)-(l) are fusion reflectance yielded by ESRCNN with the auxiliary Sentinel-2 image

1008 on June 27, 2017. The color scheme indicates point density.

1009 Figure 13. The multi-temporal fusion results for the subarea 1 indicated by the red rectangle

1010 in Figure 5 (bands 4, 3, 2 as RGB). The original Landsat-8 images at 30 m on (a) June 15 and

1011 (e) July 1, 2017. The resampled Landsat-8 images at 90 m on (b) June 15 and (f) July 1. The

1012 fused Landsat-8 images (ESRCNN) at 30 m on (c) June 15 and (g) July 1, 2017. The fused

1013 Landsat-8 images (ATPRK) at 30 m on (d) June 15 and (h) July 1, 2017.

1014 Figure 14. The distribution of pixel values from all spectral bands in Landsat-8 and

1015 Sentinel-2 images. (a)-(c) are the Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8

1016 bands at 30 m, and the ATPRK fused Landsat-8 bands at 30 m on June 15. (d)-(f) are the

1017 Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8 bands at 30 m, and the ATPRK

1018 fused Landsat-8 bands at 30 m on July 1, 2017.

1019 Figure 15. Fusion results for the subarea 2 in Figure 5 experiencing changes in vegetation

1020 coverage (bands 5, 4, 3 as RGB). (a) The 30 m Landsat-8 image on June 15, 2017, (b) the 30

1021 m Sentinel-2 image on July 7, 2017, (c) the 90 m Landsat-8 image to be downscaled on June

1022 15, 2017, (d) the ESRCNN downscaled image at 30 m, and (e) the ATPRK downscaled image

1023 at 30 m.

58

You might also like