Deep Learning-Based Fusion of Landsat-8 and Sentinel-2 Images For A Harmonized Surface

Version of Record: https://www.sciencedirect.
com/science/article/pii/S0034425719304444
Manuscript_82fe56ef8737822d696a39e6d0c5723c
1 Deep learning-based fusion of Landsat-8 and Sentinel-2 images for a harmonized surface
2 reflectance product
4 Zhenfeng Shao 1, 2, Jiajun Cai 1, 2, 3, Peng Fu 4, 5 *, Leiqiu Hu 6, Tao Liu 7
1
6 State Key Laboratory for Information Engineering in Surveying, Mapping and Remote
7 Sensing, Wuhan University, Wuhan, China 430072
2
8 Collaborative Innovation Center for Geospatial Technology, Wuhan University, Wuhan,
9 China 430072
3
10 Department of Geography and Resource Management, The Chinese University of Hong
11 Kong, Hong Kong, China 999077
4
12 Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL
13 61801, USA
5
14 Carl R. Woese Institute of Genomic Biology, University of Illinois at Urbana-Champaign,
15 Urbana, IL 61801, USA
6
16 Atmospheric Science Department, University of Alabama at Huntsville, Huntsville, AL
17 35805, USA
7
18 Geographic Data Science, Oak Ridge National Laboratory, Oak Ridge, TN, 37830, USA
19
20 * Corresponding author
21 Peng Fu (pengfu@illinois.edu)
22 Jiajun Cai (cai_jiajun@foxmail.com)
© 2019 published by Elsevier. This manuscript is made available under the Elsevier user license
https://www.elsevier.com/open-access/userlicense/1.0/
23 Abstract
24 Landsat and Sentinel-2 sensors together provide the most widely accessible
25 medium-to-high spatial resolution multispectral data for a wide range of applications, such as
26 vegetation phenology identification, crop yield estimation, and forest disturbance detection.
27 Improved timely and accurate observations of the Earth’s surface and dynamics are expected
28 from the synergistic use of Landsat and Sentinel-2 data, which entails coordinating the spatial
29 resolution gap between Landsat (30 m) and Sentinel-2 (10 m or 20 m) images. However,
30 widely used data fusion techniques may not fulfil community’s needs for generating a
31 temporally dense reflectance product at 10 m spatial resolution from combined Landsat and
32 Sentinel-2 images because of their inherent algorithmic weaknesses. Inspired by the recent
33 advances in deep learning, this study developed an extended super-resolution convolutional
34 neural network (ESRCNN) to a data fusion framework, specifically for blending Landsat-8
35 Operational Land Imager (OLI) and Sentinel-2 Multispectral Imager (MSI) data. Results
36 demonstrated the effectiveness of the deep learning-based fusion algorithm in yielding a
37 consistent and comparable dataset at 10 m from Landsat-8 and Sentinel-2. Further accuracy
38 assessments revealed that the performance of the fusion network was influenced by both the
39 number of input auxiliary Sentinel-2 images and temporal interval (i.e., difference in image
40 acquisition dates) between auxiliary Sentinel-2 images and the target Landsat-8 image.
41 Compared to the benchmark algorithm, area-to-point regression kriging (ATKPK), the deep
42 learning-based fusion framework proved better in the quantitative assessment in terms of
43 RMSE (root mean square error), correlation coefficient (CC), universal image quality index
44 (UIQI), relative global-dimensional synthesis error (ERGAS), and spectral angle mapper
2
45 (SAM). ESRCNN better preserved the reflectance distribution as the original image
46 compared to ATPRK, resulting in an improved image quality. Overall, the developed data
47 fusion network that blends Landsat-8 and Sentinel-2 images has the potential to help generate
48 continuous reflectance observations of higher temporal frequency than that can be obtained
49 from a single Landsat-like sensor.
50
51 Keywords: Deep learning, Data fusion, Landsat-8, Sentinel-2, continuous monitoring
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
3
67 1 Introduction
68 The Landsat consistent records of the Earth’s surface and dynamics with 30 m
69 spatial resolution represent hitherto the longest space-based observations dating back to the
70 1970s (Roy et al., 2014). The opening of the Landsat archive in 2008 (Woodcock et al., 2008)
71 has fostered many previously unimaginable environmental applications based on time series
72 satellite images, such as near-real time disturbance detection (Verbesselt et al., 2012) and
73 continuous land cover change detection and classification (Zhu & Woodcock, 2014).
74 Currently, it is being the norm to use all available time series images (dating back to 1980s
75 for reflectance products) at a given location for a wide range of applications from
76 characterizing forest disturbance (Kim et al., 2014) and vegetation phenology (Senf et al.,
77 2017) to revealing urbanization-induced land use and land cover changes (Fu & Weng,
78 2016a). Despite the popularity of time series analysis of Landsat images, the usability of
79 Landsat images is often limited by the presence of clouds, shadow, and other poor
80 atmospheric effects (Fu & Weng, 2016b); thus, the actual temporal revisit cycle of usable
81 Landsat images is not periodical and sometimes longer than 16 days. The missed temporal
82 information can hinder applications that require near-daily or multi-day imagery at medium
83 spatial resolution (~30 m), e.g., crop yield estimation (Claverie et al., 2012), flood response
84 (Skakun et al., 2014), vegetation phenology identification (Melaas et al., 2013), and forest
85 disturbance detection (White et al., 2017).
86 Synergies between Landsat Operational Land Imager (OLI) and Sentinel-2
87 Multispectral Imager (MSI) data are promising to fulfil the community’s needs for
88 high-temporal resolution images at medium spatial scale. With the launch of Sentinel-2A and
4
89 -2B satellites in 2015 and 2017, respectively, the combined Landsat-8 OLI and Sentinel-2
90 MSI dual system can provide dense global observations at a nominal revisit interval of 2-3
91 days. Each MSI onboard the twin Sentinel-2 satellites (Sentinel-2A and -2B) acquires images
92 covering thirteen spectral bands (Table 1) at the spatial resolutions of 10 m (four visible and
93 near-infrared bands), 20 m (six red edge and shortwave infrared bands), and 60 m (three
94 atmospheric correction bands) in every 10 days (Drusch et al., 2012; Malenovský et al., 2012).
95 The MSI images are complementary to images captured by the Landsat OLI sensor since the
96 two instruments share similarities in band specifications (Table 1).
97 The combination of Landsat and Sentinel-2 reflectance products enriches the temporal
98 information. For example, we can generate 5-day composite products, much more frequent
99 than those from Landsat and Sentinel-2 individually. However, the disparity in spatial
100 resolution between Landsat-8 and Sentinel-2 data remains unsolved. One common but rather
101 simple approach is to sacrifice the spatial resolution in the reflectance products by resampling
102 the finer-resolution data to match the coarser one (Wang et al., 2017). This resampling
103 approach has also been adopted by the NASA Harmonized Landsat and Sentinel-2 (HLS)
104 project to produce temporal dense reflectance products at 30 m. Thus, valuable information
105 provided by the 10 m Sentinel-2 data is wasted. In this study, we aim to develop a data fusion
106 approach that takes full advantages of available temporal and spatial information in both
107 Landsat-8 and Sentinel-2 images to generate a reflectance product at a finer spatial resolution
108 of 10 meter.
109 Various algorithms have been developed in the past for image fusion to obtain more
110 frequent Landsat-like data, such as the spatial and temporal adaptive reflectance fusion model
5
111 (STARFM) (Gao et al., 2006) and its variants (Hilker et al., 2009; Zhu et al., 2010; Zhu et al.,
112 2016), unmixing-based data fusion (Gevaert & García-Haro, 2015), wavelet transformation
113 (Acerbi-Junior et al., 2006), sparse representation (Wei et al., 2015), and smoothing
114 filter-based intensity modulation (SFIM) (Liu, 2000). Among these data fusion techniques,
115 the STARFM and its variants are probably the most popular fusion algorithms to generate
116 synthetic surface reflectance at both high spatial and temporal resolutions due to their robust
117 prediction performances (Emelyanova et al., 2013). Nevertheless, STARFM and its variants
118 require at least one pair of fine and coarse spatial resolution images (e.g., Landsat and
119 MODIS images) acquired on the same day as inputs to implement the downscaling process.
120 This requisite makes the STARFM framework not suitable for fusing Landsat-8 and
121 Sentinel-2 images that often revisit the same target at different days. Recently, Wang et al.
122 (2017) used area-to-point regression kriging (ATPRK) to yield downscaled Landsat-8 images
123 at 10 m and suggested that the ATPRK approach outperformed STARFM on downscaling
124 reflectance. However, ATPRK, as a geostatistical fusion approach, involves complex
125 semi-variogram modeling from the cokriging matrix that is computationally unrealistic for a
126 large domain due to its sheer size. Plus, ATPRK may not be suitable for areas experiencing
127 rapid land cover changes since its performance relies on input of covariates (i.e., spectral
128 bands) that may come from a subjectively selected image in a specific date. Within a short
129 time period (e.g., 2 weeks), the collected Sentinel-2 and Landsat data for a given location
130 may all have valuable information for image fusion. In contrast, the ATPRK algorithm does
131 not have the flexibility to accommodate a different number of input images within the
132 specific study period for reflectance prediction. With the varied number of input images, a
6
133 new fusion algorithm is highly needed to automatically select the best features from one or
134 more images to perform reflectance prediction at the high spatial resolution (Das & Ghosh,
135 2016).
136 The recent advances in deep learning make it promising in addressing the spatial gap
137 between Landsat-8 and Sentinel-2 data, potentially leading to improved performance of
138 image fusion over existing algorithms (Masi et al., 2016; Yuan et al., 2018). Deep learning is
139 a fully data-driven approach and can automatically transform the representation at one level
140 into a representation at a higher, slightly abstract level (LeCun et al., 2015; Schmidhuber,
141 2015), thus facilitating data predictions at different spatial scales. Especially, convolutional
142 neural networks (CNNs) consist of a series of convolution filters that can extract hierarchical
143 contextual image features (Krizhevsky et al., 2012). As a popular form of deep learning
144 networks, they have been widely used in image classification, object recognition, and natural
145 language processing due to their powerful feature learning ability (Audebert et al., 2017;
146 Hirschberg & Manning, 2015; Simonyan & Zisserman, 2014). Inspired by these successful
147 applications of CNNs, this study extended a super-resolution CNN (SRCNN) to address the
148 gap in spatial resolution between Landsat-8 and Sentinel-2 images. More specifically, the
149 deep learning-based framework was used to downscale the Landsat-8 image of 30 m spatial
150 resolution to 10 m by using Sentinel-2 spectral bands at 10 m and 20 m. Given the better
151 performance of ATPRK in image fusion over previous pan-sharpening and spatiotemporal
152 fusion algorithms (Wang et al., 2017), this study used ATPRK as the benchmark algorithm to
153 assess the effectiveness of deep learning-based fusion approach.
154
7
155 Table 1. Band specifications for Landsat-8 and Sentinel-2 images
Landsat-8 Sentinel-2
Resolution
Band Wavelength (nm) Resolution (m) Band Wavelength (nm)
(m)
1 (Coastal) 430-450 30 1 (Coastal) 433-453 60
2 (Blue) 450-515 30 2 (Blue) 458-523 10
3 (Green) 525-600 30 3 (Green) 543-578 10
4 (Red) 630-680 30 4 (Red) 650-680 10
5 (Red Edge) 698-713 20
6 (Red Edge) 733-748 20
7 (Red Edge) 773-793 20
5 (NIR) 845-885 30 8 (NIR) 785-900 10
9 (Water vapor) 935-955 60
10 (SWIR-Cirrus) 1360-1390 60
6 (SWIR-1) 1560-1660 30 11 (SWIR-1) 1565-1655 20
7 (SWIR-2) 2100-2300 30 12 (SWIR-2) 2100-2280 20
8 (PAN) 503-676 15 - - -
156
157 2 Methodology
158 2.1 Convolutional neural network
159 As a popular neural network form in deep learning, CNNs leverage three important
160 ideas including sparse interactions, parameter sharing, and sub-sampling to help improve a
161 machine learning system (He et al., 2015; Krizhevsky et al., 2012; Ouyang et al., 2015;
162 Zhang et al., 2014; Sun et al., 2014). In CNNs, it is impractical to connect all neurons from
163 different network layers for images of high dimensionality. Thus, neurons in CNNs are
164 connected only to a local region of the original image (i.e., sparse interactions) by using a
165 hyperparameter called a receptive field of a neuron (equivalently this is the filter size as
166 shown in Figure 1). In other words, sparse interactions indicate that the value of each neuron
167 within a convolution layer is calculated according to the neuron values of its spatially
168 neighboring region from a previous layer. Parameter sharing ensures the weights of a
8
169 convolutional kernel remain the same when they are applied to the input data at different
170 locations. In this way, the number of parameters in the network decreases significantly.
171 Figure 1 shows the concepts of local receptive fields (red rectangles) and parameter sharing
172 (i.e., weight values w1, w2, and w3 have the same value: = = ). A typical layer of a
173 convolutional network includes three phases as shown in Figure 2. In the first phase, a series
174 of convolutions are used in parallel to produce linear activations. In the second phase, each
175 linear activation from the first phase is delivered to a nonlinear activation function, such as
176 Sigmoid, Tanh, and ReLU (Rectified Linear Unit) (Nair & Hinton, 2010). In the third phase,
177 a pooling function configured in a so-called sub-sampling layer is adopted to perform local
178 averaging and sub-sampling, reducing the sensitivity of the output to shifts and distortions
179 and the computational complexity of the model. The output from operation in each phase is
180 called feature map.
181 Mathematically, the first two phases are expressed in equation (1).
182 ( )
= ( ( )
+∑ ( )( )
∗ ()
). (1)
() ( )
183 where and are the -th input feature map and -th output feature map of a
( )( ) () ( )
184 convolutional layer, is the convolutional kernel applied to , and denotes the
185 bias. The convolutional operator is indicated by the symbol *, and f means the nonlinear
186 activation function. The third phase, max pooling (or sub-sampling), is expressed in equation
187 (2).
, = max ( , )
,
188 (2)
189 where , is the neuron value at ( , !) in the output layer, and m and n are used to indicate
190 the pixel location around the center neuron at ( , !) within a spatial extent of " × " (also
9
191 known as an image patch). More specifically, equation (2) dictates that the value at location
192 ( , !) is assigned with the maximal value over the spatial extent of " × " in the input layer.
193
194
195 Figure 1. The schematic illustration of local receptive fields (red rectangles) and parameter
196 sharing (weight values w1, w2, and w3 are equal) between layers in convolution neural
197 networks.
198
199
200 Figure 2. The three phases, i.e., convolution, nonlinear activation, and pooling, in a typical
201 convolutional neural network layer.
202
203 2.2 Extending super-resolution convolutional neural network (SRCNN) for image fusion
204 Although CNN is initially designated to predict categorical variables (i.e.,
205 classification), it has recently been modified to output continuous variables (i.e., regression)
206 for data fusion (e.g., Dong et al., 2016; Shao & Cai, 2018). Dong et al. (2016) proposed
207 SRCNN to reconstruct a high-resolution image from a low-resolution image directly. As

10
208 fusion of Sentinel-2 and Landsat-8 images is to improve the spatial resolution of Landsat-8
209 imagery from 30 m to 10 m, it shares some similarities with super-resolution reconstruction
210 (SR). However, SR cannot reconstruct spatial details within image pixels from coarse
211 resolution images that are used as only inputs for predicting images at fine spatial resolution.
212 In this study, an extended SRCNN (ESRCNN) was developed to fuse data with different
213 resolutions, where the high-resolution Sentinel-2 images are treated as auxiliary data for
214 downscaling the low-resolution Landsat-8 image. The ESRCNN framework consists of three
215 layers. The first layer takes inputs with N1 channels and calculates N2 feature maps using a
216 × receptive field and a nonlinear activation ReLU. The second layer computes N3
217 feature maps with the aid of a × receptive field and ReLU. Finally, the third layer
218 outputs fusion results with $% channels based on a × receptive field. In summary,
219 these layers can be expressed in equation (3).
1
( ) = max(0, 1 + 1 ∗ ), 1 : $2 ×( 1 × 1 × $1 ) , 1 : $2 ×1
& 2
( ) = max ,0, 2 + 2 ∗ 1
( )- , 1 : $3 ×( 2 × 2 × $2 ) , 2 : $3 ×1
×( × $3 ) ,
220
3
( )= 3 + 3 ∗ 2
( ), 3 : $4 3 × 3 3 : $4 ×1
221 (3)
222 where max function returns the maximum value of the two in brackets.
223 The workflow of ESRCNN for fusing Landsat-8 and Sentinel-2 images is shown in
224 Figure 3. The downscaling procedures are divided into two parts: self-adaptive fusion of
225 Sentinel-2 images and multi-temporal fusion of Landsat-8 and Sentinel-2 images. First,
226 Sentinel-2 spectral bands 11 and 12 (SWIR-1/2) at 20 m are downscaled to 10 m by feeding
227 ESRCNN with bands 2-4, (B, G, and R), 8 (NIR) at 10 m and bands 11 and 12 at 10 m
228 resampled using the nearest neighbor interpolation. Second, Landsat-8 bands 1-7 are
11
229 downscaled through ESRCNN by using the Landsat-8 panchromatic band (15 m) and
230 Sentinel-2 data sets (bands 2-4, 8, 11-12 at 10 m). The fusion network in this step can
231 accommodate a flexible number of Sentinel-2 images as auxiliary data sets. Inputs for
232 ESRCNN include multi-temporal Sentinel-2 images (10 m) captured in relatively close days
233 to the target Landsat-8 image as well as resampled Landsat-8 bands 1-7 at 10 m with the
234 nearest neighbor interpolation. The incorporation of resampled bands by the fusion network
235 in both steps ensures that information from coarse resolution images is learned by the deep
236 learning network and used for downscaling. Overall, the ESRCNN fusion workflow
237 illustrated in Figure 3 can be summarized in four steps.
238 (1) The 20 m Sentinel-2 bands 11-12 are resampled to 10 m using the nearest neighbor
239 interpolation;
240 (2) the resampled Sentinel-2 bands 11-12 and band 2-4, 8 at 10 m are fed into the ESRCNN
241 to generate downscaled Sentinel-2 bands 11-12 at 10 m (the self-adaptive fusion process);
242 (3) the Landsat-8 bands 1-7 at 30 m and band 8 (PAN band) at 15 m are resampled to 10 m
243 using the nearest neighbor interpolation;
244 (4) the resampled Landsat-8 images and Sentinel-2 images at 10 m are used as inputs to the
245 ESRCNN to generate Landsat-8 bands 1-7 at 10 m (i.e., multi-temporal fusion of Landsat-8
246 and Sentinel-2 images).
247 In this study, parameters of the fusion network were configured following SRCNN
248 (Dong et al., 2016). More specifically, the kernel size ( , and ) of each layer was 9, 1,
249 and 5, respectively. The number of output feature maps ($ and $ ) in the first two layers
250 was 64 and 32, respectively. The number of input and output feature maps in the self-adaptive
12
251 fusion was 6 (bands 2-4, 8, and 11-12) and 2 (bands 11-12), respectively. For multi-temporal
252 fusion of Sentinel-2 and Landsat-8 images, the number of output feature maps was 7
253 (Landsat-8 bands 1-7), while the number of input feature maps depended on the number of
254 auxiliary Sentinel-2 images used in the fusion network. For instance, if only one Sentinel-2
255 image was used for downscaling, N1 would be 14 (Sentinel-2 bands 2-4, 8, 11-12 and
256 Landsat-8 bands 1-8).
257 After the preparation of algorithm inputs and network configurations, it comes to the
258 training stage for image fusion. Assume x is the input of the network and y is the desired
259 output (or known reference data). Then, the training set can be described as [ ()
, ]2
() 3
,
260 where M is the number of training samples. The network training is for learning a mapping
261 function f: 4 = ( ; ), where 4 is the predicted output and w is the set of all parameters
262 including filter weights and bias. With the help of an optimization function (equation 4), the
263 fusion network can learn to reduce the prediction error. Generally, for the deep learning
264 network, the mean square error (equation 4) is used as the optimization function (also known
265 as loss function).
266 6= ∑2 7 ()
− ( ()
)7 (4)
267 where n is the number of training samples used in each iteration, L refers to the prediction
268 error, y is the reference (desired output), f (xi;w) is the predicted output, and ||.|| denotes the
269 ℓ -norm. The stochastic gradient descent with the standard backpropagation (LeCun et al.,
270 1998) algorithm is used for optimization. The weights are updated by equation (5).
∆; = < ∙ ∆; − > ∙ , = + ∆;
?@ D D
271
?ABC ; ; (5)
where > and < are the learning rate and momentum,
?@
?AEC
272 is the derivative, l is the layer,
13
273 and ∆ refers to the intermediate value in iteration t. Following Dong et al. (2016), > is set
274 to 10-4 (except for the last layer where > is set to 10-5 for accelerating convergence), < is
275 set to 0.9, the patch size (the size of a local receptive field, or filter size) is 32 by 32 pixels
276 (Supporting Information Table S1), and the batch size (a term used in machine learning to
277 refer to the number of training samples in one iteration) is 128.
278
279
280 Figure 3. The workflow of the extended SRCNN (ESRCNN) for fusing Landsat-8 and
281 Sentinel-2 images. Conv indicates the convolutional layer and the rectified linear unit (ReLU)
282 represents the nonlinear activation layer.
283
284 2.3 ATPRK for fusion of Landsat-8 and Sentinel-2 images
285 ATPRK is among the first methods used to downscale Landsat-8 images from 30 m to
14
286 10 m spatial resolution. The ATPRK approach performs the downscaling process primarily by
287 introducing an area-to-point kriging (ATPK)-based residual downscaling scheme which is
288 synergistically used with the regression-based overall trend estimation. The ATPRK approach
289 can be regarded as an extension of either regression kriging or ATPK. Further technical
290 details of this downscaling approach can be referred to Wang et al. (2016). In this study, the
291 effectiveness of the fusion network ESRCNN was benchmarked by the algorithm ATPRK
292 that outperformed other data fusion algorithms (Wang et al., 2017).
293
294 2.4 Evaluation metrics for data fusion
295 The performance of the deep learning-based fusion framework and ATPRK was
296 assessed based on five indicators, including the spectral angle mapper (SAM) (Alparone et al.,
297 2007), root-mean-square error (RMSE), relative global-dimensional synthesis error (ERGAS)
298 (Ranchin & Wald, 2000), correlation coefficient (CC), and universal image quality index
299 (UIQI) (Wang & Bovik, 2002). These indicators have been widely used in assessing the
300 performances of data fusion algorithms such as ATPRK (Wang et al., 2017).
301 The SAM (equation 6) measures spectral angle between two vectors.
FGH(I, I4) = sinM

〈O,O4〉
‖O‖R ∗‖O4‖R
302 (6)
303 where v is the pixel vector formed by the reference image, and I4 is the vector formed by the
304 fused image. SAM is first calculated on a per-pixel basis and then all the SAM values are
305 averaged to a single value for the whole image.
306 The RMSE (equation 7) is used to evaluate overall spectral differences between the
307 reference image R and the fused image F.
15
308 SHFT(S, U) = 3V W∑32 ∑V2 [S( , ) − U( , )] (7)
309 The ERGAS provides a global quality evaluation of the fusion result and is calculated
310 in equation 8.
TSXGF = 100 W ∑Z2 [SHFT( )/H\]^( )]

Y
D Z
311 (8)
312 where h/l is the ratio between the spatial resolution of the fused Landsat-8 image and that of
313 the original Landsat-8 image, k is the number of bands of the fused image, Mean(i) is the
314 mean value of differences between the ith band of the reference image and that of the fused
315 image, and RMSE(i) indicates the root mean squared error of the ith band between the
316 reference and fused images.
317 The CC (equation 9) indicates the spectral correlation between the reference image R
318 and the fused image F.
Bef ∑def[`( , )Ma(`)][b( , )Ma(b)]

∑g
__(S, U) =
c
Bef ∑def[`( , )Ma(`)] ∑Bef ∑def[b( , )Ma(b)]

W∑g c R g c R
319 (9)
320 The UIQI (equation 10) is a global fusion performance indicator.
h(S, U) = (lR jklR )(aR

%i ∙a(`)∙a(b)
j k j ak
R)
321 (10)
322 where μ refers to the mean value and m represents standard deviation. To calculate UIQI for
323 the whole image, a sliding window was used to increase differentiation capability and
324 measure the local distortion of a fused image. The final UIQI was acquired by averaging all Q
325 values in sliding windows. The ideal value for SAM, ERGAS, RMSE, CC, and UIQI is 0, 0,
326 0, 1, and 1, respectively, if and only if Ri = Fi, for all i = 1, 2, …, N (where N is the number of
327 pixels).
328
329 3 Study area, satellite data, and synthetic data sets

16
330 The study area covers parts of Shijiazhuang, which is the capital and largest city of
331 North China’s Hebei Province. It is a city of agriculture and industry and has experienced
332 dramatic growth in population and urban extent since the founding of the People's Republic
333 of China in 1949. Two areas of Shijiazhuang, as shown in Figure 4 and Figure 5, were used
334 for training and testing the fusion network, respectively. The two areas contain abundant
335 landforms (e.g., urban areas, cropland, and forest) and have exhibited significant changes in
336 spectral reflectance within the study period (June 15 to July 7, 2017). Figure 4 shows the area
337 with a spatial extent of 19.32 km × 25.20 km, i.e., the 10 m and 20 m Sentinel-2 bands have
338 1932 × 2520 and 966 × 1260 pixels, respectively, while 30 m and 15 m Landsat-8 bands
339 have 644 × 830 and 1288 × 1660 pixels, respectively. Figure 5 shows the study area with a
340 spatial extent of 12 km × 12 km, i.e., the 10 m and 20 m Sentinel-2 bands have 1200 ×
341 1200 and 600 × 600 pixels, respectively, while 30 m and 15 m Landsat-8 bands have 400 ×
342 400 and 800 × 800 pixels, respectively. Three subareas (1, 2, and 3) as shown in Figure 4
343 and two subareas in Figure 5 were selected for visual/quantitative assessment of the fusion
344 network to downscale Landsat-8 images at low spatial resolution to high spatial resolution.
345 Sentinel-2 is an Earth observation mission of the European Space Agency (ESA)
346 developed from the European Union Copernicus Programme to acquire optical imagery at a
347 high spatial resolution. In this study, Sentinel-2 images acquired on June 20, June 27, and
348 July 7, 2017 were used as auxiliary inputs for the fusion framework. These Sentinel-2 images
349 were from the collection of Level-1C products, which provided the top of atmosphere (TOA)
350 reflectance with a sub-pixel multispectral registration accuracy in the UTM/WGS84
351 projection system (Drusch et al., 2012; Claverie et al., 2016, 2018). Two scenes of
17
352 NASA/USGS Landsat-8 TOA reflectance data (L1TP) on June 15 and July 1 in 2017 were
353 collected in this study. The Landsat-8 images were processed for standard terrain correction
354 in the UTM/WGS84 projection system (USGS, 2015). Both Landsat-8 and Sentinel-2 images
355 were captured under clear-sky conditions, and the selection of these dates was to illustrate the
356 effectiveness of the fusion network to yield downscaled Landsat-8 images at 10 m spatial
357 resolution that could be combined with Sentinel-2 images to generate more frequent and
358 consistent observations.
359 Naturally, for the self-adaptive fusion, the fusion network can downscale the
360 Sentinel-2 bands 11 and 12 at 20 m spatial resolution to 10 m with the Sentinel-2 bands 2-4
361 and 8 as inputs; however, the reference data at 10 m are not available to evaluate the
362 performance of the fusion network. Thus, in this study, synthetic data sets were constructed
363 per the Wald’s protocol (Wald et al., 1997). Specifically, the Sentinel-2 spectral bands 2-4 and
364 8 were resampled by a factor of 2 by using a Gaussian model-based degradation function
365 (also known as the point spread function, PSF). The resampled Sentinel-2 bands 2-4, 8 at 20
366 m and bands 11-12 at 40 m were then fused in the proposed fusion network to yield bands
367 11-12 at 20 m. The original Sentinel-2 bands 11-12 at 20 m were used for network training
368 and validation. It is noted that the input data sets for the fusion framework should be of the
369 same size (spatial extent and pixel size) as each other. In other words, the Sentinel-2 bands at
370 40 m should be interpolated (nearest neighbor interpolation) to 20 m and used as inputs for
371 the proposed fusion algorithm. Per the data augmentation technique (rotation and scaling),
372 36864 patches were randomly selected for training and 9216 patches were randomly selected
373 for validating the fusion network (further summarized in Supporting Information Table S2).
18
374 Following the self-adaptive fusion of Sentinel-2, the Sentinel-2 spectral bands 2-4, 8,
375 11, and 12 at 10 m were used as auxiliary data for downscaling Landsat-8 spectral bands at
376 30 m to 10 m. All input images were degraded by a factor of 3 since the reference Landsat-8
377 bands at 10 m were not available. Specifically, the Sentinel-2 bands at 10 m and Landsat-8
378 spectral and panchromatic bands at 30 m and 15 m were resampled to 30 m, 90 m, and 45 m.
379 Then these resampled data sets were used as an input to the fusion framework to yield
380 Landsat-8 bands 1-7 at 30 m. The original Landsat-8 bands 1-7 at 30 m were used for training
381 and validating the fusion network. Similar to the self-adaptive fusion, all the input images
382 should be of the same size (spatial extent and pixel size) as each other. In the multi-temporal
383 fusion step, 12160 patches were randomly selected for training and 3040 patches were
384 randomly selected for validating the fusion network (further summarized in Supporting
385 Information Table S2).
19
386
387 Figure 4. Landsat-8 and Sentinel-2 data sets used in the training phase (bands 4, 3, and 2 as
388 RGB). Landsat-8 images at 30 m were acquired on (a) June 15 and (b) July 1, 2017.
389 Sentinel-2 images were acquired on (c) June 20, (d) June 27, and (e) July 7, 2017. The
390 marked red rectangle areas (1, 2, and 3) were used for visual assessment.
391
20
392
393 Figure 5. Landsat-8 and Sentinel-2 data sets used in the testing phase (bands 4, 3, and 2 as
396 marked red rectangle areas (1, 2) were used for evaluating the fusion network performance.
397
398 4 Results
399 4.1 Self-adaptive fusion of Sentinel-2 images
400 Figure 6 shows the self-adaptive fusion results of Sentinel-2 data that were acquired
401 on June 20, June 27, and July 7, 2017 for the subarea 1 in Figure 4. Visually, the fused 20 m
402 Sentinel-2 bands 11-12 (Figure 6 (c)) were almost the same as the original 20 m Sentinel-2
403 bands 11-12 (Figure 6 (b)), indicating the good performance of the proposed fusion
404 framework. In addition, the fusion framework revealed much finer spatial details as shown in
21
405 the fused 10 m Sentinel-2 bands 11-12 (Figure 6 (d)). Table 2 presents the accuracy
406 assessment of the self-adaptive fusion results for the training area (Figure 1). Clearly,
407 ESRCNN outperformed ATPRK in downscaling Sentinel-2 bands 11-12 from 40 m 20 m
408 spatial resolution. Evaluation indicators including RMSE, CC, and UIQI provided by
409 ESRCNN were much closer to their corresponding ideal values (0, 1, 1) than those provided
410 by ATPRK. SAM and ERGAS between ESRCNN downscaled and reference images
411 exhibited a value of ~0.8 and ~2.0 for the three dates, less than a value of ~2.0 and ~5.6
412 between ATPRK downscaled and reference images. The two fusion approaches were used to
413 generate the fused 10 m Sentinel-2 bands 11-12, which were subsequently used as inputs for
414 downscaling Landsat-8 images from 30 m spatial resolution to 10 m.
415
416 Table 2. Comparisons between ATPRK and ESRCNN for downscaling Sentinel-2 bands
417 11-12 at 40 m to 20 m in three dates (June 20, June 27, and July 7, 2017). The bold number
418 indicates the best value for accuracy assessment.

June 20, 2017 June 27, 2017 July 7, 2017
Metrics Band
ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN
Band 11 0.0178 0.0065 0.0316 0.0095 0.0304 0.0104
RMSE
Band 12 0.0231 0.0093 0.0324 0.0110 0.0315 0.0119
Band 11 0.9786 0.9962 0.9804 0.9975 0.9769 0.9963
CC
Band 12 0.9772 0.9954 0.9813 0.9970 0.9793 0.9960
Band 11 0.9556 0.9920 0.9575 0.9948 0.9540 0.9929
UIQI
Band 12 0.9511 0.9918 0.9590 0.9942 0.9567 0.9928
ERGAS 5.1905 2.0211 5.7884 1.8727 5.6605 2.0516
SAM (degree) 1.3528 0.6515 1.4396 0.7435 1.5220 0.7420
419
22
420
421 Figure 6. The results of self-adaptive fusion on June 20, June 27, and July 7, 2017 for the
422 subarea 1 in Figure 4. (a) The degraded Sentinel-2 bands at 40 m, (b) the reference Sentinel-2
423 bands at 20 m, (c) the fused Sentinel-2 bands at 20 m, and (d) the fused Sentinel-2 bands at
424 10 m.
23
425
426 4.2 Multi-temporal fusion of Landsat-8 and Sentinel-2 images
427 Figure 7 shows the multi-temporal fusion results for the subarea 1 in Figure 4. The
428 Landsat-8 images on June 15 and July 1, 2017 (Figure 7 (e) and (h)) were individually
429 combined with three Sentinel-2 images on June 20, June 27, and July 7, 2017 (Figure 7 (a),
430 (b), and (c)) to yield fusion results. Visually, the resampled Landsat-8 images at 90 m (Figure
431 7 (d) and (g)) were sharply improved to the fused Landsat-8 images at 30 m (Figure 7 (f) and
432 (i)). The fused Landsat-8 images at 30 m (Figure 7 (f) and (i)) were visually similar to the
433 original Landsat-8 images at 30 m (Figure 7 (e) and (h)). Table 3 lists the values of the five
434 indices for accuracy assessment for the entire training area. For the fused images of these two
435 dates, the fusion network produced an RMSE less than 0.05, CC larger than 0.95, UIQI
436 greater than 0.94, ERGAS less than 1.2, and SAM less than 0.8. These metrics clearly
437 showed that the fused images were spectrally and spatially consistent with original images.
438 The proposed fusion framework also yielded good performance for downscaling
439 Landsat-8 images at low spatial resolution to high spatial resolution in areas experiencing
440 land use and land cover (LULC) changes. For example, the yellow circles in Figure 8 (a) and
441 (c) highlighted LULC changes due to the planting of crops within the subarea 2 from June 20
442 to July 7, 2017. Although the Sentinel-2 image acquired on July 7, 2017 was input to the
443 fusion framework, the final fused image on June 15, 2017 (Figure 8 (f)) did not incorporate
444 such LULC changes. In addition, the Landsat-8 image on July 1, 2017 was more similar to
445 the Sentinel-2 image on July 7, 2017 rather than the other two Sentinel-2 images. Despite the
446 obvious LULC changes among the three Sentinel-2 images used as inputs in the fusion
24
447 network, the final fused image at 30 m on July 1, 2017 still did not incorporate these LULC
448 changes (Figure 8 (i)). These findings suggest that the fusion network can identify areas
449 experiencing LULC changes and then remove features associated with spectral changes that
450 are not consistent with the target image.
451
452 Figure 7. The multi-temporal fusion results for the subarea 1 in Figure 4. (a)-(c) are the
453 resampled Sentinel-2 images at 30 m on June 20, June 27, and July 7, 2017. (d)-(f) are the
454 resampled Landsat image at 90 m, the original Landsat-8 reference at 30 m, and the fused
25
455 Landsat-8 image at 30 m on June 15, 2017. (g)-(i) are the resampled Landsat image at 90 m,
456 the original Landsat-8 reference at 30 m, and the fused Landsat-8 image at 30 m on July 1,
457 2017.
458
459 Figure 8. The multi-temporal fusion results for the subarea 2 in Figure 4 (bands 5, 4, 3 as
460 RGB). (a)-(c) are three Sentinel-2 images at 10 m acquired on June 20, June 27, and July 7,
461 2017. (d)-(f) are the original Landsat-8 reference at 30 m and the fused Landsat-8 image at 30
462 m and 10 m on June 15, 2017. (g)-(i) are the original Landsat-8 reference at 30 m and the
26
463 fused Landsat-8 images at 30 m and 10 m on July 1, 2017. The yellow circle highlights the
464 land use and land cover (LULC) changes due to planting of crops (a-c).
465 Table 3. Comparisons between the ESRCNN fused and reference Landsat-8 images at 30 m.
Date Band RMSE CC UIQI ERGAS SAM (degree)
Band 1 0.0243 0.9860 0.9710
Band 2 0.0242 0.9869 0.9737
Band 3 0.0260 0.9861 0.9726
June 15, 2017 Band 4 0.0258 0.9875 0.9755 1.1817 0.7042
Band 5 0.0405 0.9569 0.9415
Band 6 0.0263 0.9848 0.9738
Band 7 0.0263 0.9864 0.9757
Band 1 0.0193 0.9899 0.9683
Band 2 0.0192 0.9904 0.9732
Band 3 0.0210 0.9894 0.9731
July 01, 2017 Band 4 0.0217 0.9897 0.9749 1.0741 0.7904
Band 5 0.0399 0.9599 0.9420
Band 6 0.0254 0.9830 0.9698
Band 7 0.0254 0.9854 0.9727
466
467 4.3 Image fusion with a flexible number of auxiliary Sentinel-2 images
468 In the previous section, the downscaling of Landsat-8 images was accomplished with
469 three auxiliary Sentinel-2 images. However, it may not always be possible to have three or
470 more Sentinel-2 images as inputs into the fusion framework due to clouds, shadows, and
471 snow contamination. Thus, the impact of the number of auxiliary Sentinel-2 images on the
472 downscaling performance was evaluated in this section. Here a series of data sets with
473 varying number of Sentinel-2 images as input into the fusion framework was prepared: (1) D0
474 (Landsat-8 image on July 1, 2017 and no auxiliary Sentinel-2 image); (2) D1 (Landsat-8
475 image on July 1, 2017 and only one Sentinel-2 image on June 27, 2017); (3) D2 (Landsat-8
476 image on July 1, 2017 and only one Sentinel-2 image on July 7, 2017); (4) D3 (Landsat-8
477 image on July 1, 2017 and only one Sentinel-2 image on June 20, 2017); (5) D1+D2; (6)
27
478 D2+D3; (7) D1+D3; (8) D1+D2+D3. Note that even if no auxiliary Sentinel-2 image was input
479 into the fusion framework (D0), the fusion network could still reconstruct a high-resolution
480 image at 10 m from the Landsat-8 image at 30 m.
481 Table 4 presents the accuracy assessment for the downscaling results from eight
482 different data sets against original images at 30 m. The accuracy assessment was performed
483 for the entire training area (Figure 4). It was observed that the fused image generated from D0
484 had the least accurate results for all the bands. Since the downscaling scheme D0 solely used
485 the Landsat-8 PAN band without any Sentinel-2 information, the results had an inherent
486 pan-sharpening problem (Alparone et al., 2007; Liu et al., 2016): the PAN band was only
487 captured at 15 m and its wavelength only covered those of Landsat-8 bands 2-4 but not bands
488 5-7. The performance of the fusion network degraded as the time interval between Sentinel-2
489 and Landsat-8 images increases. This finding was evidenced by the more accurate fusion
490 results from D1 than those from D2 or D3 (Table 4) when only one auxiliary Sentinel-2 image
491 was used for downscaling the Landsat-8 image. The time intervals between the Landsat-8
492 image and the Sentinel-2 images on D1, D2, and D3 were 4, 6, and 11 days, respectively.
493 When two auxiliary Sentinel-2 images were used for downscaling the Landsat-8 image, the
494 performance of the fusion network improved as the total time interval between the Landsat-8
495 image and the two Sentinel-2 images decreased. Table 4 shows that the fusion image
496 generated from D1 + D2 (the total time interval is 10 days) was more accurate those generated
497 from D2+D3 (the total time interval is 17 days) and D1+D3 (the total time interval is 15 days).
498 Additionally, the fusion image generated from D1 + D3 had a higher accuracy level than that
499 generated from D2 + D3. At least, the fusion accuracy improved as the number of auxiliary
28
500 Sentinel-2 images input into the fusion network increased.
501 Further visual examinations of the downscaled Landsat-8 images at 10 m suggested
502 that the performances of the fusion networks trained by the eight different data sets were not
503 consistent for areas experiencing LULC changes. Figure 9 shows the LULC changes caused
504 by temporal housing (houses temporally built near construction sites, i.e., bright dots marked
505 by the red rectangle) in the subarea 3. As suggested by Figure 9, the number of temporal
506 housing (i.e., bright dots) in the Landsat-8 image on July 1, 2017 and in the Sentinel-2 images
507 on June 20, June 27 and July 7, 2017 was 4, 2, 3, and 5, respectively. For the fusion image
508 derived from D1 (Figure 10 (a)), the number of the bright dots was the same as that in the
509 Figure 9 (a). However, the dots in the green rectangle (Figure 10 (a)) were relatively blurry
510 compared to other dots in the fusion image. The number of bright dots for the fusion image
511 derived D2 was five while the top two dots in the fusion image derived D3 were relatively
512 blurry. In contrast, the fusion results derived from two auxiliary Sentinel-2 images (D1+ D2,
513 D2+ D3 and D1+ D3) were much better though blurry dots were still observed. The best fusion
514 result was observed in the fusion image derived from D1+ D2+ D3 from both the qualitative
515 (visual) and quantitative assessments (Table 4). These findings revealed that the fusion
516 network could learn LULC changes with a sufficient number of auxiliary Sentinel-2 images
517 as inputs.
518
519
520 Table 4. The impact of the number of auxiliary Sentinel-2 images input to the fusion network
521 on the downscaling performance. The bold number in each row indicates the best value for
29
522 accuracy assessment.
D0 D1 D2 D3 D1+D2 D2+D3 D1+D3 D1+D2+D3
Band1 0.0223 0.0147 0.0154 0.0170 0.0144 0.0147 0.0145 0.0141
Band2 0.0223 0.0143 0.0151 0.0166 0.0138 0.0142 0.0139 0.0136
Band3 0.0221 0.0159 0.0171 0.0174 0.0154 0.0159 0.0156 0.0153
RMSE Band4 0.0225 0.0158 0.0170 0.0174 0.0153 0.0158 0.0156 0.0151
Band5 0.0610 0.0265 0.0286 0.0347 0.0245 0.0264 0.0264 0.0240
Band6 0.0325 0.0185 0.0202 0.0222 0.0175 0.0186 0.0181 0.0173
Band7 0.0310 0.0180 0.0201 0.0222 0.0173 0.0184 0.0178 0.0172
Band1 0.9865 0.9942 0.9936 0.9923 0.9945 0.9942 0.9944 0.9947
Band2 0.9870 0.9947 0.9941 0.9829 0.9951 0.9948 0.9950 0.9952
Band3 0.9883 0.9940 0.9930 0.9928 0.9944 0.9940 0.9942 0.9945
CC Band4 0.9888 0.9945 0.9936 0.9934 0.9948 0.9946 0.9946 0.9950
Band5 0.9034 0.9826 0.9796 0.9697 0.9851 0.9827 0.9827 0.9856
Band6 0.9720 0.9910 0.9893 0.9871 0.9920 0.9909 0.9914 0.9922
Band7 0.9781 0.9926 0.9909 0.9888 0.9933 0.9923 0.9929 0.9934
Band1 0.9619 0.9813 0.9802 0.9755 0.9822 0.9813 0.9820 0.9829
Band2 0.9670 0.9857 0.9843 0.9807 0.9866 0.9858 0.9863 0.9871
Band3 0.9712 0.9851 0.9832 0.9825 0.9865 0.9853 0.9858 0.9869
UIQI Band4 0.9733 0.9876 0.9855 0.9847 0.9884 0.9876 0.9876 0.9886
Band5 0.8503 0.9751 0.9706 0.9565 0.9791 0.9755 0.9754 0.9798
Band6 0.9510 0.9841 0.9823 0.9791 0.9864 0.9851 0.9853 0.9869
Band7 0.9614 0.9868 0.9842 0.9810 0.9882 0.9867 0.9874 0.9885
ERGAS 3.0491 1.7754 1.9167 2.1033 1.6972 1.7785 1.7474 1.6768
SAM (degree) 4.6004 2.4788 2.5986 2.9878 2.3462 2.4563 2.4394 2.3162
523 Note: A total of eight different data sets were prepared as inputs to the fusion network: (1) D0
524 (Landsat-8 image on July 1, 2017 and no auxiliary Sentinel-2 image); (2) D1 (Landsat-8 image on July
525 1, 2017 and only one Sentinel-2 image on June 27, 2017); (3) D2 (Landsat-8 image on July 1, 2017
526 and only one Sentinel-2 image on July 7, 2017); (4) D3 (Landsat-8 image on July 1, 2017 and only
527 one Sentinel-2 image on June 20, 2017); (5) D1+D2; (6) D2+D3; (7) D1+D3; (8) D1+D2+D3.
30
528
529 Figure 9. The subarea 3 in Figure 4 experienced changes in temporary housing (bright dots;
530 band 2 gray-scale images were shown here). The red rectangle in the upper-left corner is an
531 enlarged view of the red rectangle in the center of the image. The number of bright dots
532 (temporal housing) in the Landsat-8 image on (c) July 1, 2017, and three Sentinel-2 images
533 on (a) June 20, (b) June 27, and (d) July 7, 2017 was 4, 2, 3, and 5, respectively.
31
534
535 Figure 10. The downscaled Landsat-8 image at 10 m (bands 7, 6, 2 as RGB) on July 1, 2017
536 derived from eight different data sets, including (a) D1, (b) D2, (c) D3, (d) D1+ D2 , (e) D2+ D3,
537 (f) D1+ D3, (g) D1+D2+D3. The red rectangle in the upper-left corner is an enlarged view of
538 the red rectangle in the center of the image.
539
540 4.4 Comparison between ATPRK and ESRCNN for image fusion
541 In the previous sections, we showed steps to train the fusion network and presented
542 the performance of the fusion network to downscale Sentinel-2 and Landsat-8 images for the
543 training area (Figure 4). In this section, we applied the trained network to a different domain
544 within Shijiazhuang (12 km by 12 km as shown in Figure 5). The performance of the fusion
545 network to downscale Sentinel-2 bands 11-12 at 40 m to 20 m and Landsat-8 images from 30
546 m to 10 m for this study area was benchmarked by ATPRK.
547 Table 5 shows the comparisons between ATPRK and ESRCNN for downscaling
548 Sentinel-2 bands 11-12 at 40 m to 20 m acquired on June 20, June 27, and July 7, 2017.
32
549 Clearly, the evaluation results revealed that ESRCNN outperformed ATPRK in self-adaptive
550 fusion of Sentinel-2 images. The downscaled Sentinel-2 bands 11-12 at 10 m (resampled to
551 30 m) were further used in downscaling Landsat-8 images from 90 m to 30 m.
552 Four groups of data sets were used for comparing the performances of the two
553 algorithms for downscaling Landsat-8 images: (1) the Landsat-8 image on June 15, 2017 and
554 the Sentinel-2 image on June 20, 2017, (2) the Landsat-8 image on June 15, 2017 and the
555 Sentinel-2 image on June 27, 2017, (3) the Landsat-8 image on July 1, 2017 and the
556 Sentinel-2 image on June 27, 2017, and (4) the Landsat-8 image on July 1, 2017 and the
557 Sentinel-2 image on June 20. In each group, only one Sentinel-2 image was included since
558 ATPRK could only take one Sentinel-2 image as the input for the downscaling process. Thus,
559 the use of only one Sentinel-2 image in each group in this section ensured a fair comparison
560 between ATPRK and ESRCNN. The original Landsat-8 and Sentinel-2 images in the two
561 groups were resampled by a factor of 3 and then input into the two algorithms for yielding
562 fused images at 30 m. The temporal interval between Sentinel-2 and Landsat-8 images for
563 each group was 5, 12, 4, and 11.
564 The proposed ESRCNN framework outperformed ATPRK in downscaling Landsat-8
565 images at 30 m to 10 m based on all evaluation metrics (Table 6). Figure 11 and Figure 12
566 further show that fused reflectance yielded by ESRCNN matched more closely to reference
567 reflectance than that yielded by ATPRK. In addition, evaluation metrics from Group 1 (or 3)
568 were better than those from Group 2 (or 4), further stressing the importance of temporal
569 interval in image acquisition time between the auxiliary Sentinel-2 and the target Landsat-8
570 images (as shown in Section 4.3). Examinations of the downscaled images yielded by
33
571 ATPRK revealed spectral distortions. For example, over-sharpened building boundaries were
572 observed in Figure 13. As suggested by the distribution of pixel values from all spectral bands
573 in Figure 14, ATPRK tended to change the distribution of reflectance values. Compared to the
574 distribution curves in Figure 14 (a, b, d, e), those in Figure 14 (c, f) were shorted and flatter
575 (based on the number of pixels at each gray level). In contrast, the fusion network had the
576 ability to preserve the data distribution as the original image.
577 Table 7 shows that ESRCNN exhibited better accuracy than ATPRK in downscaling
578 Landsat-8 images in areas experiencing LULC changes (mostly due to changes in vegetation
579 coverage as shown in Figure 15 (a) and (b)). Both ESRCNN and ATPRK can identify LULC
580 changes and reproduce the original spectral information for downscaling Landsat-8 images
581 (Figure 15 (a), (c), and (d)). However, the fused image generated by ATPRK is
582 over-sharpened compared to that generated by ESRCNN (Figure 15 (c) and (d)).
583
584
585
586
587
588
589
590
591
34
592
593 Figure 11. Comparisons between the reference and the fused reference produced by ATPRK
594 and ESRCNN for bands 2-7 on June 15, 2017. (a)-(f) are fusion reflectance yielded by
595 ATPRK with the auxiliary Sentinel-2 image on June 20, 2017. (g)-(l) are fusion reflectance
596 yielded by ESRCNN with the auxiliary Sentinel-2 image on June 20, 2017. The color scheme
597 indicates point density.

35
598
600 and ESRCNN for bands 2-7 on July 1, 2017. The color scheme indicates point density. (a)-(f)
601 are fusion reflectance yielded by ATPRK with the auxiliary Sentinel-2 image on June 27,
602 2017. (g)-(l) are fusion reflectance yielded by ESRCNN with the auxiliary Sentinel-2 image
603 on June 27, 2017. The color scheme indicates point density.
36
604
605 Figure 13. The multi-temporal fusion results for the subarea 1 indicated by the red rectangle
606 in Figure 5 (bands 4, 3, 2 as RGB). The original Landsat-8 images at 30 m on (a) June 15 and
607 (e) July 1, 2017. The resampled Landsat-8 images at 90 m on (b) June 15 and (f) July 1. The
608 fused Landsat-8 images (ESRCNN) at 30 m on (c) June 15 and (g) July 1, 2017. The fused
609 Landsat-8 images (ATPRK) at 30 m on (d) June 15 and (h) July 1, 2017.
37
610
611 Figure 14. The distribution of pixel values from all spectral bands in Landsat-8 and
612 Sentinel-2 images. (a)-(c) are the Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8
613 bands at 30 m, and the ATPRK fused Landsat-8 bands at 30 m on June 15. (d)-(f) are the
614 Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8 bands at 30 m, and the ATPRK
615 fused Landsat-8 bands at 30 m on July 1, 2017.
616
38
617
618 Figure 15. Fusion results for the subarea 2 in Figure 5 experiencing changes in vegetation
619 coverage (bands 5, 4, 3 as RGB). (a) The 30 m Landsat-8 image on June 15, 2017, (b) the 30
620 m Sentinel-2 image on July 7, 2017, (c) the 90 m Landsat-8 image to be downscaled on June
621 15, 2017, (d) the ESRCNN downscaled image at 30 m, and (e) the ATPRK downscaled image
622 at 30 m.
623 Table 5. Comparisons between ATPRK and ESRCNN for downscaling Sentinel-2 bands
624 11-12 at 40 m to 20 m for the test area (Figure 5). Bold number indicates the best value for

June 20, 2017 June 27, 2017 July 07, 2017
Metrics Band
ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN
Band 11 0.0252 0.0135 0.0424 0.0218 0.0468 0.0248
RMSE
Band 12 0.0325 0.1869 0.0480 0.0278 0.0571 0.0322
Band 11 0.9531 0.9822 0.9545 0.9842 0.9491 0.9808
CC
Band 12 0.9479 0.9774 0.9484 0.9764 0.9397 0.9721
Band 11 0.9143 0.9693 0.9156 0.9713 0.9081 0.9662
UIQI
Band 12 0.9088 0.9614 0.9071 0.9584 0.8964 0.9548
ERGAS 5.8925 3.3045 6.6753 3.6917 7.2484 3.9988
SAM (degree) 1.3316 0.8220 1.6193 1.0906 1.9884 1.1695
39
626 Table 6. Comparisons between ATPRK and ESRCNN for downscaling the Landsat-8 images
627 on June 15 and July 1, 2017 at 90 m to 30 m for the entire test area (Figure 5). The bold
628 numbers in each row indicate the best values for accuracy assessment.
Group 1 Group 2 Group 3 Group 4
June 15, 2017 June 15, 2017 July 1, 2017 July 1, 2017
Metrics Band
ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN ATPRK ESRCNN
Band 1 - 0.0426 - 0.0434 - 0.0317 - 0.0319
Band 2 0.0800 0.0426 0.0843 0.0442 0.0694 0.0338 0.0732 0.0349
Band 3 0.0864 0.0444 0.0908 0.0468 0.0777 0.0392 0.0776 0.0421
Band 4 0.0975 0.0457 0.09555 0.0491 0.0863 0.0406 0.0806 0.0449
RMSE
Band 5 0.0851 0.0415 0.0835 0.0452 0.0903 0.0524 0.0941 0.0448
Band 6 0.0827 0.0458 0.0824 0.0466 0.0874 0.0462 0.0912 0.0492
Band 7 0.0901 0.0461 0.0904 0.0490 0.0877 0.0441 0.0915 0.0498
Mean 0.0870 0.0455 0.0878 0.0463 0.0831 0.0412 0.0847 0.0425
Band 1 - 0.9513 - 0.9493 - 0.9712 - 0.9708
Band 2 0.9131 0.9539 0.9089 0.9504 0.9194 0.9679 0.9139 0.9657
Band 3 0.9130 0.9547 0.9091 0.9493 0.9171 0.9601 0.9200 0.9539
Band 4 0.9072 0.9581 0.9118 0.9511 0.9078 0.9583 0.9197 0.9483
CC
Band 5 0.8251 0.9395 0.8320 0.9236 0.8443 0.9188 0.8358 0.9412
Band 6 0.9100 0.9520 0.9102 0.9495 0.8904 0.9432 0.8831 0.9347
Band 7 0.9159 0.9588 0.9152 0.9525 0.9025 0.9508 0.8966 0.9365
Mean 0.8974 0.9469 0.8979 0.9465 0.8969 0.9529 0.8948 0.9502
Band 1 - 0.9195 - 0.9134 - 0.9224 - 0.9173
Band 2 0.8372 0.9224 0.8259 0.9136 0.8180 0.9260 0.8050 0.9163
Band 3 0.8335 0.9256 0.8228 0.9139 0.8298 0.9256 0.8322 0.9095
Band 4 0.8178 0.9286 0.8243 0.9137 0.8184 0.9279 0.8379 0.9070
UIQI
Band 5 0.7401 0.8949 0.7490 0.8790 0.7429 0.8486 0.7290 0.8935
Band 6 0.8353 0.9204 0.8355 0.9150 0.8105 0.9090 0.7982 0.8936
Band 7 0.8360 0.9294 0.8348 0.9164 0.8239 0.9229 0.8130 0.8969
Mean 0.8166 0.9115 0.8154 0.9093 0.8073 0.9118 0.8025 0.9049
ERGAS 12.7934 6.8257 12.9690 6.9992 12.6265 6.2570 12.8198 6.5302
SAM (degree) 8.5484 6.8257 8.6148 4.6881 8.4717 4.3759 8.6556 4.2710
629 Note: Four groups of data sets were used: (1) the Landsat-8 image on June 15, 2017 and the
630 Sentinel-2 image on June 20, 2017, (2) the Landsat-8 image on June 15, 2017 and the Sentinel-2
631 image on June 27, 2017, (3) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 27,
632 2017, and (4) the Landsat-8 image on July 1, 2017 and the Sentinel-2 image on June 20.
633
634
40
635 Table 7. Comparisons between ATPRK and ESRCNN for downscaling the Landsat-8 image
636 on June 15, 2017 (Sentinel-2 image on July 7, 2017 as auxiliary) at 90 m to 30 m for the area
637 experiencing LULC changes (Figure 14). The bold number indicates the best value for
Metrics RMSE CC UIQI ERGAS SAM (degree)
ATPRK 0.0871 08830 0.8054 11.4530 7.5243
ESRCNN 0.0492 0.9131 0.8895 6.6506 4.1589
639
640 5 Discussion
641 Presently, there is a need for satellite data of higher temporal resolution than that can
642 be provided from a single imaging sensor (e.g., Landsat-8 or Sentinel-2) to better improve
643 understanding of land surface changes and their causal agents (Claverie et al., 2018). The
644 synergistic use of Landsat-8 and Sentinel-2 data provides a promising avenue to satisfy this
645 scientific demand yet requires coordination of the spatial resolution gap between the two
646 sensors. To this end, this study presents a deep learning-based fusion approach to downscale
647 Landsat-8 images at 30 m spatial resolution to 10 m. The downscaled Landsat-8 images at 10
648 m spatial resolution, together with Sentinel-2 data at 10 m, can provide temporally dense and
649 routine information at a short time interval suitable for environmental applications such as
650 crop monitoring (Veloso et al., 2017).
651 Compared to data fusion algorithms such as STARFM and its variants (Gao et al.,
652 2016; Hilker et al., 2009; Zhu et al., 2010; Zhu et al., 2016), the fusion network does not
653 require the input of a pair or more pairs of satellite images acquired in the same date that
654 would be a harsh prerequisite to fulfil to blend Landsat-8 and Sentinel-2 images. In contrast,
41
655 the proposed fusion network can accommodate a flexible number of auxiliary Sentinel-2
656 images to enhance the target Landsat-8 image to a higher spatial resolution. This flexibility
657 makes it possible for the fusion network to leverage several Sentinel-2 images in relatively
658 close dates to the target Landsat-8 for reflectance predictions. ATPRK, as a geostatistical
659 approach for data fusion (Wang et al., 2016; Wang et al., 2017), does not have this flexibility
660 to digest several auxiliary Sentinel-2 images simultaneously because of the computational
661 difficulty to calculate a cokriging matrix containing spectral bands from all input images for
662 cross-semivariogram modeling. This difference in flexibility to use multi-temporal Sentinel-2
663 images between the fusion network and ATPRK also explains the better performance of the
664 fusion network to generate downscaled Landsat-8 images at a high spatial resolution (Table
665 5). Examination of the downscaled Landsat-8 image highlights the advantage of the fusion
666 network over ATPRK to preserve the distribution of spectral values as the original image. The
667 high-fidelity reflectance values and the preservation of the statistical distribution of
668 reflectance values, provided by the fusion network, hold important merits for time series
669 analysis of satellite images. For example, Zhu et al. (2014) and Zhu et al. (2019) showed that
670 continuous land cover change detection and classification was sensitive to anomaly
671 observations that may exert impacts on the accurate modeling of time series reflectance
672 values.
673 The twin Sentinel-2 satellites together can provide consistent records of Earth’s
674 surface at a nominal revisit cycle of 5 days. Thus, one to three Sentinel-2 images captured in
675 relatively close dates to the target Landsat-8 image may be available as inputs for the fusion
676 network. The performance of the fusion network to downscale Landsat-8 images at 30 m
42
677 spatial resolution to 10 m improved as the number of auxiliary Sentinel-2 images input into
678 the fusion network increased. In addition, temporal interval (i.e., difference in acquisition
679 date) between the target Landsat-8 image and the auxiliary Sentinel-2 also exerted impacts on
680 the downscaling performance the fusion network (Table 5). For example, when only one
681 auxiliary Sentinel-2 image was fed into the fusion network, the best downscaling
682 performance was achieved for the data set D1 (Table 5), in which the temporal interval
683 between the target Landsat-8 image and the Sentinel-2 image was the smallest compared to
684 that in D2 and D3. Even if auxiliary Sentinel-2 images are not available (contaminated by
685 clouds, shadows, and snow), the fusion network can still yield a high-resolution image at 10
686 m for a target date with a high accuracy level (Table 4). As such, the proposed fusion network
687 is more suitable than existing fusion algorithms for practical applications requiring dense
688 time series images. In addition, the fusion network can generate the Landsat-8 band 1 at 10 m
689 through learning complex relationships across all spectral bands within and among images.
690 One critical issue in spatiotemporal fusion of satellite images is that LULC changes
691 may occur at the temporal scale of days or weeks. Currently, only a few data fusion
692 algorithms can deal with the downscaling process while considering LULC changes. For
693 example, learning-based data fusion methods such as the Sparse-representation-based
694 spatiotemporal reflectance fusion model (Huang & Song 2012) and the extreme learning
695 machine-based fusion method (Liu et al., 2016) have proved to be able to capture LULC
696 changes. However, these learning-based data fusion methods are only suitable for relatively
697 homogeneous areas and thus may not be directly applicable to the study area presented in this
698 study containing abundant landforms and particularly in cities where spatial heterogeneity is
43
699 high. The most recent variant of STARFM, Flexible Spatiotemporal DAta Fusion method
700 (FSDAF) (Zhu et al., 2016), has the potential to capture both gradual and abrupt LULC
701 changes for data fusion by using spectral unmixing analysis and thin plate spline interpolation.
702 It is worth noting that the ability of FSDAF to capture LULC changes is heavily dependent
703 on whether LULC changes are detectable in coarse resolution images. As Landsat-8 images
704 have a revisit cycle of 16-day, LULC changes identified between two temporally neighboring
705 Landsat images may miss changes occurring at a scale of few days (e.g., crop phenology).
706 The results in this study revealed that the fusion network had the ability to identify the areas
707 experiencing LULC changes within a few days and yield fusion results consistent with the
708 original image. When the number of the auxiliary Sentinel-2 images was less than 3 (Figure
709 10), the fusion network may not be able to yield good fusion results for areas experiencing
710 LULC changes. Thus, it was suggested in this study that three or more auxiliary Sentinel-2
711 images should be input into the fusion network for the downscaling process. In the future,
712 research efforts will be made to explore whether there is a threshold in temporal interval or an
713 optimal number of auxiliary Sentinel-2 images fed into the fusion network that results in the
714 best downscaling performance. As such, high-performance computing environments that
715 leverage parallel processing, such as NASA Earth Exchange, Google Earth Engine, and
716 Amazon Web Services (Gorelick et al., 2017), are necessary and should be employed to help
717 perform sensitivity analysis of the fusion network. In addition, the fusion network requires
718 sampled pixels that are representative of different types of changes within a study area for
719 training. Thus, there is a need to collect various types of LULC changes (more than temporal
720 housing and planting of crops shown in this study) to further train the fusion network. The
44
721 validation of the fusion network using images at 10 m collected from airborne- or
722 ground-based sensing platforms, rather than synthetic data sets, may further help understand
723 the algorithm performance.
724 Although the fusion network is developed to blend Landsat-8 and Sentinel-2 images
725 for a harmonized reflectance data set, it can be used to enhance the spatial resolution of
726 images from other satellite sensors such as Landsat-7 ETM+, the Advanced Spaceborne
727 Thermal Emission and Reflection Radiometer (ASTER), and the Moderate Resolution
728 Imaging Spectroradiometer (MODIS). Attention should be paid to use surface reflectance
729 products from these sensors rather than TOA reflectance products evaluated in this study to
730 reduce prediction uncertainties induced by atmospheric conditions. The fusion of Sentinel-2
731 data with images acquired by these satellite sensors at a coarse spatial resolution ranging
732 from 60 m to 500 m may also entail the reconfiguration of the fusion network. Particularly,
733 the fusion of Sentinel-2 and Landsat-7 images requires special efforts made to fill gaps in the
734 Landsat-7 Scan Line Corrector (SLC)-OFF images.
735 Recently, NASA launched an initiative, the HLS project (https://hls.gsfc.nasa.gov/), to
736 produce harmonized Landsat-8 and Sentinel-2 reflectance products to fulfil community’s
737 needs for image at both high spatial and temporal resolutions. The project aims to produce
738 reflectance products based on a series of algorithms including atmospheric correction, cloud
739 and cloud-shadow masking, spatial co-registration and common gridding system,
740 Bi-directional Reflectance Distribution Function (BRDF) normalization, and spectral
741 bandpass adjustment, to resolve differences between Landsat-8 and Sentinel-2 images. To
742 address the gap in spatial resolution between Sentinel-2 and Landsat-8 images, the HLS
45
743 project adopted the strategy to resample Sentinel-2 images at 10 m spatial resolution to 30 m.
744 However, this resampling strategy wastes valuable information provided by the 10 m
745 Sentinel-2 data. The proposed fusion network, thus, provides a promising approach to help
746 generate a better harmonized Landsat-8 and Sentinel-2 dataset at 10 m spatial resolution.
747 Further training and evaluation of the fusion network to downscale Landsat-8 images for
748 other areas are necessary, particularly since deep learning-based methods are normally
749 data-hungry for robust performance (LeCun et al. 2015).
750
751 6 Conclusion
752 The composition of Landsat-8 and Sentinel-2 satellite images have potential to
753 provide temporally dense observations of land surfaces at short time interval (e.g., 5-day
754 composite) suitable for environmental application such as monitoring of agricultural
755 management and conditions. However, the spatial resolution gap between the two satellite
756 sensors needs to be coordinated. In this study, a deep learning-based data fusion network was
757 developed to downscale Landsat-8 images at 30 m spatial resolution to 10 m with inputs of
758 auxiliary Sentinel-2 images that were acquired in close dates to the target Landsat-8 images.
759 The promising results of Landsat-8 images with rich spatial information descaled from the
760 deep learning-based network showed a high quality, suggesting a great potential in various
761 applications that require a series of satellite observations with both high temporal and spatial
762 data. Overall, the major advantages of the proposed fusion network can be summarized as the
763 following points.
764 (1) Compared to ATPRK that outperforms other data fusion algorithms in downscaling,
46
765 ESRCNN can accommodate a flexible number of auxiliary Sentinel-2 images to enhance
766 Landsat-8 images at 30 m spatial resolution to 10 m with a higher accuracy.
767 (2) By leveraging information from auxiliary Seninel-2 images, ESRCNN has potential to
768 identify LULC changes for data fusion, which generally cannot be handed by existing
769 data fusion algorithms.
770 (3) ESRCNN is superior over ATPRK in preserving the distribution of reflectance values for
771 the downscaled Landsat-8 images.
772
773 Code Availability
774 The code developed for fusing Landsat-8 and Sentinel-2 images is available at
775 https://github.com/MrCPlusPlus/ESRCNN-for-Landsat8-Sentinel2-Fusion.
776
777 Acknowledgements
778 This work was supported in part by the National Key Research and Development Plan
779 on strategic international scientific and technological innovation cooperation special project
780 under Grant 2016YFE0202300, the National Natural Science Foundation of China under
781 Grants 61671332, 41771452, and 41771454, the Natural Science Fund of Hubei Province in
782 China under Grant 2018CFA007. The authors are also grateful to the editors and three
783 anonymous reviewers who helped improve the earlier version of this manuscript through their
784 comments and suggestions.
785
786 References
47
787 Acerbi-Junior, F.W., Clevers, J.G.P.W., & Schaepman, M.E. (2006). The assessment of
788 multi-sensor image fusion using wavelet transforms for mapping the Brazilian
789 Savanna. International Journal of Applied Earth Observation and Geoinformation, 8,
790 278-288
791 Alparone, L., Wald, L., Chanussot, J., Thomas, C., Gamba, P., & Bruce, L.M. (2007).
792 Comparison of pansharpening algorithms: outcome of the 2006 GRS-S data fusion
793 contest. IEEE Transactions on Geoscience and Remote Sensing, 45, 3012-3021
794 Audebert, N., Le Saux, B., & Lefèvre, S. (2017). Beyond RGB: Very high resolution urban
795 remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry
796 and Remote Sensing.
797 Claverie, M., Masek, J. G., Ju, J., & Dungan, J. L. (2017). Harmonized landsat-8 sentinel-2
798 (HLS) product user’s guide. National Aeronautics and Space Administration (NASA):
799 Washington, DC, USA.
800 Claverie, M., Ju, J., Masek, J. G., Dungan, J. L., Vermote, E. F., Roger, J. C., ... & Justice, C.
801 (2018). The Harmonized Landsat and Sentinel-2 surface reflectance data set. Remote
802 Sensing of Environment, 219, 145-161.
803 Das, M., & Ghosh, S. K. (2016). Deep-STEP: A Deep Learning Approach for Spatiotemporal
804 Prediction of Remote Sensing Data. IEEE Geoscience and Remote Sensing Letters,
805 13(12), 1984-1988.
806 DeVries, B., Huang, C., Huang, W., Jones, J., Lang, M., & Creed, I. (2016). Automated
807 quantification of surface water fraction using Landsat and Sentinel-2 data. Paper
808 presented at the AGU Fall Meeting Abstracts.
48
809 Dong, C., Loy, C.C., He, K., & Tang, X. (2016). Image super-resolution using deep
810 convolutional networks. IEEE Transactions on Pattern Analysis and Machine
811 Intelligence, 38, 295-307.
812 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola,
813 C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F., &
814 Bargellini, P. (2012). Sentinel-2: ESA's Optical High-Resolution Mission for GMES
815 Operational Services. Remote Sensing of Environment, 120, 25-36.
816 Emelyanova, I. V., McVicar, T. R., Van Niel, T. G., Li, L. T., & van Dijk, A. I. (2013).
817 Assessing the accuracy of blending Landsat–MODIS surface reflectances in two
818 landscapes with contrasting spatial and temporal dynamics: A framework for
819 algorithm selection. Remote Sensing of Environment, 133, 193-209.
820 Fu, P., & Weng, Q. (2016a). A time series analysis of urbanization induced land use and land
821 cover change and its impact on land surface temperature with Landsat imagery.
822 Remote Sensing of Environment, 175, 205-214.
823 Fu, P., & Weng, Q. (2016b). Consistent land surface temperature data generation from
824 irregularly spaced Landsat imagery. Remote Sensing of Environment, 184, 175-187.
825 Gao, F., Masek, J., Schwaller, M., & Hall, F. (2006). On the blending of the Landsat and
826 MODIS surface reflectance: predicting daily Landsat surface reflectance. IEEE
827 Transactions on Geoscience and Remote Sensing, 44, 2207-2218.
828 Gevaert, C. M., & García-Haro, F. J. (2015). A comparison of STARFM and an
829 unmixing-based algorithm for Landsat and MODIS data fusion. Remote sensing of
830 Environment, 156, 34-44.
49
831 Gorelick, N., Hancher, M., Dixon, M., Ilyushchenko, S., Thau, D., & Moore, R. (2017).
832 Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote
834 He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional
835 networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine
836 Intelligence, 37, 1904-1916.
837 Hilker, T., Wulder, M.A., Coops, N.C., Linke, J., McDermid, G., Masek, J.G., Gao, F., &
838 White, J.C. (2009). A new data fusion model for high spatial- and temporal-resolution
839 mapping of forest disturbance based on Landsat and MODIS. Remote Sensing of
840 Environment, 113, 1613-1627.
841 Hirschberg, J., & Manning, C. D. (2015). Advances in natural language processing. Science,
842 349(6245), 261.
843 Hirschmugl, M., Gallaun, H., Dees, M., Datta, P., Deutscher, J., Koutsias, N., & Schardt, M.
844 (2017). Methods for Mapping Forest Disturbance and Degradation from Optical Earth
845 Observation Data: a Review. Current Forestry Reports, 3(1), 32-45.
846 Huang, B., & Song, H. (2012). Spatiotemporal reflectance fusion via sparse
847 representation. IEEE Transactions on Geoscience and Remote Sensing, 50(10),
848 3707-3716.
849 Kim, D.-H., Sexton, J.O., Noojipady, P., Huang, C., Anand, A., Channan, S., Feng, M., &
850 Townshend, J.R. (2014). Global, Landsat-based forest-cover change from 1990 to
851 2000. Remote Sensing of Environment, 155, 178-193.
852 Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep
50
853 convolutional neural networks. In, International Conference on Neural Information
854 Processing Systems (pp. 1097-1105).
855 Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based leaning applied to
856 document recognition. Proceedings of the IEEE, 86, 2278-2324.
857 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436.
858 Lin, Y., David, R., Hankui, Z. et al. (2016). An Automated Approach for Sub-Pixel
859 Registration of Landsat-8 Operational Land Imager (OLI) and Sentinel-2 Multi
860 Spectral Instrument (MSI) Imagery. Remote Sensing, 8(6):520.
861 Liu, J. G. (2000). Smoothing Filter-based Intensity Modulation: A spectral preserve image
862 fusion technique for improving spatial details. International Journal of Remote
863 Sensing, 21(18), 3461-3472.
864 Liu, P., Xiao, L. and Tang, S. (2016). A new geometry enforcing variational model for
865 pan-sharpening. IEEE Journal of Selected Topics in Applied Earth Observations and
866 Remote Sensing, 9(12), 5726-5739.
867 Liu, X., Deng, C., Wang, S., Huang, G. B., Zhao, B., & Lauren, P. (2016). Fast and accurate
868 spatiotemporal fusion based upon extreme learning machine. IEEE Geoscience and
869 Remote Sensing Letters, 13(12), 2039-2043.
870 Malenovský, Z., Rott, H., Cihlar, J., Schaepman, M.E., García-Santos, G., Fernandes, R., &
871 Berger, M. (2012). Sentinels for science: Potential of Sentinel-1, -2, and -3 missions
872 for scientific observations of ocean, cryosphere, and land. Remote Sensing of
873 Environment, 120, 91-101.
874 Masi, G., Cozzolino, D., Verdoliva, L., & Scarpa G. (2016). Pansharpening by Convolutional
51
875 Neural Networks. Remote Sensing, 8(7), 594.
876 Nair, V., & Hinton, G.E. (2010). Rectified linear units improve restricted boltzmann machines.
877 In, International Conference on Machine Learning (pp. 807-814)
878 Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., Li, H., Yang, S., Wang, Z., Loy,
879 C.-C., & Tang, X. (2015). Deepid-net: Deformable deep convolutional neural
880 networks for object detection. In, IEEE Conference on Computer Vision and Pattern
881 Recognition (pp. 2403-2412)
882 Quintano, C., Fernández-Manso, A., & Fernández-Manso, O. (2018). Combination of
883 Landsat and Sentinel-2 MSI data for initial assessing of burn severity. International
884 Journal of Applied Earth Observation and Geoinformation, 64, 221-225.
885 Ranchin, T., & Wald, L. (2000). Fusion of high spatial and spectral resolution images: The
886 ARSIS concept and its implementation. Photogrammetric Engineering and Remote
887 Sensing, 66, 49–61.
888 Roy, D. P., Lewis, P., Schaaf, C., Devadiga, S., & Boschetti, L. (2006). The global impact of
889 clouds on the production of MODIS bidirectional reflectance model-based composites
890 for terrestrial monitoring. IEEE Geoscience and Remote Sensing Letters, 3(4),
891 452-456.
892 Roy, D. P., Wulder, M. A., Loveland, T. R., C.E, W., Allen, R. G., Anderson, M. C., . . . Zhu,
893 Z. (2014). Landsat-8: Science and product vision for terrestrial global change research.
894 Remote Sensing of Environment, 145, 154-172.
895 Skakun, S., Kussul, N., Shelestov, A., & Kussul, O. (2014). Flood hazard and flood risk
896 assessment using a time series of satellite images: A case study in Namibia. Risk
52
897 Analysis, 34(8), 1521-1537.
898 Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61,
899 85-117.
900 Senf, C., Pflugmacher, D., Heurich, M., & Krueger, T. (2017). A Bayesian hierarchical model
901 for estimating spatial and temporal variation in vegetation phenology from Landsat
902 time series. Remote Sensing of Environment, 194, 155-160.
903 Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale
904 image recognition. arXiv preprint arXiv:1409.1556.
905 Sun, Y., Chen, Y., Wang, X., & Tang, X. (2014). Deep learning face representation by joint
906 identification-verification. In, International Conference on Neural Information
907 Processing Systems (pp. 1988-1996)
908 Storey, J. C., Roy, D. P., Masek, J. (2016). A note on the temporary misregistration of
909 Landsat-8 Operational Land Imager (OLI) and Sentinel-2 Multi Spectral Instrument
910 (MSI) imagery. Remote Sensing of Environment, 186:121-122.
911 USGS. (2015). Using the USGS Landsat 8 product. [Online]. Available:
912 http://landsat.usgs.gov/Landsat8_Using_Product.php.
913 Veloso, A., Mermoz, S., Bouvet, A., Le Toan, T., Planells, M., Dejoux, J.-F., & Ceschia, E.
914 (2017). Understanding the temporal behavior of crops using Sentinel-1 and
915 Sentinel-2-like data for agricultural applications. Remote Sensing of Environment, 199,
916 415-426.
917 Verbesselt, J., Zeileis, A., & Herold, M. (2012). Near real-time disturbance detection using
918 satellite image time series. Remote Sensing of Environment, 123, 98-108.
53
919 Wald, L., Ranchin, T., & Mangolini, M. (1997). Fusion of satellite images of different spatial
920 resolution: Assessing the quality of resulting images. Photogrammetric Engineering
921 and Remote Sensing, 63, 691-699.
922 Wang, Q., Shi, W., Li, Z., & Atkinson, P. M. (2016). Fusion of Sentinel-2 images. Remote
924 Wang, Q., Blackburn, G.A., Onojeghuo, A.O., Dash, J., Zhou, L., Zhang, Y., & Atkinson, P.M.
925 (2017). Fusion of Landsat-8 OLI and Sentinel-2 MSI data. IEEE Transactions on
926 Geoscience and Remote Sensing, 55, 3885-3899.
927 Wang, Z., & Bovik, A.C. (2002). A universal image quality index. IEEE Signal Processing
928 Letters, 9, 81-84.
929 Wei, Q., Bioucas-Dias, J., Dobigeon, N., & Tourneret, J. Y. (2015). Hyperspectral and
930 Multispectral Image Fusion Based on a Sparse Representation. IEEE Transactions on
931 Geoscience and Remote Sensing, 53(7), 3658-3668.
932 White, J. C., Wulder, M. A., Hermosilla, T., Coops, N. C., & Hobart, G. W. (2017). A
933 nationwide annual characterization of 25 years of forest disturbance and recovery for
934 Canada using Landsat time series. Remote Sensing of Environment, 194, 303-321.
935 Woodcock, C. E., Allen, R., Anderson, M., Belward, A., Bindschadler, R., Cohen, W., Wynne,
936 R. (2008). Free Access to Landsat Imagery. Science, 320(5879), 1011.
937 Yuan, Q., Wei, Y., Meng, X., Shen, H., & Zhang, L. (2018). A Multiscale and Multidepth
938 Convolutional Neural Network for Remote Sensing Imagery Pan-sharpening. IEEE
939 Journal of Selected Topics in Applied Earth Observation and Remote Sensing, 11(3),
940 978-989.
54
941 Zhang, L., Weng, Q., & Shao, Z. (2017). An evaluation of monthly impervious surface
942 dynamics by fusing Landsat and MODIS time series in the Pearl River Delta, China,
943 from 2000 to 2015. Remote sensing of environment, 201, 99-114.
944 Zhang, N., Donahue, J., Girshick, R., & Darrel, T. (2014). Part-based RCNNs for fine-grained
945 category detection. In, European Conference on Computer Vision (pp. 834-849)
946 Zhu, X., Chen, J., Gao, F., Chen, X., & Masek, J.G. (2010). An enhanced spatial and temporal
947 adaptive reflectance fusion model for complex heterogeneous regions. Remote
949 Zhu, X., Helmer, E. H., Gao, F., Liu, D., Chen, J., & Lefsky, M. A. (2016). A flexible
950 spatiotemporal method for fusing satellite images with different resolutions. Remote
952 Zhu, Z., & Woodcock, C. E. (2014). Continuous change detection and classification of land
953 cover using all available Landsat data. Remote Sensing of Environment, 144, 152-171.
954
955
956 List of Figure Captions
957
958 Figure 1. The schematic illustration of local receptive fields (red rectangles) and parameter
959 sharing (weight values w1, w2, and w3 are equal) between layers in convolution neural
960 networks.
961 Figure 2. The three phases, i.e., convolution, nonlinear activation, and pooling, in a typical
962 convolutional neural network layer.
55
963 Figure 3. The workflow of the extended SRCNN (ESRCNN) for fusing Landsat-8 and
964 Sentinel-2 images. Conv indicates the convolutional layer and the rectified linear unit (ReLU)
965 represents the nonlinear activation layer.
966 Figure 4. Landsat-8 and Sentinel-2 data sets used in the training phase (bands 4, 3, and 2 as
969 marked red rectangle areas (1, 2, and 3) were used for visual assessment.
970 Figure 5. Landsat-8 and Sentinel-2 data sets used in the testing phase (bands 4, 3, and 2 as
973 marked red rectangle areas (1, 2) were used for evaluating the fusion network performance.
974 Figure 6. The results of self-adaptive fusion on June 20, June 27, and July 7, 2017 for the
975 subarea 1 in Figure 4. (a) The degraded Sentinel-2 bands at 40 m, (b) the reference Sentinel-2
976 bands at 20 m, (c) the fused Sentinel-2 bands at 20 m, and (d) the fused Sentinel-2 bands at
977 10 m.
978 Figure 7. The multi-temporal fusion results for the subarea 1 in Figure 4. (a)-(c) are the
979 resampled Sentinel-2 images at 30 m on June 20, June 27, and July 7, 2017. (d)-(f) are the
980 resampled Landsat image at 90 m, the original Landsat-8 reference at 30 m, and the fused
981 Landsat-8 image at 30 m on June 15, 2017. (g)-(i) are the resampled Landsat image at 90 m,
982 the original Landsat-8 reference at 30 m, and the fused Landsat-8 image at 30 m on July 1,
983 2017.
984 Figure 8. The multi-temporal fusion results for the subarea 2 in Figure 4 (bands 5, 4, 3 as
56
985 RGB). (a)-(c) are three Sentinel-2 images at 10 m acquired on June 20, June 27, and July 7,
986 2017. (d)-(f) are the original Landsat-8 reference at 30 m and the fused Landsat-8 image at 30
987 m and 10 m on June 15, 2017. (g)-(i) are the original Landsat-8 reference at 30 m and the
988 fused Landsat-8 images at 30 m and 10 m on July 1, 2017. The yellow circle highlights the
989 land use and land cover (LULC) changes due to planting of crops (a-c).
990 Figure 9. The subarea 3 in Figure 4 experienced changes in temporary housing (bright dots;
991 band 2 gray-scale images were shown here). The red rectangle in the upper-left corner is an
992 enlarged view of the red rectangle in the center of the image. The number of bright dots
993 (temporal housing) in the Landsat-8 image on (c) July 1, 2017, and three Sentinel-2 images
994 on (a) June 20, (b) June 27, and (d) July 7, 2017 was 4, 2, 3, and 5, respectively.
995 Figure 10. The downscaled Landsat-8 image at 10 m (bands 7, 6, 2 as RGB) on July 1, 2017
996 derived from eight different data sets, including (a) D1, (b) D2, (c) D3, (d) D1+ D2 , (e) D2+ D3,
997 (f) D1+ D3, (g) D1+D2+D3. The red rectangle in the upper-left corner is an enlarged view of
998 the red rectangle in the center of the image.
1000 and ESRCNN for bands 2-7 on June 15, 2017. (a)-(f) are fusion reflectance yielded by
1001 ATPRK with the auxiliary Sentinel-2 image on June 20, 2017. (g)-(l) are fusion reflectance
1002 yielded by ESRCNN with the auxiliary Sentinel-2 image on June 20, 2017. The color scheme
1003 indicates point density.
1005 and ESRCNN for bands 2-7 on July 1, 2017. The color scheme indicates point density. (a)-(f)
1006 are fusion reflectance yielded by ATPRK with the auxiliary Sentinel-2 image on June 27,
57
1007 2017. (g)-(l) are fusion reflectance yielded by ESRCNN with the auxiliary Sentinel-2 image
1008 on June 27, 2017. The color scheme indicates point density.
1009 Figure 13. The multi-temporal fusion results for the subarea 1 indicated by the red rectangle
1010 in Figure 5 (bands 4, 3, 2 as RGB). The original Landsat-8 images at 30 m on (a) June 15 and
1011 (e) July 1, 2017. The resampled Landsat-8 images at 90 m on (b) June 15 and (f) July 1. The
1012 fused Landsat-8 images (ESRCNN) at 30 m on (c) June 15 and (g) July 1, 2017. The fused
1013 Landsat-8 images (ATPRK) at 30 m on (d) June 15 and (h) July 1, 2017.
1014 Figure 14. The distribution of pixel values from all spectral bands in Landsat-8 and
1015 Sentinel-2 images. (a)-(c) are the Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8
1016 bands at 30 m, and the ATPRK fused Landsat-8 bands at 30 m on June 15. (d)-(f) are the
1017 Landsat-8 bands 1-7 at 30 m, the ESRCNN fused Landsat-8 bands at 30 m, and the ATPRK
1018 fused Landsat-8 bands at 30 m on July 1, 2017.
1019 Figure 15. Fusion results for the subarea 2 in Figure 5 experiencing changes in vegetation
1020 coverage (bands 5, 4, 3 as RGB). (a) The 30 m Landsat-8 image on June 15, 2017, (b) the 30
1021 m Sentinel-2 image on July 7, 2017, (c) the 90 m Landsat-8 image to be downscaled on June
1022 15, 2017, (d) the ESRCNN downscaled image at 30 m, and (e) the ATPRK downscaled image
1023 at 30 m.
58

Deep Learning-Based Fusion of Landsat-8 and Sentinel-2 Images For A Harmonized Surface

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning-Based Fusion of Landsat-8 and Sentinel-2 Images For A Harmonized Surface

Uploaded by

Copyright:

Available Formats

Version of Record: https://www.sciencedirect.

4 Zhenfeng Shao 1, 2, Jiajun Cai 1, 2, 3, Peng Fu 4, 5 *, Leiqiu Hu 6, Tao Liu 7

7 Sensing, Wuhan University, Wuhan, China 430072

11 Kong, Hong Kong, China 999077

15 Urbana, IL 61801, USA

22 Jiajun Cai (cai_jiajun@foxmail.com)

33 advances in deep learning, this study developed an extended super-resolution convolutional

36 demonstrated the effectiveness of the deep learning-based fusion algorithm in yielding a

42 learning-based fusion framework proved better in the quantitative assessment in terms of

49 from a single Landsat-like sensor.

51 Keywords: Deep learning, Data fusion, Landsat-8, Sentinel-2, continuous monitoring

85 disturbance detection (White et al., 2017).

86 Synergies between Landsat Operational Land Imager (OLI) and Sentinel-2

96 two instruments share similarities in band specifications (Table 1).

124 reflectance. However, ATPRK, as a geostatistical fusion approach, involves complex

153 assess the effectiveness of deep learning-based fusion approach.

158 2.1 Convolutional neural network

180 called feature map.

201 convolutional neural network layer.

204 Although CNN is initially designated to predict categorical variables (i.e.,

207 SRCNN to reconstruct a high-resolution image from a low-resolution image directly. As

209 imagery from 30 m to 10 m, it shares some similarities with super-resolution reconstruction

219 these layers can be expressed in equation (3).

226 Sentinel-2 spectral bands 11 and 12 (SWIR-1/2) at 20 m are downscaled to 10 m by feeding

237 illustrated in Figure 3 can be summarized in four steps.

243 using the nearest neighbor interpolation;

246 and Sentinel-2 images).

256 Landsat-8 bands 1-8).

265 as loss function).

277 refer to the number of training samples in one iteration) is 128.

282 represents the nonlinear activation layer.

284 2.3 ATPRK for fusion of Landsat-8 and Sentinel-2 images

287 introducing an area-to-point kriging (ATPK)-based residual downscaling scheme which is

294 2.4 Evaluation metrics for data fusion

FGH(I, I4) = sinM

305 averaged to a single value for the whole image.

307 reference image R and the fused image F.

TSXGF = 100 W ∑Z2 [SHFT( )/H\]^( )]

316 reference and fused images.

318 and the fused image F.

Bef ∑def[`( , )Ma(`)][b( , )Ma(b)]

Bef ∑def[`( , )Ma(`)] ∑Bef ∑def[b( , )Ma(b)]

320 The UIQI (equation 10) is a global fusion performance indicator.

h(S, U) = (lR jklR )(aR

329 3 Study area, satellite data, and synthetic data sets

350 reflectance with a sub-pixel multispectral registration accuracy in the UTM/WGS84

358 consistent observations.

364 8 were resampled by a factor of 2 by using a Gaussian model-based degradation function

378 spectral and panchromatic bands at 30 m and 15 m were resampled to 30 m, 90 m, and 45 m.

385 Information Table S2).

399 4.1 Self-adaptive fusion of Sentinel-2 images

407 ESRCNN outperformed ATPRK in downscaling Sentinel-2 bands 11-12 from 40 m 20 m

414 downscaling Landsat-8 images from 30 m spatial resolution to 10 m.

418 indicates the best value for accuracy assessment.

426 4.2 Multi-temporal fusion of Landsat-8 and Sentinel-2 images

450 are not consistent with the target image.

480 image at 10 m from the Landsat-8 image at 30 m.

501 Further visual examinations of the downscaled Landsat-8 images at 10 m suggested

538 the red rectangle in the center of the image.

546 m to 10 m for this study area was benchmarked by ATPRK.

551 30 m) were further used in downscaling Landsat-8 images from 90 m to 30 m.

563 each group was 5, 12, 4, and 11.

564 The proposed ESRCNN framework outperformed ATPRK in downscaling Landsat-8