Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

IPL-LEO

[Version 1.19]
July 2017

J. Verrelst & J.P Rivera

Machine Learning Regression Algorithms (MLRA)


toolbox

[MLRA toolbox v1.19 Manual]


ARTMO’s Machine Learning Regression Algorithms (MLRA) toolbox enables
evaluating and applying a suite of MLRAs for mapping surface properties. Input data
can either come from radiative transfer models or from field measurements. This
document provides the MLRA Manual.
1 Revision History
Version Date Revision Description Authors
1.00 04/11/2013 First public release of MLRA v1.0 and its documentation including this J. Verrelst &
Manual. J.P. Rivera
1.00 04/11/2013 Manual revised by Jordi Muñoz. J. Muñoz &
J. Verrelst
1.01 14/11/2013  The MLRA figure options have been revised. Options to J.P. Rivera &
manipulate figures and maps are now available within the figure J. Verrelst
top bar. This Options button will appear both when plotting
validation outputs as well when plotting maps.
 Within Tools, now also the button Figure is added. This figure as
well includes the Options button and its utilities.
 The included regression functions have been actualized
according to SimpleR V2.1.

1.02 03/01/2014 The goodness-of-fit statistical measures in the Validation table have J.P. Rivera &
been extended according to the proposed set of optimal statistics J. Verrelst
according to Richter et al., (2012):
Richter, K., Atzberger, C., Hank, T. And Mauser, W. (2012): Derivation of
biophysical variables from Earth Observation data: validation and statistical
measures, Journal of Applied Remote Sensing 6 (1), DOI:
10.1117/1.jrs.6.063557.

1.03 04/04/2014 Several small modifications have been introduced: J.P. Rivera &
 The Seil-Then estimate has been moved to a new table with J. Verrelst
‘advanced’ statistics within the calibration/validation tables. This
means that these advanced statistics are only calculated when
selecting one ore multiple SI models in the calibration/validation
table. It has been done so because its calculation is
computationally expensive, i.e. involves the looping over many
combinations. In future it is foreseen that more advanced
statistics will be added to the table.
 Options to visualize residuals have been added to
calibration/validation tables. Residuals can be plotted against
validation data.
 It appeared that GPR was no longer providing sigmas. This has
been corrected.
 The sigmas as delivered by VHGPR are now also provided, as
well the associated uncertainties (SD, CV) when mapping mean
estimates.
 The uncertainties as provided by KRR were incorrect and have
been removed.
 It appeared that the storing of the generated maps in ENVI format
was wrongly projected.
 Processing time for each regression model (training plus
validation) has been added. Also the time to generate a map can
be optionally saved as a text file. In the text file both model
development processing time as well mapping processing time is
provided.
 In Options within Tools processing speed can be deactivated.
 A bug within one of the advanced PCAs has been corrected.
 Because the MLRA toolbox only provides goodness-of-fit results
from the validation it is not possible to provide 100 training in
Setting. Some portion has to be set aside for validation. To warn
the user, when 100 is entered, the box will turn yellow.

1.04 25/04/2014  The possibility to write away the processing time of generating a J.P. Rivera &
map has been moved to Options. By default it is deactivated. In J. Verrelst
Options it the path to a text file can be defined.
 The option has been added that the MLRA models are saved in
the MySQL table. This allows that earlier generated models can
be re-used.

2
1.05 05/06/2014  It appeared that when an error occurs a temporary variable J.P. Rivera &
(dummyvar.m) is not being removed and therefore causes an J. Verrelst
error in a subsequent run. Now it is first checked whether this
variable is first removed.
 Regarding PCA approaches a few small bugs have been
corrected when applying ‘Best RMSE’ option.
1.10 15/01/2015  The cross-variation module has been added. This allows more J.P. Rivera &
robust sampling. Along with it statistics and plotting functions J. Verrelst
added.
 The reading of image and assigning of output folder is now
synchronized with that of the SI toolbox.
1.11 26/03/2015 Several small bugs have been identified and corrected: J.P. Rivera &
 The cross-var 1:1-graph appeared to show results reversed. Now J. Verrelst
corrected.
 R2-adj can in some cases lead to errors. Now if error happens
results are converted to NaN and it continues processing.
 The original 1:1 line was showing from 0. Now corrected.
1.12 22/04/2015 A few improvements have been introduced: J.P. Rivera &
 An option has added to export a MLRA model as a Matlab file J. Verrelst
and then to import an external MLRA model.
 The reading of (User) Text file has been improved. Now no longer
an error appears when some white space is remaining at the end
of the text file.
1.13 16/06/2015 Various improvements have been introduced: J.P. Rivera &
 Apart from standard ENVI images, it is now also possible to read J. Verrelst
(geo)TIFF images and write away TIFF maps.
 In case the image appears to be very big, to avoid memory
problems, it starts reading and processing the image line by line.
 The option to develop MLRA per land cover class has been
revised.
 The multi-output option appeared to be outdated. It has now been
revised and updated similar to the single-output regression
algorithms.
 The window that enables reading a (User) text file has been
updated: (1) a ‘transpose’ option has been added to transpose,
e.g. exported text data; (2) the ‘combined’ input variables has
been moved more backwards.
 In Tools an option has been added that during the processing of
an image it skips pixels where all the bands have the same value
(e.g. 0).
 The option load image at Input menu was redundant and has
been removed.
 An option to convert the to-be-processed images has been added
in the Retrieval window. It is a multiplicative conversion factor.
1.14 21/07/2015 The following improvements have been introduced: J. López-
 An error was resolved regarding displaying processing time Centelles,
mapping. J.P. Rivera &
 To facilitate the work flow, Menu items now change color once the J. Verrelst
step is completed.
 Also, the Settings step is now disabled until the Input step is
completed. The same with Validation->New.
 A sigma band analysis tool has been added. This tool iteratively
removes the worst band in model development. Currently this tool
only works for GPR and VH-GPR.
1.15 25/10/2015 The following improvements have been introduced: J.P. Rivera,
 The GPR sigma band analysis tool has been updated with new J. López-
features such as removal of multiple bands at first iteration and Centelles &
various results visualization tools. J. Verrelst
 A processing bar when analyzing the multiple regression
strategies has been added.
 A log window has been added. This window provides an overview
of the principal steps executed within the toolbox.
 Apart from processing an image, now the option has been added
to process also text files, e.g. with data coming from field
spectrometer.

3
 A bug of mapping in case of NaNs within the image has been
corrected. Also a bug in case of selecting SCOPE RTM output
data was corrected.
 A scatterplot option in tools has been updated with calculation of
goodness-of-fit indicators.
1.15 25/10/2015 At the end of the Manual a section on about how to deal with memory Petar
problems has been added. dimitrov
1.16 22/03/2016 The following improvements have been introduced: J.P. Rivera &
 The MLRA toolbox has been made Linux-proof (however there J. Verrelst
may be issues with MySQL).
 The scatterplot tool has been expanded with error maps and
histograms.
 An option has been added to validate an earlier developed
regression model with new, external data. This option allows to
evaluate the portability of a regression model.
 In RTM input data, the ‘combined’ option has been moved a bit
backwards, similar to USER input data.
 A bug regarding reading retrieval TXT file has been resolved.
1.17 08/10/2016 The following improvements have been introduced: J.P. Rivera &
 There were some bugs in the GPR sigma band analysis tool J. Verrelst
(GPR-BAT) in some plotting and mapping options. This has now
been resolved.
 A new active learning (AL) module has been implemented in this
version.
 A ‘dir’ issue specific for Matlab 2016has been corrected and an
error message in case of reading User data has been improved.

1.18 08/02/2017 The following improvements have been introduced: J.P. Rivera,
 The Neural Network (NN) and Regression Tree (RT) methods J. Muñoz-
have been updated for the newer Matlab versions. Now NN Mari & J.
should run faster (with small datasets). However, RT may lead to Verrelst
an error on Matlab 2013 or older. Note that Matlab toolboxes
(Neural Networks and Machine Learning) are required.
 Some bugs have been resolved in case of reopening results with
‘measured vs estimated’ and in case of ‘retrieval->txt’.
 The SVR method has been reintroduced. Now it works for both
32 and 64 bit machines.
 There was a problem with processing large images. Now all
processing occurs line-by-line to avoid running into out-of-
memory problems. A process bar has been added.
 A mask option has been add to View Maps. As such, in case of
GPR it allows to mask out uncertain retrievals.
 When plotting the GPR sigmas, now the component numbers are
given in case of using a PCA. Also, the wavelength labels have
been improved.
 For retrieval an image, now when having selected a single image
then the output file is editable. In case multiple images are
selected then only the output folder can be selected.
 When importing an external MLRA model (as created by
ARTMO) now it can be applied both to retrieval an image or txt
file.
 The process bar when training and validating MLRAs has been
improved. It now provides extra process bar in case sub-models
are being created (e.g. when using cross-validation).
 The measured-estimated figures have been synchronized for
without and with cross-validation sampling.

1.19 02/07/2017 The following improvements have been introduced: J.P. Rivera,
 In case of TIFF image processing, now it also writes away the J. Muñoz-
geo tags in case of geoTIFF. However, this option will only work Mari & J.
when Matlab’s Mapping Toolbox is available. Verrelst
 Cross-validation estimates in case of k-fold and LOO has been
corrected. Now the statistics are calculated based on the
estimates of all the subsets. Afterwards, the model that is used
for retrieval is trained by all data.

4
 The neural network algorithm has been extended with advanced
options. Now various alternative optimization algorithms can be
selected.
 The window with the overview of the conduced analyses
(validation→load) has been improved. Now immediately some
key properties can be observed (e.g. #bands and samples). The
same window has been applied in case of deleting or renaming
analyses. In the delete window, now multiple exercises can be
deleted at once.
 To speed up image processing, it is now possible to process
images per tiles (or block) The size of the number of lines of a tile
can be adjusted to avoid out-of-memory problems in case of
large images.
 In case of ENVI images processing, now also images with an
extension (e.g., ‘.bsq’) can be processed.
 The dimensionality reduction SIMFEAT module has been
improved. In case of the kernel dimensionality reduction
methods, now an internal sigma optimization has been added for
the kernelized methods. Additional options to control the
optimization been added. Also some small bugs have been
corrected.
 When a dimensionality reduction method is used, in the
validation table now apart from the number of bands also the
number of features (components) is given.

5
Table of Contents
1 Revision History....................................................................................................... 2
2 Introduction .............................................................................................................. 8
2.1 Ongoing development ...................................................................................... 9
2.2 Please cite the toolbox: .................................................................................... 9
3 ARTMO‘s MLRA toolbox ....................................................................................... 11
3.1 Installation ...................................................................................................... 11
3.2 ARTMO .......................................................................................................... 11
3.3 MLRA‘s modular architecture ......................................................................... 12
4 Input....................................................................................................................... 13
4.1 Input from RTM model data (LUT) ................................................................. 13
4.2 Input from external User data (TXT) ............................................................... 15
4.3 Load land cover map (optional) ...................................................................... 17
4.4 Inserting input data with land cover class labels ............................................ 18
4.5 Combining RTM data with external User data ................................................ 19
5 Settings.................................................................................................................. 21
5.1 Single-output MLRAs ..................................................................................... 21
5.2 Band tools: redundancy reduction .................................................................. 24
5.3 Cross-validation module ................................................................................. 26
5.4 Active learning ................................................................................................ 28
5.5 Configuring per land cover class .................................................................... 30
5.6 Multi-output regression algorithms ................................................................. 31
6 Validation ............................................................................................................... 33
6.1 Validation: New .............................................................................................. 33
6.1.1 Graphics .................................................................................................. 36
6.2 Outputs GPR band analysis tool (GPR-BAT) ................................................. 40
6.2.1 Select regression model.......................................................................... 42
6.3 Active learning module ................................................................................... 42
6.4 Validation: Load.............................................................................................. 44
7 Retrieval ................................................................................................................ 46
7.1 Retrieval image .............................................................................................. 46
7.2 Retrieval: Text file........................................................................................... 50
8 Tools ...................................................................................................................... 53
8.1 Save ............................................................................................................... 53
8.2 Load ............................................................................................................... 53
8.3 Manage tests .................................................................................................. 54
8.4 Options ........................................................................................................... 55

6
8.5 View maps ...................................................................................................... 56
8.6 View figure...................................................................................................... 56
8.7 Import model .................................................................................................. 57
8.8 ScatterPlot ...................................................................................................... 57
8.9 Validation external data .................................................................................. 59
9 Help ....................................................................................................................... 61
9.1 Show log......................................................................................................... 61
10 Error reporting ....................................................................................................... 62
10.1 Dealing with memory problems ...................................................................... 62
10.2 Error in case of unautharized writing temporarily files .................................... 63

7
2 Introduction
Biophysical parameter mapping from optical remote sensing images always require an
intermediate modeling step to transform spectral observations into useful estimates. This
modeling step can be approached with either statistical, physical or hybrid methods. Here
emphasis is put on statistical methods. Statistical methods can be categorized into either
parametric or nonparametric approaches.

The here presented machine learning regression algorithms (MLRAs) assessment


toolbox provides a suite of nonparametric techniques in a graphical user interface (GUI)
to enable semiautomatic mapping.

Nonparametric models are adjusted to predict a variable of interest using a training


dataset of input-output data pairs, which come from concurrent measurements of the
parameter and the corresponding radiometric observation. Particularly, the family of
MLRAs emerged as a powerful nonparametric approach for delivering biophysical
parameters. MLRAs have the potential to generate adaptive, robust relationships and,
once trained, they are very fast to apply. Typically, MLRAs are able to cope with the
strong nonlinearity of the functional dependence between the biophysical parameter and
the observed reflected radiance. They may therefore be powerful candidates for mapping
applications.
The MLRA toolbox requires training data to train an advanced regression model (e.g.
MLRA). This trained model can then be validated and applied to a remote sensing image
to enable mapping (Figure 2-1).

Figure 2-1. Basic principle of the MLRA toolbox.

In short, the MLRA toolbox enables:


 To apply and evaluate multiple MLRAs according to customized training strategies,
e.g. with different noise and train/validation partitioning.
 To choose between either single-output or multi-output models.
 Data can either come from radiative transfer models or from field measurements, or
can be mixed.
 If a land cover map is provided, then for each land cover class a distinct MLRA can
be optimized.

8
 When having validation data available then multiple MLRA strategies can be
analyzed against the validation dataset by using goodness-of-fit statistics. Results
are stored in a relational database.
 The best performing strategy can be loaded and applied to an imagery, or a model
can be directly developed and applied to an imagery, for mapping applications.
 In case of hyperspectral data, a dimensionality reduction method (e.g. PCA) can be
applied prior to the regression model.

2.1 Ongoing development


The MLRA assessment toolbox is part of the ARTMO toolbox. ARTMO is an ongoing
project constantly under development, and releases with updates are foreseen. In each
new version we aim to resolve bugs and to add new functionalities. You can also
contribute to the improvement of ARTMO and its modules, e.g. by reporting bugs or
providing suggestions. Specifically, we encourage including more models, modules and
apps in the toolbox. For instance, if you have a RT model available and you are willing
to share it with the community we would be pleased to implement your model in the
toolbox. Or if you have an application developed in Matlab™ we could add it as an app
within a module. Please direct expressions of interest to artmo.toolbox@gmail.com

2.2 Please cite the toolbox:


ARTMO’s MLRA toolbox is published in:

 Toward a Semiautomatic Machine Learning Retrieval of Biophysical


Parameters. Rivera, J.P., Verrelst, J., Muñoz, J., Moreno, J. Camps-Valls, G..
IEEE Journal of Selected Topics in Applied Earth Observation and Remote
Sensing. In Press. 2014.

The majority of the algorithms used in the MLRA module is based on G. Camps-Valls
regression algorithms toolbox published in:
 Retrieval of Biophysical Parameters with Heteroscedastic Gaussian
Processes. Miguel Lázaro-Gredilla, Michalis K. Titsias, Jochem Verrelst and
Gustavo Camps-Valls. (2014). IEEE Geoscience and Remote Sensing Letters,
11(4). P. 838-842.
The source code of the MLRAs is available at:
http://www.uv.es/gcamps/code/simpleR.html

Please also consider citing these related papers regarding the MLRA toolbox:
 Verrelst J., Dethier, S., Rivera, J.P., Munoz-Mari, J., Camps-Valls, G., Moreno,
J. (2016). Active learning methods for efficient hybrid biophysical
variable retrieval. IEEE Geoscience and Remote Sensing Letters, 13, p. 1012-
1016.
 Verrelst J., Rivera, J.P., Gitelson, A., Delegido, J., Moreno, J., Camps-Valls, G.,
(2016). Spectral band selection for vegetation properties retrieval using
Gaussian processes regression. International Journal of Applied Earth
Observation and Geoinformation, 52, p. 554-567.

9
 Experimental Sentinel-2 LAI estimation using parametric, non-
parametric and physical retrieval methods – A comparison. Verrelst,
J., Rivera, J.P. Veroustraete, F., Muñoz-Marí, J., Clevers, J.G.P.W., Camps-Valls,
G., Moreno, J. ISPRS Journal of Photogrammetry and Remote Sensing, 108, p.
260-272, 2015).
 Gaussian processes uncertainty estimates in experimental Sentinel-2
LAI and leaf chlorophyll content retrieval. Verrelst, J., Rivera, J.P. Moreno,
J., Camps-Valls, G., ISPRS Journal of Photogrammetry and Remote Sensing, 86,
p. 157-167, 2013).
 Gaussian process retrieval of chlorophyll content from imaging
spectroscopy data. Verrelst, J., Alonso, L.., Rivera, J.P., Moreno, J. Camps-
Valls, G., IEEE Journal of Selected Topics in Applied Earth Observation and
Remote Sensing, 6(2), Part 3, 2013.
 Machine Learning Regression Algorithms for Biophysical Parameter
Retrieval: Opportunities for Sentinel-2 and -3. Verrelst, J., Muñoz, J.,
Alonso, L., Delegido, J., Rivera, J.P., Camps-Valls, G, Moreno, J. Remote Sensing
of Environment, 118, p127-139, 2012.
 Retrieval of Vegetation Biophysical Parameters using Gaussian
Processes Techniques. J. Verrelst, L. Alonso, G. Camps-Valls, J. Delegido and
J. Moreno. IEEE Transactions on Geoscience and Remote Sensing, 50(5), 1832 –
1843, 2012.

And regarding ARTMO:


 Mapping vegetation structure in a heterogeneous river floodplain
ecosystem using pointable CHRIS/PROBA data. Verrelst, J., Romijn, E.,
Kooistra, L. Remote Sensing, 4(9), p. 2866-2889, 2012.
 Optimizing LUT-based RTM Inversion for Semiautomatic Mapping of
Crop Biophysical Parameters from Sentinel-2 and -3 data: Role of Cost
Functions. Verrelst, J., Rivera, J.P., Leoneko, G., Alonso, L., Moreno, J. IEEE
Transactions on Geoscience and Remote Sensing, 52(1), p. 257-269, 2014.
 Multiple cost functions and regularization options for improved retrieval
of leaf chlorophyll content and LAI through inversion of the PROSAIL
model. Rivera, J.P., Verrelst, J., Leoneko, G., Moreno, J. Remote Sensing, 5(7),
p. 3280-3304, 2013.

10
3 ARTMO‘s MLRA toolbox
3.1 Installation
The MLRA toolbox is being operated within the ARTMO environment (Figure 3-1).
Please consult the MLRA’s Installation guide on how to implement the MLRA toolbox
into ARTMO.

3.2 ARTMO
Once having the MLRA module installed, the module will automatically appear within
ARTMO’s top bar, in Retrieval (see Figure 3-2):

Figure 3-1. ARTMO’s main window.

File Models Forward Retrieval Tools Help


Load Project Leaf Leaf Spectral Indices Sensor Show Log
New Project Canopy Canopy MLRA Graphics User’s manual
DB adminstration Combined Combined LUT-based Inversion Spectral resample Installation guide
Settings GSA Disclaimer
Model inputs Emulator Info license
PROSPECT 4
Save New DB PROSPECT 5
Change DB GSA configuration
Load PROSPECT-D
Delete GSA results
DLM
Update LIBERTY
Fluspect-B
LUT class
Project
4SAIL
Database
FLIGHT
INFORM

SCOPE

Figure 3-2.ARTMO’s hierarchical design as of October 2016.

The MLRA toolbox can be accessed as follows:


Retrieval → MLRA
The following toolbox will appear (Figure 3-3):

Figure 3-3. MLRA module v.1.19.

11
3.3 MLRA‘s modular architecture
The MLRA module is again organized in a modular way. All modules are accessible from
MLRA’s main drop-down menu. A schematic overview is provided below (Figure 3-4).

RTM data Single-output New Image Save Show Log


User data (TXT) Multi-output Load Txt file Load User’s manual
Load land cover map Manage tests Installation guide
(optional) Options Disclaimer
View maps
Select project View figure
Rename
Edit settings ScatterPlot
Delete
Validation
external data
Figure 3-4.ARTMO’s MLRA architecture as from v. 1.16 onwards.

The 5 different modules of the MLRA toolbox are described in the following sections. The
modules have to be used in a logical order, according to:

Input Settings Validation Retrieval


To facilitate the user following the logical steps, some of the modules will only be
activated when first the necessary input is provided, e.g. Setting or Retrieval will only be
activated when Input data is provided.

12
4 Input
The Input module is the first mandatory step to configure. There are two sources
where Input data can come from:

1 RT models, i.e. a Look-up table as generated and stored by ARTMO,


or
2 From external User TXT data, e.g. field measurements.

4.1 Input from RTM model data (LUT)


Input→ RTM data→ Select project

As such, the Project overview window appears where a project can be chosen (Figure
4-1). Then within a Project a single look-up table (LUT) class can be chosen if multiple
LUT classes are configured.

Note that the Project overview window is the same as the one used in ARTMO. By
clicking on a project (any cell in the row of the top panel) and then on Input, the meta
data of the applied models can be consulted. If multiple LUT classes are configured they
will appear in the bottom panel. Then the appropriate LUT class can be selected by
clicking on any cell in a given row.

Figure 4-1. Project overview window to select a project and a LUT class.

Subsequently the window will appear where the output spectra and input variables of the
chosen LUT class can be selected (Figure 4-2). Depending on the complexity of the LUT
class the output from different models can be chosen (e.g. at leaf or at canopy level), the
LUT can be restricted by narrowing the variable ranges, and output parameters for
mapping can be selected.

13
Figure 4-2.Configure the required parameters to be mapped and used spectral output.

Two input variables can be combined. When clicking on ‘Combined’ a window will
appear where two variables can be selected (Figure 4-3). Subsequently its product will
be calculated (e.g. LAI and Leaf chlorophyll content will lead to canopy chlorophyll
content).

Figure 4-3. Option to combine (multiplicative) 2 input variables.

An important aspect to realize hereby is that only variables originating from the models
can be mapped. Also, when it comes to applying the generated model to a remote
sensing image, the simulated data may be too smooth as compared to real observations.
Therefore, it may be recommended to add noise in a subsequent step (see Settings).

It is also possible to edit previously configured input RTM settings:

It is also possible to edit previously configured settings:

Input→ RTM data→ Edit settings

Edit settings can be accessed if earlier RTM input data has been configured. The same
window (Figure 4-2) will then appear and input settings can be modified.

14
It is important to realize that many MLRAs are computational demanding. Therefore,
depending on your computer, most of the MLRAs can only be fed with up to 3000-
100000 samples. In a future version we aim to find a way so that larger training datasets
can be inserted.

4.2 Input from external User data (TXT)


As a second option, external data as stored in a text file can also be inserted, such as
field measurements. This option can be of particular interest, not only because then MLR
models can be developed optimized for specific local applications, but also virtually any
measured geo-biophysical parameter can be inserted as input and then mapped. Thus
also variables that are only secondary related to reflectance data, can be linked to
spectral data through a MLRA. It is then up to evaluating the developed models to verify
how powerful these relationships are.

To be able to insert your own User data, make sure that the data is prepared in a matrix
format, i.e. each cell should contain a number. Please prepare the data according the
structure below (Figure 4-4).
0 Variables in rows
. ……
. …….
. …..
wavelengths

Corresponding spectra below the samples

Figure 4-4. Structure how the Input data should be prepared.

Below you can find an example (Figure 4-5). Note that the text file can consist of a header
(e.g. 1,2,3..), but that should then be identified in the Input window (see Figure 4-6).

15
Input
parameters

Associated
Wavelength spectra
s

Figure 4-5. Example of an Input file with field data. The first column is a header. The following
columns represent parameters as measured in the field. Starting from Row 7, the corresponding
spectra are added below. The first column represents the wavelengths.

To open the window to insert User data:

Input→ User data

An Input window will appear (see Figure 4-6).

An important aspect here is that two types of input data are required within the same
file: (1) the parameters to be mapped, e.g. leaf area index, chlorophyll content; and (2)
the related spectra. These data need to be provided together in a plain text file.

When opening the User Input file, the following steps are required:

1 Open a plain text file by clicking on Browser.


2 Inspect the data in the left panel, remove any header line if needed
3 If data looks fine, click on OK. The data will then be shown in the right panel. If data
is columnar stored then it will be shown in perfect matrix style. The whole file can
now be inspected. Make sure that a number is provided for each cell. From v. 1.13
onwards the option is provided to transpose the dataset. That can be of use in case
of entering data that has been exported by ARTMO Graphics.
4 The following step is to decide which parameters will be used for developing MLR
models. In the left bottom panel a parameter name can be given that corresponds
to a selected line – then click on OK. As such, multiple parameters can be defined.
5 Also the option to insert combined parameters is provided (i.e. the product of two
parameters). From v.1.13 onwards this option is available by clicking on the
‘Combined’ button. Then a small window will appear where 2 parameters and a box
where the corresponding line can be entered. For instance, leaf chlorophyll content
can be combined with leaf area index, leading to canopy chlorophyll content.
6 Next, define the starting row of the spectral columns.
7 The following step is to opt for including all samples or excluding some.
8 Also the option is provided to convert the spectral data to another unit. This can
be important in case of combining multiple datasets, e.g. in combination with RTM
data (see further section 4.2), or to match the dataset with those of the remote
sensing image.
9 Finally, by clicking on Import, the data will be inserted. If a step appears to be
missed then a message may appear or a forgotten box may turn yellow.

16
Figure 4-6.Input window to load external, User data as prepared in a text file. First rows: input
parameters, first column: wavelength. Spectra below the parameters.

In the top left bar it is also possible to save the inserted data, including the configured
input settings. A file browser allows you to save the data and a message will appear
when done (Figure 4-7, left). These data and settings can subsequently be loaded
(Figure 4-7, right). In this way, there is no need to repeat the input settings each time.
After loading, one can immediately click on Import and proceed with the MLRA settings.

Figure 4-7.Message windows that settings have been saved [left] and that selected preciously-
saved file has been loaded [right].

When importing data, make sure that the dataset is complete, i.e. that for each cell a
value is given. Empty cells can be replaced by a ‘NaN’, however that will lead to errors
in further processing. A warning message will appear in case inconsistencies are
encountered.

4.3 Load land cover map (optional)


The Load land cover map step allows configuring MLRA strategies per different land
cover class. But note that this step is optional. When not configured then a MLRA
strategy will be applied to the complete imagery.

When aiming to develop MLRA strategies per land cover class, a first step to do is
selecting a land cover map:

Input→ Load land cover map

17
A remote sensing land cover map can then be selected through a file browser window
(Figure 4-8). From 1.13 onwards both images as prepared by ENVI or as (geo)TIFF files
can be read. ENVI is preferred since the module makes use of the information from the
.hdr file. Either an ENVI image file or its associated .hdr file can be selected. In case of
TIFF files only labels will be identified.

Figure 4-8. File browser to insert an ENVI remote sensing classified map header (.hdr).

When this step is completed then in following steps the name of the different land cover
classes will appear. As such, per land cover class a new MLRA strategy can be
developed (see following sections).

4.4 Inserting input data with land cover class


labels
If a land cover map has been loaded it is possible to group User data into multiple (land
cover) classes. For instance, it can be the case that field data were collected over
different parcels with distinct vegetation types. The assigning of input data to land cover
classes occurs through labels. In the text file these should be one row assigned to the
labeling through numbers. That row should be identified in the ‘ID class line’ and then
click on Add (Figure 4-9).

18
Figure 4-9. Input window to load external, User data with assigning a row to ID class line.

Once completed, a window will appear that allows to link the land cover classes with the
labels to which input data is assigned (Figure 4-10). The user can then manually link
each land cover class and input class. When no land cover class is assigned then this
land cover class will not processed.

Figure 4-10. Window to link land cover classes with input classes.

4.5 Combining RTM data with external User data


In case both RTM and User input data has been inserted then these input data has to be
linked so that they can be combined. The following window will appear: Figure 4-11. This
window shows the available input parameters. Since a different name may have been
given, the user has to link manually the RTM parameters with User parameter.
Make sure that both datasets are provided in the same units!

19
Figure 4-11. Window to match the RTM parameters with external user parameters.

Internally it checks whether both datasets contain the same spectral bands. When the
number of bands does not match the following Error message appears (Figure 4-12):

Figure 4-12. Error message when spectral bands from RTM data do not match with those of
User data.

Once loading input data is completed, the following step is configuring the MLRA Settings
for training and validation. Alternatively, the user can also use all data for training
without a validation in the Retrieval module.

20
5 Settings
Once having the Input data configured then the Settings module can be accessed. This
module enables evaluating one or multiple MRLA scenarios prior to applying one to a
remote sensing imagery for biophysical parameter mapping. It is hereby expected that
the input data will be partitioned in both training and validation data since goodness-of-
fit results of validation data will be presented.
Note that it is also possible to directly apply a MLR model to a remote sensing
imagery. In that case the user can skip MLRA settings and go immediately go to
Retrieval. In that case no validation will be performed. In the Retrieval window a
model will be developed and directly applied to a remote sensing imagery.

In the MLRA setting it can be opted to select either single-output regression


algorithms or multi-output regression algorithms.
Currently, most of the regression algorithms provide just a single output (i.e. a developed
model provide predictions for one parameter). However, some MLRAs can provide
output estimates of multiple parameters

The single output option is firstly explained.

5.1 Single-output MLRAs


MLRA setting→ Single-output

Currently, the following single-output regression algorithms have been implemented


according to the SimpeR toolbox (Figure 5-1). Note that Matlab’s Statistics and
Machine Learning Toolbox is required to run SimpleR.
1. Least squares linear regression
2. Principal component regression
3. Partial least square regression
4. Regression Tree (This method may lead to an error in Matlab 2013 or older.)
5. Bagging trees
6. Boosting trees
7. Neural networks (Note that the Matlab Neural Network toolbox is required!)
8. Support vector regression (This method is activated again in MLRA v. 1.1.8.)
9. Relevance vector machine
10. Extreme Learning Machine
11. Kernel ridge regression
12. Gaussian processes regression
13. VH. Gaussian Processes Regression

Note that the structure more complicated MLRAs, such as neural networks, is already
internally predesigned (e.g. with respect to hidden layers) in order to ensure ease of use.
For the kernel MLRAs (KRR, SVR, GPR) internal tuning takes place. This involves
adjusting the parameters of the models, carried out automatically by partitioning the
training set and following a cross validation (n-fold) strategy. For more information
regarding these regression algorithms, please consult:
http://www.uv.es/gcamps/code/simpleR.html.

21
Figure 5-1.MLRA single-output settings window.

Various configuration options have been implemented which can lead to optimized
performances:
 Multiple nonparametric regression algorithms can be selected at once for
evaluation. All possible combinations with further defined settings will then be
assessed.

 Gaussian noise can be added. Noise can be added, both on the parameters as on
the spectra. Here, a range of noise can be configured so that multiple noise
scenarios can be evaluated. The injection of noise can be of importance to account
for environmental and instrumental uncertainties, e.g. when simulated spectra from
RTMs are used for training. Noise will be applied to both training and validation data.

 Range. A choice can be made to insert a single value or multiple values by activating
the Range. Then, all inserted configurations will be assessed in subsequent
validation. There are three ways to enter a range:

1. According to a Step. By default, a minimum, maximum, and step (increment)


can be given. Hence, a range is created accordingly (Figure 5-2).

Figure 5-2: Adding a range through a fixed step.

2. Instead of a step, a number of samples may be given. Samples will be


organized according to a uniform distribution (Figure 5-3).

22
Figure 5-3: Adding a range through a uniform distribution and a number of samples.

3. According to user-inserted values. The distribution of samples can also be


manually provided, by inserting single values, separated by a comma, ‘,’
(Figure 5-4).

Figure 5-4: User input samples.

 Training/validation (train/val) data partitioning. The training/validation (train/val)


data partition can be controlled by configuring the percentage how much data is
assigned to training. The remaining part is kept for validation. Training is for
developing a model, while the developed model will be validated with the remaining
part of the partitioned dataset. Several options are possible:
1. Just as for noise, a range can be inserted for training data. The remaining
parts will go to validation.
2. It is possible to assign RTM data for training and user-data for validation, or
the other way around.
3. Both datasets can also be merged by selecting portions of both datasets for
training or for validation. Also here multiple train/val partitions can be
evaluated by entering a range.
It is important to realize that only the training part needs to be provided.
The module will assign the remaining part to validation! For instance, when
inserting 50%, then 50% of provided data will be used to build a model, and the
remaining 50% to evaluate this model. It is not recommended to provide 100%
training data since it expects to validate the developed model with
independent (i.e. different) validation data. If 100% training is specified then
the module will automatically keep a minimal part for validation (~1%). Therefore,
it is preferable to allocate a maximum 95% for training.

 When having inserted both RTM input data and User data inserted (see
section 4.3) then both dataset can be combined. Both the RTM data and
USER data will then be activated. In principle for both data inputs a partition of
training data can be inserted. It is then assumed that for both the remaining parts
go to validation. However, by activating the boxes below ‘Only train’ or ‘Only
Validation’ it can be decided whether data will be excluded for train or validation.
In this way all kind of partitioning combinations are possible by assigning portions
of User or RTM data to either training or validation.

23
5.2 Band tools: redundancy reduction
Since each added band puts a burden on the computational load, an option to compute
relevant bands has been added to overcome the Hughes phenomenon or ‘curse of
dimensionality’.

Several options are available to reduce dimensionality:


1. User select. Here manually bands can be manually excluded by deactivating
them (Figure 5-5).

Figure 5-5.Band selection/removal option.

2. Mutual information calculates the mutual information between two discrete


variables (or a group and a single variable). See Figure 5-6. This code was
implemented according to Will Dwinnell. Some utility testing may be required.

Figure 5-6. Select bands according to Mutual Information technique.

3. Dimensionality reduction. From v. 1.19 onwards the dimensionality reduction


(DR) module SIMFEAT (simple feature reduction) is introduced (Figure 5-7). This
module replaces the earlier PCA module. In SIMFEAT 11 DR methods are
introduced, including the classical PCA, PLS but also cluster-based methods
such as CCA and OPLS, MNF and their kernelized (nonlinear) versions.

24
Regarding the cluster-based methods (CCA, OPLS, KCCA, KOPLS), internally a
clustering step of the selected variable is applied based on k-means clustering.
The DR method is then based on the clusters. By default the same number of
clusters as components are taken, but the user can introduce more clusters. By
clicking on ‘cluster’ a window appears (Figure 5-8) where more clusters can be
given. Regarding the kernelized DR methods, additional options are provided
regarding the kernel type, the used sigma method and optimization regarding the
sigma in view of regression. By default, 10 repetitions are applied using linear
regression, but that option can be deactivated, or customized, e.g. by applying
the optimization on the user regression method (but that can go considerably
slower in case an advanced regression method is chosen). By default 5
components are given, but the user can insert any number.

Figure 5-7. SIMFEAT dimensionality reduction module with 11 DR methods and optimization
options.

Figure 5-8. Option to change the number of clusters for the cluster-based DR methods (CCA,
OPLS, KCCA, KOPLS). By default the same number of clusters as components +1 is provided.

4. GPR band analysis tool (GPR-BAT). From v. 1.14 a band analysis tool (BAT)
has been added based on band ranking properties of a few MLRAs. Specifically,
the family of GPR, operating in a Bayesian framework, provides band ranking
properties, i.e. the lower the sigma around the band the more important the band.
Consequently, high sigmas imply less relevant bands. With this property a
backward band reduction option is provided, whereby each iteration the poorest
performing band is removed and then goodness-of-fit statistics recalculated. As
such eventually the best performing bands are calculated, e.g. for 4, 3, 2 until
finally one band is left. This approach can be of interest in finding most sensitive
bands for a variable, as well ascertaining what would be the minimum of bands
to keep an acceptable accuracy. The method works only when first clicking on

25
GPR (or VH-GPR). When clicking then on the following window appears (Figure
5-9):

Figure 5-9. Sigma band analysis tool.

The following options are provided:


1. Band sorting based on either ranking or absolute. This option is of importance
in case of combining with a cross-validation strategy where per iteration
multiple models are generated. With ranking it uses the ranked position instead
of the sigma value in removing the worst band. With absolute it uses the actual
sigma value. Ranking tends to lead to better results.
2. With ‘# of bands to delete per iteration’ it can be opted to remove multiple
bands at once. That can be of use in case of having many bands available.
3. With ‘# of bands to delete at first iteration’ it can be opted to remove a large
amount of bands at the first operation. That can be of use in case of having many
bands available, but still interested in deriving the single best, second best, etc.
bands. With many bands, it could be an interesting strategy to remove the large
majority of bands at first iteration, and then continue to analyze band-by-band.
Once completed the validation table as in Figure 6-2 will appear. When then showing
results for the number of bands all results will appear.

5.3 Cross-validation module


From MLRA v1.10 onwards a cross-validation data partition module has been added
within Settings. This module provides various data partition options as alternatives to
the default single training-validation splitting. With cross-validation options more robust
data partitioning can be obtained, including the possibility of using all data both for
training and for validation.
The module is based on Matlab’s cvpartition, and requires Matlab’s Statistics Toolbox.

Matlab provides the following information1:


Description
c = cvpartition(n,'KFold',k) constructs an object c of
the cvpartition class defining a random partition for k-fold cross validation on n observations.
The partition divides the observations into k disjoint subsamples (or folds), chosen randomly but with
roughly equal size. The default value of k is 10.

c = cvpartition(group,'KFold',k) creates a random partition for a stratified k-fold


cross validation. group is a numeric vector, categorical array, string array, or cell array of strings
indicating the class of each observation. Each subsample has roughly equal size and roughly the same

1
http://es.mathworks.com/help/stats/cvpartition.html

26
class proportions as in group. cvpartition treats NaNs or empty strings in group as missing
values.
c = cvpartition(n,'HoldOut',p) creates a random partition for holdout validation
on n observations. This partition divides the observations into a training set and a test (or holdout) set.
The parameter p must be a scalar. When 0 < p < 1, cvpartition randomly selects
approximately p*n observations for the test set. When p is an integer,cvpartition randomly
selects p observations for the test set. The default value of p is 1/10.

c = cvpartition(group,'HoldOut',p) randomly partitions observations into a training


set and a test set with stratification, using the class information in group; that is, both training and test
sets have roughly the same class proportions as in group.

c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross


validation on n observations. Leave-one-out is a special case of 'KFold', in which the number of
folds equals the number of observations.

After clicking on Cross-validation, the following window will appear (Figure 5-10):

Figure 5-10. Single-output window with land cover class activated and Next Class button.

In accordance with the Matlab function cvpartition, the following methods are provided:
 k-fold  Stratified hold-out
 Stratified k-fold  Leave-one-out
 Hold-out

In the case of stratified k-fold, the option is provided to provide a file with group data.
As an example, to enable validating MLRA methods with the complete dataset a k-fold
cross-validation technique can be employed. The k-fold cross-validation means that the
dataset is randomly divided into k equal-sized sub-datasets. From these k sub-datasets,
k; k-1 sub-datasets are used as training dataset, and the single k sub-dataset is used as
the validation dataset for testing the model. Then, the cross-validation process is
repeated k times, with each of the k sub-datasets used in turn as the validation dataset.
The results from each of the iterative processes are combined to produce a single
estimation. In this way, all the data are used for both training and validation, and each
single observation is used for validation exactly once.
Once having a cross-validation data partitioning selected, the Settings can be completed
as usual.

27
5.4 Active learning
From v.1.17 onwards a new module (plug-in) named “Active Learning” has been added
to Settings (Figure 5-11).

Figure 5-11. Settings window with Active Learning module.

The active learning module has been published in Verrelst et al., 2016. Essentially, it
allows Active learning (AL) methods enable to select the most informative samples in an
additional large data set. The AL methods will sequently search for meaningful samples
within a sampling pool (e.g. a LUT) in order to increase model performance. Six AL
methods are introduced for achieving optimized biophysical variable estimation with a
manageable training data set. The selected criterion algorithms can rank the samples
according to the uncertainty of a sample or its diversity. These criteria are sometimes
used together within classification problems and are here applied separately to
regression. Selecting samples by uncertainty picks the most uncertain samples, i.e.,
those with the least confidence. Uncertainty criteria include variance-based pool of
regressors (PAL), entropy query by bagging (EQB), and residual regression AL (RSAL).
Selecting samples by diversity ensures that added samples are dissimilar from those
already accounted for. Diversity criteria include Euclidean distance based diversity
(EBD), angle-based diversity (ABD), and cluster-based diversity (CBD).The algorithms
are provided in Verrelst et al., 2016.

When clickin on Active Learning the following window appears (Figure 5-12):

28
Figure 5-12. Settings of the Active Learning module.

In short, the following options are provided:


1. Selection of the AL methods. Multiple AL methods can be selected which
implies that their performances can be compared in the Results section (see
section 6.3).
2. Number of samples to add gives the number of added samples in each
iteration.
3. Number of pool regressor is a decision tree strategy for the Pool Active
Learning . It has no impact to the other AL methods.
The following two options are stopping criteria, i.e. that will stop the AL iterations.
4. Number of iterations. The AL sequence will stop after the given number of
iterations.
5. LESS % RMSE. The AL sequence when a given % RMSE is reached.

Apart from selecting the AL methods an obligatory step is to select the pool data (LUT
or user data). From that data AL select the samples and adds to the regression method.
Similarly as input data, data can either come from ARTMO RTM projects or User data.
The same steps have to be followed as in section 4. When data is selected the following
window appears (Figure 5-13), where the originally selected variable has to match with
the variable of the pool data (that should be the same, but perhaps names can differ).

Figure 5-13. Window to match name original input variable with name of pool input variable.
Obviously, the variable should be the same.

29
5.5 Advanced options
From v. 1.19 onwards some advanced options are added. The idea is that here the
advanced user will get the possibility to tune the MLRAs. Until now that was not possible
and only default settings are applied. In this version it becomes possible to tune the
optimization of neural networks – in future versions also other MLRAs will become
tunable. When clicking on:
advanced options→ neural networks
then the following window will appear (Figure 5-14):

Figure 5-14. Advanced Neural network options.

In this window it is possible to select various training optimization methods. By default


the ‘Levenberg-Marcquardt backpropagation’ is selected, but that method can be slow in
case of large dataset. Alternative, the ‘scaled conjugate gradient backpropagation’ is
also powerful and considerably faster. Further some settings are provided. For instance,
the development of a net is 10 times repeated since the initiation can have impact to
further development. Also Matlab options to calculate the NN on a GPU or by using
parallel computing can be selected.

5.6 Configuring per land cover class


Note that if earlier land cover classes were defined, then the aforementioned options can
be configured per land cover class. In the top of the setting window the current class will
be displayed (Figure 5-15). Thereby, to switch the following land cover class, click on
Next Class.

30
Figure 5-15.Single-output window with land cover class activated and Next Class button.

5.7 Multi-output regression algorithms


Some MLRA’s are able to provide multiple outputs. Multi-output regression algorithms
include:
 partial least square regression (PLSR)
 neural networks (NN)
 kernel ridge regression (KRR).
Note that these regression algorithms can also be used for single-output.
The Settings window is precisely the same as the Single-output (Figure 5-16). Hence
when aiming to make use of the multi-output option, then multiple input parameters need
to be assigned during the Input process.

31
Figure 5-16. GUI to configure the MLRA settings for multi-output.

32
6 Validation
6.1 Validation: New
Finally, once having Input data provided and MLRA settings (single-input or multi-input)
configured, then those scenarios can be run. All possible combinations that were defined
during the MLRA setting will be applied with the training phase and evaluated with the
remaining data that was assigned to validation. To start the analysis the following steps
have to be done:

Validation→ New

A text box will appear where a name can be filled in (Figure 6-1):

Figure 6-1. Window to provide a name for the new validation table.

Note that if no name is provided an automatic (default) name will be generated that
consists of the current date (year, month, day, minute, second) will be automatically
used. By clicking on OK, all configured scenarios will subsequently be one-by-one
analyzed. Validation results are automatically saved in the current MySQL database.
This has the advantage that a large number of results can be stored in a systematic
manner and that they can be easily queried later.
A message will appear that the MLRA analysis is proceeding, and within Matlab’s
command window the process status can be followed.
Once all scenarios have been analyzed, then an overview table with best validated
results will appear (Figure 6-2). It is important to note that only validation results are
presented in the ‘MLRA test table’. Training results are not provided because they
provide biased information, i.e. they tend to be over-optimistic because they predict on
the same data they were trained with.
For validation the following goodness-of-fit measures are provided:

33
Table 1. Goodness-of-fit statistical measures

1 mean absolute error (MAE):

2 root mean squared error (RMSE):

3 relative RMSE (RELRMSE) :

4 normalized RMSE (NRMSE):

5 Correlation coefficient (R)

6 coefficient of determination (R2)

7 Adjusted R2

8 Nash-Sutcliffe efficiency (NSE):

All measures indicate the degree of association between estimated and observed values
of the same variable. Apart from MAE and ajusted-R2, these statistical measures are
those as proposed by Richter et al., (2012) which are considered as otimal statistical set.

Richter, K., Atzberger, C., Hank, T. And Mauser, W. (2012): Derivation of biophysical variables from Earth
Observation data: validation and statistical measures, Journal of Applied Remote Sensing 6 (1), DOI:
10.1117/1.jrs.6.063557.

MAE was also considered because it can be used together with RMSE to diagnose the
variation in the errors in a set of predictions. The RMSE will always be larger or equal to
the MAE; the greater difference between them, the greater the variance in the individual
errors in the sample. If the RMSE=MAE, then all the errors are of the same magnitude2.

Regarding this set of recommende statistical measures, Richter et al (2012) also


provided recommended range/values.

Table 2. Proposed optimal set with recommended desirable ranges/values, characteristics/


advantages, as well as shortcomings/disadvantages according to Richter et al., 2012.

2
http://www.eumetcal.org/resources/ukmeteocal/verification/www/english/msg/ver_cont_var/uos
3/uos3_ko1.htm

34
In the validation table the best performing results are shown according to selected land
cover Class (if configured), parameter and statistic (Figure 6-2).

35
Figure 6-2. Validation table with options to organize statistical results and plotting options. A
strategy can be chosen to apply to retrieval.

The statistics are organized with best results per regression model (in case multiple
options are calculated). The user can choose according to which statistics to sort the
results. When more options per regression model are calculated, then by changing the
top number these results can also be shown. From v.1.19 onwards, also the multiple
dimensionality reduction (DR) of the SIMFEAT module results can be accessed by
changing the top number.

6.1.1 Graphics

In the ‘Graphics’ section, when selecting a validated regression model (e.g. the best
performing one), then it is possible to visualize the prediction performance of the selected
regression model.
Further, various options to display the results are provided:

 Measured vs estimated (Figure 6-3):

Figure 6-3.Measured vs. estimated validation data based on a GPR model.

36
 plotting the relevant bands of GPR (or VHGPR) through its sigmas (Figure 6-4):

A special case is the Gaussian process regression algorithm (GPR) and its “Variational
Heteroscedastic variant” (VHGPR). It provides additional outputs (that we call sigmas)
that were generated during the development of a model. These sigmas provide an
indicator of the relevance of contributed bands; the lower the sigma the more
important the band.

Figure 6-4. Relevant bands (sigmas) of a GPR model.

 Residuals (Figure 6-5)


The residuals plots shows how much the estimated values deviate from the measured
values.

Figure 6-5. Residual plots vs. predicted values.

Additionally, the residuals can also be visualized as a Histogram (Figure 6-6).

37
Figure 6-6. Histogram of the residuals.

 Advances statistics (Figure 6-7)


It can be opted to calculate so-called advanced statistics for a selected regression model.
A model will be recalculated in order to calculate this advanced statistic. It was decided
to keep advanced statistics separated of the standard statistic because they may take
much computational time. For the moment only Thiel-Sen slope is calculated, but it is
foreseen that in future more statistical measures will be added.

Figure 6-7. Table with advanced statistics.

 matrices of performances along ranges such as noise and train/val partitioning


(Figure 6-8).
Thereby, by clicking on Options in the Figure top bar, it is possible manipulate plotting
features such as the color table, axis and titles fonts, etc. (Figure 7-8). A figure is
visualized by clicking on Sample. It is also possible to generate the map in other, more
conventional image types (jpeg, tiff, pdf, emf, eps). Hereby, redundant white space
around the figures will be automatically removed. When clicking on View, the map will
be visualized according to the configured settings. When subsequently clicking on Save
then a file browser will appear to save the map according to the chosen format.

38
Figure 6-8. 2D correlation plot with validation statistics for a GPR model based on
training/validation partition and inserted noise range.

Finally, Selecting a MLRA strategy it will move to the down panel. If they are
configured per land cover class then multiple strategies can appear here. When
clicking on Done, then these strategies will be transferred to Retrieval window
(Figure 7-1).

From MLRA v1.10 onwards due to a cross-validation data partition module (see
Section 5.3) the table of validation results shows the cross-validation results. That is,
the mean of the different cross-validation results. As such, it provides a more
robust indication of the predictive power of the different regression models. From v.
1.19 onwards the general statistics are calculated based on all estimations vs.
observation. The measured vs estimated figure will show the 1:1-line scatter plot
with all points. Also the statistics are calculated based on the subsets:
 Cross-validation statistics (Figure 6-9):

Figure 6-9. General statistics of the cross-validation results from the selected model.

For the selected result, the cross-validation statistics can be provided. Because the
cross-validation method provides results of data sub-sets, various basic statistics can
be provided. By inspecting, e.g. the min-max and standard deviation, indications
about the robustness of the method is given.
From v. 1.19 onwards, regardless of the selected cross-validation method, when
proceeding to retieval the regession model will be based on all training data.

39
In case the dimensionality reduction (DR) SIMFEAT module was selected, from v. 1.19
onwards their statistics can be inspected (Error! Reference source not found.):

6.2 Outputs GPR band analysis tool (GPR-BAT)


From v. 3.17 onwards a new band visualization tool has been implemented that displays
results of the GPR band analysis tool (GPR-BAT) (Figure 6-10). Displayed results
provide insight in band sensitivity towards a targeted biophysical parameter. The GPR
sigma band analysis tool iteratively removes the least contributing band in developing a
GPR model. As such, it is assumed that eventually the best-performing bands are
identified.
The visuzalization tool only functions in case GPR sigma band analysis tool was
activated in the Settings module. Displayed statistics results depend on the chosen
goodness-of-fit statitics in the MLRA validation table. When selecting then the Graphic
check box of a validated model for Graphics then below the GPR sigma band analysis
tool can be activated by clicking on Plot.

Figure 6-10. GPR sigma band analysis tool as activated in the MLRA validation table.

The tool provides the following options:


 Figure validation statistics over iterative band removal. This figure provides
goodness-of-fit statistics over the iteratively removed bands until only one
band is left. In case a cross-validation method is applied, additionally the
standard deviation and min-max rage is given (Figure 6-11):

40
Figure 6-11, Goodness-of-fit validation statistics over the wavelengths where iteratively each
time the least contributing wavelength is removed. In case cross-validation is applied also
standard deviation and min-max is provided.

 Table best band. This table provides the wavelengths of each iterative band
removal. Of interest are the wavelengths for the last few best performing
wavelengths (Figure 6-12):

Figure 6-12. Table of wavelengths for each iterative round and associated goodness-of-
validation validation statistics.

 Figure frequency top ranked bands. This provides the frequencies of the top
performing bands in case a cross-validation strategy is applied. The user can
choose at which number of wavelengths the frequency ranking is applied and
how many top-performing rankings should be included (Figure 6-13).

41
Figure 6-13. Frequency ranking of best ranked wavelengths in case of cross-validation
techniques.

6.2.1 Select regression model

In the ‘Select’ section (left checkmark boxes), when selecting a validated regression
model (e.g. the best performing one), then it will be moved to the bottom panel and can
then be used as prediction model to retrieve biophysical parameters. See section 7
Retrieval.
It is also possible to export regression models. By clicking on Export the options is given
to save away a .mat file where all MLRAs and relevant information from the current
exercise is included. In Tools it is then possible to import the .mat file. See Tools, Import
model.

6.3 Active learning module


From v1.17 onwards a new module (plug-in) named ‘Active learning’ (AL). In the
overview table, the best performing AL method and number of iterations (iter) and
number of added samples (samples) are given. When having selected that in Settings,
then also the results for AL can be viewed. To do so, select ‘active learning performance’.
A table will appears with the statistic results (Figure 6-14). Apart from the statistics, also
processing time, number of added samples and number of iterations are added. Only
when performance is improved samples are added. That explains why the algorithm can
run to its maximum number of iterations (e.g., 100) but eventually only a few samples
are added. It means that for the other times the performances were not improved and
the added samples thus discarded.

42
Figure 6-14. Table with statistic results Active Learning (AL) methods and processing time,
added samples and iterations.

From this table some more results can be visualized, i.e. (1) performance over the
iterations and, (2) added samples.

An example of performance is added in Figure 6-15. It can be viewed that in this case
eventually CBD is best performing. It can also be seen that random sampling (RS) shows
very irregular behavior. Again, when the performance is going down, those added
samples are then discarded in the following iteration.

Figure 6-15. AL performance for the selected AL methods along the given iterations according
to the selected statistic.

Following, also the added samples to reach best performance can be plotted for a
chosen statistic (e.g. R2) (Figure 6-16). It can again be observed that despite 100
iterations the adding of only a few samples caused improved performances. Here it can
be observed that EBD and CBD needed most samples, but at the same time these
methods are best performing; they led to most accurate regression models.

43
Figure 6-16. Number of added samples for the given iterations.

6.4 Validation: Load


It is also possible to load earlier stored tests:

Test MLRAs→ Load test

Then the following window will appear (Figure 6-17). From v. 1.19 onwards some
additional information about each test is provided when clicking on a test. Information
includes (1) if data is either coming from an RTM or from a USER, (2) the path of the
used data, (3) used variable, (4) number of samples, and (5) number of wavelengths.
That information will appear in the right panel.

Figure 6-17. List of completed validation table.

44
The validation table of the selected name will appear when clicking on ‘Select’, In this
way, previously developed models can be easily called and used for mapping
applications.

45
7 Retrieval
From v.1.15 onwards it is possible to retrieval biophysical parameters from either an
image, resulting into a map, OR from a text file. A text file could exist of a field
spectrometer, although no validation data is available. The retrieval module will then
process that data and deliver the targeted biophysical parameters as output.

7.1 Retrieval image


In the Retrieval window (Figure 7-1) an earlier selected MLRA strategy appears (e.g. the
one selected in Figure 6-2). Alternatively, it is also possible to directly configure a
relationship and apply to an image to map a parameter. Hence, the user can select
the required land cover class (if available), the retrievable parameter, the regression
algorithms and train/val partitioning. Similarly, noise can be added to the spectra or
parameters and the size of the training data can be selected. Multiple retrieval strategies
can be added by clicking on ADD, e.g. for each retrievable parameter another one. When
selecting directly for a retrieval strategy without validation, then 100% of the data can be
applied to training.
In earlier versions the retrieval module built again the regression model. Therefore it was
required that the training data was still available. From v. 1.04 onwards the regression
model is also saved within the MySQL Assessment table. There is thus no longer need
to load the training data. However, when applying the model a requirement is that the
image will consist of the same number of bands as those that have been presenting
during training phase. Otherwise they will not match and an error will appear (see also
Figure 7-5).

Figure 7-1.MLRA retrieval module

46
From v. 1.13 onwards it is possible to convert the to-be-processed image (or all images)
into different units. That can be of importance in case the training data is given in different
units as the image. They have to match. A multiplicative conversion factor can be
entered.
By clicking on OK, the input images can be loaded and the output maps will be written
away. If the input file was in ENVI format also the output will be written away as an ENVI
file. Similarly, if the input file was a TIFF file then also output is written as a TIFF file. The
following windows will appear to select the input remote sensing images (either *.hdr
or .tiff) (Figure 7-2):

Figure 7-2. Folder browser to select an Input folder where generated maps will be stored.

Finally, the output file or folder can be selected. From v. 1.18 onwards a distinction is
being made in case of processing only one image or multiple images. In case of
processing one image then the output folder can be selected and the suggested output
name will be given and can be edited. The suggested output name consists of the input
name plus the selected variable and MLRA. See Figure 7-3.

Figure 7-3. Folder browser to select an Output name and folder.

In case multiple images are selected then the only the output folder can be selected
(Figure 7-4). The output map will be written away in same format as the input image. By

47
default it will point to the folder of the input images. The same name will be used, plus
the output variable and MLRA added to it.

Figure 7-4. Folder browser to select a folder where the output maps are located.

It is of importance that an image is selected with the same band settings as those
been presented during the training phase. If the band numbers do not match, the
following error appears (Figure 7-5):

Figure 7-5. Error message in case number of bands do not match with those that have been
presented during training phase.

Note that this error only checks on band number matching. Make sure that also the
wavelengths match.

Finally, a map is created. In Matlab command window the processing time will be
displayed. When completed the following window appear: Figure 7-6. In this window an
output layer can be selected and visualized by clicking on VIEW.

Figure 7-6. Window to select a generated output map (through Open Map) and then to select a
layer. By clicking on Preview mapping options are provided.

48
From MLRA v. 1.18 onwards, it is possible to apply a mask to an output layer in case the
output file consists of multiple layers (bands), e.g. in case of GPR. By clicking on Mask
in the menu bar, then the following GUI appears (Figure 7-7):

Figure 7-7. Mask option when having clicked on Mask. In this GUI a band to be used as mask
can be selected and the boundaries for pixels to keep.

In this window the band to be used as mask (e.g GPR CV map that give the relative
uncertainties) and then the min and max threshold for pixels to keep (e.g. only pixels with
uncertainties between 0 and 30%). Then by clicking on OK the mask will be activated.
Following, in the Output map window (Figure 7-6), when having the Mask check box
activated then when clicking on View only the pixels that fall within the threshold will be
shown. The pixels outside the threshold will be masked out (given as NaN).

A figure will be provided (Figure 7-8; left). By clicking on Options in the figure top bar,
options are provided to manipulate mapping features such as the color table, axis and
titles fonts, etc. (Figure 7-8; right).
In order to have the map in the right way oriented, make sure to click on the
orientation “ij”.
A map is visualized by clicking on Sample. It is also possible to generate the map in
other, more conventional image types (jpeg, tiff, pdf, emf, eps). Hereby, redundant white
space around the figures will be automatically removed. When clicking on View, the map
will be visualized according to the configured settings. When subsequently clicking on
Save then a file browser will appear to save the map according to the chosen format.

Figure 7-8. Generated map in Figure window [left] and mapping option such a color tables,
colorbar, legend, and exporting option [right].

49
The GPR deserves special attention since apart from mean estimates it also provides
associated uncertainty estimates, expressed as the standard deviation (SD) around the
mean estimate. This map will be presented as a second layer. Further, since the
magnitude of the SD may be related to the magnitude of the mean estimate, therefore,
as a third layer, also the coefficient of variation (CV: SD/mean estimate) is provided. This
map can be considered as a relative uncertainty (e.g. see Figure 7-9).

Figure 7-9.Examples of output maps.

Regarding the viewing maps window (Figure 7-6), note that in case of processing a TIFF
image no band names will be given because TIFF is not associated with a header file. In
case of GPR and VH-GPR multiple outputs are provided but only band_1, band_2 and
band_3 is given. They have the following meaning:
1. band_1: retrieved variable (e.g. LAI)
2. band_2: standard deviation (SD) around the mean (absolute uncertainties.
3. Band_3: coefficient of variation (CV=SD/mean estimate * 100). This map can be
interpreted as relative uncertainties, expressed as percentage.
From v. 1.18 it is possible to save the masked file as an ENVI file. This is done through
Options→ Save as format: ENVI.

7.2 Retrieval: Text file


From v. 1.15 onwards it is also possible to process text files, typically coming from field
spectrometers. With this option not an image is processed into a map but a text file with
spectra. A new text file is then created with the retrieved biophysical parameters, in case
of the GPR family, additional statistics (relative and absolute uncertainties).
Essentially the same steps as with processing an image has to be followed, including
selecting the LUT and inversion strategy (see also Section 7.1). However, instead of
selecting an image, a text file with spectra can be selected (Figure 7-10).

50
Figure 7-10. Input data with spectral data. Although the GUI is quite similar as the validation
step, no validation data (field data of biophysical parameters) is required.

Once clicking on Import, the following step is to select the output file and its location. By
default the option is given to of the same folder as the input text file. Also, by default the
same name as the text file is given but then with the extension ‘_MLRAinv’ (Figure 7-11).
This output name is editable. Once completed a message ‘OK’ will appear.

Figure 7-11. Folder browser to select the output file. By default the same location and file name
as the input file is given, with the extension of ‘_MLRAinv’.

The output file provides retrieved data of the selected variables and some beta data how
the text file is organized. An example is provided in Figure 7-12.

51
Figure 7-12. Output text file with retrieved values of selected biophysical parameters, and in
case of GPR absolute and relative uncertainties.

52
8 Tools
The MLRA module consists of various tools:

8.1 Save
Tools→ Save
The Save option enables to save all input configurations defined in the toolbox. A file
browser will appear where the file can be saved (Figure 8-1). As such, input data and
settings can be saved to a Matlab .m file. This option can be of interest when aiming to
repeat an analysis with the same dataset.

Figure 8-1. File browser to save general MLRA settings.

8.2 Load
Tools→ Load
By clicking on Load, an earlier saved .m file can be loaded. A file browser will appear
(Figure 8-2). The settings are then inserted into the MLRA toolbox. Here a previously
saved Matlab .m file can be loaded.

53
Figure 8-2. File browser to load a saved general MLRA settings.

8.3 Manage tests


Tools→ Manage tests →Rename
When clicking on Rename, the possibility is provided to rename a test. The validation
table overview of the current DB is first provided (Figure 8-3, left). From v. 1.19 onwards
this GUI provides additional information in the right panel when clicking on a test. Once
having a validation table selected, the following text box will appear: Figure 8-3, right.

Figure 8-3. List of Validation tables in current DB [left], and text box to insert a new name [right].

Tools→ Manage tests →Delete


Clicking on Delete enables to delete a table with test results. The same validation table
overview window will appear (Figure 8-4, left). By selecting a name that table will be
deleted. From v. 1.19 onwards it is possible to delete multiple tests at once by first
selecting the check boxes. By then clicking on ‘Select’ all these tests will be deleted. A
confirmation message will appear (Figure 8-4, right).

54
Figure 8-4. List of Validation tables in current DB [left], and message that deletion is completed
[right].

8.4 Options
Tools→ Options
Finally, by clicking on Options the following options are provided (Figure 8-5):
 Seed: Here the provided seed for generating random numbers can be changed.
Random numbers are used in the training and validation distribution. When changing
the seed, thus, other random training and validation distribution will be applied.
 Change negative results: This option enables to convert negative values into other
values, e.g. close to zero. This is because negative values are not physically
possible and can be reasonably assumed as representing zero (non-existing).
However, in some cases such as GPR a zero value may not be preferred because
of further calculation (e.g., coefficient of variation: [standard deviation] / [mean
estimate]).
 Skip NaN value: To speed up the processing, particularly for geometrically-
corrected images, it is wise to skip NaN (not a number) values. Hence only pixels
with real values are processed. The desired output data can be provided.
 Skip pixels where all bands same value: Often instead of NaN, pixels with no
physical meaning are given all the same value, like 0, 255, -9999. With this option
these pixels will be skipped in the processing.
 Processing speed: By default the processing speed that is required to develop and
validate a MLRA model is recorded. This processing speed is then provided in the
Assessment table. However, here in setting it can be deactivated since the recording
of processing speed also takes processing time.
 Processing mapping speed: By default the processing speed that is required to
process an image through the MLRA model is recorded. This processing speed can
also be saved in a text file when activating the check box.
 As images are processed line-by-line, from v1. 1.19 onwards for images it is
possible to process images block-by-block. That can go faster in some cases
and as such it becomes even possible to process the complete image at once. That
is only recommended in case the image is small in order to avoid memory problems.
Then as value ‘-1’ has to be given.

55
Figure 8-5. Options window with option to change Seed, to change negative results to a positive
value, to skip NaN values and to deactivate recording processing speed.

8.5 View maps


Tools→ View maps
With the View maps options the same window as (Figure 7-6) appears that enables to
select an ENVI image file (Figure 8-6). An output map can be loaded through Open Map,
and then an output layer can selected. When clicking on Preview, all kinds of mapping
options are provided (see also: Figure 7-8).

Figure 8-6. Window to select a generated output map (through Open Map) and then to select a
layer. By clicking on Preview mapping options are provided.

8.6 View figure

56
Tools→ View figure
With the View figure it is possible to reopen a Matlab figure (.fig) by selecting a figure
through the file browser. The Matlab figure window appears, but then with the inclusion
of the Options button in the top bar (see also: Figure 6-3, Figure 6-4, Figure 6-8, Figure
7-8).

8.7 Import model


Tools→ Import model
From MLRA v. 1.13 onwards it is made possible to load earlier-exported MLRA models
(.mat file). A window will appears that provide information on the targeted biophysical
variable and the required wavelengths (Figure 8-7). From v1. 1.8 onwards two options
are provided by clicking on ‘Retrieval’: (1) ‘Image’ and ‘Txt file’. As such the model can
directly be applied to a remote sensing image or process a text file with spectral data.

Figure 8-7. Window to Import an earlier-exported MLRA model (.mat) with interface to display
targeted parameter, used MLRA and required wavelengths.

8.8 ScatterPlot
Tools→ScatterPlot
In all the retrieval toolboxes a Scatter plot tool has been added (Figure 8-8). With this
tool a scatter figure of two images can be plotted. Additionally, goodness-of-fit statistics
can be displayed. The tool requires to load an image, select a band and decide whether
it should be plotted at the X-axis or the Y-axis. These criteria have then be added by
clicking on ADD. The same step has then be repeated for the second image and then
the other axis has to be selected. By activating ‘Assessment’ apart from the scatter
(Figure 8-9) plot also goodness-of-fit statistics are displayed (Figure 8-9). The scatter
plot is with a color scale according to the density of the data cloud.
From v. 1.16 onwards it is also possible to plot error maps, both absolute and relative (in
%). Together with these maps also histograms are plotted (Figure 8-10). The scaling of
the colormap and the number of bins of the histogram can be controlled.

57
Figure 8-8. Scatterplot tool to display a scatterplot of two images. When activating ‘Assessment’
then also goodness-of-fit statistics are displayed.

Figure 8-9. Example of scatterplot (left) and goodness-of-fit statistics (right).

Figure 8-10. Example of absolute error map (left) and associated histogram (right).

58
8.9 Validation external data
Tools→ Validation external data
From v. 1.16 onwards it is possible to apply an earlier configured and validated model to
external data for a new validation. This tool is useful to evaluate the portability of a
developed model. To have this tool activated, first an earlier validated model has to be
selected. See also Chapter 6.4Error! Reference source not found. to load and select
MLRA model. Otherwise a message will appear with the request to load a retrieval
model. Multiple models for different variables can be selected.

When a retrieval model has been load then the Import USER data window appears
(Figure 8-11). See also section 4.2. Because a model has been selected that is related
to a specfic variable, the given variable appears in the Variable window. In case multiple
models are selected then all the variables are listed in the drop-down menu. For each
variable the associated line numberof the file with external data has to be given. Further,
it is also required to give the starting line where spectral data begins.

Once the required info is entered then by clicking on Import it will use that data to validate
the performance of the selected model. The goodness-of-fit indicators as described in
section 6.1Error! Reference source not found. (Figure 8-12, left) and a 1:1-scatter plot
will be provided (Figure 8-12, right). Note that in case multiple models for different
variables are selected then the table will list the statistical indicators for each variable.
Also for each variable a 1:1-scatter plot will be displayed.

Figure 8-11. Input window to load external, User data as prepared in a text file. The variables of
the selected models are given in ‘Variable’ drop-down menu.

59
Figure 8-12. Goodness-of-fit statistics (left) and 1:1 measured vs estimated scatterplot
(right).

60
9 Help
In Help, the Manual (this document) (Help→ User’s manual) and the Installation Guide
(Help→ Installation guide) can be consulted. Also a Disclaimer note is included (Help→
Disclaimer).

9.1 Show log


From v.1.05 onwards it is possible to keep track of all principal executed steps when
activating the Log window. The toolbox will then enable a text window that can be pulled
down (Figure 9-1). When then executing steps they, the most important informative parts
will be tracked and written away in the log window (Figure 9-2). In this way the user can
track back what e.g. was the input file or LUT project used, selected input variables,
sensor, etc.

Figure 9-1. Activated log window when clicking in Help to Show Log.

Figure 9-2. Log info with information of most important steps.

61
10 Error reporting
While much effort has gone into developing a bug-free toolbox, there may be situations
in which errors might still occur. Errors appear as red messages in the Matlab™ main
window. Please report any bugs to artmo.toolbox@gmail.com and we will try to resolve
them.

10.1Dealing with memory problems


Using a large LUT to test different retrieval strategies (see chapter0, Validation) or map
biophysical parameters based on an image (see chapter 0, Retrieval image) may
sometimes cause memory problems. Typically, a too large amount of data is requested
to be processed by Matlab or in combination with MySQL. It may lead to the following
error messages (Figure 10-1):

Figure 10-1. Error messages with warnings of insufficient memory.

To overcome these memory problems, you may:


- set a higher value for the „max_allowed_packet“ variable in MySQL, and/or
- increase the Java heap size in MATLAB.
To change the value of the „max_allowed_packet“ variable on Windows you can edit the
„my.ini“ file using Notepad. To find the my.ini file on your computer right click on the
MySQL Command Line Client shortcut and select Properties. In the field Target you can
see something like: "C:\Program Files\MySQL\MySQL Server 5.6\bin\mysql.exe" "--
defaults-file=C:\ProgramData\MySQL\MySQL Server 5.6\my.ini" "-uroot" "-p".
Open the „my.ini“ file in text editor (best open as administrator) and find the line
max_allowed_packet (Figure 10-2). Then, set the value as appropriate for you (for
example type „1024M“, which is 1024 megabites). This will make the change
permanent (it will not change back to its default value after restarting the MySQL server).
Once have the ini file saved you will have to restart your computer. For more details see
this site: http://stackoverflow.com/questions/8062496/how-to-change-max-allowed-
packet-size

62
Figure 10-2. Finding the MySQL my.ini file and editing the „max_allowed_packet“ variable.

Secondly, for MATLAB 7.10 (R2010a) onwards, you can also set the Java heap size in
the preferences dialog box (File>Preferences) under the „Java Heap Memory“ section of
the „General“ tab.
For instructions how to do this in older MATLAB versions see this site:
http://www.mathworks.com/matlabcentral/answers/92813-how-do-i-increase-the-heap-
space-for-the-java-vm-in-matlab-6-0-r12-and-later-versions

10.2 Error in case of unauthorized writing


temporarily files
The MLRA toolbox makes use of writing away temporarily files. However, it may happen
that no authority is given to write in the default folder within the ARTMO package, e.g.
when stored in C folder. Something similar as the following error may then occur:
Error using execjp1
com.mysql.jdbc.JDBC4PreparedStatement@a3b0742: select mmodel into dumpfile
'C:/Users/cathe/AppData/Roaming/artmo/temp/dummyvar_20161205204021112'
from test_mla_result where id_t9=3
Java exception occurred:
java.sql.SQLException: The MySQL server is running with the --secure-file-priv
option so it cannot execute this
or:
Java exception occurred:
java.sql.SQLException: Can't create/write to file 'C:

To resolve this it is required to create a file with writing permisions where the temporarily
files can be written away e.g. in ‘Documents’. For instance create the folder
‘ARTMO’.

Following, point to that folder in the Settings folder:

63
File→ Settings

The following window will appear (Figure 10-3. Settings menu where in ‘Local’ you can
point to the newly created writable folder.):

Figure 10-3. Settings menu where in ‘Local’ you can point to the newly created writable folder.

In ‘Local’ then point to that newly created empty folder. You could copy and paste the
path, or by clicking to ‘Get’ and then select the right folder. In this way, temporary files
will be stored in that folder.

64

You might also like