3-D Deep Learning Approach For Remote Sensing Image Classification

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

4420 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO.

8, AUGUST 2018

3-D Deep Learning Approach for Remote


Sensing Image Classification
Amina Ben Hamida, Alexandre Benoit , Patrick Lambert, and Chokri Ben Amar, Senior Member, IEEE

Abstract— Recently, a variety of approaches have been models have dramatically impacted RS data products. This fast
enriching the field of remote sensing (RS) image processing and progress has mainly led to the creation of not only over-
analysis. Unfortunately, existing methods remain limited to the whelming quantities of data sets but also of very rich spatial
rich spatiospectral content of today’s large data sets. It would
seem intriguing to resort to deep learning (DL)-based approaches and spectral content. Faced with harder and more complex
at this stage with regard to their ability to offer accurate semantic images, the renovation of the classically used approaches has
interpretation of the data. However, the specificity introduced by been needed. In fact, the basic philosophy behind the first
the coexistence of spectral and spatial content in the RS data RS classification methods relied on the so-called “shallow
sets widens the scope of the challenges presented to adapt DL structures.” Different techniques have been used, including
methods to these contexts. Therefore, the aim of this paper is
first to explore the performance of DL architectures for the RS artificial neural networks [1], classification trees [2], and
hyperspectral data set classification and second to introduce a support vector machine [3]. The bag of visual words was
new 3-D DL approach that enables a joint spectral and spatial introduced as a baseline for recent RS image classification [4],
information process. A set of 3-D schemes is proposed and eval- since it enables a better understanding of the data content and
uated. Experimental results based on well-known hyperspectral the interpixel dependencies. Although these tools were—until
data sets demonstrate that the proposed method is able to achieve
a better classification rate than state-of-the-art methods with recently—highly ranked in the classification field, they are
lower computational costs. now incapable of coping with the abundance of today’s image
content. As a response to this lack of efficient methods, serious
Index Terms— Classification, deep learning (DL), hyperspec-
tral, pixel-based, remote sensing (RS). efforts have been made, and several approaches have been set
up, such as the graphic-based one detailed in [5] or some
I. I NTRODUCTION handcrafted feature tools that can effectively describe the
spatial and spectral content of the images [6]. Despite high
T ODAY, remote sensing (RS) plays a fundamental role
in providing a rich source of information for a variety
of applications. It is now a major means for advanced and
performance, these approaches remain limited because of their
lack of a generic aspect, adaptation to different contexts, and
numerous purposes, such as the long-term climate studies, an impassable need for expert knowledge in the parameter
population evolution analysis, and sometimes even the pre- setup phase. Therefore, there is an urgent need for more
cocious prevention of calamities. In fact, RS has opened convenient analysis methods and approaches that allow hier-
doors for not only a deeper understanding of the Earth archical comprehension of the data and thorough learning
itself but also for delicate investigations into its population of its content. Certainly, when dealing with tricky learning
and environmental behaviors. In reality, such advances are tasks, it is currently almost impossible not to acknowledge
mainly boosted by the collaboration of serious industrial and the achievements of deep learning (DL). In fact, since the
academic research. The impressive breakthroughs witnessed impressive comeback of neural networks in 2006, the machine
on a technical level, acquisition tools, and new open data learning community has become very popular, mainly thanks
to the emergence of DL-based approaches and their remark-
Manuscript received May 7, 2017; revised October 16, 2017 and able performances. Early deep methods started in simpler ways
February 2, 2018; accepted March 4, 2018. Date of publication April 20,
2018; date of current version July 20, 2018. This work was supported with digit classification to recently become a winning tool for
by the Research and Education France Ministry. (Corresponding author: complex image classification tasks in the 2012 Large Scale
Alexandre Benoit.) Visual Recognition Challenge (ILSVRC2012). Over the past
A. Ben Hamida is with the Informatics, Systems, Information and
Knowledge Processing Laboratory, Université Savoie Mont Blanc, F-74000 several years, DL has been growing as one of the most efficient
Annecy, France, and also with Research Groups in Intelligent Machines, techniques for a wide range of applications and fields and has
Ecole Nationale d’Ingénieurs de Sfax, Sfax 3038, Tunisia (e-mail: managed to overcome different challenges when dealing with
amina.ben-hamida@univ-smb.fr).
A. Benoit and P. Lambert are with the Informatics, Systems, Informa- Big Data issues. However, it likewise brings more precision
tion and Knowledge Processing Laboratory, Université Savoie Mont Blanc, and accuracy into smaller scale applications that mainly focus
F-74000 Annecy, France (e-mail: alexandre.benoit@univ-smb.fr; patrick. on today’s wealth of spatial and spectral content. Therefore,
lambert@univ-smb.fr).
C. Ben Amar is with Research Groups in Intelligent Machines, one of the main focuses is now pointed at the ability of DL
Ecole Nationale d’Ingénieurs de Sfax, Sfax 3038, Tunisia (e-mail: approaches to solve RS data classification problems. Currently,
chokri.benamar@ieee.org). an important share of RS research is devoted to investigating
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. techniques that enable effective interpretation, analysis, and
Digital Object Identifier 10.1109/TGRS.2018.2818945 extraction of relevant knowledge.
0196-2892 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
BEN HAMIDA et al.: 3-D DL APPROACH FOR RS IMAGE CLASSIFICATION 4421

In this paper, a general overview of the DL evolution level concepts and objects. Different approaches belong to
phases and the currently existing methods is presented. The this specific category namely autoencoders [8]. Basically, they
main challenges that are disrupting its progress are also can perform image hierarchical learning through two types
presented along with a special focus on DL techniques used of modules: a first set of data encoding layers followed by
for RS image classification. Finally, a new deep network a decoding set of layers that tries to reconstruct back the
structure is proposed and examined for an RS case study input. Another proposed architecture for unsupervised learning
on a small well-known hyperspectral data set which then is the restricted Boltzmann machine (RBMs) [9]. As first
draws the basic guidelines of how to deal with the RS introduced, this approach relied on connected neuronlike units
data blast. Models and trained weights are then made avail- that make stochastic decisions about whether to be ON or OFF.
able at https://github.com/AminaBh/3D_deepLearning_for_ Therefore, it can be seen as a neural network model [10].
hyperspectral_images. Although the main concept behind it seems very tempting for
image processing applications, its execution was difficult and
II. D EEP L EARNING OVER THE Y EARS time-consuming. As a remedy for this problem, restrictions
Over the years, a lot of research has been dedicated to were added to the network topology forbidding connections
machine learning and artificial intelligence. Obviously, talking between the variables within the same layer, and leading to
about neural networks is not a new subject as the field has the one layer RBMs [11] as well as its deeper version the deep
been around since the 1950s [7]. Starting from the mid- Boltzmann machine. DBNs where then proposed in [12] and
1990s, machine learning has gone through several transition rely on a clever combination of RBMs along with a classifier.
periods paving the way for the impressive comeback of neural
B. Deep Networks for Supervised Learning:
networks. However, the proposed techniques relied so much
Discriminative Approaches
on human involvement in the process for system tuning
and data annotation along with high computational power Target labels are expected to help learning and data clas-
requirements that neural nets where surpassed by other less sification, whether they are present in a direct or indirect
constrained methods, such as support vector machine-based form. This category of approaches is intended to accom-
approaches. Consequently, without sufficient resources, neural plish pattern classification tasks, often by characterizing the
network-based approaches went through a “winter period” posterior distributions of classes presented in the data. They
where a few advances appeared in state-of-the-art methods. are also called discriminating deep networks. Deep stacking
Thanks to Geoffrey E. Hinton and his team, shallow structures networks (DSNs), convolutional neural networks (CNNs) [13],
with few layers have been replaced by deeper architectures and recurrent neural networks (RNNs) [14] are the main archi-
with many more layers with the introduction of deep belief tectures used for supervised tasks. As first introduced, the main
networks (DBNs) in 2006 [12]. This has revolutionized both idea behind the DSN design derives from the concept of
the academic and industrial worlds. This progress has been arranging series of simple classification modules, as proposed
greatly assisted by technical evolution on different levels: the and explored in [15] and [16]. At an early stage, classifiers
world has witnessed the birth of richer annotated databases are set to then be stacked on top of each other ensuring
along with powerful computational tools, such as GPUs. Since the learning of complex concepts. The DSN architecture was
then, much literature has been focused on DL methods. The originally presented in [17]. Although different varieties of
main tool behind the success of DL is the introduction of more DSNs were created in order to diminish the computational
processing layers which induces more representational levels cost of the process, these architectures remain very expensive,
and therefore ensures progressive dissociation of the concepts which challenges their users and deflects interest toward other
contained in the data. Consequently, not only does DL enhance sets of approaches. One of the main alternatives is the CNN,
the learning process, these approaches have also managed to as first presented in [13]. These convolutional (Conv) layers
overcome the classical ones by getting rid of the previously can be viewed as a series of trainable filters that slide all
used engineered features. Over the years, a rich repository has over the input’s dimensions (width, height, and even depth).
been established to encompass numerous deep architectures. Since these layers share many weights but significantly less
These methods can be categorized into three main classes than classical fully connected (FC) neuron-based networks,
regarding their architectures, aims, and techniques that are the stacking of the layers on top of each other allows a gradual
detailed in Sections II-A–II-D. increase in the data representation semantic level without
exploding the computational cost. Usually, a series of FC lay-
ers followed by a classifier is inserted at the end of each phase.
A. Deep Networks For Unsupervised Learning: This basically finalizes the representation learning process by
Generative Approaches modeling the target concepts from a composition of their
The absence of target labels or knowledge information already high semantic level input features. The CNNs benefit
during the learning phase limits the application to a feature from the bonus of introducing the subsampling property which
identification process. Most of today’s available data are unla- guarantees a decrease in feature map resolutions, thus reducing
beled data sets which raises the question: to what extent can the computing costs while enforcing robustness against trans-
we learn meaningful representations? In this case, the lower lations. In fact, discriminative approaches generally follow
level abstractions are more tightly related to simple features, a similar philosophy that aims at first representing input
and higher level abstractions are dedicated to high semantic images as high-resolution low-level features representations,

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
4422 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO. 8, AUGUST 2018

starting by oriented contours, that are gradually subsampled details of objects, such as houses, trees, or even cars on
and composed into more complex and more numerous patterns the parking lots. In order to be able to fully describe the
while going through the network architecture. Depending on content of these images, a deep hierarchical representation
the task, a final layer may be used to format output to the is highly recommended. The lowest level is represented by
required type. Luckily, all these basic tools have allowed the primitive vectors describing the color, texture, and shape. At a
creation of a rich benchmark of robust designs for different higher level, simple objects, such as roads, forests, or lakes,
CNN networks. These architectures have been found highly are described by unique combinations of primitive vec-
effective and been commonly used in computer vision and tors. However, individually considered, these objects cannot
image recognition [18], [19]. describe the scene or grasp the overall meaning of the image as
they give different interpretations according to their neighbor-
ing, as well as their spatial and spectral positions. Therefore,
C. Hybrid Deep Networks
in order to extract meaningful information from the image,
The term hybrid for this third category refers to the deep the spatial interactions to the next level of the representation
architecture that either comprises or makes use of both genera- hierarchy must be taken into account. These models lead to
tive and discriminative model components. These architectures the discovery of semantic rules that define the final level of
often operate in a multistage learning process, where the abstraction: high-level semantic classes, such as residential
generators are trained using a specific strategy. The recent districts, commercial areas, or ports are of high semantic level.
generative adversarial networks [20] highlight the interest of This extremely rich repository of information has created the
simultaneously training the two components where a generator need for DL as a key solution.
continuously improves and tries to fool a discriminator which
continuously tries to differentiate real and fake generated data.
Recently, it was shown in [21] that deep hybrid architec- A. State-of-the-Art DL for RS
tures, or multilevel models that integrate discriminative and It is often possible to resort to ideas that come from
generative learning objectives, offer a strong viable alternative the multimedia field to proceed with RS data sets. Current
to multistage learners. multimedia-inspired DL models have managed to provide a
baseline for the use of DL in RS key applications. Recent
D. Evolution of Different CNNs challenges focus on semantic segmentation and ensure high
accuracy rates in the case of a traditional three spectral band
As one of the most successful deep architectures, CNNs
task [29]. However, the creation of more complex RS data
have progressively found their way into today’s applications.
has catalyzed more research into a better understanding of data
By less than 20 years, we moved from the five-layered
with rich spectral content, such as hyperspectral and multispec-
leNet5 architecture [13] dedicated to digits recognition to
tral images. As detailed in [30], first trials only relied on the
the more advanced residual network [22] variants that can
spectral information presented in the data itself. For instance,
include hundreds of layers that are now able to recognize
the approaches as presented in [23], [31], and [32] suffered a
thousands of visual concepts. Actually, the main architecture
lack of spatial information and therefore probably disregarded
that effectively familiarized the world with CNNs is the eight-
a very important element of the image content. The same
layer AlexNet. As detailed in [23], this technique has the
problem was seen when only processing the spatial content of
advantage of being deeper and more expressive than the other
the data as detailed in [33]–[36]. Therefore, more complex
approaches by stacking a series of convolution layers. The
yet effective solutions have been presented in different forms
AlexNet architecture was a clear winner at ILSVRC2012,
and models to take into account both spectral and spatial com-
and since then, this challenge has been systematically won
ponents guaranteeing a maximum profit from the insights and
by CNNs every year, always improving performance levels
information restrained in the images. Early solutions resorted
by improving depth, width, and processing path strategies
to a marginal processing of spectral and spatial information
while reducing the number of parameters [24], [25]. Then,
as presented in [37]. In this case, the spectral information
as the world has evolved toward more sequence-dependent
is processed apart from the spatial component which is
applications (video, text, and so on), RNNs whose output
extracted later to be joined for feature extraction based on deep
depends on the input and the previous iteration states complete
architectures, such as stacked autoencoders (SAEs). Neural
CNN architectures. The long short-term memory cells [26]
network classifiers are then implemented in the final layer.
are a flexible example of such family that enables long and
Autoencoders are also the basic concept behind the model
short range interactions by the use of trainable state gates.
introduced in [38], presenting a spatiospectral framework that
Such tools enable impressive results on various application
merges spectral information from adjacent pixels to add spatial
use cases, such as image captioning [27] and video semantic
information to the processed pixel. Hidden layers are then
segmentation [28].
inserted in order to learn the spectral features, and a supervised
learning is ensured by an output softmax layer. Incorporating
III. D EEP L EARNING FOR R EMOTE S ENSING both spatial and spectral information improves the classifica-
I MAGE C LASSIFICATION tion performances as mentioned in all the above-mentioned
The content of satellite images with high resolution in methods. However, the use of the SAE or more generally using
both space and frequencies is remarkably complex, providing large layers of FC neurons explodes the number of parameters

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
BEN HAMIDA et al.: 3-D DL APPROACH FOR RS IMAGE CLASSIFICATION 4423

to train and demand a large number of training samples. In the 1) High-Dimensional Data: When dealing with high-
case of data sets with a few annotations, training systems dimensional data, DL approaches become computation-
with a large number of parameters are not tractable or lead ally expensive. These high costs are mainly due to the
to overfitted and suboptimal solutions. In the specific slow learning process that is needed to learn the data
case of hyperspectral image analysis, DL methods recently abstractions and establish an effective representation
opened wide doors into taking huge leaps. More specifically, from low levels to the highest semantic interpreta-
3-D (i.e., one spectral dimension plus two spatial ones) CNNs tions. In fact, a high-dimensional data source contributes
were introduced as a solution to obtain an accurate and greatly to the volume of the raw data, as well as
computationally efficient architecture. As presented in [39], complicating learning from this data. The most effective
a 3-D-like approach starts the process with a randomized solution so far proposed is the use of CNNs, since the
PCA applied on the spectral dimension of the image. There- neurons in the hidden layer units do not need to be
fore, the inputs for the first convolutional (Conv) layer (C1) connected to all of the nodes in the previous layer, but to
are 3-D patches of size s × s × Cr where s is the width a more localized receptive field. Moreover, the resolution
and height of the spatial patch while Cr is the number of of the image data is also reduced when moving toward
the retrained principle spectral components. A second Conv higher layers in the network as depicted in [43] and [44]
layer is applied to the output of C1. Finally, a C2 element thanks to pooling layers. However, this problem remains
vector is produced and fed as input to a multilayer percep- very challenging and leads to more complex and harder
tron classifier. However, with this approach, each retrained issues.
spectral dimension is processed independently with stan- 2) Large Heavy Models: DL models have so far accom-
dard 2-D convolutional filters. Another approach is developed plished remarkable results relying on deep and wide
in [40], where a 3-D convolution is literally deployed by the models. Therefore, large numbers of parameters are
first layer followed by two 1-D Convs and ending with two FC required to learn complicated features and representa-
layers. This approach is close to the one presented in [41] but tions from the data itself as explained in [45]. These
adds spatial information. However, it still introduces a large heavy models are hard to train, costly to fit, and com-
amount of parameters (60 000) that need to be trained with plicated to establish. Moreover, such heavy models are
only 1800 samples. The main concern in most of the above- greedy in terms of labeled data. This requirement is hard
presented cases is then how to deal with the high number of to establish, since the field is suffering from a serious
parameters to be trained with a few samples while improving lack of rich hyperspectral and multispectral annotated
image analysis. Recently, architectures significantly reduce data.
this ratio such as [42] that maximizes the parameter reuse. 3) Architecture Optimization: The key point in favor of
using DL today is its ability to cope with a wealth
of applications. However, this results in hardening and
B. CNNs for RS Image Classification complicating the tasks of establishing deep models
In view of the wealth of recent RS image content, it is that are inexpensive and effective in processing data.
crucial to find deep architectures that maintain the balance Obviously, regarding the variety of fields that today’s
between efficiently processing huge amounts of data and not community is involved in, one can notice an urgent
exploding the computational costs while also providing high need to optimize deep architectures in terms of com-
accuracy. The employment of deep classification techniques putational costs, accurate results, and required training
for the RS field would seem to be a promising path of information. Recently various strategies have been pro-
applications. However, further investigation reveals different posed to optimize pretrained architectures in terms of
challenges that must be overcome to reach accurate low-cost inference speed and memory footprint reduction by layer
data interpretation. The main challenge is to efficiently adapt factorization and weights pruning [46]–[48]. However,
deep architecture to take into account not only the spatial such posttraining optimization strategies cannot allow to
dimension of hyperspectral images but also their rich spectral disregard architecture design optimization prior training,
content. Out of all the current DL networks, CNNs are one since large architectures cannot be reliably trained on
of the best available tools for machine vision. These models small data sets. Therefore, a lot of wise choices must
have helped DL become one of today’s hottest topics. Thanks be taken at early stages: selecting one specific type of
to the variety of layers that one CNN can encompass, these deep network, choosing whether to fine-tune a pretrained
networks provide an efficient tool for data comprehension architecture, or starting from scratch.
and representation. The fact that they can be fitted to differ- The CNN structure introduces an accurate solution for most
ent applications and are relatively low-cost architectures for of the previously presented challenges. However, more recent
tremendous tasks makes them one of the most extensively studies that tend to enhance the use of deeper architectures
used DL approaches. In this paper, use cases, CNNs, are have also been established. Residual modules as presented
a primordial choice that can be fitted to RS classification in residual networks [22] have made it possible to design
tasks. Basically, these architectures must perform well without extremely deep networks with more than a thousand layers.
overfitting or undertraining the system. However, the evolution More recently, dense networks were presented in [42] that
toward effective CNNs in such cases has been diminished by emphasize the interest of denser connections across layers. The
the following challenges. main purpose of such methods is generally to address complex

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
4424 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO. 8, AUGUST 2018

Fig. 2. Example illustrating the evolution of feature shapes (SizeOut) of


each layer [output size is obtained according to (1)].

Fig. 1. Overview of the proposed 3-D deep architecture. and first generates 3-D feature maps that are gradually reduced
into 1-D feature vectors all along the layers. This procedure
problems with high variability targets at multiple scales. is ensured by the choice of specific configurations of the
Recent works tried to adapt these concepts to hyperspectral convolution filter strides and paddings following (1), where
data use cases as detailed in [49]. The aim of this paper is to the stride is the distance between two consecutive positions
prove the efficiency of noncomplex architectures. of the kernel expressed with a number of pixels. The padding
is used to manage boundary effects and is basically employed
IV. 3-D D EEP L EARNING A RCHITECTURE FOR as a number of zeros concatenated at the beginning and at the
RS I MAGE S EMANTIC S EGMENTATION end of an axis. Using padding enables to make the convolution
As a solution for the challenges presented in the previous output the same size as the input while no padding reduces
sections, we introduce a new 3-D-based architecture that is the output data shape
 
dedicated to hyperspectral images and tackles most of the DL SizeIn − Kernel Size + 2 × pad
for RS aspects of difficulty. SizeOut = + 1. (1)
stride
As previously said, the input is fed to the network as a
A. General Overview of the Architecture 3-D volume (voxel) of size n × n × f . The first phase consists
A joint spatiospectral model is needed to examine both of using a series of N3D 3-D convolutional layers. Each
spectral and spatial information in hyperspectral data. The layer i is characterized by a number of ki filters. The kernels
advantage of such a framework is that both the compo- of the filters are of size (mi × mi × f li ) where mi<=n
nents are merged and joint in a nonseparable way from the and fli<=f. In this case, convolution layers are deployed
early stages of the process. This solution makes maximum for two purposes. First, they are introduced as conventional
use of the information presented in the data and radically spatiospectral convolution layers with a stride equal to 1. Then,
lowers costs. This paper proposes to use a new 3-D CNN they play the role of pooling layers thanks to the choice of
architecture that, unlike the previously mentioned approaches, larger strides to downsample the data as suggested in [50].
simultaneously processes the spatial and spectral components The duality between Conv and Pool layers in a sequential
with real 3-D convolutions giving better investments of the way progressively learns and reduces the data components’
few samples available with fewer trainable parameters. This dimensions. The f li > mi rule along with the removal of
proposal decomposes the problem as the processing of a series padding on the spatial dimension for some layers make the
of volumetric representations of the image. Therefore, each transition toward the creation of a first 1-D output vector.
pixel is associated with an n × n spatial neighborhood and This output is then fed to a series of N1D 1-D Conv layers that
a number of f spectral bands. As a result, each pixel is each encompasses pi filters. In the end, the network introduces
treated as an n × n × f volume. The main concept behind a set of NFC FC layers that ends with a Softmax classifier
this architecture is to combine the traditional CNN network where the softmax activation of the i th output unit is detailed
with a twist of applying 3-D convolution operations instead of in (2). Since the final layer’s size is chosen to be equal to the
using 1-D convolution operators that only inspect the spectral input number of targeted classes (nclass), this FC guarantees
content of the data. An overview of the 3-D architecture is a probabilistic representation for the different classes
presented in Fig. 1. e xi
Different blocks of CNN layers are stacked on top of P(x i ) = nclass (2)
each other in order to ensure deep efficient representations c e xc
of the image. First, a 3-D convolution-based set of layers where x i denotes the 1-D output vector of the final Conv layer.
is introduced in order to cope with the 3-D input voxels. An example of a 3-D Conv layer network is illustrated
Each and every one of these layers encompasses a number of in Fig. 2 where, f = 103, n = 3, f li = 3, and mli can
volumetric kernels that simultaneously execute convolutions take 3 or 1 as values. Padding is used on the spectral scale
on the width, height, and depth axis of the input. Such only. The strides are alternated between one and two in order
3-D convolutions stack is followed by a set of N1D 1 × 1 to create a pooling effect after every convolutional operation.
convolution (1-D) layers that discards the spatial neighbor-
hood and a series of NFC Fully Connected layers. Basically, So, as a summary and as detailed in Algorithm 1, the pro-
the proposed architecture considers 3-D voxels as input data posed process starts by dividing the hyperspectral image into

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
BEN HAMIDA et al.: 3-D DL APPROACH FOR RS IMAGE CLASSIFICATION 4425

Algorithm 1: Classification of Hyperspectral Images at 1) Solver Method: It plays a major role in improving loss
the Pixel Level. Each Pixel is Processed with respect to during the learning process and contributes in both the
Its Neighborhood Only as a Voxel of Shape n × n × f forward and backward propagation phases and there-
Input : S: number of samples per batch; fore ensures the update of the network’s parameters.
epoch I T _tr ai n: number of iterations required Different solver methods have been introduced so far,
to parse all the train data set with S samples per batch such as the Stochastic Gradient Descent (type: “SGD”),
(one epoch); Adam (type: “Adam”), Nesterov’s Accelerated Gradient
epoch I T _test: number of iterations required to (type: “Nesterov”), and RMSprop (type: “RMSProp”) as
parse all the test data set with T samples per batch (one detailed in [51] and [52]. The SGD with momentum set
epoch); to 0.9 was selected for this use case. Several tests with
Max E poch: Maximum number of epochs to different solver types were the reason behind this choice
train the network; which appears to be a simple and robust method.
X train : Input training voxels; 2) Weights Regularization Method: It is basically used to
X t est : Input testing voxels introduce lateral regularization to the network. The main
Output: mAccuracy: the average accuracy on the test types of normalization methods are L1 and L2 weights
data set; regularization; they both provide the choice to either
All trained weights normalize within the same channel or even normalize
for epoch in range Max E poch do across channels. L1 normalization proved to be more
Testing: evaluate accuracy for each voxel from X t est efficient in the paper’s context. One recent regularization
for it in range epoch I T _tr ai n do technique is the dropout method [53] which randomly
1) Randomly sample S voxels from X train not disables some inputs thus reducing neuron dependencies.
already been considered in the current epoch; This also complements L1 and L2 regularization by pre-
2) Forward propagation of the S input voxels venting the network from overfitting. Therefore, we have
through the network in order to calculate the combined the use of both L1 and a 0.5 probability
average loss; 3) Backward propagation to update dropout on the FC layer only.
network weights with respect to the average loss 3) Nonlinearity: Most of the attention today is dedicated
value; to the use of ReLU nonlinearities, since they enable
end faster training convergence. Recent nonlinearities, such
Testing: evaluate accuracy for each voxel from X t est as ELU [54], have also been experimented but did not
for it in range epoch I T _test do lead to improved results.
1) Sample T voxels from X t est not already been 4) Weight Initialization: It is a crucial preprocessing phase
considered in the current epoch; 2) Run a forward for DL network training, since it sets the state of differ-
propagation on the batch samples. 3) Retrieve and ent parameters at the starting point. Different methods
cumulate the obtained accuracy values can be used including the recent Xavier and He et al.
end
initialization method [55] methods. The latter, often
compute m Accur acy, the average accuracy values on
called MSRA, was chosen since it is adapted to the
the entire test data set obtained in the current epoch.
end ReLU nonlinearities that we used.
5) Learning Rate: It coordinates to what extent each step of
the process influences the weight updates. We consider
an initial learning rate of 0.001 to explore rapidly the
two different sets of pixels: a training set of pixels and a testing search space and find a good local minimum. Then,
one. In fact, each pixel is taken into account as a 3-D n ×n × f for each (Max E poch/3) iteration, the learning rate is
voxel in the paper’s context. A training period is considered divided by 10 in order to converge to a lower local
to parse one time all the pixels of the training data set (one loss value and then increase accuracy (MaxEpoch is the
epoch) to learn the network weights. Follow a test period number of epochs needed to train the network).
that evaluates the performance level reached on the whole test 6) Batch Size: Instead of optimizing the network from a
data set. Those two periods are applied Max E poch times to single sample at a time, which can lead to suboptimal
get the final result. In this way, the evolution of the network solutions, averaging errors over a set of samples were
performance is monitored all along the training. proved to be more efficient. Therefore, the batch size
influences the loss convergence efficacy and in turn the
system’s update phase, which makes it a very critical
B. Main Common Parameters parameter and basically data-dependent. Values ranging
When establishing a DL architecture, the most crucial phase from 10 to 1 were tested on the small considered data
is to make wise choices for the different parameters to be set. set, and the optimal choice is equal to 3.
Although different models can be obtained from one basic 7) Bias or Batch Normalization: Batch normalization has
DL architecture, one can easily notice that there are common recently been proposed to normalize neuron activation
parameters that can be fulfilled from the early stages of the across layers and replace neuron bias variables. Despite
network creation. being efficient, batch normalization requires consistent

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
4426 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO. 8, AUGUST 2018

hyperspectral_images. The strategy is to start from a sim-


ple state-of-the-art architecture and then gradually extend it
by understanding its bottlenecks. Fig. 4 presents an overall
view of the considered deep networks where the Conv lay-
ers with a stride equal to 1 are referred to as Conv and
those with a stride equal to 2 as ConvPool. The number
of filters per layer is introduced as [number of filters].
So, in Fig. 4, each layer is presented as follows: “Layer-
Name[number of filters].” The a–e schemes, respectively, rep-
resent the 3-, 4-, 6-, 8-, and 10-layer networks as detailed in the
following.
Fig. 3. Captured images. (a) Pavia University. (b) Pavia Center. (c) KSC. 1) Three Layers Versus Four Layers (a Versus b): As first
inspired from the network in [41], the presented archi-
and stable batch statistics in the training and test data- tecture created a three-layer 3-D network that gathers
base. This is often helped by the use of a large batch size. two 3-D layers, one single 1-D layer along with two FC
This constraint is difficult to fulfill on tiny databases layers and a Softmax. However, the important number
so each time we restricted our experiments to neuron of neurons included in the first FC layer increases the
layers equipped with bias parameters initialized to 0 at number of parameters without really improving the accu-
the beginning of the training. racy rates. This layer is basically dedicated to increase
the capacity of the final classifier while taking care of
V. E XPERIMENTS AND A NALYSIS the spatial arrangement in the feature maps. However,
In this section, the experiments conducted on RS images the increased number of parameters conflicts with the
using DL architectures are presented and compared with other low number of training samples. Therefore, in a new
state-of-the-art approaches. four-layer network, only one FC, with the number of
neurons equal to the number of targeted classes, is left
A. Data Sets and a new Conv layer with a stride equal to two is added
to play the role of a pooling layer. The network is then
The results are obtained from experiments applied on the
able to keep up with the same performance level with a
Pavia University, Pavia Center, and Kennedy Space Cen-
lower training cost.
ter (KSC) data sets, all collected by the AVIRIS sensor and
2) Six-Layer Network (c): The noticeable drop in the
shown in RGB colors in Fig. 3. The first employed data
number of trained parameters in the four-layer network
were a capture of an area over Pavia University, northern
provided an interesting opportunity to develop deeper
Italy, with a spatial resolution of 1.3 m. The image comprises
models with more Conv layers. Therefore, one six-layer
610 × 340 pixels with 103 bands. Next, the Pavia Center
3-D architecture was created relying on two 3-D Conv
data set is a 102-band data set that presents one image of
layers followed by four 1-D layers, where the duality
size 1096 × 1096 pixels and of 1.3-m geometric resolution.
Conv/ConvPool is sequentially applied to the network.
Finally, the KSC data set was acquired over the KSC, FL,
Finally, one FC layer with a Softmax was kept. The
USA, on March 23, 1996. This image has 224 bands from
number of filters for each layer was set to 35 except
400 to 2500 nm and the spatial resolution is 18 m. After
for the first Conv that only gathered 20 filters. In fact,
removing water absorption and low signal-to-noise bands,
a wide spectrum of filter numbers with values ranging
it has 512 × 453 pixels with 176 bands. The first two data
from 5 to 50 filters per layer was tested. The optimal
sets include some challenging scenes among the nine classes
combination is then presented in this paper. This choice
which are, respectively, water, trees, asphalt, self-blocking
goes with the standard state-of-the-art tendency that
bricks, bitumen, tiles, shadows, meadows, and bare soil for
shows that fewer filters are required at the beginning
Pavia University and asphalt, meadows, gravel, trees, painted
of the architecture.
metal sheets, bare soil, bitumen, self-blocking bricks, and
3) Squeezing the Net Toward Deeper Architectures
shadows for the Pavia Center data set. The KSC data set
(d and e): Since different deep models can guarantee
includes 13 classes which are scrub, willow swamp, cabbage
the same accuracy level, the choice among them is
palm hammock, cabbage palm/oak hammock, slash pine,
then based on the cost and number of parameters
oak/broadleaf hammock, hardwood swamp, graminoid marsh,
each network can take. Therefore, as inspired from the
spartina marsh, cattail marsh, salt marsh, mud flats, and water.
SqueezeNet presented in [56], the trials to compress
the model led to the creation of lighter models. Thanks
B. Different 3-D Architectures to the use of smaller numbers of filters along with
Extensive sets of experiments were conducted. The main 1 × 1 × 3 filters in the pooling phase, the network
approaches that best summarize the performance level of the can reach the same accuracy levels with a smaller
3-D architectures are reported in the following. The most number of parameters. Indeed, 1 × 1 Conv layers
performing models and trained weights are then made avail- with a number of neurons lower than the one in the
able at https://github.com/AminaBh/3D_deepLearning_for_ previous standard convolutional layer (two neurons

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
BEN HAMIDA et al.: 3-D DL APPROACH FOR RS IMAGE CLASSIFICATION 4427

Fig. 4. Overview of our 3-D architectures.

for ConvPool versus 35 neurons for Conv) allow a the-art methods, two main data splitting strategies were used.
significant reduction of the number of parameters. As a When proceeding with the three- and four-layer networks,
result, deeper architectures were created for a better only 200 randomly chosen pixels per class were used for
learning. training (almost 4%), whereas the rest of the data pixels
4) Eight- and Ten-Layer Networks (d and e): A deeper light were kept for the testing phase. Then, in the case of deeper
architecture has been created with fewer parameters. architectures, 5% of the images were deployed for training
An eight-layer 3-D network was created first introducing using the same class balance strategy. Each model is trained
the duality between three 3-D Conv layers and three and evaluated three times using different nonoverlapping train
1-D ConvPool layers, followed by two 1-D layers and and test random splits. Accuracy levels are averaged to report
a single FC layer with a softmax. The same goes for a synthetic performance measure. We measured a redundant
a 10-layer network with the advantage of adding one 0.2% precision error and do not report this value within the
more sequence of 3-D layers along with pooling ones. tables to lighten the presentation. All the tests were executed
5) Networks Width Versus Depth: Talking about the on a four-core intel i7-6600U laptop CPU with no GPU
depth of the deep networks is a very rich debate that included. The presented results were obtained using the caffe
generates a lot of questions. However, it has been library [57].
recently proved [45] that one of the keys for better 1) Three Layers Versus Four Layers (a Versus b): First,
performances is to find the right balance between the as detailed in [4], the introduction of the 3-D architecture
network’s depth and width. In other words, fixing the is inspired from [41]. Technically, two main differ-
number of layers per network is a very crucial step that ences distinguish the proposed architecture from the one
has a huge impact on the efficiency of the architecture. detailed in [41]. First, trainable convolution layers with
But, knowing that the network has a variant number strides greater than one are used instead of the classically
of filters per layer catalyzes the concerns about which used max pooling filters. Then, the 100 neuron FC
number to choose. Therefore, estimating the appropriate introduced in [41] is replaced by a 50 neuron FC.
number of filters for each layer (the width of the However, the accuracy difference cannot be explained
network) according to its depth is an important decision only by these factors. In fact, as detailed in Table I,
to be taken in order to harmonize the cost and accuracy if the number of neurons in the FC layer is divided by 2
of a deep network. In this paper, the first layer is (50 neurons versus 100 in [41]), the parameter number
characterized with 20 filters, whereas the rest of the decreases by almost 18% while the performance level
layers have 35 filters, except for the pooling layers that remains stable. This shows that the initial system has
have 2, 4, or 8 filters according to their position in the actually too many degrees of freedom which misleads
network as discussed above in this section. the image representation and limits the system efficiency.
Although these primitive results do not enhance the
C. Experiments and Results level of the accuracy rate, they provide us with hints
Experiments and tests were executed on the 3, 4, 6, 8, toward establishing better models. The first observation
and 10 Conv layered architectures as detailed in the previous to be taken into account is that the choice of the spatial
section. For the sake of an accurate comparison with state-of- neighborhood is very crucial and data-dependent. Here,

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
4428 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO. 8, AUGUST 2018

TABLE I
A CCURACY L EVEL OF THE T HREE -L AYER CNN M ODELS (N ETWORK a) U SING 4.4% OF THE D ATA FOR T RAINING

TABLE II
A CCURACY L EVEL OF THE F OUR -L AYER CNN M ODELS (N ETWORK b) U SING 4.4% OF THE D ATA FOR T RAINING

the 5 × 5 input spatial neighborhood seems to be parameters in the three- to four-layer transition case. The
the optimum choice for the Pavia University data set, introduction of a sequence of Conv and Pooling duets in
whereas the 3 × 3 one performs better in the cases of the network gather both the benefits of going deeper and
the Pavia Center and KSC data sets. However, the 7 × 7 involving fewer parameters. In other words, more Conv
spatial neighborhood seems to be very extended com- layers ensure higher semantic level representation of the
pared with the spatial components of the data sets which data, whereas the Pooling ones guarantee a dimension
makes it a low performer. Furthermore, the decrease in reduction of the representation. This way, the dimension
the number of neurons in the first FC layer results in a of the vectors at the entry of the FC layer is remark-
decrease in the number of the overall trained parameters ably reduced thus significantly reducing the number
without influencing the accuracy rates. of parameters. As shown in Table III, an important
As detailed in Table II, the removal of the first FC decrease in the number of parameters (from a tenth
layer along with the introduction of the spatiospectral of the previous 60 000) is witnessed, along with an
concept in the network enables better results compared increase in the accuracy rate. These tests also prove
with [41]. In fact, in this case, the data representa- that the choice of the spatial neighborhood is highly
tion is more relevant when going from 3-D voxels to dependent on the data content. The same model can
1-D vectors followed by a single nc-neuron FC layer outperform the results in [5] in the case of the Pavia
(nc = number of the data set classes) combined with a Center data set with a 3 × 3 neighborhood while it
Softmax. Not only does this model benefits from better does not reach the state-of-the-art approach results in
accuracy rates, it also witnesses an important decrease the case of Pavia University even when using 5 × 5
on the computational cost level. Going down from over neighborhood.
60 000 parameters trained in [41] to 28 749 parameters, 3) Squeezing the Net, 8- or 10-Layer Networks (d or e):
the proposed four-layer network ensures both a better According to the results presented in Tables III, IV
learning of the data content and a lower training cost and V, the main key behind a successfully performing
process. Here again, the results prove that the choice deep network is to create a balance between a deep
of the spatial neighborhood is 5 × 5 for the Pavia yet light architecture. When examining the models,
University data set and 3 × 3 for the Pavia Center data it can be seen that the number of layers in the network
set. The reliance on the 3-D Conv layers for combined (the depth) and the number of neurons per layer (the
spatiospectral classification of the data is therefore a width) manipulate the most important share of the
basic key for better results when compared with [41] network performance. Therefore, in Tables VI and VII,
that only resort to the spectral signature of each pixel in the tests were executed for different combinations of a
the classification process. number of neurons per layer, a number of 3-D layers
2) Six-Layer Network (c): Establishing a six-layer network per architecture, and the overall number of layers in the
was inspired by the important decrease in the number of network. Although a wider range of values was tested

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
BEN HAMIDA et al.: 3-D DL APPROACH FOR RS IMAGE CLASSIFICATION 4429

TABLE III
A CCURACY L EVEL OF THE S IX -L AYER CNN M ODELS (N ETWORK c) U SING 5% OF THE D ATA FOR T RAINING

TABLE IV
A CCURACY L EVEL OF THE E IGHT-L AYER CNN M ODELS (N ETWORK d) U SING 5% OF THE D ATA FOR T RAINING

TABLE V
A CCURACY L EVEL ON KSC D ATA S ET U SING 5% OF THE D ATA FOR T RAINING

TABLE VI TABLE VII


S QUEEZING THE N ET ON PAVIA U NIVERSITY (N ETWORKS c–e) S QUEEZING THE N ET ON PAVIA C ENTER (N ETWORKS c–e)
U SING 5% OF THE D ATA FOR T RAINING U SING 5% OF THE D ATA FOR T RAINING

provides more opportunities to create deeper mod-


els. Therefore, the six-layer network performs less
than the 8- and 10-layer ones. However, the eight-layer
architecture seems to be the best choice especially when
for each parameter presented in the following, only the relying on three 3-D layers, along with a small number
most effective ones are mentioned at this stage. of neurons in the pooling layers. Not only does it
As detailed in Tables VI and VII, the decrease in the reduce the number of parameters but it also enhances
number of neurons per pooling layer enables lighter the accuracy rate.
models that maintain the same accuracy rate ranges. The best derived combination in our case is an eight-
Basically, the decrease in the width of the network layer network with three 3-D blocks. The number of

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
4430 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO. 8, AUGUST 2018

Fig. 5. Pavia University data set. (a) Image, (b) ground truth, and (c) classification output of six-layer network.

Fig. 6. Confusion matrix using the eight-layered network. (a) Pavia University. (b) Pavia Center. (c) KSC.

filters is equal to 20 for the first Conv layer and the eight-layer network enable the highest accuracy rates for
increased to 35 for the following Conv layers. However, all of the three data sets. Besides, it ensures low computational
the numbers of neurons for each Pooling layer are costs with lower training parameter numbers. Therefore, for
equal to 2, 2, 2, and 4. Here, we proceed with an what comes next, the eight-layered architecture will be used.
input voxel of size [5,5,103]. Table IV highlights the As shown in Fig. 5, the display of the output classification map
best performing architectures in the case of eight-layer does not give much information about the system’s precision,
networks. since the accuracy rates overpass the 90% in most of the
six- and eight-layer cases of study. Therefore, the confusion
matrices are drawn and presented in Fig. 6 in order to
D. Architecture Selection demonstrate the performance level of the networks. In this
The introduction of different architectures proposed in figure, the first two data sets represent classes ranging from 0
Section V-B enables different performance levels. At this to 8 and the KSC confusion matrix represents classes ranging
stage, the choice of an optimal network is made. As previously from 0 to 12 according to their enumeration order used in
detailed, many factors interfere with the selection process of Section V-A. Although these matrices show highly accurate
a best performing network. In fact, the progressive evolution classification rates, they also demonstrate the confusions made
toward a deeper, yet lighter network, draws the guideline by the trained network as shown in Fig. 6. In the case of
toward an easier choice. the Pavia University data set, the Bitumen pixels are mixed
The first presented architectures (a and b) managed to up with the shadow pixels with almost 1/10 mistaken pixels
establish a solid baseline for the creation of deeper 6-, 8-, among all the classified ones. The lowest accuracy rate (almost
and 10-layer networks. Since, the c, d, and e models enable 60%) is witnessed in the case of the KSC image, where
high accuracy rates, the computational cost plays then a major the cabbage palm/oak hammock class is confused with four
role in the choice making process. As detailed in Tables I–VII, other classes. The mistaken pixels are classified as 20%

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
BEN HAMIDA et al.: 3-D DL APPROACH FOR RS IMAGE CLASSIFICATION 4431

TABLE VIII
P ROCESSING T IME R EQUIRED TO R EACH 95% OF THE F INAL
A CCURACY ON THE PAVIA U NIVERSITY D ATA S ET

Fig. 7. Processing time required to reach 95% of the final accuracy on the
Pavia University data set using the eight-layer 3 × 3 network. E. Computational Cost and Training Data Requirements
Obviously, the developed deep architecture overperforms the
existing methods in the case of the Pavia Center data set.
oak/broadleaf hammock, 10% cabbage palm hammock, and Unlike different approaches, such as [58]–[60], the 3-D pro-
less than 10% divided between slash pine and water. The posed architectures enable from scratch-training with no prior
system behavior toward such classes is basically driven by data preprocessing. Not only does it simplify the processing
the high spectral and spatial similarity between the two class task, the 3-D architecture enables the same performance levels
characteristics. as the previously mentioned approaches. Besides, the time
In order to get better insights on the choice of a given taken is a main factor when evaluating DL architectures. Here,
architecture, taking a look at the accuracy and loss values the classification process takes almost 3 h at the most with
along the training phase is critical. Actually, these indicators a single Intel i7-6600U laptop CPU (no GPU used) in the
can validate the choice of hyperparameters, such as the learn- case of heavy architectures along with relatively large data
ing rate, but it also shows the training stability and how an set images. However, it only takes less than 2 min in the
architecture reaches a stable performance level. Plotting those case of light networks with a few training samples. Another
curves for several architectures can, however, be confusing, key factor that normally influences the performances of these
added to the fact that they are not perfectly reproducible classification processes is the amount of data involved for the
due to random weights initialization. Then, for readability, training phase. For example, when using only 5% of the image
we report the monitored values obtained at a specific iteration pixels for training, the 3-D architecture cannot overpass the
close to the maximum accuracy value but before the steady 98% accuracy rate in the case of the Pavia University data set
state. It somehow represents a form of early stopping in the (even when using the most performing architecture). However,
training process where system performances are estimated. when dedicating 10% of the data for training as experimented
More specifically, we select the iteration index that reports in [58], the accuracy rates in that case are more than 99.4%. As
95% of the final accuracy value. Note that averaging those detailed in Fig. 8, the 3-D architecture is capable of reaching a
values along multiple experiments enhances related values 99% accuracy rate while only using 9% of the image pixel to
confidence. Fig. 7 shows an example of such monitoring report train the network. The state-of-the-art method detailed in [37]
on a single experiment. One can also observe a step at iteration also deploys 9% of the data set for training. However, this
22 000 for both the loss and accuracy curves. This actually approach underperforms the proposed 3-D architecture with a
corresponds to the iteration where a decrease in the learning 98% accuracy level for about 20 000 trained parameters against
rate occurs. This change suddenly improves accuracy, since our 99% accuracy level for less than 7000 trained parameters.
a good local minimum has been found previously, and then
convergence to a lower loss value is made possible. In the F. Model Transferability
proposed network setup, this step happens a few iterations Sections V-B and V-C demonstrate the possibilities of
before the proposed architectures report 95% of their final establishing a new light deep neural network that takes into
accuracy value. Finally, Table VIII makes the synthesis of the account both the spectral and the spatial raw data. It has been
obtained results on the most performing architectures. As a proven so far that this architecture performs well in the case
conclusion, the eight-layer architectures enable high accuracy of hyperspectral data. However, regarding the lack of richly
rates within a reasonable period of time in this paper. One annotated hyperspectral images, this paper and more generally
can also notice the impressive convergence speed of the eight- state-of-the-art methods only examine and review training and
layer architecture relying on 3 × 3 pixels neighborhood. Such testing in the same context. First, the training is executed using
architecture enables access to high-accuracy values at very low a subset of a single image that is specific to a given context.
cost and is then the best compromise to obtain rapidly a good Then, testing and inference are based on the remaining pixels
classifier in environment with low energy and processing time of the same image, i.e., we generalize on the same context.
limitations. However, many questions can arise from this strategy, since the

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
4432 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO. 8, AUGUST 2018

Fig. 8. Accuracy (%) versus amount of training data (%) using the eight-layer CNN model (network d).

TABLE IX VI. C ONCLUSION


A CCURACY L EVEL FOR F INE -T UNING U SING 5%
OF THE D ATA FOR T RAINING
The processing of hyperspectral data in general is a very
delicate procedure that demands the effective use of both
spatial and spectral components. The benefit of the 3-D archi-
tecture introduced in this paper is not only to accurately
classify the hyperspectral data but also to establish deep
comprehension of the images at low cost. One of the most
valuable consequence is the ability to efficiently optimize deep
networks on small sized annotated data sets which then also
DL models are highly exposed to the possibility of overfitting reduce the cost of the data. The main concern now is to
and so did the model overfit on each specific data context? investigate ways to innovate and enhance the created models
Can the learned features be transferred from one data set to in order to process with larger and heavier data sets. As a
the other? remedy for such a challenge, both residual and dense networks
Therefore, in what follows, the learned feature transferabil- enable the fusion of different representation levels. Therefore,
ity between different contexts is proposed, trained using one they would seem like an appealing solution to enhance the
specific image, and evaluated based on a second image. Since existing CNN architectures. Furthermore, hyperspectral data
the target classes are the same, transfer learning between Pavia calibration is still an open issue that currently confines its use
University and Pavia Center is proposed. However, the number to limited areas. An interesting path would therefore to create
of spectral bands differs (103 bands versus 102 bands) so that architectures that are able to cope with this issue.
the output dimensions of the convolution layers will differ
from one data set to the other. In this context, it is necessary R EFERENCES
to resort to architecture fine-tuning: given a neural network
[1] T. Kavzoglu and P. M. Mather, “The use of backpropagating artificial
architecture trained on a given data set, all its architecture neural networks in land cover classification,” Int. J. Remote Sens.,
components (except the last FC layers) are kept and their vol. 24, no. 23, pp. 4907–4938, 2003.
weights are made constant. Finally, the FC layers are replaced [2] D. K. McIver and M. A. Friedl, “Using prior probabilities in decision-
tree classification of remotely sensed data,” Remote Sens. Environ.,
with new ones whose number of connections (weights) are vol. 81, nos. 2–3, pp. 253–261, Aug. 2002.
compatible with the new data set and the shape of the [3] M. Fauvel, J. Chanussot, and J. A. Benediktsson, “Evaluation of kernels
convolution layer output. A rapid training of the new network for multiclass classification of hyperspectral remote sensing data,” in
Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), vol. 2.
component is performed on the new data set on its own training May 2006, p. 2.
set, and performances are evaluated on the test set. [4] A. Ben Hamida, A. Benoit, P. Lambert, and C. Ben Amar, “Deep
As detailed in Table IX, the use of a pretrained eight- learning approach for remote sensing image analysis,” in Proc. Big Data
Space (BiDS), 2016, p. 133.
layer model in the case of the Pavia University and Pavia [5] S. Lefèvre, L. Chapel, and F. Merciol, “Hyperspectral image classi-
Center data sets provide an accurate pixelwise classification fication from multiscale description with constrained connectivity and
of the data. In fact, the deep neural networks are able to metric learning,” in Proc. IEEE Workshop Hyperspectral Image Signal
Process., Evol. Remote Sens. (WHISPERS), Jun. 2014, pp. 1–4.
maintain nearly the same precision level when fine-tuned [6] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson,
and trained from scratch (98.4% versus 98.9% and 90.4% “Advances in hyperspectral image classification: Earth monitoring with
versus 92.9%). Basically, the pretrained architectures proposed statistical learning methods,” IEEE Signal Process. Mag., vol. 31, no. 1,
pp. 45–54, Jan. 2014.
in this paper demonstrate a strong ability to generalize to other [7] F. Rosenblatt, “The perceptron: A probabilistic model for information
context images. As in the case of the Pavia University and storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6,
Pavia Center data sets, the reuse of pretrained networks saves p. 386, 1958.
[8] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
the huge effort required to recreate a specific architecture to data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
be trained from scratch from each and every use case. 2006.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
BEN HAMIDA et al.: 3-D DL APPROACH FOR RS IMAGE CLASSIFICATION 4433

[9] R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” in Proc. [36] F. Zhang, B. Du, L. Zhang, and L. Zhang, “Hierarchical feature
Artif. Intell. Statist., 2009, pp. 448–455. learning with dropout k-means for hyperspectral image classification,”
[10] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm Neurocomputing, vol. 187, pp. 75–82, Apr. 2016.
for Boltzmann machines,” Cognit. Sci., vol. 9, no. 1, pp. 147–169, 1985. [37] X. Ma, J. Geng, and H. Wang, “Hyperspectral image classification via
[11] A. Fischer and C. Igel, “An introduction to restricted Boltzmann contextual deep learning,” EURASIP J. Image Video Process., vol. 2015,
machines,” in Progress in Pattern Recognition, Image Analysis, Com- no. 1, p. 20, Dec. 2015.
puter Vision, and Applications. 2012, pp. 14–36. [38] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep feature
[12] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm extraction for remote sensing image classification,” IEEE Trans. Geosci.
for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar. 2016.
2006. [39] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deep
[13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based supervised learning for hyperspectral data classification through con-
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, volutional neural networks,” in Proc. IEEE Int. Geosci. Remote Sens.
pp. 2278–2324, Nov. 1998. Symp. (IGARSS), Jul. 2015, pp. 4959–4962.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [40] V. Slavkovikj, S. Verstockt, W. De Neve, S. Van Hoecke, and
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. R. Van de Walle, “Hyperspectral image classification with convolutional
[15] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer- neural networks,” in Proc. 23rd ACM Int. Conf. Multimedia, Oct. 2015,
pp. 1159–1162.
wise training of deep networks,” in Proc. Adv. Neural Inf. Process. Syst.,
[41] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional
2007, pp. 153–160.
neural networks for hyperspectral image classification,” J. Sensors,
[16] L. Breiman, “Stacked regressions,” Mach. Learn., vol. 24, no. 1,
vol. 2015, Jan. 2015, Art. no. 258619.
pp. 49–64, 1996.
[42] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten.
[17] L. Deng and D. Yu, “Deep learning: Methods and applications,” (2016). “Densely connected convolutional networks.” [Online]. Avail-
Found. Trends Signal Process., vol. 7, nos. 3–4, pp. 197–387, able: https://arxiv.org/abs/1608.06993
Jun. 2014. [43] Y. Bengio, “Artificial neural networks and their application to sequence
[18] D. C. Cireşan, U. Meier, L. M. Gambardella, and J. Schmidhuber, recognition,” Ph.D. dissertation, School Comput. Sci., McGill Univ.,
“Deep, big, simple neural nets for handwritten digit recognition,” Neural Montreal, QC, Canada, 1991.
Comput., vol. 22, no. 12, pp. 3207–3220, 2010. [44] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural proba-
[19] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural bilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155,
networks for image classification,” in Proc. IEEE Conf. Comput. Vis. Feb. 2003.
Pattern Recognit. (CVPR), Jun. 2012, pp. 3642–3649. [45] S. Zagoruyko and N. Komodakis. (2016). “Wide residual networks.”
[20] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. [Online]. Available: https://arxiv.org/abs/1605.07146
Neural Inf. Process. Syst., 2014, pp. 2672–2680. [46] S. Han, H. Mao, and W. J. Dally. (2015). “Deep compression: Com-
[21] A. G. Ororbia, II, C. L. Giles, and D. Kifer. (2016). “Unifying adversarial pressing deep neural networks with pruning, trained quantization and
training algorithms with flexible deep data gradient regularization.” Huffman coding.” [Online]. Available: https://arxiv.org/abs/1510.00149
[Online]. Available: https://arxiv.org/abs/1601.07213 [47] S. Han et al., “EIE: Efficient inference engine on compressed deep
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image neural network,” in Proc. 43rd Int. Symp. Comput. Archit., Jun. 2016,
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 243–254.
pp. 770–778. [48] A. Parashar et al., “SCNN: An accelerator for compressed-sparse
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification convolutional neural networks,” in Proc. 44th Annu. Int. Symp. Comput.
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Archit., 2017, pp. 27–40.
Process. Syst., 2012, pp. 1097–1105. [49] Z. Zhong, J. Li, L. Ma, H. Jiang, and H. Zhao, Deep Residual Networks
[24] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- for Hyperspectral Image Classification. Piscataway, NJ, USA: Institute
tional networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833. of Electrical and Electronics Engineers, 2017.
[25] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE [50] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. (2014).
Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. “Striving for simplicity: The all convolutional net.” [Online]. Available:
[26] M. M. Botvinick and D. C. Plaut, “Short-term memory for serial order: https://arxiv.org/abs/1412.6806
A recurrent neural network model,” Psychol. Rev., vol. 113, no. 2, [51] L. Bottou, “Large-scale machine learning with stochastic gradient
pp. 201–233, 2006. descent,” in Proc. COMPSTAT, 2010, pp. 177–186.
[27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural [52] D. P. Kingma and J. Ba. (2014). “Adam: A method for stochastic
image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern optimization.” [Online]. Available: https://arxiv.org/abs/1412.6980
Recognit., Jun. 2015, pp. 3156–3164. [53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[28] M. Fayyaz, M. H. Saffar, M. Sabokrou, M. Fathy, R. Klette, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
F. Huang. (2016). “STFCN: Spatio-temporal FCN for semantic video from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
segmentation.” [Online]. Available: https://arxiv.org/abs/1608.05971 2014.
[54] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. (2015). “Fast and
[29] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
accurate deep network learning by exponential linear units (ELUs).”
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
[Online]. Available: https://arxiv.org/abs/1511.07289
Recognit., Jun. 2015, pp. 3431–3440.
[55] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
[30] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data: Surpassing human-level performance on ImageNet classification,” in
A technical tutorial on the state of the art,” IEEE Geosci. Remote Sens. Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1026–1034.
Mag., vol. 4, no. 2, pp. 22–40, Jun. 2016. [56] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
[31] J. Zabalza et al., “Novel segmented stacked autoencoder for effective and K. Keutzer. (2016). “SqueezeNet: AlexNet-level accuracy with
dimensionality reduction and feature extraction in hyperspectral imag- 50x fewer parameters and <0.5 MB model size.” [Online]. Available:
ing,” Neurocomputing, vol. 185, pp. 1–10, Apr. 2016. https://arxiv.org/abs/1602.07360
[32] K. Karalas, G. Tsagkatakis, M. Zervakis, and P. Tsakalides, “Deep [57] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
learning for multi-label land cover classification,” Proc. SPIE, ding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.
vol. 9643, pp. 9643-1–9643-14, Oct. 2015. [Online]. Available: [58] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extrac-
https://doi.org/10.1117/12.2195082, doi: 10.1117/12.2195082. tion and classification of hyperspectral images based on convolutional
[33] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,
J. C. Tilton, “Advances in spectral-spatial classification of hyperspectral pp. 6232–6251, Oct. 2016.
images,” Proc. IEEE, vol. 101, no. 3, pp. 652–675, Mar. 2013. [59] Y. Yuan, J. Lin, and Q. Wang, “Dual-clustering-based hyperspectral band
[34] K. Huang, S. Li, X. Kang, and L. Fang, “Spectral–spatial hyperspectral selection by contextual analysis,” IEEE Trans. Geosci. Remote Sens.,
image classification based on KNN,” Sens. Imag., vol. 17, no. 1, p. 1, vol. 54, no. 3, pp. 1431–1445, Mar. 2016.
Dec. 2016. [60] Q. Wang, J. Lin, and Y. Yuan, “Salient band selection for
[35] W. Zhao and S. Du, “Learning multiscale and deep representations for hyperspectral image classification via manifold ranking,” IEEE
classifying remotely sensed imagery,” ISPRS J. Photogramm. Remote Trans. Neural Netw. Learn. Syst., vol. 27, no. 6, pp. 1279–1289,
Sens., vol. 113, pp. 155–165, Mar. 2016. Jun. 2016.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.
4434 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 56, NO. 8, AUGUST 2018

Amina Ben Hamida received the National Diploma Patrick Lambert received the Ph.D. degree in
degree in electrical engineering from the Ecole signal processing from the National Polytechnic
Nationale d’Ingénieurs de Sfax, Sfax, Tunisia. She is Institute of Grenoble, Grenoble, France, in 1983.
currently pursuing the Ph.D. degree with Université He is currently a Full Professor with the School of
Savoie Mont Blanc, Annecy, France, with a focus on Engineering, Université Savoie Mont Blanc, Annecy,
novel approaches that deploy deep learning networks France, and a member of the Informatics, Sys-
for remote sensing image classification tasks, specif- tems, Information and Knowledge Processing Labo-
ically within the hyperspectral and multispectral ratory, Annecy, France. His research interests include
data communities. One of her biggest ventures for image and video analysis, and actually dedicated
her research project will be the establishment of to non-linear color filtering and automatic image
balance between high accuracy and low cost for deep understanding.
learning architectures.

Chokri Ben Amar (SM’08) received the B.S.


degree in electrical engineering from the Ecole
Nationale d’Ingénieurs de Sfax (ENIS), Sfax,
Tunisia, in 1989, and the M.S. and Ph.D. degrees in
computer engineering from the National Institute of
Applied Sciences of Lyon, Lyon, France, in 1990 and
1994, respectively.
He spent one year at the University of Haute
Alexandre Benoit received the Ph.D. degree in Savoie, Annecy, France, as a Teaching Assistant
electronics and computer science from the Grenoble and a Researcher before joining the higher School
Institute of Technology, University of Grenoble, of Sciences and Techniques of Tunis, Tunis, as an
Grenoble, France, in 2007. Assistant Professor, in 1995. In 1999, he joined the University of Sfax, Sfax,
Since 2008, he has been an Associate Pro- where he is currently a Full Professor with the Department of Computer
fessor with the Informatics, Systems, Information Engineering and Applied Mathematics, Ecole Nationale d’Ingénieurs de Sfax.
and Knowledge Processing Laboratory, Université His research interests include computer vision and image and video analysis.
Savoie Mont Blanc, Annecy, France. His research These research activities are centered on wavelets and wavelet networks and
interests include image and video understanding. their applications to data classification and approximation, pattern recognition,
He actively participates to multimedia indexation and image and video indexing and securing.
challenges such as TRECVid. He develops pattern Dr. Ben Amar founded the IEEE Signal Processing Society (SPS) Tunisia
extraction and analysis methods and data fusion processes for high semantic Chapter in 2009, and he is actually the Chair of this Chapter. During this
level image and video indexing and segmentation. He contributes to the period, the chapter organized the five IEEE Distinguished Lectures and other
open source OpenCV library by providing a specific spatio-temporal filtering technical and professional activities. He has been the Advisor of the IEEE
module, bioinspired. SPS Student Chapter, ENIS, since 2010.

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on November 02,2023 at 07:44:58 UTC from IEEE Xplore. Restrictions apply.

You might also like