Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

Diverse Region-Based CNN for Hyperspectral


Image Classification
Mengmeng Zhang, Student Member, IEEE, Wei Li, Senior Member, IEEE, and
Qian Du, Fellow, IEEE

Abstract— Convolutional neural network (CNN) is of great space using the kernel method, was introduced to deal with
interest in machine learning and has demonstrated excellent Hughes phenomenon, and the SVM classifier has become
performance in hyperspectral image classification. In this paper, a benchmark. Meanwhile, sparse representation-based clas-
we propose a classification framework, called diverse region-
based CNN, which can encode semantic context-aware repre- sification (SRC) [15, 16], extreme learning machine (ELM)
sentation to obtain promising features. With merging a diverse [17], active learning [18], relevance vector machine (RVM)
set of discriminative appearance factors, the resulting CNN- [19] and other classifiers have been developed to produce
based representation exhibits spatial-spectral context sensitivity superior performance. Nevertheless, due to the fact that the
that is essential for accurate pixel classification. The proposed same material may present spectral discrepancy and different
method exploiting diverse region-based inputs to learn contextual
interactional features is expected to have more discriminative materials may have similar spectral signatures, it is difficult to
power. The joint representation containing rich spectral and precisely distinguish different classes via spectral information
spatial information is then fed to a fully-connected network and only [20]. Accordingly, to utilize spatial information in HSI,
the label of each pixel vector is predicted by a softmax layer. many spectral-spatial techniques have been investigated. For
Experimental results with widely-used hyperspectral image data example, Markov random fields (MRFs)-based model was
sets demonstrate that the proposed method can surpass any other
conventional deep learning based classifiers and other state-of- employed to combine spatial and spectral features [21, 22],
the-art classifiers. and generalized composite kernel machine for spectral-spatial
HSI classification was presented by Li et al. [12], which can
Index Terms— Hyperspectral image, convolutional neural net-
work, deep learning, pattern recognition balance the use of spectral and spatial information without any
weighting parameter.
Aforementioned methods adopt a series of manually-
I. I NTRODUCTION extracted features, which involve massive experts’ experience
Hyperspectral remote sensing has received considerable and parameter setting. Deep learning methods [10, 23–29],
interest in recent years for a variety of applications in Earth which act more dynamically to provide automation features,
observations [1–6]. Hyperspectral imagery (HSI) provides have been extensively employed for remote sensing image
hundreds of contiguous narrow spectral bands [7–10], which feature extraction and classification. A general approach to
enables more accurate discrimination of different materials construct deep networks for remote sensing images was sys-
than traditional panchromatic and multispectral remote sensing tematically analyzed by Zhang et al. [30]. In order to extract
images. With high spectral resolution, HSI has unique advan- high-level features, deep learning architecture with multilayer
tages for finer classification [11, 12], because it can uncover stacked autoencoder was constructed through an unsupervised
and reveal subtle spectral characteristics that traditional im- manner [31, 32]. Particularly, the convolutional neural network
agery cannot resolve. (CNN), which is a class of neural networks with fewer param-
In the early stage of HSI classification, numerous ma- eters than fully-connected networks with the same number of
chine learning-based methods have been used, such as nearest hidden units, has drawn great attention. Hu et al. employed
neighbor, decision trees, and linear functions. Among these CNN to extract spectral features for HSI classification, and
methods, k-nearest neighbor (k-NN) [13] can be viewed the performance was superior to that of SVM [33]. Again,
as the simplest classifier that employs Euclidean distance excavating spatial information of HSI is of great importance
to measure the similarity between a testing sample and for classification task, and many CNN based researches have
available training samples. Support vector machine (SVM) done some explorations on this aspect. Slavkovikj et al. [34]
[14], which determines the boundary in a high-dimensional presented a CNN framework for HSI classification where
spectral features were extracted from a small neighbourhood.
This work was supported by the National Natural Science Foundation of
China (91638201, 61571033), Beijing Natural Science Foundation (4172043), Makantasis et al. introduced randomized principal component
Beijing Nova Program (Z171100001117050), and partly by the Fundamental analysis (PCA) in their work, followed by CNN (named as
Research Funds for the Central Universities (BUCTRC201615) (correspond- R-PCA CNN) to encode spectral and spatial information [35],
ing author: Wei Li).
M. Zhang and W. Li are with the College of Information Science and and a similar strategy of CNN with spatial-spectral features
Technology, Beijing University of Chemical Technology, Beijing 100029 was discussed by Mei et al. [36], named as SS-CNN. In
China (e-mail: liwei089@ieee.org). the research proposed by Yue et al. [37], hyperspectral data
Q. Du is with the Department of Electrical and Computer Engineering,
Mississippi State University, Mississippi State, MS 39762 USA (e-mail: were projected to several principal components, where a CNN
du@ece.msstate.edu). model was adopted to extract spatial features. Recently, Li

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

et al. [38] presented a novel deep network to learn pixel-pair The joint representation using diverse region in the proposed
features and fuse the classification results of pixels in different CNN framework can simultaneously take advantages of spec-
pairs from its neighborhood. In such a strategy, CNN with tral information, spatial structure information, and semantic
pixel-pair features (denoted as CNN-PPF) can use pixel-pairs context-aware information in each pixel. (2) An important
within a fixed window during classification, while the con- module, “multi-scale summation”, is designed for deep feature
volution operation is mainly executed in the spectral domain extraction that can combine multiple scales and different level
while neglecting spatial details. Furthermore, Lee et al. [39] features from unequal layers. The features are extracted in
proposed an interesting contextual deep CNN (denoted as CD- a manner of information supplement and propagation, thus
CNN), which can optimally explore contextual interactions by ensuring the perfectibility of information.
jointly exploiting local spatial-spectral relationships of neigh- The remainder of this paper is organized as follows. The
boring pixel vectors within a square window. Specifically, the proposed classification framework is described in Section II.
joint exploitation of spatial-spectral information is achieved The experiments and analysis are discussed in Section III. The
by a multi-scale convolutional filter bank used as an initial conclusion is drawn in Section IV.
component of the proposed CNN pipeline.
Although existing methods based on CNN have employed II. D IVERSE R EGION - BASED CNN
some spatial-information extraction strategies for obtaining
spatial-spectral features, how to utilize information (abundant A. DR-CNN Architecture
spectral information and detailed spatial information) within The proposed deep network consists of several CNN
HSI more sufficiently still faces great challenge. Different from branches with each branch representing a different region,
commonly-used CNN models that apply a sliding window with which is called diverse region-based CNN (DR-CNN) model.
a specific scale to extract features [36, 39, 40], we present The architecture of the proposed DR-CNN model is illustrated
a diverse region-based deep CNN model (denoted as DR- in Fig. 1. It is based on the assumption that adjacent pixels
CNN) in this paper. In the proposed framework, different often consist of similar materials and tend to be the same
input patterns and topologies of CNN model are designed class as the central testing pixel with high probability. In other
to ensure complete information transfer. The input pattern, words, it is suboptimal if only considering the central pixel
namely diverse local or global regions (e.g., central region, without any neighboring spatial information. The key is how
original region, and four direction-oriented regions), support to select surrounding areas. Different from a traditional square
joint representation of each pixel, ensuring an architecture window, six regions associated with flexible shapes are built
of greater width. The proposed DR-CNN model employs in the form of diverse rectangles, followed by six blocks of
the mentioned six regions as the input to extract spectral- CNN model in feature extractor.
spatial features of HSI via the well designed network with For each CNN model, a “multi-scale summation” module
“multi-scale summation” module. A softmax classifier is used is employed to avoid overfitting that is usually caused by
to classify each pixel. Moreover, in order to alleviate the limited training data. The detailed framework of the module
problem of limited available training samples, we investigate is illustrated in Fig. 4, which allows for a certain extent of in-
hyperspectral data augmentation in the learning process. crease in depth and width of the network, leading to enhanced
The limited access to diverse input nodes can prevent learning capability and ultimately improved generalization per-
a well-designed network from leveraging deeper and wider formance. In a typical CNN model, early convolutional layers
networks that can take full advantage of very rich spectral- with high spatial resolutions often capture more local details
spatial information. In the work of Lee et al. [39], the single- while the ones with low spatial resolutions can capture more
scale input is a square region with fixed size, and it is structure information with high-level semantics [42]. Inspired
not universally applicable to data sets with various object by the densenet [43] and ResNet [44] for image classification,
distributions. Hence, due to the single input style of feature the “multi-scale summation” module is designed to combine
extraction process, CD-CNN proposed by Lee et al. [39] can- local fine details and high-level structure information by cross-
not fully exploit the abundant semantic-contextual properties layer aggregation of previous layers and deep layers. In
around a specific pixel, causing a great loss of information. particular, as shown in Fig. 4, the two cross-layer aggregations
Actually, we have found that most state-of-the-art CNN-based with shortcut connections contain convolution operations of
approaches are designed with single input architectures (e.g., different convolutional scales, to adjust the size of feature
spatial feature extraction using a fixed-size window) to conduct maps to be in line with that of the high-level hidden layer
classification. In our opinion, the single input style restricts the ones. The shortcut connections of “Multi-scale summation” is
performance, and He et al. [41] pointed out that segregation selectively bridged, and the selection principle is dependent
and subsequent aggregation at a deeper layer of the CNN upon optimum sparseness of shortcut connections. Then, the
model are more physiologically sound and suitable to model strategy passes previous features to a subsequent layer with
hierarchical information processing in human brains. It may be simple concatenation.
more reasonable to generate representations from flexible sized After that, all the features derived from different regions are
windows during training. Accordingly, we propose to construct fused together and fed into the last fully-connected network,
a CNN framework with diverse but rich pixel representation as shown in part III of Fig. 1. Particularly, this fully-connected
which plays a critical role in classification tasks. network is provided with more implementation details in
The main contributions of this paper are as follows. (1) Fig. 6. Consequently, the number of layers of the entire

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

I.Diverse Region II.Deep Feature Extractor

... ...
... ...

...
Global

...
Region III.Fully Connect

... ...
Right ... ...

...

...
Region

... ...
Left ... ...

...

...
Region Classification
Result
Top

... ...
... ...

...

...
Region
Current pixel vector of interest
Hypothesis set around the pixel Bottom

... ...
Region ... ...

...

...
Local

... ...
Region ... ...

...

...
Fig. 1. The overall flowchart of the proposed DR-CNN.

merely one fixed-size square window.


For better illustration, Fig. 2 shows pixels in edge area of
synthetic data considered by only one square-shaped region.
Suppose there is one spatial feature region (i.e., red box)
extracted for the central pixel belonging to “BLUE” material.
However, most pixels in the extracted region are “GREEN”
material. Obviously, the classification result for the pixel of in-
terest by using this square-shaped region will be disappointing.
In this example, we further notice that homogeneous region of
interest (central) pixel (i.e., pixels belong to the same material
within the square-shaped region) is widely distributed at the
Fig. 2. Illustration of pixels in edge area processed by a single square-shaped
region.
right side of the square-shaped region. Accordingly, if we
abandon the coarse-grained strategy of considering the whole
area in the red box but extracting the right sides, then the
architecture adds up to 14 layers. It is worth mentioning that utilization of spatial information seems to be more suitable
fully-connected layers perform classification not only using and powerful. By considering various possible cases, it is
comprehensive information derived from diverse regions but reasonable to infer that, the distribution in a specific square-
also the combination of coarse and fine information within shaped region can be potentially improved.
each region. Moreover, the red arrows in Fig. 1 represent Based on the aforementioned analysis, we propose to extract
the data augmentation operation to be performed during the diverse regions according to various possible material distri-
training procedure, which will be discussed in Section III-B. bution. The following are several representative situations.
(1) Direction-based Half Regions: there are several half
parts (i.e., left, right, top, bottom) of a square-shaped region
B. Diverse Regions and Specific Details of DR-CNN as shown in Fig. 3 (a-d). The CNN model with “multi-scale
Spectral-spatial classification techniques via CNN are de- summation” modules are trained on each of them, extracting
signed to exploit the spatial correlation across neighboring contextual characteristics presented only in each half region,
pixels. That is, hyperspectral pixels in a small neighborhood which is employed as input data as shown in Fig. 1. The
around the central pixel are jointly represented by the CNN network shown in Fig. 4 is served as a feature extractor of
model for spectral-spatial feature extraction. However, as dis- each direction-based half region.
cussed previously, current CNN model [36, 37, 39] commonly (2) Central Region: as illustrated in Fig. 3 (e), this region is
adopts a single input style for feature extraction by using a designed to concentrate closely around the center small area
window with fixed size (e.g., 3 × 3, 5 × 5, etc.). This type of (i.e., 3 × 3). Specifically, the CNN model trained on this small
selected region may include between-class pixels, especially, region can be guided to extract relatively pure spectral features
for urban image scenes with complex distribution. In these of each central pixel. There are two reasons behind: firstly,
heterogeneous areas, materials even within a small region may the spectral information of the pixel may be contaminated
be from different classes, which is the disadvantage of using by some uncontrollable factors; secondly, a 3 × 3 spatial

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

Conv [5x5]:64
Conv [3x3]:64

... ...

...

...

Conv [3x3]:128
Conv [3x3]:128

Conv [1x1]:128

Full Connect
Batch Norm
Input

ReLU

ReLU

ReLU
Multi-Scale Summation

Fig. 4. The detailed framework of the “multi-scale summation” module.


(a) (b)

Conv [1x1]:128
Conv [3x3]:128

Full Connect
Batch Norm
Input

ReLU

ReLU
... ...

...

...
...
Feature Extractor For Rc

Fig. 5. The detailed framework of feature extractor for central region RC .

transformation layer, and batch normalization [43, 45] layer.


(c) (d)
Specifically, in the implementation process, all the convolution
operations are executed without zero padding, and the convo-
lution stride is set as 1. Besides, the number of convolution
kernels used in each convolution operation is indicated in
Figs. 4–5. Let RC , RG , RL , RR , RT , and RB be the input
regions of the CNN, where the last four are the direction-based
half regions. RG and RC respectively represent the global
region and the central region. The convolution layer creates a
filter kernel W that is convolved with the input data and adds
a bias b to produce a tensor of outputs Z as,

(e) (f) Z = W ⊗ Rq + b, q ∈ {C, G, L, R, T, B} (1)


where ⊗ denotes the convolution process. There are many
Fig. 3. Diverse regions of a pixel of interest: (a) left direction-based half
part, (b) right direction-based half part, (c) top direction-based half part, (d)
alternatives for nonlinear transformation, such as sigmoid
bottom direction-based half part, (e) central region, and (f) global region. function and hyperbolic tangent. Here, the rectified linear unit
(ReLU) [46] is chosen for the nonlinear transformation layer
to compute the output activation value Z̃ as,
region ensures the similarities of pixels in this neighborhood, Z̃ = max {0, Z} . (2)
alleviating the impact of unknown factors like outliers. These
features extracted from central region are often not affected For batch normalization layer (BN), it normalizes the acti-
by complex spatial distribution, especially in heterogeneous vations of the previous layer at each batch. In other words, it
areas. Correspondingly, unlike aforementioned half regions, applies a transform that maintains the mean activation close
the network shown in Fig. 5 is served as a feature extractor to 0 and the activation standard deviation close to 1. Assume
for the central region. that the batch is of size m and Z̃ is derived based on samples
of the whole batch. Accordingly, there areom values of this
(3) Global Region: as shown in Fig. 3 (f), this is a candidate n
square-shaped region, where the trained CNN branch is guided activation in the batch Z̃ = Z̃1 , Z̃2 ..., Z̃m computed as,
to capture global contextual information of the central pixel,  
including the whole spectral information, more precise spa- Z̃ − E Z̃
Ẑ = γ r +β (3)
tial relationships and even the interactions between different  
categories. V ar Z̃ + ξ
All the regions mentioned above are taken as input data for n o
the proposed DR-CNN model, as illustrated in part I of Fig. 1. where Ẑ = Ẑ1 , Ẑ2 ..., Ẑm denotes the output of samples
 
The network structures shown in Figs. 4–5 further demonstrate in the batch after applying batch normalization, E Z̃ and
deep feature extractors as indicated in part II of Fig. 1. In the  
framework of DR-CNN, each convolution process involves a V ar Z̃ represent the expectation and variance of Z̃, respec-
certain operation, including convolution (conv) layer, nonlinear tively, and γ and β are hyperparameters to be learned.

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

Full Connect
Full Connect
Full Connect

Batch Norm
ŽƌŶͲŶŽƚŝůů

Input

ReLU

ReLU
ŽƌŶͲŵŝŶƚŝůů
'ƌĂƐƐͲƉĂƐƚƵƌĞ
Classifier of DR-CNN
(Part III of Fig. 1) ,ĂLJͲǁŝŶĚƌŽǁĞĚ
^ŽLJďĞĂŶͲŶŽƚŝůů
^ŽLJďĞĂŶͲŵŝŶƚŝůů

Fig. 6. The detailed framework of the fully-connected network in Part III ^ŽLJďĞĂŶͲĐůĞĂŶ
of Fig. 1. tŽŽĚƐ

(a) (b)
ƌŽĐĐŽůŝ ŐƌĞĞŶ ǁĞĞĚƐ ϭ
In general, the chain of the feature extractor ends in the ƌŽĐĐŽůŝ ŐƌĞĞŶ ǁĞĞĚƐ Ϯ
fully-connected layer, and the entire feature-extraction opera- &ĂůůŽǁ
&ĂůůŽǁ ƌŽƵŐŚ ƉůŽǁ
tion for a specific region is defined as &ĂůůŽǁ ƐŵŽŽƚŚ
^ƚƵďďůĞ
fRq = F (Rq , θ) , q ∈ {C, G, L, R, T, B} (4) ĞůĞƌLJ
'ƌĂƉĞƐ ƵŶƚƌĂŝŶĞĚ
where function F consists of the convolution process and ^Žŝů ǀŝŶĞLJĂƌĚ ĚĞǀĞůŽƉ
fully-connected process, Rq represents the specific region, ŽƌŶ ƐĞŶĞƐĐĞĚ ŐƌĞĞŶ ǁĞĞĚƐ

fRq ∈ <1×l is the features extracted from Rq , and θ consists >ĞƚƚƵĐĞ ƌŽŵĂŝŶĞƐ͕ ϰ ǁŬ
>ĞƚƚƵĐĞ ƌŽŵĂŝŶĞƐ͕ ϱ ǁŬ
of W, b, γ, and β. >ĞƚƚƵĐĞ ƌŽŵĂŝŶĞƐ͕ ϲ ǁŬ
After obtaining diverse features of all the regions by afore- >ĞƚƚƵĐĞ ƌŽŵĂŝŶĞƐ͕ ϳ ǁŬ
sŝŶĞLJĂƌĚ ƵŶƚƌĂŝŶĞĚ
mentioned feature extraction operations, these representative
sŝŶĞLJĂƌĚ ǀĞƌƚŝĐĂů ƚƌĞůůŝƐ
features are efficiently fused together. First, features of dif-
ferent CNN pipelines are concatenated with others to obtain a (c) (d)
feature vector f = {fRC , fRG , fRL , fRR , fRT , fRB }. Then, as ƐƉŚĂůƚ

shown in the part III of Fig. 1, the fully-connected layers are DĞĂĚŽǁƐ
established to combine these features from depth by regarding
'ƌĂǀĞů
f as input. Finally, the softmax layer is applied to predict the
classification label of the testing pixel. dƌĞĞƐ

^ŚĞĞƚƐ
III. E XPERIMENTS AND A NALYSIS ĂƌĞ ƐŽŝůƐ
For the proposed DR-CNN, all the programs are imple- ŝƚƵŵĞŶ
mented using Python language, and the network is constructed
using Keras1 and TensorFlow2 deep learning framework. Ten- ƌŝĐŬƐ

sorFlow is an open source software library for numerical ^ŚĂĚŽǁƐ


computation using data flow graphs, and Keras can be seen
as a simplified interface to Tensorflow. (e) (f)

Fig. 7. For three experimental datasets: (a) False-color image of the Indian
A. Experimental Data Pines data, (b) Ground truth of the Indian Pines data, (c) False-color image of
the Salinas data, (d) Ground truth of the Salinas data, (e) False-color image of
The performance of the proposed DR-CNN is evaluated on the University of Pavia data, and (f) Ground truth of the University of Pavia
three datasets, i.e., the Indian Pines dataset, the Salinas dataset, data.
and the University of Pavia dataset, as illustrated in Fig. 7. For
each data set, we randomly select 200 labeled pixels per class
for training and all the other pixels in the ground-truth map for
testing. The Indian Pines data set, which consists of 145 × 145 TABLE I
pixels, was gathered by the Airborne Visible Infrared Imaging T HE N UMBERS O F T RAINING AND T ESTING S AMPLES FOR T HE I NDIAN
Spectrometer (AVIRIS) sensor in northwestern Indiana. There P INES DATASET.
are 220 spectral channels covering the range from 0.4 to
# Class Training Test
2.5µm with a spatial resolution of 20 m. The Indian Pines 1 Corn-notill 200 1228
dataset originally has 16 different land-cover classes; however, 2 Corn-mintill 200 630
from the statistical viewpoint, we discard small classes and 3 Grass-pasture 200 283
4 Hay-windrowed 200 278
select 8 large classes [14, 47]. The numbers of training and 5 Soybean-notill 200 772
testing samples are listed in Table I. 6 Soybean-mintill 200 2255
The second data, which consists of 512 × 217 pixels, were 7 Soybean-clean 200 393
8 Woods 200 1065
also collected by the AVIRIS sensor over Salinas Valley, - Total 1600 6904
1 https://github.com/fchollet/keras
2 http://tensorow.org/

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

TABLE II
T HE N UMBERS OF T RAINING AND T ESTING S AMPLES FOR THE S ALINAS
DATASET.

Training Set
Two fold
# Class Training Test
1 Broccoli green weeds 1 200 1809
training gaussian
2 Broccoli green weeds 2 200 3526 data fliplr noise
3 Fallow 200 1776 augmentation
4 Fallow rough plow 200 1194
5 Fallow smooth 200 2478
6 Stubble 200 3759
7 Celery 200 3379
8 Grapes untrained 200 11071
9 Soil vineyard develop 200 6003 Fig. 8. The process of data augmentation.
10 Corn senesced green weeds 200 3078
11 Lettuce romaines, 4 wk 200 868
12 Lettuce romaines, 5 wk 200 1727 of training samples can be increased by a factor of two,
13 Lettuce romaines, 6 wk 200 716
14 Lettuce romaines, 7 wk 200 870 ensuring more accurate estimation of parameters.
15 Vineyard untrained 200 7068 For each training pixel, we use the surrounding 11 × 11
16 Vineyard vertical trellis 200 1607 pixels, and diverse regions are extracted from the square-based
- Total 3200 50929
region and then pour into a sequence of convolutional layers.
Note that the proposed diverse-region strategy can be viewed
TABLE III as a flexible representation of the square-shaped region, hence
T HE N UMBERS OF T RAINING AND T ESTING S AMPLES FOR THE the region size affects the final performance of the proposed
U NIVERSITY O F PAVIA DATASET. DR-CNN. Here, we empirically set the global region size to be
11 × 11, which is further validated in Section III-C. Table IV
# Class Training Test
1 Asphalt 200 6431 summarizes the size of different regions.
2 Meadows 200 18449 Furthermore, stochastic gradient descent (SGD) with a batch
3 Gravel 200 1899 size of 450 samples is used with 500 × C (C is the number of
4 Trees 200 2864
5 Sheets 200 1145 classes) iterations, a momentum of 0.99, and a weight decay D
6 Baresoil 200 4829 of 0.0001. We initially set a base learning
 rate L as 0.001, and
7 Bitumen 200 1130 1
8 Bricks 200 3482 it is decreased as, L̂ = L∗ (1+D×I) , where L̂ is the updated
9 Shadows 200 747 learning rate, and I is the number of current iterations. All the
- Total 1800 40976 convolutional layers are initialized using zero-mean Gaussian
2
random variables with standard deviation of (f anin +f anout ) ,
where ‘f anin ’ is the number of input units and ‘f anout ’ is
California. The image comprises 224 spectral bands with a the number of output units in the weight tensor [48]. Biases
spatial resolution of 3.7m. There are 16 classes in total, and the of all the convolutional layers are initialized as zero.
number of training and testing samples are listed in Table II.
The University of Pavia data set, which contains 610 × 340 C. Analysis on the Diverse Regions
pixels, was collected by the Reflective Optics System Imaging In order to validate the diverse-region strategy described
Spectrometer sensor covering the city of Pavia, Italy. The in Section II-B, we compare the classification results using
image scene comprises 103 spectral bands covering the range different kinds of input block, such as square-shaped regions
from 0.43 to 0.86µm with a spatial resolution of 1.3 m. of different sizes and direction-based half regions. The fully
Approximately 42776 labeled pixels with nine classes are from connected layer with softmax loss, as shown in Fig. 6, acts as
the ground-truth map, and the numbers of training and testing the classifier for obtaining classification result.
samples are listed in Table III. Fig. 9 illustrates the classification performance of square-
shaped region versus different window sizes, i.e., from 3 × 3
to 15 × 15. The classification result for each square-shaped
B. Learning the Proposed DR-CNN region is obtained by using the feature extraction structure
A deep network usually requires many training data to learn shown in Fig. 4 or 5 connected with softmax classifier. It is
the model with a large number of parameters. However, in apparent when the window size is as large as 11 × 11, the
HSI classification tasks, only a few labeled samples may be performance tends to be satisfied. The size 11 × 11 may not
available in practice. To solve this issue, we utilize a simple be the best window size for all the experimental data set. For
but effective data augmentation method as shown in Fig. 8. example, the red curve shows that the best window size of
For each training sample, two steps of data augmentation are square-shaped region is 9 × 9 for the University of Pavia data
executed to generate additional data without introducing extra set, and the blue curve indicates the best window size is 9 × 9
labeling costs. The first one is flip, for which the process for the Indian Pines data set, while the best window size for
is implemented by flipping the original samples horizontally, the Salinas data set is 11 × 11. Hence, we choose a relatively
or flipping them vertically. The second step is to add small large size (e.g., 11 × 11) within allowable hardware resources
Gaussian noise to the original samples. In doing so, the number for better analysis of interactions between different categories.

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

TABLE IV
T HE W INDOW S IZE OF D IVERSE R EGIONS A DOPTED IN THE P ROPOSED DR-CNN.

Region Global Region Central Region Left Region Right Region Top Region Bottom Region
Size 11×11 3×3 11×7 11×7 7×11 7×11

TABLE V
C LASSIFICATION P ERFORMANCE OF D IVERSE R EGIONS F OR THE I NDIAN
98 P INES DATA S ET.
Classification Accuracy %

OA (%) AA (%) Kappa Coefficient


96 RL 96.12 95.79 0.9525
RR 95.44 94.94 0.9442
RU 94.38 93.90 0.9315
94 RB 95.52 95.20 0.9451
RC 90.43 90.24 0.8836
RG 96.99 97.34 0.9630
92 DR-CNN 98.54 98.48 0.9820
University of Pavia
Salinas
90 Indian Pines TABLE VI
C LASSIFICATION P ERFORMANCE OF D IVERSE R EGIONS F OR THE
3 5 7 9 11 13 15
S ALINAS DATA S ET.
Varying Window Size
Fig. 9. Overall classification accuracies (%) of a square-shaped region versus OA (%) AA (%) Kappa Coefficient
various block window sizes.
RL 95.29 97.73 0.9473
RR 96.43 98.03 0.9601
RU 96.47 97.68 0.9605
Tables V–VII further list the class-specific accuracy, overall RB 96.37 97.97 0.9595
RC 89.67 95.63 0.8836
accuracy (OA), average accuracy (AA) and Kappa Coefficient RG 97.41 98.19 0.9710
of using different direction regions and the proposed fusion DR-CNN 98.33 99.12 0.9814
strategy. The results state that RC produces very low accuracy,
for which the main reason is that only the central region
covering pure spectral information is employed without con-
taining the surrounding spatial information. For other cases, when the learning rate is around 0.001. So we set the value of
the classification task can proceed well with using only half learning rate as 0.001 for DR-CNN in follow-up experiments.
part of candidate region. Sometimes, the classification result Besides, we compare the classification performance achieved
achieved by half part region is close to or even better than that with or without training data augmentation operation, and the
of RG ; for example, the optimal classification result (single classification results are listed in Table IX. It is evident that
block-derived) using the University of Pavia data set is RU . the augmentation operation for training data plays a positive
Actually, the performance of direction-based half regions role for classification task; the overall classification accuracy
is very limited with finite counteract due to existence of the achieved with augmentation is higher than those without
positional anisotropism property. But even so, it is obvious that augmentation, among almost all the regions. Particularly, we
the classification results have greater diversity among different compare the classification performance of methods with short-
kinds of blocks, which illustrates that various contextual cut connections by randomly selecting 10% training samples
distribution has different generalization capabilities on clas- per class on the Indian Pines data; as shown in Table X, DR-
sification. The block on which works better heavily depends CNN achieves the highest accuracy, due to the combination
on the actual distribution of the data as we analyzed in Section of diverse regions and the “multi-scale summation” module.
II-B, and the experimental results with the fusion strategy The performance of the proposed DR-CNN is also com-
can surely be improved. As a consequence, the proposed DR- pared with some state-of-the-art HSI classification approaches,
CNN achieves the best accuracy for that it has more robust such as SVM with radial basis function (denoted as SVM-
feature representation with consideration of multiple possible RBF) and multiple classifier systems based SVM with random
distributions. feature selection (denoted as SVM-RFS [50]), SVM-MRF
[21], CNN [33], R-PCA CNN [35], CNN-PPF [38], CD-CNN
[39], and SS-CNN [36]. Note that SVM-based methods are
D. Classification Performance
implemented using the libsvm toolbox3 . Tables XI–XIII list
Except that weights can be learned in an automatically the class-specific accuracy, OA, and AA of these methods. For
way, learning rate is a crucial parameter significantly affecting DR-CNN, the experiments are repeated ten times by randomly
the training performance. Different initial learning rates are selected 200 training samples per class, and the average results
tested on the Indian Pines dataset as shown in Table VIII. It with standard variations are reported.
can be observed that a large learning rate may degrade the
classification performance; the best performance is achieved 3 http://www.csie.ntu.edu.tw/ cjlin/libsvm/

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

TABLE XI
C OMPARISON OF THE C LASSIFICATION ACCURACY (%) A MONG THE P ROPOSED M ETHOD AND THE BASELINES U SING THE I NDIAN P INES DATA .

Performance
class
SVM-RBF SVM-RFS SVM-MRF CNN[33] R-PCA CNN[35] CNN-PPF[38] CD-CNN[39] SS-CNN[36] DR-CNN
1 76.14 88.73 93.55 78.58 82.39 92.99 90.1 96.28 98.20±0.012
2 85.40 91.20 90.41 85.24 85.41 96.66 97.1 92.26 99.79±0.003
3 97.88 97.52 95.80 96.10 95.24 98.58 100 99.3 100±0
4 99.28 100 100 99.64 100 100 100 100 100±0
5 83.94 91.67 91.12 89.64 82.76 96.24 95.9 92.84 99.78±0.003
6 73.48 78.79 97.72 81.55 96.2 87.8 87.1 98.21 96.69 ± 0.011
7 92.11 93.76 91.71 95.42 82.14 98.98 96.4 92.45 99.86 ± 0.001
8 97.28 98.74 99.84 98.59 99.81 99.81 99.4 98.98 99.99±0
AA 88.19 92.55 95.02 90.60 90.49 96.38 95.75 96.29 99.29±0.001
OA 82.98 88.68 95.34 87.01 91.13 93.9 94.24 96.63 98.54±0.257

TABLE XII
C OMPARISON OF THE C LASSIFICATION ACCURACY (%) A MONG THE P ROPOSED M ETHOD AND THE BASELINES U SING THE S ALINAS DATA .

Performance
class
SVM-RBF SVM-RFS SVM-MRF CNN[33] R-PCA CNN[35] CNN-PPF[38] CD-CNN[39] SS-CNN[36] DR-CNN
1 96.81 99.55 100 97.34 98.84 100 100 100 100±0
2 94.67 99.92 99.70 99.29 99.61 99.88 100 99.89 100±0
3 90.27 99.44 98.94 96.51 99.75 99.60 100 99.89 99.98±0
4 98.61 99.86 98.44 99.66 98.79 99.49 99.3 99.25 99.89±0.001
5 94.82 98.02 99.47 96.97 99.84 98.34 98.5 99.39 99.83±0.002
6 97.61 99.7 99.95 99.60 99.7 99.97 100 100 100±0
7 99.24 99.69 100 99.49 79.05 100 99.8 99.82 99.96±0
8 54.69 84.85 87.64 72.25 99.17 88.68 83.4 91.45 94.14±0.018
9 98.32 99.58 99.45 97.53 96.88 98.33 99.6 99.95 99.99±0
10 81.91 96.49 94.41 91.29 99.31 98.6 94.6 98.51 99.20±0.003
11 90.57 98.78 99.91 97.58 100 99.54 99.3 99.31 99.99±0
12 92.43 100 99.64 100 100 100 100 100 100±0
13 98.07 99.13 100 99.02 98.97 99.44 100 99.72 100±0
14 90.39 98.97 98.79 95.05 82.24 98.96 100 100 100±0
15 60.06 76.38 83.37 76.83 97.57 83.53 100 96.24 95.52±0.029
16 90.87 99.56 97.34 98.94 99.61 99.31 98 99.63 99.72±0.002
AA 89.33 96.87 97.32 94.84 96.83 97.73 98.28 98.94 99.26±0
OA 81.55 93.15 94.59 89.28 92.39 94.8 95.42 97.42 98.33±0.171

TABLE XIII
C OMPARISON OF THE C LASSIFICATION ACCURACY (%) A MONG THE P ROPOSED M ETHOD AND THE BASELINES U SING THE U NIVERSITY OF PAVIA
DATA .

Performance
class
SVM-RBF SVM-RFS SVM-MRF CNN[33] R-PCA CNN[35] CNN-PPF[38] CD-CNN[39] SS-CNN[36] DR-CNN
1 84.01 87.95 98.22 88.38 92.43 97.42 94.6 97.4 98.43±0.005
2 88.9 91.17 98.90 91.27 94.84 95.76 96 99.4 99.45±0.006
3 87.57 86.99 88.97 85.88 90.89 94.05 95.5 94.84 99.14±0.003
4 96.09 95.5 93.64 97.24 93.99 97.52 95.9 99.16 99.50±0.003
5 99.91 99.85 99.11 99.91 100 100 100 100 100±0
6 93.33 94.31 80.13 96.41 92.86 99.13 94.1 98.7 100±0
7 93.98 94.74 82.79 93.62 93.89 96.19 97.5 100 99.70±0.003
8 82.94 85.89 91.88 87.45 91.18 93.62 88.8 94.57 99.55±0.002
9 99.6 99.89 100 99.57 99.33 99.6 99.5 99.87 100±0
AA 91.82 92.92 94.04 93.36 94.38 97.03 95.77 98.22 99.53±0.001
OA 89.24 91.1 92.63 92.27 93.87 96.48 96.73 98.41 99.56±0.253

TABLE VIII
TABLE VII C LASSIFICATION P ERFORMANCE (%) OF DIFFERENT INITIAL LEARNING
C LASSIFICATION P ERFORMANCE OF D IVERSE R EGIONS F OR THE RATE FOR DR-CNN ON THE I NDIAN P INES DATA .

U NIVERSITY O F PAVIA DATA S ET.


OA (%) AA (%) Kappa Coefficient
0.1 96.25 96.69 0.9540
OA (%) AA (%) Kappa Coefficient 0.01 98.46 98.45 0.9811
RL 98.56 97.52 0.9806 0.001 98.54 98.48 0.9820
RR 98.55 97.18 0.9805
RU 98.81 97.75 0.9839
RB 98.08 96.69 0.9743
RC 93.78 92.29 0.9173
RG 98.76 98.08 0.9833
DR-CNN 99.56 99.42 0.9941 From the results of each individual experimental data, the
classification performance based on spectral-spatial features
are much better than that solely based on spectral features,

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

TABLE IX TABLE XIV


OVERALL ACCURACY (%) WITH OR WITHOUT DATA AUGMENTATION ON C LASSIFICATION ACCURACY (%) VERSUS DIFFERENT NUMBERS OF
THE I NDIAN P INES DATA . TRAINING SAMPLES PER CLASS FOR CNN- BASED METHODS .

with data augmentation without data augmentation Dataset Method 50 100 150 200
RL 96.12 94.87 CNN[33] 80.43 84.32 85.3 87.01
RR 95.44 93.03 CNN-PPF[38] 88.34 91.72 93.14 93.9
Indian Pines
RU 94.38 95.12 CD-CNN[39] 84.43 88.27 - 94.24
RB 95.52 93.77 DR-CNN 88.74 94.94 97.49 98.54
RC 90.43 88.20 CNN[33] 89.2 89.58 89.6 89.72
RG 96.99 95.28 CNN-PPF[38] 92.15 93.88 93.84 94.8
Salinas
DR-CNN 98.54 98.07 CD-CNN[39] 82.74 98.58 - 95.42
DR-CNN 93.46 95.54 97.36 98.33
CNN[33] 86.39 88.53 90.89 92.27
TABLE X CNN-PPF[38] 88.14 93.35 94.97 96.48
University of Pavia
OVERALL ACCURACY (%) WITH SHORT CONNECTIONS ON THE I NDIAN
CD-CNN[39] 92.19 93.55 - 96.73
DR-CNN 96.91 98.67 99.21 99.56
P INES DATA .

Methods OA (%) TABLE XV


ResNet-4 [49] 96 E LAPSED TIME ( H : HOURS , S : SECONDS ) OF TRAINING AND TESTING FOR
ResNet-6 [49] 95
ResNet-8 [49] 94 DR-CNN METHOD ON THREE DATA .
ResNet-10 [49] 93
DR-CNN 98 Indian Pines Salinas University of Pavia
Training(h) 0.50 1 0.60
CNN[33]
Testing(s) 0.21 0.26 0.37
NN-PPF[38] Training(h) 6 12 1
C
Testing(s) 4.76 20.97 16.92
and the proposed DR-CNN is obviously superior to all the Training(h) 0.74 1.42 0.43
DR-CNN
other classifiers. For example, in Table XII, the proposed DR- Testing(s) 39 240 105
CNN yields accuracy 98.33%, nearly 6% higher than that
of the R-PCA CNN (i.e., 92.39%), and approximately 3%
improvement compared to the CD-CNN (i.e., 95.42%). The
Besides, Fig. 13 shows the convergence curves. The training
similar situations exist when using the other two experimental
process is relatively time-consuming as obtaining satisfactory
data. The classification performance of the proposed network
performance requires a large number of iterations, such as 500;
is better than the best baseline classification performance
however, one can find that the proposed DR-CNN actually
by approximately 2%, 1%, and 1% for the Indian Pines
can converge after about 300 iterations. Therefore, more
dataset, the Salinas dataset, and the University of Pavia dataset,
efficient training could be obtained by moderately reducing
respectively.
the iterations.
Figs. 10-12 illustrate the classification maps, where the
visual classification results are consistent with the results listed
in Tables XI–XIII. Ground cover maps of entire image scenes IV. C ONCLUSIONS
(including unlabeled pixels) are provided. We can easily find In this paper, a novel diverse region-based CNN model was
that many regions of the classification maps achieved by the proposed for hyperspectral image classification. The proposed
proposed DR-CNN are obviously less noisy than those of the DR-CNN model extracted spectral-spatial features via the
traditional CNN, CNN-PPF, and CD-CNN, e.g., the regions of well-designed network with “multi-scale summation” module.
Soybean-clean in Fig. 10 and the Bare soil in Fig. 12. The advantage of DR-CNN comes from utilization of diverse
Table XIV further lists the classification performance with region-based input and exploration of the abundant spatial and
different numbers of training samples per class changed from spectral features with the deep network structure. Experimental
50 to 200 with an interval of 50. Obviously, for all the methods results demonstrate that the proposed DR-CNN can provide
the accuracy can grow with the number of training samples. statistically higher accuracy than state-of-the-art techniques.
From the results, the proposed DR-CNN still consistently
performs better than other state-of-the-art methods, i.e., CNN, R EFERENCES
CNN-PPF, and CD-CNN. Note that even for a smaller training [1] Y. Yuan, Y. Feng, and X. Lu, “Projection-based NMF for hyperspectral
dataset size, such as 50 or 100, the proposed network still unmixing,” IEEE Journal of Selected Topics in Applied Earth Observa-
provides excellent classification performance, especially on the tions and Remote Sensing, vol. 8, no. 6, pp. 2632–2643, 2015.
[2] W. Li, Q. Du, and B. Zhang, “Combined sparse and collaborative
complex urban image scene, i.e., the University of Pavia data. representation for hyperspectral target detection,” Pattern Recognition,
Computational complexity of training and testing proce- vol. 48, pp. 3904–3916, 2015.
dures using the proposed DR-CNN, CNN [33], and CNN-PPF [3] X. Huang and L. Zhang, “An SVM ensemble approach combining
spectral, structural, and semantic features for the classification of high
[38] is summarized in Table XV. For the training process, resolution remotely sensed imagery,” IEEE Transactions on Geoscience
CNN [33] is much faster than the other two because the net- and Remote Sensing, vol. 51, no. 1, pp. 257–272, 2012.
work size and input size of CNN[33] are much smaller. While [4] H. Goldberg, H. Kwon, and N. M. Nasrabadi, “Kernel eigenspace
separation transform for subspace anomaly detection in hyperspectral
in the testing procedure, DR-CNN is more time-consuming imagery,” IEEE Geoscience and Remote Sensing Letters, vol. 4, no. 4,
due to the computation burden in diverse regions construction. pp. 581–585, Oct. 2007.

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

(a) (b) (c) (d)

Fig. 10. Classification maps from the proposed DR-CNN and the baselines on the Indian Pines Data, (a) CNN: 97.01%, (b) CNN-PPF: 93.9%, (c) CD-CNN:
94.24%, (d) DR-CNN: 98.54%.

(a) (b) (c) (d)

Fig. 11. Classification maps from the proposed DR-CNN and the baselines on the Salinas Data, (a) CNN: 89.28%, (b) CNN-PPF: 94.8%, (c) CD-CNN:
95.42%, (d) DR-CNN: 98.33%.

[5] S. Vivek, D. Ali, T. Tinne, and V. G. Luc, “Hyperspectral cnn for image Tilton, “Advances in spectral-spatial classification of hyperspectral im-
classification & band selection, with application to face recognition,” ages,” Proceedings of the IEEE, vol. 101, no. 3, pp. 652–675, Sept.
in Technical report KUL/ESAT/PSI/1604, KU Leuven, ESAT, Leuven, 2013.
Belgium, Dec. 2016. [12] J. Li, P. R. Marpu, A. Plaza, J. M. Bioucas-Dias, and J. A. Benedikts-
[6] K. Makantasis, A. Doulamis, N. Doulamis, and A. Nikitakis, “Tensor- son, “Generalized composite kernel framework for hyperspectral image
based classifiers for hyperspectral data analysis,” arXiv preprint classication,” IEEE Transactions on Geoscience and Remote Sensing,
arXiv:1709.08164, 2017. vol. 51, no. 9, pp. 4816–4829, Sept. 2013.
[7] L. David, “Hyperspectral image data analysis,” IEEE Signal Processing [13] E. Blanzieri and F. Melgani, “Nearest neighbor classication of remote
Magazine, vol. 19, no. 1, pp. 17–28, Aug. 2002. sensing images with the maximal margin principle,” IEEE Transactions
[8] H. Li, Y. Song, and C. L. Philip Chen, “Hyperspectral image classifica- on Geoscience and Remote Sensing, vol. 46, no. 6, pp. 1804–1811, June
tion based on multiscale spatial information fusion,” IEEE Transactions 2008.
on Geoscience and Remote Sensing, 2017, in print. [14] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote
[9] X. Zheng, Y. Yuan, and X. Lu, “Dimensionality reduction by spatial sensing images with support vector machines,” IEEE Transactions on
spectral preservation in selected bands,” IEEE Transactions on Geo- Geoscience and Remote Sensing, vol. 42, no. 8, pp. 1778–1790, Aug.
science and Remote Sensing, 2017, in print. 2004.
[10] L. Lin and X. Song, “Using cnn to classify hyperspectral data based on [15] J. Liu, Z. Wu, Z. Wei, L. Xiao, and L. Sun, “Spatial-spectral ker-
spatial-spectral information,” in International Conference on Intelligent nel sparse representation for hyperspectral image classication,” IEEE
Information Hiding and Multimedia Signal Processing, Kaohsiung, Journal of Selected Topics in Applied Earth Observations and Remote
Taiwan, Nov. 2016, pp. 61–68. Sensing, vol. 6, no. 6, pp. 2462–2471, 2013.
[11] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and J. C. [16] W. Li and Q. Du, “A survey on representation-based classification and

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

(a) (b) (c) (d)

Fig. 12. Classification maps from the proposed DR-CNN and the baselines on the University of Pavia Data, (a) CNN: 92.27%, (b) CNN-PPF: 96.48%, (c)
CD-CNN: 96.73%, (d) DR-CNN: 99.56%.

detection in hyperspectral remote sensing imagery,” Pattern Recognition mentation of multispectral remote sensing imagery,” arXiv preprint
Letters, vol. 83, pp. 115–123, 2016. arXiv:1703.06452, 2017.
[17] W. Li, C. Chen, H. Su, and Q. Du, “Local binary patterns and [30] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing
extreme learning machine for hyperspectral imagery classication,” IEEE data: A technical tutorial on the state of the art,” IEEE Transactions on
Transactions on Geoscience and Remote Sensing, vol. 53, no. 7, pp. Geoscience and Remote Sensing, vol. 4, no. 2, pp. 22–40, June 2016.
3681–3693, 2015. [31] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based
[18] S. Sun, P. Zhong, H. Xiao, and R. Wang, “Active learning with gaussian classication of hyperspectral data,” IEEE Journal of Selected Topics in
process classier for hyperspectral image classication,” IEEE Transactions Applied Earth Observations and Remote Sensing, vol. 7, no. 6, pp. 2094–
on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 1746–1760, Aug. 2107, June 2014.
2015. [32] X. Ma, H. Wang, and J. Geng, “Spectral spatial classification of
[19] F. A. Mianji and Y. Zhang, “Robust hyperspectral classication using rel- hyperspectral image based on deep auto encoder,” IEEE Journal of
evance vector machine,” IEEE Transactions on Geoscience and Remote Selected Topics in Applied Earth Observations and Remote Sensing,
Sensing, vol. 49, no. 6, pp. 2100–2112, June 2011. vol. 9, no. 9, pp. 4073–4085, 2016.
[20] B. Liu, X. Yu, P. Zhang, A. Yu, Q. Fu, and X. Wei, “Supervised [33] W. Hu, Y. Huang, W. Li, F. Zhang, and H. Li., “Deep convolutional neu-
deep feature extraction for hyperspectral image classification,” IEEE ral networks for hyperspectral image classication,” Journal of Sensors,
Transactions on Geoscience and Remote Sensing, no. 99, pp. 1–13, Nov. vol. 2015, no. 258619, pp. 1–12, 2015.
2017. [34] V. Slavkovikj, S. Verstockt, W. D. Neve, S. V. Hoecke, and R. V. Walle,
[21] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, “SVM- “Hyperspectral image classication with convolutional neural networks,”
and MRF-based method for accurate classication of hyperspectral im- in ACM International Conference on Multimedia (ACMMM), Brisbane,
ages,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 4, pp. Australia, Oct. 2015, pp. 26–30.
736–740, May 2010. [35] K. Makantasis, K. Karantzalos, A. Doulamis, and N. Doulamis, “Deep
[22] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral spatial hyperspectral supervised learning for hyperspectral data classication through convolu-
image segmentation using subspace multinomial logistic regression and tional neural networks,” in IGARSS, Milan, Italy, July 2015, pp. 4959–
markov random fields,” IEEE Geoscience and Remote Sensing Letters, 4962.
vol. 50, no. 3, pp. 809–823, Aug. 2012. [36] S. Mei, J. Ji, J. Hou, X. Li, and Q. Du, “Learning sensor specic
[23] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised feature spatial spectral features of hyperspectral images via convolutional neural
learning for scene classication,” IEEE Transactions on Geoscience and networks,” IEEE Transactions on Geoscience and Remote Sensing,
Remote Sensing, vol. 53, no. 4, pp. 2175–2184, Apr. 2015. vol. 55, no. 8, pp. 4520–4533, Aug. 2017.
[24] K. Makantasis, K. Karantzalos, A. Doulamis, and M. Loupos, “Deep [37] J. Yue, W. Zhao, S. Mao, and H. Liu, “Spectral spatial classication of
learning-based man-made object detection from hyperspectral data,” in hyperspectral images using deep convolutional neural networks,” Remote
International Symposium on Visual Computing, Las Vegas, Nevada, Dec. Sensing Letters, vol. 6, no. 6, pp. 468–477, May 2015.
2016, pp. 717–727. [38] W. Li, G. Wu, F. Zhang, and Q. Du, “Hyperspectral image classication
[25] M. Lichao, G. Pedram, and Z. XiaoXiang, “Deep recurrent neural using deep pixel pair features,” IEEE Transactions on Geoscience and
networks for hyperspectral image classification,” IEEE Transactions on Remote Sensing, vol. 52, no. 2, p. 844853, Feb. 2017.
Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3639–3655, 2017. [39] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyper-
[26] F. Zhang, B. Du, and L. Zhang, “Scene classification via a gradient spectral image classification,” IEEE Transactions on Image Processing,
boosting random convolutional network framework,” IEEE Transactions vol. 26, no. 10, pp. 4843–4855, July 2017.
on Geoscience and Remote Sensing, vol. 54, no. 3, pp. 1793–1802, 2016. [40] Y. Chen, X. Zhao, and X. Jia, “Spectral spatial classication of hyper-
[27] G. J. Scott, M. R. England, W. A. Starms, R. A. Marcum, and C. H. spectral data based on deep belief network,” IEEE Journal of Selected
Davis, “Training deep convolutional neural networks for land cover Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6,
classification of high resolution imagery,” IEEE Geoscience and Remote pp. 2381–2392, June 2015.
Sensing Letters, vol. 14, no. 4, pp. 549–553, 2017. [41] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
[28] Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based feature convolutional networks for visual recognition,” European Conference on
selection for remote sensing scene classification,” IEEE Geoscience and Computer Vision (ECCV), vol. 37, no. 9, pp. 1904–1916, Sept. 2015.
Remote Sensing Letters, vol. 12, no. 11, pp. 2321–2325, 2015. [42] X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, and
[29] R. Kemker and C. Kanan, “Deep neural networks for semantic seg- S. Yan, “Human parsing with contextualized convolutional neural net-

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2018.2809606, IEEE
Transactions on Image Processing

of hyperspectral data,” IEEE Transactions on Geoscience and Remote


Training error rate %
1.4
Sensing, vol. 48, no. 7, pp. 2880–2889, Aug. 2010.
1.2

1.0

0.8

0.6

0.4 Mengmeng Zhang (S’15) received her B.S. degree


0.2
from Qingdao University of Science and Technol-
ogy, Qingdao, China, in 2014. She is currently
0.0 pursuing the Ph.D. degree at Beijing University of
0 100 200 300 400 500
Ierations
Chemical Technology. Her research interests include
remote sensing image process and pattern recogni-
(a) tion. Her supervisor is Dr. Wei Li.
3.5
Training error rate %

3.0

2.5

2.0

1.5

1.0
Wei Li (S’11–M’13–SM’16) received the B.E. de-
0.5 gree in telecommunications engineering from Xidian
University, Xi’an, China, in 2007, the M.S. degree in
0.0
0 100 200 300 400 500 information science and technology from Sun Yat-
Ierations Sen University, Guangzhou, China, in 2009, and the
Ph.D. degree in electrical and computer engineering
(b) from Mississippi State University, Starkville, MS,
2.5
USA, in 2012.
Training error rate %

Subsequently, he spent 1 year as a Postdoctoral


Researcher at the University of California, Davis,
2.0
CA, USA. He is currently a Professor and Vice Dean
with the College of Information Science and Technology at Beijing University
1.5
of Chemical Technology, Beijing, China. His research interests include
hyperspectral image analysis, pattern recognition, and data compression.
1.0 Dr. Li is an active reviewer for the IEEE Transactions on Geoscience
and Remote Sensing (TGRS), the IEEE Geoscience Remote Sensing Letters
0.5 (GRSL), and the IEEE Journal of Selected Topics in Applied Earth Observa-
tions and Remote Sensing (JSTARS). He is currently serving as an Associate
0.0 Editor for the IEEE Signal Processing Letters. He has served as Guest Editor
0 100 200 300 400 500
for special issue of Journal of Real-Time Image Processing, Remote Sensing,
Ierations
and IEEE JSTARS. He received the 2015 Best Reviewer award from IEEE
(c) Geoscience and Remote Sensing Society (GRSS) for his service for IEEE
JSTARS.
Fig. 13. Convergence curves of training samples on three experimental data,
(a)Indian Pines data, (b) Salinas data, (c) University of Pavia data.

work,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Qian Du (S’98–M’00–SM’05–F’18) received the
vol. 39, no. 1, pp. 115–127, Mar. 2016. Ph. D. degree in electrical engineering from the
[43] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely University of Maryland, Baltimore, MD, USA, in
connected convolutional networks,” arXiv preprint arXiv:1608.06993, 2000. She is currently the Bobby Shackouls Profes-
2016. sor with the Department of Electrical and Computer
[44] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image Engineering, Mississippi State University, Starkville,
recognition,” in Computer Vision and Pattern Recognition, Las Vegas, MS, USA. Her research interests include hyperspec-
NV, USA, June 2016, pp. 770–778. tral remote sensing image analysis and applications,
[45] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep pattern classification, data compression, and neural
network training by reducing internal covariate shift,” arXiv preprint networks.
arXiv:1502.03167, 2015. Dr. Du is a fellow of the SPIE-International So-
[46] V. Nair and G. E. Hinton, “Rectified linear units improve restricted ciety for Optics and Photonics. She received the 2010 Best Reviewer Award
boltzmann machines,” in International Conference on Machine Learn- from the IEEE Geoscience and Remote Sensing Society. She was a Co-
ing, Haifa, Israel, June 2010, pp. 21–24. Chair of the Data Fusion Technical Committee of the IEEE Geoscience and
Remote Sensing Society from 2009 to 2013, and the Chair of the Remote
[47] W. Li, E. W. Tramel, S. Prasad, and J. E. Fowler, “Nearest regularized
Sensing and Mapping Technical Committee of the International Association
subspace for hyperspectral classification,” IEEE Transactions on Geo-
for Pattern Recognition from 2010 to 2014. She has served as an Associate
science and Remote Sensing, vol. 52, no. 1, pp. 477–489, Jan. 2014.
Editor for the IEEE Journal of Selected Topics in Applied Earth Observations
[48] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
and Remote Sensing, the Journal of Applied Remote Sensing, and the IEEE
feedforward neural networks,” in AISTATS, Sardinia, Italy, May 2010,
Signal Processing Letters. Since 2016, she has been the Editor-in- Chief of the
pp. 249–256.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote
[49] Z. Zhong, J. Li, L. Ma, H. Jiang, and H. Zhao, “Deep residual networks
Sensing. She is the General Chair of the 4th IEEE GRSS Workshop on
for hyperspectral image classification,” Texas, USA, July 2017, pp. 23–
Hyperspectral Image and Signal Processing: Evolution in Remote Sensing,
28.
Shanghai, in 2012.
[50] B. Waske, S. Linden, J. Benediktsson, and P. Hostert, “Sensitivity of
support vector machines to random feature selection in classification

1057-7149 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like