Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING 1

FusionNet: Edge Aware Deep Convolutional


Networks for Semantic Segmentation of Remote
Sensing Harbor Images
Dongcai Cheng, Gaofeng Meng, Member, IEEE, Shiming Xiang, Member, IEEE, and Chunhong Pan

Abstract—Sea–land segmentation and ship detection are two accuracy of segmentation results will directly influence the lo-
prevalent research domains for optical remote sensing harbor im- cation of region of interest (ROI) for docked ships. As for ship
ages and can find many applications in harbor supervision and detection and classification, it plays an important role in mar-
management. As the spatial resolution of imaging technology im-
proves, traditional methods struggle to perform well due to the itime security and other applications such as traffic surveillance
complicated appearance and background distributions. In this pa- and sea pollution monitoring [7]. Further work on top of ship
per, we unify the above two tasks into a single framework and apply detection is the ship classification. However, limited literature
the deep convolutional neural networks to predict pixelwise label has been found on it due to the lack of adequate remote sensing
for an input. Specifically, an edge aware convolutional network is images for model learning.
proposed to parse a remote sensing harbor image into three typical
objects, e.g., sea, land, and ship. Two innovations are made on top The difficulties of harbor image processing mainly lie in three
of the deep structure. First, we design a multitask model by simul- aspects. First, the complicated texture distribution in land re-
taneously training the segmentation and edge detection networks. gions will affect the results of both sea–land segmentation and
Hierarchical semantic features from the segmentation network are ship detection. For sea–land segmentation, traditional meth-
extracted to learn the edge network. Second, the outputs of edge ods [6], [8] often confront the problem of misclassification in
pipeline are further employed to refine entire model by adding an
edge aware regularization, which helps our method to yield very land. Also, the ship detection methods [5], [9] will produce
desirable results that are spatially consistent and well boundary false alarms within land region, which is a key problem to be
located. It also benefits the segmentation of docked ships that are solved. Second, the intricate boundary between sea and land
quite challenging for many previous methods. Experimental results may cause under segmentation. The situation is apparent for
on two datasets collected from Google Earth have demonstrated the thin and fine-structured objects such as wharfs [10], which will
effectiveness of our approach both in quantitative and qualitative
performance compared with state-of-the-art methods. also bring negative effect on inshore ship detection. Finally, the
disturbance of cloud, wave, and shadow factors will influence
Index Terms—Edge aware regularization, harbor images, results. Specifically, clouds and waves are often mistaken for
multitask learning, semantic segmentation.
land part in sea–land segmentation and the shadow will affect
to segment a spatially consistent ship body.
I. INTRODUCTION The methods for sea–land segmentation differ depending on
the type of imagery used. For multispectral images, the nor-
EA–LAND segmentation and ship detection are two funda-
S mental and vital topics in the research field of remote sens-
ing images. As for sea–land segmentation, it aims to separate the
malized difference water index (NDWI) [11] is an important
metric, which takes advantage of the fact that the reflectance of
water area is near to zero in near-infrared band and high in green
sea area from land exactly. The accuracy of segmentation results
band. Most works [12]–[14] on sea–land segmentation of multi-
is of great importance to coastline navigation, resource manage-
spectral images are based on NDWI map. The processing steps
ment, and protection [1]–[4]. It has also been commonly used
are relative simple and can obtain accurate results. While for
as a preprocessing step for ship detection [5], [6]. The boundary
panchromatic images, most methods [6], [8], [15] are based on
threshold selection, together with morphological operations to
Manuscript received May 12, 2017; revised July 24, 2017; accepted August eliminate errors in the results. With the improvement of spatial
22, 2017. This work was supported in part by the National 863 projects under resolution of remote sensing images, threshold-based methods
Grant 2015AA042307, in part by the National Natural Science Foundation of
China under Grant 91338202, Grant 91438105, Grant 91646207, and Grant face problems when dealing with scenes with complicated tex-
61370039, and in part by the Beijing Natural Science Foundation under Grant ture and intensity distribution. The results may contain lots of
4162064. (Corresponding author: Shiming Xiang.) misclassifications both in land and sea regions.
The authors are with the National Laboratory of Pattern Recognition, Institute
of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: Previous works for ship detection in optical remote sensing
dccheng@nlpr.ia.ac.cn; gfmeng@nlpr.ia.ac.cn; smxiang@nlpr.ia.ac.cn; chpan images can be coarsely categorized as the methods for offshore
@nlpr.ia.ac.cn). ship detection and for inshore ship detection. Traditional pro-
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. cessing flow of offshore ship detection includes ROI extraction,
Digital Object Identifier 10.1109/JSTARS.2017.2747599 feature computation, and classification [6], [7], [16]–[20]. In

1939-1404 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

the stage of ROI extraction, image segmentation approaches 2) We design a segmentation and edge detection network to
[16], [20] and wavelet transformation [7], [17] are frequently form multitask learning. The edge network encodes each
used. Since most extraction steps in those methods are unsu- position by extracting semantic feature hierarchies from
pervised, which only employ low-level spectral information, the segmentation network, which provides complemen-
sometimes there is a high false positive rate. In the stage of fea- tary cues to adjust the gradients of entire model.
ture extraction and classification, the shape feature of the ship 3) Based on the multitask design, we further propose an edge
(such as ratio of length and width) together with some texture aware regularization to fuse the segmentation and edge
features (e.g., local binary patterns [21], local multiple patterns networks. The outputs of the edge network are explicitly
(LMPs) [6], and circle frequency (CF) feature [22]) is calcu- employed to refine prediction maps of the segmentation
lated and concatenated. The descriptors are then fed into some network. The local smoothness will propagate among pix-
machine learning models such as support vector machines [6], els within the same class and is controlled by semantic
AdaBoost [22], and extreme learning machine [7] to get final edges, which helps to obtain segmentation results that are
predictions. spatially consistent and well boundary located.
For inshore ship detection, the ship head detection is critical. The remainder of this paper is organized as follows. In
Many existing methods [5], [9], [23] employ the “V” shape fea- Section II, we make a brief review on previous works. The
ture of the head to identify head point. Afterward, the ship body structure of the proposed model will be described in detail in
is located by line detection and some morphological operations. Section III. In Section IV, we will introduce the datasets used
The processing steps of above methods are usually complicated. in this paper and conduct extensive comparative experiments
Moreover, the parameters are often empirically chosen and make with state-of-the-art methods. Conclusions will be drawn in
the results sensitive to them. Section V.
During last several years, the deep convolutional neural net-
works (DCNNs) have experienced rapid advance in many vision II. PREVIOUS WORKS
tasks such as object recognition [24]–[29], image classifica-
tion [30], [31], and semantic segmentation [32]–[39]. Recently, A. Sea–Land Segmentation
many works [7], [40]–[42] have attempted to apply DCNNs to Many early works on sea–land segmentation for optical re-
remote sensing images. However, there is still limited appli- mote sensing images are based on threshold selection [6], [8],
cation on harbor images, which is mainly due to the lack of [15]. For instance, Liu and Jezek [15] proposed an approach
adequate images to train the model well. The harbor images to determine the thresholds for local regions by fitting a bi-
have their own characteristic compared with natural images. model Gaussian curve. The method is suitable for images with
For instance, the direct use of region proposal network in Faster coarse spatial resolution, while the convergence of the fitting
RCNN [27] to generate ship proposals may cause problem since process to a global optimum is not guaranteed. You and Li [8]
the angle of ship should also be considered. built the statistical model for sea area to determine the thresh-
To overcome the above shortcomings in existing sea–land old based on OTSU [43] results. Afterward, they rectified the
segmentation and ship detection methods, we propose a novel connected mislabeled regions in land results by evaluating their
deep model, i.e., FusionNet, to unify those two tasks into a single variances. However, the misclassification in sea area is unable
framework by tackling a three-class (land, sea, and ship) seman- to be solved. Recently, supervised method [10] for sea–land
tic segmentation problem. For an input image, the prediction of segmentation has been proposed by integrating seeds learning
pixel-level label can simultaneously accomplish segmentation and graph cut [44]. Since it employs superpixels to build graph
and detection. Our network consists of two components: the model, the results may contain under segmentation for some
segmentation network and the edge detection network. Specif- thin and fine-structured objects, such as wharfs.
ically, the segmentation network produces class label for each
pixel, whereas the edge network outputs boundaries between
different classes. The edge network is constructed by extracting B. Ship Detection
hierarchical semantic features from the segmentation network For offshore ship detection, one of the hottest research issue in
and forms multitask with it. To further achieve fusion of both traditional methods is how to select valuable features. LMP [6]
networks, we propose an edge aware regularization to encour- and CF [22] are proved to be effective in distinguishing ships
age probability propagation among pixels within the same class. from false alarms. Recently, DCNNs-based models have been
The segmentation results are spatially coherent by employing applied in ship detection. Tang et al. [7] extracted ship candi-
semantic affinity between pixels. dates using wavelet coefficients and learned high-level features
This paper is distinguished by the following contributions. by training two deep networks. Zou and Shi [40] trained two
1) We introduce a deep model, i.e., FusionNet, for remote sets of deep filter banks to obtain ship candidates and semantic
sensing image segmentation. In this network, the tasks of features respectively, and constructed the low-level filter banks
both sea–land segmentation and ship detection are unified by singular value decomposition.
into a single framework. Our model can be employed to For inshore ship detection, the method of ship head detec-
solve both three-class (sea, land, and ship) and two-class tion is frequently used. Li et al. [9] identified the ship head by
(sea and land) problems. To the best of our knowledge, it transforming pixels within a local window into the polar coor-
is the first attempt to use semantic segmentation model to dinate system, and searched the ship body from head according
tackle ship detection problem. to the direction of line segments. Liu et al. [5] selected ship
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHENG et al.: FUSIONNET: EDGE AWARE DEEP CONVOLUTIONAL NETWORKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING 3

Fig. 1. Architecture of our proposed model. There are two outputs: segmentation results and edge detection results. We employ the encoder–decoder structure [36]
as our segmentation network. Our edge network is built by extracting hierarchical semantic features from layers of segmentation network. During back propagation,
the edge output is used to further refine entire model from the end. The output size of layer is labeled on the top, whereas the channel numbers are labeled at the
bottom. Detailed configurations of the above two networks are listed in Tables I and II.

head by corner detection and parabola fitting, and located ship construct the pairwise term in the final optimization function.
body through shape analysis. However, both the above meth- Chen et al. [51] computed edge predictions from intermediate
ods focus on detection of ships with “V” shape head. Recently, layers of the segmentation network, and proposed a domain
Liu et al. [45] proposed an supervised approach to generate ship transform approach to align segmentation scores with object
proposals in the rotated bounding box (RBB) space. The learned boundaries. However, our method is different in two aspects.
model can work with various ships and achieves high recall. First, we use the encoder–decoder model as our basic segmen-
tation network, and extract rich hierarchical features to build
C. Semantic Segmentation the edge network. Compared with FCNs, the decoder network
can generate semantic meaningful feature maps with high spa-
The applications of DCNNs in semantic segmentation tial resolution to maintain more finer details, which is effective
[32]–[34], [36], [37] have been prevalent since the appearance of for predicting boundaries. Second, our entire model is learned
fully convolutional networks (FCNs) [32], which converts the end to end. Both of the two networks are trained in parallel,
fully connected layers in detection structure to convolutional which will improve the robustness of the segmentation network
layers, and implement upsample operation to obtain original- by providing with multiple regularizations.
sized prediction maps. Several models promote performance on
the basis of FCNs to obtain finer spatial layout in the results.
For instance, Chen et al. [34] increased the size of dense maps III. STRUCTURE OF PROPOSED MODEL
by removing last several subsample layers and smooth the edge In this section, we will give a description of the configuration
results by linking a conditional random field (CRF). Studies in of our proposed model, i.e., FusionNet. First, we will provide
[33] and [36] refined semantic features by stacking a symmetri- the structure of our segmentation network. Second, we will
cal deconvolutional and upsampling structure on top of FCNs. describe the configuration of our edge network and how to train
Recently, advances in instance segmentation [38], [46]–[49] it in parallel with the segmentation network. Finally, a fusion
have been driven by combining R-CNN-based detection model strategy of how to employ outputs of the edge network to refine
and FCN-based segmentation model. The basic idea is to si- the segmentation results is proposed.
multaneously address object classes, boxes, and masks. For in-
stance, DeepMask [48] learns to produce segment proposals, A. Encoder–Decoder Network for Segmentation
which are classified by Fast R-CNN. Dai et al. [49] designed
model to obtain instance-level segment candidates by computing Fig. 1 illustrates the overall architecture of our proposed
position-sensitive maps. Mask R-CNN [46] achieves state-of- model. There are two networks in our structure, i.e., the seg-
the-art results by adding a branch on Faster R-CNN to generate mentation network and edge network, which will run in parallel
segment masks. Both results of detection and segmentation are in the training process. The segmentation network outputs prob-
improved by multitask learning. ability maps to indicate how likely the pixel belongs to each
class, and the edge network outputs probability that the pixel
belongs to boundaries between different classes.
D. Multitask Learning
We employ SegNet [36] as the basic structure of our segmen-
There are previous research works on training edge network tation network for its ability of producing segments with finer
by expanding a pipeline from segmentation network. For in- details. As presented in Fig. 1, it consists of two components:
stance, semantic boundaries are defined as the linear combi- the encoder network and decoder network. The encoder net-
nation of feature maps of FCNs in [50], and are employed to work is configured in the same way as FCNs, which contains a
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

TABLE I four times of max-pooling operation in the encoder network such


CONFIGURATION OF OUR SEGMENTATION NETWORK
that the size of feature map at the end is 19 × 19. In the decoder
network, we perform symmetrical operations, which consists
Segmentation network of four unpooling layers and corresponding convolution, BN
Encoder and ReLU layers. After the last 32-channel convolutional layer,
we add another convolutional layer with three-channel output
Name Kernel size Stride Pad Output size
to fit the deep model into our three-class problem. Finally, a
Input – – – 300 × 300 × 3 three-channel softmax layer is attached to obtain the prediction
Conv1-1 3×3 1 1 300 × 300 × 32 maps.
Conv1-2 3×3 1 1 300 × 300 × 32
Pool1 2×2 2 0 150 × 150 × 32 We employ cross entropy loss to train our segmentation net-
Conv2-1 3×3 1 1 150 × 150 × 64 work, which is defined as
Conv2-2 3×3 1 1 150 × 150 × 64
2×2 75 × 75 × 64
1 
Pool2 2 0 N K
Conv3-1 3×3 1 1 75 × 75 × 128 Lossseg (l, p, θseg ) = −σ(li = k) log pk ,i (1)
Conv3-2 3×3 1 1 75 × 75 × 128 N i=1
Conv3-3 3×3 1 1 75 × 75 × 128 k =1
Pool3 2×2 2 0 38 × 38 × 128
Conv4-1 3×3 1 1 38 × 38 × 256 where li is the ground truth label of point i, and pk ,i is the
Conv4-2 3×3 1 1 38 × 38 × 256 output probability that i belongs to kth class. K is the number
Conv4-3 3×3 1 1 38 × 38 × 256
Pool4 2×2 2 0 19 × 19 × 256
of classes and equals 3 in our segmentation problem. σ(·) is
an indicator function. It takes 1 when li = k and 0 otherwise.
Decoder
θseg is the parameter set of the segmentation network. N is the
Unpool4 2×2 2 0 38 × 38 × 256
Conv4-1-D 3×3 1 1 38 × 38 × 256 number of all pixels in the batch. Normalization over the batch
Conv4-2-D 3×3 1 1 38 × 38 × 256 is performed after total loss is calculated.
Conv4-3-D 3×3 1 1 38 × 38 × 128
Unpool3 2×2 2 0 75 × 75 × 128
Conv3-1-D 3×3 1 1 75 × 75 × 128 B. Deep Edge Network
Conv3-2-D 3×3 1 1 75 × 75 × 128
Conv3-3-D 3×3 1 1 75 × 75 × 64 The results of semantic segmentation may contain mislabel
Unpool2 2×2 2 0 150 × 150 × 64 for pixels around boundaries. Typically, the ship results often
Conv2-1-D 3×3 1 1 150 × 150 × 64
Conv2-2-D 3×3 1 1 150 × 150 × 32 have a mixture with their adjacent wharf due to the loss of
Unpool1 2×2 2 0 300 × 300 × 32 semantic details caught by pooling operations. The phenomenon
Conv1-1-D 3×3 1 1 300 × 300 × 32 is notable for objects with unclear boundaries. To this end, we
Conv1-2-D 3×3 1 1 300 × 300 × 32
Output 3×3 1 1 300 × 300 × 3 propose a multitask framework by expanding a branch from
Softmax – – – 300 × 300 × 3 the segmentation network and learn both segmentation and edge
network simultaneously. The basic ideas behind this are: first,
We use encoder–decoder architecture for its fine ability of recovering
details. By adding “-D,” we refer to layers in the decoder network. The
the training of the edge network provides complementary cues
BN and ReLU layers are omitted in the table for brevity. for better regularizing the boundary of adjacent objects; and
second, we can further employ outputs of semantic edges to
series of convolution and pooling operations to achieve feature refine segmentation results.
extraction. As the layer goes deeper, we can obtain high-level The structure of our edge network is illustrated in Fig. 1, and
features with more semantic information. However, the spatial detailed configuration is presented in Table II. We expand the
information of feature maps is lost and cannot be recovered by edge network by extracting five groups of feature maps from the
a simple interpolation to the original size of input image. The segmentation network. The lower level features contain local de-
SegNet model overcomes this problem by stacking a decoder tails, and are extracted from the encoder network. The high-level
network on top of the encoder part, which is a mirrored version features provide semantic information, and are extracted from
of the encode network and contains a series of convolutional the decoder network. Specifically, for each of the five feature
and unpooling layers to restore finer spatial layout of objects. map groups, we fist link it with an 1 × 1 convolutional layer,
Table I presents the detailed configuration of our segmen- and then upsample them to the original size of the input. The
tation network. We adopt the strategy in VGG network [31] 1 × 1 convolutional layer is employed in two purposes. First, it
that the convolution channels are augmented with the increase serves as a connection between the segmentation network and
of net depth. The input image is three-channel with size fixed edge network, which makes the entire model flexible to train
at 300 × 300. There are four groups of convolutional layers task specific while close related subnetworks. Second, it helps
(10 convolutional layers altogether) in the encoder network. For to lower the dimension of feature maps, which will reduce mem-
each convolutional layer, the kernel size is 3 × 3 and stride is ory consumption and shorten the training and inference time. In
1. Between two successive convolutional layers, batch normal- Table II, for each feature, the first layer indicates from which the
ization (BN) [52] and rectified linear units (ReLU) [53] are branch is expanded. After convolution and unpooling, the final
applied sequentially. Both BN and ReLU layers are employed features are concatenated and fed into the classifier. To achieve
to accelerate the learning process. After each group of convolu- this, we add a convolutional layer with two-channel output and
tional layers, a max-pooling layer with 2 × 2 kernel and stride a softmax layer at the end of the edge network to obtain final
2 is added to achieve translation invariance. Therefore, there are prediction maps.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHENG et al.: FUSIONNET: EDGE AWARE DEEP CONVOLUTIONAL NETWORKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING 5

TABLE II
CONFIGURATION OF OUR EDGE NETWORK

Edge network

Feature 1

Name Kernel size Stride Pad Output size

Conv2-2 3×3 1 1 150 × 150 × 64


Conv1-E 1×1 1 0 150 × 150 × 32
Unpool1-E 2×2 2 0 300 × 300 × 32
Feature 2
Name Kernel size Stride Pad Output size
Conv3-3 3×3 1 1 75 × 75 × 128
Conv2-E 1×1 1 0 75 × 75 × 32
Unpool2-1-E 2×2 2 0 150 × 150 × 32
Fig. 2. Semantic segmentation results of SegNet and our method. (a) Orig-
Unpool2-2-E 2×2 2 0 300 × 300 × 32
inal image. (b) Result of SegNet. (c) Result of FusionNet (ours). (d) Ground
Feature 3 truth. Improved regions are marked by yellow rectangles for better comparison.
Name Kernel size Stride Pad Output size Compared with SegNet, our model can obtain segmentation results with higher
spatial consistency and boundary accuracy.
Conv3-3-D 3×3 1 1 75 × 75 × 64
Conv3-E 1×1 1 0 75 × 75 × 32
Unpool3-1-E 2×2 2 0 150 × 150 × 32
Unpool3-2-E 2×2 2 0 300 × 300 × 32
end, losses of both networks are calculated and the gradients are
back propagated. In the inference phase, we can simultaneously
Feature 4
obtain segmentation results and semantic boundaries for an input
Name Kernel size Stride Pad Output size
image.
Conv2-2-D 3×3 1 1 150 × 150 × 32
Conv4-E 1×1 1 0 150 × 150 × 32
Unpool4-E 2×2 2 0 300 × 300 × 32 C. Edge Aware Network Regularization
Feature 5
The spatial resolution loss of prediction maps caused by pool-
Name Kernel size Stride Pad Output size
ing operation will lead to blobby segments that lack fine object
Conv1-2-D 3×3 1 1 300 × 300 × 32 details [50]. It results in two common problems in semantic
Conv5-E 1×1 1 0 300 × 300 × 32
Concat – – – 300 × 300 × 160 segmentation of harbor images. One is the coarse segmented
Output 3×3 1 1 300 × 300 × 2 boundaries between ship and its neighboring wharf. As pre-
Softmax – – – 300 × 300 × 2 sented in Fig. 2(b), the results often contain mixture around the
Each feature group is expanded from a convolutional layer of segmentation
boundaries due to low contrast, which will directly influence
network. The final features are concatenated and fed into a convolutional layer the ship detection accuracy. The other is the misclassification
with two-class output. By adding “-E,” we refer to layers in the edge network. in land area. It will bring difficulty in coastline extraction and
increase the ratio of false alarms of ships.
We also utilize cross entropy loss to train the edge network
In Section III-B, we have introduced a multitask learning
1  framework. The incorporation of the edge network has two ad-
N C
Lossedg (y, pe , θedg ) = −σ(yi = c) log pec,i (2) vantages: on one hand, the multitask formulation helps to opti-
N i=1 c=1
mize segmentation model; and on the other hand, we can further
where yi is the true edge label of point i, and pec,i denotes the improve performance of the whole model with outputs of the
probability that i belongs to cth class. The upper mark e indicates edge network. To this end, a regularization method to achieve
probability for the edge network. C is the number of classes and fusion of two networks is introduced in this section.
equals 2 in the edge detection problem. θedg is the parameter set Specifically, we employ outputs of the edge network to reg-
of the edge network. ularize prediction maps of the segmentation network. For each
The entire model is learned end to end. During training, these point, a local smooth loss with respect to its neighbors is defined
two networks are trained in parallel. Parameters of the segmenta- as
tion network are shared by the two networks and can be updated −p e
1  
K 1,j
1 −p e
jointly, whereas parameters of the last classification layers of Lossre,i = (pk ,i − pk ,j ) e 2 1,j (3)
the two networks as well as the expanding branch in the edge 8K
j ∈ Nb(i) k =1
network are adjusted separately.
We input the network with the original image and its ground where j is the neighbor pixel of i, Nb(i) is the eight-neighbor
truth segmentation map. To obtain the reference map for the system of i, k denotes the channel of prediction maps of the
edge network, we add another layer to automatically calculate it segmentation network and K is the number of channels, pk ,i
according to the ground truth map of the segmentation network. is the segmentation probability of i on channel k, pe1,j depicts
Specifically, for each point, if it has different label with one the probability that j belongs to edge, and the upper mark e
of its eight neighbors, we define it as true edge point. At the indicates probability from the edge network.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

We define the above pixelwise loss in a similar way as the Parameter λ in (8) is employed to adjust the relative weight
pairwise term used in semisupervised methods. Actually, the of regularization term. If λ is too large, the segmentation results
total smooth loss within an image can be simplified as will be over smoothed and the boundary accuracy of neighboring
1 classes will be degraded. We set λ at 10 in this paper and find
Lossre = Tr(P T LP ) (4) remarkable performance improvement brought by incorporating
4K
the edge aware loss.
where P = (p0 , p1 , p2 ) ∈ RN×3 , N is the number of total pix- In the training process, parameters of both segmentation net-
els within the image, and 3 denotes the number of classes in our work and edge network are updated according to their respective
segmentation problem. Each vector pi , i = 0, 1, 2, corresponds cross entropy loss as well as the edge aware loss. In practice, we
to the ith probability map of the segmentation network. The first train the multitask model in parallel, then fine-tune it using
probability map is converted to the vector by row. Tr(·) denotes (8). Implementation details can be found in Section IV-B.
the trace of matrix. L is the Laplacian matrix, which is con- The proposed edge aware regularization has some similarity
structed by L = D − W , where W ∈ RN×N is the adjacency with CRF. Actually, we can treat Lossseg in (8) as the unary
matrix and each element of it is constructed by the edge aware term in CRF, whereas Lossre as the pairwise term in CRF. In the
−p e
1,j
1 −p e
process of fine-tuning, local consistency among neighboring
pairwise term wij = e . The W matrix is sparse and each
1,j
pixels is encouraged and controlled by some regularization,
row has only eight nonzero elements, which corresponds  to the which also resembles the CRF. In spite of that, our method is
eight neighbors of i. D is a diagonal matrix, dii = N j =1 wij . different with CRF in some aspects. First, the proposed model is
In (3), the smaller the probability that neighbor pixel belongs end to end, which makes our method flexible and efficient, while
to edge, the more similar segmentation probabilities should the CRF is often employed as a postprocessing step after training
two pixels have. However, the smoothing operation is faded with in CNNs-based models. Second, the regularization term in our
the increase of the likelihood that the neighbor pixel belongs to method is built by semantic edges, which is more effective in
edge. In this way, the segmentation maps will be aligned with measuring semantic similarity between pixels than low-level
boundaries between different classes, which helps to obtain finer color-based pixel affinity function that is used in CRF.
results with more spatial consistency.
The gradients of lossre,i with respect to pk ,i , pk ,j , and pe1,j
IV. EXPERIMENTS
are calculated as
−p e In this section, we will first introduce the two datasets used
∂lossre,i 1  1,j

= (pk ,i − pk ,j )e
1 −p e
1,j (5) in this paper, and then present both quantitative and qualitative
∂pk ,i 4K comparisons with the state-of-the-art methods.
j ∈N b(i)
−p e
1,j
∂lossre,i 1 1 −p e A. Dataset Description and Augmentation
= (pk ,j − pk ,i )e 1 , j (6)
∂pk ,j 4K We evaluate the performance of our proposed model on two

K −p e datasets. One is the dataset of harbor images, which contains
∂lossre,i −1 1,j
1 −p e
= (pk ,i − pk ,j )2 e 1,j . (7) land, sea, and ship and is utilized for the segmentation of three-
e
∂p1,j 8K(1 − pe1,j )2 class problem. The other is the sea–land dataset and is used for
k =1
the segmentation of two-class problem. Images in both datasets
From (5), we can see that when the neighbor pixel has smaller
are RGB-channel and collected from Google Earth, which is a
edge probability and when the two pixels have more different
∂ loss suitable platform for providing adequate remote sensing images
segmentation probabilities, the larger is ∂ p kr,ei , i . That means the
to train the deep model.
gradient is controlled by both outputs of semantic edges and 1) Harbor Image Dataset: It contains 380 images in this
segmentation results. When the neighbor pixel has large edge dataset with spatial resolution of 1.0 m. Among them, 300 im-
probability, the gradient is dominated by the small value of expo- ages are used for training, ten images are used for validation,
nential term such that the influence of edge aware regularization and 70 images are for testing. Most of the images have size
in back propagation is negligible. larger than 800 × 800. The numbers of ships for training, val-
We define the final loss function of our network by combining idation, and testing are 1654, 69, and 397, respectively. Most
above three loss terms of them are military ships. The segmentation ground truth of
LossFusionNet = Lossseg + Lossedg + Lossre all the images are labeled by hands, and ground truth maps of
K the edge network are automatically calculated during training.
1   
N C
Note that we have not labeled ship targets with too small size
= −σ(li = k) log pk ,i + as ship since they present unclear visual textures. Actually, we
N i=1 c=1
k =1
treat them as part of land due to their high intensities compared
− σ(yi = c) log pec,i with sea area. In this paper, we only consider ships that have
⎫ apparent ship feature and abandon small targets like boats.
λ  
K −p e
1,j ⎬ The training and validation dataset are augmented by crop-
1 −p e
+ (pk ,i − pk ,j )2 e 1,j .
8K ⎭ ping and rotating. Since the ship head can provide valuable
j ∈ Nb(i) k =1
semantic features for ship detection, we extract training patches
(8) that are centered at the head, tail, and middle of the ship,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHENG et al.: FUSIONNET: EDGE AWARE DEEP CONVOLUTIONAL NETWORKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING 7

respectively. Moreover, we scan the boundary of sea and land, where TP, FP, and FN are true positive, false positive, and false
and sample patch centers with a distance of 30. The patch cen- negative, respectively. Precision measures accuracy of the pre-
ters are then screened to avoid getting too close to each other. dicted points and Recall corresponds to the fraction of true
We further rotate each patch at step of 90◦ , and then flip rotated points that are correctly predicted in the class. F-score is an
patches in horizontal and vertical reflections as in [54]. There overall metric that combines Precision and Recall.
are totally 78 296 training patches with each size of 300 × 300 To assess the boundary results of different methods, we first
obtained in this way. For validation dataset, we only crop it define Boundary Precision (BP) and Boundary Recall (BR) in
to reduce the testing time in training process. The number of a same way as (9) and (10)
cropped patches is 432.
2) Sea–Land Image Dataset: It contains 140 training im-  of correct boundary points on segmentation boundary
BP =
ages, six validation images, and 50 testing images. Most have  of boundary points on segmentation boundary
size larger than 800 × 800. We collect those images with spa- (12)
tial resolution of 3.0–5.0 m to enlarge the imaging scope. The  of correct boundary points on segmentation boundary
sea backgrounds in those images are diverse and complicated, BR = .
 of true boundary points
which makes the segmentation task challenging. There are very (13)
few ships with small sizes in these sea–land scenes. They are
labeled as land in the ground truth. One boundary point in segmentation results is considered as
We augment the training set by first cropping four corners correct if the distance between it and its closet ground truth
and center of the original image with size 300 × 300. Then, we boundary is smaller than N . N will be set with different values.
further crop the original image randomly to generate another 20 The F-score of boundary is then defined as
samples. After that, each of the 25 patches is rotated and flipped
as mentioned above to further enlarge it eight times. There are 2 · BP · BR
F-score-b = . (14)
totally 28 000 training samples obtained in this way. For images BP + BR
in validation set, we also just crop them and obtain 150 samples. In the experiments, we assess boundary accuracy of both
segmentation results and outputs of the edge network.
B. Implementation Details
The network is implemented under Caffe [55] library. The D. Comparative Results on Harbor Image Dataset
training phase includes two stages. First, we train the segmen-
We will discuss performance of the proposed FusionNet on
tation network and edge network in parallel. Then, the overall
semantic segmentation of land, sea, and ship, and present qual-
network is fine-tuned according to (8). Satisfactory results can
itative and quantitative comparisons with state-of-the-art meth-
be obtained in this way. We set the batch size at 4, initial learning
ods. Five CNNs-based models are compared and details of them
rate (lr) at 0.001, and decrease lr to one-tenth when the accuracy
are summarized as follows.
on validation set stops increasing. The network is converged af-
1) FCNs [32]: We build FCNs in the same way as the encoder
ter 80 000 iterations during training and 30 000 iterations during
network in Fig. 1. The last pooling layer is removed.
fine-tuning. The value of λ in (8) is selected at 10 in this paper,
Therefore, there are totally three pooling operations. After
at which we can obtain satisfactory results. Detailed analysis of
that, we unpool outputs of conv4 3 and concatenate them
the influence of λ can be drawn in Section IV-G.
with outputs of conv3 3. Then, a four times unpooling
At testing phase, the input size of the network is adjusted
layer and a convolutional layer with three-channel output
to 800 × 800 to increase the segmentation accuracy. Input size
are sequentially added. Finally, the softmax layer is used
for each layer of the network is also adjusted accordingly. We
to obtain the prediction maps.
process each of the testing image with bilinear interpolation,
2) Deeplab [34]: We set stride of the last two pooling layers
and reinterpolate outputs to the original size after forward pass.
in the encoder network at 1 and adopt the “holes” strategy
The negative effect brought by interpolation is negligible since
in the fourth group of convolutional layers. The outputs
our testing images have size similar to 800 × 800.
are four times upsampled to the original size.
3) SegNet [36]: The SegNet shares configuration with our
C. Evaluation Criteria
segmentation network, i.e., the encoder–decoder model.
To evaluate segmentation results of different methods, we 4) SegNet+EdgeNet: The parallel model of segmentation and
employ three metrics, i.e., Precision, Recall, and F-score, which edge network. Both networks are trained simultaneously
are defined as to generate the segmentation results and edge detection
TP results.
Precision = (9) 5) FusionNet: Final architecture of our proposed network.
TP + FP
Compared with SegNet+EdgeNet, we employ (8) to fur-
TP
Recall = (10) ther fine-tune it to achieve fusion of two networks.
TP + FN
We present quantitative comparisons of different methods on
2 · Precision · Recall the testing set in Table III. Visual illustration of segmentation
F-score = (11)
Precision + Recall results of military ships is exhibited in Fig. 3. Fig. 4 illustrates
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

TABLE III
AVERAGE SEGMENTATION RESULTS OF DIFFERENT METHODS ON TESTING IMAGES OF THE HARBOR IMAGE DATASET

Values displayed in bold and blue are the best and second best among five original methods, whereas values displayed in red are the best among five methods that employ
CRF as the postprocessing step.

Fig. 3. Visual comparison of segmentation results with military ships. (a) Original image. (b) Result of FCNs. (c) Result of SegNet. (d) Result of FusionNet
(ours). (e) Ground truth. (f) Output of the edge network. In (b), (c) and (d), improved regions of our method compared with FCNs and SegNet are marked by red
rectangles. FusionNet achieves more consistent segmentation with finer boundary results by incorporating semantic edges.

visual results of several nonmilitary ships. We will discuss the can help to reduce misclassifications in land area and obtain
quantitative and qualitative results separately. smoother results.
Table III presents the quantitative performance of different From Table III, we can see that FusionNet achieves the best
comparing methods on semantic segmentation of land, sea, and results on F-score of all the three classes. Especially, the F-
ship. We calculate the Precision, Recall, and F-score of all the score value on ship class of our method is more than 6% higher
three classes, and give the average F-score of three classes in than that of SegNet, and the average F-score of our method is
the last column. In each column in Table III, the best value is more than 2% higher than that of SegNet. Compared with three
displayed in bold, and the second best is displayed in blue. We traditional networks, our parallel model (SegNet+EdgeNet) also
also evaluate the performance of comparing methods with ad- achieves better performance. Specifically, the ship F-score and
ditional postprocessing step by adding a CRF [34] layer, which average F-score of SegNet+EdgeNet are more than 3% and 1%
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHENG et al.: FUSIONNET: EDGE AWARE DEEP CONVOLUTIONAL NETWORKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING 9

Fig. 4. Visual comparison of segmentation results with nonmilitary ships. (a) Original image. (b) Result of FCNs. (c) Result of SegNet. (d) Result of FusionNet
(ours). (e) Ground truth. (f) Output of the edge network. We mark improved regions of our method compared with FCNs and SegNet by red rectangles. Results
within yellow rectangles in the first row illustrate an example of ship false alarm that exists in all three methods.

TABLE IV significantly improved due to the smooth operation within land


F1-SCORE-B (%) OF DIFFERENT METHODS ON HARBOR IMAGE DATASET
region, whereas the ship Recall is degraded because the pixel-
level smooth constraint in CRF blurs the segmented boundaries
between the ship and its adjacent class. Examples of CRF ef-
fect are presented in Fig. 5. The first two rows exhibit that the
misclassifications in land area are reduced with CRF. The last
two rows show zoomed-in results of the regions within blue
rectangles of the first two rows, from which we can see that the
boundary accuracy of ship is decreased due to the low contrast
around it.
It is worth noting that the average F-score of FusionNet is
more than 1% higher than that of SegNet+EdgeNet_CRF, which
Best and second best results of segmentation methods are displayed in bold and demonstrates the effectiveness of employing semantic edges to
blue, respectively. Best values of the outputs of edge network are displayed in red.
measure semantic similarity between pixels than spectral-based
affinity function.
higher than those of SegNet, which has best values on these two Figs. 3 and present visual comparisons of different methods
metrics among three traditional methods. on military ships and nonmilitary ships, respectively. Outputs of
It is worth noting that the three traditional methods also per- the edge network are illustrated in the last column. We can see
form well on segmentation of sea and land due to the rich that our method also performs well on nonmilitary ships in spite
hierarchical feature learning ability of CNNs. The accuracy im- of rather few training samples since they share many features
provement of our method on sea and land is not as remarkable with military ships. The results of FusionNet contain fewer false
as that on ship. One of the reason is that compared with sea and alarms of ship, such as comparisons in the red rectangles of the
land, the ship regions are relative small, which makes segmen- fourth row in Fig. 3 and first row in Fig. 4. Results in the
tation results have a bigger fluctuation for different methods. In first and third row of Fig. 3 demonstrate that our method can
spite of that, the visual improvement of our method on sea–land better reduce misclassification for neighboring ship and wharf.
is still obvious, such as results in the fourth row of Fig. 3 and In addition, our method is more robust to shadows around the
comparisons in Fig. 7. Moreover, we can obtain a remarkable ship body, as ship results in the second row of Fig. 3 show.
improvement of our method by comparing boundary accuracy Boundary accuracy of segmentation results of different meth-
of different methods in Table IV. ods are further calculated. We also evaluate the outputs of our
The F-score metrics of all the comparing methods are en- edge network. Qualitative results are presented in Fig. 6. De-
hanced by adding CRF layer. Note that the ship Precision is tailed values are listed in Table IV. The distance tolerance N
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

Fig. 5. Visual presentation of CRF effect. (a) Original image. (b) Result of FusionNet. (c) Result of FusionNet_CRF. (d) Ground Truth. The first two rows
illustrate comparisons by adding the CRF layer. Improved regions are marked by red rectangles. The last two rows show zoomed-in results of the regions within
blue rectangles in (a) of the first two rows. The CRF operation can smooth results in land regions, while it may also blur the ship boundaries.

Fig. 6. (a) Boundary accuracy of segmentation results on harbor image dataset. (b) Accuracy of the outputs of edge network on harbor image dataset.

is set from 1 to 5. Compared with the parallel model (Seg- adjust output channel of the last convolutional layer and soft-
Net+EdgeNet), FusionNet achieves significant improvement max layer in the segmentation network to 2. The following five
both in boundary accuracy of the segmentation results and methods are compared to assess their performance.
outputs of the edge network, which has demonstrated the ef- 1) LATM [15]: Locally adaptive thresholding method. It au-
fectiveness of edge aware regularization in refining the entire tomatically set thresholds for each local region by fitting
model. Note that for both parallel model and FusionNet, the a bimodal Gaussian curve. The final results are processed
edge network achieves higher F-score-b value than that of its by morphological operations.
corresponding segmentation results. The intuitive explanation 2) SMS [8]: Global thresholding method, which builds a sta-
is that our edge network is expanded from the segmentation net- tistical model for the sea area based on the OTSU results.
work by adding additional convolutional and unpooling layers, The “holes” in land areas of the results are filled accord-
which improves its capacity to learn more task-specific model. ing to the variance difference between sea model and
themselves.
3) SegNet [36]: The SegNet shares configuration with our
E. Comparative Results on Sea–Land Image Dataset
segmentation network.
Experiments are conducted to evaluate our model on the sea– 4) SegNet+EdgeNet: The parallel model of segmentation and
land dataset, which is a two-class segmentation problem. We edge network.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHENG et al.: FUSIONNET: EDGE AWARE DEEP CONVOLUTIONAL NETWORKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING 11

Fig. 7. Visual comparison of sea–land segmentation with different methods. (a) Original image. (b) Result of LATM. (c) Result of SMS. (d) Result of SegNet.
(e) Result of FusionNet. (f) Ground Truth. Improved regions in FusionNet results compared with SegNet are marked by red rectangles.

TABLE V
AVERAGE SEGMENTATION RESULTS OF DIFFERENT METHODS ON TESTING IMAGES OF SEA–LAND IMAGE DATASET

Values displayed in bold and blue are the best and second best among five comparing methods.

Fig. 8. (a) Boundary accuracy of segmentation results on sea–land image dataset. (b) Accuracy of the outputs of edge network on sea–land image dataset.

5) FusionNet: Final architecture of our proposed network, Fig. 7 illustrates visual comparison of five segmentation meth-
which is fine-tuned on the basis of SegNet+EdgeNet. ods. In can be seen that both LATM and SMS are prone to
We implement LATM and SMS by ourselves using C++ pro- misclassify land pixels due to their complicated intensity and
gramming, and fill the connected “hole” regions whose size are texture distribution. The misclassifications cannot be simply
smaller than 200 in their land results to get better comparative resolved by morphological operations. CNNs-based methods
effect. achieve better segmentation by hierarchical feature learning.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

TABLE VI
F1-SCORE-B (%) OF DIFFERENT METHODS ON SEA–LAND IMAGE DATASET

Best and second best boundary results of segmentation methods are displayed in
bold and blue, respectively. Best values of the outputs of edge network are displayed
in red.

TABLE VII
COMPARATIVE RESULTS OF SHIP DETECTION WITH INSHORE SHIP
DETECTION METHODS

Method Precision Recall F-score

Method in [5] 91.43 54.70 68.45


Method in [9] 88.99 82.91 85.84
Ours 94.61 92.94 93.77

Best values are displayed in bold.

Compared with SegNet, FusionNet obtains smoother land re-


sults as well as finer segmented boundaries by incorporating
edge information.
Fig. 9. Ship detection results based on segmentation results of FusionNet.
We exhibit quantitative results in Table V. Our method (a) Original image. (b) Result of FusionNet. (c) Ground truth. RBB and ship
achieves the best performance on all of the evaluation metrics. direction are drawn in the input.
Specifically, FusionNet has almost 0.25% improvement in land
F-score and 0.18% improvement in sea F-score with respect to
SegNet. LATM and SMS have relative lower land Recall, which
corresponds to more misclassifications within land region. 117 ships with “V” shape head. The segmentation results of
Evaluation of sea–land boundary is further conducted and FusionNet are used to extract RBBs. As for method in [9], we
results are presented in Fig. 8 and Table VI. From Fig. 8(a), select 80 images from the training set to train the head classifier,
we can see that the segmentation results of FusionNet and which contain 360 ships.
SegNet+EdgeNet have the best and second best boundary ac- To better exhibit the detection result of our method, we calcu-
curacy. Traditional methods based on threshold selection and late the RBB and direction of the segmented ships. Specifically,
morphological operations behave poor on F-score-b due to the after acquiring the segmentation map, we first calculate the con-
existence of lots of errors in their segmented land regions. Re- nected ship regions and compute center locations of them. Then
sults in Fig. 8(b) demonstrate the effectiveness of our edge for each ship region, we traverse the angles between 0 and π. For
network on sea–land problem. each angle, a local coordinate system at center of the connected
region is built, and pixels within the region will be projected
to the short axis. The best angle is found such that the sum of
F. Comparative Results With Ship Detection Methods length of projections is the minimum. The length and width of
To better assess the performance of our proposed model, ship can be further computed by transforming the ship region
we compare the ship detection results with two inshore ship into local coordinate system with the best angle.
detection methods, i.e., the methods in [5] and [9]. Both methods Comparative results are listed in Table VII. Method in [5]
are based on ship head detection and only suitable for “V” shape has low ship recall because it performs sea–land segmentation
head. The former is a supervised method, which transforms local before head detection, which produces unsatisfactory land mask
regions around ship head into polar coordinate system to extract with coarse boundaries to identify head position. Moreover, both
its shape feature. Then, an SVM classifier is trained to predict traditional methods are sensitive to the contrast nearby head re-
ship head. The latter is unsupervised and extracts ship head gions, which is also a reason to cause low recall. Compared
by parabola fitting. After acquiring head point, both methods with them, our proposed method achieves better result for in-
employ line detection to identify ship body. shore ship detection through learning rich hierarchical features
We select 30 images from the testing set of harbor image to discriminate between ship and its adjacent region. More de-
dataset to evaluate performance of three methods. There are tection results of our method are illustrated in Fig. 9.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHENG et al.: FUSIONNET: EDGE AWARE DEEP CONVOLUTIONAL NETWORKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING 13

Fig. 10. Influence of λ on the testing images of harbor image dataset. The value of λ is selected from coarse to fine. When λ is 10, the F-score has the best value.

Fig. 11. Visual comparison of segmentation results when λ is set at different values. When λ is too big, the boundary results are worse due to the strong smoothing
effect brought by edge aware regularization term.

G. Influence of Parameter λ V. CONCLUSION


In this section, we further discuss the influence of λ in (8) In this paper, an edge aware DCNN has been proposed to
on the final segmentation results. In (8), the λ is employed to address semantic segmentation of land, sea, and ship. Specifi-
adjust the smoothing effect that is controlled by outputs of the cally, we constructed the edge network by expanding a branch
edge network. If the value of λ is too small, the edge aware from the encoder–decoder structure and train in parallel with it
regularization is negligible and the FusionNet will degenerate to learn semantic boundaries. The outputs of the edge network
into SegNet+EdgeNet. However, if λ is too big, it will over- are further employed to refine the entire model with an edge
smooth the image and cause decrease of boundary accuracy of aware regularization term, which encourages probability prop-
segmentation results. agation among neighboring pixels with constraint of semantic
We have evaluated the influence of λ on the testing images boundaries.
of harbor image dataset. The average F-score of three class is Extensive experiments have demonstrated the advantages
presented in Fig. 10. The value of λ is first selected in a coarse of FusionNet by comparing with state-of-the-art methods.
range from 0.1 to 1000. Then, it is selected in a finer range from Our model performs better in producing spatially consistent
10 to 100. It can be seen that when λ is 10, the F-score is the results with well boundary located. We have also evaluated
best. Fig. 11 presents visual effect when λ is too big. We can see FusionNet on the sea–land image dataset to address the two-
that too large value of λ will decrease the boundary accuracy of class problem and achieved better performance in quantita-
segmentation results because of the strong smoothing effect of tive and qualitative comparisons. Moreover, the ship detec-
edge aware regularization. tion results based on segmentation results have proved that our
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

method is effective in comparison with inshore ship detection [23] J. Lin, X. Yang, S. Xiao, Y. Yu, and C. Jia, “A line segment based in-
methods. shore ship detection method,” in Proc. Future Control Autom., 2012,
pp. 261–269.
[24] Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and accurate
REFERENCES vehicle detection in aerial images using coupled region-based convolu-
tional neural networks,” IEEE J. Sel. Topics Appl. Earth Observ. Remote
[1] E. H. Boak and I. L. Turner, “Shoreline definition and detection: A review,” Sens., vol. 10, no. 8, pp. 3652–3664, Aug. 2017.
J. Coastal Res., vol. 21, no. 4, pp. 688–703, Jul. 2005. [25] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
[2] A. A. Fugura, L. Billa, and B. Pradhan, “Semi-automated procedures for for accurate object detection and semantic segmentation,” in Proc. Comp.
shoreline extraction using single RADARSAT-1 SAR image,” Estuarine, Vis. Pattern Recognit., 2014, pp. 580–587.
Coastal Shelf Sci., vol. 95, pp. 395–400, 2011. [26] R. Girshick, “Fast R-CNN,” in Proc. Int. Conf. Comput. Vis., 2015,
[3] F. Baselice and G. Ferraioli, “Unsupervised coastal line extraction pp. 1440–1448.
from SAR images,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 6, [27] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time
pp. 1350–1354, Nov. 2013. object detection with region proposal networks,” in Proc. Int. Conf. Neural
[4] U. C. Benz, P. Hofmann, G. Willhauck, I. Lingenfelder, and M. Hey- Inf. Process. Syst., 2015, pp. 91–99.
nen, “Multi-resolution, object-oriented fuzzy analysis of remote sensing [28] I. Sevo and A. Avramovic, “Convolutional neural network based auto-
data for GIS-ready information,” ISPRS J. Photogrammetry Remote Sens., matic object detection on aerial images,” IEEE Geosci. Remote Sens.
vol. 58, pp. 239–258, 2004. Lett., vol. 13, no. 5, pp. 740–744, May 2016.
[5] G. Liu, Y. Zhang, X. Zheng, X. Sun, K. Fu, and H. Wang, “A new method [29] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
on inshore ship detection in high-resolution satellite images using shape Comput. Vis., 2016, pp. 21–37.
and context information,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 3, [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
pp. 617–621, Mar. 2014. with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf.
[6] C. Zhu, H. Zhou, R. Wang, and J. Guo, “A novel hierarchical method Process. Syst., 2012, pp. 1097–1105.
of ship detection from spaceborne optical image based on shape and [31] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
texture features,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 9, large-scale image recognition,” in Proc. ICLR, 2015, pp. 1–14.
pp. 3446–3456, Sep. 2010.
[32] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
[7] J. Tang, C. Deng, G.-B. Huang, and B. Zhao, “Compressed-domain ship
for semantic segmentation,” in Proc. Comp. Vis. Pattern Recognit., 2015,
detection on spaceborne optical image using deep neural network and
pp. 3431–3440.
extreme learning machine,” IEEE Trans. Geosci. Remote Sens., vol. 53,
no. 3, pp. 1174–1185, Mar. 2015. [33] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for seman-
[8] X. You and W. Li, “A sea-land segmentation scheme based on statis- tic segmentation,” in Proc. Int. Conf. Comput. Vis., 2015, pp. 1520–1528.
tical model of sea,” in Proc. Int. Congr. Image Signal Process., 2011, [34] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
pp. 1155–1159. “Semantic image segmentation with deep convolutional nets and fully
[9] S. Li, Z. Zhou, B. Wang, and F. Wu, “A novel inshore ship detection via connected CRFs,” arXiv:1412.7062, 2014.
ship head classification and body boundary determination,” IEEE Geosci. [35] W. Zhao, S. Du, and W. J. Emery, “Object-based convolutional neural
Remote Sens. Lett., vol. 13, no. 12, pp. 1920–1924, Dec. 2016. network for high-resolution imagery classification,” IEEE J. Sel. Top-
[10] D. Cheng, G. Meng, S. Xiang, and C. Pan, “Efficient sea-land segmenta- ics Appl. Earth Observ. Remote Sens., vol. 10, no. 7, pp. 3386–3396,
tion using seeds learning and edge directed graph cut,” Neurocomputing, Jul. 2017.
vol. 207, pp. 36–47, Sep. 2016. [36] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
[11] S. K. Mcfeeters, “The use of the normalized difference water index convolutional encoder-decoder architecture for image segmentation,”
(NDWI) in the delineation of open water features,” Int. J. Remote Sens., arXiv:1511.00561, 2015.
vol. 17, no. 7, pp. 1425–1432, 1996. [37] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
[12] T. Kuleli, A. Guneroglu, F. Karsli, and M. Dihkan, “Automatic detection network,” arXiv:1612.01105, 2016.
of shoreline change on coastal Ramsar wetlands of Turkey,” Ocean Eng., [38] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmentation via
vol. 38, no. 10, pp. 1141–1149, Jul. 2011. multi-task network cascades,” in Proc. Comp. Vis. Pattern Recognit., 2016,
[13] K. Di, J. Wang, R. Ma, and R. Li, “Automatic shoreline extraction from pp. 3150–3158.
high-resolution IKONOS satellite imagery,” in Proc. ASPRS Annu. Conf., [39] Y. Li, C. Tao, Y. Tan, K. Shang, and J. Tian, “Unsupervised multilayer
2003, pp. 1–11. feature learning for satellite image scene classification,” IEEE Geosci.
[14] U. R. Aktas, G. Can, and F. T. Y. Vural, “Edge-aware segmentation in Remote Sens. Lett., vol. 13, no. 2, pp. 157–161, Feb. 2016.
satellite imagery: A case study of shoreline,” in Proc. IAPR Workshop [40] Z. Zou and Z. Shi, “Ship detection in spaceborne optical image with
Pattern Recognit. Remote Sens., Nov. 2012, pp. 1–4. SVD networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,
[15] H. Liu and K. C. Jezek, “Automated extraction of coastline from satel- pp. 5832–5845, Oct. 2016.
lite imagery by integrating canny edge detection and locally adaptive
[41] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised feature
thresholding methods,” Int. J. Remote Sens., vol. 25, no. 5, pp. 937–958,
learning for scene classification,” IEEE Trans. Geosci. Remote Sens.,
Mar. 2004.
vol. 53, no. 4, pp. 2175–2184, Apr. 2015.
[16] F. Chen, W. Yu, X. Liu, K. Wang, L. Gong, and W. Lv, “Graph-based ship
extraction scheme for optical satellite image,” in Proc. IEEE Int. Geosci. [42] F. Zhang, B. Du, L. Zhang, and M. Xu, “Weakly supervised learn-
Remote Sens. Symp., 2011, pp. 491–494. ing based on coupled convolutional neural networks for aircraft detec-
[17] G. Jubelin and A. Khenchaf, “Multiscale algorithm for ship detection in tion,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 9, pp. 5553–5563,
mid, high and very high resolution optical imagery,” in Proc. IEEE Int. Sep. 2016.
Geosci. Remote Sens. Symp., 2014, pp. 2289–2292. [43] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE
[18] S. Qi, J. Ma, J. Lin, Y. Li, and J. Tian, “Unsupervised ship detection based Trans. Syst. Man Cybern., vol. SMC-9, no. 1, pp. 62–66, Jan. 1979.
on saliency and S-HOG descriptor from optical satellite images,” IEEE [44] C. Rother, V. Kolmogorov, and A. Blake, “GrabCut—interactive fore-
Geosci. Remote Sens. Lett., vol. 12, no. 7, pp. 1451–1455, Jul. 2015. ground extraction using iterated graph cuts,” in Proc. ACM SIGGRAPH,
[19] C. Corbane, F. Marre, and M. Petit, “Using SPOT-5 HRG data in panchro- 2004, pp. 309–314.
matic mode for operational detection of small ships in tropical area,” [45] Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding box
Sensors, vol. 8, no. 5, pp. 2959–2973, May 2008. space for ship extraction from high-resolution optical satellite images
[20] C. Corbane, L. Najman, E. Pecoul, L. Demagistri, and M. Petit, “A com- with complex backgrounds,” IEEE Geosci. Remote Sens. Lett., vol. 13,
plete processing chain for ship detection using optical satellite imagery,” no. 8, pp. 1074–1078, Aug. 2016.
Int. J. Remote Sens., vol. 31, no. 22, pp. 5837–5854, Jul. 2010. [46] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,”
[21] T. Ojala and I. Harwood, “A comparative study of texture measures with arXiv:1703.06870, 2017.
classification based on featured distributions,” Pattern Recognit., vol. 29, [47] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based
no. 1, pp. 51–59, Jan. 1996. fully convolutional networks,” arXiv:1605.06409, 2016.
[22] Z. Shi, X. Yu, Z. Jiang, and B. Li, “Ship detection in high-resolution [48] P. O. Pinheiro, R. Collobert, and P. Dollár, “Learning to segment ob-
optical imagery based on anomaly detector and local shape feature,” IEEE ject candidates,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2015,
Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 4511–4523, Aug. 2014. pp. 1990–1998.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHENG et al.: FUSIONNET: EDGE AWARE DEEP CONVOLUTIONAL NETWORKS FOR SEMANTIC SEGMENTATION OF REMOTE SENSING 15

[49] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully convolu- Gaofeng Meng (M’17) received the B.S. degree in
tional networks,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 534–549. applied mathematics from Northwestern Polytechni-
[50] G. Bertasius, J. Shi, and L. Torresani, “Semantic segmentation with cal University, Xi’an, China, in 2002, the M.S. de-
boundary neural fields,” in Proc. Comp. Vis. Pattern Recognit., 2016, gree in applied mathematics from Tianjin University,
pp. 3602–3610. Tianjin, China, in 2005, and the Ph.D. degree in con-
[51] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. Yuille, trol science and engineering from Xi’an Jiaotong Uni-
“Semantic image segmentation with task-specific edge detection using versity, Xi’an, China, in 2009.
CNNs and a discriminatively trained domain transform,” in Proc. Comp. He is currently an Associate Professor in the Na-
Vis. Pattern Recognit., 2016, pp. 4545–4554. tional Laboratory of Pattern Recognition, Institute of
[52] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network Automation, Chinese Academy of Sciences, Beijing,
training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. China. His research interests include computer vi-
Learn., 2015, pp. 448–456. sion, image processing, and machine learning.
[53] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltz- Dr. Meng has been an Associate Editor of Neurocomputing since 2014.
mann machines,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 807–814.
[54] G. Cheng, Y. Wang, S. Xu, H. Wang, S. Xiang, and C. Pan, “Automatic
Shiming Xiang (M’12) received the B.S. degree
road detection and centerline extraction via cascaded end-to-end convolu-
in mathematics from Chongqing Normal University,
tional neural network,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 6,
Chongqing, China, in 1993, the M.S. degree in com-
pp. 3322–3337, Jun. 2017. puting mechanics from Chongqing University, China,
[55] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embedding,”
in 1996, and the Ph.D. degree in computer sciences
in Proc. ACM Int. Conf. Multimedia, 2014, pp. 675–678.
from the Institute of Computing Technology, Chinese
Academy of Sciences, Beijing, China, in 2004.
From 1996 to 2001, he was a Lecturer at the
Huazhong University of Science and Technology,
Wuhan, China. He was a Postdoctoral Researcher
with the Department of Automation, Tsinghua Uni-
versity, Beijing, China, until 2006. He is currently a Professor with the Institute
of Automation, Chinese Academy of Sciences. His research interests include
pattern recognition and image processing.

Chunhong Pan received the B.S. degree in automatic


control from Tsinghua University, Beijing, China,
in 1987, the M.S. degree in optic and electronic
Dongcai Cheng received the B.S. degree in electronic engineering from the Shanghai Institute of Optics
and information engineering from the University of and Fine Mechanics, Chinese Academy of Sciences,
Science and Technology of China, Hefei, China, in Beijing, China, in 1990, and the Ph.D. degree in
2011. He is currently working toward the Ph.D. de- pattern recognition and intelligent system from the
gree in pattern recognition and intelligent system in Institute of Automation, Chinese Academy of Sci-
the Institute of Automation, Chinese Academy of Sci- ences, in 2000.
ences, Beijing, China. He is currently a Professor with the National Lab-
His research interests include image processing, oratory of Pattern Recognition, Institute of Automa-
object detection, and remote sensing. tion, Chinese Academy of Sciences. His research interests include computer
vision, image processing, computer graphics, and remote sensing.

You might also like