Josaa 38 1 60

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

60 Vol. 38, No.

1 / January 2021 / Journal of the Optical Society of America A Research Article

Efficient point cloud segmentation approach


using energy optimization with geometric
features for 3D scene understanding
X L, G L,* AND S S
Southwest Jiaotong University, School of Mechanical Engineering, Chengdu, Sichuan 610031, China
*Corresponding author: motorliu7810@swjtu.edu.cn

Received 17 September 2020; revised 8 November 2020; accepted 17 November 2020; posted 20 November 2020 (Doc. ID 410458);
published 10 December 2020

Efcient and quick extraction of unknown objects in cluttered 3D scenes plays a signicant role in robotics tasks
such as object search, grasping, and manipulation. This paper describes a geometric-based unsupervised approach
for the segmentation of cluttered scenes into objects. The proposed method rst over-segments the raw point clouds
into supervoxels to provide a more natural representation of 3D point clouds and reduce the computational cost
with a minimal loss of geometric information. Then the fully connected local area linkage graph is used to distin-
guish between planar and nonplanar adjacent patches. Then the initial segmentation is completed utilizing the
geometric features and local surface convexities. After the initial segmentation, many subgraphs are generated, each
of which represents an individual object or part of it. Finally, we use the plane extracted from the scene to rene
the initial segmentation result under the framework of global energy optimization. Experiments on the Object
Cluttered Indoor Dataset dataset indicate that the proposed method can outperform the representative segmen-
tation algorithms in terms of weighted overlap and accuracy, while our method has good robustness and real-time
performance. © 2020 Optical Society of America
https://doi.org/10.1364/JOSAA.410458

1. INTRODUCTION However, although the availability of low-cost RGB–depth


Accurate grasping of possibly unknown objects is one of the map (RGB-D) cameras enables us to capture 3D point clouds
more conveniently, there are still some unavoidable problems
main requirements for robotic systems in cluttered 3D envi-
when processing the segmentation of 3D point clouds. For
ronments. For this reason, robotic systems have to go through
example, point clouds of unknown scenes having very complex
multiple costly calculations. First and foremost, the object of
backgrounds, outliers (resulting from reection rates of materi-
interest needs to be identied, which includes a segmentation
als or errors of stereo matching), occlusions (due to the limited
step resulting in a suitable surface impression of the object [1].
observation positions or scanning ranges), or uneven densities
Such a task is known as segmentation. Point cloud segmentation of points (caused by the sensor resolution or varying measuring
(i.e., classication), which aims to assign each point a proper distances) can frequently occur, signicantly lowering the qual-
class label, is a fundamental problem in 3D scene understanding ity of the datasets we obtained. Due to these problems, most of
for applications in robotics such as object search, grasping, and the popular segmentation algorithms based only on the normal
manipulation [2]. estimation [4,5], the distance measurements, or the relations of
Object segmentation is still one of the ambitious and elusive surfaces [6] can hardly perform well for such complex situations
goals of computer vision, and it is even considered by some when dealing with tasks related to robot applications, especially
researchers an ill-dened problem mainly because the per- for the segmentation of unknown objects of arbitrary shape.
ception of what is the best segmentation depends heavily on On the other hand, recent deep-learning-based approaches
the application and even changes among humans [3]. With generate synthetic data to train neural networks for unknown
the recent introduction of cheap and powerful 3D sensors object segmentation [1,7–9]. A variety of recent methods also
(i.e., the Microsoft Kinect or Asus XtionPRO) that can provide has demonstrated the ability to accurately segment RGB-D data
dense point clouds with colors for almost any indoor scene, a into predened semantic labels such as humans, bicycles, and
renewed interest in 3D techniques holds the promise to push the cars by training deep neural networks on massive, hand-labeled
envelope slightly further. datasets [10]. These techniques require learning from data that

1084-7529/21/010060-11 Journal © 2021 Optical Society of America


Research Article Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A 61

contains large amounts of various objects. However, in many Among the abovementioned methods, the graph-based
robot environments, large-scale datasets with this property method has proven to be a very effective segmentation strategy.
do not exist. Since collecting a large dataset with ground truth In 2D computer vision, the segmentation method based on
annotations is expensive and time-consuming, such factors graph cuts completes the segmentation task by introducing
restrict their use for real-world robotic applications. graphics to represent data units (i.e., pixels or superpixels). In
In this paper, we tackle the aforementioned problems by the this type of method, the segmentation problem can usually
use of an unsupervised method. No prior information or learn- be transformed into a graph construction and partitioning
ing process is required. Our method uses the energy function problem [24]. The work of Felzenszwalb et al. [15] is very rep-
combined with the geometric features of the objects to rene the resentative. The authors use only color as input and require
result of initial segmentation. In summary, our contributions are no geometric information. Boundaries between regions of an
as follows. (1) We developed a novel point cloud segmentation image are determined using a graph representation. Finally,
method combining the graph- and energy-function-based the image is segmented by greedily making cuts in this graph.
strategies, which were not used in a robot grabbing applica- However, this method as a pure RGB based method cannot
tion before. To suppress the negative effects of outliers and the robustly separate objects in cluttered scenes. Inspired by 2D
scene with an extremely textured background, especially to graph-based methods, some studies have applied similar strat-
facilitate the edge connection of adjacent surface patches, we egies to point cloud segmentation and have achieved good
design a local area linkage (LAL) structure to determine the results in different datasets. This is because 3D point clouds
connection between adjacent surface patches that considers have more spatial dimensions (i.e., normal, curvature) than 2D
the information of the whole local vicinity instead of merely images, which makes the segmentation results more robust. For
utilizing the information between two surface patches. (2) The instance, Golovinskiy and Funkhouser [25] proposed a point
connectivity of different surface patches is assessed purely on cloud segmentation method based on min-cut [26] by con-
the basis of geometric information (i.e., planar, nonplanar, and structing a graph using k-nearest neighbors. The min-cutthen
local surface convexities), avoiding the use of color or intensity successfully completed the segmentation of outdoor urban
of points, which is normally limited by the 3D sensor techniques objects. Ural et al. [27] also used min-cut to solve the energy
and lighting conditions. (3) Our proposed method can address minimization problem of airborne LiDAR scanning (ALS)
complex scenes that are highly textured or with outliers and point cloud segmentation. Each point is considered to be a node
backgrounds of similar colors. This allows our method to be in the graph, and each node is connected to its 3D Voronoi
directly applied to cluttered 3D scenes without denoising. neighbors with an edge. The approach by Dutta et al. [28] use
(4) We are considering that in the past few years, the RGB-D normalized cut for the segmentation of laser point clouds in
Object Dataset (ROD) has become the de facto standard in urban areas; specically, the author proposed an edge weight
the robotics community for the object segmentation task [11]. metric, which takes into account the local plane parameters,
Therefore, the main part of the experiment in this paper is RGB values, and eigenvalues of the covariance matrix of the
performed on the state-of-the-art RGB-D Object Cluttered local point distribution. However, the graph-cut-based methods
Indoor Dataset (OCID) [12]. Meanwhile, to further prove the for point clouds proposed by the abovementioned scholars need
applicability of the algorithm in scenes with complex textured to process a large number of points (i.e., calculate the features
backgrounds and noise, experiments on indoor scenes with of the neighborhood points), which is time-consuming and
noise handing are added. also signicantly increases the computational effort used for
The remainder of this paper is organized as follows. Following subsequent tasks such as energy optimization. Markov random
this introduction, Section 2 reviews and discusses related eld (MRF) [29] and conditional random eld (CRF) [30]
methods for 3D scene segmentation. Section 3 gives a detailed are popular machine learning approaches to solve graph-based
description of the proposed 3D segmentation method. Then segmentation problems. Although the use of such techniques
the proposed method is validated in experimental studies in has achieved great success, their disadvantage is that as the
Section 4. Finally, the conclusions and the future research number of nodes increases, the computational cost of inference
direction are presented in Section 5. on these graphs usually rises sharply, which limits their use in
applications that require real-time segmentation.
To reduce the computational cost and the negative impact
2. RELATED WORK of noise, a frequent strategy is to over-segment the original
Point cloud segmentation is dened as partitioning the point point cloud into smaller patches before applying computa-
cloud into groups of meaningful regions with similar geomet- tionally expensive algorithms. The most classical point cloud
ric/spectral characteristics. Various approaches to segment over-segmentation algorithm is Voxel Cloud Connectivity
objects in cluttered scenes either in 2D computer vision or in 3D Segmentation (VCCS) [31]. In this method, a point cloud
point clouds exist. Early methods aimed to formulate generic is rst voxelized by the octree. Then a K-means clustering
Gestalt principles to organize 2D scenes into objects [13]. For algorithm is employed to realize supervoxel segmentation.
an overview of early work in perception organizations, refer However, the supervoxel method only completes the preclus-
to Boyer and Sarkar [14]. Most of them are based on simple tering of the point cloud by over-segmentation, and how to
color and edge features [15–19], some include depth informa- cluster the separated patches into complete segments is still
tion [20,21], and others rely on combined 3D shape and color a challenging task. Therefore, researchers have conducted
information [22,23]. much research on this issue [32–34]. Stein et al. [35] proposed
62 Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A Research Article

a classic supervoxel-based method named “local convex con- retrained, requiring the acquisition of a new dataset tailored
nected patches (LCCP),” and the edges in the adjacency graph to each test environment, and factors such as their high com-
are classied as either convex or concave in their study, which putational runtime also restrict their use for real-world robotic
is combined with region growing to cluster supervoxels into applications.
complete segments of objects. The proposed method has been
inspired from this convexity criterion. However, a noisy sur-
face normal can lead the convexity classication to fail and 3. METHODOLOGY
wrongfully split the linked surface. To reduce the inuence Due to a large number of points, it is highly challenging to per-
of noise, Xu et al. [36] presented perceptually grouping lows form real-time 3D point cloud segmentation of unknown scenes
for supervoxel-based point cloud segmentation, using a new for robot applications. Simultaneously, in order to use a small
connection criterion that combines both voxel connectivity cluster of points to more quickly match geometric targets, we
and geometric attributes of point groups. Saglam et al. [37] use the VCCS method proposed in [31] to over-segment input
proposed a method to inspect only the local curvatures by using point clouds into small supervoxel patches, which also provides
a new merging standard, and they used a nonsequential region a more natural representation of 3D point clouds and reduces
growing approach to complete the segmentation. the computational cost with a minimal loss of geometric infor-
The frequently used feature-clustering-based methods are mation. As the similarity between the points in each supervoxel
another representative as well [38–40]. The method based on is measured by considering the attributes of points and spatial
feature clustering organizes the point clouds into primitives distances in a local area, most points in each supervoxel belong
based on some precalculated local surface attributes (i.e., planar, to identical objects or part objects, laying a good foundation for
curvature, saliency features). segmentation. The overview of our segmentation approach is
On the other hand, the most state-of-the-art supervised given in Fig. 1.
method is to use a pretrained segmentor for segmentation
tasks [41,42]. Recent research in robotic manipulation scenes A. Calculation of Geometric Features
has largely focused on performing grasp detection using deep
learning [43]. In this context, methods such as [44] focused The supervoxel method will return a weighted undirected adja-
on predicting grasp rectangles (i.e., dened by its position, cency graph G = 〈V , E 〉, where each patch vi ∈ V corresponds
orientation, width, and height) directly in RGB-D images. to a small surface of objects with geometrical features, and each
One challenge of these methods is the availability of a large undirected adjacency edge e i, j ∈ E represents a link between
volume of labeled training data, which can be very expensive and neighboring surface patches vi , v j ∈ V .
time-consuming to generate in the real world. To overcome this Now, the use of VCCS method has already preclustered the
challenge, an alternative approach was presented in [45], where points with homogeneous characteristics. We need to calculate
the authors generated training data using a physics simulation the geometric features corresponding to each patch, of which
engine to learn the feature. With the rise of convolutional neural each represents a local surface of the unknown scenes with a cen-
networks (CNN) in the object recognition domain, most of troid, a normal vector, and curvatures (i.e., vi = (c i , n i , f i )).
the discussed approaches in deep neural networks have largely The geometric structures of each supervoxel patch are calculated
focused on the development of efcient CNN architectures for as follows.
deployment on embedded devices. Recently, DenseNets [46] First, the covariance matrix C3×3 is constructed using points
introduced an architecture that iteratively concatenates outputs belonging to patch vi ∈ V according to Eq. (1):
from previous layers and achieved higher recognition accuracy  k

compared to the conventional CNN models. However, the  C3×3 = 1

 k ( p i − p̄) · ( p i − p̄)T
iterative growth of feature channels throughout the network in i=1
, (1)
k
DenseNets demands high computational resources and pro-  1 
 p̄ = k ·
 pi
duces low inference speeds (especially on embedded hardware). i=1
In another study [47], a ne-tuned ReneNet to segment and
classify 40 unique known objects in a bin, with a system to where p i (i = 1, 2, ... , k) is the point belonging to patch
quickly learn new objects with a semi-automated procedure. vi ∈ V and p̄ is the center point of patch vi ∈ V . Let
More recently, a few methods have investigated bottom-up
methods that assign pixels to object instances [48,49]. Most of
these algorithms provide instance masks with category-level
semantic labels, which do not generalize to unknown objects
in novel categories. Another family of methods is class-agnostic
object proposal algorithms [50,51]. However, these methods
will segment everything and require a postprocessing method to
select the masks of interest. By providing object-level supervi-
sion, these learning-based methods achieve better performance
than some supervised methods. However, this kind of method
requires a large amount of training data and may fail in seg-
menting for new scenes’ point clouds. This means the method
also cannot apply the arbitrary unknown scenes without being Fig. 1. Frame diagram of the proposed segmentation method.
Research Article Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A 63

λ(C) (C)
1 ≥ λ2 ≥ λ3
(C)
and e 1(C) , e 2(C) , e 3(C) be the eigenvalue and
3×3
eigenvector of C , and the geometric structures of each path
are calculated by Eq. (2):
 √ √
(C) (C)
 λ1 − λ2


 α 1D = √
(C)

 λ1
 √ √
 (C)
λ2 − λ3
(C)
α2D = √
(C)
, (2)

 λ1

 √
(C)

 λ
 α3D = √ 3(C)

λ1
Fig. 2. Denition of the local adjacency area for a patch; marks of
the same color have the same geometric attributes.
where α1D , α2D , α3D are the linear, planar, and scatter geomet-
ric features, respectively [52]. Many everyday objects that can
be found in living environments are of compact size and have
area linkage (LAL). This conception is derived from the idea
smooth surfaces, which are sometimes planar, but also free-form
of non-maximum suppression (NMS) [53], in which only one
surfaces. We consider that the statuses of most indoor scenes’
data point is needed to compare with its neighbors and will be
point clouds for robot vision towards practical applications are
suppressed if it is not local-maximal. Finally, we design a fully
planar points and nonplanar points, simultaneously, in order to
connected local afnity graph to determine the connection
simplify the subsequent energy function optimization problem.
between adjacent patches, considering the information of the
Finally, the features of patch vi ∈ V are classied into two cat-
whole local vicinity instead of merely utilizing the informa-
egories (i.e., planar patches and nonplanar patches) according to
tion between two patches. In our method, if more than half
the geometric features described by Eq. (2), by Eq. (3):
of the neighbor patches belong to the same particular type of
 geometric features, the center patch will belong to the category
planar_patch, if (α2D > α1D )&(α2D > α3D )
Pf = , represented by the neighbor patches (i.e., this patch should be
nonplanar_patch, otherwise
(3) merged in the same surface with its local neighbor patches). By
where P f represents the geometric features of the surface applying the local area linkage, the pairwise linkage Mij± is nally
patches. dened by Eq. (4):
 +
Mij , if P = Pfj & convex(vi , v j )
Mij± = , (4)
B. Graph-Based Scene Segmentation Mij− , otherwise
Given a weighted undirected adjacency graph G = 〈V , E 〉,
segmentation of the unknown objects can be formulated as a where convex(vi , v j ) is a convexity classier and the detail of
graph-cut problem [15]. Specically, the pairwise linkage Mij± convex/concave is calculated following the description in [35].
After classication, we remove all Mij− edges from the adjacency
is dened as a binary random variable, assuming that the value
is set to Mij+ if two patches vi ∈ V and v j ∈ V should be merged graph G = 〈V , E 〉, resulting in a number of subgraphs, each
of which represents an individual object or part of it. In fact,
and Mij− if they are not. In many segmentation methods, the most of the subgraphs belong to the planar surface that has not
linkage between patches is only estimated by the geometric been merged. Now we need to use the plane extracted from
feature afnity of the pair itself. However, in complex unknown the scene to rene the nal segmentation result (description in
indoor scenes (i.e., for the increasing amount of clutter, distance Section 3.D) as shown in Fig. 3.
from the sensor to the scene, background, and viewpoint), such In fact, the LAL makes up features of the gap between pair-
a linkage estimation is not reliable because of the outliers, noise, wise linkage and global area linkage, which makes it more robust
and stacks of objects [36]. Moreover, the criteria of identify- than the pairwise-based ones, especially in cluttered scenes’
ing linkage should adaptively vary in such a scene along with point clouds full of noise.
different kinds of structures and objects (i.e., different types of
geometric features). Thus, in our method, Mij± is determined
by the geometric information of patches within a local adja- C. Extraction of Planar Surface Hypothesis
cency area (i.e., geometric features of the surface patches and After the initial segmentation is completed by merging patches,
convex/concave). In Fig. 2, we illustrate the denition of the the adjacency graph is divided into many subgraphs, each of
local adjacency area for a patch. The area of a local adjacency is which represents an individual object or part of an object in
set by the undirected adjacency graph, which is an adjacency cluttered scenes, and most of the subgraphs belong to the planar
relation in voxelized 3D space (specically, 26-adjacency in a 3D surface that has not been merged. To group these subgraphs that
space [31]). should belong to the same objects together, we rst generate a
It can be seen from Fig. 2 that a surface patch vi ∈ V should set of the initial plane hypothesis that are consistent with the
be merged in the same cluster (i.e., the surface of the same surfaces of the objects in the scene, and then we rene the initial
objects) with its local neighbor patches v j ∈ V that are more segmentation result through the geometric consistency between
likely to represent the geometric properties of the local area, and the subgraph and the initial plane hypothesis. However, to
this relationship between vi ∈ V and v j ∈ V is called a local ensure the adequacy of the initial hypothesis plane set, a simple
64 Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A Research Article

where E ( p) is a global energy function for balancing data cost,


smooth cost, and feature cost as shown in Eq. (7):
data cost smooth cost feature cost
 ∑   ∑   ∑ 
E ( p) = Dist(c i , p c ) + V (vi , v j ) + 9vi ,
c i ∈vi , p c ∈Ph vi ,v j ∈V vi ∈V
(7)

where Dist(c i , p c ) measures the sum of geometric
c i ∈vi , p c ∈Ph
errors between each surface patch to its corresponding plane
(i.e., evaluates how well the plane p c ts to the input point
cloud.). c i is the centroid of the surface patches. More speci-
Fig. 3. Graph-based scene segmentation. Level 0 represents a fully cally, the geometric error is calculated as the distance from c i to
connected local afnity graph; Level 1 represents that the subgraphs its corresponding plane p c as shown in Eq. (8):
generated after initial segmentation using local area linkage; and Level { a x +b y +c z +d
c i√ c i c i c
2 represents the segmentation result after renement. , l c 6= 0
Dist(c i , p c ) = a c2 +b c2 +c c2 , (8)
2ρ, lc = 0

method [i.e., RANdom Sample Consensus (RANSAC)] would where l c = 0 is an extra label for surface patches not belonging
to any plane and ρ is a distance threshold in plane tting. The
need to sample a huge number of minimal subsets (i.e., the
distance between any surface patch vi and outlier plane p out is
number of generated initial hypothesis planes is considerably
a constant value 2ρ, which means that the surface patches with
larger than the ground truth), which is time-consuming and also
distances to their corresponding planes larger than 2ρ are more
signicantly increases the computational effort for the energy likely to be outliers. The shorter the distance Dist(c i , p c ), the
minimization later. In this study, we address this problem by lower the penalty for assigning surface patch vi to plane p c .
using surface patches to generate plane candidates. As men- The smooth cost term is dened as the Potts model [56],
tioned earlier, the input point cloud has been over-segmented as shown in Eq. (9), which means that if a pair of neighboring
into a set of supervoxels (i.e., surface patches), each of which rep- surface patches vi and v j belong to the same plane, the smooth
resents a local surface of the scene. We select surface patches with cost between them is 0; otherwise, the smooth cost is 1:
small curvatures to generate planes using centroids and normal 
vectors, and this is because the small curvature means that the 0, if l vi = l v j
V (vi , v j ) = , (9)
patches have a high probability of locating at a planar surface. 1, if l vi 6= l v j
The parameters p h (a h , b h , c h , dh )(i.e., plane functions) of
where l vi and l v j are extra label for surface patches vi and v j .
each hypothesis plane Ph are calculated using its corresponding The feature cost term 9vi is dened in Eq. (10), which means
inliers by the least-squares estimation in Eq. (5): that a patch has the probability of planar properties. The higher
  plane properties of a patch vi , the better geometric consistency
 p̄ h = arg min p h Dist2 (c i , p h ) vi
c i ∈I
with the plane it will have. Where α2D is planar geometric fea-
+b h y i +c h zi +d h , (5) tures as described earlier, κ is the minimum number of inliers of
 Dist(c i , p h ) = a h x i√ 2 2 2
a h +b h +c h a valid plane. We set κ = 50 in our method:
v
i
where c i = (x i , y i , zi ) is the centroid of the surface patches and 9vi = κ · α2D . (10)
I is the inliers.
Since the number of surface patches vi and the number of
planes Ph are relatively small, the energy function can be opti-
D. Geometric Plane Model Alignment mized efciently using the fast graph-cut-based expansion move
algorithm [56].
As mentioned in graph-based scene segmentation, the input
point cloud has been over-segmented into a set of subgraphs,
and now we need to extract a set of supporting planes from the 4. EVALUATION
input scene point clouds, and subgraphs (i.e., surface patches) The implementation details of the experiments, including the
have been assigned to these planes; simultaneously, objects with description of benchmark datasets, the evaluation criteria, and
nonplanar features will separate naturally. To assign subgraphs parameter settings of the proposed methods are described in this
(i.e., surface patches) to these planes through the geometric section.
consistency between the subgraphs and the initial hypothesis
plane set (i.e., to complete the separation between planes and
A. Experimental Dataset
non-planes), inspired by [54,55], we formulate this problem as a
global energy minimization task: To objectively assess the quality of segmentation, we used the
OCID, which is an RGBD dataset containing pointwise labeled
T = argmin E ( p), (6) point clouds for each object. The data was captured using two
p∈Ph
Research Article Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A 65

ASUS-PRO Xtion cameras that were positioned at different Table 1. Characteristics of the Used Object
heights. It captures diverse settings of objects, background, Segmentation Implementations in the Evaluation
context, sensor-to-scene distance, viewpoint angle, and lighting Learning
conditions. The main purpose of OCID is to allow system- Method RGB Depth Approach
atic comparison of existing object segmentation methods in
RANSAC (point level) — — —
cluttered scenes with increasing amount of clutter [12]. √
LCCP (voxel level) — —
OCID comprises 96 fully built-up cluttered scenes. Each √ √
FPCS (voxel level) —
scene is a sequence of labeled point clouds that are created by √
GCUT (image level) — —
building increasingly cluttered scenes incrementally and adding √
SCUT (image level) — Deep learning
one object after the other. The dataset uses 89 different objects V4R (depth level)
√ √
SVM
that are chosen representatives from the Autonomous Robot Ours (voxel level) —


Indoor Dataset (ARID) [57] classes and YCB Object and Model
Set (YCB) [58] dataset objects. The ARID20 subset contains
scenes including up to 20 objects from ARID. The ARID10 and The information retrieval measurements, including weighted
YCB10 subsets include cluttered scenes with up to 10 objects overlap (WOv), precision, recall, and F1 score, are selected as
from ARID and the YCB, respectively. In addition, OCID does basic measures for assessing the effectiveness and accuracy of our
also provide ground-truth data for other vision tasks like object method. WOv is a popular overlapping criterion proposed in
classication and recognition [12]. [65]; specically, for each region, among the regions obtained by
the segmentation to be tested, the region having the maximum
overlap with g i is selected as its best estimate.
B. Evaluation Metrics
The quantitative performance of our proposed method is 1. Weighted Overlap
assessed by the agreement between the segmentation result and
the ground-truth dataset (i.e., the reference dataset). The results Then we dene overlap as
of segmentation using our proposed method are compared 
|g i ∩ s j |
}
against the ground-truth dataset in a point-to-point way. Six Ovi = max . (11)
sj |g i ∪ s j |
representative point cloud segmentation algorithms, including
the feature-clustering-based RANSAC [59], locally convex The WOv is computed as a weighted average:
connected patches (LCCP) [35], supervoxel and color-based
fast point cloud segmentation (FPCS) [60], the graph-based 1 ∑
W Ov =  |g i | · Ovi . (12)
i |g i |
image segmentation (GCUT) [15], the deep-learning-based i

semantic RGB-D segmentation (SCUT) [61], and the support-


The range of the above metric is [0, 1], where 1 corresponds
vector-machine (SVM)-based method (V4R) [62] are used as
to the perfect overlap of the identical segmentation and ground
baseline methods. Here, the RANSAC method is a renowned
truth.
point-based segmentation algorithm, and it is a popular tech-
nique for shape detection in multidimensional data due to its
simplicity and robustness. The LCCP method is a popular 2. True- and False-Positive Rates
supervoxel-based segmentation method, adopting the convexity We denote by s j the region in the segmentation having maxi-
as the segmentation criteria. The FPCS method uses supervoxel mum overlap with the region g i in G T. For each g i , we dene
and color for 3D segmentation. In contrast, the GCUT method the true-positive points by T Pi = g i ∩ s i , while we dene the
is an image-based segmentation method using the boundaries false-positive points by F Pi = s i \T Pi . We also dene the false-
between regions of an image. SCUT (i.e., SceneCut) is a state- negative points by F Ni = g i \T Pi . Finally, the average scores
of-the-art approach combining object and semantic RGB-D are dened as follows [21]:
segmentation using convolutional oriented boundaries (COB)
1 ∑ |T Pi |
and a hierarchical segmentation tree. The network is trained on TP = , (13)
the NYU dataset [61]. V4R is also a learning-based method. It M i |g i |

computes color and depth features for local patches and uses a
1 ∑ |F Pi |
trained SVM to determine similarity scores between patches for FP = , (14)
grouping. The method proposed in this paper initially combines M i |s i |

the supervoxel and LAL in segmentation, but it uses the geomet- 1 ∑ |F Ni |


ric properties of local patches and global energy optimization to FN= . (15)
M i |g i |
rene connections of patches and enhance the robustness of the
segmentation method. The different segmentation methods for They all have [0, 1] as range, but for T P higher is better, while
this baseline comparison are summarized in Table 1. for F P and F N lower is better.
For quantitative comparison, we employed the metrics below, Precision stands for the percentage of correctly retrieved
both of which were also used in [21,35,63,64]. The ground elements (i.e., correctly segmented points in the segmenta-
M
truth G T = {g i }i=1 is the set of M human annotated regions g i , tion results), whereas recall corresponds to the percentage of
and S = {s j } Nj=1 is a set of N predicted segments. reference datasets that are correctly retrieved (i.e., correctly
66 Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A Research Article

Fig. 4. Qualitative results of segmentation methods for the OCID dataset.

segmented points in the reference data). However, these two


measures are sensitive to the existence of false elements and the
reference data that are not recognized by the method, respec-
tively [60]. Thus, the F 1 score is introduced to balance the
precision and recall, as an overall measurement of effectiveness.
These three measures are computed as per Eqs. (16)–(18), using
the true positive (T P ), false positive (F P ), and false negative
(F N) [i.e., as in Eqs. (13)–(15)] mentioned above:
|T P |
precision = , (16)
|T P | + |F P |

|T P | Fig. 5. Visual segmentation results: (a) FPCS; (b) ours.


recall = , (17)
|T P | + |F N|

precision · recall cloud data with depth can produce more accurate segmenta-
F1=2· . (18) tion. For the results using RANSAC, the segmentation of large
precision + recall
planar surfaces (e.g., the ground surface and the surface of the
cuboid) shows good performance. This is because the RANSAC
method is a shape detection method that has good robustness.
C. Segmentation Results
However, when it comes to connection parts between surfaces
To conduct experiments, all the algorithms were implemented (i.e., stacked objects) or occluded adjacent objects, over- and
on a computer with 16 GB RAM and an Intel Core i7-6700HQ under-segmentation occur (i.e., this method cannot distin-
at 2.60 GHz CPU. guish different objects correctly). For the results of using FPCS,
In Fig. 4, segmentation qualitative results of using different although the geometric properties and color information of the
tested methods on example scenes in OCID are illustrated, in scene are used in this method, over-segmented surfaces are still
which different segments are rendered with varying colors. In generated in the scene. At the same time, due to the complexity
these experiments, the voxel sizes used in our method, LCCP, of the scene (i.e., noise, variations in lighting, and extremely
and FPCS are 0.05. The seed resolutions of the supervoxels complex texture background), some surface patches are not
in our proposed method, LCCP, and FPCS are set to 0.02. correctly merged into the segmented objects, for instance, the
Regarding the GCUT technique, the best performance was convex connections of objects like the box; they are always
achieved with the following parameters: θ = 0.4, k = 500, and over-segmented into two parts of planar surfaces as shown
the minimum cluster size of 500. in Fig. 5.
It can obviously be seen from Fig. 4 that, compared with RGB A similar phenomenon can be observed from the GCUT
information (i.e., FPCS, GCUT), methods that favor point method; in the scene with an extremely textured background,
Research Article Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A 67

Table 2. Evaluation of Segmentation Results of the


OCID Dataset
Subset Method WOv Precision Recall F 1 Score
YCB10 RANSAC 0.4212 0.5216 0.5509 0.5358
LCCP 0.8810 0.8822 0.8875 0.8848
FPCS 0.7131 0.7918 0.7577 0.7743
GCUT 0.4360 0.7056 0.7100 0.7077
SCUT 0.6830 0.7272 0.7081 0.7175
V4R 0.8512 0.7701 0.7873 0.7786
Ours 0.9055 0.8977 0.9077 0.9026
ARID10 RANSAC 0.4156 0.5423 0.5567 0.5494
LCCP 0.8522 0.8699 0.8574 0.8636
Fig. 6. Visual segmentation results: (a) LCCP; (b) ours. FPCS 0.6950 0.7834 0.7617 0.7723
GCUT 0.3831 0.7220 0.7120 0.7169
SCUT 0.6850 0.7420 0.7052 0.7231
the result of the method that relies highly on RGB is under- V4R 0.7940 0.7720 0.7600 0.7659
whelming. For instance, the details of the carpet may be Ours 0.8910 0.9178 0.9166 0.9171
segmented together with the objects. For the LCCP method, ARID20 RANSAC 0.3834 0.5134 0.4765 0.4942
although the segmentation results show a good performance, LCCP 0.8751 0.9022 0.8826 0.8922
as the scene becomes more cluttered and the number of objects FPCS 0.6928 0.7721 0.7458 0.7587
increases, over- and under-segmentation frequently happens GCUT 0.4165 0.7225 0.7152 0.7188
when segmenting objects with shapes that occlude each other or SCUT 0.6532 0.7355 0.7210 0.7281
V4R 0.7423 0.7630 0.7725 0.7677
scenes with more noise data. Besides, in the LCCP method, due
Ours 0.8840 0.9025 0.9009 0.9016
to the inuence of the noise surface, the convexity classication
may fail and connected surfaces may also be wrongfully split
[35]; for example, the surface of the cereal box and the bound-
aries of the objects shown in Fig. 6. This is because the judgment
of the extended convexity criterion used in the LCCP method
requires singular connections formed by adjacent surfaces of
supervoxels, but for such shape objects that contain noise and
occlude each other, the estimated normal can be very imprecise,
and there are no correctly neighboring supervoxels for con-
ducting the judgment of singular connections (i.e., convexity),
so incorrect segmentations may occur. Both SCUT and V4R
utilize models trained on real data, but V4R shows better results
than SCUT. V4R was trained on the Object Segmentation
Database (OSD) [22], which has an extremely similar data
distribution to OCID, giving V4R a substantial advantage. Our
proposed method outperforms other baseline methods and is
affected the least by the number of objects and clutter, with more
objects completely segmented. At the same time, even in a noisy
Fig. 7. Precision-recall curves.
environment, there is no phenomenon in which the patches are
not merged into the same object correctly.
The results also show that methods relying highly on color
information are particularly challenged in the OCID dataset. those of our method with a large recall. This reveals that, for
Unknown scenes that are highly textured, or with objects and the used OCID dataset, the LCCP method tends to create
backgrounds of similar color, favor geometry-based methods. under-segmented results. This is because, for this segmentation
In Table 2, quantitative evaluation is given. As seen from method, the smoothness and convexity criteria can segment
Table 2, LCCP and our proposed method outperform the other planar surfaces and box-shaped objects well, but when it comes
comparison methods. Especially for our method, F1 measures to more complex surfaces or objects, such as occluded objects
larger than 0.9. or rough surfaces with patterns, they may generate under-
To fully investigate the performance of the method in this segmented surfaces and cannot break an entire object into small
paper, we generate the precision-recall (PR) curves of segmented fragments. The shape of the PR curve indicates that our method
results using the ARID20 Table dataset in Fig. 4 with different can obtain better segments with a good trade-off between
baseline algorithms by changing the thresholds of the seg- precision and recall.
mentation methods. As shown in Fig. 7, the proposed method In addition, we compare the average execution time when
has better performance than the others when the recall value dealing with different subsets of OCID as shown in Fig. 8. The
is smaller than 0.72. In contrast, the popular LCCP method execution time contains all processing steps, including the time
can obtain approximate or even better precision values than for creating supervoxel structures. We can nd that GCUT
68 Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A Research Article

Table 3. Results for the Noise Handling Experiment


Total Points of Weighted Running
Parameter the Scene Overlap (%) Time (s)
p = 12.5%, σ = 0.125 338276 0.85 5.2
p = 25%, σ = 0.25 371987 0.83 5.9
p = 50%, σ = 0.5 447767 0.78 8.5

apparent decline. This reinforces the characteristics of our


technique: robustness to noise. The reason for the robustness to
noise is because the merge of surface patches uses a robust fully
connected LAL graph. Moreover, we apply it to the framework
of global energy optimization.

Fig. 8. Execution time comparison.


5. CONCLUSION
In this paper, we contributed an unsupervised geometric-based
requires the least running time and is signicantly faster than segmentation method for cluttered scenes under the framework
other methods, whereas our method ranks second in execution of global energy optimization and evaluated its performance on
time, requiring slightly longer computation time than GCUT. It the OCID dataset. Our method allows a learning-free, robust,
also shows that our method can meet the requirements of some and real-time process. Notably, the proposed approach is a
real-time processing applications. The execution times of the general solution to robotics tasks such as object search, grasping,
classic LCCP, SCUT, V4R, and RANSAC methods are very and manipulation. Comprehensive experiments demonstrated
close. In contrast, the FPCS method has the longest execution that our method obtained good performance in RGB-D point
time, especially when dealing with the ARID10 dataset. clouds compared with other methods and avoided over- and
under-segmentation as much as possible, simultaneously main-
taining a certain accuracy. More importantly, we proposed a
D. Noise Handing Analysis fully connected LAL graph and applied it to the optimization
To analyze our method’s robustness to noise, we selected a set of the energy function, which improves the robustness of the
of scene fragments containing objects and backgrounds from method in cluttered scenes. However, the proposed method in
the SUN3D dataset [66]. We then added to this set of scene this paper still has the limitations that some subgraphs cannot
fragments a certain percentage p of outliers. The position of be merged correctly. In future work, we will further extend the
each outlier sample was dened as a random position on the method to rene the segmentation results.
scene, shifted along the object’s surface by some amount h
obtained from a Gaussian distribution (i.e., µ = 0, σ ). The Funding. The Technology Innovation Fund of the
results are displayed in Fig. 9 and in Table 3 for values ranging 10th Research Institute of China Electronics Technology
from p = 12.5% and σ = 0.125 to p = 50% and σ = 0.5. Group Corporation (20181218); National Natural Science
As shown in Table 3, despite the increasing noise levels and Foundation of China (51275431).
the total number of scene points, the WOv does not show an
Disclosures. The authors declare that they have no known
competing nancial interests or personal relationships that
could have appeared to inuence the work reported in this
paper.

REFERENCES
1. M. Danielczuk, M. Matl, S. Gupta, A. Li, A. Lee, J. Mahler,
and K. Goldberg, “Segmenting unknown 3D objects from real
depth images using mask R-CNN trained on synthetic data,” in
International Conference on Robotics and Automation (ICRA) (2019),
pp. 7283–7290.
2. A. Nguyen and B. Le, “3D point cloud segmentation: a survey,” in 6th
IEEE Conference on Robotics, Automation and Mechatronics (RAM)
(2013), pp. 225–230.
3. S. C. Stein, F. Wörgötter, M. Schoeler, J. Papon, and T. Kulvicius,
“Convexity based object partitioning for robot applications,” in IEEE
International Conference on Robotics and Automation (ICRA) (2014),
pp. 3213–3220.
4. Y. Ioannou, B. Taati, R. Harrap, and M. Greenspan, “Difference of
normals as a multi-scale operator in unorganized point clouds,” in
Fig. 9. Results of three experiments using an increasing percentage 2nd International Conference on 3D Imaging, Modeling, Processing,
p of outlier samples. Visualization & Transmission (2012), pp. 501–508.
Research Article Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A 69

5. T. Rabbani, F. Van Den Heuvel, and G. Vosselman, “Segmentation of 28. A. Dutta, J. Engels, and M. Hahn, “A distance-weighted graph-
point clouds using smoothness constraint,” Int. Arch. Photogramm. cut method for the segmentation of laser point clouds,” Int. Arch.
Remote Sens. Spatial Inf. Sci. 36, 248–253 (2006). Photogramm. Remote Sens. Spatial. Inf. Sci. XL-3, 81–88 (2014).
6. M. Wang and Y. H. Tseng, “Incremental segmentation of lidar point 29. M. Johnson-Roberson, J. Bohg, M. Björkman, and D. Kragic,
clouds with an octree-structured voxel space,” Photogramm. Rec. “Attention-based active 3D point cloud segmentation,” in IEEE/RSJ
26, 32–57 (2011). International Conference on Intelligent Robots and Systems (2010),
7. C. Xie, Y. Xiang, A. Mousavian, and D. Fox, “The best of both modes: pp. 1165–1170.
separately leveraging RGB and depth for unseen object instance 30. R. B. Rusu, Z. C. Marton, N. Blodow, A. Holzbach, and M.
segmentation,” in Conference on robot learning (PMLR) (2020), Beetz, “Model-based and learned semantic object labeling in
pp. 1369–1378. 3D point cloud maps of kitchen environments,” in IEEE/RSJ
8. Y. Xiang, C. Xie, A. Mousavian, and D. Fox, “Learning RGB-D International Conference on Intelligent Robots and Systems (2009),
feature embeddings for unseen object instance segmentation,” pp. 3601–3608.
arXiv:2007.15157 (2020). 31. J. Papon, A. Abramov, M. Schoeler, and F. Worgotter, “Voxel cloud
9. L. Shao, Y. Tian, and J. Bohg, “ClusterNet: 3D instance segmentation connectivity segmentation-supervoxels for point clouds,” in IEEE
in RGB-D images,” arXiv:1807.08894 (2018). Conference on Computer Vision and Pattern Recognition (2013),
10. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks pp. 2027–2034.
for semantic segmentation,” in IEEE Conference on Computer Vision 32. W. Ao, L. Wang, and J. Shan, “Point cloud classification by fus-
and Pattern Recognition (CVPR) (2015), pp. 3431–3440. ing supervoxel segmentation with multi-scale features,” Int. Arch.
11. K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi- Photogramm. Remote. Sens. Spatial Inf. Sci. XLII-2/W13, 919–925
view RGB-D object dataset,” in International Conference on Robotics (2019).
and Automation (ICRA) (2011), pp. 1817–1824. 33. F. Verdoja, D. Thomas, and A. Sugimoto, “Fast 3D point cloud seg-
12. M. Suchi, T. Patten, D. Fischinger, and M. Vincze, “EasyLabel: a semi- mentation using supervoxels with geometry and color for 3D scene
automatic pixel-wise object annotation tool for creating robotic RGB- understanding,” in International Conference on Multimedia and Expo
D datasets,” in International Conference on Robotics and Automation (ICME) (2017), pp. 1285–1290.
(ICRA) (2019), pp. 6678–6684. 34. Y. Ben-Shabat, T. Avraham, M. Lindenbaum, and A. Fischer, “Graph
13. A. Aldoma, T. Mörwald, J. Prankl, and M. Vincze, “Segmentation of based over-segmentation methods for 3D point clouds,” Comput.
depth data in piece-wise smooth parametric surfaces,” in Computer Vis. Image Underst. 174, 12–23 (2018).
Vision Winter Workshop (CVWW) (2015). 35. S. Christoph Stein, M. Schoeler, J. Papon, and F. Worgotter, “Object
14. K. L. Boyer and S. Sarkar, “Guest editors’ introduction: perceptual
partitioning using local convexity,” in IEEE Conference on Computer
organization in computer vision: status, challenges, and potential,”
Vision and Pattern Recognition (CVPR) (2014), pp. 304–311.
in Computer Vision and Image Understanding (Academic, 1999), vol.
36. Y. Xu, L. Hoegner, S. Tuttas, and U. Stilla, “Voxel-and graph-based
76, pp. 1–5.
point cloud segmentation of 3D scenes using perceptual group-
15. P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based
ing laws,” ISPRS Ann. Photogram. Remote Sens. Spatial Inf. Sci.
image segmentation,” Int. J. Comput. Vis. 59, 167–181 (2004).
IV-1/W1, 43–50 (2017).
16. C. Rother, V. Kolmogorov, and A. Blake, “’‘GrabCut’ interactive fore-
37. A. Saglam, H. B. Makineci, N. A. Baykan, and Ö. K. Baykan,
ground extraction using iterated graph cuts,” ACM Trans. Graph. 23,
“Boundary constrained voxel segmentation for 3D point clouds
309–314 (2004).
using local geometric differences,” Expert. Syst. Appl. 157, 113439
17. S. Vicente, V. Kolmogorov, and C. Rother, “Joint optimization of seg-
(2020).
mentation and appearance models,” in 12th International Conference
38. T. Czerniawski, B. Sankaran, M. Nahangi, C. Haas, and F. Leite, “6D
on Computer Vision (ICCV) (2009), pp. 755–762.
DBSCAN-based segmentation of building point clouds for planar
18. M. Werlberger, T. Pock, M. Unger, and H. Bischof, “A variational
object classification,” Automat. Constr. 88, 44–58 (2018).
model for interactive shape prior segmentation and real-time track-
39. G. Vosselman, M. Coenen, and F. Rottensteiner, “Contextual
ing,” in International Conference on Scale Space and Variational
segment-based classification of airborne laser scanner data,” ISPRS.
Methods in Computer Vision (Springer, 2009), pp. 200–211.
19. E. Strekalovskiy and D. Cremers, “Real-time minimization of J. Photogramm. 128, 354–371 (2017).
the piecewise smooth Mumford-Shah functional,” in European 40. C. Kim, A. Habib, M. Pyeon, G.-R. Kwon, J. Jung, and J. Heo,
Conference on Computer Vision (ECCV) (2014), pp. 127–141. “Segmentation of planar surfaces from laser scanning data using the
20. G. Kootstra, N. Bergström, and D. Kragic, “Fast and automatic magnitude of normal position vector for adaptive neighborhoods,”
detection and segmentation of unknown objects,” in 10th IEEE-RAS Sensors 16, 140 (2016).
International Conference on Humanoid Robots (2010), pp. 442–447. 41. T. T. Pham, I. Reid, Y. Latif, and S. Gould, “Hierarchical higher-order
21. A. Ückermann, R. Haschke, and H. Ritter, “Real-time 3D segmen- regression forest fields: an application to 3D indoor scene labelling,”
tation of cluttered scenes for robot grasping,” in 12th IEEE-RAS in IEEE International Conference on Computer Vision (ICCV) (2015),
International Conference on Humanoid Robots (2012), pp. 198–203. pp. 2246–2254.
22. A. Richtsfeld, T. Mörwald, J. Prankl, M. Zillich, and M. Vincze, 42. L. Landrieu and M. Simonovsky, “Large-scale point cloud seman-
“Segmentation of unknown objects in indoor environments,” tic segmentation with superpoint graphs,” in IEEE Conference
in IEEE/RSJ International Conference on Intelligent Robots and on Computer Vision and Pattern Recognition (CVPR) (2018),
Systems (2012), pp. 4791–4796. pp. 4558–4567.
23. T. Mörwald, A. Richtsfeld, J. Prankl, M. Zillich, and M. Vincze, 43. S. Kumra and C. Kanan, “Robotic grasp detection using deep con-
“Geometric data abstraction using B-splines for range image seg- volutional neural networks,” in IEEE/RSJ International Conference on
mentation,” in International Conference on Robotics and Automation Intelligent Robots and Systems (IROS) (2017), pp. 769–776.
(ICRA) (2013), pp. 148–153. 44. U. Asif, M. Bennamoun, and F. A. Sohel, “RGB-D object recogni-
24. Y. Xie, T. Jiaojiao, and X. Zhu, “Linking points with labels in 3D: a tion and grasp detection using hierarchical cascaded forests,” IEEE
review of point cloud semantic segmentation,” in IEEE Geoscience Trans. Robot. 33, 547–564 (2017).
and Remote Sensing Magazine (2020). 45. S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning
25. A. Golovinskiy and T. Funkhouser, “Min-cut based segmentation of hand-eye coordination for robotic grasping with deep learning and
point clouds,” in 12th International Conference on Computer Vision large-scale data collection,” Int. J. Robot. Res. 37, 421–436 (2018).
(ICCV) (2009), pp. 39–46. 46. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
26. Y. Boykov and G. Funka-Lea, “Graph cuts and efficient ND image “Densely connected convolutional networks,” in IEEE Conference
segmentation,” Int. J. Comput. Vis. 70, 109–131 (2006). on Computer Vision and Pattern Recognition (CVPR) (2017),
27. S. Ural and J. Shan, “Min-cut based segmentation of airborne LiDAR pp. 4700–4708.
point clouds,” Int. Arch. Photogramm. Remote Sens. Spatial. Inf. Sci. 47. A. Milan, T. Pham, K. Vijay, D. Morrison, A. W. Tow, L. Liu, J. Erskine,
XXXIX-B3, 167–172 (2012). R. Grinover, A. Gurman, and T. Hunn, “Semantic segmentation from
70 Vol. 38, No. 1 / January 2021 / Journal of the Optical Society of America A Research Article

limited training data,” in International Conference on Robotics and 58. B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa,
Automation (ICRA) (2018), pp. 1908–1915. P. Abbeel, and A. M. Dollar, “Yale-CMU-Berkeley dataset for robotic
48. D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool, “Instance manipulation research,” Int. J. Robot. Res. 36, 261–268 (2017).
segmentation by jointly optimizing spatial embeddings and cluster- 59. R. Schnabel, R. Wahl, and R. Klein, “Efficient RANSAC for point-
ing bandwidth,” in IEEE Conference on Computer Vision and Pattern cloud shape detection,” Comput. Graph. Forum 26, 214–226
Recognition (CVPR) (2019), pp. 8837–8845. (2007).
49. D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi, “Semi- 60. F. Verdoja, D. Thomas, and A. Sugimoto, “Fast 3D point cloud seg-
convolutional operators for instance segmentation,” in European mentation using supervoxels with geometry and color for 3D scene
Conference on Computer Vision (ECCV) (2018), pp. 86–102. understanding,” in International Conference on Multimedia and Expo
50. P. O. Pinheiro, R. Collobert, and P. Dollár, “Learning to segment (ICME) (2017), pp. 1285–1290.
object candidates,” in Advances in Neural Information Processing 61. T. Pham, T. T. Do, N. Sunderhauf, and I. Reid, “SceneCut: joint
Systems (2015), pp. 1990–1998. geometric and object segmentation for indoor scenes,” in IEEE
51. P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning to refine International Conference on Robotics and Automation (ICRA) (2018),
object segments,” in European Conference on Computer Vision pp. 3213–3220.
(ECCV) (2016), pp. 75–91. 62. E. Potapova, A. Richtsfeld, M. Zillich, and M. Vincze, “Incremental
52. B. Yang, Z. Dong, G. Zhao, and W. Dai, “Hierarchical extraction attention-driven object segmentation,” in IEEE-RAS International
of urban objects from mobile laser scanning data,” ISPRS. J. Conference on Humanoid Robots (2014), pp. 252–258.
Photogramm. 99, 45–57 (2015). 63. A.-V. Vo, L. Truong-Hong, D. F. Laefer, and M. Bertolotto, “Octree-
53. A. Neubeck and L. Van Gool, “Efficient non-maximum suppression,” based region growing for point cloud segmentation,” ISPRS. J.
in 18th International Conference on Pattern Recognition (ICPR) Photogramm. 104, 88–100 (2015).
(2006), pp. 850–855. 64. M. Awrangjeb and C. S. Fraser, “Automatic segmentation of raw
54. H. Isack and Y. Boykov, “Energy-based geometric multi-model LiDAR data for extraction of building roofs,” Remote Sens. 6,
fitting,” Int. J. Comput. Vis. 97, 123–147 (2012). 3716–3751 (2014).
55. A. Delong, A. Osokin, H. N. Isack, and Y. Boykov, “Fast approximate 65. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg-
energy minimization with label costs,” Int. J. Comput. Vis. 96, 1–27 mentation and support inference from rgbd images,” in European
(2012). Conference on Computer Vision (ECCV) (2012), pp. 746–760.
56. Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy min- 66. S. Choi, Q.-Y. Zhou, and V. Koltun, “Robust reconstruction of indoor
imization via graph cuts,” IEEE. Trans. Pattern. Anal. 23, 1222–1239 scenes,” in IEEE Conference on Computer Vision and Pattern
(2001). Recognition (CVPR) (2015), pp. 5556–5565.
57. M. R. Loghmani, B. Caputo, and M. Vincze, “Recognizing objects
in-the-wild: where do we stand?” in International Conference on
Robotics and Automation (ICRA) (2018), pp. 2170–2177.

You might also like