Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2017 IEEE International Conference on Robotics and Automation (ICRA)

Singapore, May 29 - June 3, 2017

Probabilistic Data Association for Semantic SLAM


Sean L. Bowman Nikolay Atanasov Kostas Daniilidis George J. Pappas

Abstract— Traditional approaches to simultaneous localiza- Approaches to SLAM were initially most often based
tion and mapping (SLAM) rely on low-level geometric features on filtering methods in which only the most recent robot
such as points, lines, and planes. They are unable to assign pose is estimated [9]. This approach is in general very
semantic labels to landmarks observed in the environment.
Furthermore, loop closure recognition based on low-level fea- computationally efficient, however because of the inability
tures is often viewpoint-dependent and subject to failure in to estimate past poses and relinearize previous measurement
ambiguous or repetitive environments. On the other hand, functions, errors can compound [1]. More recently, batch
object recognition methods can infer landmark classes and methods that optimize over entire trajectories have gained
scales, resulting in a small set of easily recognizable landmarks, popularity. Successful batch methods typically represent
ideal for view-independent unambiguous loop closure. In a
map with several objects of the same class, however, a crucial optimization variables as a set of nodes in a graph (a “pose
data association problem exists. While data association and graph”). Two robot-pose nodes share an edge if an odometry
recognition are discrete problems usually solved using discrete measurement is available between them, while a landmark
inference, classical SLAM is a continuous optimization over and a robot-pose node share an edge if the landmark was
metric information. In this paper, we formulate an optimization observed from the corresponding robot pose. This pose graph
problem over sensor states and semantic landmark positions
that integrates metric information, semantic information, and optimization formulation of SLAM traces back to Lu and
data associations, and decompose it into two interconnected Milios [10]. In recent years, the state of the art [11], [12]
problems: an estimation of discrete data association and land- consists of iterative optimization methods (e.g., nonlinear
mark class probabilities, and a continuous optimization over the least squares via the Gauss-Newton algorithm) that achieve
metric states. The estimated landmark and robot poses affect excellent performance but depend heavily on linearization
the association and class distributions, which in turn affect
the robot-landmark pose optimization. The performance of our of the sensing and motion models. This becomes a problem
algorithm is demonstrated on indoor and outdoor datasets. when we consider including discrete observations, such as
detected object classes, in the sensing model.
I. I NTRODUCTION One of the first systems that used both spatial and semantic
In robotics, simultaneous localization and mapping (SLAM) representations was proposed by Galindo et al. [13]. A spatial
is the problem of mapping an unknown environment while hierarchy contained camera images, local metric maps, and the
estimating a robot’s pose within it. Reliable navigation, object environment topology, while a semantic hierarchy represented
manipulation, autonomous surveillance, and many other tasks concepts and relations, which allowed room categories to be
require accurate knowledge of the robot’s pose and the inferred based on object detections. Many other approaches
surrounding environment. Traditional approaches to SLAM [14]–[19] extract both metric and semantic information but
rely on low-level geometric features such as corners [1], typically the two processes are carried out separately and
lines [2], and surface patches [3] to reconstruct the metric the results are merged afterwards. The lack of integration
3-D structure of a scene but are mostly unable to infer between the metric and the semantic mapping does not allow
semantic content. On the other hand, recent methods for the object detection confidence to influence the performance
object recognition [4]–[6] can be combined with approximate of the metric optimization. Focusing on the localization
3D reconstruction of the environmental layout from single problem only, Atanasov et al. [20] incorporated semantic
frames using priors [7], [8]. These are rather qualitative single observations in the metric optimization via a set-based Bayes
3D snapshots rather than the more precise mapping we need filter. The works that are closest to ours [21]–[24] consider
for a robot to navigate. The goal of this paper is to address the both localization and mapping and carry out metric and
metric and semantic SLAM problems jointly, taking advantage semantic mapping jointly. SLAM++ [22] focuses on a real-
of object recognition to tightly integrate both metric and time implementation of joint 3-D object recognition and RGB-
semantic information into the sensor state and map estimation. D SLAM via pose graph optimization. A global optimization
In addition to providing a meaningful interpretation of the for 3D reconstruction and semantic parsing has been proposed
scene, semantically-labeled landmarks address two critical by [25], which is the closest work in semantic/geometric joint
issues of geometric SLAM: data association (matching sensor optmization. The main difference is that 3D space is voxelized
observations to map landmarks) and loop closure (recognizing and landmarks and/or semantic labels are assigned to voxels
previously-visited locations). which are connected in a conditional random field while
our approach allows the estimation of continuous pose of
The authors are with GRASP Lab, University of Pennsylvania,
Philadelphia, PA 19104, USA, {seanbow, atanasov, kostas, objects. Bao et al. [21] incorporate camera parameters, object
pappasg}@seas.upenn.edu. geometry, and object classes into a structure from motion
We gratefully acknowledge support by TerraSwarm, one of six centers problem, resulting in a detailed and accurate but large and
of STARnet, a Semiconductor Research Corporation program sponsored by
MARCO and DARPA and the following grants: ARL MAST-CTA W911NF- expensive optimization. A recent comprehensive survey of
08-2-0004, ARL RCTA W911NF-10-2-0016. semantic mapping can be found in [26].

978-1-5090-4632-4/17/$31.00 ©2017 IEEE 1722


Most related work uses a somewhat arbitrary decomposition two maximization steps as follows:
between data association, pose graph optimization, and object Di+1 = arg max p(D|X i , Li , Z) (3a)
recognition. Our work makes the following contributions to D
the state of the art: X i+1 , Li+1 = arg max log p(Z|X , L, Di+1 ) (3b)
• our approach is the first to tightly couple inertial, X ,L

geometric, and semantic observations into a single This resolves the problem of being able to revisit association
optimization framework, decisions once state estimates improve but does little to
• we provide a formal decomposition of the joint metric- resolve the problem with ambiguous measurements since
semantic SLAM problem into continuous (pose) and a hard decision on data associations is still required. To
discrete (data association and semantic label) optimiza- address this, rather than simply selecting D̂ as the mode of
tion sub-problems, p(D|X , L, Z), we should consider the entire density of D
• we carry out experiments on several long-trajectory real when estimating X and L. Given initial estimates X i , Li ,
indoor and outdoor datasets, which include odometry an improved estimate that utilizes the whole density of D
and visual measurements in cluttered scenes and varying can be computed by maximizing the expected measurement
lighting conditions. likelihood via expectation maximization (EM):
X i+1, Li+1 = arg max ED log p(Z|X , L, D) | X i, Li, Z (4)
 
II. P ROBABILISTIC DATA A SSOCIATION IN SLAM X ,L
X
Consider the classical localization and mapping problem, = arg max p(D|X i , Li , Z) log p(Z|X , L, D)
X ,L
in which a mobile sensor moves through an unknown D∈D
environment, modeled as a collection L , {`m }M m=1 of where D is the space of all possible values of D. This EM
M static landmarks. Given a set of sensor measurements formulation has the advantage that no hard decisions on data
Z , {zk }K k=1 , the task is to estimate the landmark positions association are required since it “averages” over all possible
L and a sequence of poses X , {xt }Tt=1 representing the associations. To compare this with the coordinate descent
sensor trajectory. Most existing work focuses on estimating formulation in (3), we can rewrite (4) as follows:
X and L and rarely emphasizes that the data association K
D , {(αk , βk )}K k=1 stipulating that measurement zk of
XX
arg max p(D|X i, Li, Z) log p(zk |xαk , `βk )
landmark `βk was obtained from sensor state xαk is in X ,L
D∈D k=1
fact unknown. A complete statement of the SLAM problem M
K X
involves maximum likelihood estimation of X , L, and D
X
= arg max i
wkj log p(zk |xαk , `j ) (5)
given the measurements Z: X ,L
k=1 j=1
X̂ , L̂, D̂ = arg max log p(Z|X , L, D) (1) where wkj D∈D(k,j) p(D|X , L , Z) is a weight, in-
i i i
P
,
X ,L,D
dependent of the optimization variables X and L, that
The most common approach to this maximization has been quantifies the influence of the “soft” data association, and
to decompose it into two separate estimation problems. First, D(k, j) , {D ∈ D | βk = j} ⊆ D is the set of all data
given prior estimates X 0 and L0 , the maximum likelihood associations such that measurement k is assigned to landmark
estimate D̂ of the data association D is computed (e.g., via j. Note that the coordinate descent optimization (3b) has a
joint compatability branch and bound [27] or the Hungarian similar form to (5), except that for each k there is exactly
algorithm [28]). Then, given D̂, the most likely landmark one j such that wkji
= 1 and wkl
i
= 0 for all l 6= j.
and sensor states are estimated1 : We can also show that the EM formulation, besides being
D̂ = arg max p(D|X 0 , L0 , Z) (2a) a generalization of coordinate descent, is equivalent to the
D following matrix permanent maximization problem.
X̂ , L̂ = arg max log p(Z|X , L, D̂) (2b) Proposition 1. If p(D | X i, Li ) is uniform, the maximizers
X ,L
of the EM formulation in (4) and the optimization below are
The second optimization above is typically carried out via
equal:
filtering [30]–[32] or pose-graph optimization [11], [12].
The above process has the disadvantage that an incorrectly X i+1, Li+1 = arg max per(Qi (X , L)),
X ,L
chosen data association may have a highly detrimental effect
on the estimation performance. Moreover, if ambiguous where per denotes the  matrix permanent2 , Qi (X , L) is a
measurements are discarded to avoid incorrect association

matrix with elements Qi kj := p(zk |xij , `ij )p(zk |xj , `j ) and
choices, they will never be reconsidered later when refined {(xij , `ij )} and {(xj , `j )} are enumerations of the sets X i ×Li
estimates of the sensor pose (and hence their data association) and X × L, respectively.
are available. Instead of a simple one step process, then, it is
possible to perform coordinate descent, which iterates the Proof. See Appendix I.
Similar to the coordinate descent formulation, the EM for-
1 Note that the first maximization in (2a) assumes that p(D|X 0 , L0 ) is
uniform. This is true when there are no false positive measurements or 2 The permanent of an n × m matrix A = [A(i, j)] with n ≤ m is
P Qn
missed detections. A more sophisticated model can be obtained using ideas defined as per(A) := π i=1 A(i, π(i)), where the sum is over all
from [29]. one-to-one functions π : {1, . . . , n} → {1, . . . , m}.

1723
Fig. 2: Estimated sensor trajectory (blue) and landmark positions and classes
Fig. 1: Example keyframe image overlaid with ORB using inertial, geometric, and semantic measurements such as those in Fig. 1.
features (green points) and object detections The accompanying video shows the estimation process in real time.

mulation (5) allows us to solve the permanent maximization external method, we assume that their data association is
problem iteratively. First, instead of estimating a maximum known.
likelihood data association, we estimate the data association
distribution p(D|X i , Li , Z) in the form of the weights wkj
i C. Semantic information
(the “E” step). Then, we maximize the expected measurement The last type of measurement used are object detections
log likelihood over the previously computed distribution (the St extracted from every keyframe image. An object detection
“M” step). sk = (sck , ssk , sbk ) ∈ St extracted from keyframe t consists of
a detected class sck ∈ C, a score ssk quantifying the detection
III. S EMANTIC SLAM confidence, and a bounding box sbk . Such information can be
In the rest of the paper, we focus on a particular formulation obtained from any modern approach for object recognition
of the SLAM problem that in addition to sensor and landmark such as [5], [34]–[36]. In our implementation, we use a
poses involves landmark classes (e.g., door, chair, table) and deformable parts model (DPM) detector [4], [37], [38],
semantic measurements in the form of object detections. which runs on a CPU in real time. If the data association
We will demonstrate that the expectation maximization Dk = (αk , βk ) of measurement sk is known, the measurement
formulation (5) is an effective way to solve the semantic likelihood can be decomposed as follows: p(sk |xαk , `βk ) =
SLAM problem. p(sck |`cβk )p(ssk |`cβk , sck )p(sbk |xαk , `pβk ). The density p(sck |`cβk )
Let the state ` of each landmark consist of its position corresponds to the confusion matrix of the object detector
`p ∈ R3 as well as a class label `c from a discrete set and is learned offline along with the score distribution
C = {1, . . . , C}. To estimate the landmark states L and sensor p(ssk |`cβk , sck ). The bounding-box likelihood p(sbk |xαk , `pβk )
trajectory X , we utilize three sources of information: inertial, is assumed normally distributed with mean equal to the
geometric point features, and semantic object observations. perspective projection of the centroid of the object onto the
Examples of geometric features and semantic observations image plane and covariance proportional to the dimensions
can be seen in Figure 1. of the detected bounding box.

A. Inertial information Problem (Semantic SLAM). Given inertial I , {It }Tt=1 ,


geometric Y , {Yt }Tt=1 , and semantic S , {St }Tt=1
We assume that the sensor package consists of an inertial
measurements, estimate the sensor state trajectory X and the
measurement unit (IMU) and one monocular camera. A subset
positions and classes L of the objects in the environment.
of the images captured by the camera are chosen as keyframes
(e.g., by selecting every nth frame as a keyframe). The sensor The inertial and geometric measurements are used to
state corresponding to the tth keyframe is denoted xt and track the sensor trajectory locally and, similar to a visual
consists of the sensor 6-D pose, velocity, and IMU bias values. odometry approach, the geometric structure is not recovered.
We assume that the IMU and camera are time synchronized, The semantic measurements, in contrast, are used to construct
so between keyframes t and t+1, the sensor also collects a set a map of objects that can be used to perform loop closure that
It of IMU measurements (linear acceleration and rotational is robust to ambiguities and viewpoint and is more efficient
velocity). than a SLAM approach that maintains full geometric structure.

B. Geometric information IV. S EMANTIC SLAM USING EM


In addition to the inertial measurements It , we utilize Following the observations from Sec. II, we apply expec-
geometric point measurements (e.g., Harris corners, SIFT, tation maximization to robustly handle the semantic data
SURF, FAST, BRISK, ORB, etc.) Yt . From each keyframe im- association. In addition to treating data association as a latent
age, these geometric point features are extracted and tracked variable, we also treat the discrete landmark class labels as
forward to the subsequent keyframe. In our experiments we latent variables in the optimization, resulting in a clean and
extract ORB features [33] from each keyframe and match efficient separation between discrete and continuous variables.
them to the subsequent keyframe by minimizing the ORB As mentioned in Sec. III, the data association of the geometric
descriptor distance. Since these features are matched by an measurements is provided by the feature tracking algorithm,

1724
so the latent variables we use are the data association D constraint set. The graph consists of a set of vertices V, each
of the semantic measurements measurements and the object of which corresponds to an optimization variable, and a set
classes `c1:M . The following proposition specifies the EM of factors F among the vertices that correspond to individual
steps necessary to solve the semantic SLAM problem. The components of the cost function. Graphically, a factor is a
initial guess X (0) is provided by odometry integration; the generalization of an edge that allows connectivity between
initial guess L(0) can be obtained from X (0) by initializing more than two vertices. A factor f in the graph is associated
a landmark along the detected camera ray. with a cost function that depends on a subset of the variables
V such that the entire optimization is of the form
Proposition 2. If p(D|X , L) is uniform and the seman- X
tic measurement data associationsQTare independent across V̂ = arg min f (V) (8)
keyframes, i.e., p(D|S, X , L) = t=1 p(Dt |St , X , L),3 the V
f ∈F
semantic SLAM problem can be solved via the expectation In addition to providing a useful representation, factor
maximization algorithm by iteratively solving for (1) data graphs are advantageous in that there exist computational
t
association weights wij (the “E” step) and (2) continuous tools that allow efficient optimization [11], [39].
sensor states X and landmark positions `p1:M (the “M” step) Our graph has a vertex for each sensor state xt and for each
via the following equations: landmark position `pi . Contrary to most prior work in which
t,(i) a hard data association decision results in a measurement
X X
wkj = κ(i) (Dt , `c ) ∀t, k, j (6)
c
` ∈C Dt ∈Dt (k,j)
defining a single factor between a sensor pose and a landmark,
T M
we consider soft semantic data association multiple factors.
p,(i+1)
X X X t,(i) 1) Semantic Factors: A measurement sk from sensor state
X (i+1), `1:M = arg min −wkj log p(sk |xt , `j )
p
X ,`1:M t=1 s ∈S j=1
x i defines factors fkj s
(xi , `j ) for each visible landmark j.
k t
Assuming the number of visible landmarks and the number
− log p(Y|X ) − log p(I|X ) (7) of received measurements are approximately equal, with this
method the number of semantic factors in the graph is roughly
where squared. Note that since `c is fixed in (7), p(ss |`c , sc ) and
(i) (i)
p(St |X , L , Dt ) p(sc |`c ) are constant. Thus, log p(s|x, `) = log p(sb |x, `p ) +
κ(i) (Dt , `c ) = P P , log p(ss |`c , sc )p(sc |`c ) and so the latter term can be dropped
Dt ∈Dt p(St |X
(i) , L , Dt )
(i)
`c
from the optimization.
Dt is the set of all possible data associations for measure- Let hπ (x, `p ) be the standard perspective projection of a
ments received at timestep t, and Dt (i, j) ⊆ Dt is the set of landmark `p onto a camera at pose x. We assume that the
all possible data associations for measurements received at camera measurement of a landmark `p from camera pose x
time t such that measurement i is assigned to landmark j. is Gaussian distributed with mean hπ (x, `p ) and covariance
Proof. See Appendix II. Rs . Thus, a camera factor corresponding to sensor state t,
measurement k, and landmark j, fkj s
, becomes
A. Object classes and data association (E step)
t,(i)
The computation of the weights for a single keyframe fkj (X , L) = −wkj log p(sbk |xt , `pj )
s
(9)
require several combinatorial sums over all possible data = ksbk − hπ (xt , `j )k2R /wt,(i) (10)
associations. However, due to the assumption of independent s kj

associations among keyframes and the fact that only few Those semantic factors due to the re-observation of a
objects are present within the sensor field-of-view, it is previously seen landmark are our method’s source of loop
feasible to compute the summations and hence wkj for all t closure constraints.
keyframes t, measurements k, and landmarks j extremely 2) Geometric Factors: Following [30], [40], we incor-
t,(i)
efficiently in practice. Once the weights wkj are computed porate geometric measurements into the pose graph as
for each measurement-landmark pair, they are used within structureless constraints between the camera poses that
the continuous optimization over sensor states and landmark observed them. We can rewrite the term corresponding to
positions. Additionally, maximum likelihood landmark class geometric factors in (7) as
estimates `c can be recovered from the computed κ values: Ny
X X
YT X − log p(Y|X ) = − log p(yk |xαyk ) (11)
ˆc c
`1:M = arg max p(`1:M |θ, Z) = arg max c
κ(Dt , ` ) i=1 k:βk
y
=i
`c `c t=1 Dt ∈Dt where Ny is the total number of distinct feature tracks, i.e.
B. Pose graph optimization (M step) the total number of observed physical geometric landmarks.
Letting ρβky be the 3D position in the global frame of
Equation (7) forms the basis of our pose graph optimization
the landmark that generated measurement yk , and assuming
over sensor states and landmark positions. A pose graph is
as before that the projection has Gaussian pixel noise with
a convenient way of representing an optimization problem
covariance Ry , we have
for which there exists a clear physical structure or a sparse
Ny
X X
3 This “naı̈ve Bayes” assumption might not always hold perfectly in
− log p(Y|X ) = kyk − hπ (xαyk , ρi )k2Ry (12)
practice but it significantly simplifies the optimization and allows for efficient
i=1 k:βky =i
implementation.

1725
For a single observed landmark ρi , the factor constraining V. E XPERIMENTS
the camera poses which observed it takes the form
X We implemented our algorithm in C++ using GTSAM [39]
fiy (X ) = kyk − hπ (xαyk , ρi )k2Ry (13) and its iSAM2 implementation as the optimization back-end.
k:βky =i All experiments were able to be computed in real-time.
The front-end in our implementation simply selects every
Because we use iterative methods to optimize the full pose 15th camera frame as a keyframe. As mentioned in section III-
graph, it is necessary to linearize the above cost term. The B, the tracking front-end extracts ORB features [33] from
linearization of the above results in a cost term of the form every selected keyframe and tracks them forward through
the images by matching the ORB descriptors. Outlier tracks
X
kHρik δρi + Hxik δxαyk + bik k2 (14)
k:βky =i are eliminated by estimating the essential matrix between
the two views using RANSAC and removing those features
where Hρik is the Jacobian of the cost function with respect which do not fit the estimated model. We assume that the
to ρβky , Hxik is the Jacobian with respect to xαyk , bik is a timeframe between two subsequent images is short enough
function of the measurement and its error, and the linearized that the orientation difference between the two frames can
cost term is in terms of deltas δx, δρ rather than the true be estimated accurately by integrating the gyroscope mea-
values x, ρ. surements. Thus, only the unit translation vector between the
Writing the inner summation in one matrix form by two images needs to be estimated. We can then estimate the
stacking the individual components, we can write this simply essential matrix using only two point correspondences [41].
as kHρi δρi + Hxi δxαy (i) + bi k2 . To avoid optimizing over The front-end’s object detector is an implementation of
ρ values, and hence to remove the dependence of the cost the deformable parts model detection algorithm [38]. On
function upon them, we project the cost into the null space the acquisition of the semantic measurements from a new
of its Jacobian. We premultiply each cost term by Ai , a keyframe, the Mahalanobis distance from the measurement
matrix whose columns span the left nullspace of Hρi . The to all known landmarks is computed. If all such distances are
cost term for the structureless geometric features thus becomes above a certain threshold, a new landmark is initialized in the
a function of only the states which observe it: map, with initial position estimate along the camera ray, with
kAi Hxi δxαy (i) + Ai bi k2 (15) depth given by the median depth of all geometric feature
measurements within its detected bounding box (or some
3) Inertial Factors: To incorporate the accelerometer and fixed value if no such features were tracked successfully).
gyroscope measurements into the pose graph, we use the While ideally we would iterate between solving for
method of preintegration factors detailed in [40]. The authors constraint weights wij and poses as proposition 2 suggests, in
provide an efficient method of computing inertial residuals practice for computational reasons we solve for the weights
between two keyframes xi and xj in which several inertial just once per keyframe.
measurements were received. By “preintegrating” all IMU Our experimental platform was a VI-Sensor [42] from
measurements received between the two keyframes, the which we used the IMU and left camera. We performed three
relative pose difference (i.e. difference in position, velocity, seperate experiments. The first consists of a medium length
and orientation) between the two successive keyframes is (approx. 175 meters) trajectory around one floor of an office
estimated. Using this estimated relative pose, the authors building, in which the object classes detected and kept in the
provide expressions for inertial residuals on the rotation map were two types of chairs (red office chairs and brown
(r∆Rij ), velocity (r∆vij ), and position (r∆pij ) differences four-legged chairs). The second experiment is a long (approx.
between two keyframes as a function of the poses xi and xj . 625 meters) trajectory around two different floors of an office
Specifically, they provide said expressions along with their building. The classes in the second experiment are red office
noise covariances Σ such that chairs and doors. The third and final trajectory is several
fiI (X ) = − log p(Iij |X ) (16) loops around a room equipped with a vicon motion tracking
system, in which the only class of objects detected is red
= krIij k2Σij (17) office chairs. In addition to our own experiments, we applied
The full pose graph optimization corresponding to equation our algorithm to the KITTI dataset [43] odometry sequences
(7) is then a nonlinear least squares problem involving 05 and 06.
semantic observation terms (see (10)), geometric observation The final trajectory estimate along with the estimated
terms (see (15)), and inertial terms (see (17)). semantic map for the first office experiment is shown in
Fig. 3. The trajectories estimated by our algorithm, by the
K X
X M ROVIO visual-inertial odometry algorithm [31], and by the
x̂1:T , `ˆ1:M = arg min s
fkj (X , `p1:M ) ORB-SLAM2 visual SLAM algorithm [44], [45], projected
X ,`1:M
k=1 j=1 into the x-y plane, are shown in Fig. 4. Due to a lack of
Ny
X T
X −1 inertial information and a relative lack of visual features in
+ fiy (X ) + ftI (X ) (18) the environment, ORB-SLAM2 frequently got lost and much
i=1 t=1 of the trajectory estimate is missing, but was always able to
We solve this within the iSAM2 framework [12], which is able recover when entering a previously mapped region.
to provide a near-optimal solution with real-time performance. The second office experiment trajectory along with the

1726
25 Ours
ROVIO
ORB-SLAM2 Mono
20

15

y [m]
10

0 Fig. 5: Estimated trajectory in second office


experiment from our algorithm (blue line)
−10 −5 0 5 10 15 20 25
along with our estimated door landmark
x [m]
positions (blue circles), overlaid onto partial
Fig. 3: Sensor trajectory and estimated land- Fig. 4: Estimated trajectories in first office ground truth map (red) along with ground
marks for the first office experiment experiment. truth door locations (green squares)

estimated map is shown in Fig. 2. An example image for improved localization performance and loop closure,
overlaid with object detections from near the beginning of while only slightly impacting the computational cost of
this trajectory is displayed in Fig. 1. We constructed a partial the algorithm. Furthermore, semantic information about the
map of the top floor in the experiment using a ground robot environment is valuable in and of itself in aiding autonomous
equipped with a lidar scanner. On this ground truth map, operation of robots within a human-centric environment.
we manually picked out door locations. The portion of the In future work, we plan to expand our algorithm to estimate
estimated trajectory on the top floor is overlayed onto this the full pose of the semantic objects (i.e., orientation in
partial truth map (the two were manually aligned) in Fig. 5, addition to position). We also plan to fully exploit our EM
Due to the extremely repetitive nature of the hallways in decomposition by reconsidering data associations for past
this experiment, bag-of-words based loop closure detections keyframes, and to consider systems with multiple sensors
are subject to false positives and incorrect matches. ORB- and non-stationary objects.
SLAM2 was unable to successfully estimate the trajectory
due to such false loop closures. A partial trajectory estimate A PPENDIX I: P ROOF OF P ROPOSITION 1
after an incorrect loop closure detection is shown in Fig. 6.
The vicon trajectory and the estimated map of chairs is First, we rewrite the optimization in (4) without a logarithm
shown in Fig. 7. We evaluated the position error with respect and similarly expand the expectation:
to the vicon’s estimate for our algorithm, ROVIO, and ORB- X
X i+1, Li+1 = arg max p(D|X i , Li , Z)p(Z|X , L, D)
SLAM2 and the results are shown in Fig. 8. Note that the X ,L
D∈D
spikes in the estimate errors are due to momentary occlusion
from the vicon cameras. The data association likelihood can then be rewrite as
We also evaluated our algorithm on the KITTI outdoor p(Z|X i , Li , D)p(D|X i , Li )
p(D|X i , Li , Z) = P (19)
dataset, using odometry sequences 05 and 06. The semantic D p(Z|X , L , D)p(D|X , L )
i i i i

objects detected and used in our algorithm were cars. Rather p(Z|X i , Li , D)
than use inertial odometry in this experiment, we used the =P (20)
D p(Z|X , L , D)
i i
VISO2 [46] visual odometry algorithm as the initial guess
X (0) for a new keyframe state. Similarly, we replaced the with the last equality due to the assumption that p(D|X , L) is
preintegrated inertial relative pose (cf. Sec. IV-B.3) with the uniform. We can next
Q decompose the measurement likelihood
relative pose obtained from VISO in the odometry factors. p(Z|X , L, D) = k p(zk |xαk , `βk ), and so
The absolute position errors over time for KITTI sequence 05
X
X i+1, Li+1 = arg max p(D|X ,i Li , Z)p(Z|X , L, D) (21)
with respect to ground truth for our algorithm, VISO2, and X ,L
D∈D
ORB-SLAM2 with monocular and stereo cameras are shown X Y p(zk |xiαk , `iβ )p(zk |xαk , `βk )
in Fig. 9. The same for sequence 06 are shown in Fig. 10. = arg max P k

D p(Z|X , L , D)
i i
Finally, the mean translational and rotational errors over all X ,L
D∈D k
possible subpaths of length (100, 200, ..., 800) meters are The result then follows by noting that the normalizing
shown in Fig. 11. denominator is independent of the optimization variables and
from the definition of the matrix permanent.
VI. C ONCLUSION
The experiments demonstrated that in complex and clut- A PPENDIX II: P ROOF OF P ROPOSITION 2
tered real-world datasets our method can be used to recon-
struct the full 6-D pose history of the sensor and the positions Suppose we have some initial guess given by θ(i) =
and classes of the objects contained in the environment. {X (i) , `p,(i) }. We can then compute an improved estimate of
The advantage of our work is that by having semantic θ = {X , `p } by maximizing the expected log likelihood:
features directly into the optimization, we include a relatively θ(i+1) = arg max ED,`c |θ(i) [log p(D, `c , S, Y, I|θ)] (22)
sparse and easily distinguishable set of features that allows θ

1727
4.5
Ours
4 ROVIO
ORB-SLAM2 Mono

Translation error [m]


3.5
3
2.5
2
1.5
1
0.5
0
0 50 100 150 200 250 300 350 400
t [s]
Fig. 6: Partial ORB-SLAM2 trajectory Fig. 7: Sensor trajectory and estimated land- Fig. 8: Position errors with respect to vicon
after incorrect loop closure in second marks for the vicon experiment ground truth.
office experiment.
Sequence 05 Sequence 06 KITTI Sequence 05
25
Ours
40
Ours
Method Trans. err [%] Rot. err [deg/m]
Absolute translation error [m]

Absolute translation error [m]

VISO2 35 VISO2 Ours 1.31 0.0038


OrbSLAM Mono OrbSLAM Mono
20
OrbSLAM Stereo OrbSLAM Stereo VISO2 4.08 0.0050
30
ORBSLAM2 Mono 5.39 0.0019
15 25 ORBSLAM2 Stereo 0.63 0.0017
20
10 15 KITTI Sequence 06
Method Trans. err [%] Rot. err [deg/m]
10
5 Ours 0.77 0.0037
5 VISO2 1.81 0.0036
0 0 ORBSLAM2 Mono 6.71 0.0015
0 20 40 60 80 100120140160180200220240260280300 0 10 20 30 40 50 60 70 80 90 100 110 120
t [s] t [s] ORBSLAM2 Stereo 0.29 0.0013

Fig. 9: Norm of position error between Fig. 10: Norm of position error between Fig. 11: KITTI mean translational and rotational error
estimate and ground truth, KITTI seq. 05 estimate and ground truth, KITTI seq. 06 over path lengths (100, 200, . . . , 800) meters.

Expanding the expectation, for optimization purposes we have


c
(23)
X
ED,`c |θ(i) [log p(D, ` , S, Y, I|θ)] κ(D, `c ) log p(S, D, `c |θ)
D,`c
X
= p(D, `c |S, θ(i) ) log[p(S, D, `c |θ)p(Y|θ)p(I|θ)] X
D,`c = κ(D, `c ) log p(S|D, `c , θ) (26)
D,`c
XX X
= κ(Dt , `c ) log p(si |xt , `βi ) (27)
Letting κ(D, `c ) , p(D, `c |S, θ(i) ), a constant with respect t i Dt ,`c
to the optimization variables, we continue:
X Note that if we let D(i, j) be the subset of all possible
E[·] = κ(D, `c ) log p(S, D, `c |θ)+ (24) data associations that assign measurment i to landmark j, we
D,`c X can further decompose this summation as
κ(D, `c ) log[p(Y|θ)p(I|θ)] X
D,`c κ(D, `c ) log p(S, D, `c |θ)
X D,`c
= κ(D, ` ) log p(S, D, `c |θ) + log p(Y|θ) + log p(I|θ)
c
XXXX X
D,`c = κ(Dt , `c ) log p(si |xt , `j ) (28)
t i j `c Dt ∈D(i,j)
X X
Finally, letting wij
t
, κ(Dt , `c ), we can write
Focusing on the leftmost summation over data associations
`c Dt ∈D(i,j)
and landmark classes,
X the final expectation maximization as
κ(D, `c ) log p(S, D, `c |θ) (25) XXX
D,`c θ(i+1) = arg max t
wij log p(si |xt , `j )
θ
X X t i j
= κ(D, `c ) log p(S|D, `c , θ) + κ(D, `c ) log p(D, `c |θ)
+ log p(Y|θ) + log p(I|θ) (29)
D,`c D,`c

R EFERENCES
Using the assumption that p(D, `c |θ) is a uniform distribu- [1] J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis,
“Consistency Analysis and Improvement of Vision-aided Inertial
tion over the space of data associations and landmark classes, Navigation,” IEEE Trans. on Robotics (TRO), vol. 30, no. 1, pp. 158–
this term doesn’t affect which θ maximizes the objective, so 176, 2014.

1728
[2] D. G. Kottas and S. I. Roumeliotis, “Efficient and Consistent Vision- Springer International Publishing, 2014, vol. 8694, pp. 703–718.
aided Inertial Navigation using Line Observations,” in IEEE Int. Conf. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-10599-4 45
on Robotics and Automation (ICRA), 2013, pp. 1540–1547. [26] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics
[3] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D mapping: tasks: A survey,” Robotics and Autonomous Systems, vol. 66, pp. 86–
Using Kinect-style depth cameras for dense 3D modeling of indoor 103, 2015.
environments,” The International Journal of Robotics Research (IJRR), [27] J. Neira and J. Tardós, “Data Association in Stochastic Mapping Using
vol. 31, no. 5, pp. 647–663, 2012. the Joint Compatibility Test,” IEEE Trans. on Robotics and Automation
[4] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object (TRO), vol. 17, no. 6, pp. 890–897, 2001.
Detection with Discriminatively Trained Part-Based Models,” IEEE [28] J. Munkres, “Algorithms for the Assignment and Transportation
Trans. on Pattern Analysis and Machine Intelligence (PAMI), vol. 32, Problems,” Journal of the Society for Industrial & Applied Mathematics
no. 9, pp. 1627–1645, 2010. (SIAM), vol. 5, no. 1, pp. 32–38, 1957.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards [29] N. Atanasov, M. Zhu, K. Daniilidis, and G. Pappas, “Localization from
real-time object detection with region proposal networks,” in Advances semantic observations via the matrix permanent,” The International
in Neural Information Processing Systems (NIPS), 2015. Journal of Robotics Research, vol. 35, no. 1-3, pp. 73–99, 2016.
[6] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance [30] A. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman
of multilayer neural networks for object recognition,” in Computer filter for vision-aided inertial navigation,” in Robotics and Automation,
Vision–ECCV 2014. Springer, 2014, pp. 329–344. 2007 IEEE International Conference on. IEEE, 2007, pp. 3565–3572.
[7] X. Liu, Y. Zhao, and S.-C. Zhu, “Single-view 3d scene parsing by [31] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual
attributed grammar,” in Computer Vision and Pattern Recognition inertial odometry using a direct ekf-based approach,” in IEEE/RSJ Int.
(CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 684–691. Conf. on Intelligent Robots and Systems (IROS), 2015, pp. 298–304.
[8] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and [32] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct
R. Urtasun, “3d object proposals for accurate object class detection,” monocular visual odometry,” in IEEE Int. Conf. on Robotics and
in NIPS, 2015. Automation (ICRA), 2014, pp. 15–22.
[9] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and [33] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient
mapping: part i,” Robotics Automation Magazine, IEEE, vol. 13, no. 2, alternative to sift or surf,” in Int. Conf. on Computer Vision, 2011, pp.
pp. 99–110, June 2006. 2564–2571.
[10] F. Lu and E. Milios, “Globally Consistent Range Scan Alignment for [34] S. Gidaris and N. Komodakis, “Object detection via a multi-region
Environment Mapping,” Auton. Robots, vol. 4, no. 4, pp. 333–349, and semantic segmentation-aware cnn model,” in IEEE Int. Conf. on
1997. Computer Vision, 2015, pp. 1134–1142.
[11] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
“g2o: A General Framework for Graph Optimization,” in IEEE Int. image recognition,” arXiv preprint arXiv:1512.03385, 2015.
Conf. on Robotics and Automation (ICRA), 2011, pp. 3607–3613. [36] Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos, “A unified multi-scale deep
[12] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F. Dellaert, convolutional neural network for fast object detection,” in European
“iSAM2: Incremental Smoothing and Mapping Using the Bayes Tree,” Conference on Computer Vision (ECCV), 2016.
The International Journal of Robotics Research (IJRR), vol. 31, no. 2, [37] M. Zhu, N. Atanasov, G. Pappas, and K. Daniilidis, “Active Deformable
pp. 216–235, 2012. Part Models Inference,” in European Conference on Computer Vision
(ECCV), ser. Lecture Notes in Computer Science. Springer, 2014, vol.
[13] C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J. Fernandez-
8695, pp. 281–296.
Madrigal, and J. Gonzalez, “Multi-hierarchical Semantic Maps for
[38] C. Dubout and F. Fleuret, “Deformable part models with individual
Mobile Robotics,” in IEEE/RSJ Int. Conf. on Intelligent Robots and
part scaling,” in British Machine Vision Conference, no. EPFL-CONF-
Systems (IROS), 2005, pp. 2278–2283.
192393, 2013.
[14] J. Civera, D. Galvez-Lopez, L. Riazuelo, J. Tardos, and J. Montiel,
[39] F. Dellaert, “Factor graphs and gtsam: A hands-on introduction,”
“Towards Semantic SLAM Using a Monocular Camera,” in IEEE/RSJ
GT RIM, Tech. Rep. GT-RIM-CP&R-2012-002, Sept 2012.
Int. Conf. on Intelligent Robots and Systems, 2011, pp. 1277–1284.
[Online]. Available: https://research.cc.gatech.edu/borg/sites/edu.borg/
[15] A. Pronobis, “Semantic Mapping with Mobile Robots,” dissertation, files/downloads/gtsam.pdf
KTH Royal Institute of Technology, 2011. [40] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “Imu preinte-
[16] J. Stückler, B. Waldvogel, H. Schulz, and S. Behnke, “Dense real-time gration on manifold for efficient visual-inertial maximum-a-posteriori
mapping of object-class semantics from RGB-D video,” Journal of estimation,” in Proceedings of Robotics: Science and Systems, Rome,
Real-Time Image Processing, pp. 1–11, 2013. Italy, July 2015.
[17] V. Vineet, O. Miksik, M. Lidegaard, M. Nießner, S. Golodetz, V. A. [41] D. G. Kottas, K. Wu, and S. I. Roumeliotis, “Detecting and dealing
Prisacariu, O. Kähler, D. W. Murray, S. Izadi, P. Perez, and P. H. S. with hovering maneuvers in vision-aided inertial navigation systems,”
Torr, “Incremental dense semantic stereo fusion for large-scale semantic in 2013 IEEE/RSJ International Conference on Intelligent Robots and
scene reconstruction,” in IEEE International Conference on Robotics Systems, Nov 2013, pp. 3172–3179.
and Automation (ICRA), 2015. [42] J. Nikolic, J. Rehder, M. Burri, P. Gohl, S. Leutenegger, P. T. Furgale,
[18] B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool, “Dynamic 3d and R. Siegwart, “A synchronized visual-inertial sensor system with
scene analysis from a moving vehicle,” in IEEE Conf. on Computer fpga pre-processing for accurate real-time slam,” in Robotics and
Vision and Pattern Recognition (CVPR), June 2007, pp. 1–8. Automation (ICRA), 2014 IEEE International Conference on. IEEE,
[19] S. Pillai and J. Leonard, “Monocular slam supported object recognition,” 2014, pp. 431–437.
in Proceedings of Robotics: Science and Systems (RSS), Rome, Italy, [43] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
July 2015. driving? the kitti vision benchmark suite,” in Conference on Computer
[20] N. Atanasov, M. Zhu, K. Daniilidis, and G. Pappas, “Semantic Vision and Pattern Recognition (CVPR), 2012.
Localization Via the Matrix Permanent,” in Robotics: Science and [44] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: a
Systems (RSS), 2014. versatile and accurate monocular SLAM system,” IEEE Transactions
[21] S. Bao and S. Savarese, “Semantic Structure from Motion,” in IEEE on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
Conf. on Computer Vision and Pattern Recognition (CVPR), 2011, pp. [45] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source
2025–2032. SLAM system for monocular, stereo and RGB-D cameras,”
[22] R. Salas-Moreno, R. Newcombe, H. Strasdat, P. Kelly, and A. Davison, CoRR, vol. abs/1610.06475, 2016. [Online]. Available: http:
“SLAM++: Simultaneous Localisation and Mapping at the Level of //arxiv.org/abs/1610.06475
Objects,” in IEEE Conf. on Computer Vision and Pattern Recognition [46] A. Geiger, J. Ziegler, and C. Stiller, “StereoScan: Dense 3d Recon-
(CVPR), 2013, pp. 1352–1359. struction in Real-time,” in Intelligent Vehicles Symposium (IV), 2011,
[23] D. Gálvez-López, M. Salas, J. Tardós, and J. Montiel, “Real-time pp. 963–968.
Monocular Object SLAM,” arXiv:1504.02398, 2015.
[24] I. Reid, “Towards Semantic Visual SLAM,” in Int. Conf. on Control
Automation Robotics Vision (ICARCV), 2014.
[25] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. Rehg, “Joint semantic
segmentation and 3d reconstruction from monocular video,” in
Computer Vision ECCV 2014, ser. Lecture Notes in Computer
Science, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds.

1729

You might also like