Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

A Benchmark for Visual-Inertial Odometry Systems

Employing Onboard Illumination


Mike Kasper1 Steve McGuire1 Christoffer Heckman1

Abstract— We present a dataset for evaluating the perfor-


mance of visual-inertial odometry (VIO) systems employing an
onboard light source. The dataset consists of 39 sequences,
recorded in mines, tunnels, and other dark environments,
totaling more than 160 minutes of stereo camera video and
IMU data. In each sequence, the scene is illuminated by an
onboard light of approximately 1300, 4500, or 9000 lumens.
We accommodate both direct and indirect visual odometry
methods by providing the geometric and photometric camera
calibrations (i.e. response, attenuation, and exposure times).
In contrast with existing datasets, we also calibrate the light
source itself and publish data for inferring more complex
light models. Ground-truth position data are available for a
subset of sequences, as captured by a Leica total station.
All remaining sequences start and end at the same position,
permitting the use of total accumulated drift as a metric for
evaluation. Using our proposed benchmark, we analyze the
performance of several start-of-the-art VO and VIO frame-
works. The full dataset, including sensor data, calibration Fig. 1: Example frames from each environment (i.e. tunnels,
sequences, and evaluation scripts, is publicly available online
at http://arpg.colorado.edu/research/oivio.
mines, woods, and office). Each row shows a sequence of
four frames, separated by a few seconds. These images high-
I. I NTRODUCTION light the primary challenges posed by our dataset: dynamic
illumination, motion blur, and poor camera exposure.
Given their versatility and relatively low cost, passive
cameras are arguably the most common sensor employed in
robotics applications. However, to be used reliably, sufficient
scene illumination is required. While these conditions are are robust to such illumination changes, but are far more
met in many scenarios, there is growing interest for robots susceptible to the motion blur and sensor noise we can expect
to work in darker environments, such as underground or when working in dark environments, due to inadequate
underwater. This is most evident in the recently proposed camera exposure [23], [13], [4].
DARPA Subterranean Challenge [3], but also highlighted by To assess the performance of existing methods and to aid
the emergence of workshops focused on the subject [20], the development of novel solutions to the aforementioned
[25] and companies fielding robots in this domain. challenges, we present a benchmarking dataset for visual-
In the absence of visual information, the robotics com- inertial odometry (VIO) systems working in environments
munity has primarily relied on depth sensors (e.g. LIDAR illuminated by a single, onboard light source. In total, the
and active depth cameras). While these sensors are robust to dataset contains 39 sequences, with over two hours of stereo
low-texture surfaces and poor illumination, they are often of camera video and IMU data. The sequences were recorded
limited range, lower resolution, and higher cost. More im- in a number of challenging environments, including tun-
portantly, their utility is reduced in geometrically ambiguous nels, mines, low-light indoor scenes, and nighttime outdoor
scenes, such as long hallways or tunnels. Traditionally, visual scenes. Several example frames can be seen in Figure 1.
cues have compensated for these limitations [9]. For each recorded sequence, we illuminate the scene with
In order to employ cameras in dark environments, robots a white LED light of approximately 1300, 4500, or 9000
can be equipped with an onboard light source. However, lumens. This allows us to assess how much performance
this would violate the brightness constancy assumption held depends on lighting strength. In contrast with other datasets,
by most visual perception systems, as scene illumination our benchmark not only provides the geometric and photo-
will change as a result of the robot’s movement. This is metric camera calibrations, but also a calibration of the light
particularly problematic for direct methods, which work on itself. By publishing a light model, we intend to promote
image intensities [8], [15], [4]. In contrast, indirect methods the development of novel VIO algorithms that relax the
1 Autonomous Robotics and Perception Group (APRG). Department of
brightness constancy assumption and model the dynamic
Computer Science. University of Colorado. Boulder, Colorado USA. Cor- illumination of the scene. As a point of comparison, however,
responding author: christoffer.heckman@colorado.edu we analyze several existing frameworks in Section VII.
Fig. 2: Frames captured while executing approximately the same turn with each lighting configuration. From left to right,
the depicted scene is illuminated with 1300, 4500, and 9000 lumens. Note how the camera’s auto-exposure compensates for
the different levels of illumination, but consequently produces varying amounts of motion blur. To aid visual inspection, we
have provided enlarged images for the regions outlined in red and green.

II. R ELATED W ORK sequence exhibits some form of dynamic lighting, either
We draw guidance from the popular EuRoC dataset [2] in by modulating global and local light sources, or by the
terms of sensors and ground-truthing employed. It contains movement of a flashlight co-located with the camera. While
11 sequences captured via a hardware-synchronized stereo this dataset does contain sequences illuminated by an on-
camera and IMU mounted on top of a micro aerial vehi- board light source, only two sequences employ such a
cle. Ground-truth 6-DoF poses and 3-DoF positions were lighting solution. Additionally, no model of the light source
captured via a Vicon Motion Capture system and Leica is provided, which we believe could be leveraged to develop
MS50 laser tracker, respectively. The dataset does exhibit novel methods for visual odometry.
some challenging lighting scenarios, where large regions of In a different vein, the DiLiGenT dataset [22] is not in-
the environment are poorly illuminated. However, lighting tended for VO research, but rather that of photometric stereo.
remained constant while capturing each sequence. A limita- Photometric stereo, in contrast with binocular stereo, is a
tion of the EuRoC dataset is that it is only well-suited for technique that typically employs a single camera and one or
indirect methods; it not only lacks the camera response and more lights to infer 3D geometry [26]. The DiLiGenT dataset
attenuation models required by direct methods, but also the contains a series of images, taken of 10 objects, captured
exposure times between stereo cameras are not synchronized. by a stationary camera and different lighting configurations.
In contrast, the TUM monoVO dataset [5] targets direct In addition to the images themselves the authors provide a
odometry methods, providing full photometric calibration calibration of the employed light array, which consists of a
and exposure times as reported by the camera sensor. How- 2D grid of 96 uniformly-spaced, white LED lights. We wish
ever, this is a purely monocular dataset, lacking the second to take a similar approach in our visual odometry dataset.
camera and IMU sensor found in [2]. The curators of the As can be observed by this brief review, just as there is
TUM monoVO dataset also opt for a different ground- large diversity of VO solutions, the same can be said for VO
truthing strategy. All sequences start and end in the same datasets. The dataset we present in the following sections
position, permitting the evaluation of VO frameworks in is particularly focused on underground environments with
terms of total accumulated drift over the entire sequences. sensing and calibration considerations to match, including
The published sequences contain a number of challenging onboard lighting and the usage of stereo cameras and IMUs.
scenes, but again exhibit relatively static lighting.
III. DATASET
More inline with our focus on onboard illumination is the
Oxford RobotCar dataset [12]. It contains over one year of All sequences in our dataset can be characterized as a
driving data recorded by a car outfitted with six cameras, visual-inertial rig navigating dark environments, illuminated
a LIDAR, and an IMU. It facilitates the development and by an onboard light source. We captured sequences in four
evaluation of a number of perception problems related to types of environments: (1) mines, (2) tunnels, (3) outdoors
autonomous vehicles. The sequences exhibit a variety of at night, and (4) indoors where all other lights are turned
weather conditions, captured at day and night. While night- off. While some sequences may exhibit small amounts ex-
time sequences are illuminated by the car’s headlights, a sig- ternal lighting, the predominate illuminant in all scenes is
nificant portion of illumination is contributed by streetlights. the onboard light source. To permit exploration of lighting
Additionally, no model of the car’s headlights are provided. solutions, we roughly replicate each trajectory with three
Taking these concepts further, the ETH-ICL dataset [18] different lighting intensities. A few examples frames are
focuses directly on the problem of visual SLAM in dy- shown in Figure 2. During data capture, the sensor rig was
namically lit environments. The dataset contains both real either handheld or mounted on a ground-vehicle (Clearpath
and synthetic sequences largely based on the TUM RGB- Husky UGV). In the remainder of this section we provide
D benchmark [24] and the ICL-NUIM dataset [7]. Each details about the sensor rig and ground-truthing strategies.
Leica Prism Computer truth position of our sensor rig at 10Hz. Unfortunately, main-
taining the line-of-sight required by any tracking solution
puts undesirable constraints on the nature of environments
and trajectories we can employ. This is especially true in
the narrow passage ways common in tunnels and mines.
Light We therefore provide an addition set of sequences, that start
and end in the same position, permitting the use of total
accumulated drift as another evaluation metric. As in [5], we
Cameras Batteries start and end each sequence with a 10-20 second sequence
IMU
of loopy camera motion, observing an easy-to-track scene.
We then process just these frames using the ORB-SLAM2
framework [14] to obtain accurate poses for the start and
Fig. 3: Our employed sensors include an Intel RealSense
end segments. As our dataset contains stereo images, we
D435i and a LORD Microstrain G3M-GX5-15 (not visible).
need not resolve the scale ambiguity problem that arises from
The onboard light source is a 9000 lumen, 100W, white LED
monocular VO, as required in [5].
light. To modulate the light intensity we use a DC-DC boost
regulator. Long-term use of this light requires a large passive IV. DATASET F ORMAT
heat-sink and fan. We equip a tracking prism when ground-
truthing position data with the Leica. A. Data Format
We organize our dataset using the ASL dataset format,
similar to the the EuRoC dataset [2]. However, we must
A. Sensor Setup also accommodate the additional photometric calibrations
We capture inertial data with LORD Microstrain 3DM- and light source model. As in [2], each captured sequence
GX5-15 at 100Hz, and a stereo pair of 1280 × 720, grayscale comprises a collection of sensor data. Each sensor comes
images with a Intel RealSense D435i at 30Hz. We opted for with a sensor.yaml file, specifying calibration parame-
the RealSense as it is a widely available consumer device ters, and a data.csv file, that either contains the sensor
featuring a hardware-synchronized, fixed-lens, global-shutter data itself or references files in an optional data folder:
stereo camera. However, as it is primarily intended to be used <sensor system id>
as an active IR depth sensor, these cameras do not filter out <sensor id>
infrared light. This does not negatively impact the acquired sensor.yaml
visual information, but does require that we disable the IR data.csv
emitter during operation. Consequently, we do not publish data
any depth maps with our dataset. Again, ground-truthing systems are treated as separate
Each sequence in our benchmark is illuminated by an sensors. In a slight abuse of terminology, we treat the light
onboard, maximum 9000 lumens, 100W, white LED light. source as a sensor without a data.csv file. For example:
Long-term use of this light requires a large passive heat husky0
sink and fan. Clearly, such a lighting system is not practical imu0
for all robotics applications (e.g. micro air vehicles). We sensor.yaml
therefore attempt to capture the same trajectory three times, data.csv
modulating the light’s intensity to approximately 100, 50, cam0
and 15 percent of its full capacity. This allows us to evaluate sensor.yaml
the performance of visual odometry systems working with data.csv
different lighting solutions. data
The light and sensors are mounted inside custom housing, 154723105215000000.png
equipped with a power supply and onboard computer for 154723105220000000.png
logging data. As our sensor rig is self-contained, all captured ...
sequences exhibit consistent extrinsic calibrations, regardless leica0
of the mobile platform employed (i.e. handheld or ground ve- sensor.yaml
hicle). Depending on the ground-truthing system employed, data.csv
the rig may also be outfitted with a laser tracking prism. An light0
image of our sensor rig can be seen in Figure 3. sensor.yaml
1) YAML Files: As with the EuRoC MAV dataset,
B. Ground-truth each sensor provides a sensor.yaml file, which
In this work, we employ two methods for ground-truthing, describes all relevant properties unique to the sensor.
providing a balance between sampling resolution and trajec- Additionally, all YAML files share two common
tory complexity. For a subset of our dataset, we employ a fields: sensor type and T BS. The sensor type
Leica TCRP1203 R300 total station to acquire the ground- field specifies one of the following sensor types:
{imu|camera|position|pose|light} and T BS is YAML file provides the transformation matrix T BS, which
a 4 × 4 homogeneous transformation matrix, describing the specifies the pose of the light with respect to the sensor
sensor’s extrinsic relationship with the sensor rig’s frame. system. It also contains two additional fields: size and
All properties listed in the YAML file are assumed to be lumens. Here, size is 2D vector indicating the horizontal,
static throughout the entire sequence. and vertical size, in meters, of the LED patch. The lumens
2) CSV Data Files: The data.csv file either contains field specifies the approximate light intensity, in lumens,
all the data captured by the sensor throughout the sequence, which is one of three values: 9000, 4500 or 1300.
or references files in the optional data folder. In either case, 4) Position, Pose: As described in Section III-B, we
each line first begins with a timestamp (integer nanosec- employ two ground-truthing methods. These are represented
onds POSIX) denoting the time at which the corresponding by the sensors leica0 and loop0. As with the EuRoC
data was recorded. The subsequent fields for each sensor’s dataset, the data.csv file for these sensors contains the fol-
data.csv file are presented in the following sections. lowing fields: timestamp, qRS , R pRS . Here, R pRS denotes
the position of the sensor with respect to the ground-truthing
B. Sensors reference frame R. The field, qRS is the four-element, unit
1) Cameras: As with the EuRoC dataset [2], each line quaternion representing the orientation of the sensor.
of a camera’s data.csv file contains the timestamp and Given that the Leica laser tracker only captures 3D posi-
file name of the captured image. We augment this by tion, the qRS field is omitted from its data.csv file. In
including the exposure times and gains, reported by the contrast, the “loop-closure” method, described in [5], can
camera during capture. The sensor.yaml file specifies provide full 6-DoF poses. However, it only does so for the
the cameras intrinsics with the fields: camera model, very beginning and ending of the trajectory. Consequently,
intrinsic coefficients, distortion model, its corresponding data.csv file contains far fewer entries
distortion coefficients, and resolution. with a large gap in the reported timestamps.
However, as our dataset accommodates direct VO meth-
ods, we also provide a photometric calibration, via the V. C ALIBRATION
response.csv and vignette.png files in the sensor In this section we described the provided calibration
folder. The response.csv file contains 255 values, pro- parameters and how they are obtained. These calibrations
viding all camera inverse-response values over the domain include the camera intrinsics, extrinsics between the cameras
of 8-bit color depth. These values can be used as a simple and IMU, and the photometric camera calibration. However,
lookup to convert from captured image intensities to irra- in contrast with existing VO datasets, we additionally provide
diance values. The vignette.png file is a monochrome multiple models for the onboard light source. To permit the
16-bit image with the same resolution as the camera. Each use of custom calibration models, we also publish all raw
pixel in the vignette.png specifies the attenuation factor sensor sequences for each mode of calibration.
for its respective coordinates in camera images. Section V-B
describes our photometric calibrations process. A. Geometric Camera Calibration
2) IMU: As with the EuRoC dataset, each line of an We calibrate our visual-inertial sensor using the Vicalib
IMU’s data.csv file contains the following fields: times- framework1 . This process yields a geometric calibration com-
tamp, ω S [rad s−1 ], aS [m s−2 ]. Where ω S ∈ R3 is the prising the camera intrinsics and the camera-IMU extrinsics,
angular rate and aS ∈ R3 is the linear acceleration, both although recent work shows this may be estimated online
expressed in the sensor’s body frame. For all sequences, the [16]. To permit the use of other camera models, we publish
Microstrain’s data is published under the imu0 sensor. The all calibrations sequences.
YAML file contains the noise densities and bias “diffusions”,
which stochastically describe the sensor’s random walk. B. Photometric Camera Calibration
These are obtain in the same way as in the EuRoC MAV As with the TUM monoVO dataset [5], we provide the
dataset. We refer the reader to [2] for more details. photometric calibration for each camera. This includes the
We also publish measurements from the RealSense camera’s response and dense attenuation model. We obtained
D435i’s onboard IMU (Bosch Sensortec BMI055) as imu1. these properties using the same process described in [5]. This
However, the RealSense does not capture accelerometer and yields independent linear response functions Ul and Ur for
gyroscope data at the same rate. Rather than complicating the left and right cameras, respectively. However, care must
IMU data access, we opt to keep the same single-line file be taken to ensure that observations of the same point in each
format. We configure the RealSense to capture accelerometer image elicit the same response. We therefore apply an affine
data at 250Hz and gyroscope data at 400Hz, and only report transformation to the response Ur to satisfy the equality
the gyroscope measurement closest to each accelerometer
measurement. However, we still provide files for the indi- Ul (Il (p)) = α Ur (Ir (q)) + β. (1)
vidual sensor streams in imu1/data folder. Here, p and q are coordinates in the left and right images
3) Light: As previously mentioned, the onboard light is corresponding to the same world point. We obtain these from
treated like a sensor without a data.csv file. However,
it does have a sensor.yaml file. Like all sensors, this 1 https://github.com/arpg/vicalib
measure of L and our isotropic light models can therefore
be used for all three lighting configurations. In addition, we
argue that light occlusions (i.e. cast shadows) need not be
modeled, as the light source is located near the cameras.
However, when the sensor system is operating within
close proximity to objects of the scene, shadows may be
of concern. Additionally, the errors from using a point light
model will become non-negligible. Consequently, the area
light model should be employed in these circumstances.
Theoretically, this would require integrating all incident
radiance emitted from the entire surface of the LED patch.
In practice, incident radiance arriving from our area light
can be approximated, up to some scale, as a summation of
Fig. 4: Example frames from our light source calibration
a finite number of samples
sequences. The top row shows images for geometric cali-
bration, where the green ‘plus’ indicates the hand-annotated X cos θa
I∝ , (3)
specular highlight, and the red ‘x’ indicates the highlight d2a
a∈A
computed from the inferred geometry. The bottom row shows
images for intensity distribution calibration. The motion of Given that our light does not employ any reflectors and
the rig is such that the entirety of the side-lobing from the that the field of view of our cameras are relatively narrow,
onboard illuminant was captured. we believe an isotropic light model offers the best trade-off
between fidelity and computational cost. However, should
more complex light models be desired, we provide addi-
the irradiance images generated when performing the cali- tional calibration sequences for inferring the light’s intensity
bration described in [5] on both cameras simultaneously. As distribution. Guided by similar efforts [1], [17], we capture
the two irradiance images will be centered on the calibration hundreds of images of a known calibration target, exhibiting
target, data association is trivial. We then solve the linear largely diffuse reflectance, at different depths and angles.
system constructed from these observations to determine the Figure 4 shows two example images from these sequences.
affine transformation that best satisfies Eq. (1). Using the provided camera calibration and exposure times,
these images should first be photometrically corrected. In-
C. Light Source Calibration tensity variations observed in the resulting images can then
In order to facilitate the development of novel visual be attributed to extrinsics relationship between the target
odometry algorithms, which anticipate changes in scene illu- and light source. Similar approaches have been used to
mination, we additionally calibrate the onboard light source. infer radially symmetric light distributions [1] and full 4D
Our dataset includes computationally simple light models, light models [17]. We provide separate light distribution
which are feasible for real-time applications. However, we calibration data for each lighting configuration.
also publish multiple types of calibration sequences so that
D. Timestamp Alignment
other models may be inferred in the future.
To obtain an initial geometric calibration, we first model In order to spatially and temporally align ground-truth
the light source as a planar patch in 3D space. As in [28], data with sensor readings, we employ an implementation of
[21], we employ reflective spheres as light probes, and SplineFusion [11]. Here we use a maximum likelihood (ML)
exploit the their contours and specular highlights, observed in state estimator to obtain the set of control points c ∈ SE3,
multiple stereo images and scene configurations, to construct describing the trajectory as a cubic B-spline. As the spline is
an inverse-rendering problem. Given prior knowledge of our C 2 -continuous, we can therefore infer the rotational velocity
light source, we fix the dimensions of the planar patch, and linear acceleration at any point for direct comparison
reducing the parameter space proposed in [28] to that of a with IMU readings. This process yields a temporal offset for
single 6-DoF pose. Figure 4 depicts this calibration process. each sensor measurement and the biases of each IMU.
The position of an isotropic point light, the simplest model VI. E VALUATION M ETRICS
we provide, is computed as the center of the inferred area
light. Using this point light model (and assuming Lambertian We propose several metrics for evaluating VIO systems
reflectance), the irradiance I found at a given point p is then using our dataset. As described in Section III-B, we employ
two distinct methods of ground-truthing: (1) trajectory po-
αL cos θ sitions, captured with a Leica laser tracker and (2) relative
I= , (2)
d2 start and end poses, obtained from the final loop-closure.
where α is the surface albedo, L is radiance emitted by We therefore propose separate evaluation metrics for each
the light, θ is the angle of incidence, and d is the distance method. To assess how sensitive a VIO algorithm is to
between p and the light source. As α is typically unknown the lighting solution, this analysis can also be performed
in a visual odometry setting, we do not provide an absolute separately for each of the employed light configurations.
15% Light 50% Light 100% Light 15% Light 50% Light 100% Light 15% Light 50% Light 100% Light
10

0
OKVIS (Stereo + IMU) ORB-SLAM (Stereo) ORB-SLAM (Mono)
Fig. 5: Color-encoded alignment error ealign for each run of three different odometry systems. Results are broken into three
regions based on onboard light intensity and each row depicts results from running the same trajectory multiple times.

300

OKVIS-Stereo

Number of Runs
OKVIS-Mono
200 ORB Stereo Fwd
ORB Stereo Bwd
ORB Mono Fwd
ORB Mono Bwd
100
DSO Fwd
DSO Bwd

Fig. 6: Frames where stereo ORB-SLAM consistently loses 0


0 2 4 6 8 10
tracking in the TN 015 HH 03 and OD 015 HH 02 data
sequences. Both frames exhibit extreme motion blur as well
ealign
as over- and under-saturated image intensities. Fig. 7: A comparison of several state-of-the-art systems
running on the proposed dataset. Here we show the number
of successful runs on the vertical axis, as determined by the
When ground-truth position data is available, we follow given alignment error threshold on the horizontal axis.
[6], [24], [19] and evaluate performance using both relative
pose error (RPE) and absolute trajectory error (ATE). When
the sensor system is used as a handheld rig, we obtain then be assessed using the poses yielded by our batch ML
ground-truth poses for the beginning and end of the sequence estimator, as described in Section V-D. When evaluating
only. This approach, introduced in TUM monoVO dataset accumulated drift, however, we use the ground-truth 6-DoF
[5], requires these segments exhibit loopy camera motion pose that is already available.
and fixate on the same scene. In a sequence containing n
frames, let S ⊂ [1; n] and E ⊂ [1; n] denote the subset of VII. B ENCHMARK
frames making up the start and end segments, respectively.
We concatenate two segments together and run stereo ORB- We evaluate the performance of several start-of-the-art
SLAM [14] on the resulting sequence to obtain relative visual and visual-inertial odometry systems running on our
positions pˆi ∈ R3 for each frame in S ∪ E. We then dataset. Specifically, we analyze the stereo and monocular
independently align these segments with the corresponding versions of OKVIS [10], the stereo and monocular versions
position estimated pi yielded by a visual odometry system of ORB-SLAM [14], and the Direct Sparse Odometry (DSO)
X framework [4]. As our main evaluation metric is accumulated
Tsgt := argmin (T pi − p̂i )2 (4) drift, we disabled loop-closure when running ORB-SLAM.
T ∈SE3
i∈S The color-encoded tables presented in Figure 5 illustrate
X the performance of stereo OKVIS, stereo ORB-SLAM and
Tegt := argmin (T pi − p̂i )2 . (5)
T ∈SE3 monocular ORB-SLAM. Each respective table is further
i∈E
sorted by the onboard light intensity employed. As runtime
This provides us with two relative
q Ptransforms for evaluating performance is typically non-deterministic, we run the same
n
the alignment error ealign = n1 i=1 kTsgt pi − Tegt pi k. As sequence 10 times and visualize their results in separate
noted in [5], for fair assessment using this metric, it is columns. From these results, we can see that the stereo
imperative that loop-closure is not performed. OKVIS performs well on all sequences. Stereo ORB-SLAM
We publish code for computing all evaluation metrics de- also exhibits good performance, however, tracking failed on
scribed here. Users need only provide a CSV file representing two sequences where the dimmest lighting was employed.
an estimated trajectory for a given sequence, where each line ORB-SLAM failed consistently on the same frames, shown
contains a timestamp followed by a 3-DoF position or 6-DoF in Figure 6. Not surprisingly, both exhibit severe motion blur
pose. We first determine the appropriate SE3 (or optionally and poorly saturated image intensities.
SIM3) transform for aligning the two trajectories. A full 6- We compare all evaluated frameworks in Figure 7. The
DoF pose may also be provided, indicated by four additional plot shows the number of successful runs on the vertical-
values representing the unit quaternion. Orientation error will axis, as determined by the alignment-error threshold on the
OKVIS-Stereo OKVIS-Mono
t1 t3
t3
50
t2

Y-translation (m)

Y-translation (m)
40 40

30
t1
20 20
t2
10
t0 t0 0 0

−10
DSO Forward DSO Backward −60 −40 −20 0 −70 −60 −50 −40 −30 −20 −10 0
X-translation (m) X-translation (m)
Fig. 8: Top-down view of point-clouds generated by DSO ORB-Mono DSO
moving forwards (left) and backwards (right) through a mine. 30
100
Note the mine shaft itself is of uniform width. Corresponding
points have been labeled to aid visual comparison.

Y-translation (m)

Y-translation (m)
20 50

0
100 100 10

−50
Number of Runs

Number of Runs

80 80
0
−100
60 60

−30 −20 −10 0 −50 0 50 100 150


40 40
X-translation (m) X-translation (m)
OKVIS-Mono 100% ORB-Stereo 100%
20 OKVIS-Mono 50% 20 ORB-Stereo 50% Fig. 10: Top-down views of the trajectories produced by
OKVIS-Mono 15% ORB-Stereo 15%
0 0 each system on the TN 100 HH 02 data sequence. Here, the
0 2 4 6 8 10 0 2 4 6 8 10
ealign ealign total distance traveled is approximately 285m. The starting
100 100 position of each trajectory is at the origin (0, 0).
ORB-Mono 100% DSO 100%
ORB-Mono 50% DSO 50%
Number of Runs

Number of Runs

80 80
ORB-Mono 15% DSO 15%
60 60 quite robust to light intensity. This can be attributed to their
use of inertial information and direct alignment, making them
40 40
inherently robust to motion blur. In contrast, we can see that
20 20 both the stereo and monocular versions of ORB-SLAM are
0 0
far more sensitive to the lighting.
0 2 4 6 8 10 0 2 4 6 8 10
Finally, we provide individual trajectories generated by
ealign ealign each framework, which are indicative of their general perfor-
Fig. 9: Lighting sensitivity analysis of each odometry system. mance. The shown sequence was captured in a steam tunnel,
Performance is individually plotted for the three lighting has a total distance traveled around 285m, and starts and ends
configurations: 15%, 50%, and 100% max intensity. in the same location. Both the stereo and monocular versions
of OKVIS exhibit very little drift, while monocular ORB-
SLAM and DSO’s poor performance is largely the result
of extreme scale drift. We note, however, both systems still
horizontal axis. An ideal framework would have a horizontal
manage to reconstruct the overall trajectory well.
line at 300. Notably, stereo ORB-SLAM initially performs
better than monocular OKVIS, but flattens out around 250 VIII. DATA ACCESS
runs, as tracking was completely lost for many sequences.
As described in [27], we see that ORB-SLAM exhibits a Our entire dataset, including the raw data streams
fair amount of motion bias, performing significantly better and calibration sequences, is publicly available online at:
when the sequences are run backwards. The same is true for http://arpg.colorado.edu/research/oivio
DSO when running on our benchmark. We conjecture this To facilitate the evaluation of different frameworks, we also
to be a direct result of violating the brightness constancy provide scripts for changing image resolution and converting
assumption. In general, corresponding points will become from the ASL format to ROS bags. Additionally, code for
brighter in subsequent frames when tracked through forward performing quantitative and qualitative analysis as well as
motion. Consequently, direct methods will likely match with generating all figures presented here are also provided.
darker, more distant points. The inferred depth will therefore
ACKNOWLEDGMENTS
increase to satisfy the multi-view geometry problem. This
phenomenon is illustrated in Figure 8. Note how the scale- We would like to thank Cesar Galan for his work in
drift is reversed under backwards motion. designing and constructing our sensor rig. Financial support
In Figure 9, we evaluate each framework’s sensitivity to from the DARPA Subterranean Challenge Award HR0011-
the lighting solution. Both OKVIS and DSO appear to be 18-2-0043 is also gratefully acknowledged.
R EFERENCES [16] Fernando Nobre, Michael Kasper, and Christoffer Heckman. Drift-
correcting self-calibration for visual-inertial slam. In International
[1] Takahito Aoto, Tomokazu Sato, Yasuhiro Mukaigawa, and Naokazu Conference on Robotics and Automation (ICRA), 2017.
Yokoya. Linear estimation of 4-D illumination light field from diffuse
reflections. In Asian Conference on Pattern Recognition (IAPR), 2013. [17] Jaesik Park, Sudipta Sinha, Yasuyuki Matsushita, Yu Wing Tai, and
[2] M Burri, J Nikolic, P Gohl, T Schneider, J Rehder, S Omari, M Achte- In So Kweon. Calibrating a non-isotropic near point light source using
lik, and R Siegwart. The EuRoC MAV Datasets. International Journal a plane. In Computer Vision and Pattern Recognition (CVPR), 2014.
of Robotics Research (IJRR), 2015. [18] Seonwook Park, Thomas Schöps, and Marc Pollefeys. Illumination
[3] Defense Advanced Research Projects Agency (DARPA). Subterranean Change Robustness in Direct Visual SLAM. International Conference
challenge, 2018. http://www.subtchallenge.com [accessed: on Robotics and Automation (ICRA), 2017.
2019-02-01]. [19] Bernd Pfrommer, Nitin Sanket, Kostas Daniilidis, and Jonas Cleveland.
[4] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct Sparse PennCOSYVIO: A challenging Visual Inertial Odometry benchmark.
Odometry. Transactions on Pattern Analysis and Machine Intelligence In International Conference on Robotics and Automation (ICRA),
(TPAMI), 2018. 2017.
[5] Jakob Engel, Vladyslav Usenko, and Daniel Cremers. A Photometri- [20] Nicholas Roy, Henrik Christensen, Dieter Fox, M. Ani Hsieh, Stuart
cally Calibrated Benchmark For Monocular Visual Odometry. arXiv Young, Simon Ng, Dirk Schulz, Ethan Stump, Carlos Nieto-Grand,
preprint, 2016. and Christopher Reardon. International Conference on Robotics and
[6] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Automation (ICRA) Workshop on Robot Teammates Operating in
autonomous driving? the KITTI vision benchmark suite. In Computer Dynamic, Unstructured Environments (RT-DUNE), 2018. https:
Vision and Pattern Recognition (CVPR), 2012. //manihsieh.com/icra-2018-workshop/ [accessed: 2019-
[7] Ankur Handa, Thomas Whelan, John McDonald, and Andrew J. 02-01].
Davison. A benchmark for RGB-D visual odometry, 3D reconstruction
[21] Dirk Schnieders, Kwan-yee Wong, and Zhenwen Dai. Polygonal Light
and SLAM. In International Conference on Robotics and Automation
Source Estimation. In Asian Conference on Computer Vision (ACCV),
(ICRA), 2014.
2009.
[8] M Irani and P Anandan. About Direct Methods. Vision Algorithms,
2000. [22] Boxin Shi, Zhe Wu, Zhipeng Mo, Dinglong Duan, and Sai-kit Yeung
[9] Christian Kerl, Jurgen Sturm, and Daniel Cremers. Dense visual Ping. A Benchmark Dataset and Evaluation for Non-Lambertian and
SLAM for RGB-D cameras. In Intelligent Robots and Systems (IROS), Uncalibrated Photometric Stereo. In Computer Vision and Pattern
2013. Recognition (CVPR), 2016.
[10] Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, [23] Zijiang Song and Reinhard Klette. Robustness of Point Feature
and Paul Furgale. Keyframe-based visualinertial odometry using Detection. Computer Analysis of Images and Patterns (CAIP), 2013.
nonlinear optimization. International Journal of Robotics Research [24] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and
(IJRR), 2014. Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM
[11] Steven Lovegrove, Alonso Patron-Perez, and Gabe Sibley. Spline systems. In Intelligent Robots and Systems (IROS), 2012.
Fusion: A continuous-time representation for visual-inertial fusion [25] Wennie Tabib and Nathan Michael. Robotics: Science and Systems
with application to rolling shutter cameras. In British Machine Vision (RSS) Workshop on Challenges and Opportunities for Resilient Col-
Conference (BMVC), 2013. lective Intelligence in Subterranean Environments, 2018. http://
[12] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 rssworkshop18.autonomousaerialrobot.com/ [accessed:
Year, 1000km : The Oxford RobotCar Dataset. International Journal 2019-02-01].
of Robotics Research (IJRR), 2017.
[26] Robert J Woodham. Photometric Stereo: A Reflectance Map Technique
[13] Michael Milford, Eleonora Vig, Walter Scheirer, and David Cox.
For Determining Surface Orientation From Image Intensity. Image
Vision-based Simultaneous Localization and Mapping in Changing
Understanding Systems and Industrial Applications, 1979.
Outdoor Environments. Journal of Field Robotics (JFR), 2014.
[14] Raul Mur-Artal and Juan Tardos. ORB-SLAM2: An Open-Source [27] Nan Yang, Rui Wang, Xiang Gao, and Daniel Cremers. Challenges
SLAM System for Monocular, Stereo, and RGB-D Cameras. Trans- in Monocular Visual Odometry: Photometric Calibration, Motion Bias
actions on Robotics, 2017. and Rolling Shutter Effect. arXiv preprint, 2018.
[15] Richard Newcombe, Steven Lovegrove, and Andrew Davison. DTAM: [28] Wei Zhou and Chandra Kambhamettu. Estimation of the Size and
Dense tracking and mapping in real-time. In International Conference Location of Multiple Area Light Sources. In International Conference
on Computer Vision (ICCV), 2011. on Pattern Recognition (ICPR), 2004.

You might also like