Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Journal Pre-proof

An improved SLAM based on RK-VIF: Vision and inertial information fusion via
Runge-Kutta method

Jia-shan Cui, Fang-rui Zhang, Dong-zhu Feng, Cong Li, Fei Li, Qi-chen Tian

PII: S2214-9147(21)00193-8
DOI: https://doi.org/10.1016/j.dt.2021.10.009
Reference: DT 932

To appear in: Defence Technology

Received Date: 3 August 2021


Revised Date: 18 September 2021
Accepted Date: 25 October 2021

Please cite this article as: Cui J-s, Zhang F-r, Feng D-z, Li C, Li F, Tian Q-c, An improved SLAM based
on RK-VIF: Vision and inertial information fusion via Runge-Kutta method, Defence Technology (2021),
doi: https://doi.org/10.1016/j.dt.2021.10.009.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition
of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of
record. This version will undergo additional copyediting, typesetting and review before it is published
in its final form, but we are providing this version to give early visibility of the article. Please note that,
during the production process, errors may be discovered which could affect the content, and all legal
disclaimers that apply to the journal pertain.

© 2021 China Ordnance Society. Publishing services by Elsevier B.V. on behalf of KeAi
Communications Co. Ltd.
Title Page

An improved SLAM based on RK-VIF: vision and


inertial information fusion via runge-kutta
method
Jia-shan Cui1, Fang-rui Zhang1, Dong-zhu Feng1*, Cong Li2, Fei Li3, Qi-
chen Tian4

f
oo
1
School of Aerospace Science and Technology, Xidian University,Xi’an 710126, China

2
China Academy of Space Technology, Xian Branch, Xi’an 710100, China

r
3
-p
China Academy of Launch Vehicle Technology, Beijing, 100076, China
re
4
Aerospace Information Research Institute, Chinese Academy of Sciences,Jinan 250100, China
lP
na
ur
Jo

*Correspondence: Dongzhu Feng


Affiliation: Xidian University, Xi’an, China
E-mail:jscui@xidian.edu.cn; dzhfeng@xidian.edu.cn
Tel.: +86-29-8189-1034
An improved SLAM based on RK-VIF: vision and inertial

information fusion via Runge-Kutta method

Abstract

f
Simultaneous Localization and Mapping (SLAM) is the foundation of autonomous navigation for unmanned

oo
systems. The existing SLAM solutions are mainly divided into the visual SLAM(vSLAM) equipped with camera

and the lidar SLAM equipped with lidar. However, pure visual SLAM have shortcomings such as low positioning

r
-p
accuracy, the paper proposes a visual-inertial information fusion SLAM based on Runge-Kutta improved pre-

integration. First, the Inertial Measurement Unit(IMU) information between two adjacent keyframes is pre-
re
integrated at the front-end to provide IMU constraints for visual-inertial information fusion. In particular, to improve
lP

the accuracy in pre-integration, the paper uses the Runge-Kutta algorithm instead of Euler integral to calculate the

pre-integration value at the next moment. Then, the IMU pre-integration value is used as the initial value of the
na

system state at the current frame time. We combine the visual reprojection error and imu pre-integration error to

optimize the state variables such as speed and pose, and restore map points' three-dimensional coordinates. Finally,
ur

we set a sliding window to optimize map points' coordinates and state variables. The experimental part is divided
Jo

into dataset experiment and complex indoor-environment experiment. The results show that compared with pure

visual SLAM and the existing visual-inertial fusion SLAM, our method has higher positioning accuracy.

Keywords:SLAM; Visual-inertial positioning; Sensor fusion; Unmanned system; Runge-Kutta

1 Introduction

Recently, autonomous navigation based on unmanned systems is gradually being used in various fields. SLAM

is the foundation and core of autonomous navigation[1]. The SLAM method refers to unmanned systems establishing

the environmental map and estimating its trajectory during the movement under the unknown environment[2]. Vision

sensors rely on the image sequence captured to estimate the pose and create an environmental map. Still, when the

movement is too fast or the environment features are missing, its accuracy and robustness will decrease sharply and

sometimes even lead to positioning failure[3][4].

With the continuous advancement of hardware technology, low-cost IMU has become ubiquitous[5]. IMU and

vision sensors have strong complementarity in performance. Therefore, the Visual-Inertial Navigation System(VINS)

has gradually become a hot topic in the current SLAM field[7].


The main goal of VINS is to better couple IMU measurement and image information. According to the coupling

scheme, VINS can be divided into loosely coupled systems and tightly coupled systems. The loose coupling uses

the knowledge of the camera and IMU to estimate the motion of the system and then combines the estimation results.

In contrast, the tight coupling method establishes a unified objective function for the camera and IMU state

information and then performs the system motion estimation.

According to the data processing method, VINS can be divided into the filtering-based method and

optimization-based method. Filtering-based methods generally choose the EKF framework, which is divided into

two stages: prediction and update. The prediction stage uses IMU information to predict the system pose; and then

in the update stage, the system state is corrected and optimized according to the image information. The optimization-

based method considers the data of a period in the past and then iteratively optimizes the system state[8][9]. For the

f
reason that the filtering-based method only considers data at the current moment and only uses image information

oo
when updating the state, its accuracy is limited; therefore, the current mainstream VINS choose the optimization-

r
based method.
-p
Limited by processor performance in the early days, VINS research was mainly based on filtering. Mourikis et
re
al. proposed a filter-based tight-coupling scheme-MSCKF, which is the first relatively mature scheme in the field of

VINS[10]. MSCKF uses a feature point to constrain multiple camera motion states at different moments and uses
lP

multi-state constrained extended Kalman filtering as the back-end. At the same time, it adopts a sliding window

strategy to ensure the real-time performance of the system. However, since only monocular camera is used in the
na

MSCKF, the accuracy of visual-inertial odometry(VIO) is quite insufficient compared to stereo camera. The problem

of autonomous underwater vehicles' SLAM has been solved in [11]. Aiming at the problem of nonlinear models and
ur

non-Gaussian noise in AUV motion, the author proposes an improved variance reduction FastSLAM with simulated
Jo

annealing to solve the problems of particle degradation, particle depletion and particle loss in traditional FastSLAM.

Bloesch et al. proposed a robust visual-inertial odometry(ROVIO), which also uses a tight coupling solution based

on filtering. The ROVIO front-end adopts the direct method to track the camera movement by minimizing the

photometric error of the image block, and the extended Kalman filter at the back-end can follow the image block

features and three-dimensional landmark points at the same time[12]. However, ROVIO doesn’t have a loop

detection module, it is only an odometry, and there is a sizeable cumulative error in the long term.

In recent years, with the continuous improvement of processor performance, the optimized VINS has gradually

become the current hot point. Leutenegger et al. proposed the keyframe-based visual-inertial SLAM using nonlinear

optimization(OKVIS)[13]. Its front-end uses multi-scale Harris corner detection features and BRISK descriptors.

OKVIS pre-integrates the IMU information between the keyframes in the back-end and then estimates the system

moves through vision and IMU information. However, the scheme doesn’t have loop detection and pose graph

optimization, which inevitably causes error accumulation. Raul et al. proposed VIORB-SLAM[14][15] based on

ORB-SLAM by fusing IMU information, which also adopts a tight coupling scheme based on optimization. Its front-
end uses ORB features, pre-integrates IMU data, and then uses the g2o library to fuse the two information to estimate

the system motion. The accuracy of VIORB-SLAM is much better than monocular ORB-SLAM. Still, the code is

not open-source, and it only supports monocular cameras, which require complex initialization before using[16].

Shen Shaojiao et al. proposed the robust and versatile monocular visual-inertial state estimator(VINS-Mono),

which still uses a tight coupling scheme based on optimization. However, VINS-Mono only supports monocular

cameras. Its front-end uses the KLT optical flow method to track Harris corners. It also uses the pre-integration idea

to process IMU data and then fuses the two sensor information through iterative optimization[17]. In addition, VINS-

Mono can be initialized in an unknown environment and has a loop detection module[18][19], but the accuracy

needs to be further improved. Later, the Shen Shaojiao team released VINS-FuSion based on VINS-Mono. VINS-

FuSion retains an optimized tightly coupled fusion scheme and supports multiple types of sensor combinations (pure

f
binocular, binocular + IMU, and binocular + IMU +GPS). Compared with VINS-Mono, it can be initialized statically,

oo
and the scale information doesn’t entirely rely on the IMU, which is more robust. However, because of stereo vision's

r
mismatch problem, its accuracy is not even as good as VINS-Mono in some situations.
-p
Although existing visual-inertial information fusion SLAM systems are emerging one after another, the visual
re
and IMU information still cannot be effectively integrated. Solving the frequency inconsistency between the camera

and IMU and fully integrating the visual and IMU information is still an urgent problem for VINS. A suitable SLAM
lP

should be able to fully integrate visual and IMU information, significantly to improve the positioning accuracy and

robustness of the system in extreme situations such as lack of environmental information and rapid carrier movement.
na

Therefore, this paper proposes a visual-inertial information fusion SLAM based on Runge-Kutta improved pre-

integration, aiming to improve the accuracy and robustness of the system.


ur

The sections of our paper are arranged as follows: the second section introduces the visual-inertial information
Jo

fusion framework. The third section introduces the SLAM method of visual-inertial information fusion based on

Runge-Kutta4 improved pre-integration. The fourth section introduces the experimental part of our paper, including

dataset experiments and experiments in complex indoor environment. Finally, the fifth section summarizes the paper

and gives conclusions.

2 RK-VIF: System Overview

Pure visual SLAM can estimate the pose of unmanned system and create featured map in unknown environment.

Visual information has no data drift, and the positioning accuracy is higher in rich image features areas; however, if

the image features are less or the movement is too fast, it is difficult for pure visual SLAM to work robustly, or even

lead to positioning fails; IMU doesn’t depend on environmental characteristics, and the measurement frequency is

higher than camera. IMU relative displacement data has high accuracy in a short time, which can make up for the

shortcomings of the visual blurring due to fast motion; however, excessive IMU use will cause errors to accumulate

with time, and then produce extremely serious drift error. Aiming at the shortcomings of IMU, visual SLAM can

constrain the divergence generated by IMU for a long time. Thus, IMU and camera are highly complementary in
performance. Therefore, the paper fuses IMU information based on the visual SLAM, studies visual-inertial

information fusion SLAM to improve the system’s positioning accuracy. As shown in Fig.1, the visual-inertial

information fusion SLAM framework of this paper mainly includes four modules: measurement data processing,

system initialization, joint state estimation and sliding window local optimization, and loop detection.

f
r oo
-p
re
lP
na

Fig.1 Visual-inertial Information Fusion Framework.


ur

3 Measurement process
Jo

3.1 Feature extraction and matching

In the visual SLAM based on the feature point method, feature point extraction and matching are first required.

The purpose is to determine the projection of the same spatial point in different images according to the extracted

features. Feature extraction generally includes two parts: feature point detection and description. First, the critical

information in the image is detected, and then it is represented by feature descriptors. Then, by comparing the

distances of feature descriptors, the correspondence between feature points in different images can be matched.

Usually, there are many mismatched points in the matched image, which will affect the accuracy of the subsequent

pose estimation. Therefore, it is necessary to use the mismatch elimination algorithm to eliminate the abnormal

points.

The current visual SLAM method usually uses the Random Sampling Consensus(RANSAC) algorithm to

eliminate mismatched points. Although the RANSAC algorithm can effectively eliminate mismatched points, the

randomness of the algorithm itself causes the number of iterations to be unstable, which in turn causes problems

such as low elimination efficiency. Therefore, this paper introduces the Progressive Sampling Consensus(PROSAC)
algorithm[20] to eliminate mismatched points. The PROSAC algorithm can solve the problems of unstable iterations

and poor robustness of the RANSAC algorithm, effectively improve the algorithm's efficiency, and provide a

guarantee for the real-time performance of SLAM.

The PROSAC is based on the assumption that "the points with good quality are more likely to be interior

points"[21]. Therefore, when selecting sample points, the points with good quality will be chosen first. Specifically,

the PROSAC first uses certain criteria to evaluate the quality of all sample points, and then sorts them by the

evaluation results, the algorithm only high-quality sample points are selected to estimate the matching model every

time. Through the sorting strategy, the randomness of algorithm sampling is avoided, the probability of obtaining

the correct model is improved, and the number of algorithm iterations is reduced.

The steps of the PROSAC algorithm are shown in Fig.2:

f
Input:Maximum number of iterations I m , internal point judgment error threshold  , internal point

oo
number threshold Nin
Output:Interior point set and homography matrix H

r
-p
Step1:According to the formula:  =dmin dmin 2 ,   1  dmin ; dmin 2 represents the second smallest

Euclidean distance, which represents the minimum distance between a point and other feature points in the
re
matching image; calculate the minimum Euclidean distance dmin of the feature point; Euclidean distance ratio
 and matching factor  ;
lP

Step2:Choose m points with the best quality, four points are a combination, arrange the combination in
descending order according to the sum of the quality;
na

Step3:Select the four sets of matching points with the highest quality to calculate the homography matrix
H;
ur

Step4:Calculate the projection points of other feature points in another image according to the matrix H ;
Step5:Calculate the error e between the matching point and the projection point, if e   , determined
Jo

the point as interior point; otherwise, determined the point as exterior point;
Step6:Count the number S of interior point, if S  Nin , updating the number of interior points to S ;

otherwise, number of iterations I add 1, then repeat step6;


Step7:Using the updated S interior points, recalculate the homography matrix H and the other interior
points;
Step8:If I  I m , return the new interior point set and homography matrix H ; otherwise, no suitable

model can be found.

Fig.2 the process of PROSAC algorithm.

3.2 IMU pre-integration based on Runge-Kutta

In the VINS, the camera is an external sensor that collects environmental information and determines the

system's pose during movement; IMU is an internal sensor that provides the system's movement information and

restores the gravity direction of the visual SLAM. The IMU measurement value will be affected by Gaussian noise

and zero bias. In order to simplify the model, it's assumed that each axis of the accelerometer and gyroscope are
measured independently, and the influence of the earth's rotation is not considered. The measurement process can be

modeled as follows[22][23]:
B  B  bg   g
aB  RWB  aB  ba  a (1)
 RWB  aW  gW   ba  a

B  3
and aB  3
represent the measured values of angular velocity and acceleration in the body

coordinate system.  B  3
and aB  3
represent the true values. RWB represents the rotation matrix from

the world coordinate system to the body coordinate system. Meanwhile, g and  a represents the measured

noise of the gyroscope and the accelerometer, which conform to the Gaussian distribution of zero mean; bg and

ba represent the zero bias of the gyroscope and the accelerometer, and their differential is Gaussian white noise.

According to the rigid body kinematics model, the motion state of the body system in the world coordinate

f
oo
system satisfies the following differential formula:

R WB  RWB ωWB|B

r
v B|W =aB|W (2)
-p p B|W =vB|W
re
In formula, aB|W represents the acceleration of the body coordinate in the world coordinate, and vB|W represents
lP

the speed of the body coordinate in the world coordinate.

Suppose the time difference between two IMU measurements is t . Assuming that the acceleration aB|W
na

and the angular velocity vB|W remain constant within [t , t  t ] , substituting the IMU measurement model of

formula (1) into the formula (2):


ur


RWB  t  t  =RWB  t  Exp ωWB|B  t   bg  t   ηgd  t  t  

vB|W  t  t  =vB|W  t  +gW t +RWB  t  a B  t   ba  t   ηad  t  t  (3)
Jo

pB|W  t  t  =pB|W  t  +vB|W  t  t 


1
2
1

gW  t   RWB  t  a B  t   ba  t   ηad  t 
2

2
  t  2

In formula, ηgd  t  and ηad  t  are the discrete values of random noise ηg  t  and ηa  t  respectively.

Traditional pre-integration uses Euler integral to discretize the IMU motion formula and then derives the IMU

pre-integration formula. Our paper uses the Runge-Kutta method to discretize the IMU motion formula to improve

the pre-integration accuracy. The basic idea of the Runge-Kutta method is to estimate the derivatives of multiple

points in the integration interval, take the weighted average of these derivatives to obtain the average derivative, and

then use the average derivative to estimate the result at the end of the integration interval. If Runge-Kutta algorithm’s

order is higher, it will lead to smaller truncation error of the system, and improve algorithm’s accuracy. In practical

applications, the fourth-order Runge-Kutta algorithm can already meet most of the accuracy requirements.

Considering the computer processor's performance, the paper use the fourth-order Runge-Kutta algorithm.

Given the following differential formula:

y  t  =f  t , y  t   (4)
Euler integral uses the derivative f  tn , yn  at point tn as the derivative of the interval tn , tn 1  , and updates

it with the state yn at point tn . There are:

yn 1  yn  f  tn , yn  t (5)

In formula, t  tn 1  tn . Euler integral only uses the derivative information at point tn , so it has a higher

estimation error.

The Runge-Kutta algorithm is an improvement of Euler integral. It selects four points in the interval  t n , t n 1  ,
calculates the derivative k1 ~ k4 of four points by iterative method, and then takes the weighted average of these

derivatives to calculate the next state yn 1 . The formula of the fourth-order Runge-Kutta algorithm is as follows:
t
yn 1  yn   k1 +2k2 +2k3 +k4 
6 (6)
= yn  kRK t

f
oo
In formula,

k1  f  tn , yn 

r
-p 

t
k2  f  tn  , yn  k1 
2
t
t 
2 
t 
(7)

re
k3  f  t n  , y n  k 2 
 2 2 
k4  f  tn  t , yn  k3t 
lP

 t t 
In formula, k1 represents the derivative at the beginning of interval  t n , yn  ;  tn  , yn  k1   is the
 2 2
na

midpoint of the interval estimated by Euler integral method using k1 as the average derivative, and k 2 represents

 t t 
its derivative;  tn  , yn  k2   is the midpoint of the interval estimated by Euler integral method using k 2
ur

 2 2

as the average derivative, and k3 represents its derivative;  tn  t , yn  k3  t  is the end point coordinates of
Jo

the interval estimated by Euler integral method using k3 as the average derivative, and k 4 represents its derivative.
From the formula (7), it can be seen that the algorithm integrates the derivative information of the four points, and
the derivative at the two midpoints has a higher weight.

Our paper use Runge-Kutta method discretize the IMU motion formula in formula (3):
k 1
RWB k
=RWB Exp  ωWB-RK|B t 
k 1
vB|W k
=vB|W +gW t +RWB
k
aB-RK t (8)
1 1 k
k 1
t  gW  t   RWB aB-RK  t 
k k 2 2
pB|W =pB|W +vB|W
2 2

ωWB-RK|B and aB-RK are the weighted average values of the angular velocity and acceleration of the system obtained

by Runge-Kutta algorithm within  k , k  1 . Compared with only using the angular velocity  WB|B
k
and

acceleration a Bk at moment k to recursively calculate the system state at moment k  1 ωWB-RK|B and

aB-RK have higher accuracy.

As shown in Fig.3, the IMU frequency is much higher than the camera frequency, and there are much IMU data

between two adjacent keyframes[24][25]. Therefore, the IMU data cannot be directly fused with the camera data.
The IMU data between the two frames must be integrated to align the frequencies of the two sensors.

Fig.3 Sensor frequency.

f
Assuming that the IMU information and the camera image are synchronized in time, only the frequency is

oo
different. We accumulate the IMU information from the moment i to the moment j . The state variables such as

r
the rotation, speed, and translation of the system at moment i can be directly updated to the moment j .
-p
Combining formula (8), the numerical integration of IMU information between adjancent keyframes i and j
re
satisfies the following formula:
j 1
j
RWB  RWB
i
 Exp  k
t 
lP

WB-RK|B
k i
j 1
j
vB|W i
=vB|W +  gW  RWB
k k
aB-RK  t (9)
k i
na

j 1
 k
 t  
1 1 k
+ vB|W t  gW  t   RWB
j i 2 k 2
pB|W =pB|W aB-RK
k i  2 2 
ur

Formula (9) describes the state estimation obtained by IMU information from moment i to moment j . But if
i k
(k  i, ... , j  1)
Jo

RWB changes during the optimization process, all other rotation RWB will change, and all

integrals need to be recalculated. In order to avoid the problem of repeated integration, we use the form of pre-

integration to define a set of variables that are not related to the initial pose and velocity:
j 1
Ri , j  RBW
i j
RWB   Exp  WB-RK|B
k
t 
k i
j 1
vi , j =R
i
BW v j
B|W  v B|W
i
 g W ti , j    Ri , k aB-RK
k
t
k i
(10)
 j 1 j 1 2
pi , j =R  pB|W
i
BW  pB|W
i
 vB|W
i
t i , j   g W   t  
 2 k i 
j 1

 t  
1
   vi , k t  Ri , k aB-RK
k 2

k i  2 

In formula, Ri , k  RBW


i k
RWB
k
i

, Δvi , k  RBW vB|W  vB|W  gW Δti , k . The left side of formula (10) is the relative
i

motion increment between adjancent keyframes i and j , and the right side is the IMU pre-integration

measurement model. The definition makes the right side of the formula independent of the state and gravity at

moment i . Repeated integration in the calculation process is solved, and calculation is significantly reduced.
3.3 System initialization

VINS initialization is a loosely coupled process of visual estimation and IMU estimation[26]. Firstly, perform

visual initialization, estimate the initial pose of the system from the matching image. Then initialize the zero bias of

the gyroscope and accelerometer by the estimation results, and calculate the speed of the IMU.

Assuming that the gyroscope bias remains constant between two consecutive keyframes, minimize the

difference between the gyroscope pre-integrated measurement and the visual estimated relative rotation, and

calculate the gyroscope bias:

  
N 1 2
arg min  Log  ΔRi ,i 1 exp J ΔR i 1 
T
g i
RBW RWB  (11)
bg i 1  

N is the number of keyframes required for initialization, Ri ,i 1 is the pre-integrated measurement of rotation

f
i i 1
between adjacent keyframes, and RBW RWB is the rotation matrix estimated by vision. Formula (11) is a least-

oo
squares problem. The Gauss-Newton method can find the zero bias of the gyroscope. After calculating the gyroscope

r
zero bias, in order to eliminate the influence of the gyroscope zero bias on the pre-integration, it is necessary to

recalculate the pre-integration on the IMU data. -p


In order to get the zero bias of the accelerometer, firstly, we estimate the gravitational acceleration through the
re
posture of visual SLAM. Then we calculate the zero bias[27]. The conversion relationship between the body
lP

coordinate system and the camera coordinate system in the world coordinate system satisfies:

pB|W = pC|W + RWC pB|C (12)


na

pB|W and pC|W respectively represent the position of IMU and camera in the world coordinate system, RWC
ur

describes the rotation matrix of the camera coordinate system relative to the world coordinate system. Meanwhile,

pB|C represents the position of IMU in the camera coordinate.


Jo

Introducing the constraint of gravitational acceleration G  9.8 , suppose a body coordinate system I , which

shares the origin with the world coordinate system. The rotation matrix between the body coordinate system and the

world coordinate system is RWI . The direction of gravitational acceleration in the body coordinate system I is

gˆ I  [0,0,1]T , and the direction of gravitational acceleration in the world coordinate system is gˆ W  g*W g*W ,

so the angle between these two direction vectors can be used to calculate RWI .

At this time, the acceleration of gravity g W can be expressed as:


gW  RWI gˆ I G (13)

RWI will change after the accelerometer zero bias is introduced. We use disturbance  to describe the change

and perform the first-order approximation:

gW  RWI Exp  θ  gˆ I G  RWI gˆ I G+RWI  θ  gˆ I G=RWI gˆ I G  RWI  gˆ I   θ G


 ^
(14)

T T
In formula,  θ =  θxy T ,0 ,  θxy T   x , y  .

Combining formula (9) and formula (12), considering the change of the accelerometer zero bias to pij , and
considering the constraints between three consecutive keyframes, eliminating the velocity term, we can get the

following formula:
 
  i    i       i  (15)
ba 

Where,

 i  
1
2


RWI  gˆ I  G  ti ,i 1  ti 1,i  2  ti , i 1  ti 1, i  2 
2 2

ξ  i   RWB
i 1
J avi 1,i 2 ti ,i 1  RWB
i
J api ,i 1 ti 1,i  2  RWB
i
J avi 1,i 2 ti ,i 1 ti 1,i  2
ψ  i   RWB
i 1
pi 1,i  2 ti ,i 1  RWB
i
pi ,i 1 ti 1,i  2   RWC
i 1
 RWC
i
 pB|C ti 1,i  2 (16)
  RWC
i2 i 1
 RWC  pB|C ti,i 1 +RWB
i
vi , i 1 ti , i 1 ti 1, i  2 +  pC|W
i2 i 1
 pC|W  ti,i 1
  pC|W  ti 1,i  2 + 12 RWI gI G  ti,i 1 
i 1 2
 pC|W
i

f
We use Singular Value Decomposition(SVD) to solve formula (15), the zero bias ba of the accelerometer

oo
and the optimized gravity acceleration can be obtained.

r
According to the calculated zero bias of the gyroscope and accelerometer, combined with formula (9), the
-p
system speed at each keyframe moment can be further obtained. The system speed at the last keyframe is:
re
k 1
vB|W  vB|W
k
+gW tk , k +1 +RWB
k
vk , k +1 (17)
lP

The system speed at other keyframe moments is:


 k 1 2
 pB|W  pB|W  RWB pk , k +1  gW  tk , k +1  
1 1
k
vB|W  k k
(18)
tk , k +1  2 
na

3.4 Joint estimation and local optimization


ur

The initial velocity, gravitational acceleration, and zero bias of the gyroscope and accelerometer are solved

through the IMU initialization. Then we use image and IMU information to estimate and optimize the system's pose.
Jo

First, the joint estimation calculates the system's pose at the current keyframe by minimizing the visual reprojection

error and the IMU pre-integration error. Then, the system's pose and the map points' location are optimized in the

sliding window.

3.4.1 State and error

The state variables of the VINS includes rotation matrix, translation matrix and speed. IMU bias is a random

walk process, which also needs to be estimated. Define the ith frame's state as i  Ri , pi , vi , bgi , bai  . In the

subsequent local optimization process, the map point's location in the sliding window needs to be jointly optimized

as the state.

As shown in Fig.4, the VINS includes three constraints:

 The constraint of image feature points' position between the keyframe and the map points;

 The constraint of IMU pre-integration measurement between adjacent keyframes;

 The constraint on the random walk model of the zero bias between adjacent keyframes.
Fig.4 VINS state and constraints.

According to the above constraints, we define the visual reprojection error and IMU measurement error. The

f
two types of errors are introduced below.

oo
Visual error is defined by the reprojection error of the map point. Assuming that m map points are observed

r
at the time of the ith frame, the visual error at that time is
-p m
Eproj  i     xuv
k 1
k
 π  X Ck 
Σk
(19)
re
In formula,  is the kernel function,  k is the covariance matrix of the reprojection error. Meanwhile,


 uLk , vLk , uRk 
lP

k
xuv is the projection coordinate of the kth landmark point obtained by the stereo camera,

X Ck   X ,Y , Z  is the landmark in camera coordinate, and π is the projection formula:


 f x  X Z   cu 
na

 
π X   
k
C f y Y Z   cv  (20)
 f   X  bl  Z   c 
 x u
ur

f x , f y , cu , and cv are the internal parameters, and bl is the baseline.


Jo

The IMU error comprehensively considers the two constraints, including the IMU pre-integration measurement

and the zero bias random walk process. The IMU error between frame i and j is defined as:

EIMU  i, j   eIMU  i, j   eb  i , j 
2 2
Σ IMU Σb
(21)

In formula,  IMU is the pre-integrated measurement covariance matrix;  b is the zero-biased random walk

covariance matrix; eIMU is the IMU pre-integrated measurement residual, and eb is the IMU zero-biased

random walk residual:

eR 
 i, j  ebi , j 
eIMU  i, j    evi , j  , eb  i, j    i ,gj  (22)
  eba 
 epi , j 

In formula,

   j 
T
eRi , j  Log  Ri , j Exp J gRi , j bgj i
RBW RWB 
 
evi , j  RBW
i
 vB|W
j
 vB|W
i

 gW ti , j   vi , j  J gvi , j bgj  J gvi , j baj 
 j
 
2
 pB|W  pB|W  vB|W  gW  ti , j    pi , j  J pi , j bg  J pi , j ba
1
epi , j  RBWi i i g j a j
(23)
 2 
ebi ,gj  bgj  bgi
ebi ,a j  baj  bai

3.4.2 Joint state estimation

The joint state estimation measures the system pose in real-time. First, IMU provides the system with a

relatively credible initial estimate, reprojects the map points observed in the previous keyframe into the current

keyframe. Then, the system calculates the visual reprojection error for all projection points and their matching points

f
in two adjacent keyframes. Finally, it minimizes the sum of visual reprojection error and IMU pre-integration error.

oo
Formula (19) defines the visual error Eproj  j  , and formula (21) defines the IMU error EIMU  i, j  . Therefore, the

objective function is defined as:

r
-p
 *  arg min  Eproj  j   EIMU  i, j  

(24)
re
The above problem is a least-squares problem. Therefore, the optimal state  * can be solved by minimizing

the sum of the visual reprojection error and the IMU error. In this paper, based on the visual estimation, IMU pre-
lP

integration constraints are added to the optimization process to improve the accuracy of system state estimation. At
na

the same time, in the absence of environmental features or the unmanned system moving too fast, the IMU

measurement information is used to estimate the system state so that it can run robustly.
ur

3.4.3 Local sliding window optimization


Jo

The state information of the latest keyframe is estimated through the joint state estimation module, and the

observed landmarks are added to the local map to become map points. Next, we optimize the system pose and map

point locations in the local map.

In the VINS, since the IMU provides constraints between two adjacent keyframes, the keyframes contained in

the local map must be continuous before they can be used for optimization. At the same time, the computational

complexity after fusion will increase rapidly because of the high frequency of IMU. To balance accuracy and

calculation efficiency, we choose the sliding window strategy to optimize the local map. As shown in Fig.5, the

continuous keyframes in the most recent period of time constitute a sliding window. During the optimization process,

only the keyframes in the sliding window and the map points observed by these keyframes are optimized. In our

paper, the sliding window's length in the local map is set to 10 to ensure reliability. The method ensures that the IMU

pre-integration can provide effective constraints during local optimization.


Fig.5 Sliding window optimization range.

f
oo
The state at this moment is defined as:

  TWB
k i
, XW k
, vW , bgk , bak  (25)

r
In formula, k
-p k
represents the keyframe's subscript in the sliding window, TWB represents the system pose in the
re
world coordinate system at the kth keyframe. Meanwhile, X Wi represents the ith landmark's coordinate in the

world coordinate system.


lP

Combine the reprojection error and the IMU pre-integration error of all keyframes in the sliding window to

construct the objective function:


na

 *  arg min   Eproj  k   EIMU (k  1, k ) 


N
(26)
 k 1
ur

Formula (26) is still a least-squares problem, and it is still solved by the Gauss-Newton method in g2o.
Jo

The local sliding window optimization comprehensively considers the status information of consecutive N

keyframes, and update the system pose in the sliding window at each iteration. After completing local optimization,

the loop detection module will perform loop detection and global pose graph optimization on the latest keyframes

to eliminate accumulated errors and optimize the system pose.

3.5 Back-end optimization

Unmanned systems explore in an unknown environment, and there is always an error estimating their pose.

Due to expanding the exploration area, the previous errors will gradually accumulate to the next moment. Therefore,

correcting the pose in time is particularly important for the long-term robust running of the unmanned system. In

order to solve this problem, the feasible idea is to detect the image information in the scene and determine whether

the current position is a place that has been reached before. If the information is found to be similar, a loop is formed

at the repeated position. Then, re-optimize the system's pose in the loop to eliminate accumulated errors and obtain

a globally consistent pose graph.

The loop detection based on the scene's appearance is essentially an image similarity detection problem.
However, suppose all historical frames in the map are matched with the current frame. In that case, calculation is too

large, which seriously affects the real-time detection . In this paper, we use the Bag of Words(BoW) model to

accomplish loop detection.

The words in the BoW model are the average expression of similar features those have a specific type, and the

dictionary is a collection of all words. We use the ORB features extracted from the front-end. For the frame, compare

the ORB feature and dictionary elements to convert the frame into a bag-of-words vector. In loop detection, by

comparing the bag-of-words vectors of the two frames, it can be judged whether the two frames constitute a loop to

speed up the detection. The detection process includes the following three parts: (1) building a dictionary; (2) using

bag-of-words vectors to describe frames; (3) comparing the bag-of-words vectors to determine the similarity of two

frames.

f
r oo
-p
re
lP
na

Fig.6 Loop detection process.


ur

After the system successfully detects the loop, it can establish the constraint relationship between the two frames
Jo

through feature matching. Firstly, we need to fuse the map points matched by the loop and eliminate redundant map

points. Then calculate the relative pose between the two keyframes according to Perspective-N-Point(PnP) algorithm,

and correct the current keyframe's pose. In addition, other keyframes in the loop also have different degrees of

accumulated error and need to be globally optimized.

Considering that the number of map points is much larger than the number of keyframe poses, we only optimize

each keyframe pose in the global optimization process to reduce calculation. After the pose optimization is completed,

we use the optimized pose to adjust the map point.

In the pose graph optimization, the nodes represent the camera poses, and the edges represent the pose

transformation relationships between adjacent nodes. Suppose the camera poses of two adjacent nodes in the loop

are Ti and T j , the pose transformation matrix is Ti , j , and the corresponding Lie Algebras are expressed as i ,

 j , and  i , j respectively. According to the relationship of pose transformation:

Ti , j  Ti 1 T j (27)

Due to the pose estimation error, the above formula will not be accurate. The pose estimation of unmanned
systems is to iteratively solve an optimal transformation matrix T  SE  3 , so as to minimize the system error. In

order to simplify the iterative solution process, Lie algebra is introduced, so we can transform the matrix derivation

to multiplying the matrix by the vector which corresponding to the anti-symmetric matrix. We stipulate that the

symbol ‘  ’ represents the conversion of an anti-symmetric matrix to corresponding vector and the symbol ‘  ’

represents the conversion of a vector to corresponding anti-symmetric matrix. Thus we can construct the error term

by moving Ti , j to the right side of formula (27):

 

ei , j  ln Ti ,j1 Ti 1 T j

      
 (28)
 ln exp  i , j  exp  j 
 
exp  j

Formula (28) describes the error term between two adjacent pose nodes. Considering the error term formed by

f
oo
all pose nodes in pose graph to build a least-squares problem:
1
min

 eiT, j Σi,1j ei, j
2 i , j
(29)

r
Where,  -p
is the set of all nodes in the pose graph,  represents the set of all edges in the pose graph.  i , j is

the information matrix, representing the weight of error term ei , j .


re
Finally, we use g2o to solve the above least-squares problem, the poses of all keyframes in the loop are corrected,
lP

and the position of all map points is re-adjusted. Fig.7 shows the pose graph. Due to the accumulation of errors, the

unmanned system motion trajectory does not form a closed loop when the previous position is accessed. Pose
na

optimization can align the system poses at the same place and correct the system global motion trajectory.
ur
Jo

Fig.7 Globally optimized pose.

4 Experiment

We evaluate the proposed visual-inertial information fusion algorithm through two parts: dataset experiment

and real environment experiment. We compare the proposed visual-inertial information fusion algorithm with other

latest algorithms on a public dataset in the first experiment. The comparison of experimental data proves the

positioning accuracy and robustness of our algorithm in detail. Then, we test the algorithm in an indoor environment

to evaluate its performance in real scenarios.

4.1 Dataset experiment

The experimental platform configuration for our experiment is shown in Table 1. The open-source dataset
EuRoC provided by ETH Zurich is used to verify the SLAM algorithm. The EuRoC dataset is collected on a micro-

drone containing 20Hz binocular images provided by MT9V034 camera, IMU information of synchronous 200Hz

provided by ADIS16448 sensor, and actual trajectory provided by a motion capture system. Therefore, it is very

suitable to verify the visual-inertial information fusion SLAM algorithm.

Table 1 Experimental platform configuration.


Option Configuration
CPU Intel Core i7-8700(4.3GHz)
RAM 16G
Programming Language C++
Operating System Ubuntu16.0.4
Library OpenCV3.4.0、Eigen3.2、g2o .etc

f
oo
The dataset includes 11 data sequences collected during the flight of the micro-drone shown in Fig.8(a) in an

industrial plant and two rooms. According to the camera data sequence's exposure change, feature number, motion

r
speed, and motion blur degree, the EuRoC dataset divides 11 data sequences into three levels: easy, medium, and
-p
difficult. Among them, the maximum speed of 7 data sequences exceeds 2m/s, and the highest speed can reach 2.87
re
m/s. The average speed of three data series exceeds 0.9m/s. Fig.8(b) shows the images collected by the micro-drone

in an industrial plant. The environment in the plant is relatively complex and can fully simulate the actual setting of
lP

unmanned systems.
na
ur
Jo

Fig.8 EuRoC Dataset.

In order to verify the performance of the algorithm, we have done a lot of comparative experiments. To simplify

the notation, we use RK-VIF(the SLAM method for visual-inertial information fusion via Runge-Kutta) to represent

our algorithm and compares it with the current state-of-the-art visual-inertial information fusion system OKVIS,

ROVIO, and VINS-Mono. OKVIS is an advanced VIO algorithm developed by the Automation System Laboratory

of ETH Zurich, which can be used with monocular or stereo cameras; VINS-Mono is a tightly coupled monocular

visual-inertial system developed by the team of Mr. Shaojie Shen from HongKong University of Science and

Technology; ROVIO was developed by ETH Zurich as a tightly coupled monocular visual-inertial system. This part

of the simulation experiment details the results of the MH_05_difficult and V1_02_medium sequences in the EuRoC

dataset.
For sequence MH_05_difficult and sequence V1_02_medium, the running trajectory graph of ROVIO, OKVIS,

RK-VIF, and Ground Truth is shown in Fig.9.

f
r oo
-p
re
lP
na
ur
Jo

Fig.9 Dataset experiment.

Through multiple comparison experiments, we give the Root Mean Square Error(RMSE) of various algorithms

under each sequence in the EuRoC dataset to evaluate the performance of multiple algorithms.

Table 2 The RMSE in EuRoC Dataset(unit: m)

Sequence ROVIO OKVIS VINS-Mono RK-VIF

MH_01_easy 0.224 0.213 0.186 0.041

MH_02_easy / 0.138 0.178 0.049

MH_03_medium 0.445 0.239 0.175 0.046

MH_04_difficult 0.893 0.257 0.193 0.221

MH_05_difficult 1.071 0.313 0.280 0.214

V1_01_easy 0.162 0.097 0.138 0.081


V1_02_medium 0.173 0.064 0.085 0.063

V1_03_difficult 0.163 0.184 0.118 0.060

V2_01_easy 0.496 0.078 0.125 0.097

V2_02_medium 0.383 0.145 0.176 0.059

The RMSE of 10 sequences in the EuRoC dataset is shown in Table 2. Compared with other algorithms, RK-

VIF has the smallest RMSE in most data sequences, indicating that RK-VIF has the highest positioning accuracy

and the best performance. However, the OKVIS has weak advantage in the sequence V2_01_easy. After comparative

analysis, we think the RK-VIF has a certain positioning error when handling the sequence V2_01_easy, Unmanned

aerial vehicle(UAV) moving too fast and blurred motion are the main reasons leading to the increase in RMSE.

However, on the test platform of our paper, the OKVIS cannot run in real-time. Fig.10 shows the running time of

f
oo
various algorithms in EuRoC dataset. The blue column represents Ground Truth, which is the actual shooting time

of every sequence. It can be seen that the running speed of our algorithm is better than ROVIO. However, compared

r
with the VINS-Mono, the running time is longer under certain sequences. It is because the algorithm costs long time
-p
in matching feature points and using Runge-Kutta for pre-integration; OKVIS has the longest running time. Analysis
re
of the possible reason is that it takes a long time in the process of image feature extraction and global pose

optimization. In summary, our experiments verify algorithm’s performance in our paper in terms of running accuracy
lP

and running speed. Our algorithm satisfies the long-term and stable operation of the SLAM system.
na
ur
Jo

Fig. 10 Running time in EuRoC Dataset(unit: s).

4.2 Real environment experiment

In the actual indoor experiment part, the experimental environment is shown in Fig.11(a). Our article uses

indoor GPS as the Ground Truth. We compare the RK-VIF with OKVIS, VINS-Mono and ROVIO to verify the

performance of various algorithms.

The experimental platform used is shown in Fig.11(b). A TX2 development board is mounted on the Mecanum
wheel mobile robot. Computing resources mainly include CPU: HMP Dual Denver 2/2 MB L2 + Quad ARM A57/2

MB L2, GPU: NVIDIA Pascal, 256 CUDA Cores, connected to a ZED binocular camera and an STM32 driver board,

equipped with IMU sensor MPU-6050.

f
Fig.11 Test environment and equipment.

oo
In this experiment, we tested the positioning accuracy of ROVIO, OKVIS, VINS-Mono, and RK-VIF. Connect

r
the laptop and TX2 to the same local area network(LAN). We use laptop to control the TX2 mobile robot and run
-p
three circles in the indoor experimental environment with the same trajectory, as shown in Fig.12. Indoor GPS
re
obtains ground truth, and the total track length is 72.72m. Fig.13 and Fig.14 respectively show the translation and

rotation information of the mobile robot at each moment and the corresponding error.
lP
na
ur
Jo

Fig.12 Running trajectory in indoor environment.


f
r oo
-p
re
lP
na
ur
Jo

Fig. 13 The position and their corresponding errors of various algorithms.

The RK-VIF, ROVIO, OKVIS, and VINS-Mono are compared and analyzed in the indoor environment

experiment. Comparing the ground truth generated by indoor GPS in the X and Y directions, the RK-VIF algorithm

has the least trajectory error and the highest positioning accuracy, which is between -0.2 m ~ 0.2 m . The second

is the VINS-Mono, and the third is the OKVIS. The worst ROVIO has a trajectory error of 0.3 m ; in the Z

direction, due to the mobile robot is always on the same horizontal plane. The position information at each moment

fluctuates around 0, which is in line with actual changes.

Considering the actual situation of the indoor experimental site flat and no slope, in the attitude error
comparison, we do not count the roll angle and pitch angle; only the yaw angle is included in the category of attitude

error analysis.

f
Fig.14 The attitude and corresponding errors of various algorithms.

oo
As shown in Fig.14, comparing the four algorithms of RK-VIF, OKVIS, VINS-Mono, and ROVIO, the attitude

error of the RK-VIF is significantly smaller than the other three algorithms and the error is between 4 ~ 4 .

r
-p
However, OKVIS and VINS-Mono have similar attitude errors, and ROVIO has the largest error, reaching 6 .
re
Table 3 The average RMSE in indoor environment(unit: m).

ROVIO OKVIS VINS-Mono RK-VIF


lP

0.395 0.312 0.325 0.296

Fig. 13 and Fig. 14 analyze the positioning errors of the four algorithms under the same experimental conditions
na

and scene in the form of a relationship diagram. In order to more intuitively demonstrate the advantages of our
ur

algorithm, we analogy to the dataset experiment in chapter 4.1. In Table 3, we conduct five experiments for each

algorithm and give the best result of the RMSE in the indoor environment. It can be seen from the table that RK-
Jo

VIF has smaller positioning errors and better robustness in actual scenes. Thus, integrating the two parts of the

dataset and actual scene experiments, our algorithm considers the positioning accuracy and running time, providing

a more excellent solution for the unmanned system to explore the unknown environment robustly.

5 Conclusion

This paper proposes a visual-inertial information fusion SLAM based on the Runge-Kutta improved pre-

integration, aiming at the low positioning accuracy of pure visual SLAM in fast motion and complex indoor-

environment. In the process of visual-inertial information fusion, the IMU pre-integration value is used as a

constraint between two adjacent frames. In order to calculate the pre-integration value, the paper uses the Runge-

Kutta instead of Euler integral to discretize the IMU motion equation, so as to avoid the error caused by the first-

order approximation. The experiments show that RK-VIF can effectively perform positioning estimation in

environments of different difficulty levels. Furthermore, compared with other current state-of-the-art open-source

algorithms, our method shows higher positioning accuracy. However, we still need extensive research further to

improve the accuracy and robustness of the system.


Acknowledgements

This work was supported by the China Postdoctoral Science Foundation under Grant 2019M653870XB,

National Natural Science Foundation of Shanxi Province under Grants No.2020GY-003 and 2021GY-036, National

Natural Science Foundation of China under Grants 62001340 and Fundamental Research Funds for the Central

Universities, China, XJS211306 and JC2007.

Reference

[1] Cadena C, Carlone L, Carrillo H, Latif Y, Scaramuzza D, Neira J, et al. Past, Present, and Future of

Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Transactions on Robotics

2016;32(6): 1309-1332.

f
[2] Saputra M R U, Markham A, Trigoni N. Visual SLAM and structure from motion in dynamic environments:

oo
A survey. ACM Computing Surveys (CSUR) 2018;51(2): 1-36.

[3] Fuentes-Pacheco, Jorge, José Ruiz-Ascencio, Juan Manuel Rendón-Mancha. Visual simultaneous localization

r
[4]
-p
and mapping: a survey. Artificial intelligence review 2015;43(1): 55-81.

Sualeh M, Kim G W. Simultaneous localization and mapping in the epoch of semantics: a survey. International
re
Journal of Control, Automation and Systems 2019;17(3): 729-742.
lP

[5] Younes G, Asmar D, Shammas E, Zelek J. Keyframe-based monocular SLAM: design, survey, and future

directions. Robotics and Autonomous Systems 2017;98: 67-88.


na

[6] Barfoot Timothy D. State estimation for robotics. Cambridge University Press, 2017.

[7] Huang G. Visual-inertial navigation: A concise review. In 2019 international conference on robotics and
ur

automation (ICRA). IEEE 2019; 9572-9582.


Jo

[8] Heo S, Cha J, Park C G. EKF-based visual inertial navigation using sliding window nonlinear optimization.

IEEE Transactions on Intelligent Transportation Systems 2018;20(7): 2470-2479.

[9] Forster C, Carlone L, Dellaert F, Scaramuzza D. IMU Preintegration on Manifold for Efficient Visual-Inertial

Maximum-a-Posteriori Estimation. Robotics: Science and Systems 2015.

[10] Ma F, Shi J, Yang Y, Li J, Dai K. ACK-MSCKF: Tightly-Coupled Ackermann Multi-State Constraint Kalman

Filter for Autonomous Vehicle Localization. Sensors 2019;19(21): 4816.

[11] Cui J, Feng D, Li Y, Tian Q. Research on simultaneous localization and mapping for AUV by an improved

method: Variance reduction FastSLAM with simulated annealing. Defence Technology 2020;16(3): 651-661.

[12] Forster C, Carlone L, Dellaert F, Scaramuzza D. On-Manifold Preintegration for Real-Time Visual-Inertial

Odometry. IEEE Transactions on Robotics 2016;33(1): 1-21.

[13] Ma S, Bai X, Wang Y, Fang R. Robust Stereo Visual-Inertial Odometry Using Nonlinear Optimization. Sensors

2019;19(17): 3747.

[14] Liu F, Zhang J, Wang J, Han H, Yang D. An UWB/Vision Fusion Scheme for Determining Pedestrians’ Indoor

Location. Sensors 2020;20(4): 1139.


[15] Mur-Artal R, Tardós J D. ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras.

IEEE Transactions on Robotics 2017;33(5): 1255-1262.

[16] Mur-Artal R, Tardós J D. Visual-inertial monocular SLAM with map reuse. IEEE Robotics and Automation

Letters 2017;2(2): 796-803.

[17] Qin T, Li P, Shen S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE

Transactions on Robotics 2018;34(4): 1004-1020.

[18] Qin T, Shen S. Online Temporal Calibration for Monocular Visual-Inertial Systems. In 2018 IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS). IEEE 2018; 3662-3669.

[19] Yang Z, Shen S. Monocular visual-inertial state estimation with online initialization and camera-IMU extrinsic

calibration. IEEE Transactions on Automation Science and Engineering 2016;14(1): 39-51.

f
[20] Zhao P, Ding D, Wang Y, Liu H. An improved GMS-PROSAC algorithm for image mismatch elimination.

oo
Systems Science & Control Engineering 2018;6(1): 220-229.

r
[21] Senthooran I, Murshed M, Barca J C, Kamruzzaman J, Chung H. An efficient RANSAC hypothesis evaluation
-p
using sufficient statistics for RGB-D pose estimation. Autonomous Robots 2019;43(5): 1257-1270.
re
[22] Xiao Y, Ruan X, Chai J, Zhang X, Zhu X. Online IMU self-calibration for visual-inertial systems. Sensors

2019;19(7): 1624.
lP

[23] Kok M, Hol J D, Schön T B. Using Inertial Sensors for Position and Orientation Estimation. Foundations and

Trends in Signal Processing 2018;11: 1-153.


na

[24] Eckenhoff K, Geneva P, Huang G. Closed-form preintegration methods for graph-based visual-inertial

navigation. The International Journal of Robotics Research 2019;38(5): 563-586.


ur

[25] Wen S, Zhao Y, Zhang H, Lam H K, Manfredi L. Joint optimization based on direct sparse stereo visual-inertial
Jo

odometry. Autonomous Robots 2020; 1-19.

[26] Yang G, Zhao L, Mao J, Liu X. Optimization-based, simplified stereo visual-inertial odometry with high-

accuracy initialization. IEEE Access 2019;7: 39054-39068.

[27] Zou D, Wu Y, Pei L, Ling H, Yu W. StructVIO: visual-inertial odometry with structural regularity of man-

made environments. IEEE Transactions on Robotics 2019;35(4): 999-1013.


Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

of
ro
-p
re
lP
na
ur
Jo

You might also like