Professional Documents
Culture Documents
CU Thesis Template
CU Thesis Template
CU Thesis Template
by
in
Mechanical Engineering
Carleton University
Ottawa, Ontario
© 2023
Muhammad Abdul Wasae
Abstract
In this thesis, the author creates a sensor fusion model that combines information from
stereo vision, IMU & GPS to give odometry for autonomous vehicles. The proposed
model solves three main problems in the automobile industry: (1) It gives high precision
and accurate odometry for real time applications where the margin of error is low; (2)
Unlike most models, it does not use LiDAR which is expensive and unpractical; (3) The
model is robust & reliable which means if it loses information from one of the sensors
then it will still give reasonably good pose estimate. The system was tested on the
ii
Acknowledgements
I would like to thank the faculty of Carleton University for motivating me and showing
thesis.
iii
Table of Contents
Abstract.............................................................................................................................. ii
iv
3.4.2 Orientation accuracy ................................................................................................. 20
v
8.1 Position estimates .......................................................................................................... 48
References ........................................................................................................................ 49
vi
List of Tables
Table 4 Sensor types and their respective standard deviation of error for position
estimate ............................................................................................................................. 42
Table 5 Sensor types and their respective standard deviation of error for orientation
estimate ............................................................................................................................. 42
Table 6 Each sensor model and their respective RMS error for position estimate ........ 46
Table 7 Each sensor model and their respective RMS error for orientation estimate ... 47
vii
List of Illustrations
viii
Illustration 17 Robot position & orientation from linear acceleration and angular
velocity .............................................................................................................................. 22
Illustration 35 Vision, IMU and GPS fused position readings (Zoomed in) .................. 46
ix
(Page intentionally left blank)
x
Chapter 1: Introduction
position and orientation of a mobile robot by analyzing data from its own sensors. The
term "odometry" is derived from the Greek words "odos" (meaning path) and "metron"
(meaning measure), and it essentially involves measuring the robot's path or motion.
There are many ways to do this task. We can use wheel encoders sensors, Inertial
measurement unit, Lidar, anything that can give us any information about robot’s location
and orientation. Every sensor is unique with its advantages and disadvantages. Sensor
fusion is the method used to combine sensors to get a more accurate and reliable estimate
Precision and Accuracy of sensor is essential in advanced robotics. For example, we have
robots that are now used for operations in hospitals which have to balance force and
displacement which require high precision and accuracy. Autonomous cars need precise
and accurate odometry sensing when they are moving fast, and the margin of error is very
low. In future, machine learning & robotics will enable the technology to have robots at
home which will do tasks for us which will also require a high precision and accurate
1
odometry. Our method improves both precision and accuracy of odometry that is better
No Lidar:
Autonomous automobile and robotics companies use Lidar sensor to try to perceive the
environment better, but the additional cost of Lidar sensor is a big issue. Lidar heavy
weight and big size are also problems in both cars and humanoid robots. We have not
used Lidar in our sensor fusion method like most people and made a more practical
Reliability:
Our method is reliable because it does not rely on only one sensor to get odometry. For
example, if a car or a robot is in a place where it does not receive GPS signals like
underground tunnels then it would still predict very accurate and precise odometry from
IMU and vision. And if due to bad weather, we have poor visibility then still our model
We have used one of the sensor fusion methods called particle filter which is derived
from bayes filter to combine information that we get from vision, IMU and GPS to get
high precision, robust and reliable odometry that does not drift with time.
This requires that we first calculate the pose for each sensor individually and find the
standard deviation of the error by comparing it to the ground truth. Since the error that we
2
get may not be gaussian and linear so we cannot use Kalman filter. We used a derivation
of bayes filter called particle filter which requires less computation and is a very good
candidate for real time robotics application like autonomous cars or robots.
We used KITTI dataset to evaluate our model and then concluded the thesis with our
findings.
3
In 1948, Claude Shannon [1] introduced a mathematical way to quantify information and
for binary digit) to represent the smallest amount of information. This allowed for the
Claude Shannon also showed that logic gates can perform computation in his master
thesis at MIT. In 1936, Alan Turing introduced the concept of the Turing machine, a
theoretical model of a simple computational device that can simulate the logic of any
This gave birth to new fields like computer vision and soon visual odometry which uses
computer vision to find pose (position and orientation) of the robot. The term “Visual
observing a sequence of images. Moravec also developed corner detection method that he
published in his book "Sensor Fusion in Certainty Grids for Mobile Robots," in 1988.
In 1987, Matthies and Shaffer [4] recovered Moravec’s work and extended it by deriving
the motion error model using 3D Gaussian distributions instead of scalar model.
In 1988, Chris Harris and Mike Stephens [5] developed corner detection operator that
4
Most of the early research for visual odometry was done by NASA for Mars exploration
In 2004, David Lowe [7] introduced a method to find image features from scale invariant
key points which proved to be a very powerful tool for visual odometry that is both
robust and accurate. SIFT (scale invariant feature transform) is often used in combination
Bolles [8].
Visual odometry can be combined with IMU and GPS using multi sensor fusion.
Different definitions of the architectures of multi sensor fusions have been proposed in
literature. Luo and Kay [9]–[11]. One type of architecture is built upon Bayesian
inference.
[12] in 1960. The Kalman filter was used in guidance and navigation of Apollo
spacecraft. Kalman filter limitation is that it can only work for linear models with
gaussian noise. In 1993, Gordon, Salmond and Smith [13] developed particle filter which
5
Chapter 3: Visual Odometry
Odometry means to find the pose (position & orientation) of a robot. Visual Odometry is
the method where we use vision to determine the pose of the robot. It has been used in a
wide variety of robotic applications, such as on the Mars Exploration Rovers [14].
6
Illustration 1 Visual Odometry
The method of getting the pose of the robot using vision can be broken into three main
parts:
1. Camera Calibration
2. Stereo Depth
3. Visual Odometry
We have three reference frames in our system. The first reference frame is the Image
coordinate frame which gives the image coordinates of the pixel that correspond to 3-
dimensional point. The second reference frame is the World coordinate frame which
gives us coordinates for 3-dimensional point from world’s perspective and the third
reference frame is Camera coordinate frame which give us coordinates for 3-dimensional
7
.
Illustration 3 Transformation of world coordinates to image coordinates using camera intrinsic and
extrinsic matrix
Where:
(cx, cy) are the principal point coordinates (the optical center).
8
s is the scaling factor.
We have four unknowns (fx, fy, cx, cy) that we need to calculate so we can transform
To find the 4 unknowns, we take an image of chessboard whose world coordinates w.r.t
to one edge can easily be calculated as a chessboard have grid like structure. We now
know the world coordinates of some points and their respective image coordinates.
Illustration 4 Camera calibration to find unknown parameters of intrinsic matrix using corner
detection algorithm
We know the World coordinates and image coordinates so we can calculate the
Since extrinsic matrix is just the translation in horizontal direction of known length and
no change in orientation, we know extrinsic matrix as well from the baseline of stereo
images. We can now calculate the intrinsic matrix from extrinsic matrix and projection
matrix.
9
The values of 4 unknown parameters of camera intrinsic matrix are shown below:
fx 718.856
fy 718.856
cx 607
cy 185
To find world coordinates from image coordinates, we will have to first create a depth
To create a depth map we will need a disparity map between the stereo images.
We start with left and right images to compute the disparity between them.
10
This is an RGB (color) image and it contains 3 times more information than a grayscale
image but in our algorithm for stereo vision color is not useful, instead color images
make things more complicated so we will use a grayscale image for all our calculations.
Disparity is the difference of pixels (horizontal in our case) that correspond to the same
world point in stereo images. Disparity map can be calculated by using algorithms and
11
We notice that StereoBM algorithm gives holes in the information and is not continuous.
Another algorithm that calculates the disparity map better and quicker is with no holes is
StereoSGBM algorithm.
If we know the disparity, baseline (given) and focal length (given) then we can get the
Where:
Baseline is the distance between the two stereo cameras (camera separation).
12
This is the part where we use internal camera matrix and depth map to get pose of the
robot using Epipolar geometry, Swift algorithm, RANSAC algorithm and P3P algorithm.
Here Points O1, O2 are camera center and Xw is the world point. Our goal is to find the
relative position and orientation of the camera at time t and at next time step t+1. Camera
center at t is O1 and camera center at time t+1 is O2. This relative orientation and
translation between two camera centers can be described by epipolar geometry. Here we
are trying to compute the R and t in the image. The points O1, O2 and Xw make up a
13
3.3.2 Scale-Invariant Feature Transform (SIFT)
SIFT is a computer vision algorithm used for keypoint detection, description, and
matching in images. SIFT was developed by David Lowe in 1999 and has been widely
adopted in various computer vision applications due to its robustness and invariance
properties.
We will use sift to get matching key points between two consecutive images and then
Projective 3-point algorithm is a method to locate the camera orientation and translation
corresponding image coordinates. Though 3 is the least number of points required, this
14
will give us 4 possibilities and if we have 4 points then we will get only one possible
The P3P method works only if the camera is calibrated internally, that is intrinsic matrix
is known. Let A, B, C be the 3 world points and O be the projection center of the camera
then I can generate a tetrahedron that is 4 triangles connecting these individual points.
This is a 2-step approach. The first step is to calculate the length of x, y, z which are
length of the projection rays, and the second step is computation of translation and
orientation.
In the first step we are basically exploiting the geometry of the triangles and using cosine
relations to relate different lengths of the triangle with each other. These relations
15
combined will result in a polynomial of degree 4. The coefficients of these polynomials
Since we have a polynomial of degree 4, the solution to this equation will yield 4 possible
solutions. If we have one more extra point, then we can verify which of these 4 possible
solutions is the correct one. We will get 4 solutions using 3 world coordinates but only
In the second step we can get the translation and rotation of camera at time t+1 relative to
time t. The details of this step will be in the section where all algorithms are combined.
The P3P method is simple and uses the minimum number of points to give us translation
16
3.3.4 Random Sample Consensus
It takes some random (not all) points and using the P3P algorithm, compute the relative
orientation and translation between camera at time t and t+1. After getting translation and
rotation using some points, it calculates the error between where the other points (that
were not included in calculation) are in reality and where the calculated translation and
After a few samples of points, it chooses the points that give the minimum error. Now
with SIFT and RANSAC we have robust way of getting key point matches between
17
Illustration 13 RANSAC model fitting
This section is about how SIFT, P3P, RANSAC and Epipolar geometry are combined to
1. We have already computed the depth map and intrinsic matrix (K).
2. Using SIFT algorithms we find matching key points in images of the camera at
3. Using depth map we compute the world coordinates of each pixel in the camera at
time t.
18
4. We use P3P algorithm that computes translation and rotation of the camera at
time t+1 relative to time t by using the image coordinates at time t+1 and world
coordinates that we got in step 2 using depth map at time t. The formula for this
step is as follows:
Here Extrinsic matrix is the only unknown. We know intrinsic matrix, image
coordinates at t+1 with its SIFT matching key points in images coordinates at
Since we get hundreds of SIFT key point matches but only need 4 matching points to get
pose. We can use other points to make our algorithm accurate by reducing outliers and
using RANSAC to get the 4 points that will give us the best fit on our data.
We can get standard deviation of error to be used in sensor fusion by comparing the data
that we get from sequence using this algorithm with the ground truth.
19
Illustration 14 Position accuracy for stereo vision
20
Illustration 16 Orientation accuracy for stereo vision
21
Chapter 4: Inertial Measurement Unit
IMU can be used for odometry as well but usually they are less accurate than visual
odometry. IMU works principally by measuring linear and angular velocity. The
acceleration we get is then integrated to get both the position and orientation of a robot.
Errors in IMU are also integrated with time and it drifts with time. GPS on the other hand
never drifts in time but is less accurate than the IMU. Both can be fused to give more
robust and accurate translation and orientation. They are complementary to each other.
The Angular velocity and linear acceleration are converted to orientation and linear
Illustration 17 Robot position & orientation from linear acceleration and angular velocity
22
This equation is used to find change in position from acceleration. Because we have
discrete data, we take average of two acceleration and multiply it with delta t = 0.1 for 10
hertz to get almost instantaneous velocity which is again multiped by delta t=0.1 to get
change in position. Similarly, we can get orientation as well from discrete data.
Error bias in an Inertial Measurement Unit (IMU) refers to a systematic or constant offset
in the sensor's measurements. This bias can result in a consistent error in the IMU's
output, which can affect the accuracy of the IMU's measurements over time. It needs to
23
We can correct for linear bias for both position and orientation by taking IMU readings at
rest with only gravitational acceleration acting on it and then taking the mean of the data.
The Euler angles are three angles introduced by Leonhard Euler to describe the
orientation of a rigid body with respect to a fixed coordinate system. They describe the
24
We can convert the Rotational matrix into Euler angles using the algorithm:
We can compare the position and orientation that we get using IMU with the ground truth
to get the standard deviation of how accurate IMU model is. This will be useful in the
25
Illustration 20 Position accuracy for IMU
26
Illustration 22 Orientation accuracy for IMU
27
GPS stands for Global Positioning System, and it is a satellite-based navigation system
that allows users to determine their precise location and track movement almost
anywhere on Earth. The system works by using a network of satellites in orbit around the
Earth. These satellites continuously transmit signals that are received by GPS receivers,
GPS can be used to calculate robot position, but it is not very accurate. The error never
drifts with time and almost stays the same unlike IMU. GPS does not have long term
errors and only has short term errors. They can be fused to get a more accurate position of
the robot.
We get data from GPS sensor in ENU coordinates, and they can be converted into the
Given ENU coordinates (E, N, U) and the azimuth angle (θ) of the local frame:
N (Local Up) = U
28
Illustration 23 GPS coordinates & ENU coordinates relation
We can get the standard deviation of GPS either by looking at its specifications or by
comparing it to the ground truth. This information tells us how accurate our GPS system
is and how much it can be trusted. This will be required if we want to do sensor fusion
29
SPS 1.5m
SBAS 0.6m
DGPS 0.4m
PPP 0.1m
RTK 0.01m
The following table shows how to get standard deviation from other units of error
calculation. This conversion table is also used to convert values expressed for one
For example, to convert manufacturer’s error which is given in the unit of R95 (95
30
0.409 × R95 error (Given my manufacturer) = RMS error
31
Sensor fusion, also known as sensor data fusion, is the process of integrating data from
The primary goals of sensor fusion are to improve data accuracy, reduce uncertainty,
enhance situational awareness, and make more informed decisions. Classical fusion
At the core of probabilistic robotics is the idea of estimating state from sensor data.
Probability theory and statistics are the building blocks of sensor fusion.
Each sensor conveys some information with different degree of certainty. By integrating
different sensors, we can get advantages of both while the disadvantages cancel
themselves out.
32
For example, in imu we can get very accurate estimation of short-term distance, but it is
bad for long term prediction of distance because it has some error in it and by integrating
that error, we become less sure of position as time passes. GPS gives us position to less
accuracy than IMU but does not suffer from the problem of error increasing as time
P(B) then what is the probability that both happen P (A AND B) is given by:
There is a difference between the two. When the random variable is no longer
independent then we must ask what is the probability that A happens when B has
Also notice that we can write it in two ways either P(A|B) x P(B) or P(B|A) x P(A)
because both are equal. This expression is used to formulate the bayes rule.
Bayes Rule:
33
P(A|B) × P(B) = P(B|A) × P(A)
Likelihood: P(B|A) is the probability of observing evidence/data given our prior beliefs.
Posterior: P(A|B) is our new updated belief that is much closer to reality.
34
6.1.2 Bayes filter
A Bayes filter is a powerful and widely used framework for estimating the state of a
dynamic system in the presence of noise and uncertainty. This filter is crucial in various
1. Prediction (Prior Update): In this step, the filter predicts the new state of the system
based on the previous state and a mathematical model that describes how the system
evolves over time. The result is a probability distribution representing the expected state,
filter calculates the likelihood of the measurement given the predicted state. Bayes'
theorem is employed to update the prior belief with this measurement information,
3. Estimation (Posterior Estimate): From the posterior belief, an estimate of the system's
state is computed. Common estimates include the mean or mode of the posterior
4. Resampling (Optional): In some filter variants like the particle filter, resampling is
35
distribution. This helps prevent particle depletion and maintains a more accurate state
estimate.
5. Iteration: The process repeats as new measurements arrive, using the posterior belief
from the previous step as the prior belief for the next prediction. This recursive approach
Bayes filters are versatile, with variants like the Kalman filter for linear systems and the
Particle filter for non-linear and non-Gaussian systems. They play a fundamental role in
enabling systems to make informed decisions and navigate uncertainty, making them
36
Particle filter is a derivation of bayes filter, and this is the sensor fusion method that we
will be using. A particle filter, also known as a sequential Monte Carlo filter, is a
recursive Bayesian filtering method used for estimating the state of a dynamic system.
Particle filters are particularly useful in situations where the state estimation process
involves nonlinear and non-Gaussian systems, where traditional filters like the Kalman
filter may not be applicable. Particle filters work by representing the probability
to represent the possible states of the system. These particles are drawn from the prior
state distribution (initial belief about the state), which may include information from
37
2. Prediction: In each time step, the particles are propagated forward in time based on the
system's dynamics. This accounts for how the system is expected to evolve. Each
a weight that represents how well it agrees with the observed measurements. Particles
that are more consistent with the measurements receive higher weights, while those that
4. Resampling: Particles are resampled with replacement from the existing set, and
particles with higher weights have a higher chance of being selected. This process
emphasizes the particles that are more likely to represent the true state.
resampled particles. This estimate represents the most probable state of the system given
38
Illustration 29 Particle filter algorithm []
We will build two separate models. One for position and one for orientation of the robot.
We have cleaned and processed data. We have already calculated the standard deviations
2. Each particle is assigned a weight that represents how well it agrees with the
39
3. Generate new sample set by resampling (with replacement) such that more
4. The combined vision and IMU position estimate is calculated as an average of the
5. We predict the position of particles from vision and IMU position fused.
(Prediction step)
6. Each particle is assigned a weight that represents how well it agrees with the
7. Generate new sample set by resampling (with replacement) such that more
(State estimation)
40
6.3.2 Second Model for orientation using Particle filter
(Prediction step)
2. Each particle is assigned a weight that represents how well it agrees with the
3. Generate new sample set by resampling (with replacement) such that more
In the next chapter we will test this model using KITTI dataset and standard deviations of
41
Following are the tables for position and orientation standard deviation of error from
ground truth:
Vision 0.03867
IMU 0.5018
GPS 0.0404
Table 4 Sensor types and their respective standard deviation of error for position estimate
Vision 0.0427
IMU 0.012435
Table 5 Sensor types and their respective standard deviation of error for orientation estimate
42
Chapter 7: Experimental Results
7.1 Set up
We have used KITTI dataset for our calculations. The KITTI dataset, or the KITTI
Vision Benchmark Suite, is a widely used collection of datasets for various computer
vision and robotics tasks, primarily focused on autonomous driving and scene
understanding.
43
Illustration 32 Set up of sensors in KITTI dataset
It contains GPS/IMU unit and stereo grayscale cameras which will be used in our
calculations. The above image can help us to calibrate our sensor's location and
2. The images that we get are not raw, they are rectified.
3. The sensor readings are synced at 10 hertz, that is 10 readings per second.
We load the readings that we get from our sequence into our sensor fusion model to get
experimental results.
44
Illustration 33 Vision & IMU fused position readings
45
IMU, Vision & GPS combined (Zoomed in)
Illustration 35 Vision, IMU and GPS fused position readings (Zoomed in)
The following table shows the Root mean squared (RMS) error for position:
IMU 34.084
Vision 5.5174
GPS 0.0703
Table 6 Each sensor model and their respective RMS error for position estimate
46
Illustration 36 Vision and IMU fused orientation readings
The following table shows the Root mean squared (RMS) error for orientation:
IMU 0.0220
Vision 0.0169
Table 7 Each sensor model and their respective RMS error for orientation estimate
Chapter 8: Conclusion
47
8.1 Position estimates
The first sensor fusion for position between IMU and stereo vision gives us a very good
estimate that is comparable to GPS but because both IMU and visual odometry are bad at
long-term accuracy, this model will start to lose accuracy with time.
The next sensor fusion model which combines all three sensors together gives the best
estimate and least root squared mean error as compared to each individual sensor. The
new path that we get is most of the time between GPS path and IMU-Stereo vison path.
This combination will not drift with time as GPS will make sure that there are no long-
term errors. This model is reliable with high accuracy and precision as compared to each
individual sensor alone. This model will drift with time because small errors will
integrate with time and there will be no long-term correction like GPS.
Just like in position estimate, the curve for orientation is most of the time in between the
curves for IMU and stereo vision. The fused model has the least RMS error.
48
References
Rover” by H. P. Moravec
Exploration Rovers,” IEEE Robot. Autom. Mag., vol. 13, no. 2, pp. 54–62, 2006.
Model Fitting with,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.
Proc. 16th Anuu. Conf. IEEE Ind. Electron., 1990, vol. 1, pp. 707–722.
systems,” IEEE Trans. Syst., Man, Cybern., vol. 19, no. 5, pp. 901–931, Sep./Oct.
1989.
11. R. C. Luo and M. G. Kay, Multisensor Integration and Fusion for Intelligent
49
12. R. E. Kalman ,”A new approach to linear filtering and prediction problems” in
13. Gordon, N.J.; Salmond, D.J.; Smith, A.F.M. , "Novel approach to nonlinear/non-
14. Maimone, M.; Cheng, Y.; Matthies, L. "Two years of Visual Odometry on the
15. “Information Theory, Inference and Learning Algorithms” by David J.C. McKay.
16. “Deep learning” by Aaron Courville, Ian Goodfellow, and Yoshua Bengio.
20. Multisensor Fusion and Integration: Theories, Applications, and its Perspectives
Ren C. Luo, Fellow, IEEE, Chih Chia Chang, and Chun Chi Lai.
21. Sensor Fusion INS/GNSS based on Fuzzy Logic Weighted Kalman Filter by
23. Navigation with IMU/GPS/Digital Compass with Unscented Kalman Filter Pifu
24. Wikipedia
50
51