AnkitaSikdar Dissertation

Depth based Sensor Fusion in Object
Detection and Tracking
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor
of Philosophy in the Graduate School of The Ohio State University
By
Ankita Sikdar, B.Tech.
Graduate Program in Computer Science and Engineering
The Ohio State University
2018
Dissertation Committee
Dong Xuan, Co-Advisor
Yuan F. Zheng, Co-Advisor
Han-Wei Shen
Copyrighted by
Ankita Sikdar
2018
2
Abstract
Multi-sensor fusion is the method of combining sensor data obtained from multiple sources
to estimate the environment. Its common applications are in automated manufacturing,
automated navigation, target detection and tracking, environment perception, biometrics,
etc. Out of these applications, object detection and tracking is very important in the field
of robotics or computer vision and finds application in diverse areas such as video
surveillance, person following, autonomous navigation etc. In the context of purely two-
dimensional (2-D) camera based tracking, situations such as erratic motion of the object,
scene changes, occlusions along with noise and illumination changes are an impediment to
successful object tracking. Integration of information from range sensors with cameras
helps alleviate some of the issues faced by 2-D tracking. This dissertation aims to explore
novel methods to develop a sensor fusion framework to combine depth information from
radars, infrared and Kinect sensors with an RGB camera to improve object detection and
tracking accuracy.
In indoor robotics applications, the use of infrared sensors has mostly been limited to a
proximity sensor to avoid obstacles. The first part of the dissertation focuses on extending
the use of these low-cost, but extremely fast infrared sensors to accomplish tasks such as
ii
identifying the direction of motion of a person and fusing the sparse range data obtained
from infrared sensors with a camera to develop a low-cost and efficient indoor tracking
sensor system. A linear infrared array network has been used to classify the direction of
motion of a human being. A histogram based iterative clustering algorithm segments data
into clusters, from which extracted features are fed to a classification algorithm to classify
the motion direction. To address the circumstances when a robot tracks an object that
executes unpredictable behavior - making abrupt turns, stopping while moving in an
irregular wavy track, such as when a personal robot assistant follows a shopper in a store
or a tourist in a museum or a child playing around, the use of an adaptive motion model
has been proposed to keep track of the object. Therefore, an array of infrared sensors can
be advantageous over a depth camera, when discrete data is required at a fast processing
rate.
Research regarding 3-D tracking has proliferated in the last decade with the advent of the
low-cost Kinect sensors. Prior work on depth based tracking using Kinect sensors focuses
mostly on depth based extraction of objects to aid in tracking. The next part of the
dissertation focuses on object tracking in the x-z domain using a Kinect sensor, with an
emphasis on occlusion handling. Particle filters, used for tracking, are propagated based on
a motion model in the horizontal-depth framework. Observations are obtained by
extracting objects using a suitable depth range. Particles, depicted by patches extracted in
the x-z domain, are associated to these observations based on the closest match according
to a likelihood model and then a majority voting is employed to select a final observation,
iii
based on which, particles are reweighted, and a final estimation is made. An occluder
tracking system has been developed, which uses a part based association of the partially
visible occluded objects to the whole object prior to its occlusion, thus helping to keep
track of the object when it recovers from occlusion.
The latter part of the dissertation discusses a classical data association problem, where
discrete range data from a depth sensor has to be associated to 2-D objects detected by a
camera. A vision sensor helps to locate objects in a 2-D plane only but estimating the
distance using a single vision sensor has limitations. A radar sensor returns the range of
objects accurately; however, it does not indicate which range corresponds to which object.
A sensor fusion approach for radar-vision integration has been proposed, which using a
modified Hungarian algorithm with geometric constraints, associates data from a simulated
radar to 2-D information from an image to establish the three-dimensional (3-D) position
of vehicles around an ego vehicle in a highway. This information would help an
autonomous vehicle to maneuver safely.
iv
Dedication
I dedicate this dissertation to my mother, who has always been my strength and
inspiration.
v
Acknowledgments
I would like to express my sincere gratitude to my co-advisors Dr. Dong Xuan and Dr.
Yuan Fang Zheng, who have guided me throughout my doctoral studies, inculcating in me
a spirit of independent research. I would especially thank Dr. Zheng for his continuous
inspiration and guidance as a faculty member from the Department of Electrical and
Computer Engineering, outside of my home Department, Computer Science and
Engineering. I would also like to thank my committee member Dr. Han-Wei Shen. This
work would not have been possible without their support.
I would like to thank all my lab mates at the Multimedia and Robotics Laboratory, with
whom I have had the pleasure of holding interesting discussions about my work and
projects, and participating in numerous data collection experiments. Some special thanks
also go to my batchmates at The Ohio State University, who made initial study sessions
fun. I would also like to thank my friends, made during my undergraduate studies at West
Bengal University of Technology, for providing the inspiration and courage to pursue
doctoral studies abroad.
vi
I would also like to thank my parents, my grandparents, my younger sister, cousins,
extended family and friends, who have been a source of constant support.
Finally, I would like to thank my husband for always being there for me through thick and
thin, with his strong encouragement and active support.
vii
Vita
2008……………………….……….…...........Mahadevi Birla Girls’ Higher Secondary

School
July 2012…….…………………………...…B.Tech. Computer Science and Engineering,

West Bengal University of Technology
August 2012 to December 2017……………. Graduate Teaching Associate, Department

of Computer Science and Engineering, The
Ohio State University
Publications
Sikdar, A., Cao, S., Zheng, Y.F. and Ewing, R.L., 2014, May. Radar depth association
with vision detected vehicles on a highway. In Proc. 2014 IEEE Radar Conference (pp.
1159-1164).
Sikdar, A., Zheng, Y.F. and Xuan, D., 2015, May. An iterative clustering algorithm for
classification of object motion direction using infrared sensor array. In Proc. 2015 IEEE
International Conference on Technologies for Practical Robot Applications (TePRA) (pp.
1-6).
Sikdar, A., Zheng, Y.F. and Xuan, D., 2015, June. Using an A-priori learnt motion model
with particle filters for tracking a moving person by a linear infrared array network. In
Proc. 2015 IEEE National Aerospace and Electronics Conference (NAECON) (pp. 75-
80).
viii
Sikdar, A., Zheng, Y.F. and Xuan, D., 2016, September. Robust object tracking in the XZ
domain. In Proc. 2016 IEEE Multisensor Fusion and Integration for Intelligent Systems
(MFI) (pp. 499-504).
Fields of Study
Major Field: Computer Science and Engineering
ix
Table of Contents
Abstract ............................................................................................................................... ii
Dedication ........................................................................................................................... v
Acknowledgments.............................................................................................................. vi
Vita................................................................................................................................... viii
List of Tables .................................................................................................................... xii
List of Figures .................................................................................................................. xiii
Chapter 1. Introduction ....................................................................................................... 1
Chapter 2. Use of Low Cost Range Sensors for Indoor Object Tracking Applications .... 9
2.1. Introduction .............................................................................................................. 9
2.2. Infrared Sensors ..................................................................................................... 12
2.3. An Iterative Clustering Algorithm for Classification of Object Motion Direction
Using Infrared Sensor Array ......................................................................................... 14
2.3.1. Introduction ..................................................................................................... 14
2.3.2. Related Work .................................................................................................. 15
2.3.3. Methodology ................................................................................................... 17
2.3.4. Results ............................................................................................................. 24
2.4. Using an A-Priori Learnt Motion Model with Particle Filters for Tracking a
Moving Person by a Linear Infrared Array Network.................................................... 31
2.4.1. Introduction ..................................................................................................... 31
2.4.2. Related Work .................................................................................................. 32
2.4.3. Methodology ................................................................................................... 32
2.4.4. Results ............................................................................................................. 38
2.4.5. Conclusion ...................................................................................................... 43
2.5. An Infrared Sensor Guided Approach to Camera Based Tracking of Erratic Human
Motion ........................................................................................................................... 45
2.5.1. Introduction ..................................................................................................... 45
x
2.5.2. Related Work .................................................................................................. 46
2.5.3. Methodology ................................................................................................... 47
2.5.4. Results ............................................................................................................. 57
2.5.5. Conclusion ...................................................................................................... 62
Chapter 3. Occlusion Handling in Tracking ..................................................................... 63
3.1. Introduction ............................................................................................................ 63
3.2. Related Work ......................................................................................................... 65
3.3. Methodology .......................................................................................................... 67
3.3.1. Object Representation ..................................................................................... 67
3.3.2. Object Extraction and Filtering ....................................................................... 68
3.3.4. Particle filter tracker ....................................................................................... 72
3.4. Results .................................................................................................................... 76
3.5. Conclusion ............................................................................................................. 85
Chapter 4. Data Association in Tracking ......................................................................... 86
4.1. Introduction ............................................................................................................ 86
4.2. Related Work ......................................................................................................... 88
4.3. Methodology .......................................................................................................... 89
4.3.1. Derivation of equation .................................................................................... 89
4.3.2. Procedure ........................................................................................................ 95
4.4. Results .................................................................................................................... 99
4.5. Conclusion ........................................................................................................... 101
Chapter 5. Conclusion and Future Work ....................................................................... 102
References ....................................................................................................................... 107
xi
List of Tables
Table 1. Data from each sensor representing motion in the left to right direction ........... 25
Table 2. Classification Accuracy ...................................................................................... 28
Table 3. Confusion Matrix for KNN, k=5 (Predicted classes shown in columns, actual
classes shown in rows) ...................................................................................................... 28
Table 4. Confusion matrix for SVM classifier.................................................................. 39
xii
List of Figures
Figure 1. A graph showing the non-linearity of the distance measurement as returned by

the infrared sensors. .......................................................................................................... 13
Figure 2. Timing diagram of the SHARP GP2Y0A710K0F infrared sensor as provided by
the manufacturer’s manual. ............................................................................................... 13
Figure 3. Infrared sensor array setup which is installed on the top of a robotic platform.
The platform co-ordinate system is shown, and data is measured w.r.t this co-ordinate
system (a) Each sensor is placed with a separation of 0.5ft on the platform (i.e. at 0.5ft,
1ft and 1.5ft distances along the x-axis); (b) The platform is mounted at a height of 2.42ft
above the ground. .............................................................................................................. 16
Figure 4. Raw data collected from three infrared sensors. (a) Person walking away from
the platform and then towards it; (b) Person moving from left to right and then from right
to left across the platform in a straight line; (c) Person moving from right to left and then
from left to right diagonally. ............................................................................................. 19
Figure 5. Plots showing the intermediate processing steps. (a) Data collected between the
1st and 2nd second; (b) Histogram of the range (Y) or longitudinal distance values; (c)
Clustering done in the range domain; (d) Range domain clusteres merged in time domain
to form super clusters representing an object in motion or a stationary object. .............. 21
Figure 6. A plot of the two-dimensional feature space. Certain classes may overlap at the
boundaries. The classes are: 1. In front and away; 2. In front and towards; 3. Left to right
straight line; 4. Right to left straight line; 5. Left to right diagonal line; 6. Right to left
diagonal line; 7. Stationary. .............................................................................................. 23
Figure 7. (a-c) Person moving from left to right across the infrared sensor array which is
mounted on a robotic platform. ......................................................................................... 25
xiii
Figure 8. Plot showing data analysis for time period 3s – 4s. (a) Raw data capturing
person’s motion; (b) histogram showing peak values (c-d) clustering, with the one in red
being the real cluster. ........................................................................................................ 26
Figure 9. Plots showing straight lines fitting data points in (a) the x-t plane and (b) y-t
plane. The slope pair (0.1581, 0.0003371) is used as the feature vector for classification.
It can also be verified that this slope pair falls in the domain of class 3 as represented by
the feature space in Fig. 6. ................................................................................................ 27
Figure 10. Color coded peaks because of an object in motion as detected by the SVM
classifier. Other peaks are also noticed; however, they are created by inconsistent data
and are discarded by the SVM as true negatives. ............................................................. 38
Figure 11. (a)-(b) A traditional particle filter with 500 particles is used to track the
infrared sensor simulated data with an average position estimation error of 0.9008 ft. ... 40
Figure 12. (a)-(b) A particle filter with 500 particles and a continuously updated motion
model with coefficients of 0.5(Fig. 12(a)) and 0.6(Fig. 12(b)) respectively is used to track
the infrared sensor simulated data with an average position estimation error of 0.35ft. .. 41
Figure 13. True position versus estimation. (a) A traditional particle filter with a fixed
linear model tracks object on real infrared sensor data with an average position estimation
error of 0.64ft (b)A particle filter receiving feedback from the controller regarding
position estimation error tracks the object on real infrared sensor data with an average
position estimation error of 0.34ft. .................................................................................. 42
Figure 14. A plot showing the average error in the position estimation of the object at
different coefficient values for the motion model update parameters. For these runs, a
coefficient of 0.6 produced good results. .......................................................................... 43
Figure 15. Infrared sensor setup with camera (a)The coordinate system (b) Distance
measurements shown on the robotic platform .................................................................. 48
Figure 16. Images from some video sequences illustrating the target tracking under
various occlusion/illumination scenarios. (a) Video sequence to demonstrate target
walking in a scene without any occlusion. Frames 1905, 1941, 2019 and 2055 have been
shown; (b) Video sequence to demonstrate target being occluded by an object with
xiv
similar appearance as well as by an object with a different appearance. Frames 1359,
1407, 1425, 1431, 1521, 1545, 1659, 1701, 1725, 1773, 1857 and 2236 have been shown;
(c) Video sequence to demonstrate target being tracked when multiple persons are present
in the scene, however, there is no occlusion. Frames 408, 433, 450, 492 and 505 have
been shown; (d) Video sequence to demonstrate target occluded in presence of other
objects as well. Frames 1002, 1074, 1110 and 1182 have been shown; (e) Video sequence
to demonstrate target occluded in presence of other objects as well. Frames 300, 306, 318
and 360 have been shown; (f) Video sequence to demonstrate target being tracked in the
presence of other objects in low illumination condition in the hallway. Frames 2221,
2293, 2329 and 2341 have been shown. ........................................................................... 59
Figure 17. Graphs showing tracking error at a frequency of 3s-1 for two different
sequences (a) and (b). ....................................................................................................... 61
Figure 18. (a) Depth image showing the human body. (b) projection of the human body
depth data on the x-z plane. (c) normalized depth histogram for the human object ......... 69
Figure 19. Image sequence showing an object executing a simple linear motion being
tracked. .............................................................................................................................. 78
Figure 20. Image sequence showing an object facing partial occlusion being tracked
correctly. At frame number 262, the two objects are at similar depths and is partly
occluded. ........................................................................................................................... 79
Figure 21. Image sequence showing an object that is fully occluded for a short time,
however, on reappearing, it is tracked again..................................................................... 80
Figure 22. Image sequence showing an object that is partially occluded for a long
duration of time, however, it is succesfully tracked all through and towards the end, it is
heavily occluded, but the algorithm tracks it correctly till the end. .................................. 81
Figure 23. Target enters a stage of partial occlusion, until it is fully occluded and then
reappears (example 1); Bold black bounding box represents target, light black bounding
box represents occluder..................................................................................................... 82
xv
box represents occluder..................................................................................................... 83
Figure 25. Partially visible target is obstructed by an occluder which in turn is occluded;
Bold black bounding box represents target; light black bounding box represents occluder.
........................................................................................................................................... 84
Figure 26. A real-world figure projected onto camera co-ordinate plane......................... 90
Figure 27. Plot of size of chessboard projected at increasing depth ranges ..................... 92
Figure 28. Plot confirms that observed data follows derived Eq. (28) ............................. 93
Figure 29. Vehicles with their ranks based on their relative positions determined by the
size. ................................................................................................................................... 94
Figure 30. Testing images (a)Testing on cars having same average size (b) Testing on
partly occluded vehicle(a small car occluded by a large truck) . (c) & (d) Testing on a
large vehicle along with cars............................................................................................. 96
Figure 31. Radar simulation results for Fig. 30(a)-(d) respectively ............................... 100
xvi
Chapter 1. Introduction
Object tracking is an important and challenging field of research in the areas of computer
vision and robotics and finds applications ranging from human computer interaction,
surveillance in public places, player tracking in sports, person following, etc. The goal of
object tracking is that, given an initial state (position, bounding box, size) of the target, it
should be able to robustly estimate the position of the target object in successive frames of
the input sequence. Some of the difficulties faced by an object tracker include changes in
illumination and shadows, similarity with the background scene, unpredictable motion
behavior and occlusion.
Object tracking algorithms can be categorized into 2-D tracking algorithms and 3-D
tracking algorithms based on whether tracking is done including the dimension of depth or
not. 2-D object tracking has been prevalent and uses monocular RGB cameras. [1] is a
famous survey that categorizes the tracking methods based on the different object
representations and motion representations used, discusses the pros and cons and lists the
important object tracking issues. In [2], Li et al present a survey of 2-D appearance models
for visual object tracking, focusing on visual representations and statistical modeling
schemes for tracking-by-detection. Visual representations try to robustly describe the
1
spatio-temporal characteristics of object appearance, while the statistical modeling
schemes for tracking-by-detection emphasizes on capturing the generative and
discriminative statistical information of the object regions. Effective appearance models
combine both visual representation and statistical modeling. In [3], Wu et al carry out large
scale experiments to evaluate the performance of existing online tracking algorithms,
identify new challenges and provide evaluation metrics for in-depth analysis of tracking
algorithms from several perspectives.
When building a 2-D appearance model, the right balance between tracking robustness and
tracking accuracy must be achieved. To improve tracking accuracy, more visual features
and geometric constraints are incorporated into the models, resulting in a precise object
localization, which might also lower the generalization capabilities of the models when
there are variations in the appearance of the target object. On the other hand, to improve
tracking robustness, the appearance models might relax some constraints, which might lead
to ambiguous localization. Additionally, a more complex model composed of many
components may improve tracking robustness at the cost of increased computational
power, when compared to a simpler model that may be computationally more efficient but
has a lower discriminability. Due to the hardware limits of processing speed and memory
usage, the rate at which frames acquired from the video is processed is typically low. The
object’s appearance model may have undergone some variation due to occlusion or
illumination changes, and thus the appearance model used to represent the target object
must be able to generalize well and have the capability to adapt itself based on these
2
changes. Another aspect to consider is that with a low frame rate, the object may have
executed large or abrupt motion and thus the motion model is also crucial for object
tracking. Good location prediction based on the dynamic or adaptive motion model can
narrow down the search space and lead to improved tracking efficiency and robustness.
Information about the background is also essential as it can be used to effectively
discriminate the foreground, as in [4], or it can serve as the tracking context explicitly as
in [5]. Due to the information loss due to projection from 3-D to 2-D, the appearance
models in 2-D cannot accurately estimate the poses of tracked objects, leading to failures
in case of occlusion. Local models have been proposed such as in [6, 7], which help when
the object has undergone partial change in appearance, such as when it is partially occluded
or partially deformed.
When multiple overlapping cameras are used to fuse information from different
viewpoints, tracking in a large camera network with transfer of target information from
one camera sub-network to another becomes an important study. In [8], Ercan et al propose
a sensor network of cameras to track a single object in the presence of static and moving
occluders, where each camera does some simple processing to detect the horizontal
position of the target. This data is then sent to the cluster (a subset of cameras) head to
track the object.
RGB-D tracking is popular among researchers and [9] provides a benchmark for standard
RGBD algorithms along with comparison of the various algorithms. The 3-D object
3
tracking algorithms use depth information such as when depth is obtained from stereo or
multiple cameras. The algorithms can then also be extended to crowded scenes to handle
occlusion such as in [10], where pixels are assigned to humans based on their distance and
color models. In [11], multiple human beings are tracked based on motion estimation and
detection, background subtraction, shadow removal and occlusion detection. In [12], stereo
images are used, and appearance-based representation methods based on luminance with
disparity information and Local Steering Kernel (LSK) descriptors is used. In [13], the
occlusion situation is analyzed by exploiting the spatiotemporal context information, which
is further double checked by the reference target and motion constraints to improve
tracking performance along with several templated mask approaches.
Depth information can also be obtained from a range of depth sensors such as radar, laser,
infrared or ultrasonic sensors. These are typically integrated with RGB cameras in a multi-
sensor fusion framework. The addition of depth dimension to the traditional 2-D camera
based object tracking helps to alleviate some of the important challenges faced by 2-D
tracking such as partial or full occlusion of object, better object segmentation based on
depth, better distinguishability between objects having similar appearances. In [14], Fod et
al have described a method for real time tracking of objects with multiple laser range
finders covering a workspace. Vision oriented methods are adapted to laser scanners,
grouping range measurements into entities such as blobs and objects. In [15], Vu et al have
presented a method of simultaneous detection and tracking moving objects from a moving
vehicle equipped with a single layer laser scanner. In [16], Labayrade et al have used a
4
laser scanner is used to first detect objects, and then a stereo vision system is used to
validate the detections. In [17], Cho et al present a reliable and effective moving object
detection and tracking system for a self-driving car using radars, LIDARS and vision
sensors. In [18], Kumar et al have combined a thermal infrared sensor along with a wide
angle RGB camera to correct the errors of the camera and reduce the false positives,
improve segmentation of tracked objects and correct false negatives.
Research regarding 3-D tracking has proliferated in the last decade with the advent of the
low-cost Kinect sensors [19]. In [20], a combination of histogram of oriented gradients and
histogram of oriented depths is used for detecting humans. In [21], a hierarchical
spatiotemporal data association method (HSTA) is introduced to robustly track multiple
objects without prior knowledge. In [22], an adaptive depth segmentation procedure is
described to perform real-time tracking analysis. In [23], Nakamura et al propose a 3-D
object tracking method by integrating the range and color information using camera
intrinsic parameters and relative transformation between the cameras, followed by tracking
the desired target regions by processing the depth pixels with color information.
Object tracking algorithms generally follow a bottom-up approach or a top down approach
or a combination of both. In the first case, objects are usually extracted from the image
frame or data, and is then used for tracking, such as model-based approaches [24, 25] or
template matching approaches [26]. Particle filters [27], is a top down approach as it
generates a set of hypotheses on which evaluation takes place. It is a widely used technique
5
in object tracking, which applies a recursive Bayesian filter based on samples drawn from
a proposal distribution. An advantage of particle filters is that it can be applied to non-
linear and non-Gaussian systems. In [28], Spinello et al use a bottom up detector which
generates candidate detection hypotheses that are validated by a top down classifier
procedure for tracking people in 3-D.
This dissertation is focused on depth based sensor fusion in solving the challenges of object
detection and tracking. A part of the research has focused on using the low-cost range
sensors to solve some of the issues of object tracking such as object association, where
objects detected in a 2-D image have to be associated with a set of depth values
corresponding to the scene returned by a radar sensor. This has been applied to the task of
predicting the 3-D positions of vehicles in a highway with respect to an ego vehicle, which
would facilitate in the navigation of a self-driving autonomous vehicle. Another issue that
has been explored is the use of infrared (IR) sensors with the aim of utilizing these low-
cost, low-power, easy to use sensors for indoor robotic applications beyond obstacle
distance estimation, such as using an IR sensor array network for classifying the direction
of motion of a human walking in the field of view of the sensors. The direction estimate
can then be used by the robotic platform to guide its motion, avoiding the object or the
person, for example.
Another issue that often poses a challenge in object tracking problems with particle filters,
is that of appropriate motion models, which would be dynamic enough to adapt to changes
6
in the target object’s trajectory, which could be because the object executes random and
erratic motion, or the number of frames processed per second is less, because of which the
object’s position might have shifted from the position predicted by the motion model.
Small errors in position estimation could add up over time making the particle filter
completely lose track of the person. Thus, instead of using a fixed motion model, a motion
model is statistically learnt from the initial target motion data and subsequently this model
is used with the particle filtering approach to track the person. In addition, the learnt motion
model is regularly updated to support the particle filtering approach in establishing a more
accurate track of the person.
Another multi-sensor based tracking approach has been proposed using an infrared sensor
array based secondary tracker to deal with abrupt changes in motion, and a camera based
primary tracker to deal with simple non-linear motion. The former uses an omnidirectional
motion model to keep track of detections and thereby helps to re-initialize the latter in case
it fails to track the object due to sudden motion changes. Additionally, location prediction
made by the infrared tracker is used to influence the likelihood function for the primary
tracker which helps in achieving better object tracking results than relying solely on the
primary tracker.
An important challenge that most 2-D trackers would fail to address is that of occlusion.
Kinect sensors have been used for robust object tracking with occlusion handling in the x-
z domain instead of the traditional x-y domain. Tracking is done by particle filters which
7
are propagated based on the motion model in the horizontal-depth movement framework.
An adaptive (based on the occlusion status of target) joint color and depth histogram model
is used to represent a human being. Depth segmented objects are filtered out in each frame.
Particles, depicted by patches extracted in the x-z domain, are associated to these depth
segmented objects (observations) based on a closest match according to a likelihood model
and then a majority voting is employed to select a final observation, based on which,
particles are reweighted, and a final estimation is made. The addition of the depth
dimension in motion propagation and tracking alleviates challenges faced due to change of
object appearance, illumination changes and partial occlusion (or full occlusion in some
cases). Most occlusion handling strategies aim to model prior appearance, shape or motion
models of the occluded target and match it with the observed portion of the target upon its
reappearance. This strategy often fails due to change in the dynamics of the target to be
tracked. In this work, occlusion is handled by using an occluder tracking approach, which
indirectly provides position estimates for the occluded target, thus improving the likelihood
of observing the target correctly when it comes out of its state of occlusion. Complex
occlusion scenarios have been explored and utilizing depth distribution to keep track of
target in heavily occluded scenes improves object tracking accuracy.
8
Chapter 2. Use of Low Cost Range Sensors for Indoor Object
Tracking Applications
2.1. Introduction
In the field of indoor robotics, commonly used distance sensors are ultrasonic sensors,
infrared sensors, lasers or stereo cameras. While each sensor has their pros and cons, one
has to select a combination of sensors for the particular task at hand based on factors of
cost, availability and usage.
Infrared sensors are quite indispensable, widely used as a proximity sensor helping in
obstacle avoidance. It is easy to use and consumes small amount of power. It is small and
compact enough to be fitted on any platform, but most importantly, it is an extremely low-
cost device. (The infrared sensors used in this work cost only 16 USD). These sensors are
almost always used in combination with other sensors such as ultrasonic sensors, cameras,
etc. to obtain an understanding of the environment. However, its use has been limited to
obstacle distance estimation. Constructing a map of an environment using infrared sensors
alone is not considered quite reliable. This is because the measurements obtained from
these sensors could be imprecise attributing to the non-linearity of the device and its
dependence on the reflectivity of the surrounding objects. Moreover, unlike most sensors
9
that have a wider beam width, the beam width of the infrared sensors is very narrow
(around 16cm width at the middle, making the beam angle roughly 3.5◦ for the sensor) and
this could result in the infrared light to pass right beside the object without being reflected
by it.
However, the focused beam width of the infrared sensor has the advantage of hitting
smaller objects and suffering from less interference from other infrared sensors, in
comparison to an ultrasonic sensor, which has a wider sound pulse and is susceptible to
noise and interference from other sensors in its vicinity. While a laser is quite expensive
when compared to an infrared sensor array, a Kinect sensor would match the price range.
The advantage of an infrared sensor array over a Kinect would be over processing time, as
the Kinect provides dense depth data (hundreds of thousands of points), whereas an array
of infrared sensors would provide discrete depth information which would be much faster
to process. This chapter presents research work performed to extend the use of infrared
sensors from being basic range sensors. Section 2.2 outlines the theory behind infrared
sensors. Section 2.3 shows how data from an array of infrared sensors can be studied to
extract the direction of motion of a human being walking in its vicinity. Section 2.4
introduces a motion model with feedback to track the motion of a human being using sparse
data from infrared sensor array. In Section 2.5, data from infrared sensor array and a camera
have been fused to perform indoor tracking.
10
The development of new low-cost IR sensors capable of accurately measuring distances
with reduced response times is worth researching as stated in [41], where G. Benet et. al.
has described some ranging techniques used in infrared sensors and has also proposed his
technique based on light intensity back scattering from the objects. Other infrared sensors
are based on the measurement of phase shift such as [42].
In one application of autonomous navigation [43], infrared sensor emitters and receivers
were arranged in a ring at the bottom of the robot. The front sensors performed collision
avoidance while side sensors were used to follow a wall. In [44], Park et al describe an
infrared sensor array network designed to provide a 360◦ coverage of the environment
using 12 infrared sensors. In [45], rings of several infrared sensors are arranged around
robot links to develop a sensing skin for a robotic arm. Some research has been performed
to develop a 2-D array of infrared sensors [46-47] as to extract 3-D information of the
environment. In [47], a 2-D array has been used for obstacle detection, safe navigation and
estimating object pose. In [48], such a system has been used for an interesting application
on touchless human computer interaction. In [49], multiple infrared sensors have been used
in localization and obstacle avoidance. A more beneficial way to use these infrared sensors
would be to combine them with other sensors to utilize the strengths of each sensor while
minimizing the disadvantages. An array of ultrasonic and infrared sensors has been used
in [50-51] for researching obstacle avoidance problems. Infrared sensors have also been
fused with vision sensor for obstacle avoidance [52].
11
2.2. Infrared Sensors
The sensors used in this work have been purchased from Sharp (model number
GP2Y0A710) and has a minimum and maximum detection range of 3 feet and 18 feet
respectively. The Sharp sensors work by the method of triangulation. It has two parts, an
emitter and a receiver. A pulse of light is emitted which upon hitting an object, is reflected
back at an angle depending on the distance of the reflecting object. By knowing this angle,
the distance is calculated. The IR receiver part has a precision lens that transmits the
reflected light onto an enclosed linear CCD array, based on this triangulation angle. The
CCD array determines the angle and converts it to a corresponding analog voltage value to
be fed to the microcontroller.
The output of these detectors is non-linear with respect to the distance being measured.
This is because of the trigonometry involved in computing the distance to an object based
on the triangulation angle. Eq. (1) is used in this work to convert the analog readings to
distance values.
24.41 ∗ 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 + 81.24

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = (1)
𝑣𝑜𝑙𝑡𝑎𝑔𝑒 − 1.155
12
A graph showing the non-linearity is plotted in Fig. 1. A timing diagram from the datasheet
of the sensor is presented in Fig. 2. The latter illustrates that there is only 16.5 ms delay
before the sensor starts to produce the output of readings.
Figure 1. A graph showing the non-linearity of the distance measurement as returned by

the infrared sensors.
Figure 2. Timing diagram of the SHARP GP2Y0A710K0F infrared sensor as provided by

the manufacturer’s manual.
13
2.3. An Iterative Clustering Algorithm for Classification of Object Motion
Direction Using Infrared Sensor Array
2.3.1. Introduction
In this work, these disadvantages have been accounted for and a system has been proposed
that can do much more than just acquiring a distance estimate – this system will be able to
identify the direction of motion of an object such as a person in front of the sensor. An
infrared sensor array consisting of three sensors have been mounted on a robotic platform
at equal distances of 0.5 feet. The platform is at a height of 2.4 feet, which is sufficient to
detect the torso of a human adult. Readings are taken at every 100ms interval. These
readings are analog voltage values, which must be converted to corresponding distance
measures before they can be used. The values can then be used by the robotic platform to
guide its motion, avoiding the object or the person, for example.
The algorithm uses data collected over 1 second (i.e. 10 readings from 3 sensors each) and
after initial filtering to remove out of range or background values, performs a distance
based iterative histogram clustering on the data to attain clusters in the distance domain.
These clusters are then analyzed to form possible merging with another cluster in the time
domain representing an observation corresponding to some object (either in motion or
stationary). This is followed by feature extraction. Two features are used in this work, the
y-t slope and the x-t slope of the straight line fitting the data points in the cluster in the
14
respective y-t and x-t planes. The slopes of these two fits are used as features in classifying
the direction of motion of the person (or stationary). The k-nearest neighbors’ algorithm is
used to perform this classification.
2.3.2. Related Work
In this work infrared sensors have been used to identify the direction of motion of a single
person walking in front of the infrared sensor array. The array is installed on a robotic
platform. The motion information obtained by the sensor can then be used to avoid the
person for a safe navigation. The array is a one dimensional one. Similar work has been
carried on using other sensors such as pyroelectric infrared sensors (PIR sensors) [53]. In
[54], PIR sensors are used for human movement detection as well as human identification.
In [55], distributed PIR sensors have been used to estimate people count in an office.
However, PIR sensors measure the light radiating from objects and have a wide field of
view (less than 180◦ or even 360◦ in some models), which is quite different from an infrared
sensor which has a very narrow field of view. This limitation of the infrared sensor makes
the problem more challenging, however, it is worth investigating because of its many
advantages such as low cost, low power, fast response rate and compactness. In addition,
a narrow beam gives rise to high resolution in detection of the objects. Multiple infrared
sensors will thus provide a wide view as well as high resolution in detection.
15
(a)
(b)
Figure 3. Infrared sensor array setup which is installed on the top of a robotic platform.
The platform co-ordinate system is shown, and data is measured w.r.t this co-ordinate
system (a) Each sensor is placed with a separation of 0.5ft on the platform (i.e. at 0.5ft, 1ft
and 1.5ft distances along the x-axis); (b) The platform is mounted at a height of 2.42ft
above the ground.
16
2.3.3. Methodology
Fig. 3 shows the arrangement of the sensors on the robotic platform. Fig. 4 is a plot of the
raw data obtained from the infrared sensors in the y-t domain with the colors representing
the data coming from each sensor. So each data point can be represented as a 3-D point
(x,y,z) with ‘x’ being the lateral distance, ‘y’ the longitudinal distance (both expressed in
terms of sensor coordinates) and ‘t’ being the time. The background obstruction (a wall) is
present at around 12.5ft. Fig. 4(a) shows the data points when a person is moving away
from the sensor array and then walking towards it in a straight-line perpendicular to the
sensors. Fig. 4(b) shows the same person moving from left to right and then from right to
left across the sensor array in a straight line parallel to the sensors. Fig. 4(c) shows motion
across the sensor array in a diagonal line from right to left and then left to right, with the
line making an angle of 45◦ with the sensor array. From these plots, it is quite clear that
there is a distinct pattern that can be extracted for each type of motion. In Fig. 4(a), the
person was walking in front of the left and middle sensors, and that is why both these
sensors agree on the distance of the person. In Fig. 4(b), one can see the patterns near the
5ft mark clearly indicate the time order in which each sensor could spot the person, giving
us an idea if the person moved from left to right or vice versa. The pattern in Fig. 4(c), in
addition to capturing the direction like Fig. 4(b), also captures the diagonal aspect of
motion. This work aims to extract these patterns and classify the direction of motion based
on these patterns.
17
At first, background or out of range data is removed. Any value above 12ft has been
ignored. However, this value can also be learned by allowing the sensor to make a few
initial observations and estimate a background such as a wall. The data pre-processing step
is followed by calculating the histogram in the range (or longitudinal distance) domain.
Analyzing the histogram, the regional peak values are found out that would correspond to
probable object detection (could be a stationary or moving object). These regional peak
values give us an original estimate of where the clusters could lie. Based on these peaks,
clusters are obtained iteratively in the range domain with a constraint that if a range value
falls a certain threshold away from the peak value, a new cluster is formed. Tight clusters
are required whose standard deviation is within the threshold, which was chosen to be 2ft
for this work. This can be represented mathematically as follows: If there are ‘n’ regional
max values (initial cluster means) in the range domain that are obtained from the histogram
bin values, then data point dpresent taken at time instant tpresent is assigned to a cluster based
on
dpresent ∈ clusteri , where 1≤ i ≤ n %dpresent added to clusteri

such that i=s, s =1,2,…,n
if min( |dpresent – mean(clusters) | ) <
cluster_separation_distance
else, dpresent ∈clusteri, where i=n+1 %a new cluster is formed (2)
18
(a)
(b)
continued
Figure 4. Raw data collected from three infrared sensors. (a) Person walking away from
the platform and then towards it; (b) Person moving from left to right and then from right
to left across the platform in a straight line; (c) Person moving from right to left and then
from left to right diagonally.
19
Figure 4 continued
(c)
20
(a) (b)
(c) (d)
Figure 5. Plots showing the intermediate processing steps. (a) Data collected between the
1st and 2nd second; (b) Histogram of the range (Y) or longitudinal distance values; (c)
Clustering done in the range domain; (d) Range domain clusteres merged in time domain
to form super clusters representing an object in motion or a stationary object.
Clusters obtained in the range domain are further merged together in the time domain,
where two neighboring clusters are joined if the time difference between the last recorded
time and the first recorded time instant in both the clusters is less than 500ms and the
21
difference between the range values for the corresponding points reflect the amount that an
average human could have walked in that time interval. Thus, if there are ‘m’ clusters
obtained by clustering in the range domain, then,
clusterk = {clusteri , clusterj }, for all 1≤ i,j ≤ m
such that i<j and time gap<500 and range gap<time gap*0.8 (3)
At this point, clusters are obtained that could possibly represent one full motion direction
or maybe a stationary object. Fig. 5(a) shows the raw data for a one second interval. Fig.
5(b) shows the histogram computation over the longitudinal distance or ‘y’ values. Fig.
5(c) shows the clustering in the range domain. These range domain clusters are further
merged in time to form time domain clusters representing an object (in motion or
stationary) shown in Fig. 5(d).
These clusters are then used in the classification process as described. The data points in
each cluster are viewed in the x-t plane and the y-t plane. Straight lines are fitted to the
points in each of the two planes. The slopes of these lines are used as features to classify
the direction of the motion. The motion classes w.r.t to robot coordinate system are: 1. In
front and away; 2. In front and towards; 3. Left to right straight line; 4. Right to left straight
line; 5. Left to right diagonal line; 6. Right to left diagonal line; 7. Stationary. Fig. 6 shows
a plot of the 2-D training data. From this plot, it can be observed that using these two
features, it will be possible to learn a well separated space for each of the motion directions
22
as well as a stationary object using some classification algorithm. The k nearest neighbors’
algorithm has been used with k values of 1, 3 and 5. The knn algorithm is one of the simplest
supervised classification algorithms. Given a training set, this algorithm learns its
parameters and can then classify a new pattern to be the same as the one most common
among the k closest neighbors’. This algorithm suffers if the dimensions of the training set
features are large. It also requires large storage. However, in this work, the dimension of
the feature vector is just 2, so knn is a reasonable choice.
Figure 6. A plot of the two-dimensional feature space. Certain classes may overlap at the
boundaries. The classes are: 1. In front and away; 2. In front and towards; 3. Left to right
straight line; 4. Right to left straight line; 5. Left to right diagonal line; 6. Right to left
diagonal line; 7. Stationary.
23
2.3.4. Results
To evaluate this method, data has been collected for each class of motion (40 instances
each *7 classes =280 instances in total). Fig. 7(a-c) shows snapshots of a subject moving
from left to right in a straight line in front of the infrared sensor array. This is to illustrate
the data capturing procedure. Each of the infrared sensors is sensing the distance values
over 10 seconds (which should be sufficient for a person to make a move in any direction).
Table 1 shows a portion of the distance values captured by the sensors that correspond to
the actual motion of the person. Fig. 8(a) shows the raw data from time 3s – 4s plotted in
the y-t domain. This is the time when the person makes the actual move across the sensor
array and the distance readings in the rest of the 10 seconds are just wall/junk/out-of-range
values and hence not shown in the demonstration. Fig. 8(b) shows the histogram
computation to give us an initial estimate of the clusters which is further refined in Fig.
8(c-d). In this instance, the data is clean enough and the initial peaks obtained from the
histogram analysis are sufficient. Fig. 9(a-b) shows the final cluster representing the motion
in the x-t as well as y-t planes with straight lines fitted to the points to obtain the slopes.
Using these slope values, it can be verified in Fig. 6 that it does fall in the region for class3
(left to right motion in a straight line). These two features are used as inputs to the knn
classifier which also labels this feature vector as belonging to that of class3.
24
Time instant Left sensor Middle sensor Right sensor
(seconds) (feet) (feet) (feet)
2.5 12.52 11.19 11.19
2.6 12.91 11.34 12.91
2.7 12.91 11.98 12.52
2.8 11.65 15.01 15.01
2.9 12.71 12.71 12.71
3.0 4.69 9.79 11.82
3.1 4.46 12.71 14.74
3.2 4.18 12.34 14.48
3.3 4.13 4.13 12.91
3.4 7.62 4.35 13.76
3.5 12.52 12.52 12.52
3.6 12.52 4.34 12.52
3.7 11.98 11.98 11.98
3.8 13.53 11.82 11.82
3.9 11.98 11.98 11.98
4.0 12.71 12.52 12.52
4.1 12.16 11.98 14.74
4.2 12.34 12.34 12.34
4.3 12.16 10.91 12.16
Table 1. Data from each sensor representing motion in the left to right direction
(a) (b) (c)
Figure 7. (a-c) Person moving from left to right across the infrared sensor array which is
mounted on a robotic platform.
25
For building the training and testing feature vectors, the 280 data instances collected were
analyzed using the method described above to obtain 280 2-D feature vectors (x-t slope
and y-t slope). These were divided into training and testing data equally. 5-fold cross
validation has been performed on the training set, where the data was divided into 5 groups,
and four of them were used to train the knn classifier and one group was used for validation.
This was repeated 5 times, each time taking a different combination. The results of
Figure 8. Plot showing data analysis for time period 3s – 4s. (a) Raw data capturing
person’s motion; (b) histogram showing peak values (c-d) clustering, with the one in red
being the real cluster.
26
classification on the training data are shown in Table 2. From Table 2, it can be observed
that the knn classifier achieves a high accuracy in performing the classifications
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠
𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (4)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
(especially at k=5) and thus can be reasonably used on infrared sensor data for person’s
direction of motion classification. Table 3 presents the confusion matrix for knn classifier,
when k=5. From the matrix, it can be verified that classes 1 and 2, classes 3 and 4, classes
(a)
(b)
Figure 9. Plots showing straight lines fitting data points in (a) the x-t plane and (b) y-t
plane. The slope pair (0.1581, 0.0003371) is used as the feature vector for classification. It
can also be verified that this slope pair falls in the domain of class 3 as represented by the
feature space in Fig. 6.
27
5 and 6, are never confused. Some confusion might remain between classes 3 and 5 as well
as classes 4 and 6 as each of these classes essentially represent motion in the similar
direction with the difference being one is diagonal, and the other is straight, thereby causing
overlap. Similarly, if diagonal motion becomes more straight and perpendicular, then knn
classifier might confuse between diagonal and perpendicular motions as in classes 1 and 6,
classes 2 and 5. The stationary objects can be classified accurately as well.
Classifier Accuracy (%)
KNN, k=1 91.2
KNN, k=3 93
KNN, k=5 93.6
Table 2. Classification Accuracy
Class1 Class2 Class3 Class4 Class5 Class6 Class7
Class1 20 0 0 0 0 0 0
Class2 0 20 0 0 0 0 0
Class3 0 0 19 0 1 0 0
Class4 0 0 0 18 0 1 1
Class5 0 0 2 0 18 0 0
Class6 1 0 0 3 0 16 0
Class7 0 0 0 0 0 0 20
Table 3. Confusion Matrix for KNN, k=5 (Predicted classes shown in columns, actual
classes shown in rows)
28
2.3.5. Conclusion
In this work, infrared sensors have been applied to tasks more complex than obstacle
detection, such as finding out the direction of motion of a moving object in front of it. This
can assist the robot to achieve collision-free navigation. From the results, it can be seen
that the k-nn algorithm is successfully able to identify the direction of motion or no motion
correctly, achieving a high accuracy of up to 93%. It takes around 0.06 seconds to process
the reading from a sensor, thus, making it real-time.
This method can work with varying sizes of the humans if they are detected by the sensors.
Since the sensors are placed quite close, it can be guaranteed that a person will not go
undetected. Also, since the rate at which range data is collected from the infrared sensor
(10 every 1 second), it can detect humans walking at a very fast speed also. Slow motions
will not affect the model too.
Future challenging research in this field would be to extend this work to be able to detect
more than one person in motion in front of the infrared sensor array. However, one must
keep the limitation of the infrared sensor in mind and understand that the persons need to
be considerably separated to obtain accurate information. Another important direction
would be to perform the same activities when the infrared sensor array mounted over the
robotic platform is in dynamic motion. This would be challenging because the data
collected when the sensor is in motion would be noisy enough, thus making it difficult.
One could combine other sensors such as ultrasonic sensors, cameras, depth cameras, etc.
29
and involve sensor fusion to integrate the multi-dimensional information. Ultimately, the
goal is to consider the infrared sensors to achieve something more than merely obtaining a
distance estimate.
30
2.4. Using an A-Priori Learnt Motion Model with Particle Filters for Tracking a
Moving Person by a Linear Infrared Array Network
2.4.1. Introduction
The aim of this work is to extend the work presented in Section 2.3. to detecting and
tracking a person using particle filters with a modified motion model. This work introduces
a robust method to extract the peak values corresponding to human motion from noisy
sensor data. An algorithm based on a sliding window approach is used to slide over the
infrared sensor returns and extract values corresponding to peaks caused by a person
appearing in front of the infrared sensor using an SVM based detector. These peak values
extracted are fed to a particle filter to track the moving person. The second contribution is
the use of the feedback from a proportional controller, which is based on the difference
between the particle filter predicted values and the observed values, to update the
parameters of the motion model. A series of 10 infrared sensors were lined in a linear array
on a still platform. The sensors had a separation of more than 15cm. A person was made to
walk across the sensors following a straight line with slight deviations and data was
recorded simultaneously by each of the sensors for the processing of the data to predict the
person’s track. The speed of the person was an average of 1 to 3ft/s. Simulated data has
also been used for testing purposes.
31
2.4.2. Related Work
In the literature, the use of these low priced infrared sensors used in combination with
other sensors for tracking is less. Some of the interesting works are [56], where infrared
sensors have been used with electro-optical sensors for detection and tracking of moving
objects. In addition, pyroelectric infrared sensors have been used for person tracking [57].
Prior research has been done to experiment with the learning models such as in [58], where
the motion parameters are learnt by switching among multi-model system using particle
filters. In [59], a probabilistic motion model is learnt based on the motion of the target as
observed by a camera in the learning phase, instead of using an initial empirical distribution
for the motion model. This work focuses on tracking an object moving across an infrared
sensor array in a simple linear manner using particle filters with a motion model that
updates its parameters based on the feedback of a proportional error controller. The
controller uses the difference in the position estimation versus the position observation to
steer the particle filtering algorithm in the right direction.
2.4.3. Methodology
Infrared sensors are sensitive to noise and therefore a robust method to extract data points
from the sensor needs to be developed if this information has to be leveraged elsewhere.
The data from the infrared sensor has values fluctuating around a mean point. Also, during
a change of depth value, when an obstacle appears in the view of a sensor, a point in
between the ranges could also be reported. In this work, an algorithm is presented, that uses
32
a sliding window to go over the data points reported by the sensor and predict one value
for that sliding window or ignore that window if the points appear to be too noisy. The
logic behind this algorithm is that within the time duration of one sliding window, which
is typically around 0.1 to 0.5 seconds, the values reported by the infrared sensor should not
be very different. The algorithm starts by filtering out the out of range values and then uses
the standard deviation of points to decide whether to retain those set of points or disregard
them. A threshold is used which is estimated from trial runs where the sensor reports almost
consistent values for the same depth of the object. If the data points in the sliding window
have a standard deviation lesser than the threshold, it indicates that there is a higher
possibility that the data points relate to the same obstacle, and thus a histogram is computed
over the points and the value of the bin that has the highest concentration of observations
is chosen to represent the depth value for that sliding window. A simple averaging of the
points within the sliding window that succeeds the threshold test was also tried, however,
it did not perform as well as using the histogram approach. This is probably because, in the
sliding window, some points can appear, which are slightly off from the true value,
however, there is a higher concentration of true value points. Experiments proved that it is
better to use the value that had a higher concentration than take an average value.
This is followed by finding the gradients of the depth values over time and flagging any
point as a peak point if the gradient value at that point is greater than a chosen peak
threshold value. This work focusses on finding sudden changes in the depth values
reported, which ideally would correspond to an object moving across the sensor. However,
33
false positives also appear, especially when sudden noisy detections dominate the majority
of the observations in the sliding window. If a peak detected by sensor i at time t
corresponds to motion, then in the absence of other obstacles, there is a high probability
Algorithm
The algorithm uses a sliding window of size ‘n’ with an overlap of ‘n-1’ to extract points
which satisfy a standard deviation threshold and is given:
At any time instant t=n,n+1,..., for each sensor 1:n
(i) use a sliding window of length ‘n’ to obtain ‘n’ observations recorded by the
sensor
(ii) filter out observations which are above the maximum range, ‘max’ and below the
minimum range, ‘min’ of the infrared sensor
(iii)if the standard deviation of the observations is within a threshold ‘s’, then, analyze
the distribution of the data using histogram and choose the bin that has the highest
concentration of observations and add them to the set of valid observations for
this sensor, otherwise, disregard this set of observations
(iv) compute the gradient of distances w.r.t time for the function F(t), i.e. ∇F = δF/δt
(v) flag the observation as a ‘peak point’ if distance gradient at that point is greater
than the chosen peak threshold value
34
that peaks at similar range values should be detected by the neighboring sensors i+1 or i-1
(or even i, depending on the direction of motion) at time instants prior to t. On the contrary,
the probability that a sensor’s invalid peak observation will be corroborated by its
neighboring sensors is low. This is very important for modelling a method to discard the
false positives while retaining the peaks caused by motion.
A method was developed to classify between the two types of peaks (peaks due to true
motion of an object across the sensor versus peaks due to consistent noisy data) reported
by the algorithm above. For any peak detected, the peaks recorded by the neighboring
sensors are collected and 4 geometric features are computed, which are the total number of
peaks reported by the neighboring sensors, the standard deviation of the peaks, the average
distance between the peaks and the sum of squared error of points that fit the line joining
the peaks. These features are then fed as input to a Support Vector Machine (SVM)
classifier for classifying the peak because of true motion or not.
The first detection of object motion is used to initialize a particle filter with the location
values and provide the velocity components in the x and y directions respectively to build
the initial motion model. As the target moves, more peaks are detected which are fed as
sensor measurements to the particle filter algorithm.
35
However, a linear motion model which is often used with particle filter based tracking does
not work well for infrared sensor based data. This is because in the presence of noisy or
missed data, the predictions for the motion update made at each step can lead to small
errors, which if added up over time, can put the prediction completely off track and would
obviate the restarting of the particle filter. Therefore, to improve the estimation, it is
suggested, that in the velocity update step, instead of adding random noise, an informed
guess is made about the velocity based on the difference in the position estimated by the
motion model and the sensor, so that at every step, the motion model can make predictions
closer to the actual location and minimize errors. During the velocity update, this difference
is multiplied by a coefficient and the term is either added to or subtracted from the velocity
depending on whether the motion model prediction lags the observation or is ahead of it.
Thus, the particle filter algorithm switches to a faster velocity model if it is behind the
observed value and if it is ahead of the observed value, then it slows down the prediction
value by choosing a slower model.
The particle filter algorithm is as follows:
• Initialize the particle filter with initial peak corresponding to motion as the starting
point and use the computed velocity components for the motion model.
• Generate N initial particles around the initial point following a normal distribution.
• For all later time instants,
(i) Update the particles according to the motion model.
(ii) Compute the weight of each particle
36
(iii) Resample the particles so that particles with higher weights are sampled
more frequently than those with lower weights.
(iv) Instead of making a random selection for the velocity components, update
the velocity components based on the feedback of an error based
proportional controller, in effect switching between faster and slower
paced models.
• The model update parameters are given as:
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = √ (𝑜𝑏𝑠𝑥 − 𝑥)2 + (𝑜𝑏𝑠𝑦 − 𝑦)2 (5)
𝑋𝑡̇ = 𝑋𝑡−1
̇ + 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒, ̇
𝑖𝑓(𝑜𝑏𝑠𝑥 − 𝑥) > 𝑋0 /2
= 𝑋̇𝑡−1 − 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒, 𝑖𝑓(𝑜𝑏𝑠𝑥 − 𝑥) < −𝑋0̇ /2 (6)
𝑌𝑡̇ = 𝑌𝑡−1
̇ + 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒, 𝑖𝑓(𝑜𝑏𝑠𝑦 − 𝑦) > 𝑌0̇ /2
̇ − 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒,
= 𝑌𝑡−1 𝑖𝑓(𝑜𝑏𝑠𝑦 − 𝑦) < −𝑌0̇ /2 (7)
where X0̇ and Y0̇ are initial velocity components in the x and y directions respectively,
obsx and obsy are the observations as recorded by the sensor, x and y are the predictions
by the motion update model respectively.
37
2.4.4. Results
In order to evaluate the proposed algorithm to robustly extract peak points corresponding
to motion detection, data is recorded using a linear array of 10 infrared sensors mounted
on a flat platform. A person is allowed to move across the sensor array network following
a straight line with small deviations. Fig. 10 shows a sample run from two neighboring
sensors, when the person is walking across the sensor in one direction and then returning
back in the opposite direction. The peaks, which are basically the distance gradients, are
Figure 10. Color coded peaks because of an object in motion as detected by the SVM
classifier. Other peaks are also noticed; however, they are created by inconsistent data and
are discarded by the SVM as true negatives.
38
Table 4. Confusion matrix for SVM classifier
circled. The black circled ones represent the peaks caused when the person walked in front
of the sensors from the first sensor to the second one respectively. The green circled ones
are when the person crossed the second and then first sensors respectively. The negative
peaks are basically gradients when the person was last detected by the sensors. (The
background is at a much farther distance from the scene of motion).
For each set of data recorded, the sliding window based algorithm extracts the peak points
and also computes for each peak point, the set of 4 features as discussed in Section 2.4.3.
to be fed as inputs to the SVM classifier to classify the points as peaks created due to true
motion across the sensor or peaks created by noisy data. The confusion matrix is for the
classification is shown in Table 4. The SVM classifier does a good job at distinguishing
between the two types of peaks reported, with a true positive rate of 92% and a low false
negative rate of 8%. A good rate for true detections ensures the smooth running of the
particle filters as points representing the target are obtained consistently. The false positive
rate is 15%.
39
The other contribution to this work is on the modification of the motion update model of
the particle filter algorithm. To test this, both simulated data as well as recorded data from
the infrared sensors has been used. For the simulation, 2-D points were sampled, enacting
motion along a straight line. White noise was added to the sampled points along with some
random peaks representing noise. Each recording had 100 observations mimicking an array
network of 100 infrared sensors, however, for the real data, only 10 infrared sensors have
been used. The approach has been tested on 30 sets of simulated data and 10 sets of real
life data. The particle filter’s motion model is updated based on information about the error
between the recorded observation and the predicted value scaled by a coefficient.
(a) (b)
Figure 11. (a)-(b) A traditional particle filter with 500 particles is used to track the infrared
sensor simulated data with an average position estimation error of 0.9008 ft.
40
(a) (b)
Figure 12. (a)-(b) A particle filter with 500 particles and a continuously updated motion
model with coefficients of 0.5(Fig. 12(a)) and 0.6(Fig. 12(b)) respectively is used to track
the infrared sensor simulated data with an average position estimation error of 0.35ft.
The filter uses 500 particles and a range of values have been experimented with for the
motion model parameters coefficient. For each dataset, both the traditional particle filter as
well as this approach was tested. In Fig. 11(a) and 11(b), the traditional filter does manage
to follow the track of the target with some vertical offset, however it never catches up with
the target. The average position estimate error is around 0.9 feet for the simulated data,
whereas for the approach with feedback, the filter follows the target closely and has an
average estimate error of 0.4ft as in Fig. 12(a) and 12(b). Fig. 13 (a) shows the traditional
particle filter deviating from the actual trajectory on the data recorded by the infrared sensor
array, with an average position estimate error of 0.6ft whereas in Fig. 13(b), the filter using
feedback makes an average error of 0.3ft. It has been found that by leveraging the
information about the error between the estimated state of the object and the observation
as reported by the sensor, more accurate subsequent predictions can be made, reducing the
41
average position estimation error by almost 50%. In some cases, when the traditional
particle filter loses track of the object because the predictions tend to drift away slowly
from the true values, the feedback provided in this approach helps to guide the particle
filter towards the true observation values. Fig. 14 shows a plot of the average position error
estimation versus the coefficient value used for the motion model update. It can be
observed seen that a coefficient value around 0.6 is effective in tracking the motion of the
object. Lower or higher coefficients will introduce higher tracking errors.
(a) (b)
Figure 13. True position versus estimation. (a) A traditional particle filter with a fixed
linear model tracks object on real infrared sensor data with an average position estimation
error of 0.64ft (b)A particle filter receiving feedback from the controller regarding position
estimation error tracks the object on real infrared sensor data with an average position
estimation error of 0.34ft.
42
Figure 14. A plot showing the average error in the position estimation of the object at
different coefficient values for the motion model update parameters. For these runs, a
coefficient of 0.6 produced good results.
2.4.5. Conclusion
In this work, infrared sensors have been used for tracking the motion of an object along a
linear sensor array network. A robust method to extract points corresponding to the motion
of a moving object has been proposed. The use of a learning method to distinguish between
peaks corresponding to true motion versus peaks corresponding to sudden sparks gives a
better performance than a heuristic based approach and can identify peaks accurately
around 92% of the times. The second contribution is the modification of the motion model
update of the particle filters. The particle filter is supplied with valuable feedback from the
proportional error controller which updates the motion model parameters accordingly and
has given on an average at least 50% more accurate location estimations from 30 test runs
43
on simulated data as well as 10 test runs on infrared sensor data versus using a fixed motion
model. In effect, switching between faster and slower models to keep track of the person
in presence of noisy data proves to perform a better job than using one fixed model.
To develop more on this work, the feature set for the peaks could be modified and the
missed detections rate could be reduced. Another interesting direction would be to test the
approach on complex motion patterns, especially to and fro motions or free form motion.
Also, the feedback based motion model update can be tested on data from another sensor
such as tracking using a camera, whose data properties, when applied to this task, are
different from that of an infrared sensor in terms of the continuity of data, the frequency as
well as the accurateness. Ultimately, the infrared sensor can also be integrated with the
camera sensor to improve tracking of a moving object using the modified motion model
with particle filters.
44
2.5. An Infrared Sensor Guided Approach to Camera Based Tracking of Erratic
Human Motion
2.5.1. Introduction
Tracking a human being in motion is an important research topic in robotics or computer
vision. Substantial research has been conducted to address this issue, however as Yilmaz
et al. states in [1], researchers mostly simplify tracking by constraining the motion of the
object, assuming that the object motion will be smooth and will not perform any random
abrupt change in its direction. Further constraints are made to assume either fixed velocity
or fixed acceleration and motion models are constructed to resemble these types of motion.
This work accounts for the fact that the object to be tracked can make unpredictable
changes during its motion and to solve this, it uses two trackers based on the two types of
sensors used, infrared sensor array and camera, and also identifies the point of failure for
the camera tracker and recovers the tracking from input by the infrared sensor tracker. A
purely camera based object tracking system can encounter failures if the object size gets
smaller, or if the appearance model changes, or in case of a robot following a person who
is going out of focus of the camera, an incorrect location estimation may not make the robot
turn in the proper direction. Also, when the object makes sudden turns and the camera
tracker errs, predicting the position of the object can get complicated and time consuming
if one has to make an exhaustive search of the visual space to detect the object. On the
other hand, the infrared sensor array measures distance estimates of objects in front of it at
a sufficiently high sampling rate, enabling it to detect sudden changes in the position of the
45
objects quite accurately. However, these sensors do not provide any information about the
appearance of the object, and thus cannot validate whether a detection would belong to the
object to be tracked or anything else, but when these sensors are used in conjunction with
a camera, these detections can guide the camera to reduce the search space for observations
in an image and can also help to detect change in motion directions and thereby help restart
the camera tracker in case it gets lost. Another drawback of a purely camera based tracker
is that it doesn’t provide 3-D information about the scene and so, the depth values are not
obtained and therefore, integrating an infrared sensor array system with the camera will
also provide 3-D information which can be used in applications such as a person following
robot, to maintain a safe following distance and also increase or decrease the speed of the
robot as desired. Additionally, infrared sensors are extremely low in cost when compared
to other distance measuring (but more accurate) sensors, such as a radar or LIDAR and it
also has a narrow field of view as compared to ultrasonic sensors, which makes localization
of an object easier, and thus, if the aim of the application is to provide tracking in indoor
areas and making the system affordable to the common public, the combination of a camera
and infrared sensors sound feasible.
2.5.2. Related Work
Using distance measuring IR sensors to aid object tracking is a largely unexplored area and
this work aims to address this issue. This work also aims to make use of IR sensors to aid
a camera based tracker in failure detection and recovery. In the research literature,
appearance or motion characteristics have been used for detecting failure by comparing to
46
reference features [60], or by comparing trajectories [61]. A time reversed Markov process
is used in [62] to identify failed trackers and perform recovery. This work combines inputs
from the IR tracker with appearance features of the object to detect failure, recover and
restart the tracking process.
2.5.3. Methodology
In this work, observations are dealt in the frame of reference of the camera and therefore
the real-world distances obtained by the infrared sensors have to be mapped to the pixels
of the image captured by the camera. The sensor system is placed on a robotic platform. A
total of 5 infrared sensors have been used. Three are positioned facing the front direction
and rest two are placed sideways at an angle of 70˚ to the front direction. The camera is
placed at a height of approximately 3.8ft above the ground and 50cm behind the infrared
sensors as shown in Fig. 15. The camera’s roll, pitch and yaw has been set to zero. The
infrared sensor measurement, (𝑥𝑖𝑟 ,𝑦𝑖𝑟 , 𝑧𝑖𝑟 )is transformed to the frame of reference of the
camera (𝑥𝑐𝑎𝑚 ,𝑦𝑐𝑎𝑚 , 𝑧𝑐𝑎𝑚 ). Using the mapping for rectilinear lens, the radial position
(angle) of the point on the image is found, which is given by
𝑅 = 𝑓 ∗ tan(𝜃) (8)
where 𝑓 is the focal length in mm(or pixels) and 𝜃 is the angle in radians(or degrees)
between a point in the real world (𝑥𝑐𝑎𝑚 ,𝑦𝑐𝑎𝑚 , 𝑧𝑐𝑎𝑚 ) and the optical axis of the camera.
47
The angle that the line joining the center of the image and the projected point on the image
makes with the image axis is given by
∅ = tan−1(𝑦𝑐𝑎𝑚 ⁄𝑧𝑐𝑎𝑚 ) (9)
(a) (b)
Figure 15. Infrared sensor setup with camera (a)The coordinate system (b) Distance
measurements shown on the robotic platform
Following the spatial coordinate system, the corresponding projection of the infrared
sensor detected points on the image is given by
𝑦𝑝𝑟𝑜𝑗 = (−𝑅 ∗ cos(∅) + 𝑦𝑐𝑒𝑛𝑡𝑒𝑟 ) (10)
𝑥𝑝𝑟𝑜𝑗 = (−𝑅 ∗ sin(∅) + 𝑥𝑐𝑒𝑛𝑡𝑒𝑟 ) (11)
where 𝑥𝑐𝑒𝑛𝑡𝑒𝑟 and 𝑦𝑐𝑒𝑛𝑡𝑒𝑟 are the x and y spatial coordinates of the center of the image.
48
In this research, the infrared sensor has been made responsible for tracking the irregularities
of the human motion, that is when the object makes an unpredictable turn in a different
direction, or when it remains static for some time and then moves suddenly. Most of the
tracking algorithms developed generally put a constraint on the object motion and assume
a relatively simple non-linear track with no sudden turns, etc. Such motion is usually
represented by a constant velocity model or a constant acceleration model. However, these
models will not be able to represent motion where the object executes random turns. An
infrared sensor, on the other hand, continuously returns distance estimates of the objects in
front of it and thus can extract a sudden detection made by one of the sensors indicating
that an object might have suddenly moved into its field of view. The job of this secondary
infrared tracker is to track these detections using particle filters, however, any of the
conventional motion models will not be used. Instead, in this work, being inspired by
Brownian motion, a type of a random walk model [63] has been used, but with a fixed
range for speed, which is termed the omnidirectional motion model. In this model, the state
evolves following a randomly guessed speed inside a fixed range and a randomly guessed
direction. The speed, 𝑣 ranges from 0 to 45cm/s and the direction, 𝜃 is between 0˚ to 360˚.
Thus, this distribution ensures that there are particles representing motion in any direction
at any speed, including the condition that the object is at rest. The state vector is represented
as 𝑋_𝑖𝑟𝑘 = {𝑥_𝑖𝑟𝑘 , 𝑦_𝑖𝑟𝑘 } , where 𝑥_𝑖𝑟𝑘 is the distance between the sensor and the object
and 𝑦_𝑖𝑟𝑘 is the lateral distance. The motion model is given by
𝑥_𝑖𝑟𝑘 = 𝑥_𝑖𝑟𝑘−1 + 𝑣 ∗ cos(𝜃) (11)
𝑦_𝑖𝑟𝑘 = 𝑦_𝑖𝑟𝑘−1 + 𝑣 ∗ sin(𝜃) (12)
49
The likelihood model is given by the Gaussian density and is chosen as
p(𝑧_𝑖𝑟𝑘 |𝑋_𝑖𝑟𝑘 ) ∝ exp(− 𝑑 2 ⁄2𝜎 2 ) (13)
where 𝑑 is the Euclidean distance between the observed point and the sample particle and
𝜎 specifies the Gaussian noise in the measurements.
The tracking starts with the initial location specified by the user. The state of the system is
updated according to Eq. (11-12) and when new observations are received by the infrared
sensor array, a gating technique is used to filter out observations beyond a particular range
of the current location of the object. The final observation,𝑧_𝑖𝑟𝑘 is used to update the prior
distribution. Weights are assigned based on Eq. (13). The residual resampling method has
been used [64]. The mean of the posterior distribution is output as the estimated location
by the infrared tracker. This output is used in the primary tracker to modify the weight of
the particles and to also reinitialize the primary tracker in the event that it loses the object.
The camera based tracker is used as the primary tracker in this work and its function is to
perform tracking of the object when it is executing simple non-linear motion and to
reinitialize itself with inputs from the secondary infrared sensor tracker. For tracking
purposes, a bounding box representing the object is chosen by the user which is rectangular
and is fixed in size and is characterized by the state vector at time 𝑘, as 𝑋_𝑐𝑘 =
{𝑥_𝑐𝑘 , 𝑦_𝑐𝑘 , 𝑥_𝑐

̇ 𝑘 , 𝑦_𝑐
̇ 𝑘 , 𝑥_𝑐
̈ 𝑘 , 𝑦_𝑐
̈ 𝑘 }, where 𝑥_𝑐𝑘 , 𝑦_𝑐𝑘 are the centers of the bounding box,
50
𝑥_𝑐
̇ 𝑘 , 𝑦_𝑐
̇ 𝑘 are the respective velocities and 𝑥_𝑐
̈ 𝑘 , 𝑦_𝑐
̈ 𝑘 are the respective accelerations. A
constant acceleration model represents the state evolution and is given by
𝑋_𝑐𝑘 = 𝐹 ∗ 𝑋𝑐 𝑘−1 + 𝑣𝑘−1 (14)
where F is the state transition matrix given by
1 ∆𝑡 ∆𝑡 2⁄
2
𝐹 = [0 1 ∆𝑡 ] (15)
0 0 1
and 𝑣𝑘−1 is the process noise assumed to be white, zero mean and Gaussian.
Normalized color histograms [65] and normalized Histogram of Oriented Gradients (HOG)
features [66] are employed to build feature vectors for the selected region and make it the
reference model. Based on the state evolution model, the particles are propagated to their
new predicted positions and upon receiving a new image, patches are extracted around the
predicted positions and a feature vector is computed for each extracted patch. The
Bhattacharyya distance between the feature vector of a sample patch and that of a reference
patch is computed and is used to assign weights to each particle.
Additionally, the infrared sensor tracking system provides an estimation of the object’s
position as projected on the image. This is an important term when assigning weights to
51
the particles as patches which contain the projected point will have a higher weight than
patches which do not contain the secondary tracker’s estimated point. To construct the
likelihood model, it is assumed that the color and HOG features as well as the distance
estimates provided by the infrared sensor are independent of each other. Therefore, the
overall likelihood is the product of the separate likelihoods.
The integration of the infrared system plays a significant role in recovery after detection of
failure. To address this, two metrics have been introduced, 𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 and
𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠, depending on whose values, the likelihood model is formed. These are
defined as
(16)
(17)
For computing the metric 𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠, a threshold, 𝑡ℎ𝑟𝑒𝑠ℎ𝑚𝑎𝑡𝑐ℎ_𝑟𝑒𝑓 is used to
determine if a patch closely matches the reference patch. Thresholds
𝑡ℎ𝑟𝑒𝑠ℎ𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 and 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 are used to test the above metrics as
detailed. If 𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 is greater than 𝑡ℎ𝑟𝑒𝑠ℎ𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 indicating
52
sufficient overlap and 𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 is greater than 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 , the likelihood
function is defined as
(18)
̂𝑘 , 𝐶̂
where 𝑑𝐶𝑜𝑙𝑜𝑟 (𝐶 ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐶𝑘 , 𝐶𝑟𝑒𝑓 ) is the Bhattacharyya distance between the
̂𝑘 , 𝐻𝑜𝑔
color histograms, 𝑑𝐻𝑜𝑔 (𝐻𝑜𝑔 ̂ ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐻𝑜𝑔𝑘 , 𝐻𝑜𝑔𝑟𝑒𝑓 ), is the Bhattacharyya
distance between the HOG features, 𝑑𝐸𝑢𝑐 (𝑋𝑐 𝑘 , 𝑋𝑖𝑟 𝑘 ) =
√(𝑥_𝑐𝑘 − 𝑥_𝑖𝑟𝑘 )2 + (𝑦_𝑐𝑘 − 𝑦_𝑖𝑟𝑘 )2 is the Euclidean distance between the center of the
̂𝑘 and 𝐻𝑜𝑔
current patch and the infrared sensor estimated object position at time 𝑘, 𝐶 ̂𝑘 are
the normalized color histograms and HOG features respectively for the current patch
centered at (𝑥_𝑐𝑘 , 𝑦_𝑐𝑘 ), 𝐶̂ ̂

𝑟𝑒𝑓 and 𝐻𝑜𝑔𝑟𝑒𝑓 are the normalized color histograms and
normalized HOG features for the reference patch 𝜌 is the Bhattacharyya coefficient and
𝜎1 , 𝜎2 , 𝜎3 specify Gaussian noise in measurements.
On the other hand, if significant number of patches do not contain the infrared sensor
projected point, it means that there is a disagreement between the two sensors, however, if
there are sufficient patches which match closely with the reference patch, or if,
𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 is greater than 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 , the infrared sensor tracker’s data has
53
been ignored and the algorithm relies on the primary camera tracker’s data only. Thus, the
likelihood function would be the products of the likelihood of the color and HOG features
and is given as
(19)
where terms have the usual meaning as stated above. The object is classified as lost if the
extracted patches have low similarity to the reference patch, or if 𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 is lesser
than 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 . In such as case, the algorithm relies on distance estimates
provided by the infrared sensor (if these estimates would belong to an object) and
iteratively checks if patches generated around the infrared sensor reported distance estimate
match with the reference patch. When a high match score, 𝑡ℎ𝑟𝑒𝑠ℎ𝑟𝑒𝑠𝑡𝑎𝑟𝑡 has been found,
the primary tracker is reinitialized.
54
Algorithm
1. Initialization:
Select a point representing the object as input to the secondary infrared tracker
and a patch enclosing the object as input for the primary camera tracker. At step
k = 0, for i = 1,2, ……. N, generate samples based on an initial Gaussian
distribution around user inputs.
2. Infrared sensor based secondary tracker:
(𝑖) (𝑖)
(i) For k = 1, 2,…… and for i = 1,2,….,N, sample 𝑋_𝑖𝑟𝑘+1 ~𝑝(𝑋_𝑖𝑟𝑘+1 | 𝑋_𝑖𝑟𝑘 )
from the omnidirectional motion model using Eq. (11-12).
(ii) When new distance measurements recorded by the infrared sensor array
come in, use a gating technique to filter out observations beyond a threshold
of past estimated detection, then take an average of remaining points. The
(𝑖) (𝑖)
weights for each particle is computed as 𝑤_𝑖𝑟𝑘+1 ∝ 𝑝(𝑧_𝑖𝑟𝑘+1 ⁄𝑋_𝑖𝑟𝑘+1 )
using Eq. (13).
(𝑖) (𝑖) (𝑖)

̂𝑘+1 = 𝑤𝑘+1 ⁄∑𝑁
(iii)Normalize the weights 𝑤 𝑖=1 𝑤𝑘+1
(iv) Resample from the posterior distribution p(𝑋_𝑖𝑟𝑘+1 |𝑍_𝑖𝑟 𝑘+1 )
(v) Output the mean of the posterior distribution and project the point on to the
camera’s image (to be used in 3(iii))
3. Camera based primary tracker:
55
(i) For k = 1,2,…… and for i = 1,2,….,N, sample
(𝑖) (𝑖)
𝑋_𝑐𝑘+1 ~𝑝(𝑋_𝑐𝑘+1 | 𝑋_𝑐𝑘 ) from the constant acceleration model in Eq.
(14-15).
(ii) When a new image is received, extract patches around the sampled
points in 3(i), and using the point obtained in 2(v), compute the metrics,
ir_camera_overlap and patch_matches. The weights for each particle is
(𝑖) (𝑖)
computed as 𝑤_𝑐𝑘+1 ∝ 𝑝(𝑧_𝑐𝑘+1 ⁄𝑋_𝑐𝑘+1 )
• If ir_cam_overlap>threshir_camera_overlap and
patch_matches>threshpatch_matches, use Eq. (18)
• If ir_cam_overlap<threshir_camera_overlap and
patch_matches>threshpatch_matches, use Eq. (19)
• If patch_matches<threshpatch_matches,
Set t = 0;
do
a) Set IR estimated position at time k as the object
location.
b) Generate patches around the point randomly.
c) k=k+1, t=t+1
while (patch_matches<threshrestart & t<max_iterations)
if(t<max_iterations)
56
go to step 3(i) to restart the particle filter with the
patch around the infrared sensor estimated point as
the starting patch.
else
report object has been lost (quit).
(𝑖) (𝑖) (𝑖)

(iii) ̂𝑘+1 = 𝑤𝑘+1 ⁄∑𝑁
Normalize the weights 𝑤 𝑖=1 𝑤𝑘+1
(iv) Resample from the posterior distribution p(𝑋_𝑐𝑘+1 |𝑍_𝑐 𝑘+1 )
(v) Output the mean of the posterior distribution as the center of the patch
representing the object.
4. Go to step 2.
2.5.4. Results
The setup consists of an array of five Sharp infrared sensors (model number GP2Y0A710)
mounted on a robotic platform at a height of about 2.4 feet so that an average human’s
torso can be detected. Two of them face sideways and the rest three are placed at intervals
of 18cm from each other. A camera (Logitech C920) is mounted at a height of 42 cm from
the plane of the infrared sensors to get a nearly full frame of the object to be tracked. To
evaluate the algorithm,15 recorded video sequences have been used with an object
executing different types of motion, such as walking in the hallway or a cluttered lab and
57
making a sudden turn to the left or right or moving in a zig-zag, random and stop and go
motion patterns. The algorithm is evaluated thrice per second and ground truth about the
object’s position is obtained manually. The accuracy of the algorithm is expressed by Root
Mean Square Error (RMSE), which is given by
∑𝑁 2
𝑖=1 √(𝑥𝑖 −𝑥_𝑒𝑠𝑡𝑖 ) +(𝑦𝑖 −𝑦_𝑒𝑠𝑡𝑖 )
2
𝑚𝑒𝑎𝑛 𝑡𝑟𝑎𝑐𝑘 𝑒𝑟𝑟𝑜𝑟 = (20)
𝑁𝑓𝑟𝑎𝑚𝑒𝑠
where 𝑥𝑖 , 𝑦𝑖 is the known position of the center of the object at frame 𝑖 and 𝑥_𝑒𝑠𝑡𝑖 , 𝑦_𝑒𝑠𝑡𝑖
is the estimated position by the tracking algorithm, N is the number of detections and Nframes
is the total number of frames. Fig. 16 shows selected frames from videos capturing different
types of motion and Fig.17 shows graphs recording the tracking error for each iteration
when only using the infrared tracker, only using the camera tracker and using both the
trackers respectively. In Fig. 16(a), the target moves in an unobstructed scene and the
infrared-camera tracker performs well. In Fig. 16(b), the target is occluded by two persons,
one who has a similar appearance with the target in frames 1359 to 1431, and the other
person, who has a different appearance occludes the target in frames 1701 to 1725. In the
occlusion scenario using the infrared-camera tracker helps to recover the person
immediately after the occlusion, as it finds the depth of the person and thereby helps to
track it. When using a camera based baseline tracker, it loses track of the target after the
first occlusion and then drifts, and thus it must be restarted. In Fig 16(c), when multiple
objects are present in the scene, however, they do not occlude the target, both baseline
58
(a)
(b)
continued
Figure 16. Images from some video sequences illustrating the target tracking under various
occlusion/illumination scenarios. (a) Video sequence to demonstrate target walking in a
scene without any occlusion. Frames 1905, 1941, 2019 and 2055 have been shown; (b)
Video sequence to demonstrate target being occluded by an object with similar appearance
as well as by an object with a different appearance. Frames 1359, 1407, 1425, 1431, 1521,
1545, 1659, 1701, 1725, 1773, 1857 and 2236 have been shown; (c) Video sequence to
demonstrate target being tracked when multiple persons are present in the scene, however,
there is no occlusion. Frames 408, 433, 450, 492 and 505 have been shown; (d) Video
sequence to demonstrate target occluded in presence of other objects as well. Frames 1002,
1074, 1110 and 1182 have been shown; (e) Video sequence to demonstrate target occluded
in presence of other objects as well. Frames 300, 306, 318 and 360 have been shown; (f)
Video sequence to demonstrate target being tracked in the presence of other objects in low
illumination condition in the hallway. Frames 2221, 2293, 2329 and 2341 have been
shown.
59
Figure 16 continued
(c)
(d)
(e)
(f)
camera and the infrared-camera trackers perform well. Fig 16(d) and 16(e) provide some
more examples of the target being occluded in the presence of multiple persons and in both
cases, the combined infrared-camera tracker outperforms the baseline camera or baseline
infrared tracker. Fig 16(f) demonstrates tracking when the illumination in the hallway
changes. The baseline camera tracker doesn’t fail, however, when combined with infrared-
60
camera tracker, its performance improves as the latter tracker’s success is not dependent
on the illumination.
(a) (b)
Figure 17. Graphs showing tracking error at a frequency of 3s-1 for two different sequences
(a) and (b).
Examining the graphs in Fig. 17(a)-(b), it is evident that the camera and infrared sensor
based tracker perform better overall than using either trackers alone. In Fig. 17(a), between
8 to 10 seconds, the infrared reports noisy data, however, the camera tracker tracks the
object quite accurately probably because the feature representations matched well to the
reference one as is obvious since the pure camera based tracker performs well during this
time period. In Fig. 17(b), the trackers individually track with higher accuracy than that in
Fig. 17(a), and the combined tracker is also able to track the object accurately, however,
from 11 to 14 seconds duration, the pure camera based trackers performance drops, but
because the infrared tracker maintains consistent accuracy during this time, the overall
tracking doesn’t fail.

61
2.5.5. Conclusion
This work has introduced a technique for tracking an object making unpredictable turns
using a primary camera tracker and a secondary infrared sensor tracker. Fusing inputs from
both trackers help determine more accurately the object’s location by giving more weight
to particles having closer proximity to infrared and camera detections. Tracking failure by
either one or both sensors is handled by using suitable recovery methods. Infrared sensor
detections have been used to restart the particle filter if it is lost, where possible. The results
show that tracking using both the sensors give better performance accuracy and help keep
tracking errors lower than using either the camera or infrared based tracker alone.
A problem faced is during extraction of infrared sensor data to associate with the object.
Infrared sensor data can get noisy especially when the tracking platform is in motion.
Therefore, more efficient gating technique or data association algorithm has to be
developed. Other future areas of research involve exploring different fusion techniques,
improved motion models and dealing with noisier environments (possibly occlusion).
62
Chapter 3. Occlusion Handling in Tracking
3.1. Introduction
Human object detection and tracking is a challenging research topic in the field of computer
vision or robotics and finds wide applications in the areas of video surveillance, robot
follower, autonomous navigation, etc. RGB cameras have been used extensively for
tracking purposes with a combination of other sensors such as lidars, infrared cameras,
ultrasonic sensors, etc. Stereo cameras have also been used to generate depth maps to assist
in tracking as one can exploit the depth associated to the pixels of the image. However,
most appearance model based object detection and tracking algorithms encounter problems
when the appearance of the object changes as it interacts with the background. If the
background also changes a lot, this can lead to incorrect foreground-background
segmentation. Moreover, such models based on the appearance or features extracted from
a color image will also depend on the lighting conditions of the scene, which poses
problems with rapid illumination changes. In addition, in the presence of occlusion, it
might fail to successfully track the object because of difficulty in identifying the occluded
object whose appearance may have changed considerably because of the presence of the
occluding object. In the traditional x-y tracking domain using particle filters, the ‘y’
corresponds to the vertical displacement in the image, which appears linear when the object
is close to the camera but at further distances, the relationship between the positions of the
persons and the vertical displacement on the image is non-linear, which might lead to a
63
discrepancy between the motion model predictions and the representation on the image.
Using a camera matrix for obtaining real-world coordinates adds another step in the
processing algorithm, which can be avoided by utilizing the depth data returned by the
Kinect sensor. In case of a stereo camera, computing depth values increases the time
complexity as one must match the similar feature points in both images.
In comparison to the appearance based tracking in the x-y frame, depth based tracking adds
the third dimension of depth, which is able to provide geometrical representations of the
objects without being affected by illumination changes. Also, given a particular depth,
objects at those depth ranges can be extracted and segmented more robustly than trying to
extract objects from color images or video sequences which might have a changing
background that might be similar to the object in appearance. If the object to be tracked is
occluded, a tracker based on the appearance features might fail to extract the partially
occluded object or might even switch to the object causing the occlusion in case their
appearances are similar. It might not be able to detect the occlusion in those cases, however,
when using depth data with object tracking, based on the object’s current position and
depth, there is only a small change in depth in the next frame and thus, extraction of the
fully or partially visible object at the next possible depth ranges will help extract a more
accurate region for the object to be tracked.
This chapter presents research which aims to utilize a motion model which uses the
horizontal-depth frame for propagating particles of a particle filter used to track a given
64
target and demonstrates the advantages of incorporating depth into the motion model. In
addition, the depth data helps in determining if occlusion has taken place and to extract a
target more precisely than using feature or appearance based models. The algorithm has
been further enhanced, to handle dynamic occlusion scenarios, such as when a target is
occluded by one or more occluders for a period of time. This is achieved by observing the
occlusion status of the target and initiating occluder track(s), which serves the dual purpose
of providing a distribution of the location probability for the target in case of full or partial
occlusion. This is combined with a part based matching template system for associating
partially visible object parts to the whole object as detected in the pre-occlusion stage, or
even for object recovery purposes.
3.2. Related Work
Researchers have approached the problem of occlusion handling in different ways, such as
by producing detailed object representations for parts of objects, such as in [72], where a
hierarchical deformable part-based model is used for handling occlusion. In [73], two types
of detectors are used, a global detector which generates an occlusion map, which is to be
used by a part based local detector. Some researchers consider the occluder-occludee as a
pair and use suitable feature representations for the same, such as in [74], where an and-or
model has been adopted for studying occluder-occludee occlusion patterns in a car, which
can also be extended to other objects such as humans. [75] proposes to use a double person
detector for detecting occluder-occludee pairs. In [76], occluder-occludee occlusion
patterns are mined for robust object detection. Some other approaches use a combination
65
of context information along with visual and depth cues to track an object robustly in the
presence of an occluder, such as in [77-78]. In [79], a vehicle detection and tracking
approach has been proposed that handles dynamic occlusion of vehicles in a road, by
tracking occluders and occludees using a context based multiple cue method. The
challenges faced by these approaches is that modeling the target object by itself ignores the
fact that it can undergo occlusion, which would drastically change the model
representation. Even if the occluder-occludee pair is modeled, that will also undergo
change, as the objects move and interact with each other.
Increased availability of depth sensors has encouraged researchers to pursue RGB-D
tracking, which has potential for yielding better results, since the addition of the depth data
can handle occlusion better or prevent model drift arising from a change in appearance of
the target. [80] presents an RGB-D tracker, where the RGB Kernelized Correlation Filters
tracker is enhanced by fusing color and depth cues, and by exploiting the depth distribution
of target, scale changes are studied, and occlusion is handled. Lost tracks are recovered by
searching in key areas. In [81], Gaussian Mixture Models have been used to detect
occlusion. Partial occlusion is handled by tracking the partially visible object based on
fusing depth and color data. A motion tracker is also used to predict positions in case of
full occlusion. However, research focusing on RGB-D tracking with occlusion handling is
limited and this work aims to address this situation in a different way. Object tracking is
done by propagating the object in the horizontal-depth framework followed by depth based
extraction. Occlusion handling is done by matching partially occluded object parts to prior
66
models and maintaining separate occluder tracks to narrow down search for the occluded
target.
3.3. Methodology
This section presents the approach used for depth based tracking with occlusion handling.
It is divided into 4 sub-sections, Object Representation, Object Extraction and Filtering,
Occlusion Detection and Handling, and Particle Filter Tracker.
3.3.1. Object Representation
This algorithm uses particle filters for object tracking in the x-z domain. A human object
in the depth image is depicted in Fig. 18(a) and the corresponding depth profile w.r.t. the
horizontal axis is given in Fig. 18(b). The depth profile takes on a characteristic shape for
an upright human body, either stationary or in motion. As, the person moves, his depth
profile does not alter much, unless the person is occluded, in which case the shape of the
blob is going to change. However, for the purpose of tracking, one can consider the center
of the patch in Fig. 18(a) to be the (x,z) center of the object, where x stands for the
horizontal displacement and z stands for the depth value. Additionally, normalized color
histograms [65], extracted from the corresponding color image and normalized histogram
of depth values with 50 bins (Fig. 18(c)), obtained from the depth images are used for
computing the feature vector for the object. Other characteristics such as object’s (x,y)
position, where x stands for the horizontal displacement in the image and y stands for the
67
vertical displacement in the image, and size of the bounding box are also computed. The
patches are checked for occlusion and the Occlusion Detection and Handling section
describes how to handle partial or full occlusion scenarios.
3.3.2. Object Extraction and Filtering
The tracking initiates with the user selecting the object to be tracked. The ground plane is
removed in all images before beginning the processing following the method in [67]. The
corresponding patch in the depth image is analyzed to get the mode of the depth value.
Using this value, the algorithm makes an informed guess about the possible depth ranges
that the object can be at in the next frame (-100cm to +200cm) and objects are extracted at
depth intervals of 25cm. In the case that the object is occluded partially by another object
at a different depth, then extraction still works unaffected as it is based on depth. However,
in case occlusion occurs because the occluding object is almost at a similar depth to the
object or if it is at the same depth as the object to be tracked and two or more objects appear
as a joint blob, then the depth segmented blob is going to be split into multiple hypotheses
patches with sizes pertaining to the given depth value. The patch size given by the length
and width parameters and associated with a depth value is learnt by conducting
experiments, recording average human sizes at those depths in a lookup table and then,
using extrapolation. Once the objects are extracted using the method explained, a two-step
gating technique is applied which filters out some objects based on the proximity in
position, size with the estimated human object in the previous frame. In the second step of
68
the gating method, these filtered objects are matched by their color and depth features using
the Bhattacharyya distance measure with the previously detected human object.
(a) (b)
(c)
Figure 18. (a) Depth image showing the human body. (b) projection of the human body
depth data on the x-z plane. (c) normalized depth histogram for the human object
69
3.3.3. Occlusion Detection and Handling
Before any extracted patch is processed, a check is performed to detect the presence of
occlusion. In case the person to be tracked is occluded by another object, then the occluding
object has to be present at a depth which is lesser than the depth of the object. Therefore,
when analyzing the depth values in a patch, if there exists a concentration of pixels having
depth values lower than the depth of the object to be tracked, such that the concentration
exceeds an occlusion threshold, threshocc , then the object is said to be partially occluded.
The threshold threshocc is computed according to
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑙𝑢𝑑𝑒𝑑 𝑝𝑖𝑥𝑒𝑙𝑠

𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑐𝑐 = (21)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑛−𝑧𝑒𝑟𝑜 𝑝𝑖𝑥𝑒𝑙𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑒𝑝𝑡ℎ 𝑖𝑚𝑎𝑔𝑒
Modelling the prior appearance, shape or motion of the target to match with the object,
when it reappears after occlusion might fail. In this approach, when an object is partially
occluded, the algorithm starts tracking the occluder as well, until the object is completely
visible, thus indirectly keeping track of the object’s 2-D location. One has to keep in mind
that the object to be tracked might be occluded on its left or right by other occluders,
especially in a crowded scene, or the occluders might themselves be occluded by newer
occluders. The proposed algorithm tracks all the visible occluders.
The goal is to use the location of the tracked occluder as a prior over the distribution of the
object to be tracked. This can be easily observed because the target to be tracked is either
70
visible on the left side or the right side (or top side) of the occluder, or the target is
completely covered by the occluder (full occlusion). One might argue that the target could
also be lost, and while that is a possibility in certain cases (such as when an occluder is
approaching a target from the right side, and upon occlusion, if the occluder is stationary
and the target simply changes track and moves perpendicular to his original path, all along
covered by the occluder, and finally exits the tracking scene), the goal of this algorithm is
to help identify the occluded object and associate it correctly upon its reappearance.
Therefore, fusing prior information of target with the location of its occluder(s) help
recover the target upon its reappearance more effectively.
Additionally, when the target undergoes occlusion, its appearance changes, and therefore
the partially visible object (when the object is getting occluded or when it is coming out of
occlusion), does not necessarily match the appearance model of its pre-occlusion stage.
When target is heavily occluded, with only parts of the target visible, such as an arm or
part of the leg, or a half of the body, use of depth data can reliably extract the sub-parts
than appearance based models. The immediate background is also monitored for possible
occluders at similar depths, and in such cases, color and position based filters will help to
extract the target (or its parts in case of occlusion) precisely.
Once occlusion has been detected, this work does not strive to maintain an updated model
of the partially visible object, and instead tries to associate the partly visible object to the
prior model learnt before occlusion. The motivation behind this approach is that an
71
occluded object interacts with its occluders and any model representing the visible part of
the occluder or even the occluder-occludee pair will be changing with the change in
interaction, and therefore, this work updates the probability distribution for the occluded
object based on the occluder and associates partially visible object parts to the stored
model. This is done by template matching, and if a high probability of association can be
found, the part object is output as the tracked object, otherwise the algorithm proceeds with
tracking the occluder(s). When the object becomes fully occluded, the algorithm does not
output anything for the object, but updates its possible location distribution. Finally, when
the object is visible (i.e. not occluded), the depth and appearance models are updated by a
weighted combination of the prior and posterior models. After a threshold number of
frames, framethresh, if the object is still occluded, the algorithm assumes that it has been lost
and a search is done to locate the object.
3.3.4. Particle filter tracker
The particle filter tracker in this work uses a motion model in the horizontal-depth or x-z
motion framework. The state of the object at time t is represented by the vector St =[ Xt
Xṫ Zt Zṫ ], where Xt and Zt represent the x and z coordinate positions, Xṫ and Zṫ represent
the velocity components in the x and z directions respectively at time t. This motion model
which gives a prior estimation of the state can be represented by the following equations:
72
̇ Δt
𝑋𝑡 = 𝑋𝑡−1 +𝑋𝑡−1 (22)
̇ Δt
𝑍𝑡 = 𝑍𝑡−1 + 𝑍𝑡−1 (23)
|𝑋𝑡̇ |= |𝑋𝑡−1
̇ |+ N(0, 𝜎𝑥2 ) (24)
|𝑍𝑡̇ |= |𝑍𝑡−1
̇ | + N(0, 𝜎𝑦2 ) (25)
where 𝜎𝑥2 and 𝜎𝑧2 are variances of the velocity components in the x and z directions.
Tracking in the x-z domain is beneficial as particles can now be propagated in the x-z
domain. The motion model estimates the depth at which the object can be expected to be
found in the next time instant and using these depth values, the objects can be extracted. If
there are multiple pixels in the depth image with (x,z) values, then patches are extracted
averaged on all those (x,z) values respectively. However, in the x-y tracking domain, the y
estimate is essentially the vertical displacement in the image and the state’s motion model’s
prediction for the y value at longer distances may be drifting upwards in the image whereas
the object’s actual y position changes non-linearly in the image. Such a problem can be
completely avoided using depth values as one can use depth to extract the objects in the
depth or corresponding color image.
The object’s feature vector is computed as described in the Object Representation section.
When a set of objects are obtained as the observations following the Object Extraction and
Filtering section, then the primary task is to retain only one observation. To do so, each
patch generated during the motion model prediction step casts a vote for an observation.
73
This vote is cast depending on the maximum likelihood achieved for the state, given the
current set of observations. The likelihood model is given by
2
̂𝑘 , 𝐶̂
p(obs|X) ∝ exp (− 𝑑𝐶𝑜𝑙𝑜𝑟 (𝐶 2 ̂ ̂ 2 2
𝑟𝑒𝑓 ) ⁄2𝜎1 ) ∗ exp (− 𝑑𝐷𝑒𝑝 (𝐷𝑒𝑝𝑘 , 𝐷𝑒𝑝𝑟𝑒𝑓 ) ⁄2𝜎2 ) ∗
2
exp (− 𝑑𝐸𝑢𝑐 (𝑋𝑐𝑢𝑟 𝑘 , 𝑋𝑜𝑏𝑠 𝑘 ) ⁄2𝜎3 2 ) (26)
̂𝑘 , 𝐶̂
where 𝑑𝐶𝑜𝑙𝑜𝑟 (𝐶 ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐶𝑘 , 𝐶𝑟𝑒𝑓 ) is the Bhattacharyya distance between the
̂𝑘 , 𝐷𝑒𝑝
color histograms, 𝑑𝑑𝑒𝑝 (𝐷𝑒𝑝 ̂ ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐷𝑒𝑝𝑘 , 𝐷𝑒𝑝𝑟𝑒𝑓 ), is the Bhattacharyya
distance between the normalized depth histograms, 𝑑𝐸𝑢𝑐 (𝑋𝑐𝑢𝑟 𝑘 , 𝑋𝑜𝑏𝑠 𝑘 ) =
√(𝑥_𝑐𝑢𝑟𝑘 − 𝑥_𝑜𝑏𝑠𝑘 )2 + (𝑦_𝑐𝑢𝑟𝑘 − 𝑦_𝑜𝑏𝑠𝑘 )2 is the Euclidean distance between the
̂𝑘 and
center of the current patch and the observed object’s center position at time 𝑘, 𝐶
̂𝑘 are the normalized color histograms and normalized depth histograms respectively
𝐷𝑒𝑝
for the current patch centered at (𝑥, 𝑧), 𝐶̂ ̂

𝑟𝑒𝑓 and 𝐷𝑒𝑝𝑟𝑒𝑓 are the normalized color
histograms and normalized depth histograms for the reference patch, 𝜌 is the Bhattacharyya
coefficient and 𝜎1 , 𝜎2 , 𝜎3 specify Gaussian noise in measurements, obs is the observation
and X is the current state.
Once all the patches have associated themselves to an observation, a majority voting is
conducted to obtain that observation which has the highest number of votes, as the final
observation. Following this, the particle weights are again reassigned using the same
74
Algorithm
(i) Obtain depth mean and standard deviation of target to be tracked from user
selected patch.
(ii) Obtain the depth distribution of the immediate neighborhood.
(iii) Update the position of target to be tracked based on depth propagation and filter
using estimated position of target. If neighborhood depth distribution obtained
in step 2 is similar to the target, apply filters based on color distribution and
position of target.
(iv) Update the position for any current occluder(s) which are occluding the target.
(v) For the updated position of the target in step 3, determine if occlusion has
occurred. Checks conducted to estimate presence of new occluder(s) as well as
determine any intersection between any old occluder(s) and the target.
(vi) In case of occlusion in step 5, extract the partially visible object using depth of
the target and use part based template matching to match it with the appearance
model of the object prior to occlusion to see if the visible part belongs to this
object.
(vii) In case of full occlusion, the position estimates of the occluder(s) serve as
possible locations for the target.
(viii) Reinitialize search for target if it does not reappear after some frames
depending on the situation.
(ix) Repeat steps 2 to 8.
75
likelihood model w.r.t. the final selected observation. Residual resampling method is used
[69] and the mean of the posterior distribution is output as the estimated position of the
tracked object.
3.4. Results
The algorithm has been tested initially on data collected in the laboratory and also on some
occlusion scenes in the Princeton tracking dataset [82]. A Microsoft Kinect sensor is used
which provides the depth data for each pixel in the color image. Mapping from the depth
data image to the color image is done using [68]. It is fixed on a table and object(s) motion
in front of it is captured. The maximum range is 400cm for this sensor. To evaluate the
proposed method, 20 video sequences were recorded and corresponding depth and color
images of size 424 X 512 were obtained. In these videos, the object executed motion in a
linear manner, either moving forward or walking towards the camera, or going from left to
right in front of the camera and vice-versa. Another object steps in and occludes the person
to be tracked either partially or fully in some cases. Additionally, some occlusion scenes
from Princeton Tracking dataset has been used which tests occlusion on several levels,
such as when the target is occluded for a period of time by multiple occluders, or when the
target is occluded by an occluder at similar depth ranges, etc. The ground truth of the
human’s position is recorded manually, and the Root mean Square Error (RMSE) is used
to obtain the accuracy of the algorithm.
In Fig. 19, the object is executing a simple linear walk and the algorithm is able to
successfully track the object throughout the sequence. In Fig. 20, two objects are walking
76
towards each other and therefore, they cross at some point of time and thus have similar
depth values with partial occlusion. This algorithm is able to detect and track the selected
object without drifting to the other object when occlusion takes place. In Fig. 21, an object
is fully occluded for some time instant, however, when it comes out of its state of occlusion,
the tracker again picks it up without drifting to the other object, although the occluding
object is at a very close depth. This is possibly because of the combination of depth and
color models used for matching. In Fig. 22, the object to be tracked moves from the back
to the front of the camera, however, it is partly occluded, however, the algorithm can track
it throughout the occlusion state. In the end, the occluding object moves towards the camera
and only a small portion of the object to be tracked is visible, and the algorithm can detect
the non-occluded portion and continue the tracking. Similar occlusion scenarios are
observed in Fig. 23 and Fig. 24. However, in Fig. 25, it is a bit challenging as the target to
be tracked is occluded and the occluder in turn is occluded. By using depth data to extract
visible portions of occluded target and matching the part with the prior appearance model
stored before occlusion, the algorithm is able to output to the viewer a suitable location of
the target. Additionally, tracking the occluder helps in recovery of the target when it
emerges from a state of full occlusion.
77
Figure 19. Image sequence showing an object executing a simple linear motion being
tracked.
78
Figure 20. Image sequence showing an object facing partial occlusion being tracked
correctly. At frame number 262, the two objects are at similar depths and is partly occluded.
79
Figure 21. Image sequence showing an object that is fully occluded for a short time,
however, on reappearing, it is tracked again.
80
Figure 22. Image sequence showing an object that is partially occluded for a long duration
of time, however, it is succesfully tracked all through and towards the end, it is heavily
occluded, but the algorithm tracks it correctly till the end.
81
box represents occluder.
82
box represents occluder.
83
Figure 25. Partially visible target is obstructed by an occluder which in turn is occluded;
Bold black bounding box represents target; light black bounding box represents occluder.
84
3.5. Conclusion
This work has introduced human object tracking with occlusion in the x-z domain, where
particles are propagated using the horizontal-depth framework. Using depth values for
tracking and object extraction avoids issues faced otherwise in appearance based tracking
such as lighting changes causing a difference in the appearance of the object, or difficulty
in extraction when there are objects with similar appearances, or unsuccessful object
extraction when it is occluded. As seen from the results, tracking in the x-z domain leads
to more accurate state motion model predictions which help in the extraction of the objects.
This method handles occlusion scenarios robustly, because it integrates information about
the occluders, thus producing a better estimate of the location of the target. Also, by
avoiding the update of the target’s appearance model by using partially visible object parts,
and instead, simply associating those object parts to the whole object detected prior to
occlusion, additional information about the target’s position can be obtained.
Tracking in x-z domain instead of the x-y domain hasn’t been well researched and this
work aims to contribute towards that. Improvements can be made to obtain a better
representation of the object features in the depth domain. The gating technique could be
changed by using a machine learning algorithm that can classify extracted depth patches
as humans or non-human objects. Also, the shape of the x-z blob projection could be used
for human identification with or without occlusion. While this work assumes that the sensor
is stationary, future work could use these sensors in motion. Additionally, multiple objects
could be tracked, and better data association techniques could be explored.

85
Chapter 4. Data Association in Tracking
4.1. Introduction
Correctly modeling the 3-D environment around a given object is an important first step in
applications where the object has to navigate around such as autonomous robots,
autonomous vehicles and the like. For instance, an autonomous vehicle would require a
map of its surrounding environment so that it can steer itself in the correct direction.
Information about the surrounding environment can be obtained from sensors. However,
to obtain complete 3- dimensional information is challenging by using one kind of sensor.
Using a radar, one can obtain accurate range information of the surrounding objects. Yet it
is difficult to obtain the azimuth(x) and elevation(y) resolutions accurately. For instance,
the monopulse radar generates multiple closely spaced beams from the same antenna, and
uses the sum beam and the delta beam to resolve target’s azimuth and elevation directions.
At least three receive channels and a relative complex antenna feed network are needed to
realize the radar. The cross-track mode of satellite is another method for the radar to resolve
targets along azimuth direction. However, it needs a rotating mirror (use optical-
mechanical device). The mechanical part increases the cost and is not suitable for using in
the automotive, which always experiences vibration during driving. Many researchers have
come up with phased array radars or multiple-input multiple-output (MIMO) radar.
However, to deploy these techniques the transmitter and receiver should have as much
array elements as possible for high azimuth resolution. The cost and complexity to realize
86
the elements become high. On the other hand, using a camera can provide location
information of the object in a 2-D plane (like the x-y plane) accurately, but can just give us
some idea about the relative depth. Several methods based on cameras only cannot resolve
depth well. For example, a depth camera is restricted to providing information up to very
short ranges. A stereo camera can be used to measure depth, but it makes the whole system
more complex due to the need to locate the similar feature points between the two cameras.
This work proposes to use two different sensors (i.e. low-cost radar and camera) that give
accurate information along different dimensions and tries to associate these pieces of
information obtained from the two sensors to correctly predict the 3-D position of objects.
The case of a car moving on a highway, trying to find out the positions of other vehicles
around it. Therefore, a complicated 3-D environment detection problem by using one kind
of sensor is formulated as a problem of associating range information of radar with 2-D
position coordinates from the vision system. This is where sensor fusion comes into play.
It means integrating information obtained from various sources (sensors) in an appropriate
manner so that an estimation can be made about the scene in question which further helps
in establishing a model of the environment.
This method establishes a relationship between the sizes of a fixed object as projected on
an image taken at different ranges from the object versus the corresponding ranges. This
relationship is then used to estimate the relative depth of objects from their sizes as
perceived in an image. This data, when combined with the relative positions of the vehicles
87
in the image, can be used to predict which absolute distance values obtained from a radar
signal return associates with which object, using an optimization algorithm such as the
Hungarian algorithm.
4.2. Related Work
Prior work that has attempted to associate a radar depth to a vision detected object in a real-
world outdoor scene is limited. However, work has been done to estimate absolute depth
using monocular vision system. For instance, [29] tries to estimate depth of static objects
in traffic scenes from monocular video using structure from motion. Depth estimation from
unstructured scenes was explored in [30]. Semantic knowledge of the scene is used to
estimate the depth in [31]. A depth based segmentation using radar and stereo camera is
given in [32]. In [33], a radar-vision fusion system is studied. Some radar and vision fusion
techniques try to associate the radar data to visual detection and tracking as in [34]. In
[35], a vision system is used to confirm the contour of the object detected by a radar. In
[36], the radar and vision detected objects are associated during initialization and after that
tracking is also involved, however, it is not clear how the initial association takes place. In
this work, one sensor is not used to validate another sensor. Instead, the 3-D position of the
objects is established by combining the strengths of two sensors.
88
4.3. Methodology
4.3.1. Derivation of equation
Using the perspective projection equation for the pinhole camera model, a point in 3-D
space(x,y,z) can be transformed to a point in the camera co-ordinate system (u,v) and is
given by [37]:
x y
u = α z – α cot θ z + u0 ,
β y
v = sin θ z + v0 (27)
where, z is the depth of the point from the camera, 𝛼 = 𝑘𝑓, 𝛽 = 𝑙𝑓, f is the focal length
1 1
expressed in meters and a pixel has dimensions × 𝑙 , where k and l are expressed in
𝑘
pixel×m-1, 𝑢0 and 𝑣0 are the positions of the center of the image plane in the camera co-
ordinate system and θ is the skew of the camera co-ordinate system. Now, for this
derivation, it is assumed that the origin of the world coordinate system is the same as that
of the camera coordinate system. Let us consider a figure which is viewed at depth z from
the camera. The 3-D world coordinates and the corresponding camera coordinates for the
figure is shown in Fig. 26.
89
Figure 26. A real-world figure projected onto camera co-ordinate plane
Using Eq. (27), we can write
𝑥1 𝑦1 𝛽 𝑦1
u1 = α – α cot 𝜃 + u0, v1 = 𝑠𝑖𝑛 𝜃 + v0,
𝑧 𝑧 𝑧
𝑥2 𝑦2 𝛽 𝑦2
𝑢2 = α – α cot 𝜃 + u0 , 𝑣2 = 𝑠𝑖𝑛 𝜃 + v0 .
𝑧 𝑧 𝑧
Therefore,
𝛼 𝛼 cot 𝜃
u2 – u1 = (x2 – x1) + (y2 – y1)
𝑧 𝑧
𝛼
⇒ 𝑙𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 = (𝑙 − ℎ cot 𝜃)
𝑧
𝛽
v2 – v1 = (y2 – y1)
𝑧 sin 𝜃
90
𝛽ℎ
⇒ ℎ𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 =
𝑧 sin 𝜃
where 𝑙 = x2 – x1 and ℎ = y2 – y1. Therefore, one has:
𝛼𝛽(𝑙ℎ − ℎ2 cot 𝜃)
𝑠𝑖𝑧𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 = 𝑙𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 × ℎ𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 =
𝑧 2 sin 𝜃
⇒ 𝑠𝑖𝑧𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 = 𝑐𝑜𝑛𝑠𝑡⁄𝑧 2 ,
𝛼𝛽(𝑙ℎ−ℎ2 cot 𝜃)
where 𝑐𝑜𝑛𝑠𝑡 = sin 𝜃
⇒ 𝑠𝑖𝑧𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 ∝ 1⁄𝑧 2 . (28)
In real life, the world and camera origin coordinates are not the same. However, translation
and rotation factors transforming a figure from the world frame to the camera frame would
not affect the projected size of the figure on the image, which is inversely proportional to
depth raised to the power of two.
An experiment was conducted where a fixed size chessboard was placed on the wall. An
image of the chessboard was taken from different ranges (moving from a distance near the
chessboard to far away) and at each time, the size of the chessboard was recorded as
projected on the image and the corresponding range measurement was also taken. The size
of the object as perceived in the image versus the distance at which the image was taken is
plotted and shown in Fig. 27.
91
The plot for (size*distance*distance) for each data pair is given in Fig. 28. This value is
almost a constant with a very small standard deviation. Therefore, this experimental data
confirms the relationship between projected size and depth as derived using the pinhole
model in Eq. (28). The motivation behind using this assumes that since most cars moving
on the road have almost similar sizes, it could be represented by an average size. Now, this
average sized car, if viewed from different distances, would project different sizes on the
image. For instance, when viewed from very close, the size of it would appear very large
and at large distances, the car would appear much smaller (Fig. 29). Thus, we can use Eq.
(28) to get an idea about the relative depth of the cars in the scene. The absolute depth is
not predicted because that would involve knowing the actual sizes of the cars, which is not
possible. Also, using relative depth does not change the way the data association works out
as these associations are optimized with the absolute depth obtained by the radar.
4
x 10 Size versus Distance plot
10
7
Size in pixel units
1
200 250 300 350 400 450 500 550 600
Distance in cms
Figure 27. Plot of size of chessboard projected at increasing depth ranges
92
x 10
9
Size X Distance2 = Constant
10
2
size X distance
6
0
0 2 4 6 8 10 12 14
Observation number
Figure 28. Plot confirms that observed data follows derived Eq. (28)
However, predicting the depth from projected size will give us accurate results only when
objects of a similar physical size are considered. In real life, vehicles of different sizes
(trucks versus small cars) are seen on the highways. The projected size of a bus on an image
will be much larger than that of an average car when both are placed at the same range
from the camera. It is highly possible that even if the car is nearer to the camera than the
bus as in Fig. 30(c)(d), yet the projected size for the bus would still be larger than that of
the car, giving a false conclusion that the bus is nearer. This results from the assumption
that all cars on the road have an average size. So, to include vehicles of all classes, an
additional information is used which can be obtained from the image itself, that is the
relative positions of the vehicles, which is described next.
93
Figure 29. Vehicles with their ranks based on their relative positions determined by the
size.
From the bounding box that is obtained around every detected vehicle, the y-coordinate of
the lower left corner (i.e. along the vertical plane of image) is used as a measure of relative
positions of the vehicles. For instance, in Fig. 29, the detected targets have been ranked
according to their nearness to the camera based on the y-value. Thus, for a situation as
illustrated in Fig. 30(c)(d), this ranking would tell us that the bus is farther away from the
car. Therefore, this ranking is used as an additional constraint over associations to the
Hungarian algorithm, which will be described below.
94
4.3.2. Procedure
After obtaining an image of vehicles moving on a highway, we would have to detect the
vehicles in the image. Detection of vehicles in an image would be a separate research topic
altogether and we do not go into details of that in this work. However, the reader could
consider [38] for a survey on techniques for detecting vehicles. The Histogram of Oriented
Gradients (HOG) detector has been used for detecting vehicles as described in [39],
followed by computing the area of the bounding box surrounding each detected car.
Assuming this area to represent the size of the car as projected in the image, the relative
depth of the car is calculated using Eq. (28). The radar also gives us a set of range
measurements representing the depths of objects located around. But radar with a small
number of array elements or even a single array element cannot resolve targets in the x-y
plane well. Therefore, these range measurements are just values and do not tell us to which
object each corresponds. We now have a set of objects represented by their relative depths
and sizes along with a ranking and a set of absolute depths. We will have to associate the
objects to the absolute depth values. This problem can be formulated as an assignment
problem and can be solved using the Hungarian algorithm. Let us represent the vehicles
detected by the vision system as the collection O = {o1, o2, o3,....., oj} and the ranges
returned by the radar sensor as the collection R = {r1,r2,r3,.....,rk}. This is modeled as a
bipartite graph, where V is one set of nodes and R is the other set of nodes. So the total
number of nodes is given by N = O+R. The cost matrix for the above problem can be
constructed by letting each cell in the matrix be the squared difference between the absolute
distance returned by the radar and the distance estimate returned by the vision system
95
(a)
(b)
continued
Figure 30. Testing images (a)Testing on cars having same average size (b) Testing on partly
occluded vehicle(a small car occluded by a large truck) . (c) & (d) Testing on a large vehicle
along with cars.
96
Figure 30 continued
(c)
(d)
97
which is then fed to the Hungarian algorithm. The aim is to minimize this cost matrix, that
is, find an optimal assignment such that the cost of assignment is minimized. If m is the
minimum number of assignments for objects that are from the sets of O and R respectively,
the goal is to make an optimal assignment for the two groups of m objects. Therefore, the
following equation has to be minimized,
𝑎𝑟𝑔 𝑚𝑖𝑛 ∑𝑚
𝑖=1 ||𝑟 (𝑖) − 𝑣(𝑖)||
2
based on the constraint
if 𝑟𝑎𝑛𝑘(𝑜(𝑖)) < 𝑟𝑎𝑛𝑘(𝑜(𝑗)),
then 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑜(𝑖)) < 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑜(𝑗)) (29)
where 𝑣(𝑖) = √(𝑐𝑜𝑛𝑠𝑡/𝑠𝑖𝑧𝑒(𝑖)), 𝑟(𝑖) and 𝑣(𝑖) are the radar and vision computed depth
values, ||𝑟(𝑖) − 𝑣(𝑖)|| 2 is a measure of the difference between the computed depth
values, 𝑠𝑖𝑧𝑒(𝑖) is the size of object I, and 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑜(𝑖)) is the absolute distance assigned
to object i using the Hungarian method. Thus, an optimal assignment of the absolute
distance obtained by the radar to the objects detected by the vision system is obtained,
thereby associating the range returns with spatial positions, thus giving the 3-D information
about the surrounding vehicles.
98
4.4. Results
Several test images were collected by driving along a highway. The images included small
cars at varying distances from the car. Images with a truck and a bus were also collected to
see how this method would work out. Radar reflect signals of all the vehicles in images
have been simulated. The simulated radar was using 76.5GHz carrier, 1GHz bandwidth,
0.2𝜇𝑠 pulse width, and SNR 13.2db. It applied a wavelet based waveform [40] for high
range resolution detection. The radar simulated results are shown in Fig. 31.
After processing the images following the method described in Section 4.3, the relative
depths of the vehicles in the scene is obtained. These are then associated with the absolute
ones using the Hungarian method of optimizing the associations. Some of the test images
are shown in Fig. 30, in which the associated vehicles are depicted by bounding boxes and
associated distances are written in meters (m). The ground truth of the vehicle distances is
obtained my manual inspection. Correct associations were made to the radar returned
distances. The method worked well with trucks and small cars in the same scene, which
can be attributed to the inclusion of relative positions among the vehicles.
In this process, some false detections have been eliminated (that occurs if the radar system
returned more range values than that of number of objects detected by the vision system).
For instance, in Fig. 30(b), radar detected three objects. However, since one is just partially
visible, the vision system fails to detect it. The proposed method is still able to discard the
radar return for that occluded vehicle and do the other associations correctly. However, it
99
might also happen that the vision system failed to detect some object whereas the radar
sensor detected it.
(a) (b)
(c) (d)
Figure 31. Radar simulation results for Fig. 30(a)-(d) respectively
100
4.5. Conclusion
In this work, a camera and a radar have been used to estimate the 3-D position of vehicles
on a highway as seen from an ego car. The relative depth of the vehicles has been estimated
using the size of the vehicles as projected on the image using Eq. (28). Then using a
constraint, based on the ranking of vehicles according to their nearness to the ego car, and
the Hungarian algorithm, associations have been performed to match the accurate range
data of the radar with each vehicle as seen by the vision system. This method works well
on the tested images. Some future works that could be done on this approach have been
listed. An issue that arises would be when the size of the vehicle computed would not
reflect the true back view of the vehicle. In Fig. 29(d), for the car on the leftmost lane, part
of the side view is included inside the bounding box. Weights could be assigned to the
computed size such that a lower weight could be assigned if we get a back and partial side
view vehicle. However, this problem arises for vehicles nearby the ego car rather than far
away ones. Also, in a lane intersection, a side view of a car in a perpendicular lane would
project a larger size. The two views would have to be distinguished. Another direction
could be trying to estimate absolute depth of the vehicles from a single image based on
[30] and [31]. That could perhaps make the whole approach more robust under some
conditions, while the computation will be significantly complicated. The use of radar and
vision fusion could be extended to estimate 3-D positions of more complex scenarios, like
a college campus.
101
Chapter 5. Conclusion and Future Work
This dissertation makes a contribution to depth based sensor fusion by exploring several
topics related to object detection and tracking. The more traditional range sensors such as
lasers and radars are generally quite expensive and have been used in military or large scale
industrial projects, etc. Indoor robotics has mostly used low-cost sensors such as infrared,
ultrasonic, or stereo cameras, etc. Nowadays, with the development of depth cameras such
as Kinect or PrimeSense, which are affordable as well, a substantial amount of research
has been conducted to fuse depth data with existing models based on RGB features.
However, as noted in [82], this integration has a lot of potential for improvement.
In indoor robotics applications, the use of infrared sensors has mostly been limited to a
proximity sensor to avoid obstacles. Chapter 2 presents work on extending the use of these
low-cost, but extremely fast infrared sensors to accomplish tasks such as identifying the
direction of motion of a person and fusing the sparse range data obtained from infrared
sensors with a camera to develop a low-cost and efficient indoor tracking sensor system.
Therefore, an array of infrared sensors can be advantageous over a depth camera, when
discrete data is required at a fast processing rate.
In Chapter 3, a Kinect sensor has been used to track an object with a focus on occlusion
handling. A Kinect sensor provides data for the depth at every pixel and this information
is useful for extracting objects based on depth even if the object is partially occluded. An
102
occluder tracking system with part based association of the partially visible occluded
objects helps to keep track of the object when it recovers from occlusion. There are many
state-of-the-art algorithms for object tracking using RGB data, however, object tracking
using RGB-D data is relatively new and occlusion handling using depth data needs more
exploration.
In Chapter 4, a classical data association problem has been explored, where discrete range
data from a depth sensor has to be associated to 2-D objects detected by a camera. This
problem has been applied to a situation where a radar returns a set of ranges corresponding
to objects in the environment and a camera provides the 2-D information about the objects,
with a focus on vehicles driving in a highway. This data association using a Hungarian
algorithm with specified constraints works on a structured environment, however, more
research is required to extend this use to complex environments like an urban scene. This
would eliminate the need to use very expensive radars or 3-D lasers.
There is scope for a lot of interesting work to be done to extend the research presented in
this dissertation. Effectively tracking a human being in a crowd using the idea of
propagation in the x-z domain and extending it to explore multi-target tracking could be
considered. Both these topics have immense potential in today’s world, for instance, a robot
whose job is to follow a target, such as a patient in the hospital (in order to monitor the
movement of the patient and alert concerned department in case the patient falls down or
needs some other help that the robot cannot provide), or a personal shopping assistant robot
following a shopper (to perhaps direct the shopper to a particular product in an aisle), or a
103
robot that aids a blind man to navigate on the roads, or a robot which can follow a particular
factory worker, to carry instruments or materials from one place to another and so on.
Important applications for multi-target tracking would be in a sport such as football, where
the coach would like to track the movements of the players and also their interactions and
in a video surveillance scene, where every person has to be tracked.
Putting more focus on the steps leading to object detection and tracking is necessary. In
order to successfully extract target object(s) from the scene, various methods have been
proposed in the literature to build a 2-D model such as the survey in [2] or a 3-D model,
such as the survey in [69]. As the scenes get more complex, the background illumination
might change, or the background might have a similar appearance with the foreground, or
it could have noisier stationary occluders at similar depths as the object(s) to be tracked, so
starting with some relevant work [4-5], research could be conducted on this topic. In the
case of multi-target tracking, it is highly possible that the object(s) will occlude each other,
or might exhibit a complex motion trajectory [70], thus a dynamic motion model which is
able to make accurate predictions about the object(s) positions is desired.
When x-z propagation is used in a particle filtering tracker, it provides some advantages
over the x-y tracking, because the depth information is utilized and thus one can extract
objects at a particular depth range. It must be taken into account that if two objects are
observed to be at the same depth range, then either they are beside each other, or one object
is occluding each other, with the occluder and occluded person being at different depths.
104
Thus, based on depth information solely, it is possible to extract the objects by utilizing the
pixel depth information in case of occlusion. Secondly, when an object blob extracted
comprises of two persons beside each other, the information provided by the size of the
bounding box (at that particular depth, which can be established prior to execution) is an
indicator of unsuccessful object extraction and some blob segmentation technique can be
applied to split the blob into several candidate hypotheses. Based on pure depth
information, it doesn’t matter if the two objects have similar appearance or not. On the
contrary, in a 2-D image, extracting occluded objects or objects having similar appearance
is not quite successful.
Moreover, when the motion model uses x-z propagation, it must be understood that the
object can move only a finite distance ahead or behind, unlike in x-y propagation, where
the object has the liberty to take on any random y position. The x position in either case
can be random. Therefore, x-z propagation is definitely going to limit the number of
particles needed for tracking. In fact, one can consider the particles to be the depth extracted
objects and since each depth extraction takes place over a small range, we could possibly
be looking at 20 to 30 samples for tracking. For example, if the object is at z depth 200 cm
from the sensor, then considering the average stride of the person to be 81 cm [71], the
tracker would ideally check in the depth range of 200 ± 100 cm, i.e. from 100 cm to 400
cm for the object, and choosing a depth processing interval of 30cm, this could lead to
processing of a minimum of just 10 objects per frame. This would significantly reduce the
computational load.
105
Another interesting direction to consider would be to use a network of RGB-D sensors, or
a network of infrared arrays situated at strategic locations in an environment to facilitate
object tracking even in complex scenarios. A sensor network of cameras has been used in
[8] for tracking a single object and presents interesting research on the trade-off between
using subsets of multiple cameras versus having more prior information about the
occluders. If the environment is very crowded, the target remains occluded most of the
time, and therefore information from multiple cameras might help to resolve occlusion
errors, rather than tracking multiple occluders. In [83], a network of RGB-D cameras has
been used for tracking multiple persons to build an infrastructure for emergency relief
operations. It has been demonstrated that multiple visual cameras have helped localize and
track humans, handling occlusion scenarios well. The ultimate goal could be to develop a
multiview approach using 3-D representations of space.
106
References
[1] Yilmaz, A., Javed, O. and Shah, M., 2006. Object tracking: A survey. Acm computing
surveys (CSUR), 38(4), p.13.
[2] Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A. and Hengel, A.V.D., 2013. A survey of
appearance models in visual object tracking. ACM transactions on Intelligent Systems and
Technology (TIST), 4(4), p.58.
[3] Wu, Y., Lim, J. and Yang, M.H., 2013. Online object tracking: A benchmark.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
2411-2418).
[4] Hare, S., Saffari, A. and Torr, P.H., 2011, November. Struck: Structured output tracking
with kernels. In 2011 International Conference on Computer Vision (pp. 263-270). IEEE.
[5] Dinh, T.B., Vo, N. and Medioni, G., 2011, June. Context tracker: Exploring supporters
and distracters in unconstrained environments. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on (pp. 1177-1184). IEEE.
[6] Jia, X., Lu, H. and Yang, M.H., 2012, June. Visual tracking via adaptive structural local
sparse appearance model. In Computer vision and pattern recognition (CVPR), 2012 IEEE
Conference on (pp. 1822-1829). IEEE.
[7] Zhong, W., Lu, H. and Yang, M.H., 2012, June. Robust object tracking via sparsity-
based collaborative model. In Computer vision and pattern recognition (CVPR), 2012
IEEE Conference on (pp. 1838-1845). IEEE.
[8] Ercan, A.O., Gamal, A.E. and Guibas, L.J., 2013. Object tracking in the presence of
occlusions using multiple cameras: A sensor network approach. ACM Transactions on
Sensor Networks (TOSN), 9(2), p.16.
[9] Song, S. and Xiao, J., 2013. Tracking revisited using rgbd camera: Unified benchmark
and baselines. In Proceedings of the IEEE international conference on computer
vision (pp. 233-240).
107
[10] Tsai, Y.T., Shih, H.C. and Huang, C.L., 2006, August. Multiple human objects
tracking in crowded scenes. In 18th International Conference on Pattern Recognition
(ICPR'06) (Vol. 3, pp. 51-54). IEEE.
[11] Saravanakumar, S., Vadivel, A. and Ahmed, C.S., 2010, December. Multiple human
object tracking using background subtraction and shadow removal techniques. In Signal
and Image Processing (ICSIP), 2010 International Conference on (pp. 79-84). IEEE.
[12] Zoidi, O., Nikolaidis, N. and Pitas, I., 2013, May. Appearance based object tracking
in stereo sequences. In 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing (pp. 2434-2438). IEEE.
[13] Pan, J. and Hu, B., 2007, June. Robust occlusion handling in object tracking. In 2007
IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-8). IEEE.
[14] Fod, A., Howard, A. and Mataric, M.A.J., 2002. A laser-based people tracker.
In Robotics and Automation, 2002. Proceedings. ICRA'02. IEEE International Conference
on (Vol. 3, pp. 3024-3029). IEEE.
[15] Vu, T.D. and Aycard, O., 2009, May. Laser-based detection and tracking moving
objects using data-driven markov chain monte carlo. In Robotics and Automation, 2009.
ICRA'09. IEEE International Conference on (pp. 3800-3806). IEEE.
[16] Labayrade, R., Perrollaz, M., Gruyer, D. and Aubert, D., 2010. Sensor Data Fusion
for Road Obstacle Detection, Sensor Fusion and its Applications.
[17] Cho, H., Seo, Y.W., Kumar, B.V. and Rajkumar, R.R., 2014, May. A multi-sensor
fusion system for moving object detection and tracking in urban driving environments.
In 2014 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1836-
1843). IEEE.
[18] Kumar, S., Marks, T.K. and Jones, M., 2014. Improving person tracking using an
inexpensive thermal infrared sensor. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops (pp. 217-224).
[19] Cruz, L., Lucio, D. and Velho, L., 2012, August. Kinect and rgbd images: Challenges
and applications. In Graphics, Patterns and Images Tutorials (SIBGRAPI-T), 2012 25th
SIBGRAPI Conference on (pp. 36-49). IEEE.
108
[20] Spinello, L. and Arras, K.O., 2011, September. People detection in RGB-D data.
In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference
on (pp. 3838-3843). IEEE.
[21] Koo, S., Lee, D. and Kwon, D.S., 2013, November. Multiple object tracking using an
rgb-d camera by hierarchical spatiotemporal data association. In Intelligent Robots and
Systems (IROS), 2013 IEEE/RSJ International Conference on (pp. 1113-1118). IEEE.
[22] Parvizi, E. and Wu, Q.J., 2008, May. Multiple object tracking based on adaptive depth
segmentation. In Computer and Robot Vision, 2008. CRV'08. Canadian Conference on (pp.
273-277). IEEE.
[23] Nakamura, T., 2011, December. Real-time 3-D object tracking using Kinect sensor.
In Robotics and Biomimetics (ROBIO), 2011 IEEE International Conference on (pp. 784-
788). IEEE.
[24] Isard, M. and Blake, A., 1996, April. Contour tracking by stochastic propagation of
conditional density. In European conference on computer vision (pp. 343-356). Springer,
Berlin, Heidelberg.
[25] Blake, A. and Isard, M., 1997. The CONDENSATION algorithm-conditional density
propagation and applications to visual tracking. In Advances in Neural Information
Processing Systems (pp. 361-367).
[26] Comaniciu, D., Ramesh, V. and Meer, P., 2000. Real-time tracking of non-rigid
objects using mean shift. In Computer Vision and Pattern Recognition, 2000. Proceedings.
IEEE Conference on (Vol. 2, pp. 142-149). IEEE.
[27] Arulampalam, M.S., Maskell, S., Gordon, N. and Clapp, T., 2002. A tutorial on
particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on
signal processing, 50(2), pp.174-188.
[28] Spinello, L., Luber, M. and Arras, K.O., 2011, May. Tracking people in 3-D using a
bottom-up top-down detector. In Robotics and Automation (ICRA), 2011 IEEE
International Conference on (pp. 1304-1310). IEEE.
109
[29] Wedel, A., Franke, U., Klappstein, J., Brox, T. and Cremers, D., 2006, September.
Realtime depth estimation and obstacle detection from monocular video. In Joint Pattern
Recognition Symposium (pp. 475-484). Springer Berlin Heidelberg.
[30] Saxena, A., Sun, M. and Ng, A.Y., 2007, October. Learning 3-d scene structure from
a single still image. In 2007 IEEE 11th International Conference on Computer Vision (pp.
1-8). IEEE.
[31] Liu, B., Gould, S. and Koller, D., 2010, June. Single image depth estimation from
predicted semantic labels. In Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on (pp. 1253-1260). IEEE.
[32] Fang, Y., Masaki, I. and Horn, B., 2002. Depth-based target segmentation for
intelligent vehicles: Fusion of radar and binocular stereo. IEEE transactions on intelligent
transportation systems, 3(3), pp.196-202.
[33] Bauson, W.A., 2010. Integrated Radar-Vision Sensors: the Next Generation of Sensor
Fusion. Available online:
http://www.sae.org/events/gim/presentations/2010/williambauson.pdf
(accessed on 17 December 2017).
[34] Wang, T., Zheng, N., Xin, J. and Ma, Z., 2011. Integrating millimeter wave radar with
a monocular vision sensor for on-road obstacle detection applications. Sensors, 11(9),
pp.8992-9008.
[35] Bertozzi, M., Bombini, L., Cerri, P., Medici, P., Antonello, P.C. and Miglietta, M.,
2008, June. Obstacle detection and classification fusing radar and vision. In Intelligent
Vehicles Symposium, 2008 IEEE (pp. 608-613). IEEE.
[36] Chavez-Garcia, R.O., Burlet, J., Vu, T.D. and Aycard, O., 2012, June. Frontal object
perception using radar and mono-vision. In Intelligent Vehicles Symposium (IV), 2012
IEEE (pp. 159-164). IEEE.
[37] D. Forsyth , and J. Ponce, Computer Vision: A Modern Approach. 2 nd ed., Prentice
Hall, 2011.
[38] Sivaraman, S. and Trivedi, M.M., 2013, June. A review of recent developments in
vision-based vehicle detection. In Intelligent Vehicles Symposium (pp. 310-315).
110
[39] Mao, L., Xie, M., Huang, Y. and Zhang, Y., 2010, July. Preceding vehicle detection
using Histograms of Oriented Gradients. In Communications, Circuits and Systems
(ICCCAS), 2010 International Conference on (pp. 354-358). IEEE.
[40] Cao, S., Zheng, Y.F. and Ewing, R.L., 2011, July. Scaling function waveform for
effective side-lobe suppression in radar signal. In Proceedings of the 2011 IEEE National
Aerospace and Electronics Conference (NAECON) (pp. 231-236). IEEE.
[41] Benet, G., Blanes, F., Simó, J.E. and Pérez, P., 2002. Using infrared sensors for
distance measurement in mobile robots. Robotics and autonomous systems, 40(4), pp.255-
266.
[42] H.R. Everett, Sensors for Mobile Robots, AK Peters, Ltd., Wellesley, MA, 1995.
[43] Malik, R. and Yu, H., 1992, August. The infrared detector ring: obstacle detection for
an autonomous mobile robot. In Circuits and Systems, 1992., Proceedings of the 35th
Midwest Symposium on (pp. 76-79). IEEE.
[44] Park, H., Baek, S. and Lee, S., 2005, July. IR sensor array for a mobile robot.
In Proceedings, 2005 IEEE/ASME International Conference on Advanced Intelligent
Mechatronics. (pp. 928-933). IEEE.
[45] Gandhi, D. and Cervera, E., 2003, October. Sensor covering of a robot arm for
collision avoidance. In Systems, Man and Cybernetics, 2003. IEEE International
Conference on (Vol. 5, pp. 4951-4955). IEEE.
[46] Tar, A., Koller, M. and Cserey, G., 2009, April. 3-D geometry reconstruction using
Large Infrared Proximity Array for robotic applications. In Mechatronics, 2009. ICM 2009.
IEEE International Conference on (pp. 1-6). IEEE.
[47] Do, Y. and Kim, J., 2013. Infrared range sensor array for 3-D sensing in robotic
applications. International Journal of Advanced Robotic Systems, 10(4), p.193.
[48] Ryu, D., Um, D., Tanofsky, P., Koh, D.H., Ryu, Y.S. and Kang, S., 2010, October. T-
less: A novel touchless human-machine interface based on infrared proximity sensing.
In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference
on (pp. 5220-5225). IEEE.
111
[49] Sharma, R., Daniel, H. and Dušek, F., 2014. Sensor fusion: an application to
localization and obstacle avoidance in robotics using multiple ir sensors. In Nostradamus
2014: Prediction, Modeling and Analysis of Complex Systems (pp. 385-392). Springer,
Cham.
[50] Fu, C., Wu, S., Luo, Z., Fan, X. and Meng, F., 2009, December. Research and design
of the differential autonomous mobile robot based on multi-sensor information fusion
technology. In Information Engineering and Computer Science, 2009. ICIECS 2009.
International Conference on (pp. 1-4). IEEE.
[51] Sabatini, A.M., Genovese, V., Guglielmelli, E., Mantuano, A., Ratti, G. and Dario, P.,
1995, August. A low-cost, composite sensor array combining ultrasonic and infrared
proximity sensors. In Intelligent Robots and Systems 95.'Human Robot Interaction and
Cooperative Robots', Proceedings. 1995 IEEE/RSJ International Conference on (Vol. 3,
pp. 120-126). IEEE.
[52] Duan, S., Li, Y., Chen, S., Chen, L., Min, J., Zou, L., Ma, Z. and Ding, J., 2011, June.
Research on obstacle avoidance for mobile robot based on binocular stereo vision and
infrared ranging. In Intelligent Control and Automation (WCICA), 2011 9th World
Congress on (pp. 1024-1028). IEEE.
[53] Zappi, P., Farella, E. and Benini, L., 2008, October. Pyroelectric infrared sensors
based distance estimation. In Sensors, 2008 IEEE (pp. 716-719). IEEE.
[54] Yun, J. and Lee, S.S., 2014. Human movement detection and identification using
pyroelectric infrared sensors. Sensors, 14(5), pp.8057-8081.
[55] Wahl, F., Milenkovic, M. and Amft, O., 2012, December. A distributed PIR-based
approach for estimating people count in office environments. In Computational Science
and Engineering (CSE), 2012 IEEE 15th International Conference on (pp. 640-647).
IEEE.
[56] Kang, J., Gajera, K., Cohen, I. and Medioni, G., 2004, June. Detection and tracking
of moving objects from overlapping EO and IR sensors. In Computer Vision and Pattern
Recognition Workshop, 2004. CVPRW'04. Conference on (pp. 123-123). IEEE.
112
[57] Hosokawa, T. and Kudo, M., 2005. Person tracking with infrared sensors.
In Knowledge-Based Intelligent Information and Engineering Systems (pp. 907-907).
Springer Berlin/Heidelberg.
[58] Gu, Y. and Veloso, M., 2007, June. Learning Tactic-Based Motion Models of a
Moving Object with Particle Filtering. In Computational Intelligence in Robotics and
Automation, 2007. CIRA 2007. International Symposium on (pp. 1-6). IEEE.
[59] Madrigal, F., Rivera, M. and Hayet, J.B., 2011, November. Learning and regularizing
motion models for enhancing particle filter-based target tracking. In Pacific-Rim
Symposium on Image and Video Technology (pp. 287-298). Springer Berlin Heidelberg.
[60] Erdem, C.E., Sankur, B. and Tekalp, A.M., 2004. Performance measures for video
object segmentation and tracking. IEEE Transactions on Image Processing, 13(7), pp.937-
951.
[61] Piciarelli, C., Foresti, G.L. and Snidaro, L., 2005, September. Trajectory clustering
and its applications for video surveillance. In Advanced Video and Signal Based
Surveillance, 2005. AVSS 2005. IEEE Conference on (pp. 40-45). Ieee.
[62] Biresaw, T.A., Alvarez, M.S. and Regazzoni, C.S., 2011, August. Online failure
detection and correction for Bayesian sparse feature-based object tracking. In Advanced
Video and Signal-Based Surveillance (AVSS), 2011 8th IEEE International Conference
on (pp. 320-324). IEEE.
[63] Jaward, M., Mihaylova, L., Canagarajah, N. and Bull, D., 2006, March. Multiple
object tracking using particle filters. In 2006 IEEE Aerospace Conference (pp. 8-pp).
IEEE.
[64] Liu, J.S. and Chen, R., 1998. Sequential Monte Carlo methods for dynamic
systems. Journal of the American statistical association, 93(443), pp.1032-1044.
[65] Nummiaro, K., Koller-Meier, E. and Van Gool, L., 2002, September. Object tracking
with an adaptive color-based particle filter. In Joint Pattern Recognition Symposium (pp.
353-360). Springer Berlin Heidelberg.
113
[66] Dalal, N. and Triggs, B., 2005, June. Histograms of oriented gradients for human
detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR'05) (Vol. 1, pp. 886-893). IEEE.
[67] Wei, X., Phung, S.L. and Bouzerdoum, A., 2014. Object segmentation and
classification using 3-D range camera. Journal of Visual Communication and Image
Representation, 25(1), pp.74-85.
[68] J. R. Terven, D. M. Cordova, "A Kinect 2 Toolbox for MATLAB",
https://github.com/jrterven/Kin2, 2016
[69] Li, H., Liu, X., Cai, Q. and Du, J., 2015. 3-D Objects Feature Extraction and Its
Applications: A Survey. In Transactions on Edutainment XI (pp. 3-18). Springer Berlin
Heidelberg.
[70] Morris, B.T. and Trivedi, M.M., 2008. A survey of vision-based trajectory learning
and analysis for surveillance. IEEE transactions on circuits and systems for video
technology, 18(8), pp.1114-1127.
[71] http://www.livestrong.com/article/438170-the-average-walking-stride-length/
(accessed on 17 December 2017).
[72] Girshick, R.B., Felzenszwalb, P.F. and Mcallester, D.A., 2011. Object detection with
grammar models. In Advances in Neural Information Processing Systems (pp. 442-450).
[73] Wang, X., Han, T.X. and Yan, S., 2009, September. An HOG-LBP human detector
with partial occlusion handling. In Computer Vision, 2009 IEEE 12th International
Conference on(pp. 32-39). IEEE.
[74] Li, B., Wu, T. and Zhu, S.C., 2014, September. Integrating context and occlusion for
car detection by hierarchical and-or model. In European Conference on Computer
Vision (pp. 652-667). Springer, Cham.
[75] Pepikj, B., Stark, M., Gehler, P. and Schiele, B., 2013. Occlusion patterns for object
class detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 3286-3293).
[76] Tang, S., Andriluka, M. and Schiele, B., 2014. Detection and tracking of occluded
people. International Journal of Computer Vision, 110(1), pp.58-69.
114
[77] Chen, G., Ding, Y., Xiao, J. and Han, T.X., 2013. Detection evolution with multi-
order contextual co-occurrence. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (pp. 1798-1805).
[78] Ouyang, W. and Wang, X., 2013. Single-pedestrian detection aided by multi-
pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (pp. 3198-3205).
[79] Tao, J., Enzweiler, M., Franke, U., Pfeiffer, D. and Klette, R., 2015, September. What
is in front? multiple-object detection and tracking with dynamic occlusion handling. In
International Conference on Computer Analysis of Images and Patterns (pp. 14-26).
Springer International Publishing.
[80] Camplani, M., Hannuna, S.L., Mirmehdi, M., Damen, D., Paiement, A., Tao, L. and
Burghardt, T., 2015, September. Real-time RGB-D Tracking with Depth Scaling
Kernelised Correlation Filters and Occlusion Handling. In BMVC (pp. 145-1).
[81] Benou, A., Benou, I. and Hagage, R., 2014, December. Occlusion handling method
for object tracking using RGB-D data. In Electrical & Electronics Engineers in Israel
(IEEEI), 2014 IEEE 28th Convention of (pp. 1-5). IEEE.
[82] Song, S. and Xiao, J., 2013. Tracking revisited using RGBD camera: Unified
benchmark and baselines. In Proceedings of the IEEE international conference on
computer vision (pp. 233-240).
[83] Galanakis, G., Zabulis, X., Koutlemanis, P., Paparoulis, S. and Kouroumalis, V., 2014,
May. Tracking persons using a network of RGBD cameras. In Proceedings of the 7th
International Conference on PErvasive Technologies Related to Assistive
Environments (p. 63). ACM.
115

AnkitaSikdar Dissertation

Uploaded by

Copyright:

Available Formats

You might also like

AnkitaSikdar Dissertation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AnkitaSikdar Dissertation

Uploaded by

Copyright:

Available Formats

Depth based Sensor Fusion in Object

Detection and Tracking

Presented in Partial Fulfillment of the Requirements for the Degree Doctor

of Philosophy in the Graduate School of The Ohio State University

Ankita Sikdar, B.Tech.

Graduate Program in Computer Science and Engineering

The Ohio State University

Dong Xuan, Co-Advisor

Yuan F. Zheng, Co-Advisor

to estimate the environment. Its common applications are in automated manufacturing,

automated navigation, target detection and tracking, environment perception, biometrics,

executes unpredictable behavior - making abrupt turns, stopping while moving in an

a motion model in the horizontal-depth framework. Observations are obtained by

track of the object when it recovers from occlusion.

of vehicles around an ego vehicle in a highway. This information would help an

autonomous vehicle to maneuver safely.

Computer Engineering, outside of my home Department, Computer Science and

work would not have been possible without their support.

doctoral studies abroad.

thin, with his strong encouragement and active support.

2008……………………….……….…...........Mahadevi Birla Girls’ Higher Secondary

July 2012…….…………………………...…B.Tech. Computer Science and Engineering,

August 2012 to December 2017……………. Graduate Teaching Associate, Department

Major Field: Computer Science and Engineering

Figure 1. A graph showing the non-linearity of the distance measurement as returned by

behavior and occlusion.

schemes for tracking-by-detection. Visual representations try to robustly describe the

schemes for tracking-by-detection emphasizes on capturing the generative and

discriminative statistical information of the object regions. Effective appearance models

scale experiments to evaluate the performance of existing online tracking algorithms,

algorithms from several perspectives.

to ambiguous localization. Additionally, a more complex model composed of many

components may improve tracking robustness at the cost of increased computational

Information about the background is also essential as it can be used to effectively

track the object.

occlusion situation is analyzed by exploiting the spatiotemporal context information, which

tracking performance along with several templated mask approaches.

improve segmentation of tracked objects and correct false negatives.

histogram of oriented depths is used for detecting humans. In [21], a hierarchical

spatiotemporal data association method (HSTA) is introduced to robustly track multiple

objects without prior knowledge. In [22], an adaptive depth segmentation procedure is

described to perform real-time tracking analysis. In [23], Nakamura et al propose a 3-D

a proposal distribution. An advantage of particle filters is that it can be applied to non-

procedure for tracking people in 3-D.

person, for example.

accurate track of the person.

segmented objects (observations) based on a closest match according to a likelihood model

target in heavily occluded scenes improves object tracking accuracy.

cost, availability and usage.

obstacle distance estimation. Constructing a map of an environment using infrared sensors

have been fused to perform indoor tracking.

are based on the measurement of phase shift such as [42].

fused with vision sensor for obstacle avoidance [52].

be fed to the microcontroller.

24.41 ∗ 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 + 81.24

before the sensor starts to produce the output of readings.

Figure 1. A graph showing the non-linearity of the distance measurement as returned by