AnkitaSikdar Dissertation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 132

Depth based Sensor Fusion in Object

Detection and Tracking

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor

of Philosophy in the Graduate School of The Ohio State University

By

Ankita Sikdar, B.Tech.

Graduate Program in Computer Science and Engineering

The Ohio State University

2018

Dissertation Committee

Dong Xuan, Co-Advisor

Yuan F. Zheng, Co-Advisor

Han-Wei Shen
Copyrighted by

Ankita Sikdar

2018

2
Abstract

Multi-sensor fusion is the method of combining sensor data obtained from multiple sources

to estimate the environment. Its common applications are in automated manufacturing,

automated navigation, target detection and tracking, environment perception, biometrics,

etc. Out of these applications, object detection and tracking is very important in the field

of robotics or computer vision and finds application in diverse areas such as video

surveillance, person following, autonomous navigation etc. In the context of purely two-

dimensional (2-D) camera based tracking, situations such as erratic motion of the object,

scene changes, occlusions along with noise and illumination changes are an impediment to

successful object tracking. Integration of information from range sensors with cameras

helps alleviate some of the issues faced by 2-D tracking. This dissertation aims to explore

novel methods to develop a sensor fusion framework to combine depth information from

radars, infrared and Kinect sensors with an RGB camera to improve object detection and

tracking accuracy.

In indoor robotics applications, the use of infrared sensors has mostly been limited to a

proximity sensor to avoid obstacles. The first part of the dissertation focuses on extending

the use of these low-cost, but extremely fast infrared sensors to accomplish tasks such as

ii
identifying the direction of motion of a person and fusing the sparse range data obtained

from infrared sensors with a camera to develop a low-cost and efficient indoor tracking

sensor system. A linear infrared array network has been used to classify the direction of

motion of a human being. A histogram based iterative clustering algorithm segments data

into clusters, from which extracted features are fed to a classification algorithm to classify

the motion direction. To address the circumstances when a robot tracks an object that

executes unpredictable behavior - making abrupt turns, stopping while moving in an

irregular wavy track, such as when a personal robot assistant follows a shopper in a store

or a tourist in a museum or a child playing around, the use of an adaptive motion model

has been proposed to keep track of the object. Therefore, an array of infrared sensors can

be advantageous over a depth camera, when discrete data is required at a fast processing

rate.

Research regarding 3-D tracking has proliferated in the last decade with the advent of the

low-cost Kinect sensors. Prior work on depth based tracking using Kinect sensors focuses

mostly on depth based extraction of objects to aid in tracking. The next part of the

dissertation focuses on object tracking in the x-z domain using a Kinect sensor, with an

emphasis on occlusion handling. Particle filters, used for tracking, are propagated based on

a motion model in the horizontal-depth framework. Observations are obtained by

extracting objects using a suitable depth range. Particles, depicted by patches extracted in

the x-z domain, are associated to these observations based on the closest match according

to a likelihood model and then a majority voting is employed to select a final observation,

iii
based on which, particles are reweighted, and a final estimation is made. An occluder

tracking system has been developed, which uses a part based association of the partially

visible occluded objects to the whole object prior to its occlusion, thus helping to keep

track of the object when it recovers from occlusion.

The latter part of the dissertation discusses a classical data association problem, where

discrete range data from a depth sensor has to be associated to 2-D objects detected by a

camera. A vision sensor helps to locate objects in a 2-D plane only but estimating the

distance using a single vision sensor has limitations. A radar sensor returns the range of

objects accurately; however, it does not indicate which range corresponds to which object.

A sensor fusion approach for radar-vision integration has been proposed, which using a

modified Hungarian algorithm with geometric constraints, associates data from a simulated

radar to 2-D information from an image to establish the three-dimensional (3-D) position

of vehicles around an ego vehicle in a highway. This information would help an

autonomous vehicle to maneuver safely.

iv
Dedication

I dedicate this dissertation to my mother, who has always been my strength and

inspiration.

v
Acknowledgments

I would like to express my sincere gratitude to my co-advisors Dr. Dong Xuan and Dr.

Yuan Fang Zheng, who have guided me throughout my doctoral studies, inculcating in me

a spirit of independent research. I would especially thank Dr. Zheng for his continuous

inspiration and guidance as a faculty member from the Department of Electrical and

Computer Engineering, outside of my home Department, Computer Science and

Engineering. I would also like to thank my committee member Dr. Han-Wei Shen. This

work would not have been possible without their support.

I would like to thank all my lab mates at the Multimedia and Robotics Laboratory, with

whom I have had the pleasure of holding interesting discussions about my work and

projects, and participating in numerous data collection experiments. Some special thanks

also go to my batchmates at The Ohio State University, who made initial study sessions

fun. I would also like to thank my friends, made during my undergraduate studies at West

Bengal University of Technology, for providing the inspiration and courage to pursue

doctoral studies abroad.

vi
I would also like to thank my parents, my grandparents, my younger sister, cousins,

extended family and friends, who have been a source of constant support.

Finally, I would like to thank my husband for always being there for me through thick and

thin, with his strong encouragement and active support.

vii
Vita

2008……………………….……….…...........Mahadevi Birla Girls’ Higher Secondary


School

July 2012…….…………………………...…B.Tech. Computer Science and Engineering,


West Bengal University of Technology

August 2012 to December 2017……………. Graduate Teaching Associate, Department


of Computer Science and Engineering, The
Ohio State University

Publications

Sikdar, A., Cao, S., Zheng, Y.F. and Ewing, R.L., 2014, May. Radar depth association
with vision detected vehicles on a highway. In Proc. 2014 IEEE Radar Conference (pp.
1159-1164).

Sikdar, A., Zheng, Y.F. and Xuan, D., 2015, May. An iterative clustering algorithm for
classification of object motion direction using infrared sensor array. In Proc. 2015 IEEE
International Conference on Technologies for Practical Robot Applications (TePRA) (pp.
1-6).

Sikdar, A., Zheng, Y.F. and Xuan, D., 2015, June. Using an A-priori learnt motion model
with particle filters for tracking a moving person by a linear infrared array network. In
Proc. 2015 IEEE National Aerospace and Electronics Conference (NAECON) (pp. 75-
80).

viii
Sikdar, A., Zheng, Y.F. and Xuan, D., 2016, September. Robust object tracking in the XZ
domain. In Proc. 2016 IEEE Multisensor Fusion and Integration for Intelligent Systems
(MFI) (pp. 499-504).

Fields of Study

Major Field: Computer Science and Engineering

ix
Table of Contents

Abstract ............................................................................................................................... ii
Dedication ........................................................................................................................... v
Acknowledgments.............................................................................................................. vi
Vita................................................................................................................................... viii
List of Tables .................................................................................................................... xii
List of Figures .................................................................................................................. xiii
Chapter 1. Introduction ....................................................................................................... 1
Chapter 2. Use of Low Cost Range Sensors for Indoor Object Tracking Applications .... 9
2.1. Introduction .............................................................................................................. 9
2.2. Infrared Sensors ..................................................................................................... 12
2.3. An Iterative Clustering Algorithm for Classification of Object Motion Direction
Using Infrared Sensor Array ......................................................................................... 14
2.3.1. Introduction ..................................................................................................... 14
2.3.2. Related Work .................................................................................................. 15
2.3.3. Methodology ................................................................................................... 17
2.3.4. Results ............................................................................................................. 24
2.4. Using an A-Priori Learnt Motion Model with Particle Filters for Tracking a
Moving Person by a Linear Infrared Array Network.................................................... 31
2.4.1. Introduction ..................................................................................................... 31
2.4.2. Related Work .................................................................................................. 32
2.4.3. Methodology ................................................................................................... 32
2.4.4. Results ............................................................................................................. 38
2.4.5. Conclusion ...................................................................................................... 43
2.5. An Infrared Sensor Guided Approach to Camera Based Tracking of Erratic Human
Motion ........................................................................................................................... 45
2.5.1. Introduction ..................................................................................................... 45
x
2.5.2. Related Work .................................................................................................. 46
2.5.3. Methodology ................................................................................................... 47
2.5.4. Results ............................................................................................................. 57
2.5.5. Conclusion ...................................................................................................... 62
Chapter 3. Occlusion Handling in Tracking ..................................................................... 63
3.1. Introduction ............................................................................................................ 63
3.2. Related Work ......................................................................................................... 65
3.3. Methodology .......................................................................................................... 67
3.3.1. Object Representation ..................................................................................... 67
3.3.2. Object Extraction and Filtering ....................................................................... 68
3.3.4. Particle filter tracker ....................................................................................... 72
3.4. Results .................................................................................................................... 76
3.5. Conclusion ............................................................................................................. 85
Chapter 4. Data Association in Tracking ......................................................................... 86
4.1. Introduction ............................................................................................................ 86
4.2. Related Work ......................................................................................................... 88
4.3. Methodology .......................................................................................................... 89
4.3.1. Derivation of equation .................................................................................... 89
4.3.2. Procedure ........................................................................................................ 95
4.4. Results .................................................................................................................... 99
4.5. Conclusion ........................................................................................................... 101
Chapter 5. Conclusion and Future Work ....................................................................... 102
References ....................................................................................................................... 107

xi
List of Tables

Table 1. Data from each sensor representing motion in the left to right direction ........... 25
Table 2. Classification Accuracy ...................................................................................... 28
Table 3. Confusion Matrix for KNN, k=5 (Predicted classes shown in columns, actual
classes shown in rows) ...................................................................................................... 28
Table 4. Confusion matrix for SVM classifier.................................................................. 39

xii
List of Figures

Figure 1. A graph showing the non-linearity of the distance measurement as returned by


the infrared sensors. .......................................................................................................... 13
Figure 2. Timing diagram of the SHARP GP2Y0A710K0F infrared sensor as provided by
the manufacturer’s manual. ............................................................................................... 13
Figure 3. Infrared sensor array setup which is installed on the top of a robotic platform.
The platform co-ordinate system is shown, and data is measured w.r.t this co-ordinate
system (a) Each sensor is placed with a separation of 0.5ft on the platform (i.e. at 0.5ft,
1ft and 1.5ft distances along the x-axis); (b) The platform is mounted at a height of 2.42ft
above the ground. .............................................................................................................. 16
Figure 4. Raw data collected from three infrared sensors. (a) Person walking away from
the platform and then towards it; (b) Person moving from left to right and then from right
to left across the platform in a straight line; (c) Person moving from right to left and then
from left to right diagonally. ............................................................................................. 19
Figure 5. Plots showing the intermediate processing steps. (a) Data collected between the
1st and 2nd second; (b) Histogram of the range (Y) or longitudinal distance values; (c)
Clustering done in the range domain; (d) Range domain clusteres merged in time domain
to form super clusters representing an object in motion or a stationary object. .............. 21
Figure 6. A plot of the two-dimensional feature space. Certain classes may overlap at the
boundaries. The classes are: 1. In front and away; 2. In front and towards; 3. Left to right
straight line; 4. Right to left straight line; 5. Left to right diagonal line; 6. Right to left
diagonal line; 7. Stationary. .............................................................................................. 23
Figure 7. (a-c) Person moving from left to right across the infrared sensor array which is
mounted on a robotic platform. ......................................................................................... 25

xiii
Figure 8. Plot showing data analysis for time period 3s – 4s. (a) Raw data capturing
person’s motion; (b) histogram showing peak values (c-d) clustering, with the one in red
being the real cluster. ........................................................................................................ 26
Figure 9. Plots showing straight lines fitting data points in (a) the x-t plane and (b) y-t
plane. The slope pair (0.1581, 0.0003371) is used as the feature vector for classification.
It can also be verified that this slope pair falls in the domain of class 3 as represented by
the feature space in Fig. 6. ................................................................................................ 27
Figure 10. Color coded peaks because of an object in motion as detected by the SVM
classifier. Other peaks are also noticed; however, they are created by inconsistent data
and are discarded by the SVM as true negatives. ............................................................. 38
Figure 11. (a)-(b) A traditional particle filter with 500 particles is used to track the
infrared sensor simulated data with an average position estimation error of 0.9008 ft. ... 40
Figure 12. (a)-(b) A particle filter with 500 particles and a continuously updated motion
model with coefficients of 0.5(Fig. 12(a)) and 0.6(Fig. 12(b)) respectively is used to track
the infrared sensor simulated data with an average position estimation error of 0.35ft. .. 41
Figure 13. True position versus estimation. (a) A traditional particle filter with a fixed
linear model tracks object on real infrared sensor data with an average position estimation
error of 0.64ft (b)A particle filter receiving feedback from the controller regarding
position estimation error tracks the object on real infrared sensor data with an average
position estimation error of 0.34ft. .................................................................................. 42
Figure 14. A plot showing the average error in the position estimation of the object at
different coefficient values for the motion model update parameters. For these runs, a
coefficient of 0.6 produced good results. .......................................................................... 43
Figure 15. Infrared sensor setup with camera (a)The coordinate system (b) Distance
measurements shown on the robotic platform .................................................................. 48
Figure 16. Images from some video sequences illustrating the target tracking under
various occlusion/illumination scenarios. (a) Video sequence to demonstrate target
walking in a scene without any occlusion. Frames 1905, 1941, 2019 and 2055 have been
shown; (b) Video sequence to demonstrate target being occluded by an object with

xiv
similar appearance as well as by an object with a different appearance. Frames 1359,
1407, 1425, 1431, 1521, 1545, 1659, 1701, 1725, 1773, 1857 and 2236 have been shown;
(c) Video sequence to demonstrate target being tracked when multiple persons are present
in the scene, however, there is no occlusion. Frames 408, 433, 450, 492 and 505 have
been shown; (d) Video sequence to demonstrate target occluded in presence of other
objects as well. Frames 1002, 1074, 1110 and 1182 have been shown; (e) Video sequence
to demonstrate target occluded in presence of other objects as well. Frames 300, 306, 318
and 360 have been shown; (f) Video sequence to demonstrate target being tracked in the
presence of other objects in low illumination condition in the hallway. Frames 2221,
2293, 2329 and 2341 have been shown. ........................................................................... 59
Figure 17. Graphs showing tracking error at a frequency of 3s-1 for two different
sequences (a) and (b). ....................................................................................................... 61
Figure 18. (a) Depth image showing the human body. (b) projection of the human body
depth data on the x-z plane. (c) normalized depth histogram for the human object ......... 69
Figure 19. Image sequence showing an object executing a simple linear motion being
tracked. .............................................................................................................................. 78
Figure 20. Image sequence showing an object facing partial occlusion being tracked
correctly. At frame number 262, the two objects are at similar depths and is partly
occluded. ........................................................................................................................... 79
Figure 21. Image sequence showing an object that is fully occluded for a short time,
however, on reappearing, it is tracked again..................................................................... 80
Figure 22. Image sequence showing an object that is partially occluded for a long
duration of time, however, it is succesfully tracked all through and towards the end, it is
heavily occluded, but the algorithm tracks it correctly till the end. .................................. 81
Figure 23. Target enters a stage of partial occlusion, until it is fully occluded and then
reappears (example 1); Bold black bounding box represents target, light black bounding
box represents occluder..................................................................................................... 82

xv
Figure 24. Target enters a stage of partial occlusion, until it is fully occluded and then
reappears (example 2); Bold black bounding box represents target, light black bounding
box represents occluder..................................................................................................... 83
Figure 25. Partially visible target is obstructed by an occluder which in turn is occluded;
Bold black bounding box represents target; light black bounding box represents occluder.
........................................................................................................................................... 84
Figure 26. A real-world figure projected onto camera co-ordinate plane......................... 90
Figure 27. Plot of size of chessboard projected at increasing depth ranges ..................... 92
Figure 28. Plot confirms that observed data follows derived Eq. (28) ............................. 93
Figure 29. Vehicles with their ranks based on their relative positions determined by the
size. ................................................................................................................................... 94
Figure 30. Testing images (a)Testing on cars having same average size (b) Testing on
partly occluded vehicle(a small car occluded by a large truck) . (c) & (d) Testing on a
large vehicle along with cars............................................................................................. 96
Figure 31. Radar simulation results for Fig. 30(a)-(d) respectively ............................... 100

xvi
Chapter 1. Introduction

Object tracking is an important and challenging field of research in the areas of computer

vision and robotics and finds applications ranging from human computer interaction,

surveillance in public places, player tracking in sports, person following, etc. The goal of

object tracking is that, given an initial state (position, bounding box, size) of the target, it

should be able to robustly estimate the position of the target object in successive frames of

the input sequence. Some of the difficulties faced by an object tracker include changes in

illumination and shadows, similarity with the background scene, unpredictable motion

behavior and occlusion.

Object tracking algorithms can be categorized into 2-D tracking algorithms and 3-D

tracking algorithms based on whether tracking is done including the dimension of depth or

not. 2-D object tracking has been prevalent and uses monocular RGB cameras. [1] is a

famous survey that categorizes the tracking methods based on the different object

representations and motion representations used, discusses the pros and cons and lists the

important object tracking issues. In [2], Li et al present a survey of 2-D appearance models

for visual object tracking, focusing on visual representations and statistical modeling

schemes for tracking-by-detection. Visual representations try to robustly describe the

1
spatio-temporal characteristics of object appearance, while the statistical modeling

schemes for tracking-by-detection emphasizes on capturing the generative and

discriminative statistical information of the object regions. Effective appearance models

combine both visual representation and statistical modeling. In [3], Wu et al carry out large

scale experiments to evaluate the performance of existing online tracking algorithms,

identify new challenges and provide evaluation metrics for in-depth analysis of tracking

algorithms from several perspectives.

When building a 2-D appearance model, the right balance between tracking robustness and

tracking accuracy must be achieved. To improve tracking accuracy, more visual features

and geometric constraints are incorporated into the models, resulting in a precise object

localization, which might also lower the generalization capabilities of the models when

there are variations in the appearance of the target object. On the other hand, to improve

tracking robustness, the appearance models might relax some constraints, which might lead

to ambiguous localization. Additionally, a more complex model composed of many

components may improve tracking robustness at the cost of increased computational

power, when compared to a simpler model that may be computationally more efficient but

has a lower discriminability. Due to the hardware limits of processing speed and memory

usage, the rate at which frames acquired from the video is processed is typically low. The

object’s appearance model may have undergone some variation due to occlusion or

illumination changes, and thus the appearance model used to represent the target object

must be able to generalize well and have the capability to adapt itself based on these

2
changes. Another aspect to consider is that with a low frame rate, the object may have

executed large or abrupt motion and thus the motion model is also crucial for object

tracking. Good location prediction based on the dynamic or adaptive motion model can

narrow down the search space and lead to improved tracking efficiency and robustness.

Information about the background is also essential as it can be used to effectively

discriminate the foreground, as in [4], or it can serve as the tracking context explicitly as

in [5]. Due to the information loss due to projection from 3-D to 2-D, the appearance

models in 2-D cannot accurately estimate the poses of tracked objects, leading to failures

in case of occlusion. Local models have been proposed such as in [6, 7], which help when

the object has undergone partial change in appearance, such as when it is partially occluded

or partially deformed.

When multiple overlapping cameras are used to fuse information from different

viewpoints, tracking in a large camera network with transfer of target information from

one camera sub-network to another becomes an important study. In [8], Ercan et al propose

a sensor network of cameras to track a single object in the presence of static and moving

occluders, where each camera does some simple processing to detect the horizontal

position of the target. This data is then sent to the cluster (a subset of cameras) head to

track the object.

RGB-D tracking is popular among researchers and [9] provides a benchmark for standard

RGBD algorithms along with comparison of the various algorithms. The 3-D object

3
tracking algorithms use depth information such as when depth is obtained from stereo or

multiple cameras. The algorithms can then also be extended to crowded scenes to handle

occlusion such as in [10], where pixels are assigned to humans based on their distance and

color models. In [11], multiple human beings are tracked based on motion estimation and

detection, background subtraction, shadow removal and occlusion detection. In [12], stereo

images are used, and appearance-based representation methods based on luminance with

disparity information and Local Steering Kernel (LSK) descriptors is used. In [13], the

occlusion situation is analyzed by exploiting the spatiotemporal context information, which

is further double checked by the reference target and motion constraints to improve

tracking performance along with several templated mask approaches.

Depth information can also be obtained from a range of depth sensors such as radar, laser,

infrared or ultrasonic sensors. These are typically integrated with RGB cameras in a multi-

sensor fusion framework. The addition of depth dimension to the traditional 2-D camera

based object tracking helps to alleviate some of the important challenges faced by 2-D

tracking such as partial or full occlusion of object, better object segmentation based on

depth, better distinguishability between objects having similar appearances. In [14], Fod et

al have described a method for real time tracking of objects with multiple laser range

finders covering a workspace. Vision oriented methods are adapted to laser scanners,

grouping range measurements into entities such as blobs and objects. In [15], Vu et al have

presented a method of simultaneous detection and tracking moving objects from a moving

vehicle equipped with a single layer laser scanner. In [16], Labayrade et al have used a

4
laser scanner is used to first detect objects, and then a stereo vision system is used to

validate the detections. In [17], Cho et al present a reliable and effective moving object

detection and tracking system for a self-driving car using radars, LIDARS and vision

sensors. In [18], Kumar et al have combined a thermal infrared sensor along with a wide

angle RGB camera to correct the errors of the camera and reduce the false positives,

improve segmentation of tracked objects and correct false negatives.

Research regarding 3-D tracking has proliferated in the last decade with the advent of the

low-cost Kinect sensors [19]. In [20], a combination of histogram of oriented gradients and

histogram of oriented depths is used for detecting humans. In [21], a hierarchical

spatiotemporal data association method (HSTA) is introduced to robustly track multiple

objects without prior knowledge. In [22], an adaptive depth segmentation procedure is

described to perform real-time tracking analysis. In [23], Nakamura et al propose a 3-D

object tracking method by integrating the range and color information using camera

intrinsic parameters and relative transformation between the cameras, followed by tracking

the desired target regions by processing the depth pixels with color information.

Object tracking algorithms generally follow a bottom-up approach or a top down approach

or a combination of both. In the first case, objects are usually extracted from the image

frame or data, and is then used for tracking, such as model-based approaches [24, 25] or

template matching approaches [26]. Particle filters [27], is a top down approach as it

generates a set of hypotheses on which evaluation takes place. It is a widely used technique

5
in object tracking, which applies a recursive Bayesian filter based on samples drawn from

a proposal distribution. An advantage of particle filters is that it can be applied to non-

linear and non-Gaussian systems. In [28], Spinello et al use a bottom up detector which

generates candidate detection hypotheses that are validated by a top down classifier

procedure for tracking people in 3-D.

This dissertation is focused on depth based sensor fusion in solving the challenges of object

detection and tracking. A part of the research has focused on using the low-cost range

sensors to solve some of the issues of object tracking such as object association, where

objects detected in a 2-D image have to be associated with a set of depth values

corresponding to the scene returned by a radar sensor. This has been applied to the task of

predicting the 3-D positions of vehicles in a highway with respect to an ego vehicle, which

would facilitate in the navigation of a self-driving autonomous vehicle. Another issue that

has been explored is the use of infrared (IR) sensors with the aim of utilizing these low-

cost, low-power, easy to use sensors for indoor robotic applications beyond obstacle

distance estimation, such as using an IR sensor array network for classifying the direction

of motion of a human walking in the field of view of the sensors. The direction estimate

can then be used by the robotic platform to guide its motion, avoiding the object or the

person, for example.

Another issue that often poses a challenge in object tracking problems with particle filters,

is that of appropriate motion models, which would be dynamic enough to adapt to changes

6
in the target object’s trajectory, which could be because the object executes random and

erratic motion, or the number of frames processed per second is less, because of which the

object’s position might have shifted from the position predicted by the motion model.

Small errors in position estimation could add up over time making the particle filter

completely lose track of the person. Thus, instead of using a fixed motion model, a motion

model is statistically learnt from the initial target motion data and subsequently this model

is used with the particle filtering approach to track the person. In addition, the learnt motion

model is regularly updated to support the particle filtering approach in establishing a more

accurate track of the person.

Another multi-sensor based tracking approach has been proposed using an infrared sensor

array based secondary tracker to deal with abrupt changes in motion, and a camera based

primary tracker to deal with simple non-linear motion. The former uses an omnidirectional

motion model to keep track of detections and thereby helps to re-initialize the latter in case

it fails to track the object due to sudden motion changes. Additionally, location prediction

made by the infrared tracker is used to influence the likelihood function for the primary

tracker which helps in achieving better object tracking results than relying solely on the

primary tracker.

An important challenge that most 2-D trackers would fail to address is that of occlusion.

Kinect sensors have been used for robust object tracking with occlusion handling in the x-

z domain instead of the traditional x-y domain. Tracking is done by particle filters which

7
are propagated based on the motion model in the horizontal-depth movement framework.

An adaptive (based on the occlusion status of target) joint color and depth histogram model

is used to represent a human being. Depth segmented objects are filtered out in each frame.

Particles, depicted by patches extracted in the x-z domain, are associated to these depth

segmented objects (observations) based on a closest match according to a likelihood model

and then a majority voting is employed to select a final observation, based on which,

particles are reweighted, and a final estimation is made. The addition of the depth

dimension in motion propagation and tracking alleviates challenges faced due to change of

object appearance, illumination changes and partial occlusion (or full occlusion in some

cases). Most occlusion handling strategies aim to model prior appearance, shape or motion

models of the occluded target and match it with the observed portion of the target upon its

reappearance. This strategy often fails due to change in the dynamics of the target to be

tracked. In this work, occlusion is handled by using an occluder tracking approach, which

indirectly provides position estimates for the occluded target, thus improving the likelihood

of observing the target correctly when it comes out of its state of occlusion. Complex

occlusion scenarios have been explored and utilizing depth distribution to keep track of

target in heavily occluded scenes improves object tracking accuracy.

8
Chapter 2. Use of Low Cost Range Sensors for Indoor Object

Tracking Applications

2.1. Introduction

In the field of indoor robotics, commonly used distance sensors are ultrasonic sensors,

infrared sensors, lasers or stereo cameras. While each sensor has their pros and cons, one

has to select a combination of sensors for the particular task at hand based on factors of

cost, availability and usage.

Infrared sensors are quite indispensable, widely used as a proximity sensor helping in

obstacle avoidance. It is easy to use and consumes small amount of power. It is small and

compact enough to be fitted on any platform, but most importantly, it is an extremely low-

cost device. (The infrared sensors used in this work cost only 16 USD). These sensors are

almost always used in combination with other sensors such as ultrasonic sensors, cameras,

etc. to obtain an understanding of the environment. However, its use has been limited to

obstacle distance estimation. Constructing a map of an environment using infrared sensors

alone is not considered quite reliable. This is because the measurements obtained from

these sensors could be imprecise attributing to the non-linearity of the device and its

dependence on the reflectivity of the surrounding objects. Moreover, unlike most sensors
9
that have a wider beam width, the beam width of the infrared sensors is very narrow

(around 16cm width at the middle, making the beam angle roughly 3.5◦ for the sensor) and

this could result in the infrared light to pass right beside the object without being reflected

by it.

However, the focused beam width of the infrared sensor has the advantage of hitting

smaller objects and suffering from less interference from other infrared sensors, in

comparison to an ultrasonic sensor, which has a wider sound pulse and is susceptible to

noise and interference from other sensors in its vicinity. While a laser is quite expensive

when compared to an infrared sensor array, a Kinect sensor would match the price range.

The advantage of an infrared sensor array over a Kinect would be over processing time, as

the Kinect provides dense depth data (hundreds of thousands of points), whereas an array

of infrared sensors would provide discrete depth information which would be much faster

to process. This chapter presents research work performed to extend the use of infrared

sensors from being basic range sensors. Section 2.2 outlines the theory behind infrared

sensors. Section 2.3 shows how data from an array of infrared sensors can be studied to

extract the direction of motion of a human being walking in its vicinity. Section 2.4

introduces a motion model with feedback to track the motion of a human being using sparse

data from infrared sensor array. In Section 2.5, data from infrared sensor array and a camera

have been fused to perform indoor tracking.

10
The development of new low-cost IR sensors capable of accurately measuring distances

with reduced response times is worth researching as stated in [41], where G. Benet et. al.

has described some ranging techniques used in infrared sensors and has also proposed his

technique based on light intensity back scattering from the objects. Other infrared sensors

are based on the measurement of phase shift such as [42].

In one application of autonomous navigation [43], infrared sensor emitters and receivers

were arranged in a ring at the bottom of the robot. The front sensors performed collision

avoidance while side sensors were used to follow a wall. In [44], Park et al describe an

infrared sensor array network designed to provide a 360◦ coverage of the environment

using 12 infrared sensors. In [45], rings of several infrared sensors are arranged around

robot links to develop a sensing skin for a robotic arm. Some research has been performed

to develop a 2-D array of infrared sensors [46-47] as to extract 3-D information of the

environment. In [47], a 2-D array has been used for obstacle detection, safe navigation and

estimating object pose. In [48], such a system has been used for an interesting application

on touchless human computer interaction. In [49], multiple infrared sensors have been used

in localization and obstacle avoidance. A more beneficial way to use these infrared sensors

would be to combine them with other sensors to utilize the strengths of each sensor while

minimizing the disadvantages. An array of ultrasonic and infrared sensors has been used

in [50-51] for researching obstacle avoidance problems. Infrared sensors have also been

fused with vision sensor for obstacle avoidance [52].

11
2.2. Infrared Sensors

The sensors used in this work have been purchased from Sharp (model number

GP2Y0A710) and has a minimum and maximum detection range of 3 feet and 18 feet

respectively. The Sharp sensors work by the method of triangulation. It has two parts, an

emitter and a receiver. A pulse of light is emitted which upon hitting an object, is reflected

back at an angle depending on the distance of the reflecting object. By knowing this angle,

the distance is calculated. The IR receiver part has a precision lens that transmits the

reflected light onto an enclosed linear CCD array, based on this triangulation angle. The

CCD array determines the angle and converts it to a corresponding analog voltage value to

be fed to the microcontroller.

The output of these detectors is non-linear with respect to the distance being measured.

This is because of the trigonometry involved in computing the distance to an object based

on the triangulation angle. Eq. (1) is used in this work to convert the analog readings to

distance values.

24.41 ∗ 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 + 81.24


𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = (1)
𝑣𝑜𝑙𝑡𝑎𝑔𝑒 − 1.155

12
A graph showing the non-linearity is plotted in Fig. 1. A timing diagram from the datasheet

of the sensor is presented in Fig. 2. The latter illustrates that there is only 16.5 ms delay

before the sensor starts to produce the output of readings.

Figure 1. A graph showing the non-linearity of the distance measurement as returned by


the infrared sensors.

Figure 2. Timing diagram of the SHARP GP2Y0A710K0F infrared sensor as provided by


the manufacturer’s manual.
13
2.3. An Iterative Clustering Algorithm for Classification of Object Motion
Direction Using Infrared Sensor Array

2.3.1. Introduction

In this work, these disadvantages have been accounted for and a system has been proposed

that can do much more than just acquiring a distance estimate – this system will be able to

identify the direction of motion of an object such as a person in front of the sensor. An

infrared sensor array consisting of three sensors have been mounted on a robotic platform

at equal distances of 0.5 feet. The platform is at a height of 2.4 feet, which is sufficient to

detect the torso of a human adult. Readings are taken at every 100ms interval. These

readings are analog voltage values, which must be converted to corresponding distance

measures before they can be used. The values can then be used by the robotic platform to

guide its motion, avoiding the object or the person, for example.

The algorithm uses data collected over 1 second (i.e. 10 readings from 3 sensors each) and

after initial filtering to remove out of range or background values, performs a distance

based iterative histogram clustering on the data to attain clusters in the distance domain.

These clusters are then analyzed to form possible merging with another cluster in the time

domain representing an observation corresponding to some object (either in motion or

stationary). This is followed by feature extraction. Two features are used in this work, the

y-t slope and the x-t slope of the straight line fitting the data points in the cluster in the

14
respective y-t and x-t planes. The slopes of these two fits are used as features in classifying

the direction of motion of the person (or stationary). The k-nearest neighbors’ algorithm is

used to perform this classification.

2.3.2. Related Work

In this work infrared sensors have been used to identify the direction of motion of a single

person walking in front of the infrared sensor array. The array is installed on a robotic

platform. The motion information obtained by the sensor can then be used to avoid the

person for a safe navigation. The array is a one dimensional one. Similar work has been

carried on using other sensors such as pyroelectric infrared sensors (PIR sensors) [53]. In

[54], PIR sensors are used for human movement detection as well as human identification.

In [55], distributed PIR sensors have been used to estimate people count in an office.

However, PIR sensors measure the light radiating from objects and have a wide field of

view (less than 180◦ or even 360◦ in some models), which is quite different from an infrared

sensor which has a very narrow field of view. This limitation of the infrared sensor makes

the problem more challenging, however, it is worth investigating because of its many

advantages such as low cost, low power, fast response rate and compactness. In addition,

a narrow beam gives rise to high resolution in detection of the objects. Multiple infrared

sensors will thus provide a wide view as well as high resolution in detection.

15
(a)

(b)

Figure 3. Infrared sensor array setup which is installed on the top of a robotic platform.
The platform co-ordinate system is shown, and data is measured w.r.t this co-ordinate
system (a) Each sensor is placed with a separation of 0.5ft on the platform (i.e. at 0.5ft, 1ft
and 1.5ft distances along the x-axis); (b) The platform is mounted at a height of 2.42ft
above the ground.

16
2.3.3. Methodology

Fig. 3 shows the arrangement of the sensors on the robotic platform. Fig. 4 is a plot of the

raw data obtained from the infrared sensors in the y-t domain with the colors representing

the data coming from each sensor. So each data point can be represented as a 3-D point

(x,y,z) with ‘x’ being the lateral distance, ‘y’ the longitudinal distance (both expressed in

terms of sensor coordinates) and ‘t’ being the time. The background obstruction (a wall) is

present at around 12.5ft. Fig. 4(a) shows the data points when a person is moving away

from the sensor array and then walking towards it in a straight-line perpendicular to the

sensors. Fig. 4(b) shows the same person moving from left to right and then from right to

left across the sensor array in a straight line parallel to the sensors. Fig. 4(c) shows motion

across the sensor array in a diagonal line from right to left and then left to right, with the

line making an angle of 45◦ with the sensor array. From these plots, it is quite clear that

there is a distinct pattern that can be extracted for each type of motion. In Fig. 4(a), the

person was walking in front of the left and middle sensors, and that is why both these

sensors agree on the distance of the person. In Fig. 4(b), one can see the patterns near the

5ft mark clearly indicate the time order in which each sensor could spot the person, giving

us an idea if the person moved from left to right or vice versa. The pattern in Fig. 4(c), in

addition to capturing the direction like Fig. 4(b), also captures the diagonal aspect of

motion. This work aims to extract these patterns and classify the direction of motion based

on these patterns.

17
At first, background or out of range data is removed. Any value above 12ft has been

ignored. However, this value can also be learned by allowing the sensor to make a few

initial observations and estimate a background such as a wall. The data pre-processing step

is followed by calculating the histogram in the range (or longitudinal distance) domain.

Analyzing the histogram, the regional peak values are found out that would correspond to

probable object detection (could be a stationary or moving object). These regional peak

values give us an original estimate of where the clusters could lie. Based on these peaks,

clusters are obtained iteratively in the range domain with a constraint that if a range value

falls a certain threshold away from the peak value, a new cluster is formed. Tight clusters

are required whose standard deviation is within the threshold, which was chosen to be 2ft

for this work. This can be represented mathematically as follows: If there are ‘n’ regional

max values (initial cluster means) in the range domain that are obtained from the histogram

bin values, then data point dpresent taken at time instant tpresent is assigned to a cluster based

on

dpresent ∈ clusteri , where 1≤ i ≤ n %dpresent added to clusteri


such that i=s, s =1,2,…,n
if min( |dpresent – mean(clusters) | ) <
cluster_separation_distance
else, dpresent ∈clusteri, where i=n+1 %a new cluster is formed (2)

18
(a)

(b)

continued

Figure 4. Raw data collected from three infrared sensors. (a) Person walking away from
the platform and then towards it; (b) Person moving from left to right and then from right
to left across the platform in a straight line; (c) Person moving from right to left and then
from left to right diagonally.

19
Figure 4 continued

(c)

20
(a) (b)

(c) (d)

Figure 5. Plots showing the intermediate processing steps. (a) Data collected between the
1st and 2nd second; (b) Histogram of the range (Y) or longitudinal distance values; (c)
Clustering done in the range domain; (d) Range domain clusteres merged in time domain
to form super clusters representing an object in motion or a stationary object.

Clusters obtained in the range domain are further merged together in the time domain,

where two neighboring clusters are joined if the time difference between the last recorded

time and the first recorded time instant in both the clusters is less than 500ms and the

21
difference between the range values for the corresponding points reflect the amount that an

average human could have walked in that time interval. Thus, if there are ‘m’ clusters

obtained by clustering in the range domain, then,

clusterk = {clusteri , clusterj }, for all 1≤ i,j ≤ m

such that i<j and time gap<500 and range gap<time gap*0.8 (3)

At this point, clusters are obtained that could possibly represent one full motion direction

or maybe a stationary object. Fig. 5(a) shows the raw data for a one second interval. Fig.

5(b) shows the histogram computation over the longitudinal distance or ‘y’ values. Fig.

5(c) shows the clustering in the range domain. These range domain clusters are further

merged in time to form time domain clusters representing an object (in motion or

stationary) shown in Fig. 5(d).

These clusters are then used in the classification process as described. The data points in

each cluster are viewed in the x-t plane and the y-t plane. Straight lines are fitted to the

points in each of the two planes. The slopes of these lines are used as features to classify

the direction of the motion. The motion classes w.r.t to robot coordinate system are: 1. In

front and away; 2. In front and towards; 3. Left to right straight line; 4. Right to left straight

line; 5. Left to right diagonal line; 6. Right to left diagonal line; 7. Stationary. Fig. 6 shows

a plot of the 2-D training data. From this plot, it can be observed that using these two

features, it will be possible to learn a well separated space for each of the motion directions

22
as well as a stationary object using some classification algorithm. The k nearest neighbors’

algorithm has been used with k values of 1, 3 and 5. The knn algorithm is one of the simplest

supervised classification algorithms. Given a training set, this algorithm learns its

parameters and can then classify a new pattern to be the same as the one most common

among the k closest neighbors’. This algorithm suffers if the dimensions of the training set

features are large. It also requires large storage. However, in this work, the dimension of

the feature vector is just 2, so knn is a reasonable choice.

Figure 6. A plot of the two-dimensional feature space. Certain classes may overlap at the
boundaries. The classes are: 1. In front and away; 2. In front and towards; 3. Left to right
straight line; 4. Right to left straight line; 5. Left to right diagonal line; 6. Right to left
diagonal line; 7. Stationary.

23
2.3.4. Results

To evaluate this method, data has been collected for each class of motion (40 instances

each *7 classes =280 instances in total). Fig. 7(a-c) shows snapshots of a subject moving

from left to right in a straight line in front of the infrared sensor array. This is to illustrate

the data capturing procedure. Each of the infrared sensors is sensing the distance values

over 10 seconds (which should be sufficient for a person to make a move in any direction).

Table 1 shows a portion of the distance values captured by the sensors that correspond to

the actual motion of the person. Fig. 8(a) shows the raw data from time 3s – 4s plotted in

the y-t domain. This is the time when the person makes the actual move across the sensor

array and the distance readings in the rest of the 10 seconds are just wall/junk/out-of-range

values and hence not shown in the demonstration. Fig. 8(b) shows the histogram

computation to give us an initial estimate of the clusters which is further refined in Fig.

8(c-d). In this instance, the data is clean enough and the initial peaks obtained from the

histogram analysis are sufficient. Fig. 9(a-b) shows the final cluster representing the motion

in the x-t as well as y-t planes with straight lines fitted to the points to obtain the slopes.

Using these slope values, it can be verified in Fig. 6 that it does fall in the region for class3

(left to right motion in a straight line). These two features are used as inputs to the knn

classifier which also labels this feature vector as belonging to that of class3.

24
Time instant Left sensor Middle sensor Right sensor
(seconds) (feet) (feet) (feet)
2.5 12.52 11.19 11.19
2.6 12.91 11.34 12.91
2.7 12.91 11.98 12.52
2.8 11.65 15.01 15.01
2.9 12.71 12.71 12.71
3.0 4.69 9.79 11.82
3.1 4.46 12.71 14.74
3.2 4.18 12.34 14.48
3.3 4.13 4.13 12.91
3.4 7.62 4.35 13.76
3.5 12.52 12.52 12.52
3.6 12.52 4.34 12.52
3.7 11.98 11.98 11.98
3.8 13.53 11.82 11.82
3.9 11.98 11.98 11.98
4.0 12.71 12.52 12.52
4.1 12.16 11.98 14.74
4.2 12.34 12.34 12.34
4.3 12.16 10.91 12.16

Table 1. Data from each sensor representing motion in the left to right direction

(a) (b) (c)

Figure 7. (a-c) Person moving from left to right across the infrared sensor array which is
mounted on a robotic platform.

25
For building the training and testing feature vectors, the 280 data instances collected were

analyzed using the method described above to obtain 280 2-D feature vectors (x-t slope

and y-t slope). These were divided into training and testing data equally. 5-fold cross

validation has been performed on the training set, where the data was divided into 5 groups,

and four of them were used to train the knn classifier and one group was used for validation.

This was repeated 5 times, each time taking a different combination. The results of

Figure 8. Plot showing data analysis for time period 3s – 4s. (a) Raw data capturing
person’s motion; (b) histogram showing peak values (c-d) clustering, with the one in red
being the real cluster.

26
classification on the training data are shown in Table 2. From Table 2, it can be observed

that the knn classifier achieves a high accuracy in performing the classifications

𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠
𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (4)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

(especially at k=5) and thus can be reasonably used on infrared sensor data for person’s

direction of motion classification. Table 3 presents the confusion matrix for knn classifier,

when k=5. From the matrix, it can be verified that classes 1 and 2, classes 3 and 4, classes

(a)

(b)

Figure 9. Plots showing straight lines fitting data points in (a) the x-t plane and (b) y-t
plane. The slope pair (0.1581, 0.0003371) is used as the feature vector for classification. It
can also be verified that this slope pair falls in the domain of class 3 as represented by the
feature space in Fig. 6.

27
5 and 6, are never confused. Some confusion might remain between classes 3 and 5 as well

as classes 4 and 6 as each of these classes essentially represent motion in the similar

direction with the difference being one is diagonal, and the other is straight, thereby causing

overlap. Similarly, if diagonal motion becomes more straight and perpendicular, then knn

classifier might confuse between diagonal and perpendicular motions as in classes 1 and 6,

classes 2 and 5. The stationary objects can be classified accurately as well.

Classifier Accuracy (%)

KNN, k=1 91.2

KNN, k=3 93

KNN, k=5 93.6

Table 2. Classification Accuracy

Class1 Class2 Class3 Class4 Class5 Class6 Class7

Class1 20 0 0 0 0 0 0

Class2 0 20 0 0 0 0 0

Class3 0 0 19 0 1 0 0

Class4 0 0 0 18 0 1 1

Class5 0 0 2 0 18 0 0

Class6 1 0 0 3 0 16 0

Class7 0 0 0 0 0 0 20

Table 3. Confusion Matrix for KNN, k=5 (Predicted classes shown in columns, actual
classes shown in rows)
28
2.3.5. Conclusion

In this work, infrared sensors have been applied to tasks more complex than obstacle

detection, such as finding out the direction of motion of a moving object in front of it. This

can assist the robot to achieve collision-free navigation. From the results, it can be seen

that the k-nn algorithm is successfully able to identify the direction of motion or no motion

correctly, achieving a high accuracy of up to 93%. It takes around 0.06 seconds to process

the reading from a sensor, thus, making it real-time.

This method can work with varying sizes of the humans if they are detected by the sensors.

Since the sensors are placed quite close, it can be guaranteed that a person will not go

undetected. Also, since the rate at which range data is collected from the infrared sensor

(10 every 1 second), it can detect humans walking at a very fast speed also. Slow motions

will not affect the model too.

Future challenging research in this field would be to extend this work to be able to detect

more than one person in motion in front of the infrared sensor array. However, one must

keep the limitation of the infrared sensor in mind and understand that the persons need to

be considerably separated to obtain accurate information. Another important direction

would be to perform the same activities when the infrared sensor array mounted over the

robotic platform is in dynamic motion. This would be challenging because the data

collected when the sensor is in motion would be noisy enough, thus making it difficult.

One could combine other sensors such as ultrasonic sensors, cameras, depth cameras, etc.
29
and involve sensor fusion to integrate the multi-dimensional information. Ultimately, the

goal is to consider the infrared sensors to achieve something more than merely obtaining a

distance estimate.

30
2.4. Using an A-Priori Learnt Motion Model with Particle Filters for Tracking a
Moving Person by a Linear Infrared Array Network

2.4.1. Introduction

The aim of this work is to extend the work presented in Section 2.3. to detecting and

tracking a person using particle filters with a modified motion model. This work introduces

a robust method to extract the peak values corresponding to human motion from noisy

sensor data. An algorithm based on a sliding window approach is used to slide over the

infrared sensor returns and extract values corresponding to peaks caused by a person

appearing in front of the infrared sensor using an SVM based detector. These peak values

extracted are fed to a particle filter to track the moving person. The second contribution is

the use of the feedback from a proportional controller, which is based on the difference

between the particle filter predicted values and the observed values, to update the

parameters of the motion model. A series of 10 infrared sensors were lined in a linear array

on a still platform. The sensors had a separation of more than 15cm. A person was made to

walk across the sensors following a straight line with slight deviations and data was

recorded simultaneously by each of the sensors for the processing of the data to predict the

person’s track. The speed of the person was an average of 1 to 3ft/s. Simulated data has

also been used for testing purposes.

31
2.4.2. Related Work

In the literature, the use of these low priced infrared sensors used in combination with

other sensors for tracking is less. Some of the interesting works are [56], where infrared

sensors have been used with electro-optical sensors for detection and tracking of moving

objects. In addition, pyroelectric infrared sensors have been used for person tracking [57].

Prior research has been done to experiment with the learning models such as in [58], where

the motion parameters are learnt by switching among multi-model system using particle

filters. In [59], a probabilistic motion model is learnt based on the motion of the target as

observed by a camera in the learning phase, instead of using an initial empirical distribution

for the motion model. This work focuses on tracking an object moving across an infrared

sensor array in a simple linear manner using particle filters with a motion model that

updates its parameters based on the feedback of a proportional error controller. The

controller uses the difference in the position estimation versus the position observation to

steer the particle filtering algorithm in the right direction.

2.4.3. Methodology

Infrared sensors are sensitive to noise and therefore a robust method to extract data points

from the sensor needs to be developed if this information has to be leveraged elsewhere.

The data from the infrared sensor has values fluctuating around a mean point. Also, during

a change of depth value, when an obstacle appears in the view of a sensor, a point in

between the ranges could also be reported. In this work, an algorithm is presented, that uses

32
a sliding window to go over the data points reported by the sensor and predict one value

for that sliding window or ignore that window if the points appear to be too noisy. The

logic behind this algorithm is that within the time duration of one sliding window, which

is typically around 0.1 to 0.5 seconds, the values reported by the infrared sensor should not

be very different. The algorithm starts by filtering out the out of range values and then uses

the standard deviation of points to decide whether to retain those set of points or disregard

them. A threshold is used which is estimated from trial runs where the sensor reports almost

consistent values for the same depth of the object. If the data points in the sliding window

have a standard deviation lesser than the threshold, it indicates that there is a higher

possibility that the data points relate to the same obstacle, and thus a histogram is computed

over the points and the value of the bin that has the highest concentration of observations

is chosen to represent the depth value for that sliding window. A simple averaging of the

points within the sliding window that succeeds the threshold test was also tried, however,

it did not perform as well as using the histogram approach. This is probably because, in the

sliding window, some points can appear, which are slightly off from the true value,

however, there is a higher concentration of true value points. Experiments proved that it is

better to use the value that had a higher concentration than take an average value.

This is followed by finding the gradients of the depth values over time and flagging any

point as a peak point if the gradient value at that point is greater than a chosen peak

threshold value. This work focusses on finding sudden changes in the depth values

reported, which ideally would correspond to an object moving across the sensor. However,

33
false positives also appear, especially when sudden noisy detections dominate the majority

of the observations in the sliding window. If a peak detected by sensor i at time t

corresponds to motion, then in the absence of other obstacles, there is a high probability

Algorithm

The algorithm uses a sliding window of size ‘n’ with an overlap of ‘n-1’ to extract points

which satisfy a standard deviation threshold and is given:

At any time instant t=n,n+1,..., for each sensor 1:n

(i) use a sliding window of length ‘n’ to obtain ‘n’ observations recorded by the

sensor

(ii) filter out observations which are above the maximum range, ‘max’ and below the

minimum range, ‘min’ of the infrared sensor

(iii)if the standard deviation of the observations is within a threshold ‘s’, then, analyze

the distribution of the data using histogram and choose the bin that has the highest

concentration of observations and add them to the set of valid observations for

this sensor, otherwise, disregard this set of observations

(iv) compute the gradient of distances w.r.t time for the function F(t), i.e. ∇F = δF/δt

(v) flag the observation as a ‘peak point’ if distance gradient at that point is greater

than the chosen peak threshold value

34
that peaks at similar range values should be detected by the neighboring sensors i+1 or i-1

(or even i, depending on the direction of motion) at time instants prior to t. On the contrary,

the probability that a sensor’s invalid peak observation will be corroborated by its

neighboring sensors is low. This is very important for modelling a method to discard the

false positives while retaining the peaks caused by motion.

A method was developed to classify between the two types of peaks (peaks due to true

motion of an object across the sensor versus peaks due to consistent noisy data) reported

by the algorithm above. For any peak detected, the peaks recorded by the neighboring

sensors are collected and 4 geometric features are computed, which are the total number of

peaks reported by the neighboring sensors, the standard deviation of the peaks, the average

distance between the peaks and the sum of squared error of points that fit the line joining

the peaks. These features are then fed as input to a Support Vector Machine (SVM)

classifier for classifying the peak because of true motion or not.

The first detection of object motion is used to initialize a particle filter with the location

values and provide the velocity components in the x and y directions respectively to build

the initial motion model. As the target moves, more peaks are detected which are fed as

sensor measurements to the particle filter algorithm.

35
However, a linear motion model which is often used with particle filter based tracking does

not work well for infrared sensor based data. This is because in the presence of noisy or

missed data, the predictions for the motion update made at each step can lead to small

errors, which if added up over time, can put the prediction completely off track and would

obviate the restarting of the particle filter. Therefore, to improve the estimation, it is

suggested, that in the velocity update step, instead of adding random noise, an informed

guess is made about the velocity based on the difference in the position estimated by the

motion model and the sensor, so that at every step, the motion model can make predictions

closer to the actual location and minimize errors. During the velocity update, this difference

is multiplied by a coefficient and the term is either added to or subtracted from the velocity

depending on whether the motion model prediction lags the observation or is ahead of it.

Thus, the particle filter algorithm switches to a faster velocity model if it is behind the

observed value and if it is ahead of the observed value, then it slows down the prediction

value by choosing a slower model.

The particle filter algorithm is as follows:

• Initialize the particle filter with initial peak corresponding to motion as the starting

point and use the computed velocity components for the motion model.

• Generate N initial particles around the initial point following a normal distribution.

• For all later time instants,

(i) Update the particles according to the motion model.

(ii) Compute the weight of each particle

36
(iii) Resample the particles so that particles with higher weights are sampled

more frequently than those with lower weights.

(iv) Instead of making a random selection for the velocity components, update

the velocity components based on the feedback of an error based

proportional controller, in effect switching between faster and slower

paced models.

• The model update parameters are given as:

𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = √ (𝑜𝑏𝑠𝑥 − 𝑥)2 + (𝑜𝑏𝑠𝑦 − 𝑦)2 (5)

𝑋𝑡̇ = 𝑋𝑡−1
̇ + 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒, ̇
𝑖𝑓(𝑜𝑏𝑠𝑥 − 𝑥) > 𝑋0 /2

= 𝑋̇𝑡−1 − 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒, 𝑖𝑓(𝑜𝑏𝑠𝑥 − 𝑥) < −𝑋0̇ /2 (6)

𝑌𝑡̇ = 𝑌𝑡−1
̇ + 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒, 𝑖𝑓(𝑜𝑏𝑠𝑦 − 𝑦) > 𝑌0̇ /2

̇ − 𝑐𝑜𝑒𝑓𝑓 ∗ 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒,
= 𝑌𝑡−1 𝑖𝑓(𝑜𝑏𝑠𝑦 − 𝑦) < −𝑌0̇ /2 (7)

where X0̇ and Y0̇ are initial velocity components in the x and y directions respectively,

obsx and obsy are the observations as recorded by the sensor, x and y are the predictions

by the motion update model respectively.

37
2.4.4. Results

In order to evaluate the proposed algorithm to robustly extract peak points corresponding

to motion detection, data is recorded using a linear array of 10 infrared sensors mounted

on a flat platform. A person is allowed to move across the sensor array network following

a straight line with small deviations. Fig. 10 shows a sample run from two neighboring

sensors, when the person is walking across the sensor in one direction and then returning

back in the opposite direction. The peaks, which are basically the distance gradients, are

Figure 10. Color coded peaks because of an object in motion as detected by the SVM
classifier. Other peaks are also noticed; however, they are created by inconsistent data and
are discarded by the SVM as true negatives.

38
Table 4. Confusion matrix for SVM classifier

circled. The black circled ones represent the peaks caused when the person walked in front

of the sensors from the first sensor to the second one respectively. The green circled ones

are when the person crossed the second and then first sensors respectively. The negative

peaks are basically gradients when the person was last detected by the sensors. (The

background is at a much farther distance from the scene of motion).

For each set of data recorded, the sliding window based algorithm extracts the peak points

and also computes for each peak point, the set of 4 features as discussed in Section 2.4.3.

to be fed as inputs to the SVM classifier to classify the points as peaks created due to true

motion across the sensor or peaks created by noisy data. The confusion matrix is for the

classification is shown in Table 4. The SVM classifier does a good job at distinguishing

between the two types of peaks reported, with a true positive rate of 92% and a low false

negative rate of 8%. A good rate for true detections ensures the smooth running of the

particle filters as points representing the target are obtained consistently. The false positive

rate is 15%.

39
The other contribution to this work is on the modification of the motion update model of

the particle filter algorithm. To test this, both simulated data as well as recorded data from

the infrared sensors has been used. For the simulation, 2-D points were sampled, enacting

motion along a straight line. White noise was added to the sampled points along with some

random peaks representing noise. Each recording had 100 observations mimicking an array

network of 100 infrared sensors, however, for the real data, only 10 infrared sensors have

been used. The approach has been tested on 30 sets of simulated data and 10 sets of real

life data. The particle filter’s motion model is updated based on information about the error

between the recorded observation and the predicted value scaled by a coefficient.

(a) (b)

Figure 11. (a)-(b) A traditional particle filter with 500 particles is used to track the infrared
sensor simulated data with an average position estimation error of 0.9008 ft.

40
(a) (b)

Figure 12. (a)-(b) A particle filter with 500 particles and a continuously updated motion
model with coefficients of 0.5(Fig. 12(a)) and 0.6(Fig. 12(b)) respectively is used to track
the infrared sensor simulated data with an average position estimation error of 0.35ft.

The filter uses 500 particles and a range of values have been experimented with for the

motion model parameters coefficient. For each dataset, both the traditional particle filter as

well as this approach was tested. In Fig. 11(a) and 11(b), the traditional filter does manage

to follow the track of the target with some vertical offset, however it never catches up with

the target. The average position estimate error is around 0.9 feet for the simulated data,

whereas for the approach with feedback, the filter follows the target closely and has an

average estimate error of 0.4ft as in Fig. 12(a) and 12(b). Fig. 13 (a) shows the traditional

particle filter deviating from the actual trajectory on the data recorded by the infrared sensor

array, with an average position estimate error of 0.6ft whereas in Fig. 13(b), the filter using

feedback makes an average error of 0.3ft. It has been found that by leveraging the

information about the error between the estimated state of the object and the observation

as reported by the sensor, more accurate subsequent predictions can be made, reducing the
41
average position estimation error by almost 50%. In some cases, when the traditional

particle filter loses track of the object because the predictions tend to drift away slowly

from the true values, the feedback provided in this approach helps to guide the particle

filter towards the true observation values. Fig. 14 shows a plot of the average position error

estimation versus the coefficient value used for the motion model update. It can be

observed seen that a coefficient value around 0.6 is effective in tracking the motion of the

object. Lower or higher coefficients will introduce higher tracking errors.

(a) (b)

Figure 13. True position versus estimation. (a) A traditional particle filter with a fixed
linear model tracks object on real infrared sensor data with an average position estimation
error of 0.64ft (b)A particle filter receiving feedback from the controller regarding position
estimation error tracks the object on real infrared sensor data with an average position
estimation error of 0.34ft.

42
Figure 14. A plot showing the average error in the position estimation of the object at
different coefficient values for the motion model update parameters. For these runs, a
coefficient of 0.6 produced good results.

2.4.5. Conclusion

In this work, infrared sensors have been used for tracking the motion of an object along a

linear sensor array network. A robust method to extract points corresponding to the motion

of a moving object has been proposed. The use of a learning method to distinguish between

peaks corresponding to true motion versus peaks corresponding to sudden sparks gives a

better performance than a heuristic based approach and can identify peaks accurately

around 92% of the times. The second contribution is the modification of the motion model

update of the particle filters. The particle filter is supplied with valuable feedback from the

proportional error controller which updates the motion model parameters accordingly and

has given on an average at least 50% more accurate location estimations from 30 test runs
43
on simulated data as well as 10 test runs on infrared sensor data versus using a fixed motion

model. In effect, switching between faster and slower models to keep track of the person

in presence of noisy data proves to perform a better job than using one fixed model.

To develop more on this work, the feature set for the peaks could be modified and the

missed detections rate could be reduced. Another interesting direction would be to test the

approach on complex motion patterns, especially to and fro motions or free form motion.

Also, the feedback based motion model update can be tested on data from another sensor

such as tracking using a camera, whose data properties, when applied to this task, are

different from that of an infrared sensor in terms of the continuity of data, the frequency as

well as the accurateness. Ultimately, the infrared sensor can also be integrated with the

camera sensor to improve tracking of a moving object using the modified motion model

with particle filters.

44
2.5. An Infrared Sensor Guided Approach to Camera Based Tracking of Erratic
Human Motion

2.5.1. Introduction

Tracking a human being in motion is an important research topic in robotics or computer

vision. Substantial research has been conducted to address this issue, however as Yilmaz

et al. states in [1], researchers mostly simplify tracking by constraining the motion of the

object, assuming that the object motion will be smooth and will not perform any random

abrupt change in its direction. Further constraints are made to assume either fixed velocity

or fixed acceleration and motion models are constructed to resemble these types of motion.

This work accounts for the fact that the object to be tracked can make unpredictable

changes during its motion and to solve this, it uses two trackers based on the two types of

sensors used, infrared sensor array and camera, and also identifies the point of failure for

the camera tracker and recovers the tracking from input by the infrared sensor tracker. A

purely camera based object tracking system can encounter failures if the object size gets

smaller, or if the appearance model changes, or in case of a robot following a person who

is going out of focus of the camera, an incorrect location estimation may not make the robot

turn in the proper direction. Also, when the object makes sudden turns and the camera

tracker errs, predicting the position of the object can get complicated and time consuming

if one has to make an exhaustive search of the visual space to detect the object. On the

other hand, the infrared sensor array measures distance estimates of objects in front of it at

a sufficiently high sampling rate, enabling it to detect sudden changes in the position of the

45
objects quite accurately. However, these sensors do not provide any information about the

appearance of the object, and thus cannot validate whether a detection would belong to the

object to be tracked or anything else, but when these sensors are used in conjunction with

a camera, these detections can guide the camera to reduce the search space for observations

in an image and can also help to detect change in motion directions and thereby help restart

the camera tracker in case it gets lost. Another drawback of a purely camera based tracker

is that it doesn’t provide 3-D information about the scene and so, the depth values are not

obtained and therefore, integrating an infrared sensor array system with the camera will

also provide 3-D information which can be used in applications such as a person following

robot, to maintain a safe following distance and also increase or decrease the speed of the

robot as desired. Additionally, infrared sensors are extremely low in cost when compared

to other distance measuring (but more accurate) sensors, such as a radar or LIDAR and it

also has a narrow field of view as compared to ultrasonic sensors, which makes localization

of an object easier, and thus, if the aim of the application is to provide tracking in indoor

areas and making the system affordable to the common public, the combination of a camera

and infrared sensors sound feasible.

2.5.2. Related Work

Using distance measuring IR sensors to aid object tracking is a largely unexplored area and

this work aims to address this issue. This work also aims to make use of IR sensors to aid

a camera based tracker in failure detection and recovery. In the research literature,

appearance or motion characteristics have been used for detecting failure by comparing to
46
reference features [60], or by comparing trajectories [61]. A time reversed Markov process

is used in [62] to identify failed trackers and perform recovery. This work combines inputs

from the IR tracker with appearance features of the object to detect failure, recover and

restart the tracking process.

2.5.3. Methodology

In this work, observations are dealt in the frame of reference of the camera and therefore

the real-world distances obtained by the infrared sensors have to be mapped to the pixels

of the image captured by the camera. The sensor system is placed on a robotic platform. A

total of 5 infrared sensors have been used. Three are positioned facing the front direction

and rest two are placed sideways at an angle of 70˚ to the front direction. The camera is

placed at a height of approximately 3.8ft above the ground and 50cm behind the infrared

sensors as shown in Fig. 15. The camera’s roll, pitch and yaw has been set to zero. The

infrared sensor measurement, (𝑥𝑖𝑟 ,𝑦𝑖𝑟 , 𝑧𝑖𝑟 )is transformed to the frame of reference of the

camera (𝑥𝑐𝑎𝑚 ,𝑦𝑐𝑎𝑚 , 𝑧𝑐𝑎𝑚 ). Using the mapping for rectilinear lens, the radial position

(angle) of the point on the image is found, which is given by

𝑅 = 𝑓 ∗ tan(𝜃) (8)

where 𝑓 is the focal length in mm(or pixels) and 𝜃 is the angle in radians(or degrees)

between a point in the real world (𝑥𝑐𝑎𝑚 ,𝑦𝑐𝑎𝑚 , 𝑧𝑐𝑎𝑚 ) and the optical axis of the camera.

47
The angle that the line joining the center of the image and the projected point on the image

makes with the image axis is given by

∅ = tan−1(𝑦𝑐𝑎𝑚 ⁄𝑧𝑐𝑎𝑚 ) (9)

(a) (b)

Figure 15. Infrared sensor setup with camera (a)The coordinate system (b) Distance
measurements shown on the robotic platform

Following the spatial coordinate system, the corresponding projection of the infrared

sensor detected points on the image is given by

𝑦𝑝𝑟𝑜𝑗 = (−𝑅 ∗ cos(∅) + 𝑦𝑐𝑒𝑛𝑡𝑒𝑟 ) (10)

𝑥𝑝𝑟𝑜𝑗 = (−𝑅 ∗ sin(∅) + 𝑥𝑐𝑒𝑛𝑡𝑒𝑟 ) (11)

where 𝑥𝑐𝑒𝑛𝑡𝑒𝑟 and 𝑦𝑐𝑒𝑛𝑡𝑒𝑟 are the x and y spatial coordinates of the center of the image.

48
In this research, the infrared sensor has been made responsible for tracking the irregularities

of the human motion, that is when the object makes an unpredictable turn in a different

direction, or when it remains static for some time and then moves suddenly. Most of the

tracking algorithms developed generally put a constraint on the object motion and assume

a relatively simple non-linear track with no sudden turns, etc. Such motion is usually

represented by a constant velocity model or a constant acceleration model. However, these

models will not be able to represent motion where the object executes random turns. An

infrared sensor, on the other hand, continuously returns distance estimates of the objects in

front of it and thus can extract a sudden detection made by one of the sensors indicating

that an object might have suddenly moved into its field of view. The job of this secondary

infrared tracker is to track these detections using particle filters, however, any of the

conventional motion models will not be used. Instead, in this work, being inspired by

Brownian motion, a type of a random walk model [63] has been used, but with a fixed

range for speed, which is termed the omnidirectional motion model. In this model, the state

evolves following a randomly guessed speed inside a fixed range and a randomly guessed

direction. The speed, 𝑣 ranges from 0 to 45cm/s and the direction, 𝜃 is between 0˚ to 360˚.

Thus, this distribution ensures that there are particles representing motion in any direction

at any speed, including the condition that the object is at rest. The state vector is represented

as 𝑋_𝑖𝑟𝑘 = {𝑥_𝑖𝑟𝑘 , 𝑦_𝑖𝑟𝑘 } , where 𝑥_𝑖𝑟𝑘 is the distance between the sensor and the object

and 𝑦_𝑖𝑟𝑘 is the lateral distance. The motion model is given by

𝑥_𝑖𝑟𝑘 = 𝑥_𝑖𝑟𝑘−1 + 𝑣 ∗ cos(𝜃) (11)

𝑦_𝑖𝑟𝑘 = 𝑦_𝑖𝑟𝑘−1 + 𝑣 ∗ sin(𝜃) (12)

49
The likelihood model is given by the Gaussian density and is chosen as

p(𝑧_𝑖𝑟𝑘 |𝑋_𝑖𝑟𝑘 ) ∝ exp(− 𝑑 2 ⁄2𝜎 2 ) (13)

where 𝑑 is the Euclidean distance between the observed point and the sample particle and

𝜎 specifies the Gaussian noise in the measurements.

The tracking starts with the initial location specified by the user. The state of the system is

updated according to Eq. (11-12) and when new observations are received by the infrared

sensor array, a gating technique is used to filter out observations beyond a particular range

of the current location of the object. The final observation,𝑧_𝑖𝑟𝑘 is used to update the prior

distribution. Weights are assigned based on Eq. (13). The residual resampling method has

been used [64]. The mean of the posterior distribution is output as the estimated location

by the infrared tracker. This output is used in the primary tracker to modify the weight of

the particles and to also reinitialize the primary tracker in the event that it loses the object.

The camera based tracker is used as the primary tracker in this work and its function is to

perform tracking of the object when it is executing simple non-linear motion and to

reinitialize itself with inputs from the secondary infrared sensor tracker. For tracking

purposes, a bounding box representing the object is chosen by the user which is rectangular

and is fixed in size and is characterized by the state vector at time 𝑘, as 𝑋_𝑐𝑘 =

{𝑥_𝑐𝑘 , 𝑦_𝑐𝑘 , 𝑥_𝑐


̇ 𝑘 , 𝑦_𝑐
̇ 𝑘 , 𝑥_𝑐
̈ 𝑘 , 𝑦_𝑐
̈ 𝑘 }, where 𝑥_𝑐𝑘 , 𝑦_𝑐𝑘 are the centers of the bounding box,

50
𝑥_𝑐
̇ 𝑘 , 𝑦_𝑐
̇ 𝑘 are the respective velocities and 𝑥_𝑐
̈ 𝑘 , 𝑦_𝑐
̈ 𝑘 are the respective accelerations. A

constant acceleration model represents the state evolution and is given by

𝑋_𝑐𝑘 = 𝐹 ∗ 𝑋𝑐 𝑘−1 + 𝑣𝑘−1 (14)

where F is the state transition matrix given by

1 ∆𝑡 ∆𝑡 2⁄
2
𝐹 = [0 1 ∆𝑡 ] (15)
0 0 1

and 𝑣𝑘−1 is the process noise assumed to be white, zero mean and Gaussian.

Normalized color histograms [65] and normalized Histogram of Oriented Gradients (HOG)

features [66] are employed to build feature vectors for the selected region and make it the

reference model. Based on the state evolution model, the particles are propagated to their

new predicted positions and upon receiving a new image, patches are extracted around the

predicted positions and a feature vector is computed for each extracted patch. The

Bhattacharyya distance between the feature vector of a sample patch and that of a reference

patch is computed and is used to assign weights to each particle.

Additionally, the infrared sensor tracking system provides an estimation of the object’s

position as projected on the image. This is an important term when assigning weights to

51
the particles as patches which contain the projected point will have a higher weight than

patches which do not contain the secondary tracker’s estimated point. To construct the

likelihood model, it is assumed that the color and HOG features as well as the distance

estimates provided by the infrared sensor are independent of each other. Therefore, the

overall likelihood is the product of the separate likelihoods.

The integration of the infrared system plays a significant role in recovery after detection of

failure. To address this, two metrics have been introduced, 𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 and

𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠, depending on whose values, the likelihood model is formed. These are

defined as

(16)

(17)

For computing the metric 𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠, a threshold, 𝑡ℎ𝑟𝑒𝑠ℎ𝑚𝑎𝑡𝑐ℎ_𝑟𝑒𝑓 is used to

determine if a patch closely matches the reference patch. Thresholds

𝑡ℎ𝑟𝑒𝑠ℎ𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 and 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 are used to test the above metrics as

detailed. If 𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 is greater than 𝑡ℎ𝑟𝑒𝑠ℎ𝑖𝑟_𝑐𝑎𝑚𝑒𝑟𝑎_𝑜𝑣𝑒𝑟𝑙𝑎𝑝 indicating

52
sufficient overlap and 𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 is greater than 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 , the likelihood

function is defined as

(18)

̂𝑘 , 𝐶̂
where 𝑑𝐶𝑜𝑙𝑜𝑟 (𝐶 ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐶𝑘 , 𝐶𝑟𝑒𝑓 ) is the Bhattacharyya distance between the

̂𝑘 , 𝐻𝑜𝑔
color histograms, 𝑑𝐻𝑜𝑔 (𝐻𝑜𝑔 ̂ ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐻𝑜𝑔𝑘 , 𝐻𝑜𝑔𝑟𝑒𝑓 ), is the Bhattacharyya

distance between the HOG features, 𝑑𝐸𝑢𝑐 (𝑋𝑐 𝑘 , 𝑋𝑖𝑟 𝑘 ) =

√(𝑥_𝑐𝑘 − 𝑥_𝑖𝑟𝑘 )2 + (𝑦_𝑐𝑘 − 𝑦_𝑖𝑟𝑘 )2 is the Euclidean distance between the center of the

̂𝑘 and 𝐻𝑜𝑔
current patch and the infrared sensor estimated object position at time 𝑘, 𝐶 ̂𝑘 are

the normalized color histograms and HOG features respectively for the current patch

centered at (𝑥_𝑐𝑘 , 𝑦_𝑐𝑘 ), 𝐶̂ ̂


𝑟𝑒𝑓 and 𝐻𝑜𝑔𝑟𝑒𝑓 are the normalized color histograms and

normalized HOG features for the reference patch 𝜌 is the Bhattacharyya coefficient and

𝜎1 , 𝜎2 , 𝜎3 specify Gaussian noise in measurements.

On the other hand, if significant number of patches do not contain the infrared sensor

projected point, it means that there is a disagreement between the two sensors, however, if

there are sufficient patches which match closely with the reference patch, or if,

𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 is greater than 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 , the infrared sensor tracker’s data has

53
been ignored and the algorithm relies on the primary camera tracker’s data only. Thus, the

likelihood function would be the products of the likelihood of the color and HOG features

and is given as

(19)

where terms have the usual meaning as stated above. The object is classified as lost if the

extracted patches have low similarity to the reference patch, or if 𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 is lesser

than 𝑡ℎ𝑟𝑒𝑠ℎ𝑝𝑎𝑡𝑐ℎ_𝑚𝑎𝑡𝑐ℎ𝑒𝑠 . In such as case, the algorithm relies on distance estimates

provided by the infrared sensor (if these estimates would belong to an object) and

iteratively checks if patches generated around the infrared sensor reported distance estimate

match with the reference patch. When a high match score, 𝑡ℎ𝑟𝑒𝑠ℎ𝑟𝑒𝑠𝑡𝑎𝑟𝑡 has been found,

the primary tracker is reinitialized.

54
Algorithm

1. Initialization:

Select a point representing the object as input to the secondary infrared tracker

and a patch enclosing the object as input for the primary camera tracker. At step

k = 0, for i = 1,2, ……. N, generate samples based on an initial Gaussian

distribution around user inputs.

2. Infrared sensor based secondary tracker:

(𝑖) (𝑖)
(i) For k = 1, 2,…… and for i = 1,2,….,N, sample 𝑋_𝑖𝑟𝑘+1 ~𝑝(𝑋_𝑖𝑟𝑘+1 | 𝑋_𝑖𝑟𝑘 )

from the omnidirectional motion model using Eq. (11-12).

(ii) When new distance measurements recorded by the infrared sensor array

come in, use a gating technique to filter out observations beyond a threshold

of past estimated detection, then take an average of remaining points. The

(𝑖) (𝑖)
weights for each particle is computed as 𝑤_𝑖𝑟𝑘+1 ∝ 𝑝(𝑧_𝑖𝑟𝑘+1 ⁄𝑋_𝑖𝑟𝑘+1 )

using Eq. (13).

(𝑖) (𝑖) (𝑖)


̂𝑘+1 = 𝑤𝑘+1 ⁄∑𝑁
(iii)Normalize the weights 𝑤 𝑖=1 𝑤𝑘+1

(iv) Resample from the posterior distribution p(𝑋_𝑖𝑟𝑘+1 |𝑍_𝑖𝑟 𝑘+1 )

(v) Output the mean of the posterior distribution and project the point on to the

camera’s image (to be used in 3(iii))

3. Camera based primary tracker:

55
(i) For k = 1,2,…… and for i = 1,2,….,N, sample

(𝑖) (𝑖)
𝑋_𝑐𝑘+1 ~𝑝(𝑋_𝑐𝑘+1 | 𝑋_𝑐𝑘 ) from the constant acceleration model in Eq.

(14-15).

(ii) When a new image is received, extract patches around the sampled

points in 3(i), and using the point obtained in 2(v), compute the metrics,

ir_camera_overlap and patch_matches. The weights for each particle is

(𝑖) (𝑖)
computed as 𝑤_𝑐𝑘+1 ∝ 𝑝(𝑧_𝑐𝑘+1 ⁄𝑋_𝑐𝑘+1 )

• If ir_cam_overlap>threshir_camera_overlap and

patch_matches>threshpatch_matches, use Eq. (18)

• If ir_cam_overlap<threshir_camera_overlap and

patch_matches>threshpatch_matches, use Eq. (19)

• If patch_matches<threshpatch_matches,

Set t = 0;

do

a) Set IR estimated position at time k as the object

location.

b) Generate patches around the point randomly.

c) k=k+1, t=t+1

while (patch_matches<threshrestart & t<max_iterations)

if(t<max_iterations)

56
go to step 3(i) to restart the particle filter with the

patch around the infrared sensor estimated point as

the starting patch.

else

report object has been lost (quit).

(𝑖) (𝑖) (𝑖)


(iii) ̂𝑘+1 = 𝑤𝑘+1 ⁄∑𝑁
Normalize the weights 𝑤 𝑖=1 𝑤𝑘+1

(iv) Resample from the posterior distribution p(𝑋_𝑐𝑘+1 |𝑍_𝑐 𝑘+1 )

(v) Output the mean of the posterior distribution as the center of the patch

representing the object.

4. Go to step 2.

2.5.4. Results

The setup consists of an array of five Sharp infrared sensors (model number GP2Y0A710)

mounted on a robotic platform at a height of about 2.4 feet so that an average human’s

torso can be detected. Two of them face sideways and the rest three are placed at intervals

of 18cm from each other. A camera (Logitech C920) is mounted at a height of 42 cm from

the plane of the infrared sensors to get a nearly full frame of the object to be tracked. To

evaluate the algorithm,15 recorded video sequences have been used with an object

executing different types of motion, such as walking in the hallway or a cluttered lab and

57
making a sudden turn to the left or right or moving in a zig-zag, random and stop and go

motion patterns. The algorithm is evaluated thrice per second and ground truth about the

object’s position is obtained manually. The accuracy of the algorithm is expressed by Root

Mean Square Error (RMSE), which is given by

∑𝑁 2
𝑖=1 √(𝑥𝑖 −𝑥_𝑒𝑠𝑡𝑖 ) +(𝑦𝑖 −𝑦_𝑒𝑠𝑡𝑖 )
2
𝑚𝑒𝑎𝑛 𝑡𝑟𝑎𝑐𝑘 𝑒𝑟𝑟𝑜𝑟 = (20)
𝑁𝑓𝑟𝑎𝑚𝑒𝑠

where 𝑥𝑖 , 𝑦𝑖 is the known position of the center of the object at frame 𝑖 and 𝑥_𝑒𝑠𝑡𝑖 , 𝑦_𝑒𝑠𝑡𝑖

is the estimated position by the tracking algorithm, N is the number of detections and Nframes

is the total number of frames. Fig. 16 shows selected frames from videos capturing different

types of motion and Fig.17 shows graphs recording the tracking error for each iteration

when only using the infrared tracker, only using the camera tracker and using both the

trackers respectively. In Fig. 16(a), the target moves in an unobstructed scene and the

infrared-camera tracker performs well. In Fig. 16(b), the target is occluded by two persons,

one who has a similar appearance with the target in frames 1359 to 1431, and the other

person, who has a different appearance occludes the target in frames 1701 to 1725. In the

occlusion scenario using the infrared-camera tracker helps to recover the person

immediately after the occlusion, as it finds the depth of the person and thereby helps to

track it. When using a camera based baseline tracker, it loses track of the target after the

first occlusion and then drifts, and thus it must be restarted. In Fig 16(c), when multiple

objects are present in the scene, however, they do not occlude the target, both baseline

58
(a)

(b)
continued

Figure 16. Images from some video sequences illustrating the target tracking under various
occlusion/illumination scenarios. (a) Video sequence to demonstrate target walking in a
scene without any occlusion. Frames 1905, 1941, 2019 and 2055 have been shown; (b)
Video sequence to demonstrate target being occluded by an object with similar appearance
as well as by an object with a different appearance. Frames 1359, 1407, 1425, 1431, 1521,
1545, 1659, 1701, 1725, 1773, 1857 and 2236 have been shown; (c) Video sequence to
demonstrate target being tracked when multiple persons are present in the scene, however,
there is no occlusion. Frames 408, 433, 450, 492 and 505 have been shown; (d) Video
sequence to demonstrate target occluded in presence of other objects as well. Frames 1002,
1074, 1110 and 1182 have been shown; (e) Video sequence to demonstrate target occluded
in presence of other objects as well. Frames 300, 306, 318 and 360 have been shown; (f)
Video sequence to demonstrate target being tracked in the presence of other objects in low
illumination condition in the hallway. Frames 2221, 2293, 2329 and 2341 have been
shown.
59
Figure 16 continued

(c)

(d)

(e)

(f)

camera and the infrared-camera trackers perform well. Fig 16(d) and 16(e) provide some

more examples of the target being occluded in the presence of multiple persons and in both

cases, the combined infrared-camera tracker outperforms the baseline camera or baseline

infrared tracker. Fig 16(f) demonstrates tracking when the illumination in the hallway

changes. The baseline camera tracker doesn’t fail, however, when combined with infrared-

60
camera tracker, its performance improves as the latter tracker’s success is not dependent

on the illumination.

(a) (b)

Figure 17. Graphs showing tracking error at a frequency of 3s-1 for two different sequences
(a) and (b).

Examining the graphs in Fig. 17(a)-(b), it is evident that the camera and infrared sensor

based tracker perform better overall than using either trackers alone. In Fig. 17(a), between

8 to 10 seconds, the infrared reports noisy data, however, the camera tracker tracks the

object quite accurately probably because the feature representations matched well to the

reference one as is obvious since the pure camera based tracker performs well during this

time period. In Fig. 17(b), the trackers individually track with higher accuracy than that in

Fig. 17(a), and the combined tracker is also able to track the object accurately, however,

from 11 to 14 seconds duration, the pure camera based trackers performance drops, but

because the infrared tracker maintains consistent accuracy during this time, the overall

tracking doesn’t fail.


61
2.5.5. Conclusion

This work has introduced a technique for tracking an object making unpredictable turns

using a primary camera tracker and a secondary infrared sensor tracker. Fusing inputs from

both trackers help determine more accurately the object’s location by giving more weight

to particles having closer proximity to infrared and camera detections. Tracking failure by

either one or both sensors is handled by using suitable recovery methods. Infrared sensor

detections have been used to restart the particle filter if it is lost, where possible. The results

show that tracking using both the sensors give better performance accuracy and help keep

tracking errors lower than using either the camera or infrared based tracker alone.

A problem faced is during extraction of infrared sensor data to associate with the object.

Infrared sensor data can get noisy especially when the tracking platform is in motion.

Therefore, more efficient gating technique or data association algorithm has to be

developed. Other future areas of research involve exploring different fusion techniques,

improved motion models and dealing with noisier environments (possibly occlusion).

62
Chapter 3. Occlusion Handling in Tracking

3.1. Introduction

Human object detection and tracking is a challenging research topic in the field of computer

vision or robotics and finds wide applications in the areas of video surveillance, robot

follower, autonomous navigation, etc. RGB cameras have been used extensively for

tracking purposes with a combination of other sensors such as lidars, infrared cameras,

ultrasonic sensors, etc. Stereo cameras have also been used to generate depth maps to assist

in tracking as one can exploit the depth associated to the pixels of the image. However,

most appearance model based object detection and tracking algorithms encounter problems

when the appearance of the object changes as it interacts with the background. If the

background also changes a lot, this can lead to incorrect foreground-background

segmentation. Moreover, such models based on the appearance or features extracted from

a color image will also depend on the lighting conditions of the scene, which poses

problems with rapid illumination changes. In addition, in the presence of occlusion, it

might fail to successfully track the object because of difficulty in identifying the occluded

object whose appearance may have changed considerably because of the presence of the

occluding object. In the traditional x-y tracking domain using particle filters, the ‘y’

corresponds to the vertical displacement in the image, which appears linear when the object

is close to the camera but at further distances, the relationship between the positions of the

persons and the vertical displacement on the image is non-linear, which might lead to a
63
discrepancy between the motion model predictions and the representation on the image.

Using a camera matrix for obtaining real-world coordinates adds another step in the

processing algorithm, which can be avoided by utilizing the depth data returned by the

Kinect sensor. In case of a stereo camera, computing depth values increases the time

complexity as one must match the similar feature points in both images.

In comparison to the appearance based tracking in the x-y frame, depth based tracking adds

the third dimension of depth, which is able to provide geometrical representations of the

objects without being affected by illumination changes. Also, given a particular depth,

objects at those depth ranges can be extracted and segmented more robustly than trying to

extract objects from color images or video sequences which might have a changing

background that might be similar to the object in appearance. If the object to be tracked is

occluded, a tracker based on the appearance features might fail to extract the partially

occluded object or might even switch to the object causing the occlusion in case their

appearances are similar. It might not be able to detect the occlusion in those cases, however,

when using depth data with object tracking, based on the object’s current position and

depth, there is only a small change in depth in the next frame and thus, extraction of the

fully or partially visible object at the next possible depth ranges will help extract a more

accurate region for the object to be tracked.

This chapter presents research which aims to utilize a motion model which uses the

horizontal-depth frame for propagating particles of a particle filter used to track a given

64
target and demonstrates the advantages of incorporating depth into the motion model. In

addition, the depth data helps in determining if occlusion has taken place and to extract a

target more precisely than using feature or appearance based models. The algorithm has

been further enhanced, to handle dynamic occlusion scenarios, such as when a target is

occluded by one or more occluders for a period of time. This is achieved by observing the

occlusion status of the target and initiating occluder track(s), which serves the dual purpose

of providing a distribution of the location probability for the target in case of full or partial

occlusion. This is combined with a part based matching template system for associating

partially visible object parts to the whole object as detected in the pre-occlusion stage, or

even for object recovery purposes.

3.2. Related Work

Researchers have approached the problem of occlusion handling in different ways, such as

by producing detailed object representations for parts of objects, such as in [72], where a

hierarchical deformable part-based model is used for handling occlusion. In [73], two types

of detectors are used, a global detector which generates an occlusion map, which is to be

used by a part based local detector. Some researchers consider the occluder-occludee as a

pair and use suitable feature representations for the same, such as in [74], where an and-or

model has been adopted for studying occluder-occludee occlusion patterns in a car, which

can also be extended to other objects such as humans. [75] proposes to use a double person

detector for detecting occluder-occludee pairs. In [76], occluder-occludee occlusion

patterns are mined for robust object detection. Some other approaches use a combination
65
of context information along with visual and depth cues to track an object robustly in the

presence of an occluder, such as in [77-78]. In [79], a vehicle detection and tracking

approach has been proposed that handles dynamic occlusion of vehicles in a road, by

tracking occluders and occludees using a context based multiple cue method. The

challenges faced by these approaches is that modeling the target object by itself ignores the

fact that it can undergo occlusion, which would drastically change the model

representation. Even if the occluder-occludee pair is modeled, that will also undergo

change, as the objects move and interact with each other.

Increased availability of depth sensors has encouraged researchers to pursue RGB-D

tracking, which has potential for yielding better results, since the addition of the depth data

can handle occlusion better or prevent model drift arising from a change in appearance of

the target. [80] presents an RGB-D tracker, where the RGB Kernelized Correlation Filters

tracker is enhanced by fusing color and depth cues, and by exploiting the depth distribution

of target, scale changes are studied, and occlusion is handled. Lost tracks are recovered by

searching in key areas. In [81], Gaussian Mixture Models have been used to detect

occlusion. Partial occlusion is handled by tracking the partially visible object based on

fusing depth and color data. A motion tracker is also used to predict positions in case of

full occlusion. However, research focusing on RGB-D tracking with occlusion handling is

limited and this work aims to address this situation in a different way. Object tracking is

done by propagating the object in the horizontal-depth framework followed by depth based

extraction. Occlusion handling is done by matching partially occluded object parts to prior

66
models and maintaining separate occluder tracks to narrow down search for the occluded

target.

3.3. Methodology

This section presents the approach used for depth based tracking with occlusion handling.

It is divided into 4 sub-sections, Object Representation, Object Extraction and Filtering,

Occlusion Detection and Handling, and Particle Filter Tracker.

3.3.1. Object Representation

This algorithm uses particle filters for object tracking in the x-z domain. A human object

in the depth image is depicted in Fig. 18(a) and the corresponding depth profile w.r.t. the

horizontal axis is given in Fig. 18(b). The depth profile takes on a characteristic shape for

an upright human body, either stationary or in motion. As, the person moves, his depth

profile does not alter much, unless the person is occluded, in which case the shape of the

blob is going to change. However, for the purpose of tracking, one can consider the center

of the patch in Fig. 18(a) to be the (x,z) center of the object, where x stands for the

horizontal displacement and z stands for the depth value. Additionally, normalized color

histograms [65], extracted from the corresponding color image and normalized histogram

of depth values with 50 bins (Fig. 18(c)), obtained from the depth images are used for

computing the feature vector for the object. Other characteristics such as object’s (x,y)

position, where x stands for the horizontal displacement in the image and y stands for the

67
vertical displacement in the image, and size of the bounding box are also computed. The

patches are checked for occlusion and the Occlusion Detection and Handling section

describes how to handle partial or full occlusion scenarios.

3.3.2. Object Extraction and Filtering

The tracking initiates with the user selecting the object to be tracked. The ground plane is

removed in all images before beginning the processing following the method in [67]. The

corresponding patch in the depth image is analyzed to get the mode of the depth value.

Using this value, the algorithm makes an informed guess about the possible depth ranges

that the object can be at in the next frame (-100cm to +200cm) and objects are extracted at

depth intervals of 25cm. In the case that the object is occluded partially by another object

at a different depth, then extraction still works unaffected as it is based on depth. However,

in case occlusion occurs because the occluding object is almost at a similar depth to the

object or if it is at the same depth as the object to be tracked and two or more objects appear

as a joint blob, then the depth segmented blob is going to be split into multiple hypotheses

patches with sizes pertaining to the given depth value. The patch size given by the length

and width parameters and associated with a depth value is learnt by conducting

experiments, recording average human sizes at those depths in a lookup table and then,

using extrapolation. Once the objects are extracted using the method explained, a two-step

gating technique is applied which filters out some objects based on the proximity in

position, size with the estimated human object in the previous frame. In the second step of

68
the gating method, these filtered objects are matched by their color and depth features using

the Bhattacharyya distance measure with the previously detected human object.

(a) (b)

(c)

Figure 18. (a) Depth image showing the human body. (b) projection of the human body
depth data on the x-z plane. (c) normalized depth histogram for the human object
69
3.3.3. Occlusion Detection and Handling

Before any extracted patch is processed, a check is performed to detect the presence of

occlusion. In case the person to be tracked is occluded by another object, then the occluding

object has to be present at a depth which is lesser than the depth of the object. Therefore,

when analyzing the depth values in a patch, if there exists a concentration of pixels having

depth values lower than the depth of the object to be tracked, such that the concentration

exceeds an occlusion threshold, threshocc , then the object is said to be partially occluded.

The threshold threshocc is computed according to

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑙𝑢𝑑𝑒𝑑 𝑝𝑖𝑥𝑒𝑙𝑠


𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑐𝑐 = (21)
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑛−𝑧𝑒𝑟𝑜 𝑝𝑖𝑥𝑒𝑙𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑒𝑝𝑡ℎ 𝑖𝑚𝑎𝑔𝑒

Modelling the prior appearance, shape or motion of the target to match with the object,

when it reappears after occlusion might fail. In this approach, when an object is partially

occluded, the algorithm starts tracking the occluder as well, until the object is completely

visible, thus indirectly keeping track of the object’s 2-D location. One has to keep in mind

that the object to be tracked might be occluded on its left or right by other occluders,

especially in a crowded scene, or the occluders might themselves be occluded by newer

occluders. The proposed algorithm tracks all the visible occluders.

The goal is to use the location of the tracked occluder as a prior over the distribution of the

object to be tracked. This can be easily observed because the target to be tracked is either

70
visible on the left side or the right side (or top side) of the occluder, or the target is

completely covered by the occluder (full occlusion). One might argue that the target could

also be lost, and while that is a possibility in certain cases (such as when an occluder is

approaching a target from the right side, and upon occlusion, if the occluder is stationary

and the target simply changes track and moves perpendicular to his original path, all along

covered by the occluder, and finally exits the tracking scene), the goal of this algorithm is

to help identify the occluded object and associate it correctly upon its reappearance.

Therefore, fusing prior information of target with the location of its occluder(s) help

recover the target upon its reappearance more effectively.

Additionally, when the target undergoes occlusion, its appearance changes, and therefore

the partially visible object (when the object is getting occluded or when it is coming out of

occlusion), does not necessarily match the appearance model of its pre-occlusion stage.

When target is heavily occluded, with only parts of the target visible, such as an arm or

part of the leg, or a half of the body, use of depth data can reliably extract the sub-parts

than appearance based models. The immediate background is also monitored for possible

occluders at similar depths, and in such cases, color and position based filters will help to

extract the target (or its parts in case of occlusion) precisely.

Once occlusion has been detected, this work does not strive to maintain an updated model

of the partially visible object, and instead tries to associate the partly visible object to the

prior model learnt before occlusion. The motivation behind this approach is that an

71
occluded object interacts with its occluders and any model representing the visible part of

the occluder or even the occluder-occludee pair will be changing with the change in

interaction, and therefore, this work updates the probability distribution for the occluded

object based on the occluder and associates partially visible object parts to the stored

model. This is done by template matching, and if a high probability of association can be

found, the part object is output as the tracked object, otherwise the algorithm proceeds with

tracking the occluder(s). When the object becomes fully occluded, the algorithm does not

output anything for the object, but updates its possible location distribution. Finally, when

the object is visible (i.e. not occluded), the depth and appearance models are updated by a

weighted combination of the prior and posterior models. After a threshold number of

frames, framethresh, if the object is still occluded, the algorithm assumes that it has been lost

and a search is done to locate the object.

3.3.4. Particle filter tracker

The particle filter tracker in this work uses a motion model in the horizontal-depth or x-z

motion framework. The state of the object at time t is represented by the vector St =[ Xt

Xṫ Zt Zṫ ], where Xt and Zt represent the x and z coordinate positions, Xṫ and Zṫ represent

the velocity components in the x and z directions respectively at time t. This motion model

which gives a prior estimation of the state can be represented by the following equations:

72
̇ Δt
𝑋𝑡 = 𝑋𝑡−1 +𝑋𝑡−1 (22)

̇ Δt
𝑍𝑡 = 𝑍𝑡−1 + 𝑍𝑡−1 (23)

|𝑋𝑡̇ |= |𝑋𝑡−1
̇ |+ N(0, 𝜎𝑥2 ) (24)

|𝑍𝑡̇ |= |𝑍𝑡−1
̇ | + N(0, 𝜎𝑦2 ) (25)

where 𝜎𝑥2 and 𝜎𝑧2 are variances of the velocity components in the x and z directions.

Tracking in the x-z domain is beneficial as particles can now be propagated in the x-z

domain. The motion model estimates the depth at which the object can be expected to be

found in the next time instant and using these depth values, the objects can be extracted. If

there are multiple pixels in the depth image with (x,z) values, then patches are extracted

averaged on all those (x,z) values respectively. However, in the x-y tracking domain, the y

estimate is essentially the vertical displacement in the image and the state’s motion model’s

prediction for the y value at longer distances may be drifting upwards in the image whereas

the object’s actual y position changes non-linearly in the image. Such a problem can be

completely avoided using depth values as one can use depth to extract the objects in the

depth or corresponding color image.

The object’s feature vector is computed as described in the Object Representation section.

When a set of objects are obtained as the observations following the Object Extraction and

Filtering section, then the primary task is to retain only one observation. To do so, each

patch generated during the motion model prediction step casts a vote for an observation.

73
This vote is cast depending on the maximum likelihood achieved for the state, given the

current set of observations. The likelihood model is given by

2
̂𝑘 , 𝐶̂
p(obs|X) ∝ exp (− 𝑑𝐶𝑜𝑙𝑜𝑟 (𝐶 2 ̂ ̂ 2 2
𝑟𝑒𝑓 ) ⁄2𝜎1 ) ∗ exp (− 𝑑𝐷𝑒𝑝 (𝐷𝑒𝑝𝑘 , 𝐷𝑒𝑝𝑟𝑒𝑓 ) ⁄2𝜎2 ) ∗

2
exp (− 𝑑𝐸𝑢𝑐 (𝑋𝑐𝑢𝑟 𝑘 , 𝑋𝑜𝑏𝑠 𝑘 ) ⁄2𝜎3 2 ) (26)

̂𝑘 , 𝐶̂
where 𝑑𝐶𝑜𝑙𝑜𝑟 (𝐶 ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐶𝑘 , 𝐶𝑟𝑒𝑓 ) is the Bhattacharyya distance between the

̂𝑘 , 𝐷𝑒𝑝
color histograms, 𝑑𝑑𝑒𝑝 (𝐷𝑒𝑝 ̂ ̂ ̂
𝑟𝑒𝑓 ) = √1 − 𝜌(𝐷𝑒𝑝𝑘 , 𝐷𝑒𝑝𝑟𝑒𝑓 ), is the Bhattacharyya

distance between the normalized depth histograms, 𝑑𝐸𝑢𝑐 (𝑋𝑐𝑢𝑟 𝑘 , 𝑋𝑜𝑏𝑠 𝑘 ) =

√(𝑥_𝑐𝑢𝑟𝑘 − 𝑥_𝑜𝑏𝑠𝑘 )2 + (𝑦_𝑐𝑢𝑟𝑘 − 𝑦_𝑜𝑏𝑠𝑘 )2 is the Euclidean distance between the

̂𝑘 and
center of the current patch and the observed object’s center position at time 𝑘, 𝐶

̂𝑘 are the normalized color histograms and normalized depth histograms respectively
𝐷𝑒𝑝

for the current patch centered at (𝑥, 𝑧), 𝐶̂ ̂


𝑟𝑒𝑓 and 𝐷𝑒𝑝𝑟𝑒𝑓 are the normalized color

histograms and normalized depth histograms for the reference patch, 𝜌 is the Bhattacharyya

coefficient and 𝜎1 , 𝜎2 , 𝜎3 specify Gaussian noise in measurements, obs is the observation

and X is the current state.

Once all the patches have associated themselves to an observation, a majority voting is

conducted to obtain that observation which has the highest number of votes, as the final

observation. Following this, the particle weights are again reassigned using the same

74
Algorithm

(i) Obtain depth mean and standard deviation of target to be tracked from user

selected patch.

(ii) Obtain the depth distribution of the immediate neighborhood.

(iii) Update the position of target to be tracked based on depth propagation and filter

using estimated position of target. If neighborhood depth distribution obtained

in step 2 is similar to the target, apply filters based on color distribution and

position of target.

(iv) Update the position for any current occluder(s) which are occluding the target.

(v) For the updated position of the target in step 3, determine if occlusion has

occurred. Checks conducted to estimate presence of new occluder(s) as well as

determine any intersection between any old occluder(s) and the target.

(vi) In case of occlusion in step 5, extract the partially visible object using depth of

the target and use part based template matching to match it with the appearance

model of the object prior to occlusion to see if the visible part belongs to this

object.

(vii) In case of full occlusion, the position estimates of the occluder(s) serve as

possible locations for the target.

(viii) Reinitialize search for target if it does not reappear after some frames

depending on the situation.

(ix) Repeat steps 2 to 8.

75
likelihood model w.r.t. the final selected observation. Residual resampling method is used

[69] and the mean of the posterior distribution is output as the estimated position of the

tracked object.

3.4. Results

The algorithm has been tested initially on data collected in the laboratory and also on some

occlusion scenes in the Princeton tracking dataset [82]. A Microsoft Kinect sensor is used

which provides the depth data for each pixel in the color image. Mapping from the depth

data image to the color image is done using [68]. It is fixed on a table and object(s) motion

in front of it is captured. The maximum range is 400cm for this sensor. To evaluate the

proposed method, 20 video sequences were recorded and corresponding depth and color

images of size 424 X 512 were obtained. In these videos, the object executed motion in a

linear manner, either moving forward or walking towards the camera, or going from left to

right in front of the camera and vice-versa. Another object steps in and occludes the person

to be tracked either partially or fully in some cases. Additionally, some occlusion scenes

from Princeton Tracking dataset has been used which tests occlusion on several levels,

such as when the target is occluded for a period of time by multiple occluders, or when the

target is occluded by an occluder at similar depth ranges, etc. The ground truth of the

human’s position is recorded manually, and the Root mean Square Error (RMSE) is used

to obtain the accuracy of the algorithm.

In Fig. 19, the object is executing a simple linear walk and the algorithm is able to

successfully track the object throughout the sequence. In Fig. 20, two objects are walking
76
towards each other and therefore, they cross at some point of time and thus have similar

depth values with partial occlusion. This algorithm is able to detect and track the selected

object without drifting to the other object when occlusion takes place. In Fig. 21, an object

is fully occluded for some time instant, however, when it comes out of its state of occlusion,

the tracker again picks it up without drifting to the other object, although the occluding

object is at a very close depth. This is possibly because of the combination of depth and

color models used for matching. In Fig. 22, the object to be tracked moves from the back

to the front of the camera, however, it is partly occluded, however, the algorithm can track

it throughout the occlusion state. In the end, the occluding object moves towards the camera

and only a small portion of the object to be tracked is visible, and the algorithm can detect

the non-occluded portion and continue the tracking. Similar occlusion scenarios are

observed in Fig. 23 and Fig. 24. However, in Fig. 25, it is a bit challenging as the target to

be tracked is occluded and the occluder in turn is occluded. By using depth data to extract

visible portions of occluded target and matching the part with the prior appearance model

stored before occlusion, the algorithm is able to output to the viewer a suitable location of

the target. Additionally, tracking the occluder helps in recovery of the target when it

emerges from a state of full occlusion.

77
Figure 19. Image sequence showing an object executing a simple linear motion being
tracked.

78
Figure 20. Image sequence showing an object facing partial occlusion being tracked
correctly. At frame number 262, the two objects are at similar depths and is partly occluded.

79
Figure 21. Image sequence showing an object that is fully occluded for a short time,
however, on reappearing, it is tracked again.

80
Figure 22. Image sequence showing an object that is partially occluded for a long duration
of time, however, it is succesfully tracked all through and towards the end, it is heavily
occluded, but the algorithm tracks it correctly till the end.

81
Figure 23. Target enters a stage of partial occlusion, until it is fully occluded and then
reappears (example 1); Bold black bounding box represents target, light black bounding
box represents occluder.
82
Figure 24. Target enters a stage of partial occlusion, until it is fully occluded and then
reappears (example 2); Bold black bounding box represents target, light black bounding
box represents occluder.

83
Figure 25. Partially visible target is obstructed by an occluder which in turn is occluded;
Bold black bounding box represents target; light black bounding box represents occluder.

84
3.5. Conclusion

This work has introduced human object tracking with occlusion in the x-z domain, where

particles are propagated using the horizontal-depth framework. Using depth values for

tracking and object extraction avoids issues faced otherwise in appearance based tracking

such as lighting changes causing a difference in the appearance of the object, or difficulty

in extraction when there are objects with similar appearances, or unsuccessful object

extraction when it is occluded. As seen from the results, tracking in the x-z domain leads

to more accurate state motion model predictions which help in the extraction of the objects.

This method handles occlusion scenarios robustly, because it integrates information about

the occluders, thus producing a better estimate of the location of the target. Also, by

avoiding the update of the target’s appearance model by using partially visible object parts,

and instead, simply associating those object parts to the whole object detected prior to

occlusion, additional information about the target’s position can be obtained.

Tracking in x-z domain instead of the x-y domain hasn’t been well researched and this

work aims to contribute towards that. Improvements can be made to obtain a better

representation of the object features in the depth domain. The gating technique could be

changed by using a machine learning algorithm that can classify extracted depth patches

as humans or non-human objects. Also, the shape of the x-z blob projection could be used

for human identification with or without occlusion. While this work assumes that the sensor

is stationary, future work could use these sensors in motion. Additionally, multiple objects

could be tracked, and better data association techniques could be explored.


85
Chapter 4. Data Association in Tracking

4.1. Introduction

Correctly modeling the 3-D environment around a given object is an important first step in

applications where the object has to navigate around such as autonomous robots,

autonomous vehicles and the like. For instance, an autonomous vehicle would require a

map of its surrounding environment so that it can steer itself in the correct direction.

Information about the surrounding environment can be obtained from sensors. However,

to obtain complete 3- dimensional information is challenging by using one kind of sensor.

Using a radar, one can obtain accurate range information of the surrounding objects. Yet it

is difficult to obtain the azimuth(x) and elevation(y) resolutions accurately. For instance,

the monopulse radar generates multiple closely spaced beams from the same antenna, and

uses the sum beam and the delta beam to resolve target’s azimuth and elevation directions.

At least three receive channels and a relative complex antenna feed network are needed to

realize the radar. The cross-track mode of satellite is another method for the radar to resolve

targets along azimuth direction. However, it needs a rotating mirror (use optical-

mechanical device). The mechanical part increases the cost and is not suitable for using in

the automotive, which always experiences vibration during driving. Many researchers have

come up with phased array radars or multiple-input multiple-output (MIMO) radar.

However, to deploy these techniques the transmitter and receiver should have as much

array elements as possible for high azimuth resolution. The cost and complexity to realize

86
the elements become high. On the other hand, using a camera can provide location

information of the object in a 2-D plane (like the x-y plane) accurately, but can just give us

some idea about the relative depth. Several methods based on cameras only cannot resolve

depth well. For example, a depth camera is restricted to providing information up to very

short ranges. A stereo camera can be used to measure depth, but it makes the whole system

more complex due to the need to locate the similar feature points between the two cameras.

This work proposes to use two different sensors (i.e. low-cost radar and camera) that give

accurate information along different dimensions and tries to associate these pieces of

information obtained from the two sensors to correctly predict the 3-D position of objects.

The case of a car moving on a highway, trying to find out the positions of other vehicles

around it. Therefore, a complicated 3-D environment detection problem by using one kind

of sensor is formulated as a problem of associating range information of radar with 2-D

position coordinates from the vision system. This is where sensor fusion comes into play.

It means integrating information obtained from various sources (sensors) in an appropriate

manner so that an estimation can be made about the scene in question which further helps

in establishing a model of the environment.

This method establishes a relationship between the sizes of a fixed object as projected on

an image taken at different ranges from the object versus the corresponding ranges. This

relationship is then used to estimate the relative depth of objects from their sizes as

perceived in an image. This data, when combined with the relative positions of the vehicles

87
in the image, can be used to predict which absolute distance values obtained from a radar

signal return associates with which object, using an optimization algorithm such as the

Hungarian algorithm.

4.2. Related Work

Prior work that has attempted to associate a radar depth to a vision detected object in a real-

world outdoor scene is limited. However, work has been done to estimate absolute depth

using monocular vision system. For instance, [29] tries to estimate depth of static objects

in traffic scenes from monocular video using structure from motion. Depth estimation from

unstructured scenes was explored in [30]. Semantic knowledge of the scene is used to

estimate the depth in [31]. A depth based segmentation using radar and stereo camera is

given in [32]. In [33], a radar-vision fusion system is studied. Some radar and vision fusion

techniques try to associate the radar data to visual detection and tracking as in [34]. In

[35], a vision system is used to confirm the contour of the object detected by a radar. In

[36], the radar and vision detected objects are associated during initialization and after that

tracking is also involved, however, it is not clear how the initial association takes place. In

this work, one sensor is not used to validate another sensor. Instead, the 3-D position of the

objects is established by combining the strengths of two sensors.

88
4.3. Methodology

4.3.1. Derivation of equation

Using the perspective projection equation for the pinhole camera model, a point in 3-D

space(x,y,z) can be transformed to a point in the camera co-ordinate system (u,v) and is

given by [37]:

x y
u = α z – α cot θ z + u0 ,

β y
v = sin θ z + v0 (27)

where, z is the depth of the point from the camera, 𝛼 = 𝑘𝑓, 𝛽 = 𝑙𝑓, f is the focal length
1 1
expressed in meters and a pixel has dimensions × 𝑙 , where k and l are expressed in
𝑘

pixel×m-1, 𝑢0 and 𝑣0 are the positions of the center of the image plane in the camera co-

ordinate system and θ is the skew of the camera co-ordinate system. Now, for this

derivation, it is assumed that the origin of the world coordinate system is the same as that

of the camera coordinate system. Let us consider a figure which is viewed at depth z from

the camera. The 3-D world coordinates and the corresponding camera coordinates for the

figure is shown in Fig. 26.

89
Figure 26. A real-world figure projected onto camera co-ordinate plane

Using Eq. (27), we can write

𝑥1 𝑦1 𝛽 𝑦1
u1 = α – α cot 𝜃 + u0, v1 = 𝑠𝑖𝑛 𝜃 + v0,
𝑧 𝑧 𝑧

𝑥2 𝑦2 𝛽 𝑦2
𝑢2 = α – α cot 𝜃 + u0 , 𝑣2 = 𝑠𝑖𝑛 𝜃 + v0 .
𝑧 𝑧 𝑧

Therefore,

𝛼 𝛼 cot 𝜃
u2 – u1 = (x2 – x1) + (y2 – y1)
𝑧 𝑧
𝛼
⇒ 𝑙𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 = (𝑙 − ℎ cot 𝜃)
𝑧

𝛽
v2 – v1 = (y2 – y1)
𝑧 sin 𝜃

90
𝛽ℎ
⇒ ℎ𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 =
𝑧 sin 𝜃

where 𝑙 = x2 – x1 and ℎ = y2 – y1. Therefore, one has:

𝛼𝛽(𝑙ℎ − ℎ2 cot 𝜃)
𝑠𝑖𝑧𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 = 𝑙𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 × ℎ𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 =
𝑧 2 sin 𝜃

⇒ 𝑠𝑖𝑧𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 = 𝑐𝑜𝑛𝑠𝑡⁄𝑧 2 ,

𝛼𝛽(𝑙ℎ−ℎ2 cot 𝜃)
where 𝑐𝑜𝑛𝑠𝑡 = sin 𝜃

⇒ 𝑠𝑖𝑧𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑒𝑑 ∝ 1⁄𝑧 2 . (28)

In real life, the world and camera origin coordinates are not the same. However, translation

and rotation factors transforming a figure from the world frame to the camera frame would

not affect the projected size of the figure on the image, which is inversely proportional to

depth raised to the power of two.

An experiment was conducted where a fixed size chessboard was placed on the wall. An

image of the chessboard was taken from different ranges (moving from a distance near the

chessboard to far away) and at each time, the size of the chessboard was recorded as

projected on the image and the corresponding range measurement was also taken. The size

of the object as perceived in the image versus the distance at which the image was taken is

plotted and shown in Fig. 27.

91
The plot for (size*distance*distance) for each data pair is given in Fig. 28. This value is

almost a constant with a very small standard deviation. Therefore, this experimental data

confirms the relationship between projected size and depth as derived using the pinhole

model in Eq. (28). The motivation behind using this assumes that since most cars moving

on the road have almost similar sizes, it could be represented by an average size. Now, this

average sized car, if viewed from different distances, would project different sizes on the

image. For instance, when viewed from very close, the size of it would appear very large

and at large distances, the car would appear much smaller (Fig. 29). Thus, we can use Eq.

(28) to get an idea about the relative depth of the cars in the scene. The absolute depth is

not predicted because that would involve knowing the actual sizes of the cars, which is not

possible. Also, using relative depth does not change the way the data association works out

as these associations are optimized with the absolute depth obtained by the radar.

4
x 10 Size versus Distance plot
10

7
Size in pixel units

1
200 250 300 350 400 450 500 550 600
Distance in cms

Figure 27. Plot of size of chessboard projected at increasing depth ranges

92
x 10
9
Size X Distance2 = Constant
10

2
size X distance
6

0
0 2 4 6 8 10 12 14
Observation number

Figure 28. Plot confirms that observed data follows derived Eq. (28)

However, predicting the depth from projected size will give us accurate results only when

objects of a similar physical size are considered. In real life, vehicles of different sizes

(trucks versus small cars) are seen on the highways. The projected size of a bus on an image

will be much larger than that of an average car when both are placed at the same range

from the camera. It is highly possible that even if the car is nearer to the camera than the

bus as in Fig. 30(c)(d), yet the projected size for the bus would still be larger than that of

the car, giving a false conclusion that the bus is nearer. This results from the assumption

that all cars on the road have an average size. So, to include vehicles of all classes, an

additional information is used which can be obtained from the image itself, that is the

relative positions of the vehicles, which is described next.

93
Figure 29. Vehicles with their ranks based on their relative positions determined by the
size.

From the bounding box that is obtained around every detected vehicle, the y-coordinate of

the lower left corner (i.e. along the vertical plane of image) is used as a measure of relative

positions of the vehicles. For instance, in Fig. 29, the detected targets have been ranked

according to their nearness to the camera based on the y-value. Thus, for a situation as

illustrated in Fig. 30(c)(d), this ranking would tell us that the bus is farther away from the

car. Therefore, this ranking is used as an additional constraint over associations to the

Hungarian algorithm, which will be described below.

94
4.3.2. Procedure

After obtaining an image of vehicles moving on a highway, we would have to detect the

vehicles in the image. Detection of vehicles in an image would be a separate research topic

altogether and we do not go into details of that in this work. However, the reader could

consider [38] for a survey on techniques for detecting vehicles. The Histogram of Oriented

Gradients (HOG) detector has been used for detecting vehicles as described in [39],

followed by computing the area of the bounding box surrounding each detected car.

Assuming this area to represent the size of the car as projected in the image, the relative

depth of the car is calculated using Eq. (28). The radar also gives us a set of range

measurements representing the depths of objects located around. But radar with a small

number of array elements or even a single array element cannot resolve targets in the x-y

plane well. Therefore, these range measurements are just values and do not tell us to which

object each corresponds. We now have a set of objects represented by their relative depths

and sizes along with a ranking and a set of absolute depths. We will have to associate the

objects to the absolute depth values. This problem can be formulated as an assignment

problem and can be solved using the Hungarian algorithm. Let us represent the vehicles

detected by the vision system as the collection O = {o1, o2, o3,....., oj} and the ranges

returned by the radar sensor as the collection R = {r1,r2,r3,.....,rk}. This is modeled as a

bipartite graph, where V is one set of nodes and R is the other set of nodes. So the total

number of nodes is given by N = O+R. The cost matrix for the above problem can be

constructed by letting each cell in the matrix be the squared difference between the absolute

distance returned by the radar and the distance estimate returned by the vision system
95
(a)

(b)

continued

Figure 30. Testing images (a)Testing on cars having same average size (b) Testing on partly
occluded vehicle(a small car occluded by a large truck) . (c) & (d) Testing on a large vehicle
along with cars.
96
Figure 30 continued

(c)

(d)

97
which is then fed to the Hungarian algorithm. The aim is to minimize this cost matrix, that

is, find an optimal assignment such that the cost of assignment is minimized. If m is the

minimum number of assignments for objects that are from the sets of O and R respectively,

the goal is to make an optimal assignment for the two groups of m objects. Therefore, the

following equation has to be minimized,

𝑎𝑟𝑔 𝑚𝑖𝑛 ∑𝑚
𝑖=1 ||𝑟 (𝑖) − 𝑣(𝑖)||
2

based on the constraint

if 𝑟𝑎𝑛𝑘(𝑜(𝑖)) < 𝑟𝑎𝑛𝑘(𝑜(𝑗)),

then 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑜(𝑖)) < 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑜(𝑗)) (29)

where 𝑣(𝑖) = √(𝑐𝑜𝑛𝑠𝑡/𝑠𝑖𝑧𝑒(𝑖)), 𝑟(𝑖) and 𝑣(𝑖) are the radar and vision computed depth

values, ||𝑟(𝑖) − 𝑣(𝑖)|| 2 is a measure of the difference between the computed depth

values, 𝑠𝑖𝑧𝑒(𝑖) is the size of object I, and 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑜(𝑖)) is the absolute distance assigned

to object i using the Hungarian method. Thus, an optimal assignment of the absolute

distance obtained by the radar to the objects detected by the vision system is obtained,

thereby associating the range returns with spatial positions, thus giving the 3-D information

about the surrounding vehicles.

98
4.4. Results

Several test images were collected by driving along a highway. The images included small

cars at varying distances from the car. Images with a truck and a bus were also collected to

see how this method would work out. Radar reflect signals of all the vehicles in images

have been simulated. The simulated radar was using 76.5GHz carrier, 1GHz bandwidth,

0.2𝜇𝑠 pulse width, and SNR 13.2db. It applied a wavelet based waveform [40] for high

range resolution detection. The radar simulated results are shown in Fig. 31.

After processing the images following the method described in Section 4.3, the relative

depths of the vehicles in the scene is obtained. These are then associated with the absolute

ones using the Hungarian method of optimizing the associations. Some of the test images

are shown in Fig. 30, in which the associated vehicles are depicted by bounding boxes and

associated distances are written in meters (m). The ground truth of the vehicle distances is

obtained my manual inspection. Correct associations were made to the radar returned

distances. The method worked well with trucks and small cars in the same scene, which

can be attributed to the inclusion of relative positions among the vehicles.

In this process, some false detections have been eliminated (that occurs if the radar system

returned more range values than that of number of objects detected by the vision system).

For instance, in Fig. 30(b), radar detected three objects. However, since one is just partially

visible, the vision system fails to detect it. The proposed method is still able to discard the

radar return for that occluded vehicle and do the other associations correctly. However, it
99
might also happen that the vision system failed to detect some object whereas the radar

sensor detected it.

(a) (b)

(c) (d)

Figure 31. Radar simulation results for Fig. 30(a)-(d) respectively

100
4.5. Conclusion

In this work, a camera and a radar have been used to estimate the 3-D position of vehicles

on a highway as seen from an ego car. The relative depth of the vehicles has been estimated

using the size of the vehicles as projected on the image using Eq. (28). Then using a

constraint, based on the ranking of vehicles according to their nearness to the ego car, and

the Hungarian algorithm, associations have been performed to match the accurate range

data of the radar with each vehicle as seen by the vision system. This method works well

on the tested images. Some future works that could be done on this approach have been

listed. An issue that arises would be when the size of the vehicle computed would not

reflect the true back view of the vehicle. In Fig. 29(d), for the car on the leftmost lane, part

of the side view is included inside the bounding box. Weights could be assigned to the

computed size such that a lower weight could be assigned if we get a back and partial side

view vehicle. However, this problem arises for vehicles nearby the ego car rather than far

away ones. Also, in a lane intersection, a side view of a car in a perpendicular lane would

project a larger size. The two views would have to be distinguished. Another direction

could be trying to estimate absolute depth of the vehicles from a single image based on

[30] and [31]. That could perhaps make the whole approach more robust under some

conditions, while the computation will be significantly complicated. The use of radar and

vision fusion could be extended to estimate 3-D positions of more complex scenarios, like

a college campus.

101
Chapter 5. Conclusion and Future Work

This dissertation makes a contribution to depth based sensor fusion by exploring several

topics related to object detection and tracking. The more traditional range sensors such as

lasers and radars are generally quite expensive and have been used in military or large scale

industrial projects, etc. Indoor robotics has mostly used low-cost sensors such as infrared,

ultrasonic, or stereo cameras, etc. Nowadays, with the development of depth cameras such

as Kinect or PrimeSense, which are affordable as well, a substantial amount of research

has been conducted to fuse depth data with existing models based on RGB features.

However, as noted in [82], this integration has a lot of potential for improvement.

In indoor robotics applications, the use of infrared sensors has mostly been limited to a

proximity sensor to avoid obstacles. Chapter 2 presents work on extending the use of these

low-cost, but extremely fast infrared sensors to accomplish tasks such as identifying the

direction of motion of a person and fusing the sparse range data obtained from infrared

sensors with a camera to develop a low-cost and efficient indoor tracking sensor system.

Therefore, an array of infrared sensors can be advantageous over a depth camera, when

discrete data is required at a fast processing rate.

In Chapter 3, a Kinect sensor has been used to track an object with a focus on occlusion

handling. A Kinect sensor provides data for the depth at every pixel and this information

is useful for extracting objects based on depth even if the object is partially occluded. An

102
occluder tracking system with part based association of the partially visible occluded

objects helps to keep track of the object when it recovers from occlusion. There are many

state-of-the-art algorithms for object tracking using RGB data, however, object tracking

using RGB-D data is relatively new and occlusion handling using depth data needs more

exploration.

In Chapter 4, a classical data association problem has been explored, where discrete range

data from a depth sensor has to be associated to 2-D objects detected by a camera. This

problem has been applied to a situation where a radar returns a set of ranges corresponding

to objects in the environment and a camera provides the 2-D information about the objects,

with a focus on vehicles driving in a highway. This data association using a Hungarian

algorithm with specified constraints works on a structured environment, however, more

research is required to extend this use to complex environments like an urban scene. This

would eliminate the need to use very expensive radars or 3-D lasers.

There is scope for a lot of interesting work to be done to extend the research presented in

this dissertation. Effectively tracking a human being in a crowd using the idea of

propagation in the x-z domain and extending it to explore multi-target tracking could be

considered. Both these topics have immense potential in today’s world, for instance, a robot

whose job is to follow a target, such as a patient in the hospital (in order to monitor the

movement of the patient and alert concerned department in case the patient falls down or

needs some other help that the robot cannot provide), or a personal shopping assistant robot

following a shopper (to perhaps direct the shopper to a particular product in an aisle), or a

103
robot that aids a blind man to navigate on the roads, or a robot which can follow a particular

factory worker, to carry instruments or materials from one place to another and so on.

Important applications for multi-target tracking would be in a sport such as football, where

the coach would like to track the movements of the players and also their interactions and

in a video surveillance scene, where every person has to be tracked.

Putting more focus on the steps leading to object detection and tracking is necessary. In

order to successfully extract target object(s) from the scene, various methods have been

proposed in the literature to build a 2-D model such as the survey in [2] or a 3-D model,

such as the survey in [69]. As the scenes get more complex, the background illumination

might change, or the background might have a similar appearance with the foreground, or

it could have noisier stationary occluders at similar depths as the object(s) to be tracked, so

starting with some relevant work [4-5], research could be conducted on this topic. In the

case of multi-target tracking, it is highly possible that the object(s) will occlude each other,

or might exhibit a complex motion trajectory [70], thus a dynamic motion model which is

able to make accurate predictions about the object(s) positions is desired.

When x-z propagation is used in a particle filtering tracker, it provides some advantages

over the x-y tracking, because the depth information is utilized and thus one can extract

objects at a particular depth range. It must be taken into account that if two objects are

observed to be at the same depth range, then either they are beside each other, or one object

is occluding each other, with the occluder and occluded person being at different depths.

104
Thus, based on depth information solely, it is possible to extract the objects by utilizing the

pixel depth information in case of occlusion. Secondly, when an object blob extracted

comprises of two persons beside each other, the information provided by the size of the

bounding box (at that particular depth, which can be established prior to execution) is an

indicator of unsuccessful object extraction and some blob segmentation technique can be

applied to split the blob into several candidate hypotheses. Based on pure depth

information, it doesn’t matter if the two objects have similar appearance or not. On the

contrary, in a 2-D image, extracting occluded objects or objects having similar appearance

is not quite successful.

Moreover, when the motion model uses x-z propagation, it must be understood that the

object can move only a finite distance ahead or behind, unlike in x-y propagation, where

the object has the liberty to take on any random y position. The x position in either case

can be random. Therefore, x-z propagation is definitely going to limit the number of

particles needed for tracking. In fact, one can consider the particles to be the depth extracted

objects and since each depth extraction takes place over a small range, we could possibly

be looking at 20 to 30 samples for tracking. For example, if the object is at z depth 200 cm

from the sensor, then considering the average stride of the person to be 81 cm [71], the

tracker would ideally check in the depth range of 200 ± 100 cm, i.e. from 100 cm to 400

cm for the object, and choosing a depth processing interval of 30cm, this could lead to

processing of a minimum of just 10 objects per frame. This would significantly reduce the

computational load.

105
Another interesting direction to consider would be to use a network of RGB-D sensors, or

a network of infrared arrays situated at strategic locations in an environment to facilitate

object tracking even in complex scenarios. A sensor network of cameras has been used in

[8] for tracking a single object and presents interesting research on the trade-off between

using subsets of multiple cameras versus having more prior information about the

occluders. If the environment is very crowded, the target remains occluded most of the

time, and therefore information from multiple cameras might help to resolve occlusion

errors, rather than tracking multiple occluders. In [83], a network of RGB-D cameras has

been used for tracking multiple persons to build an infrastructure for emergency relief

operations. It has been demonstrated that multiple visual cameras have helped localize and

track humans, handling occlusion scenarios well. The ultimate goal could be to develop a

multiview approach using 3-D representations of space.

106
References

[1] Yilmaz, A., Javed, O. and Shah, M., 2006. Object tracking: A survey. Acm computing
surveys (CSUR), 38(4), p.13.
[2] Li, X., Hu, W., Shen, C., Zhang, Z., Dick, A. and Hengel, A.V.D., 2013. A survey of
appearance models in visual object tracking. ACM transactions on Intelligent Systems and
Technology (TIST), 4(4), p.58.
[3] Wu, Y., Lim, J. and Yang, M.H., 2013. Online object tracking: A benchmark.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
2411-2418).
[4] Hare, S., Saffari, A. and Torr, P.H., 2011, November. Struck: Structured output tracking
with kernels. In 2011 International Conference on Computer Vision (pp. 263-270). IEEE.
[5] Dinh, T.B., Vo, N. and Medioni, G., 2011, June. Context tracker: Exploring supporters
and distracters in unconstrained environments. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on (pp. 1177-1184). IEEE.
[6] Jia, X., Lu, H. and Yang, M.H., 2012, June. Visual tracking via adaptive structural local
sparse appearance model. In Computer vision and pattern recognition (CVPR), 2012 IEEE
Conference on (pp. 1822-1829). IEEE.
[7] Zhong, W., Lu, H. and Yang, M.H., 2012, June. Robust object tracking via sparsity-
based collaborative model. In Computer vision and pattern recognition (CVPR), 2012
IEEE Conference on (pp. 1838-1845). IEEE.
[8] Ercan, A.O., Gamal, A.E. and Guibas, L.J., 2013. Object tracking in the presence of
occlusions using multiple cameras: A sensor network approach. ACM Transactions on
Sensor Networks (TOSN), 9(2), p.16.
[9] Song, S. and Xiao, J., 2013. Tracking revisited using rgbd camera: Unified benchmark
and baselines. In Proceedings of the IEEE international conference on computer
vision (pp. 233-240).

107
[10] Tsai, Y.T., Shih, H.C. and Huang, C.L., 2006, August. Multiple human objects
tracking in crowded scenes. In 18th International Conference on Pattern Recognition
(ICPR'06) (Vol. 3, pp. 51-54). IEEE.
[11] Saravanakumar, S., Vadivel, A. and Ahmed, C.S., 2010, December. Multiple human
object tracking using background subtraction and shadow removal techniques. In Signal
and Image Processing (ICSIP), 2010 International Conference on (pp. 79-84). IEEE.
[12] Zoidi, O., Nikolaidis, N. and Pitas, I., 2013, May. Appearance based object tracking
in stereo sequences. In 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing (pp. 2434-2438). IEEE.
[13] Pan, J. and Hu, B., 2007, June. Robust occlusion handling in object tracking. In 2007
IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-8). IEEE.
[14] Fod, A., Howard, A. and Mataric, M.A.J., 2002. A laser-based people tracker.
In Robotics and Automation, 2002. Proceedings. ICRA'02. IEEE International Conference
on (Vol. 3, pp. 3024-3029). IEEE.
[15] Vu, T.D. and Aycard, O., 2009, May. Laser-based detection and tracking moving
objects using data-driven markov chain monte carlo. In Robotics and Automation, 2009.
ICRA'09. IEEE International Conference on (pp. 3800-3806). IEEE.
[16] Labayrade, R., Perrollaz, M., Gruyer, D. and Aubert, D., 2010. Sensor Data Fusion
for Road Obstacle Detection, Sensor Fusion and its Applications.
[17] Cho, H., Seo, Y.W., Kumar, B.V. and Rajkumar, R.R., 2014, May. A multi-sensor
fusion system for moving object detection and tracking in urban driving environments.
In 2014 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1836-
1843). IEEE.
[18] Kumar, S., Marks, T.K. and Jones, M., 2014. Improving person tracking using an
inexpensive thermal infrared sensor. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops (pp. 217-224).
[19] Cruz, L., Lucio, D. and Velho, L., 2012, August. Kinect and rgbd images: Challenges
and applications. In Graphics, Patterns and Images Tutorials (SIBGRAPI-T), 2012 25th
SIBGRAPI Conference on (pp. 36-49). IEEE.

108
[20] Spinello, L. and Arras, K.O., 2011, September. People detection in RGB-D data.
In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference
on (pp. 3838-3843). IEEE.
[21] Koo, S., Lee, D. and Kwon, D.S., 2013, November. Multiple object tracking using an
rgb-d camera by hierarchical spatiotemporal data association. In Intelligent Robots and
Systems (IROS), 2013 IEEE/RSJ International Conference on (pp. 1113-1118). IEEE.
[22] Parvizi, E. and Wu, Q.J., 2008, May. Multiple object tracking based on adaptive depth
segmentation. In Computer and Robot Vision, 2008. CRV'08. Canadian Conference on (pp.
273-277). IEEE.
[23] Nakamura, T., 2011, December. Real-time 3-D object tracking using Kinect sensor.
In Robotics and Biomimetics (ROBIO), 2011 IEEE International Conference on (pp. 784-
788). IEEE.
[24] Isard, M. and Blake, A., 1996, April. Contour tracking by stochastic propagation of
conditional density. In European conference on computer vision (pp. 343-356). Springer,
Berlin, Heidelberg.
[25] Blake, A. and Isard, M., 1997. The CONDENSATION algorithm-conditional density
propagation and applications to visual tracking. In Advances in Neural Information
Processing Systems (pp. 361-367).
[26] Comaniciu, D., Ramesh, V. and Meer, P., 2000. Real-time tracking of non-rigid
objects using mean shift. In Computer Vision and Pattern Recognition, 2000. Proceedings.
IEEE Conference on (Vol. 2, pp. 142-149). IEEE.
[27] Arulampalam, M.S., Maskell, S., Gordon, N. and Clapp, T., 2002. A tutorial on
particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on
signal processing, 50(2), pp.174-188.
[28] Spinello, L., Luber, M. and Arras, K.O., 2011, May. Tracking people in 3-D using a
bottom-up top-down detector. In Robotics and Automation (ICRA), 2011 IEEE
International Conference on (pp. 1304-1310). IEEE.

109
[29] Wedel, A., Franke, U., Klappstein, J., Brox, T. and Cremers, D., 2006, September.
Realtime depth estimation and obstacle detection from monocular video. In Joint Pattern
Recognition Symposium (pp. 475-484). Springer Berlin Heidelberg.
[30] Saxena, A., Sun, M. and Ng, A.Y., 2007, October. Learning 3-d scene structure from
a single still image. In 2007 IEEE 11th International Conference on Computer Vision (pp.
1-8). IEEE.
[31] Liu, B., Gould, S. and Koller, D., 2010, June. Single image depth estimation from
predicted semantic labels. In Computer Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on (pp. 1253-1260). IEEE.
[32] Fang, Y., Masaki, I. and Horn, B., 2002. Depth-based target segmentation for
intelligent vehicles: Fusion of radar and binocular stereo. IEEE transactions on intelligent
transportation systems, 3(3), pp.196-202.
[33] Bauson, W.A., 2010. Integrated Radar-Vision Sensors: the Next Generation of Sensor
Fusion. Available online:
http://www.sae.org/events/gim/presentations/2010/williambauson.pdf
(accessed on 17 December 2017).
[34] Wang, T., Zheng, N., Xin, J. and Ma, Z., 2011. Integrating millimeter wave radar with
a monocular vision sensor for on-road obstacle detection applications. Sensors, 11(9),
pp.8992-9008.
[35] Bertozzi, M., Bombini, L., Cerri, P., Medici, P., Antonello, P.C. and Miglietta, M.,
2008, June. Obstacle detection and classification fusing radar and vision. In Intelligent
Vehicles Symposium, 2008 IEEE (pp. 608-613). IEEE.
[36] Chavez-Garcia, R.O., Burlet, J., Vu, T.D. and Aycard, O., 2012, June. Frontal object
perception using radar and mono-vision. In Intelligent Vehicles Symposium (IV), 2012
IEEE (pp. 159-164). IEEE.
[37] D. Forsyth , and J. Ponce, Computer Vision: A Modern Approach. 2 nd ed., Prentice
Hall, 2011.
[38] Sivaraman, S. and Trivedi, M.M., 2013, June. A review of recent developments in
vision-based vehicle detection. In Intelligent Vehicles Symposium (pp. 310-315).

110
[39] Mao, L., Xie, M., Huang, Y. and Zhang, Y., 2010, July. Preceding vehicle detection
using Histograms of Oriented Gradients. In Communications, Circuits and Systems
(ICCCAS), 2010 International Conference on (pp. 354-358). IEEE.
[40] Cao, S., Zheng, Y.F. and Ewing, R.L., 2011, July. Scaling function waveform for
effective side-lobe suppression in radar signal. In Proceedings of the 2011 IEEE National
Aerospace and Electronics Conference (NAECON) (pp. 231-236). IEEE.
[41] Benet, G., Blanes, F., Simó, J.E. and Pérez, P., 2002. Using infrared sensors for
distance measurement in mobile robots. Robotics and autonomous systems, 40(4), pp.255-
266.
[42] H.R. Everett, Sensors for Mobile Robots, AK Peters, Ltd., Wellesley, MA, 1995.
[43] Malik, R. and Yu, H., 1992, August. The infrared detector ring: obstacle detection for
an autonomous mobile robot. In Circuits and Systems, 1992., Proceedings of the 35th
Midwest Symposium on (pp. 76-79). IEEE.
[44] Park, H., Baek, S. and Lee, S., 2005, July. IR sensor array for a mobile robot.
In Proceedings, 2005 IEEE/ASME International Conference on Advanced Intelligent
Mechatronics. (pp. 928-933). IEEE.
[45] Gandhi, D. and Cervera, E., 2003, October. Sensor covering of a robot arm for
collision avoidance. In Systems, Man and Cybernetics, 2003. IEEE International
Conference on (Vol. 5, pp. 4951-4955). IEEE.
[46] Tar, A., Koller, M. and Cserey, G., 2009, April. 3-D geometry reconstruction using
Large Infrared Proximity Array for robotic applications. In Mechatronics, 2009. ICM 2009.
IEEE International Conference on (pp. 1-6). IEEE.
[47] Do, Y. and Kim, J., 2013. Infrared range sensor array for 3-D sensing in robotic
applications. International Journal of Advanced Robotic Systems, 10(4), p.193.
[48] Ryu, D., Um, D., Tanofsky, P., Koh, D.H., Ryu, Y.S. and Kang, S., 2010, October. T-
less: A novel touchless human-machine interface based on infrared proximity sensing.
In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference
on (pp. 5220-5225). IEEE.

111
[49] Sharma, R., Daniel, H. and Dušek, F., 2014. Sensor fusion: an application to
localization and obstacle avoidance in robotics using multiple ir sensors. In Nostradamus
2014: Prediction, Modeling and Analysis of Complex Systems (pp. 385-392). Springer,
Cham.
[50] Fu, C., Wu, S., Luo, Z., Fan, X. and Meng, F., 2009, December. Research and design
of the differential autonomous mobile robot based on multi-sensor information fusion
technology. In Information Engineering and Computer Science, 2009. ICIECS 2009.
International Conference on (pp. 1-4). IEEE.
[51] Sabatini, A.M., Genovese, V., Guglielmelli, E., Mantuano, A., Ratti, G. and Dario, P.,
1995, August. A low-cost, composite sensor array combining ultrasonic and infrared
proximity sensors. In Intelligent Robots and Systems 95.'Human Robot Interaction and
Cooperative Robots', Proceedings. 1995 IEEE/RSJ International Conference on (Vol. 3,
pp. 120-126). IEEE.
[52] Duan, S., Li, Y., Chen, S., Chen, L., Min, J., Zou, L., Ma, Z. and Ding, J., 2011, June.
Research on obstacle avoidance for mobile robot based on binocular stereo vision and
infrared ranging. In Intelligent Control and Automation (WCICA), 2011 9th World
Congress on (pp. 1024-1028). IEEE.
[53] Zappi, P., Farella, E. and Benini, L., 2008, October. Pyroelectric infrared sensors
based distance estimation. In Sensors, 2008 IEEE (pp. 716-719). IEEE.
[54] Yun, J. and Lee, S.S., 2014. Human movement detection and identification using
pyroelectric infrared sensors. Sensors, 14(5), pp.8057-8081.
[55] Wahl, F., Milenkovic, M. and Amft, O., 2012, December. A distributed PIR-based
approach for estimating people count in office environments. In Computational Science
and Engineering (CSE), 2012 IEEE 15th International Conference on (pp. 640-647).
IEEE.
[56] Kang, J., Gajera, K., Cohen, I. and Medioni, G., 2004, June. Detection and tracking
of moving objects from overlapping EO and IR sensors. In Computer Vision and Pattern
Recognition Workshop, 2004. CVPRW'04. Conference on (pp. 123-123). IEEE.

112
[57] Hosokawa, T. and Kudo, M., 2005. Person tracking with infrared sensors.
In Knowledge-Based Intelligent Information and Engineering Systems (pp. 907-907).
Springer Berlin/Heidelberg.
[58] Gu, Y. and Veloso, M., 2007, June. Learning Tactic-Based Motion Models of a
Moving Object with Particle Filtering. In Computational Intelligence in Robotics and
Automation, 2007. CIRA 2007. International Symposium on (pp. 1-6). IEEE.
[59] Madrigal, F., Rivera, M. and Hayet, J.B., 2011, November. Learning and regularizing
motion models for enhancing particle filter-based target tracking. In Pacific-Rim
Symposium on Image and Video Technology (pp. 287-298). Springer Berlin Heidelberg.
[60] Erdem, C.E., Sankur, B. and Tekalp, A.M., 2004. Performance measures for video
object segmentation and tracking. IEEE Transactions on Image Processing, 13(7), pp.937-
951.
[61] Piciarelli, C., Foresti, G.L. and Snidaro, L., 2005, September. Trajectory clustering
and its applications for video surveillance. In Advanced Video and Signal Based
Surveillance, 2005. AVSS 2005. IEEE Conference on (pp. 40-45). Ieee.
[62] Biresaw, T.A., Alvarez, M.S. and Regazzoni, C.S., 2011, August. Online failure
detection and correction for Bayesian sparse feature-based object tracking. In Advanced
Video and Signal-Based Surveillance (AVSS), 2011 8th IEEE International Conference
on (pp. 320-324). IEEE.
[63] Jaward, M., Mihaylova, L., Canagarajah, N. and Bull, D., 2006, March. Multiple
object tracking using particle filters. In 2006 IEEE Aerospace Conference (pp. 8-pp).
IEEE.
[64] Liu, J.S. and Chen, R., 1998. Sequential Monte Carlo methods for dynamic
systems. Journal of the American statistical association, 93(443), pp.1032-1044.
[65] Nummiaro, K., Koller-Meier, E. and Van Gool, L., 2002, September. Object tracking
with an adaptive color-based particle filter. In Joint Pattern Recognition Symposium (pp.
353-360). Springer Berlin Heidelberg.

113
[66] Dalal, N. and Triggs, B., 2005, June. Histograms of oriented gradients for human
detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR'05) (Vol. 1, pp. 886-893). IEEE.
[67] Wei, X., Phung, S.L. and Bouzerdoum, A., 2014. Object segmentation and
classification using 3-D range camera. Journal of Visual Communication and Image
Representation, 25(1), pp.74-85.
[68] J. R. Terven, D. M. Cordova, "A Kinect 2 Toolbox for MATLAB",
https://github.com/jrterven/Kin2, 2016
[69] Li, H., Liu, X., Cai, Q. and Du, J., 2015. 3-D Objects Feature Extraction and Its
Applications: A Survey. In Transactions on Edutainment XI (pp. 3-18). Springer Berlin
Heidelberg.
[70] Morris, B.T. and Trivedi, M.M., 2008. A survey of vision-based trajectory learning
and analysis for surveillance. IEEE transactions on circuits and systems for video
technology, 18(8), pp.1114-1127.
[71] http://www.livestrong.com/article/438170-the-average-walking-stride-length/
(accessed on 17 December 2017).
[72] Girshick, R.B., Felzenszwalb, P.F. and Mcallester, D.A., 2011. Object detection with
grammar models. In Advances in Neural Information Processing Systems (pp. 442-450).
[73] Wang, X., Han, T.X. and Yan, S., 2009, September. An HOG-LBP human detector
with partial occlusion handling. In Computer Vision, 2009 IEEE 12th International
Conference on(pp. 32-39). IEEE.
[74] Li, B., Wu, T. and Zhu, S.C., 2014, September. Integrating context and occlusion for
car detection by hierarchical and-or model. In European Conference on Computer
Vision (pp. 652-667). Springer, Cham.
[75] Pepikj, B., Stark, M., Gehler, P. and Schiele, B., 2013. Occlusion patterns for object
class detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 3286-3293).
[76] Tang, S., Andriluka, M. and Schiele, B., 2014. Detection and tracking of occluded
people. International Journal of Computer Vision, 110(1), pp.58-69.

114
[77] Chen, G., Ding, Y., Xiao, J. and Han, T.X., 2013. Detection evolution with multi-
order contextual co-occurrence. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (pp. 1798-1805).
[78] Ouyang, W. and Wang, X., 2013. Single-pedestrian detection aided by multi-
pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (pp. 3198-3205).
[79] Tao, J., Enzweiler, M., Franke, U., Pfeiffer, D. and Klette, R., 2015, September. What
is in front? multiple-object detection and tracking with dynamic occlusion handling. In
International Conference on Computer Analysis of Images and Patterns (pp. 14-26).
Springer International Publishing.
[80] Camplani, M., Hannuna, S.L., Mirmehdi, M., Damen, D., Paiement, A., Tao, L. and
Burghardt, T., 2015, September. Real-time RGB-D Tracking with Depth Scaling
Kernelised Correlation Filters and Occlusion Handling. In BMVC (pp. 145-1).
[81] Benou, A., Benou, I. and Hagage, R., 2014, December. Occlusion handling method
for object tracking using RGB-D data. In Electrical & Electronics Engineers in Israel
(IEEEI), 2014 IEEE 28th Convention of (pp. 1-5). IEEE.
[82] Song, S. and Xiao, J., 2013. Tracking revisited using RGBD camera: Unified
benchmark and baselines. In Proceedings of the IEEE international conference on
computer vision (pp. 233-240).
[83] Galanakis, G., Zabulis, X., Koutlemanis, P., Paparoulis, S. and Kouroumalis, V., 2014,
May. Tracking persons using a network of RGBD cameras. In Proceedings of the 7th
International Conference on PErvasive Technologies Related to Assistive
Environments (p. 63). ACM.

115

You might also like