Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Proceedings of the 2009 IEEE

International Conference on Robotics and Biomimetics


December 19 -23, 2009, Guilin, China

Robust Face Tracking Based on Active Stereo Camera System


Yuzhang Gu, Makoto Sato, and Xiaolin Zhang

Abstract—This study presents a robust face tracking method Active contour algorithm can represent the precise target
based on active stereo camera system. This system is a stereo shape according to the target movement [4]. The performance
vision system that can hold its gaze point on the face of the visual of this algorithm is good when tracking non-rigid object.
target during tracking it. When the system fixes its gaze point on However, it requires that the object contour has to be defined
the face exactly, the face can be discriminated from the
or trained before tracking.
background by binocular disparity. To keep the gaze point of
the system on the face of a moving person, a binocular motor Template matching, the most famous object tracking
control model based on human binocular neural pathways is algorithm, searches for the block in the input image that is
proposed. This face tracking method based on active stereo most similar to the template of the object registered in
camera system can detect the face being tracked in complex advance [5][6]. Because fixed template is used, it will fail
scene, even though the face has changed its appearance when the appearance of the object changes. Updatable
according to the rotation of itself. The system can be applied for
template matching algorithm could solve this problem in
surveillance applications even under complex scene conditions.
some measure. There are several updatable template
I. INTRODUCTION algorithms proposed [7][8][9], but at each template updating
some background pixels will be interfused into the template,
T HIS study presents a robust face tracking method based
on active stereo camera system. Face tracking is very
useful in surveillance applications, which makes it possible to
which will cause the template drift away from the object.
3D position information is expected to avoid mistaking the
background for a part of the template. Usually, a parallel
get detail information of the person being tracked with a stereo vision system can only provide limited depth
Pan-Tilt-Zoom camera. resolution within a close range, since it needs wide-angle lens
It is still a big challenge to track the face in complex scene. cameras to create a wide overlap field of view.
In general, according to the variation of illumination On the contrary, an active stereo camera system can adjust
conditions, the change of the face’s positions and poses, the its gaze point to a specified object which needs to be observed,
face keeps changing its appearance in the images sequences so that telephoto lens can be used to provide higher depth
captured by the camera. Appearance variation makes it resolution around the target object.
difficult to track the face robustly because the image features An algorithm named Zero Disparity Filter (ZDF) for active
of the face will disappear. stereo camera system was proposed [10]. The ZDF is a way to
There are lots of object tracking algorithms developed in restrict the input from a binocular vision system to the
the last several decades. These algorithms can be separated to horopter. The horopter is the surface in physical space, any
two categories: gradient-based optical flow approach and point of which produces images in the two cameras that
region-based matching approach. Optical flow methods are stimulate exactly corresponding points [11]. Changing
characterized by their low computational cost [1][2]. camera vergence sweeps the horopter in or out through space.
However, in case of that there is a noisy background, optical Implementing the ZDF in practice involves finding areas of
flow becomes unstable. the images that have no binocular disparity and permitting
Mean-shift algorithm uses a color histogram to describe an these regions of the image to pass through the filter. There are
object [3]. It can track faces with pose changes because the several drawbacks of this algorithm that make it hardly to be
color of skin is invariable to the viewpoint changes. However, implemented in applications at a practical level.
if the color of background surrounding the object is similar to In section II, the original ZDF is reviewed and its
the color of it, this algorithm performs poor. shortcoming is discussed. In section III, a new ZDF based on
Scale Invariant Feature Transform (SIFT) is proposed, which
Manuscript received July 31, 2009. This work was supported in part by the improves the performance of ZDF by using more elaborate
Japan Science and Technology Agency under grant numbered 1907. feature extractor. In section IV, a binocular motor control
Yuzhang Gu is with the Precision and Intelligence Laboratory, Tokyo model is proposed to keep the gaze point of the active stereo
Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
JAPAN (phone: +81-45-924-5069; fax: +81-45-924-5069; e-mail: gu.y.ab@ camera system on the target face during tracking it, which is a
m.titech.ac.jp). necessary condition of the algorithm proposed in section III.
Makoto Sato is with the Precision and Intelligence Laboratory, Tokyo An experiment system to implement proposed tracking
Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
JAPAN (e-mail: msato@pi.titech.ac.jp).
algorithm is developed, which is described in section V.
Xiaolin Zhang is with the Precision and Intelligence Laboratory, Tokyo Finally, face tracking results are presented in section VI.
Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
JAPAN (e-mail: zhang@pi.titech.ac.jp).

978-1-4244-4775-6/09/$25.00 © 2009 IEEE. 2307


II. ZERO DISPARITY FILTER OVERVIEW eliminating noise. By applying the ZDF at various resolutions
Consider two cameras of the active stereo camera system and using richer matching and feature quality measures the
are horizontally displaced and share a common tilt angle. Any performance of it is improved to some extent, but it is hardly
setting of their two pan angles induces a point of fixation in implemented at a practical level in applications with cluttered
3D space where the camera axes intersect in the tilt plane. The environment such as public area surveillance.
2D shape of the horopter in the tilt plane is the circle passing
through the two nodal points of the cameras and the fixation III. ZDF BASED ON SIFT
point, as shown in Fig. 1(a). ZDF is expected to be a robust A feasible approach to improve the performance of ZDF is
method to distinguish the visual target from the background using more elaborate feature extractor than Sobel filter used
scene. As shown in Fig. 1(b), the object being gazed is in Kaenel’s ZDF. An object recognition system has been
doomed to exist on the horopter, so the binocular disparity of proposed by Lowe, which uses a new class of local scale
it should be zero or approximate zero. Conversely, objects far invariant features [12]. SIFT features are invariant to image
from the horopter generate large binocular disparity. As a scaling, translation, rotation, and partially invariant to
result, it is possible to eliminate distracting features from illumination changes and affine or 3D projection. Features
background when updating template of the target object is are detected through a staged filtering approach that identifies
necessary due to the appearance change of the target object. stable points in scale space. Feature mismatching can be
eliminated significantly by using this kind of features. It
makes the output of ZDF based on it more reliable and stable.
Horopter circle

Left image

&amera L &amera R

(a)

,mage in Camera L

Right image
,mage in Camera R

(b)

Fig. 1 . Horopter circle and binocular disparity.

The ZDF proposed by Kaenel et al. is a simple non-linear


image filter that suppresses features that have non-zero stereo
disparity [10]. The features it uses are vertical edges, since
they can give useful information about horizontal disparity. It
applies a Sobel vertical edge operator to each stereo image as
a feature detector. Next, the two edge images are then
compared pixel by pixel: if an edge of similar contrast is
present in corresponding locations in each image, then an
edge of similar strength will appear in that location in the
resulting zero disparity image. There are several drawbacks Fig. 2 . An example of ZDF based on SIFT.
of this algorithm. First, if there happen to be two edges in the
same place in each image an edge will exist in the filter output, An example of ZDF based on SIFT is shown in Fig. 2. The
even if they arise from different objects. Second, edges at the lens axes of the active stereo camera system have been
particular resolution detected by the edge operator may not pointed to the same point on the face. All marks shown in the
always be the appropriate feature for matching. Finally, stereo images are SIFT key points extracted from them.
threshold to detect edge features is an inexact way of Green points represent key points with binocular disparity
less than 2 pixels, and red points represent key points with

2308
binocular disparity more than 2 pixels. Obviously, ZDF based 1 l
dzT dl  dD
on SIFT succeeded in discriminating visual target being D D2
tracked from the background. (4)
zT z 2
It is a two-stage process for implement of ZDF based on dl  T d D
l l
SIFT. First, keep the gaze point of the active stereo camera
Obviously, given the length of baseline l, the distance of
system on the moving object continuously. Second, extract
the target object zT, the pixel spatial resolution d, and the
SIFT key points and estimate which points belong to the
acceptable distance between the target object and the
target object through a binocular disparity threshold.
background dzT, the value of binocular disparity threshold
Conventionally, an active stereo camera system is treated
could be calculated.
as two independent monocular vision systems. Each
monocular system is controlled by an independent PID
IV. BINOCULAR MOTOR CONTROL MODEL
feedback loop. However, this approach can hardly
guarantee that the gaze point stay on the target object stably. Active camera movements include horizontal rotation and
A binocular motor control model to resolve this problem will vertical rotation. Since two cameras are arranged horizontally,
be discussed in section IV. to keep gaze point on a moving object the two cameras have
Binocular disparity of an object represents depth distance to do horizontal conjugate movement or vergence movement
from it to the gaze point. The value of binocular disparity from time to time.
threshold should be decided based on the baseline length of As mentioned in section III, usually an active stereo
the active stereo camera system, the pixel spatial resolution of camera system is treated as two independent monocular
the cameras, and the distance between the target object and vision systems. Making them work properly will yield
the camera system. vergence movement in appearance. Unfortunately,
The coordinate system in vergence plane is shown in Fig. 3. independent PID feedback control loops could not guarantee
Denotes zT as the distance between the target and the system, l the gaze point stay on the target object continuously, which is
as the length of the baseline,  as the vergence angle, l (r) a necessary condition for ZDF based on SIFT.
as the pan angle of the left (right) camera from zero line to the It is well known that the ocular motor control system of
target. primates is an effective device for capturing an object in the
central pits of the retinas of both eyes. Human eyes always
z orientate their lines of sight to keep the image of the object
Target Zero line
from leaving their central pit. Actually, it is very difficult for
0 0 human to separate lines of sight of his two eyes onto different
object. By investigating eye movement control neural system,
- + - +
we found out the cooperative eye movements mechanism,
 which provides a new approach to obtain cooperative camera
movements other than image processing method.
The mathematical model of human horizontal binocular
motor neural pathways is shown in Fig. 4, where , , , , 1,
zT
l r 2 denotes gain of each block of the model [13].
V Xs 
Ml s
N Ks 

U1 
1
motorl El s
x  s 
o U2
Left camera Right camera
l
U2
 
1
Fig. 3 . Coordinate system in vergence plane. U1 motorr Er s
s 

Obviously
N K s 
D Ml  M r (1) Mr s
V Xs 
l
zT (2)
tan Ml  tan M r Fig. 4 . Mathematical model of horizontal binocular motor neural pathways.

If zT is big enough than l, tan M  M , then As shown in Fig. 2, denotes l (r)as the pan angle of the
left (right) camera from zero line to the target. Denotes El (Er)
l
zT (3) as the desired pan angle of the left (right) camera.
D Accordingly
By taking the derivative of (3), (4) can be obtained.

2309
ª El ( s ) º 1 V  N  (Q  K ) s ª(Ml ( s )  M r ( s ) º equipped with wide angle lens. The cameras with telephoto
« E ( s) » « » lens could provide high resolution images of the target object
¬ r ¼ 2 V  N  ( U  U  Q  K ) s ¬(Ml ( s )  M r ( s ) ¼
1
1 2
(5) at a distance. Camera 5 is equipped with a super wide angle
1 V  N  (Q  K ) s ª(Ml ( s )  M r ( s ) º lens to monitoring the whole area. Besides camera 5 fixed on
 « »
2 V  N  ( U  U  Q  K ) s ¬(M r ( s )  Ml ( s ) ¼
1
the foundation of the system, other cameras tilt up and down
1 2
together on a common platform. There are three pan motors,
As shown in (5), the first part of the right side of (5) is the
two for the camera sockets separately, and one for the
transfer function of conjugate movement, and the second part
platform.
is the transfer function of vergence movement. Therefore, the
time constant of the conjugate transfer function Tc and that of l
the vergence transfer function Tv can be obtained, as shown in Motor1
(6) and (7). Motor2
1  ( U1  U 2 )(Q  K ) Motor3
Tc (6)
( U1  U 2 )(V  N ) Motor4
1  ( U1  U 2 )(Q  K )
Tv (7) Camera1
( U1  U 2 )(V  N )
Camera2
To make sure that the system is stable, the following
condition (8) should be met. Camera3
V ! N ,Q ! K , U1 ! U 2 (8) Camera4
Equation (8) makes Tc<Tv [13]. In other words, in a stable Camera5
system the time constant of vergence movement is longer
than that of conjugate movement.
This conclusion accords with normal physical
phenomenon. When the visual target moves in a plane Fig. 5 . Picture of the active stereo camera system.
perpendicular to the line of sight approximately, its image in
the retina moves quickly. Short conjugate movement time The length of the baseline between two camera sockets l is
constant enables eye rotation to respond quickly to keep about 200mm, which is expected to be shorter enough than
smooth pursuit. In case of the visual target moves along the the distance between the system and the target object. That
line of sight, its image in the retina moves slowly. Even means there is only a minor difference of the target object
though the time constant of vergence movement is long, the between stereo images.
two eyes still can follow the visual target. The following Table I shows the specification of these
Long vergence movement time constant could stabilize the cameras. If we assume that the head of human as a ball which
gaze point. The value of gains shown in Fig. 4 could be has a diameter of 250mm, the face image of a person at a
decided according to the following steps. First, set  and  to distance of 50m captured by the telephoto lens camera should
zero, which could simplify gain setting without affecting the be an circle of 80 pixels in diameter, that is big enough to be
signal intercross between two eyes of the system. Second, recognized by the operator or a face recognition software.
regulate 1 and 2 as 1+2=1. Third, according to (6), set the
TABLE I
value of  and  to yield a small Tc. Finally, according to (7), SPECIFICATION OF CAMERAS
set the value of 1 and 2 to yield a large Tv. Camera1,3 Camea2,4 Camera5
In this study, we set =50, =0.1, =0, =0, 1=0.51, Resolution
1024×768 640×480 1600×1200
2=0.49. Accordingly, we get Tc=0.022s, Tv=1s. In section [pixel]
Hor. angle of view
VI.B, an experiment result to compare the gaze point control 3.67 21.73 180
[deg]
performance of proposed binocular motor control model and Ver. angle of view
2.75 16.38 180
traditional two independent control loops is given. [deg]

V. EXPERIMENT SYSTEM We choose Maxon DC motors to drive the plant of the


system. After fine tuning the motors, they can achieve
A. Mechanism of the Active Stereo Camera System excellent control effect. The cameras can rotate at a speed
As shown in Fig. 5, the proposed active stereo camera exceeding 720deg/s.
system is equipped with five monochrome cameras and four To achieve high accurate motor control performance, high
motors which provide four degree of freedom (DOF). There precision two phase encoders and high ratio reduction gears
are two cameras with different lens mounted in each of the are used. The theoretical angular resolution of the encoders is
two camera sockets respectively. Camera 1 and camera 3 is up to 7.5e-4 deg/count.
equipped with telephoto lens, camera 2 and camera 4 is We choose Camera Link interface to transfer video data
from camera to the host PC. Camera Link is a serial

2310
communication protocol designed for computer vision all of the active cameras are driven to gaze the face, and
applications based on the National Semiconductor interface the face is registered as the tracking template for the
Channel Link. At the maximum chipset operating frequency, subsequent processing.
the base configuration yields a video data throughput of 2) Original SIFT algorithm [12] is used to detect the target
255MB/s. It ensures enough bandwidth to transfer video data object when the appearance change of it is sustainable.
from three kinds of cameras used in the system, that need As mentioned in [12], SIFT can be used as an object
27.6MB/s, 23.6MB/s, and 57.6MB/s bandwidth respectively. detector. The object models of SIFT are represented as
Three Dalsa Xcelera-CL PX4 Dual frame grabbers are used 2D locations of SIFT keys that can undergo affine
to obtain data from five cameras. Xcelera-CL PX4 Dual is a projection. Sufficient variation in feature location is
highly versatile PCI Express frame grabber capable of allowed to recognize perspective projection of planar
acquiring images from two independent Camera Link base shapes at up to a 60 degree rotation away from the
configuration cameras. The PCI Express host interface is a camera or to allow up to a 20 degree rotation of a 3D
point to point host interface allowing simultaneous image object.
acquisition and transfer without loading the system bus and 3) When the appearance change of the target object is too
involving little intervention from the host CPU. big to be matched with the template, ZDF based on SIFT
is used to create a new template near the location that the
B. Tracking Algorithm
visual target disappeared. The new template is registered
The flowchart of face tracking algorithm is shown in Fig. as the tracking template for the following processing.
6. Step 2) and 3) are repeated until the target object could
not be detected completely.
Begin

VI. EXPERIMENTS

Detect face
A. Head Rotation with Number of SIFT Key Points
N
Sufficient number of SIFT key points is necessary for SIFT
matching algorithm. In this experiment, whether enough
SIFT key points belong to the target object could be detected
Is any face detected?
by ZDF based on SIFT is verified.
Let a person sit down in a swivel chair about 3.3m in front
Y N of the active stereo camera system. Images from Camera 2
Gaze on the face and register and Camera 4 are used, and the image of the person’s frontal
the face as tracking template face is a rectangle of about 100pixel×100pixel. Rotate the
swivel chair from 0 degree to 90 degree at an interval of 10
degree, and record the number of SIFT key points with
SIFT Matching succeeded? binocular disparity less than 2 pixels. The relation of the head
rotation angle and SIFT key points detected by ZDF based on
Y SIFT is shown in following Fig. 7.

Smooth pursuit the target object

Succeed in DEtect the target


object with ZDF based on SIFT?

Update the tracking tempalte

Fig. 6 . Flowchart of face tracking algorithm


Fig. 7 . Head rotation angle VS. Number of SIFT key points detected
There are three steps of image processing, face detection,
SIFT template matching, ZDF based on SIFT. Obviously, the number of SIFT key points of the template
1) Before tracking could be carried out, system keeps could maintain a high level in the range from -90 degree to 90
searching an object to be tracked. In this system, a face degree head rotation.
detect algorithm proposed by Viola et al. [14] is in charge
of detecting the target object. If a face is been detected,

2311
B. Gaze Point Control Left
In this experiment, a stationary visual target is placed in
front of the active stereo camera system. The system is
commanded to fix its gaze point on the visual target. Two sets
of experiments are carried out. At first, the system is
controlled by traditional independent PID feedback loops. Right
Next, the system is controlled by proposed binocular motor
control model. The distance between the gaze point and the
system is recorded and shown in Fig. 8.

frame 1 frame 300 frame 1000


(a)

frame 224 frame 271 frame 352

Fig. 8 . Comparison experiments of gaze point control

As shown in Fig. 8, the red curve denotes the distance


between the active stereo camera system and its gaze point
frame 366 frame 386 frame 415
measured by the system with proposed binocular motor
control model, the blue curve denotes the distance measured
by system with traditional independent PID feedback control
model. Obviously, the stability of gaze point control is
improved dramatically through implementing the binocular
motor control model, which guarantees that stable ZDF based
on SIFT could be carried out during tracking moving object. frame 651 frame 800
(b)
C. Face Tracking Experiment
Fig. 9 . Tracking face experiment
In this experiment, a person is appointed to walk back and
forth in front of the active stereo system. The system is
commanded to detect the face and track it. VII. CONCLUSION
Fig. 9(a) shows some frames of the tracking experiment We proposed an active stereo camera system which is
image sequences. The person walks back and forth in front of expected to track a face robustly in scenes with complex
the system, and the distance between the person and the background. The main contributions of this work are:
system is about 4 to 5 meters. The background is an ordinary 1) A robust updatable template matching algorithm named
laboratory room with a poster on the wall. The tracking ZDF based on SIFT is proposed. This algorithm makes it
procedure lasts about 30 seconds, and 1000 pairs of stereo possible to eliminate distracting features from
images are recorded. During the tracking procedure, the background when template updating is necessary due to
system successfully kept its gaze point on the target object, the appearance change of the target object.
which has made sure that the system can update the SIFT 2) A novel gaze point control method is proposed. This
template correctly when notable change of the person’s method prevents the gaze point from leaving the target
appearance occurred. object. It is crucial for ZDF based on SIFT algorithm.
Fig. 9(b) shows the SIFT templates updated during the A face tracking experiment has been done at an indoor
tracking experiment. There are eight times of template environment. In the future, outdoor experiment of tracking
updating occurred during the tracking procedure, respectively face at a distance should be carried out to evaluation its
at frame 224, 271, 352, 366, 386, 415, 651, and 800. As performance.
shown in Fig. 9(b), green points are SIFT key points with
binocular disparity less than 2 pixels, red points are SIFT key ACKNOWLEDGMENT
points with binocular disparity more than 2 pixels, yellow This work was supported in part by the Japan Science and
polygons are convex hulls of the set of green points. Technology Agency under grant numbered 1907.

2312
REFERENCES
[1] B. Horn and B. Schunck, “Determining optical flow,” Artificial
Intelligence, vol. 17, pp. 185-203, 1981.
[2] H. Nagel, “Displacement vectors derived from second-order intensity
variations in image sequences,” Computer Graph, Image Process, vol.
21, pp. 85-117, 1983.
[3] D. Comaniciu, V.Ramesh, and P. Meer, “Real-time tracking of
non-rigid objects using mean shift”, in Proc. IEEE Conf. on Computer
Vision and Pattern Recognition, vol. 2, pp. 142-149, 2000.
[4] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active contour
models,” Int. J. of Computer Vision, vol. 1, pp. 321-332, 1988.
[5] H. D. Crane and C. M. Steele, “Translation-tolerant mask matching
using noncoherent reflective optics,” in Pattern Recognition, vol. 1,
issue 2, pp. 129-136, Nov. 1968.
[6] J. Martin and J. L. Crowley, “Comparison of correlation techniques,” in
Intelligent Autonomous Systems, pp. 86-93, 1995.
[7] B. Li and R. Chellappa, “Simultaneous tracking and verification via
sequential posterior estimation,” in Proc. IEEE Conf. Computer Vision
and Pattern Recognition, vol. 2, pp. 110-117, 2000.
[8] M. Black and Y. Yacoob, “Recognizing facial expressions in image
sequences using local parameterized models of image motion,” Int. J. of
Computer Vision, vol. 25, no. 1, pp 23-48, 1997.
[9] T. Kaneko and O. Hori, “Update criterion of image template for visual
tracking using template matching,” Trans. of the Institute of Electronics,
Information and Communication Engineers, vol. J88-D-II, no. 8,
pp.1378-1388, 2005.
[10] P. Kaenel, C. Brown, and D. Coombs, “Detecting region of zero
disparity in binocular images,” Technical Report 388, Computer
Science Department, University of Rochester, 1991
[11] R. Reading, Binocular Vision: Foundations and Applications,
Butterworth, Boston, 1983, pp. 88.
[12] D. Lowe, “Object recognition from local scale-invariant features,” in
Proc. the 7th IEEE Int. Conf. on Computer Vision, Vol. 2, pp.
1150-1157, “1999.
[13] X. Zhang, and H. Wakamatsu: “Mathematical model for binocular
movements mechanism and construction of eye axes control system,” J.
of the Robotics Society of Japan, vol. 20, no. 1, pp. 89-97, 2002.
[14] P. Viola, and M. Jones, “Rapid Object Detection using a Boosted
Cascade of Simple Features,” in Proc. of IEEE Computer Society Conf.
on Computer Vision and Pattern Recognition, vol. 1, pp. 511-518,
2001.

2313

You might also like