Professional Documents
Culture Documents
Robust Face Tracking Based On Active Stereo
Robust Face Tracking Based On Active Stereo
Abstract—This study presents a robust face tracking method Active contour algorithm can represent the precise target
based on active stereo camera system. This system is a stereo shape according to the target movement [4]. The performance
vision system that can hold its gaze point on the face of the visual of this algorithm is good when tracking non-rigid object.
target during tracking it. When the system fixes its gaze point on However, it requires that the object contour has to be defined
the face exactly, the face can be discriminated from the
or trained before tracking.
background by binocular disparity. To keep the gaze point of
the system on the face of a moving person, a binocular motor Template matching, the most famous object tracking
control model based on human binocular neural pathways is algorithm, searches for the block in the input image that is
proposed. This face tracking method based on active stereo most similar to the template of the object registered in
camera system can detect the face being tracked in complex advance [5][6]. Because fixed template is used, it will fail
scene, even though the face has changed its appearance when the appearance of the object changes. Updatable
according to the rotation of itself. The system can be applied for
template matching algorithm could solve this problem in
surveillance applications even under complex scene conditions.
some measure. There are several updatable template
I. INTRODUCTION algorithms proposed [7][8][9], but at each template updating
some background pixels will be interfused into the template,
T HIS study presents a robust face tracking method based
on active stereo camera system. Face tracking is very
useful in surveillance applications, which makes it possible to
which will cause the template drift away from the object.
3D position information is expected to avoid mistaking the
background for a part of the template. Usually, a parallel
get detail information of the person being tracked with a stereo vision system can only provide limited depth
Pan-Tilt-Zoom camera. resolution within a close range, since it needs wide-angle lens
It is still a big challenge to track the face in complex scene. cameras to create a wide overlap field of view.
In general, according to the variation of illumination On the contrary, an active stereo camera system can adjust
conditions, the change of the face’s positions and poses, the its gaze point to a specified object which needs to be observed,
face keeps changing its appearance in the images sequences so that telephoto lens can be used to provide higher depth
captured by the camera. Appearance variation makes it resolution around the target object.
difficult to track the face robustly because the image features An algorithm named Zero Disparity Filter (ZDF) for active
of the face will disappear. stereo camera system was proposed [10]. The ZDF is a way to
There are lots of object tracking algorithms developed in restrict the input from a binocular vision system to the
the last several decades. These algorithms can be separated to horopter. The horopter is the surface in physical space, any
two categories: gradient-based optical flow approach and point of which produces images in the two cameras that
region-based matching approach. Optical flow methods are stimulate exactly corresponding points [11]. Changing
characterized by their low computational cost [1][2]. camera vergence sweeps the horopter in or out through space.
However, in case of that there is a noisy background, optical Implementing the ZDF in practice involves finding areas of
flow becomes unstable. the images that have no binocular disparity and permitting
Mean-shift algorithm uses a color histogram to describe an these regions of the image to pass through the filter. There are
object [3]. It can track faces with pose changes because the several drawbacks of this algorithm that make it hardly to be
color of skin is invariable to the viewpoint changes. However, implemented in applications at a practical level.
if the color of background surrounding the object is similar to In section II, the original ZDF is reviewed and its
the color of it, this algorithm performs poor. shortcoming is discussed. In section III, a new ZDF based on
Scale Invariant Feature Transform (SIFT) is proposed, which
Manuscript received July 31, 2009. This work was supported in part by the improves the performance of ZDF by using more elaborate
Japan Science and Technology Agency under grant numbered 1907. feature extractor. In section IV, a binocular motor control
Yuzhang Gu is with the Precision and Intelligence Laboratory, Tokyo model is proposed to keep the gaze point of the active stereo
Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
JAPAN (phone: +81-45-924-5069; fax: +81-45-924-5069; e-mail: gu.y.ab@ camera system on the target face during tracking it, which is a
m.titech.ac.jp). necessary condition of the algorithm proposed in section III.
Makoto Sato is with the Precision and Intelligence Laboratory, Tokyo An experiment system to implement proposed tracking
Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
JAPAN (e-mail: msato@pi.titech.ac.jp).
algorithm is developed, which is described in section V.
Xiaolin Zhang is with the Precision and Intelligence Laboratory, Tokyo Finally, face tracking results are presented in section VI.
Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503
JAPAN (e-mail: zhang@pi.titech.ac.jp).
Left image
&amera L &amera R
(a)
,mage in Camera L
Right image
,mage in Camera R
(b)
2308
binocular disparity more than 2 pixels. Obviously, ZDF based 1 l
dzT dl dD
on SIFT succeeded in discriminating visual target being D D2
tracked from the background. (4)
zT z 2
It is a two-stage process for implement of ZDF based on dl T d D
l l
SIFT. First, keep the gaze point of the active stereo camera
Obviously, given the length of baseline l, the distance of
system on the moving object continuously. Second, extract
the target object zT, the pixel spatial resolution d, and the
SIFT key points and estimate which points belong to the
acceptable distance between the target object and the
target object through a binocular disparity threshold.
background dzT, the value of binocular disparity threshold
Conventionally, an active stereo camera system is treated
could be calculated.
as two independent monocular vision systems. Each
monocular system is controlled by an independent PID
IV. BINOCULAR MOTOR CONTROL MODEL
feedback loop. However, this approach can hardly
guarantee that the gaze point stay on the target object stably. Active camera movements include horizontal rotation and
A binocular motor control model to resolve this problem will vertical rotation. Since two cameras are arranged horizontally,
be discussed in section IV. to keep gaze point on a moving object the two cameras have
Binocular disparity of an object represents depth distance to do horizontal conjugate movement or vergence movement
from it to the gaze point. The value of binocular disparity from time to time.
threshold should be decided based on the baseline length of As mentioned in section III, usually an active stereo
the active stereo camera system, the pixel spatial resolution of camera system is treated as two independent monocular
the cameras, and the distance between the target object and vision systems. Making them work properly will yield
the camera system. vergence movement in appearance. Unfortunately,
The coordinate system in vergence plane is shown in Fig. 3. independent PID feedback control loops could not guarantee
Denotes zT as the distance between the target and the system, l the gaze point stay on the target object continuously, which is
as the length of the baseline, as the vergence angle, l (r) a necessary condition for ZDF based on SIFT.
as the pan angle of the left (right) camera from zero line to the It is well known that the ocular motor control system of
target. primates is an effective device for capturing an object in the
central pits of the retinas of both eyes. Human eyes always
z orientate their lines of sight to keep the image of the object
Target Zero line
from leaving their central pit. Actually, it is very difficult for
0 0 human to separate lines of sight of his two eyes onto different
object. By investigating eye movement control neural system,
- + - +
we found out the cooperative eye movements mechanism,
which provides a new approach to obtain cooperative camera
movements other than image processing method.
The mathematical model of human horizontal binocular
motor neural pathways is shown in Fig. 4, where , , , , 1,
zT
l r 2 denotes gain of each block of the model [13].
V Xs
Ml s
N Ks
U1
1
motorl El s
x s
o U2
Left camera Right camera
l
U2
1
Fig. 3 . Coordinate system in vergence plane. U1 motorr Er s
s
Obviously
N K s
D Ml M r (1) Mr s
V Xs
l
zT (2)
tan Ml tan M r Fig. 4 . Mathematical model of horizontal binocular motor neural pathways.
If zT is big enough than l, tan M M , then As shown in Fig. 2, denotes l (r)as the pan angle of the
left (right) camera from zero line to the target. Denotes El (Er)
l
zT (3) as the desired pan angle of the left (right) camera.
D Accordingly
By taking the derivative of (3), (4) can be obtained.
2309
ª El ( s ) º 1 V N (Q K ) s ª(Ml ( s ) M r ( s ) º equipped with wide angle lens. The cameras with telephoto
« E ( s) » « » lens could provide high resolution images of the target object
¬ r ¼ 2 V N ( U U Q K ) s ¬(Ml ( s ) M r ( s ) ¼
1
1 2
(5) at a distance. Camera 5 is equipped with a super wide angle
1 V N (Q K ) s ª(Ml ( s ) M r ( s ) º lens to monitoring the whole area. Besides camera 5 fixed on
« »
2 V N ( U U Q K ) s ¬(M r ( s ) Ml ( s ) ¼
1
the foundation of the system, other cameras tilt up and down
1 2
together on a common platform. There are three pan motors,
As shown in (5), the first part of the right side of (5) is the
two for the camera sockets separately, and one for the
transfer function of conjugate movement, and the second part
platform.
is the transfer function of vergence movement. Therefore, the
time constant of the conjugate transfer function Tc and that of l
the vergence transfer function Tv can be obtained, as shown in Motor1
(6) and (7). Motor2
1 ( U1 U 2 )(Q K ) Motor3
Tc (6)
( U1 U 2 )(V N ) Motor4
1 ( U1 U 2 )(Q K )
Tv (7) Camera1
( U1 U 2 )(V N )
Camera2
To make sure that the system is stable, the following
condition (8) should be met. Camera3
V ! N ,Q ! K , U1 ! U 2 (8) Camera4
Equation (8) makes Tc<Tv [13]. In other words, in a stable Camera5
system the time constant of vergence movement is longer
than that of conjugate movement.
This conclusion accords with normal physical
phenomenon. When the visual target moves in a plane Fig. 5 . Picture of the active stereo camera system.
perpendicular to the line of sight approximately, its image in
the retina moves quickly. Short conjugate movement time The length of the baseline between two camera sockets l is
constant enables eye rotation to respond quickly to keep about 200mm, which is expected to be shorter enough than
smooth pursuit. In case of the visual target moves along the the distance between the system and the target object. That
line of sight, its image in the retina moves slowly. Even means there is only a minor difference of the target object
though the time constant of vergence movement is long, the between stereo images.
two eyes still can follow the visual target. The following Table I shows the specification of these
Long vergence movement time constant could stabilize the cameras. If we assume that the head of human as a ball which
gaze point. The value of gains shown in Fig. 4 could be has a diameter of 250mm, the face image of a person at a
decided according to the following steps. First, set and to distance of 50m captured by the telephoto lens camera should
zero, which could simplify gain setting without affecting the be an circle of 80 pixels in diameter, that is big enough to be
signal intercross between two eyes of the system. Second, recognized by the operator or a face recognition software.
regulate 1 and 2 as 1+2=1. Third, according to (6), set the
TABLE I
value of and to yield a small Tc. Finally, according to (7), SPECIFICATION OF CAMERAS
set the value of 1 and 2 to yield a large Tv. Camera1,3 Camea2,4 Camera5
In this study, we set =50, =0.1, =0, =0, 1=0.51, Resolution
1024×768 640×480 1600×1200
2=0.49. Accordingly, we get Tc=0.022s, Tv=1s. In section [pixel]
Hor. angle of view
VI.B, an experiment result to compare the gaze point control 3.67 21.73 180
[deg]
performance of proposed binocular motor control model and Ver. angle of view
2.75 16.38 180
traditional two independent control loops is given. [deg]
2310
communication protocol designed for computer vision all of the active cameras are driven to gaze the face, and
applications based on the National Semiconductor interface the face is registered as the tracking template for the
Channel Link. At the maximum chipset operating frequency, subsequent processing.
the base configuration yields a video data throughput of 2) Original SIFT algorithm [12] is used to detect the target
255MB/s. It ensures enough bandwidth to transfer video data object when the appearance change of it is sustainable.
from three kinds of cameras used in the system, that need As mentioned in [12], SIFT can be used as an object
27.6MB/s, 23.6MB/s, and 57.6MB/s bandwidth respectively. detector. The object models of SIFT are represented as
Three Dalsa Xcelera-CL PX4 Dual frame grabbers are used 2D locations of SIFT keys that can undergo affine
to obtain data from five cameras. Xcelera-CL PX4 Dual is a projection. Sufficient variation in feature location is
highly versatile PCI Express frame grabber capable of allowed to recognize perspective projection of planar
acquiring images from two independent Camera Link base shapes at up to a 60 degree rotation away from the
configuration cameras. The PCI Express host interface is a camera or to allow up to a 20 degree rotation of a 3D
point to point host interface allowing simultaneous image object.
acquisition and transfer without loading the system bus and 3) When the appearance change of the target object is too
involving little intervention from the host CPU. big to be matched with the template, ZDF based on SIFT
is used to create a new template near the location that the
B. Tracking Algorithm
visual target disappeared. The new template is registered
The flowchart of face tracking algorithm is shown in Fig. as the tracking template for the following processing.
6. Step 2) and 3) are repeated until the target object could
not be detected completely.
Begin
VI. EXPERIMENTS
Detect face
A. Head Rotation with Number of SIFT Key Points
N
Sufficient number of SIFT key points is necessary for SIFT
matching algorithm. In this experiment, whether enough
SIFT key points belong to the target object could be detected
Is any face detected?
by ZDF based on SIFT is verified.
Let a person sit down in a swivel chair about 3.3m in front
Y N of the active stereo camera system. Images from Camera 2
Gaze on the face and register and Camera 4 are used, and the image of the person’s frontal
the face as tracking template face is a rectangle of about 100pixel×100pixel. Rotate the
swivel chair from 0 degree to 90 degree at an interval of 10
degree, and record the number of SIFT key points with
SIFT Matching succeeded? binocular disparity less than 2 pixels. The relation of the head
rotation angle and SIFT key points detected by ZDF based on
Y SIFT is shown in following Fig. 7.
2311
B. Gaze Point Control Left
In this experiment, a stationary visual target is placed in
front of the active stereo camera system. The system is
commanded to fix its gaze point on the visual target. Two sets
of experiments are carried out. At first, the system is
controlled by traditional independent PID feedback loops. Right
Next, the system is controlled by proposed binocular motor
control model. The distance between the gaze point and the
system is recorded and shown in Fig. 8.
2312
REFERENCES
[1] B. Horn and B. Schunck, “Determining optical flow,” Artificial
Intelligence, vol. 17, pp. 185-203, 1981.
[2] H. Nagel, “Displacement vectors derived from second-order intensity
variations in image sequences,” Computer Graph, Image Process, vol.
21, pp. 85-117, 1983.
[3] D. Comaniciu, V.Ramesh, and P. Meer, “Real-time tracking of
non-rigid objects using mean shift”, in Proc. IEEE Conf. on Computer
Vision and Pattern Recognition, vol. 2, pp. 142-149, 2000.
[4] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: active contour
models,” Int. J. of Computer Vision, vol. 1, pp. 321-332, 1988.
[5] H. D. Crane and C. M. Steele, “Translation-tolerant mask matching
using noncoherent reflective optics,” in Pattern Recognition, vol. 1,
issue 2, pp. 129-136, Nov. 1968.
[6] J. Martin and J. L. Crowley, “Comparison of correlation techniques,” in
Intelligent Autonomous Systems, pp. 86-93, 1995.
[7] B. Li and R. Chellappa, “Simultaneous tracking and verification via
sequential posterior estimation,” in Proc. IEEE Conf. Computer Vision
and Pattern Recognition, vol. 2, pp. 110-117, 2000.
[8] M. Black and Y. Yacoob, “Recognizing facial expressions in image
sequences using local parameterized models of image motion,” Int. J. of
Computer Vision, vol. 25, no. 1, pp 23-48, 1997.
[9] T. Kaneko and O. Hori, “Update criterion of image template for visual
tracking using template matching,” Trans. of the Institute of Electronics,
Information and Communication Engineers, vol. J88-D-II, no. 8,
pp.1378-1388, 2005.
[10] P. Kaenel, C. Brown, and D. Coombs, “Detecting region of zero
disparity in binocular images,” Technical Report 388, Computer
Science Department, University of Rochester, 1991
[11] R. Reading, Binocular Vision: Foundations and Applications,
Butterworth, Boston, 1983, pp. 88.
[12] D. Lowe, “Object recognition from local scale-invariant features,” in
Proc. the 7th IEEE Int. Conf. on Computer Vision, Vol. 2, pp.
1150-1157, “1999.
[13] X. Zhang, and H. Wakamatsu: “Mathematical model for binocular
movements mechanism and construction of eye axes control system,” J.
of the Robotics Society of Japan, vol. 20, no. 1, pp. 89-97, 2002.
[14] P. Viola, and M. Jones, “Rapid Object Detection using a Boosted
Cascade of Simple Features,” in Proc. of IEEE Computer Society Conf.
on Computer Vision and Pattern Recognition, vol. 1, pp. 511-518,
2001.
2313