Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

2014 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO.

9, SEPTEMBER 2017

Human Motion Tracking by


Multiple RGBD Cameras
Zhenbao Liu, Member, IEEE, Jinxin Huang, Junwei Han, Senior Member, IEEE, Shuhui Bu, and Jianfeng Lv

Abstract— The advent of low-cost depth cameras, such as the of the sensor technologies, many new and economical cam-
Microsoft Kinect in the consumer market, has made many indoor era devices have been produced, e.g., the Microsoft Kinect
applications and games based on motion tracking available to the and Asus Xtion, which make spatial information acquisition
everyday user. However, it is a large challenge to track human
motion via such a camera because of its low-quality images, easier [1]. These affordable devices are regarded as an alterna-
missing depth values, and noise. In this paper, we propose a novel tive technology for performing tracking tasks in indoor scenes.
human motion capture method based on a cooperative structure Although these cameras have appealing advantages, it is still
of multiple low-cost RGBD cameras, which can effectively avoid a big challenge to track human motion via a depth camera
these problems. This structure can also manage the problem in cluttered indoor environments, as they suffer from noise
of body occlusions that appears when a single camera is used.
Moreover, the whole process does not require training data, and missing data [2]. Furthermore, when tracking via a single
which makes this approach easily deployed and reduces operation camera, self-occlusion frequently occurs when the tracked
time. We use the color image, depth image, and point cloud human turns around or crosses his or her limbs. As a result,
acquired in each view as the data source, and an initial pose the depth information of the occluded body parts cannot be
is extracted in our optimization framework by aligning multiple captured. These parts may be lost during tracking.
point clouds from different cameras. The pose is dynamically
updated by combining a filtering approach with a Markov model Currently, one category of a motion tracking algorithm
to estimate new poses in video streams. To verify the efficiency requires the manual initialization of human joints, while
and robustness of our approach, we capture a wide variety of an alternative needs to obtain initial joint positions from
human actions via three cameras in indoor scenes and compare a motion capture system. In addition, many state-of-the-art
the tracking results of the proposed method to those of the motion tracking algorithms via a single depth camera rely on a
current state-of-the-art methods. Moreover, our system is tested
on more complex situations, in which multiple humans move large training data set to train classifiers to recognize complex
within a scene, possibly occluding each other to some extent. The human motion, which is difficult to track. In contrast to these
actions of multiple humans are tracked simultaneously, which methods, we track complicated human actions with multiple
would assist group behavior analysis. depth cameras in 3D space without manual initialization and
Index Terms— Human motion tracking, multiple depth full human body reconstruction, which is more suitable for
cameras, multiple humans, skeleton. home application cases. Furthermore, the proposed system is
efficient and robust to different types of poses and able to
I. I NTRODUCTION handle body part self-occlusion.

I N RECENT years, human motion tracking has attracted


much attention for its wide applicability to a variety of
fields, such as video games and animation. A virtual char-
The core of our system is that we formulate the motion
tracking problem in a joint framework consisting of a Markov
model and filtering, which predicts new poses via mean square
acter is commonly driven by captured human motion using error (mse) minimization. Our approach only requires two
expensive inertial and optical systems. With the development video streams, an RGB stream and a depth stream. We then
Manuscript received December 15, 2015; revised March 22, 2016; accepted compute an initial pose from the aligned point cloud and
May 2, 2016. Date of publication May 6, 2016; date of current version generate a number of candidate poses to approximate the
September 5, 2017. This work was supported in part by the National Natural possible pose at the current frame according to the initial
Science Foundation of China under Grant 61003137, Grant 61473231,
Grant 61573284, and Grant 61522207; in part by the North- pose. We incorporate image features, silhouettes, and edges to
western Polytechnical University Basic Research Fund under estimate an optimal pose. Silhouettes and edges are employed
Grant 3102016JKBJJGZ08; in part by the Open Fund of State Key to evaluate how well the candidate poses fit the observed
Laboratory of Computer-Aided Design & Computer Graphics in Zhejiang
University under Grant A1509; in part by the Open Research Foundation images in three views.
of State Key Laboratory of Digital Manufacturing Equipment and
Technology in Huazhong University of Science and Technology under A. Overview
Grant DMETKF2015009; in part by the Fund of National Engineering
and Research Center for Commercial Aircraft Manufacturing under The flowchart of our approach is shown in Fig. 1.
Grant SAMC14-JS-15-045; and in part by the Shaanxi Natural Science We introduce the idea of a joint filtering and Markov model
Fund under Grant 2015JM6344. This paper was recommended by Associate framework to formulate the motion tracking problem via
Editor H. Yao. (Corresponding author: Junwei Han.)
The authors are with Northwestern Polytechnical University, Xi’an multiple cameras.
710072, China (e-mail: liuzhenbao@nwpu.edu.cn; hjxin627@nwpu.edu.cn; 1) Data Acquisition and Processing: Color images and
jhan@nwpu.edu.cn; bushuhui@nwpu.edu.cn; jianfengswjtu@nwpu.edu.cn). depth images are captured from two respective cameras in
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. each Kinect, that is, a color camera and a depth camera. The
Digital Object Identifier 10.1109/TCSVT.2016.2564878 two types of images are registered by the external calibration
1051-8215 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2015

Fig. 1. Overview of our algorithm.

parameters of the two cameras. The depth information, also errors from different views and assign weights to the possible
called the point cloud, is converted from a depth image after poses according to the error. Following the weight calculation,
the internal calibration of a one depth camera. Three sets of the best pose is generated by summing up the product of states
point clouds are integrated into one full point cloud of the of the poses and their corresponding weights. The human
human body. The skeleton is extracted from the full point model is then replaced with the best pose.
cloud using the geometrical and topological operations. The
skeleton of a standard T-pose is captured before tracking. B. Contributions
A T-pose is a reference pose in which the legs of the We implement the whole pipeline for human motion track-
tracked human are straight and his/her arms are straightened ing in a prototype system and demonstrate its usefulness
horizontally, forming a T-shape. The T-pose is used to obtain with several types of human actions taken from three depth
the limb length of the human model. In our method, although cameras. No accurate full human body mesh is needed in our
we can obtain the initial skeleton from the aligned point cloud, algorithm, which results in high efficiency and makes tracking
the skeleton is represented by a set of points that should be easier. Our approach is also free of training recognition data,
selected via a priori knowledge of the human body, namely, which saves deployment time. This approach is made possible
the limb length. The motion angles of connected limbs are by the following two technical contributions.
deduced by the initial pose, namely, the skeleton at each frame. 1) We extract an initial skeleton from the point cloud,
The angle information is employed to construct a human to which the point clouds captured from three depth
model in which cylinders represent body parts. The generated cameras are aligned for each frame. The use of the
human model is then projected into a 2D image plane. initial skeleton per frame assists in avoiding the situation
2) Filtering Process: The filtering process is used to esti- of motion discontinuity and abrupt variation in motion
mate the final pose for each frame. In particular, the filtering velocity, and reduces the search space as well, which
process can approximate the probability density function (pdf) reduces the tracking time.
by random sampling. The mathematical model of the human 2) We propose a motion tracking algorithm in a joint fil-
motion can be simplified using the sampled candidate poses. tering and Markov model framework with three RGBD
Therefore, these poses can be randomly generated according cameras. The joint tracking framework fully uses pose
to an initial pose that is similar to the true pose. These information observed with each single camera and,
poses are assigned associated weights that are updated by a hence, improves tracking performance. The experimental
likelihood function. The estimated pose is finally computed by results support the conclusion that this algorithm is
the weighted sum of all sampled poses. robust to self-occlusions of the human body and cross
3) Error Calculation and Pose Updating: We obtain the occlusions among multiple humans.
optimal pose via mse minimization. In detail, we project the
predicted pose onto the three views of the depth cameras to C. Related Work
calculate the errors. A likelihood function composed of two In this section, we mainly survey the related applications
image features, including edge features extracted from the based on depth cameras, body tracking methods, and pose
RGB images and the silhouette features extracted from the tracking methods in the domains of video, image, multimedia,
corresponding depth images, is modeled to estimate how well and computer vision. Body tracking means that the human
the predicted pose fits the acquired image. The objective of the body is tracked with a bounding box in 2D or 3D, and its
likelihood function is to compare the projected pose with objective is to know the person’s position and trajectory in
the image features. After the error calculation, we integrate the the outdoor and indoor scenes. However, the pose tracking
2016 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017

methods aim at tracking the person’s body parts, such as arms framework using a deep convolutional neural network. How-
and legs, and analyzing his actions. ever, these methods require a large amount of training samples,
1) RGBD Camera Applications: With the development of and the tracking accuracy relies heavily on the quality and
the sensor technologies, the RGBD cameras, such as Microsoft number of the training data, which limits their applications.
Kinect, have been applied into many research fields in com- Bingbing et al. [24] learned a compact low-dimensional rep-
puter vision [3], [4], such as 3D reconstruction, object recogni- resentation of motion statistics from similar motion patterns,
tion, and pose estimation. In [5], reconstruction of 3D human and then sample in the low-dimensional space during track-
body from a sequence of RGBD frames is represented by ing to deduce the computation time. In [25], the images
employing a novel parameterization of cylindrical-type objects are described using salient interest points represented by
using the Cartesian tensor and the B-spline bases along the scale-invariant feature transform-like descriptors, and then,
radial and longitudinal dimensions, respectively. Lai et al. [6] they model the mapping between poses and features by the
performed object recognition and detection through the com- Gaussian process and multiple linear regression.
bination of visual cues, depth cues, and rough knowledge In retrieval-based tracking systems, poses are usually com-
of configuration between the turntable and RGBD cameras. pared with a motion database and the most similar pose
Jiang et al. [7] proposed a novel multilayered gesture is selected. Helten et al. [26] demonstrated a sensor fusion
recognition method with Kinect, and obtain relatively approach for real-time full-body tracking, in which a gen-
high performance for one-shot learning gesture recognition. erative tracker and a discriminative tracker are combined to
Haker et al. [8] presented a pose estimation method by fitting retrieve the closest pose in a database. Liu et al. [27] presented
a simple model of human body to the point cloud of pose a full-body human motion tracking system by proposing an
in 3D space. Liu et al. [9] proposed a 3D human body exemplar-based conditional particle filter, and the best exem-
reconstruction method by template fitting via multiple depth plar is chosen by comparing with the current frame. The state
cameras. of the target to be tracked over time can also be predicted
2) Body Tracking: In some applications, motion trajec- by introducing decision theoretic online learning and using a
tory and full human body are tracked for video surveillance set of weighted experts [28]. Targeting the fusion of multiple
systems. A review and a comparison of the state-of-the-art tracking results, a symbiotic tracker ensemble framework [29]
tracking methods based on sparse coding are presented in [10]. is proposed to effectively combine the outputs of multiple
Zhu et al. [11] proposed an object tracking method in struc- trackers, and jointly explore the consistency of each tracker
tured environments. They use distance transform to model and the pairwise tracker correlations.
the environment state, and solve the tracking problem in In image-based tracking systems, the usage of motion track-
a Bayesian framework. Yang et al. [12] boosted for an ing usually initializes 3D human pose manually, and minimizes
optimal combination of features and kernels in an extended the difference between the hypothesized pose and the observed
multiple kernel learning framework to achieve an effective image features. However, a high-dimensional 3D human
and efficient tracking. Ess et al. [13] proposed a two-stage motion tracking is not easy to be mapped on a 2D image due to
tracking solution on a mobile platform, where they first the self-occlusion, and the depth information of hypothesized
build a simplified version of model to estimate the scene pose is also ambiguous. Therefore, 3D human motion tracking
geometry and an overcomplete set of object detections and, via multiview is proposed. Gavrila and Davis [30] formulated
then, address object interactions, tracking, and prediction. the 3D tracking problem as a search problem and find the most
In [14], a human detection and tracking system is proposed similar appearance of the subject in the multiview images.
in which visual features are extracted from the RGB images Deutscher and Reid [31] proposed an annealed particle filter
to track human body segmented from scene represented by algorithm to the motion tracking via four cameras.
the depth images in the successive frames. Xia et al. [15] The creation of depth cameras has absorbed much attention
proposed a model-based approach using depth information by in the motion-tracking field. Knoop et al. [32] achieved the
Kinect, which detects human using a 2D head contour model 3D tracking of human body movements based on a 3D body
and a 3D head surface model. Zhang et al. [16] introduced model and the iterative closest point (ICP) algorithm, but it
structurally random projection and weighted least squares into easily results in local optimization. Wei et al. [33] presented a
visual tracking to relax sparsity constraint when a set of fast, automatic tracking method by iteratively registering a 3D
target and trivial templates is used to linearly represent each articulated human body model with monocular depth cues via
target candidate. Wang et al. [17] tailored keypoint matching linear system solvers in a maximum a posteriori framework.
to track the 3D pose of the user head in a video stream. Shotton et al. [34] proposed a quick and accurate recognition
Zhang et al. [18] proposed the local patch movement modeling approach, in which an intermediate body part representation
from the perspective of uncertainty principle. is used to transform a pose estimation problem [4] into a
3) Pose Tracking: Another type of tracking algorithms aims per-pixel classification problem [35], [36]. Given a depth
at tracking movement details of human body poses. The image, a per-pixel body part distribution is inferred, and the
tracking skeletons can be applied into animation, game, and local modes are estimated to determine the location of joints.
so on, which is the motivation of our work as well. Machine Alexiadis et al. [37] built a real-time automatic system of
learning methods based on prior knowledge to obtain better dance performance evaluation using a Kinect RGBD sensor,
estimates of 3D human pose are introduced in [19]–[22]. and provided visual feedback for the beginners in a 3D virtual
Li et al. [23] proposed a heterogeneous multitask learning scene.
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2017

Fig. 3. RGBD camera calibration.

Fig. 2. Our human motion capture system composed of multiple depth


cameras.

These pose tracking methods via depth cameras depend


on either human body recognition or full human body mesh
reconstruction. Different from these algorithms, our approach
does not require training samples or human body mesh.
Moreover, our method effectively solves self-occlusions via
multiple depth cameras and accommodates cross occlusions Fig. 4. (a) Skeleton obtained from OpenNI. (b) Human body parts.
(c) Established model using a parameter set.
among multiple humans.

II. DATA ACQUISITION AND P REPROCESSING for the RGB camera. We then use a standard calibration
technique [39] to calibrate the RGB and IR cameras separately.
We focus on tracking human motion using multiple depth The offset is corrected between the IR camera and the depth
cameras, each of which captures images via both the depth and map [38]. We compute a set of intrinsic parameters to perform
RGB cameras at 30 frames/s. We capture the RGB images and the projection between the 2D points and the 3D points. After
the depth images simultaneously, extract human silhouettes, intrinsic calibration, we perform extrinsic calibration between
and represent the human model with joints. the depth and RGB cameras by applying a local rotation and
translation. The extrinsic parameters are used to convert the
A. Data Acquisition depth coordinate system into the RGB coordinate system.
Human motion is tracked by collecting the data from 2) Stereo Calibration: To integrate the point clouds from
three low-cost depth cameras, each of which captures different views, we need to know the relative positions and
640 × 480 depth and RGB images synchronously. The depth orientations between each pair of depth cameras. We apply
cameras are placed in a circle with a radius of ∼2.5 m and a global registration method [39] between every pair of
∼90° between each camera. Each camera is connected to a RGB cameras so as to align the different RGBD cameras.
computer, and we randomly select one of the computers to
be the master. The other two computers are connected to the C. Human Model Representation
master via network cables. The data are stored simultaneously In some motion tracking methods, two main methods are
when the master computer sends a storage signal to the other used to obtain the initial human pose. One is to initialize the
two computers. The motion capture system is shown in Fig. 2. pose by applying a motion tracking system with markers, and
the other is to locate the human joints manually in advance.
B. RGBD Camera Calibration The former method is a very complicated and expensive
One RGBD camera consists of a depth sensor, an method to deploy the whole motion tracking hardware system,
RGB camera, and a projector that casts a fixed speckle pattern. while the latter is not convenient for users, as they must mark
Because of the hardware configuration, a scene obtained from the joints manually. To initialize the system automatically, we
an RGB camera is slightly different from the one captured use OpenNI to obtain the skeleton coordinates of a standard
via a depth sensor. As mentioned in [38], there is a fixed T-pose before the tracking process, as shown in Fig. 4(a). The
offset between the raw infrared (IR) image and the depth skeleton coordinates consist of 15 joint parts for a full human
map. To correct for the offset and convert them into the body, which consist of the head, neck, torso center, hips, knees,
same coordinate system, we first calibrate the depth camera, feet, shoulders, elbows, and hands. We determine the length
as shown in Fig. 3. In this way, we convert the 3D point of each limb l = {l1 , . . . , l10 } in advance according to the
coordinate in the depth coordinate system into the 2D point skeleton information.
coordinate in the RGB coordinate system. Instead of reconstructing an accurate human mesh model,
1) Self-Calibration: We prepare a planar checkerboard and we represent the human body with ten limbs to reduce the
take a few images of the checkerboard at different views time of real reconstruction, as shown in Fig. 4(b). Each limb
using the RGB camera. To calibrate the IR camera, we cover has its own local coordinate system. Each part is simulated
the projector so that the calibration grid is not corrupted by one cylinder, except for the torso, which is represented by
by the projected pattern, and then repeat the same steps as an elliptical cylinder. Rather than modeling the human body
2018 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017

with coordinates, we convert them into the angles between the


related local coordinate systems. The reason why we do not
adopt coordinates to model human motion is that using angles
as the tracking parameters is more convenient and suitable for
the subsequent tracking in our algorithm. We use the concept
of a Cascade to model the full body, and the details are as
Fig. 5. Left: original depth image recorded from a depth sensor. Center: effect
follows. after filling holes. Right: silhouette extracted from original depth image.
First, we regard the torso as a root node, and characterize
its location and orientation with six degrees of freedom (three It is assumed that the background does not change over
coordinates and three angles). Next, the head, upper arms, time, which means that the objects in an indoor scene are
and thighs are separated into the child nodes of the torso, static except for moving humans. In our scene, no matter
each having three degrees of freedom (only the angles around how heavily the background is cluttered, the tracked person is
the three axes with respect to the torso coordinate). Finally, commonly in the center of a circle surrounded by Kinects.
we represent the lower arms and legs with one degree of Therefore, there are clear distances between the points on
freedom. Their degrees of freedom are related to the upper the person and the points in the background, which roughly
arms and thighs separately. All of these angles and coordinates separates the foreground and the background. Based on this
form our parameter set X = {φ1 , . . . , φ N }, which contains fact, we calculate the central pixel of a human body, denoted
N = 25 degrees of freedom. by (C x , C y ), as the mean of the foreground pixels. We select
It is well known that a coordinate system can be defined the maximum and minimum values of the pixel coordinates
with at least two axes, while only one axis of the local coor- of the person in the x- and y-directions separately, and the
dinate system represents the upper arms and thighs. We deal central pixel is defined as the mean value calculated by these
with the situation when information is lacking (e.g., when the pixel coordinates in each direction.
left thigh’s two points cannot form a 3D coordinate system). Here, two thresholds T1 and T2 are set to judge whether
Taking the left upper arm as an example, only the z-axis along a pixel belongs to the foreground. If the value of one
the left upper arm computed by the skeleton joint points is pixel (x, y) falls into the interval [T1 , T2 ], it is considered to
known, and the rotation of the left lower arm around the be a foreground pixel. In our practical application, the farthest
x-axis is defined. The plane formed by the z-axes of the possible distance is 7 m from the Kinects, and the person
left upper and lower arms is deduced, which is invariant to moves in a range of 16 m. The distance is normalized to the
the left lower arm rotation. values between 0 and 255. Furthermore, T1 and T2 are set to
The coordinates are adopted to calculate the related angles 35 and 200, respectively. Because the depth information can
between the coordinate systems of the connected limbs. be mostly captured around the central pixel, we infer the depth
We predefine three rotations in the order Rz (α)Rx (β)R y (γ ), values of the holes pixel-by-pixel from the central pixel. The
and the rotation matrix R between two coordinate systems of criterion used to determine the value of a hole uses the mean of
parent joint A and its child joint B, as shown in (1) at the the values of foreground pixels around the hole. It can be seen
bottom of this page, where s and c denote sin and cos func- in the center image in Fig. 5 that the noise level is suppressed
tions. The angles between A and B are calculated according significantly.
to these axes. Similarly, all other angles are obtained. The full 2) Gaussian Model: The background can be modeled as
human body is modeled with unit length and parameter set X, prior knowledge to detect the foreground objects [40]. In this
as shown in Fig. 4(c). paper, to remove the background, a Gaussian model [41], [42]
is used to represent it. The model is described by the following
D. Silhouette Extraction equation:
In this section, we introduce silhouette extraction, which is  
performed by removing the background of the depth image 1 (i x,y − μx,y )2
P(i x,y ) =  exp − 2
. (2)
recorded by each depth sensor from multiple cameras. 2πδ 2 2δx,y
x,y
1) Hole Filling: Compared with a time-of-flight camera, a
depth camera like that of the Kinect has a defect in that the Here, we denote the value of pixel (x, y) by i x,y , μx,y is
captured depth data are frequently incomplete and suffer from defined as the mean of pixel (x, y), and δx,y 2 is its variance.

the reflection and absorption of environmental lights on the Furthermore, P(i x,y ) is the probability of i x,y in the current
surface of objects. The missing depth data lead to holes in the model. For pixel (x, y), supposing that it was a background
depth image, as shown in Fig. 5 (left). To accurately segment point until it becomes a foreground point, i x,y will vary signif-
the foreground from the depth image, we have to complement icantly relative to μx,y . Here, we define another threshold T
the depth information in advance. to determine whether a pixel belongs to the foreground

⎡ ⎤
c(α)c(γ ) − s(α)s(γ )s(β) −s(α)c(β) c(α)s(γ ) + s(α)s(β)c(γ )
R = ⎣s(α)c(γ ) + c(α)s(β)s(γ ) c(α)c(β) s(β)s(γ ) − c(α)s(β)c(γ )⎦ (1)
−c(β)s(γ ) s(β) c(β)c(γ )
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2019

or background. If P(i x,y ) > T , then (x, y) is regarded as one if their distances are below a threshold, which is set
a foreground pixel; otherwise, it is a background pixel. The empirically. In this paper, a better sampling effect can be
Gaussian model is updated by the following equation: achieved when its value equals 0.18 times the length of the
longest diagonal in the human bounding box. Two neighbor
Mt = α It + (1 − α)Mt −1 (3)
sampled nodes are connected to form a skeleton segment if
where Mt is the Gaussian model at time t and Mt −1 is the their one-ring neighbors stay consistent. The iterations are
model at time t −1. Parameter α of the Gaussian model is used terminated when all the triangles are removed. Finally, an
to balance the influence of the current image on the previous initial skeleton with a set of candidate nodes is obtained, and
Gaussian model when it updates. A larger value may result in these nodes are further optimized to a few nodes according to
an unstable background removal, because the background in their connection relations and prior information provided by
indoor scenes stays almost unchanged. It is desirable for the the OpenNI skeleton.
Gaussian distribution to be slowly updated. In our experiment,
the value of α is tuned to 0.1 empirically. Furthermore, B. Filtering Process
It represents the mean and variance of the current frame Because there are differences between an initial pose and
at time t. The essence of the Gaussian model is to update a true pose, we propose a filtering process to predict the
the mean and variance of the pixel. The result is shown true pose with the initial pose aligned from the point cloud.
in Fig. 5 (right). To make the state estimation more robust, we further introduce
an annealing stage. Several layers are used to estimate the
III. 3D P OSE T RACKING
optimal state at each time. The human model is improved
The proposed tracking framework of a 3D human pose is at each layer, and the new estimated poses are generated
introduced in this section. We describe how to incorporate randomly. The poses are accepted according to the related
the silhouette features and the edge features originating from weights. As each layer propagates, the generation space for
multiple cameras into the whole framework. With the color producing new poses becomes more restricted. After all the
images and the depth images captured from the three Kinects, layers finish, an optimal pose state is obtained. The detailed
the depth information, namely, the point clouds from the steps are described as follows.
different Kinects, are obtained from the depth images after an 1) Initialize the parameters of layers M and estimated pose
internal calibration of each depth camera. A complete point numbers N.
cloud is aligned by the three point clouds from different views. 2) For every time step t, the unweighted poses X tm are
During the tracking process, a full 3D skeleton is computed generated at each layer m. Note that optimization starts
from the aligned 3D point cloud by the geometrical and from the Mth layer, and ends at the first layer. The
topological operations, followed by a combination of a filtering parameter set X 0 of initial pose is used as the mean
stage and Markov model estimation of the current pose. to produce the unweighted poses as follows:
A likelihood function is adopted to generate pose weights and m
X t,i = X 0 + Bm . (5)
compute estimation errors.
where Bm is a multivariate Gaussian random variable
A. Initial Skeleton Computation with covariance Pm and zero mean. For X 0 , m = M.
With the camera parameters of the relative positions and The covariance decreases as the layer number decreases.
orientations obtained via stereo calibration, we register the 3) Weight ωt,im is assigned to each pose according to

point clouds from three cameras to a global coordinate system. the error computed by the cost function. Weights are
N
We downsample the aligned point cloud uniformly and, then, normalized, such that i=1 ωt,i
m = 1.

apply a skeleton computation on the point cloud data. The 4) New poses X tm−1 are generated. More poses are
computation process is divided into two main steps: geometry distributed around the ones with higher weights at the
contraction and topological thinning. On the undersampled upper layer. For example, the i th pose with weight
point cloud P, a tangent plane is determined based on the ωim is chosen to produce a new pose. Each pose is
K nearest neighbors of a point Pi in P, and then, a planar produced by
Delaunay triangulation is performed to define the one-ring m−1
X t,i = X t,i
m
+ Bm . (6)
neighbors of Pi . During the contraction, we contract the point
cloud by solving the following linear system iteratively using: The number of new candidate poses satisfies the require-


ment that the value of N cannot be changed.
WL L  0
P = (4) 5) The process is repeated until layer 1, and the obtained
WH WH P 1 are used to estimate the real pose
possible poses X t,i

where P denotes the point cloud after contraction, and state at time t as
W L and W H are the two matrices that weigh the two oper-
N
ations, including Laplacian L contraction and attraction. The X̂ t = ωt(i) X t,i
1
. (7)
matrix computation runs iteratively until the initial point cloud i=1
becomes a point set without volume. We use the farthest- 6) The process returns to step 2, and a new initial pose
point sampling to further reduce the size of the contracted is initialized at time t + 1. Then, X t +1 is predicted by
point cloud. Two neighbor points may be combined into steps 2–5.
2020 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017

C. Markov Model D. Likelihood Function


In most cases, a pose can be predicted by the filtering In this tracking framework, an important step is to calculate
process; however, the skeletons extracted from the aligned the weight ωi of each possible pose i . A likelihood function
point cloud can be completely wrong compared with the true E(X i , Ok ) is introduced to evaluate each weight from the view
pose. We judge the tracking performance using the concept of the kth camera
of relative error. The error is computed by the likelihood
function (see Section III-D). The smaller the error is, the more ωi ∝ E(X i , Ok ). (13)
accurate the tracking result will be. Namely, poses that are
For each frame captured from one of these cameras, the
closer to the true pose have smaller error. In our experiment,
likelihood function measures its similarity using two terms,
we set an error threshold to 50%. If the error rate exceeds
silhouette similarity term Ssil (X i , O) and edge similarity term
this threshold, we regard the skeleton to be completely wrong.
Sedg (X i , O) as follows:
Once this case occurs, we assume that state X, namely, the
parameter set of the pose, is subject to the Markov model. E(X i , O) = exp{−(Ssil (X i , O) + Sedg (X i , O))}. (14)
This means that the current pose state X t is only related
to the state X t −1 of the previous frame. At the same time, 1) Silhouette Similarity Term: To quantify the similarity
observations {O1:t } are supposed to be independent. Hence, degree, the estimated pose is compared with silhouette Os ,
the state estimation problem is converted into the problem of which is extracted from the depth data.
obtaining posterior pdf p(X t |O1:t ) according to Bayes’ theory. Note that the silhouette image is captured by the IR camera,
The optimal estimation of the pose state can then be inferred. namely, the points Q IR of the silhouette image are represented
The whole process is divided into two stages: prediction and in the IR coordinate system, while the estimated 3D pose
updating. is described using the RGB coordinate system. Hence, we
In the prediction stage, we predict the probability density first need to eliminate the obtained offset doffset between
of the current pose. It is assumed that pdf p(X t −1 |Ot −1 ) at a the IR and the depth map and, then, use the calibrated
previous time is known. Thus, we can obtain stereo parameters, rotation R and translation T , between the
depth and RGB coordinate systems to convert the 3D point
p(X t , X t −1 |O1:t −1 ) = p(X t |X t −1 , Ot −1 ) p(X t −1 |O1:t −1 ) in the depth coordinate system into the RGB coordinate
= p(X t |X t −1 ) p(X t −1 |O1:t −1 ). (8) system

Q RGB = R(Q IR + doffset ) + T. (15)


State X t at time t is then predicted as follows:
After obtaining the silhouette images in the RGB coordinate
p(X t |O1:t −1 ) = p(X t |X t −1 ) p(X t −1 |O1:t −1 )d X t −1 . (9) system, we convert the 3D points Q RGB of both the silhouette
image and the estimated pose into 2D points qRGB to compute
In the update stage, pdf p(X t |O1:t ) can be updated accord- the error
ing to the Bayes theory by the following equation:
qRGB = 
M Q RGB  (16)
p(Ot |X t ) p(X t |O1:t −1 ) fx 0 cx
p(X t |O1:t ) = (10) M = 0 fy cy . (17)
p(Ot |O1:t −1 )
0 0 1
where p(Ot |O1:t −1 ) is a normalized constant. The estimated
We then uniformly sample N1 points on each projected
pose state is calculated by minimizing the mse
cylinder of the limbs and compare it with the silhouette image
based on the following equation:
MSE
X̂ t = X t p(X t |Ot )d X t . (11)
1
N1

To make the estimated state available, we approximate the Ssil (X i , O) = (1 − Os (i ))2 (18)
(i)
N1
i=1
pdf by summing a set of weighted samples; N samples {x t }
from pdf p(X t |O1:t ) are approximated by the Monte Carlo where i denotes the index of the sampled points obtained from
method, in which the pdf is approximated by the estimated pose and Os (i ) is the i th pixel value in the
⎧ silhouette image. Thus, we can obtain the similarity degree

⎪ N
  between the estimation and the observation by counting the

⎪ p(X |O ) = ωt(i) σ X t − X t(i)

⎨ t 1:t number of points that are equal to one. A value of one is
i=1
(12) considered to be the correct position for the estimated points,

⎪ (i)
⎪ (i)
⎪ ωt while a zero indicates that the estimated point is not on the

⎩ωt = N (i)
. human body. The sum of the point values can reflect how they
i=1 ωt
match, as shown in Fig. 6(d).
After the pose is estimated at the tth time using (7), the 2) Edge Similarity Term: We convert the RGB image into
Markov process is automatically switched to the filtering a grayscale image, as shown in Fig. 6(a) and, then, extract the
process. edges from this image, as shown in Fig. 6(b). Similar to the
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2021

boxing, kicking, walking, and three very challenging motions,


as shown in the first row of Fig. 7. The tracking results of
our method are given in the three bottom rows of Fig. 7, and
observed from different views, which proves that the tracking
performance is robust against not only common actions (the
first to fourth columns) but also complicated actions (fifth to
Fig. 6. (a) Gray image. (b) Edge map. (c) One of the candidate poses seventh columns).
projected to the edge image, and the red points are the points we sampled
along the projected cylinders. (d) One of the candidate poses projected to B. Comparison With the State-of-the-Art Methods
the silhouette image, and the red points are the points we sampled from the
interior of each cylinder. We evaluated the performance of our system by compar-
ing it with two popular methods, OpenNI tracker [43], the
silhouette term, N2 points are sampled at the edges of the pro- ICP technique [32], and a recent method proposed in [34].
jected cylinders uniformly, and the errors are accumulated by 1) Comparison With OpenNI Tracker: Our system starts the
pose estimation with an initial skeleton extracted from the
1
N2
Sedg (X i , O) = (1 − Oe (i ))2 (19) aligned point cloud at each frame. Our system and OpenNI
N2 tracker [43] operate in a similar way. Fig. 8 shows the tracking
i=1
results at several frames for the two methods. It is clear that
where i denotes the index of the sampled points obtained from our tracking result is more accurate and robust than the pose
the estimated pose and Oe (i ) is the i th pixel value in the edge captured by OpenNI.
image in Fig. 6(c). 2) Comparison With the ICP Technique: We compare our
3) Weight Computation: It is clear that three weights ωk tracking algorithm with the ICP technique proposed in [32].
(k = 1, 2, 3) can be obtained at every frame. However, the Similar to our approach, they also model a human using
final weight used to compute the predicted pose is unique. cylinders. The difference is that they use only one single depth
We choose the minimum weight as the final value from the camera. The tracking process starts with the same initial pose,
weights based on the following consideration. If a candidate and we minimize the distance between the point clouds from
pose with minimum weight is right in one single view, this three cameras and the human model. Fig. 9 compares the
pose with larger weights should be reasonable as well in other tracking results. It is clear that our tracking system is superior
two views. However, once a wrong pose is generated, its to the ICP technique, which often falls into locally optimal
minimum weight will help to decrease the influence on the solutions.
tracking process. 3) Comparison With Shotton et al.’s [34] method: We
IV. R ESULTS compare the tracking performance of our method with a widely
used algorithm proposed in [34], the key of which is body part
In this section, we show the accuracy and robustness of classification dependent on a large, realistic, and highly varied
our prototype system using three Kinects on a 3.5-GHz synthetic set of training images is obtained by training random
4-core Intel PC with 8-GB memory. We present the tracking forest. The body part classifier can use test images to estimate
results for different actions. The influence of various aspects body part labels and is invariant to pose, body shape, clothing,
of our method is investigated, e.g., by dropping off each term and other irrelevances. However, the approach does not predict
in our likelihood function, tracking poses with one camera, the position of an occluded joint. Fig. 10 shows the tracking
and capturing skeletons only at the first frame. The proposed results of both the body part classifier and our algorithm. There
method is compared with the three state-of-the-art methods in is hardly any difference between the two methods for the
terms of tracking accuracy: the OpenNI tracker [43], which general poses, as shown in the first and second columns of
we used as initialization, the ICP technique [32], which is a Fig. 10. When human actions exhibit extreme variability in
representative method independent of training, and a recent pose, the body part classifier cannot recognize the body parts
method proposed in [34] that is based on training. Tracking with overlap and occlusion, as shown in the last two columns
multiple persons by multiple cameras is also implemented. of Fig. 10. In contrast to Shotton et al.’s [34] approach, our
method performs well when tracking these complicated actions
A. Tracking Results and does not miss any body parts.
Our system works neither by marking joint positions man-
ually nor by wearing special sensors to obtain the markers C. Evaluation of the Tracking Process
initial positions. Moreover, we do not need to reconstruct an 1) 3D Skeleton Extraction: To test the stability of the
accurate full human body mesh either. The proposed method 3D skeleton extracted from a point cloud, a fast-moving
automatically adapts to users with different body sizes well complicated action is adopted, as shown in Fig. 11(a). The
without any training, and effectively handles self-occlusion complete point cloud, shown in Fig. 11(b), is first integrated
of the body parts during complicated motions. Our system from multiple cameras. After the processes of geometric
does not require people to face the camera, although it is contraction and topological thinning, an initial skeleton with
necessary if tracking via only a single camera, especially a set of candidate nodes is obtained, as shown in Fig. 11(c).
when occlusions occur. Fig. 7 shows the tracking results for These nodes are further optimized to a few nodes according to
a wide variety of human actions. The poses are squatting, their connection relations and prior information provided by
2022 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017

Fig. 7. Tracking results of different types of actions including complicate actions.

Fig. 8. Comparison against OpenNI. The first row provides the original Fig. 10. Comparison against the method proposed in [34]. The first row
color image as the ground truth, the second row is the skeleton captured by provides the original color image as the ground truth, the second row is the
OpenNI, and the third row is the skeleton generated by our system. skeleton captured by body part classifier, and the third row is the skeleton
generated by our system.

2) Importance of Information Captured by Three Cameras:


We evaluated the importance of using three cameras by
comparing the tracking results obtained with three cameras
with those obtained with only a single camera. Fig. 12 shows
the tracking results from a single depth camera and those
from three cameras. We observe that using three cameras
improves the tracking accuracy, especially when significant
self-occlusions occur (the fourth column in Fig. 12). The
results indicate that using the three depth cameras cannot only
improve the tracking accuracy, but also handle most of the
self-occlusion situations and eliminate depth ambiguity.
Fig. 9. Comparison against ICP. The first row provides the original color 3) Importance of Skeleton Capture: We investigated the
image as the ground truth, the second row is the skeleton captured by ICP, importance of skeleton capture at each frame. The existing
and the third row is the skeleton generated by our system. algorithms that capture skeletons only in the first frame often
need a large number of training data to handle various actions.
the OpenNI skeleton, and the optimized skeleton is shown in We compare tracking using skeletons captured only in the
Fig. 11(d). The example indicates that our algorithm can obtain first frame and in each frame. The frames in which motion
accurate 3D skeletons, even for some complicated actions. changes were significantly selected. For example, two groups
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2023

lie in the foreground of the silhouette image and overlap with


the edge of the human in the edge image. An exponential
function of the difference between their point positions forms
the likelihood function. The average over the three cameras
of the tracking likelihood functions represents the similarity
error between a predicted pose and its ground truth pose.
To prove that our algorithm is able to handle the partial
occlusions of the human body, we show a set of tracking
Fig. 11. Initial skeleton computation. (a) Original RGB image. (b) Aligned results with occluded body parts in Fig. 16. Partial occlusions
point cloud. (c) Skeleton before the final node selection. (d) Optimized exist in the set of poses in certain views of each single
skeleton after the final node selection. camera. The tracking result shows that, although some limbs
of the human body are invisible in a certain view, the tracking
accuracy is not affected in our multicamera system. More-
over, to further analyze the tracking quality quantitatively, we
computed the similarity error as an evaluation criterion. Seven
groups of actions were used to measure the tracking errors
of our method and the other three representative methods,
OpenNI tracker, ICP technique, and Shotton et al.’s [34]
method. We compare our tracking results to those using 2.5D
skeletons extracted using OpenNI and Shotton et al.’s [34]
method to verify the advantage of a full 3D skeleton. It is
clear that the error rate of 2.5D skeleton tracking by OpenNI
is much higher than that of 3D skeleton tracking by our
algorithm, especially when occlusion occurs or the action is
complicated. Moreover, the performance of our method is
Fig. 12. Importance of adopting three depth cameras. The first row provides
the original color image as the ground truth, the second row is the skeleton superior to that of the latest method, i.e., Shotton et al.’s [34]
captured by one single camera, and the third row is the skeleton generated method, for most actions. Fig. 17 shows these similarity errors,
via three cameras. where it is clear that our tracking system generates relatively
higher tracking accuracy.
of actions, boxing and walking, are examples when the first
pose is right. The tracking results in Fig. 13 show that tracking E. Tracking Multiple Persons
with skeletons captured in each frame is able to solve the
Our system can accommodate more complicated situations
problems of motion discontinuity and abrupt variation in
in which multiple humans move in a scene, possibly occluding
motion velocity.
each other to some extent. The actions of multiple humans
4) Importance of the Edge Term: We evaluated the effect
are tracked simultaneously. In the experiment, we track two to
of the edge term in the likelihood function by tracking human
four humans, and appropriate interactions are allowed. Fig. 18
poses with and without it. The edge formed by a human
shows the tracking results, where people interact and partial
provides a clear contour for every visible limb part. This
occlusion happens to a certain extent. It is clear that the
term commonly works unless people wear very loose clothes.
tracking accuracy is not affected in these cases.
Moreover, the edge term is invariant to clothing color, lighting,
and different actions. Fig. 14 shows the comparison results, F. Investigation of Genetic Algorithms
which verify that the pose is more accurate when the edge
We attempted to substitute the filtering process with a
term is included.
genetic algorithm (GA) to observe tracking performance.
5) Importance of Silhouette Term: To investigate the impor-
A GA is an effective searching algorithm based on natural
tance of the silhouette term in the likelihood function, we com-
selection rules. It starts from an initial population, and global
pared the tracking results with and without it. The silhouette
optimization is reached with the population evolving by the
can help distinguish whether a hypothesized body part is in the
use of genetic operators, such as selection, crossover, and
background or foreground. Fig. 15 shows that the silhouette
mutation. In our test, we formed an initial population as a
term improves the tracking result significantly.
set of possible poses based on a mean pose, that is, the initial
skeleton extracted from the point cloud, and then evaluated
D. Quantitative Evaluation of Tracking Quality every individual of the population using the likelihood func-
1) Similarity Error: It is used to measure the quantita- tion. A new population with M poses was produced iteratively
tive accuracy of tracking by comparing predicted poses with using the genetic operators according to the evaluation results
ground truth poses. Each predicted pose is imposed on the until a certain number of iterations were reached. The indi-
human model, and the cylinders of the human model are vidual with the maximum fitness value after optimization was
projected onto the silhouette and edge images separately, selected as the predicted pose.
which converts the human model into a 2D representation. It is The tracking performance using the GA is shown in Fig. 19.
desirable that all the sampled points on the 2D representation We quantitatively compared the GA and filtering process by
2024 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017

Fig. 13. Importance of skeleton capture. The first column represents the starting frame with right pose. Frames are selected randomly from second to
fourth columns. The hypothesized pose with skeleton captured at every frame is marked in magenta, and the pose with skeleton captured at only the
first frame is marked in cyan.

Fig. 16. Tracking results under partial occlusions of human body.

Fig. 14. Importance of edge term. The first row provides the original color
image as the ground truth, the second row is the skeleton captured without
edge term, and the third row is the skeleton generated with edge term.

Fig. 17. Quantitative evaluation of tracking accuracy.

V. D ISCUSSION
A. Number of Depth Cameras
If only one Kinect is used, the human body self-occlusion
frequently occurs when the user turns around or crosses
Fig. 15. Importance of silhouette term. The first row provides the original his/her limbs, which will result in missing depth data for
color image as the ground truth, the second row is the skeleton captured the occluded body parts. When two cameras are employed,
without silhouette term, and the third row is the skeleton generated with
silhouette term. we need to put them face to face to expand the field of
view as much as possible. As we know, the depth data are
acquired by an IR projector that sends out a fixed pattern of
calculating the error rate using the likelihood function for both light and an IR camera that records the reflected speckle from
the GA and filtering. The error rate of the GA is slightly lower objects. The depth is then calculated by the speckle, which is
than filtering, as shown in Table I. Moreover, tracking accuracy memorized at a known depth. However, two Kinects produce
improves with the increase of the number of generations M. strong IR interference, and the depth value errors are yielded.
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2025

Fig. 18. Actions of multiple humans are tracked simultaneously.

TABLE I C. Markov Models


Q UANTITATIVE C OMPARISON B ETWEEN GA AND
F ILTERING BY E RROR R ATES
In general, updating the current pose according to the
previous pose is based on the Markov model theory. However,
a prelearned model is required to determine the motion range
of the possible poses generated from the previous pose. There-
fore, using only the Markov model may lead to inaccurate
tracking performance, especially when discontinuous or abrupt
motions occur. In our algorithm, a relatively correct pose
is extracted as the initial pose from the point cloud, which
ensures that our algorithm does not yield an initial pose that
is significantly distinct from the estimated pose. The search
space is dramatically reduced. The Markov model is only used
when the error rate of the initial pose exceeds a threshold.
In this situation, we use the Markov model, which takes the
previous pose as the initial pose and estimates the current pose
over a relatively large space with the poses of the previous
frames taken into consideration. The integration of a Markov
model into our algorithm makes our tracking performance
more stable and accurate.
D. Edge and Silhouette Images
Fig. 19. Tracking performance using GA.
In fact, the edge and silhouette images play different roles
in the proposed method. The silhouette image extracted from
Therefore, we use three Kinects to obtain the depth data, and
one depth image is used to determine the basic outline of
distribute them uniformly over the space. They are placed in
the captured human. However, when the limbs overlap, for
a circle with about 120° between each Kinect to ensure a full
example, the arms lie in front of the torso from the viewpoint
view scan and reduce the interference as much as possible.
of the camera, the silhouette image can only capture the
There is no significant performance improvement with more
position of the torso. In other words, the silhouette image
than three Kinects; hence, we chose three Kinects for our final
cannot reduce the ambiguity of the limb positions in such a
system.
situation. In this case, the edge images computed from the
color images can help to reduce any ambiguity and better
B. Initial Pose
localize the body limbs.
In other literature, the pose of the current frame updated
from the previous frame usually needs a priori knowledge E. Occlusion
to determine the accurate range of the degrees of freedom. Occlusion is regarded as a relative concept. In general, the
Without the accurate variation range, which is obtained from occlusion we mentioned is a pose that cannot be captured
a training set in most cases, the tracking result may return by a single camera. For example, a pose cannot be estimated
a local optimization. The tracking performance may become accurately when a person moves an arm or leg that is invisible
totally wrong when a sudden change in human motion velocity to the camera. However, a complete point cloud of the person
or discontinuous motion occurs. However, we do not resort to can be captured in a system composed of multiple depth
any training sets. To deal with the problem of sudden motion cameras, which can reduce the ambiguity of the pose to a
changes, we decided to extract the skeleton from the aligned certain extent. Furthermore, we estimate the pose with an
point cloud in each frame to reduce the difference between initial known pose computed from the aligned point cloud as
the possible poses generated from the initial pose and the true a priori knowledge, which ensures that the initial pose will
pose as much as possible. not differ much from the true pose to be estimated.
2026 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017

F. 3D Skeleton [5] A. Barmpoutis, “Tensor body: Real-time reconstruction of the human


body and avatar synthesis from RGB-D,” IEEE Trans. Cybern., vol. 43,
Tracking methods based on a single depth camera only no. 5, pp. 1347–1356, Oct. 2013.
obtain 2.5D depth data and the skeleton is estimated using [6] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view
these data. Therefore, we think that such a skeleton belongs to RGB-D object dataset,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
May 2011, pp. 1817–1824.
2.5D representations. In our method, the three depth cameras [7] F. Jiang, S. Zhang, S. Wu, Y. Gao, and D. Zhao, “Multi-layered gesture
forming one stereo environment are used to capture a human recognition with Kinect,” J. Mach. Learn. Res., vol. 16, pp. 227–254,
full body in 3D, and we regard its skeleton to be a full Feb. 2015.
[8] M. Haker, M. Böhme, T. Martinetz, and E. Barth, “Self-organizing
3D version. An initial skeleton is computed from the 3D point maps for pose estimation with a time-of-flight camera,” in Dynamic
cloud aligned from the depth data of three depth cameras. 3D Imaging. Berlin, Germany: Springer, 2009, pp. 142–153.
[9] Z. Liu, J. Huang, S. Bu, J. Han, X. Tang, and X. Li, “Template
deformation-based 3-D reconstruction of full human body scans from
VI. C ONCLUSION low-cost depth cameras,” IEEE Trans. Cybern., vol. PP, no. 99, pp. 1–14,
Feb. 2016.
In this paper, we develop a human body motion tracking [10] S. Zhang, H. Yao, X. Sun, and X. Lu, “Sparse coding based visual
system with multiple depth cameras. Our system is appealing, tracking: Review and experimental comparison,” Pattern Recognit.,
since we aim at tracking full human body motion with low- vol. 46, no. 7, pp. 1772–1788, Jul. 2013.
[11] J. Zhu, Y. Lao, and Y. F. Zheng, “Object tracking in structured environ-
price sensors. Moreover, our solution can handle different ments for video surveillance applications,” IEEE Trans. Circuits Syst.
types of motions and also self-occlusions, which cannot be Video Technol., vol. 20, no. 2, pp. 223–235, Feb. 2010.
overcome by one single camera. We verify the performance of [12] F. Yang, H. Lu, and M.-H. Yang, “Robust visual tracking via multiple
kernel boosting with affinity constraints,” IEEE Trans. Circuits Syst.
our system by capturing a group of human actions in indoor Video Technol., vol. 24, no. 2, pp. 242–254, Feb. 2014.
scene. It is robust against sensor noise, depth missing, and [13] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “Robust multiperson
occlusion, and comparable with the state-of-the-art methods. tracking from a mobile platform,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 31, no. 10, pp. 1831–1846, Oct. 2009.
The tracking results show that our algorithm is competent for [14] J. Han, E. J. Pauwels, P. M. de Zeeuw, and P. H. N. de With, “Employing
tracking human actions via three depth cameras without any a RGB-D sensor for real-time tracking of humans across multiple re-
training, manual operation, and mesh model reconstruction of entries in a smart environment,” IEEE Trans. Consum. Electron., vol. 58,
no. 2, pp. 255–263, May 2012.
full human body. [15] L. Xia, C.-C. Chen, and J. K. Aggarwal, “Human detection using depth
information by Kinect,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
A. Limitations Pattern Recognit. Workshops (CVPRW), Jun. 2011, pp. 15–22.
[16] S. Zhang, H. Zhou, F. Jiang, and X. Li, “Robust visual tracking using
Currently, our approach is heavily dependent on the aligned structurally random projection and weighted least squares,” IEEE Trans.
point cloud from the three depth cameras. Thus, the tracking Circuits Syst. Video Technol., vol. 25, no. 11, pp. 1749–1760, Nov. 2015.
[17] H. Wang, F. Davoine, V. Lepetit, C. Chaillou, and C. Pan, “3-D head
process will fail when the depth data of some body part tracking via invariant keypoint learning,” IEEE Trans. Circuits Syst.
is largely missing, which may lead to the failure of initial Video Technol., vol. 22, no. 8, pp. 1113–1126, Aug. 2012.
skeleton extraction. Another limitation is that complicate inter- [18] B. Zhang, A. Perina, Z. Li, V. Murino, J. Liu, and R. Ji, “Bounding
multiple Gaussians uncertainty with application to object tracking,” Int.
actions among more than two persons cannot be handled, for J. Comput. Vis., vol. 118, pp. 1–16, Feb. 2016.
example, wresting and hugging, because arms and legs are [19] C. Hong, J. Yu, and X. Chen, “Image-based 3D human pose recovery
frequently occluded by each other. In this case, our algorithm with locality sensitive sparse retrieval,” in Proc. IEEE Int. Conf. Syst.,
Man, Cybern. (SMC), Oct. 2013, pp. 2103–2108.
may lose some body parts while tracking. [20] Y. Song, X. Feng, and P. Perona, “Towards detection of human motion,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1. Jun. 2000,
B. Future Work pp. 810–817.
[21] K. Grauman, G. Shakhnarovich, and T. Darrell, “Inferring 3D structure
In the future, the current tracking system should be with a statistical image-based shape model,” in Proc. 9th IEEE Int. Conf.
improved by adding extra inertial sensors to capture Comput. Vis., Oct. 2003, pp. 641–647.
[22] Z. Liu et al., “3D real human reconstruction via multiple low-cost depth
complicate motions, such as crawling and rolling. Moreover, cameras,” Signal Process., vol. 112, pp. 162–179, Jul. 2015.
we are particularly interested in incorporating acceleration [23] S. Li, Z.-Q. Liu, and A. B. Chan, “Heterogeneous multi-task learning
information reflected by abrupt changes of human motions for human pose estimation with deep convolutional neural network,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
into the current tracking framework. Finally, we wish to Jun. 2014, pp. 488–495.
explore rich human interactions between multiple persons, [24] N. Bingbing, A. A. Kassim, and S. Winkler, “A hybrid framework for
e.g., wrestling. 3-D human motion tracking,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 18, no. 8, pp. 1075–1084, Aug. 2008.
[25] X. Zhao, Y. Fu, H. Ning, Y. Liu, and T. S. Huang, “Human pose
R EFERENCES regression through multiview visual fusion,” IEEE Trans. Circuits Syst.
Video Technol., vol. 20, no. 7, pp. 957–966, Jul. 2010.
[1] L. Shao, J. Han, P. Kohli, and Z. Zhang, Eds., Computer Vision [26] T. Helten, M. Müller, H.-P. Seidel, and C. Theobalt, “Real-time body
and Machine Learning With RGB-D Sensors. Switzerland: Springer, tracking with one depth camera and inertial sensors,” in Proc. IEEE Int.
Feb. 2014. Conf. Comput. Vis. (ICCV), Dec. 2013, pp. 1105–1112.
[2] Y. Gao, Y. Yang, Y. Zhen, and Q. Dai, “Depth error elimination for [27] J. Liu, D. Liu, J. Dauwels, and H. S. Seah, “3D human motion tracking
RGB-D cameras,” ACM Trans. Intell. Syst. Technol., vol. 6, no. 2, by exemplar-based conditional particle filter,” Signal Process., vol. 110,
pp. 13:1–13:16, May 2015. pp. 164–177, May 2015.
[3] J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision [28] S. Zhang, H. Zhou, H. Yao, Y. Zhang, K. Wang, and J. Zhang, “Adaptive
with microsoft Kinect sensor: A review,” IEEE Trans. Cybern., vol. 43, NormalHedge for robust visual tracking,” Signal Process., vol. 110,
no. 5, pp. 1318–1334, Oct. 2013. pp. 132–142, May 2015.
[4] Z. Liu, X. Wang, and S. Bu, “Human-centered saliency detection,” [29] Y. Gao, R. Ji, L. Zhang, and A. Hauptmann, “Symbiotic tracker
IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 6, pp. 1150–1162, ensemble toward a unified tracking framework,” IEEE Trans. Circuits
Jun. 2016. Syst. Video Technol., vol. 24, no. 7, pp. 1122–1131, Jul. 2014.
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2027

[30] D. M. Gavrila and L. S. Davis, “3-D model-based tracking of humans Jinxin Huang was born in Hubei, China, in 1992.
in action: A multi-view approach,” in Proc. IEEE Comput. Soc. Conf. She received the bachelor’s degree in electrical engi-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 1996, pp. 73–80. neering and automation from Northwestern Poly-
[31] J. Deutscher and I. Reid, “Articulated body motion capture by stochastic technical University, Xi’an, China, in 2014, where
search,” Int. J. Comput. Vis., vol. 61, no. 2, pp. 185–205, 2005. she is currently working toward the master’s degree
[32] S. Knoop, S. Vacek, and R. Dillmann, “Sensor fusion for 3D human in transportation tools and applications.
body tracking with an articulated 3D body model,” in Proc. IEEE Int. Her research interests include human–computer
Conf. Robot. Autom. (ICRA), May 2006, pp. 1686–1691. interaction, including 3D human reconstruction.
[33] X. Wei, P. Zhang, and J. Chai, “Accurate realtime full-body motion
capture using a single depth camera,” ACM Trans. Graph., vol. 31, no. 6,
pp. 188:1–188:12, Nov. 2012.
[34] J. Shotton et al., “Real-time human pose recognition in parts from single
depth images,” Commun. ACM, vol. 56, no. 1, pp. 116–124, Jan. 2013.
[35] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object
detection and geographic image classification based on collection of part
detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132, Junwei Han (M’12–SM’15) received the
Dec. 2014. Ph.D. degree in pattern recognition and intelligent
[36] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective systems from the School of Automation,
and efficient midlevel visual elements-oriented land-use classification Northwestern Polytechnical University, Xi’an,
using VHR remote sensing images,” IEEE Trans. Geosci. Remote Sens., China, in 2003.
vol. 53, no. 8, pp. 4238–4249, Aug. 2015. He is a Professor with Northwestern Polytechnical
[37] D. S. Alexiadis, P. Kelly, P. Daras, N. E. O’Connor, T. Boubekeur, and University. His research interests include multimedia
M. B. Moussa, “Evaluating a dancer’s performance using Kinect-based processing and brain imaging analysis.
skeleton tracking,” in Proc. 19th ACM Int. Conf. Multimedia, 2011, Dr. Han is an Associate Editor of IEEE
pp. 659–662. T RANSACTIONS ON H UMAN M ACHINE S YSTEMS ,
[38] J. Smisek, M. Jancosek, and T. Pajdla, “3D with Kinect,” in Consumer Neurocomputing, and Multidimensional Systems
Depth Cameras for Computer Vision (Advances in Computer Vision and Signal Processing.
and Pattern Recognition), A. Fossati, J. Gall, H. Grabner, X. Ren, and
K. Konolige, Eds. London, U.K.: Springer, 2013, pp. 3–25.
[39] J.-Y. Bouguet. (2007). Camera Calibration Toolbox for MATLAB.
[Online]. Available: http://www.vision.caltech.edu/bouguetj/calib_doc
[40] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background
prior-based salient object detection via deep reconstruction residual,” Shuhui Bu received the master’s and Ph.D. degrees
IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1309–1321, from the College of Systems and Information Engi-
Aug. 2015. neering, University of Tsukuba, Tsukuba, Japan, in
[41] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: 2006 and 2009, respectively.
Real-time tracking of the human body,” IEEE Trans. Pattern Anal. Mach. He was an Assistant Professor with Kyoto Uni-
Intell., vol. 19, no. 7, pp. 780–785, Jul. 1997. versity, Kyoto, Japan, from 2009 to 2011. He is
[42] J. Han, S. He, X. Qian, D. Wang, L. Guo, and T. Liu, “An object- currently an Associate Professor with Northwestern
oriented visual saliency detection framework based on sparse coding Polytechnical University, Xi’an, China. He has
representations,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, authored approximately 40 papers in major interna-
no. 12, pp. 2009–2021, Dec. 2013. tional journals and conferences. His research inter-
[43] (2012). OpenNI Tracker. [Online]. Available: http://wiki.ros.org/ ests include computer vision and robotics.
openni_tracker

Zhenbao Liu (M’11) received the bachelor’s and


master’s degrees from Northwestern Polytechnical
University, Xi’an, China, in 2001 and 2004, respec- Jianfeng Lv was born in Xinjiang, China, in 1993.
tively, and the Ph.D. degree from the College of He is currently working toward the master’s degree
Systems and Information Engineering, University of in transportation tools and applications with North-
Tsukuba, Tsukuba, Japan, in 2009. western Polytechnical University, Xi’an, China.
He was a Visiting Scholar with Simon Fraser His research interests include human–computer
University, Burnaby, BC, Canada, in 2012. He is interaction.
currently an Associate Professor with Northwestern
Polytechnical University. He has authored over 50
papers in major international journals and confer-
ences. His research interests include computer graphics, computer vision, and
shape analysis.

You might also like