Professional Documents
Culture Documents
p2 PDF
p2 PDF
9, SEPTEMBER 2017
Abstract— The advent of low-cost depth cameras, such as the of the sensor technologies, many new and economical cam-
Microsoft Kinect in the consumer market, has made many indoor era devices have been produced, e.g., the Microsoft Kinect
applications and games based on motion tracking available to the and Asus Xtion, which make spatial information acquisition
everyday user. However, it is a large challenge to track human
motion via such a camera because of its low-quality images, easier [1]. These affordable devices are regarded as an alterna-
missing depth values, and noise. In this paper, we propose a novel tive technology for performing tracking tasks in indoor scenes.
human motion capture method based on a cooperative structure Although these cameras have appealing advantages, it is still
of multiple low-cost RGBD cameras, which can effectively avoid a big challenge to track human motion via a depth camera
these problems. This structure can also manage the problem in cluttered indoor environments, as they suffer from noise
of body occlusions that appears when a single camera is used.
Moreover, the whole process does not require training data, and missing data [2]. Furthermore, when tracking via a single
which makes this approach easily deployed and reduces operation camera, self-occlusion frequently occurs when the tracked
time. We use the color image, depth image, and point cloud human turns around or crosses his or her limbs. As a result,
acquired in each view as the data source, and an initial pose the depth information of the occluded body parts cannot be
is extracted in our optimization framework by aligning multiple captured. These parts may be lost during tracking.
point clouds from different cameras. The pose is dynamically
updated by combining a filtering approach with a Markov model Currently, one category of a motion tracking algorithm
to estimate new poses in video streams. To verify the efficiency requires the manual initialization of human joints, while
and robustness of our approach, we capture a wide variety of an alternative needs to obtain initial joint positions from
human actions via three cameras in indoor scenes and compare a motion capture system. In addition, many state-of-the-art
the tracking results of the proposed method to those of the motion tracking algorithms via a single depth camera rely on a
current state-of-the-art methods. Moreover, our system is tested
on more complex situations, in which multiple humans move large training data set to train classifiers to recognize complex
within a scene, possibly occluding each other to some extent. The human motion, which is difficult to track. In contrast to these
actions of multiple humans are tracked simultaneously, which methods, we track complicated human actions with multiple
would assist group behavior analysis. depth cameras in 3D space without manual initialization and
Index Terms— Human motion tracking, multiple depth full human body reconstruction, which is more suitable for
cameras, multiple humans, skeleton. home application cases. Furthermore, the proposed system is
efficient and robust to different types of poses and able to
I. I NTRODUCTION handle body part self-occlusion.
parameters of the two cameras. The depth information, also errors from different views and assign weights to the possible
called the point cloud, is converted from a depth image after poses according to the error. Following the weight calculation,
the internal calibration of a one depth camera. Three sets of the best pose is generated by summing up the product of states
point clouds are integrated into one full point cloud of the of the poses and their corresponding weights. The human
human body. The skeleton is extracted from the full point model is then replaced with the best pose.
cloud using the geometrical and topological operations. The
skeleton of a standard T-pose is captured before tracking. B. Contributions
A T-pose is a reference pose in which the legs of the We implement the whole pipeline for human motion track-
tracked human are straight and his/her arms are straightened ing in a prototype system and demonstrate its usefulness
horizontally, forming a T-shape. The T-pose is used to obtain with several types of human actions taken from three depth
the limb length of the human model. In our method, although cameras. No accurate full human body mesh is needed in our
we can obtain the initial skeleton from the aligned point cloud, algorithm, which results in high efficiency and makes tracking
the skeleton is represented by a set of points that should be easier. Our approach is also free of training recognition data,
selected via a priori knowledge of the human body, namely, which saves deployment time. This approach is made possible
the limb length. The motion angles of connected limbs are by the following two technical contributions.
deduced by the initial pose, namely, the skeleton at each frame. 1) We extract an initial skeleton from the point cloud,
The angle information is employed to construct a human to which the point clouds captured from three depth
model in which cylinders represent body parts. The generated cameras are aligned for each frame. The use of the
human model is then projected into a 2D image plane. initial skeleton per frame assists in avoiding the situation
2) Filtering Process: The filtering process is used to esti- of motion discontinuity and abrupt variation in motion
mate the final pose for each frame. In particular, the filtering velocity, and reduces the search space as well, which
process can approximate the probability density function (pdf) reduces the tracking time.
by random sampling. The mathematical model of the human 2) We propose a motion tracking algorithm in a joint fil-
motion can be simplified using the sampled candidate poses. tering and Markov model framework with three RGBD
Therefore, these poses can be randomly generated according cameras. The joint tracking framework fully uses pose
to an initial pose that is similar to the true pose. These information observed with each single camera and,
poses are assigned associated weights that are updated by a hence, improves tracking performance. The experimental
likelihood function. The estimated pose is finally computed by results support the conclusion that this algorithm is
the weighted sum of all sampled poses. robust to self-occlusions of the human body and cross
3) Error Calculation and Pose Updating: We obtain the occlusions among multiple humans.
optimal pose via mse minimization. In detail, we project the
predicted pose onto the three views of the depth cameras to C. Related Work
calculate the errors. A likelihood function composed of two In this section, we mainly survey the related applications
image features, including edge features extracted from the based on depth cameras, body tracking methods, and pose
RGB images and the silhouette features extracted from the tracking methods in the domains of video, image, multimedia,
corresponding depth images, is modeled to estimate how well and computer vision. Body tracking means that the human
the predicted pose fits the acquired image. The objective of the body is tracked with a bounding box in 2D or 3D, and its
likelihood function is to compare the projected pose with objective is to know the person’s position and trajectory in
the image features. After the error calculation, we integrate the the outdoor and indoor scenes. However, the pose tracking
2016 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017
methods aim at tracking the person’s body parts, such as arms framework using a deep convolutional neural network. How-
and legs, and analyzing his actions. ever, these methods require a large amount of training samples,
1) RGBD Camera Applications: With the development of and the tracking accuracy relies heavily on the quality and
the sensor technologies, the RGBD cameras, such as Microsoft number of the training data, which limits their applications.
Kinect, have been applied into many research fields in com- Bingbing et al. [24] learned a compact low-dimensional rep-
puter vision [3], [4], such as 3D reconstruction, object recogni- resentation of motion statistics from similar motion patterns,
tion, and pose estimation. In [5], reconstruction of 3D human and then sample in the low-dimensional space during track-
body from a sequence of RGBD frames is represented by ing to deduce the computation time. In [25], the images
employing a novel parameterization of cylindrical-type objects are described using salient interest points represented by
using the Cartesian tensor and the B-spline bases along the scale-invariant feature transform-like descriptors, and then,
radial and longitudinal dimensions, respectively. Lai et al. [6] they model the mapping between poses and features by the
performed object recognition and detection through the com- Gaussian process and multiple linear regression.
bination of visual cues, depth cues, and rough knowledge In retrieval-based tracking systems, poses are usually com-
of configuration between the turntable and RGBD cameras. pared with a motion database and the most similar pose
Jiang et al. [7] proposed a novel multilayered gesture is selected. Helten et al. [26] demonstrated a sensor fusion
recognition method with Kinect, and obtain relatively approach for real-time full-body tracking, in which a gen-
high performance for one-shot learning gesture recognition. erative tracker and a discriminative tracker are combined to
Haker et al. [8] presented a pose estimation method by fitting retrieve the closest pose in a database. Liu et al. [27] presented
a simple model of human body to the point cloud of pose a full-body human motion tracking system by proposing an
in 3D space. Liu et al. [9] proposed a 3D human body exemplar-based conditional particle filter, and the best exem-
reconstruction method by template fitting via multiple depth plar is chosen by comparing with the current frame. The state
cameras. of the target to be tracked over time can also be predicted
2) Body Tracking: In some applications, motion trajec- by introducing decision theoretic online learning and using a
tory and full human body are tracked for video surveillance set of weighted experts [28]. Targeting the fusion of multiple
systems. A review and a comparison of the state-of-the-art tracking results, a symbiotic tracker ensemble framework [29]
tracking methods based on sparse coding are presented in [10]. is proposed to effectively combine the outputs of multiple
Zhu et al. [11] proposed an object tracking method in struc- trackers, and jointly explore the consistency of each tracker
tured environments. They use distance transform to model and the pairwise tracker correlations.
the environment state, and solve the tracking problem in In image-based tracking systems, the usage of motion track-
a Bayesian framework. Yang et al. [12] boosted for an ing usually initializes 3D human pose manually, and minimizes
optimal combination of features and kernels in an extended the difference between the hypothesized pose and the observed
multiple kernel learning framework to achieve an effective image features. However, a high-dimensional 3D human
and efficient tracking. Ess et al. [13] proposed a two-stage motion tracking is not easy to be mapped on a 2D image due to
tracking solution on a mobile platform, where they first the self-occlusion, and the depth information of hypothesized
build a simplified version of model to estimate the scene pose is also ambiguous. Therefore, 3D human motion tracking
geometry and an overcomplete set of object detections and, via multiview is proposed. Gavrila and Davis [30] formulated
then, address object interactions, tracking, and prediction. the 3D tracking problem as a search problem and find the most
In [14], a human detection and tracking system is proposed similar appearance of the subject in the multiview images.
in which visual features are extracted from the RGB images Deutscher and Reid [31] proposed an annealed particle filter
to track human body segmented from scene represented by algorithm to the motion tracking via four cameras.
the depth images in the successive frames. Xia et al. [15] The creation of depth cameras has absorbed much attention
proposed a model-based approach using depth information by in the motion-tracking field. Knoop et al. [32] achieved the
Kinect, which detects human using a 2D head contour model 3D tracking of human body movements based on a 3D body
and a 3D head surface model. Zhang et al. [16] introduced model and the iterative closest point (ICP) algorithm, but it
structurally random projection and weighted least squares into easily results in local optimization. Wei et al. [33] presented a
visual tracking to relax sparsity constraint when a set of fast, automatic tracking method by iteratively registering a 3D
target and trivial templates is used to linearly represent each articulated human body model with monocular depth cues via
target candidate. Wang et al. [17] tailored keypoint matching linear system solvers in a maximum a posteriori framework.
to track the 3D pose of the user head in a video stream. Shotton et al. [34] proposed a quick and accurate recognition
Zhang et al. [18] proposed the local patch movement modeling approach, in which an intermediate body part representation
from the perspective of uncertainty principle. is used to transform a pose estimation problem [4] into a
3) Pose Tracking: Another type of tracking algorithms aims per-pixel classification problem [35], [36]. Given a depth
at tracking movement details of human body poses. The image, a per-pixel body part distribution is inferred, and the
tracking skeletons can be applied into animation, game, and local modes are estimated to determine the location of joints.
so on, which is the motivation of our work as well. Machine Alexiadis et al. [37] built a real-time automatic system of
learning methods based on prior knowledge to obtain better dance performance evaluation using a Kinect RGBD sensor,
estimates of 3D human pose are introduced in [19]–[22]. and provided visual feedback for the beginners in a 3D virtual
Li et al. [23] proposed a heterogeneous multitask learning scene.
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2017
II. DATA ACQUISITION AND P REPROCESSING for the RGB camera. We then use a standard calibration
technique [39] to calibrate the RGB and IR cameras separately.
We focus on tracking human motion using multiple depth The offset is corrected between the IR camera and the depth
cameras, each of which captures images via both the depth and map [38]. We compute a set of intrinsic parameters to perform
RGB cameras at 30 frames/s. We capture the RGB images and the projection between the 2D points and the 3D points. After
the depth images simultaneously, extract human silhouettes, intrinsic calibration, we perform extrinsic calibration between
and represent the human model with joints. the depth and RGB cameras by applying a local rotation and
translation. The extrinsic parameters are used to convert the
A. Data Acquisition depth coordinate system into the RGB coordinate system.
Human motion is tracked by collecting the data from 2) Stereo Calibration: To integrate the point clouds from
three low-cost depth cameras, each of which captures different views, we need to know the relative positions and
640 × 480 depth and RGB images synchronously. The depth orientations between each pair of depth cameras. We apply
cameras are placed in a circle with a radius of ∼2.5 m and a global registration method [39] between every pair of
∼90° between each camera. Each camera is connected to a RGB cameras so as to align the different RGBD cameras.
computer, and we randomly select one of the computers to
be the master. The other two computers are connected to the C. Human Model Representation
master via network cables. The data are stored simultaneously In some motion tracking methods, two main methods are
when the master computer sends a storage signal to the other used to obtain the initial human pose. One is to initialize the
two computers. The motion capture system is shown in Fig. 2. pose by applying a motion tracking system with markers, and
the other is to locate the human joints manually in advance.
B. RGBD Camera Calibration The former method is a very complicated and expensive
One RGBD camera consists of a depth sensor, an method to deploy the whole motion tracking hardware system,
RGB camera, and a projector that casts a fixed speckle pattern. while the latter is not convenient for users, as they must mark
Because of the hardware configuration, a scene obtained from the joints manually. To initialize the system automatically, we
an RGB camera is slightly different from the one captured use OpenNI to obtain the skeleton coordinates of a standard
via a depth sensor. As mentioned in [38], there is a fixed T-pose before the tracking process, as shown in Fig. 4(a). The
offset between the raw infrared (IR) image and the depth skeleton coordinates consist of 15 joint parts for a full human
map. To correct for the offset and convert them into the body, which consist of the head, neck, torso center, hips, knees,
same coordinate system, we first calibrate the depth camera, feet, shoulders, elbows, and hands. We determine the length
as shown in Fig. 3. In this way, we convert the 3D point of each limb l = {l1 , . . . , l10 } in advance according to the
coordinate in the depth coordinate system into the 2D point skeleton information.
coordinate in the RGB coordinate system. Instead of reconstructing an accurate human mesh model,
1) Self-Calibration: We prepare a planar checkerboard and we represent the human body with ten limbs to reduce the
take a few images of the checkerboard at different views time of real reconstruction, as shown in Fig. 4(b). Each limb
using the RGB camera. To calibrate the IR camera, we cover has its own local coordinate system. Each part is simulated
the projector so that the calibration grid is not corrupted by one cylinder, except for the torso, which is represented by
by the projected pattern, and then repeat the same steps as an elliptical cylinder. Rather than modeling the human body
2018 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017
the reflection and absorption of environmental lights on the Furthermore, P(i x,y ) is the probability of i x,y in the current
surface of objects. The missing depth data lead to holes in the model. For pixel (x, y), supposing that it was a background
depth image, as shown in Fig. 5 (left). To accurately segment point until it becomes a foreground point, i x,y will vary signif-
the foreground from the depth image, we have to complement icantly relative to μx,y . Here, we define another threshold T
the depth information in advance. to determine whether a pixel belongs to the foreground
⎡ ⎤
c(α)c(γ ) − s(α)s(γ )s(β) −s(α)c(β) c(α)s(γ ) + s(α)s(β)c(γ )
R = ⎣s(α)c(γ ) + c(α)s(β)s(γ ) c(α)c(β) s(β)s(γ ) − c(α)s(β)c(γ )⎦ (1)
−c(β)s(γ ) s(β) c(β)c(γ )
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2019
or background. If P(i x,y ) > T , then (x, y) is regarded as one if their distances are below a threshold, which is set
a foreground pixel; otherwise, it is a background pixel. The empirically. In this paper, a better sampling effect can be
Gaussian model is updated by the following equation: achieved when its value equals 0.18 times the length of the
longest diagonal in the human bounding box. Two neighbor
Mt = α It + (1 − α)Mt −1 (3)
sampled nodes are connected to form a skeleton segment if
where Mt is the Gaussian model at time t and Mt −1 is the their one-ring neighbors stay consistent. The iterations are
model at time t −1. Parameter α of the Gaussian model is used terminated when all the triangles are removed. Finally, an
to balance the influence of the current image on the previous initial skeleton with a set of candidate nodes is obtained, and
Gaussian model when it updates. A larger value may result in these nodes are further optimized to a few nodes according to
an unstable background removal, because the background in their connection relations and prior information provided by
indoor scenes stays almost unchanged. It is desirable for the the OpenNI skeleton.
Gaussian distribution to be slowly updated. In our experiment,
the value of α is tuned to 0.1 empirically. Furthermore, B. Filtering Process
It represents the mean and variance of the current frame Because there are differences between an initial pose and
at time t. The essence of the Gaussian model is to update a true pose, we propose a filtering process to predict the
the mean and variance of the pixel. The result is shown true pose with the initial pose aligned from the point cloud.
in Fig. 5 (right). To make the state estimation more robust, we further introduce
an annealing stage. Several layers are used to estimate the
III. 3D P OSE T RACKING
optimal state at each time. The human model is improved
The proposed tracking framework of a 3D human pose is at each layer, and the new estimated poses are generated
introduced in this section. We describe how to incorporate randomly. The poses are accepted according to the related
the silhouette features and the edge features originating from weights. As each layer propagates, the generation space for
multiple cameras into the whole framework. With the color producing new poses becomes more restricted. After all the
images and the depth images captured from the three Kinects, layers finish, an optimal pose state is obtained. The detailed
the depth information, namely, the point clouds from the steps are described as follows.
different Kinects, are obtained from the depth images after an 1) Initialize the parameters of layers M and estimated pose
internal calibration of each depth camera. A complete point numbers N.
cloud is aligned by the three point clouds from different views. 2) For every time step t, the unweighted poses X tm are
During the tracking process, a full 3D skeleton is computed generated at each layer m. Note that optimization starts
from the aligned 3D point cloud by the geometrical and from the Mth layer, and ends at the first layer. The
topological operations, followed by a combination of a filtering parameter set X 0 of initial pose is used as the mean
stage and Markov model estimation of the current pose. to produce the unweighted poses as follows:
A likelihood function is adopted to generate pose weights and m
X t,i = X 0 + Bm . (5)
compute estimation errors.
where Bm is a multivariate Gaussian random variable
A. Initial Skeleton Computation with covariance Pm and zero mean. For X 0 , m = M.
With the camera parameters of the relative positions and The covariance decreases as the layer number decreases.
orientations obtained via stereo calibration, we register the 3) Weight ωt,im is assigned to each pose according to
point clouds from three cameras to a global coordinate system. the error computed bythe cost function. Weights are
N
We downsample the aligned point cloud uniformly and, then, normalized, such that i=1 ωt,i
m = 1.
apply a skeleton computation on the point cloud data. The 4) New poses X tm−1 are generated. More poses are
computation process is divided into two main steps: geometry distributed around the ones with higher weights at the
contraction and topological thinning. On the undersampled upper layer. For example, the i th pose with weight
point cloud P, a tangent plane is determined based on the ωim is chosen to produce a new pose. Each pose is
K nearest neighbors of a point Pi in P, and then, a planar produced by
Delaunay triangulation is performed to define the one-ring m−1
X t,i = X t,i
m
+ Bm . (6)
neighbors of Pi . During the contraction, we contract the point
cloud by solving the following linear system iteratively using: The number of new candidate poses satisfies the require-
ment that the value of N cannot be changed.
WL L 0
P = (4) 5) The process is repeated until layer 1, and the obtained
WH WH P 1 are used to estimate the real pose
possible poses X t,i
where P denotes the point cloud after contraction, and state at time t as
W L and W H are the two matrices that weigh the two oper-
N
ations, including Laplacian L contraction and attraction. The X̂ t = ωt(i) X t,i
1
. (7)
matrix computation runs iteratively until the initial point cloud i=1
becomes a point set without volume. We use the farthest- 6) The process returns to step 2, and a new initial pose
point sampling to further reduce the size of the contracted is initialized at time t + 1. Then, X t +1 is predicted by
point cloud. Two neighbor points may be combined into steps 2–5.
2020 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 27, NO. 9, SEPTEMBER 2017
To make the estimated state available, we approximate the Ssil (X i , O) = (1 − Os (i ))2 (18)
(i)
N1
i=1
pdf by summing a set of weighted samples; N samples {x t }
from pdf p(X t |O1:t ) are approximated by the Monte Carlo where i denotes the index of the sampled points obtained from
method, in which the pdf is approximated by the estimated pose and Os (i ) is the i th pixel value in the
⎧ silhouette image. Thus, we can obtain the similarity degree
⎪
⎪ N
between the estimation and the observation by counting the
⎪
⎪ p(X |O ) = ωt(i) σ X t − X t(i)
⎪
⎨ t 1:t number of points that are equal to one. A value of one is
i=1
(12) considered to be the correct position for the estimated points,
⎪
⎪ (i)
⎪ (i)
⎪ ωt while a zero indicates that the estimated point is not on the
⎪
⎩ωt = N (i)
. human body. The sum of the point values can reflect how they
i=1 ωt
match, as shown in Fig. 6(d).
After the pose is estimated at the tth time using (7), the 2) Edge Similarity Term: We convert the RGB image into
Markov process is automatically switched to the filtering a grayscale image, as shown in Fig. 6(a) and, then, extract the
process. edges from this image, as shown in Fig. 6(b). Similar to the
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2021
Fig. 8. Comparison against OpenNI. The first row provides the original Fig. 10. Comparison against the method proposed in [34]. The first row
color image as the ground truth, the second row is the skeleton captured by provides the original color image as the ground truth, the second row is the
OpenNI, and the third row is the skeleton generated by our system. skeleton captured by body part classifier, and the third row is the skeleton
generated by our system.
Fig. 13. Importance of skeleton capture. The first column represents the starting frame with right pose. Frames are selected randomly from second to
fourth columns. The hypothesized pose with skeleton captured at every frame is marked in magenta, and the pose with skeleton captured at only the
first frame is marked in cyan.
Fig. 14. Importance of edge term. The first row provides the original color
image as the ground truth, the second row is the skeleton captured without
edge term, and the third row is the skeleton generated with edge term.
V. D ISCUSSION
A. Number of Depth Cameras
If only one Kinect is used, the human body self-occlusion
frequently occurs when the user turns around or crosses
Fig. 15. Importance of silhouette term. The first row provides the original his/her limbs, which will result in missing depth data for
color image as the ground truth, the second row is the skeleton captured the occluded body parts. When two cameras are employed,
without silhouette term, and the third row is the skeleton generated with
silhouette term. we need to put them face to face to expand the field of
view as much as possible. As we know, the depth data are
acquired by an IR projector that sends out a fixed pattern of
calculating the error rate using the likelihood function for both light and an IR camera that records the reflected speckle from
the GA and filtering. The error rate of the GA is slightly lower objects. The depth is then calculated by the speckle, which is
than filtering, as shown in Table I. Moreover, tracking accuracy memorized at a known depth. However, two Kinects produce
improves with the increase of the number of generations M. strong IR interference, and the depth value errors are yielded.
LIU et al.: HUMAN MOTION TRACKING BY MULTIPLE RGBD CAMERAS 2025
[30] D. M. Gavrila and L. S. Davis, “3-D model-based tracking of humans Jinxin Huang was born in Hubei, China, in 1992.
in action: A multi-view approach,” in Proc. IEEE Comput. Soc. Conf. She received the bachelor’s degree in electrical engi-
Comput. Vis. Pattern Recognit. (CVPR), Jun. 1996, pp. 73–80. neering and automation from Northwestern Poly-
[31] J. Deutscher and I. Reid, “Articulated body motion capture by stochastic technical University, Xi’an, China, in 2014, where
search,” Int. J. Comput. Vis., vol. 61, no. 2, pp. 185–205, 2005. she is currently working toward the master’s degree
[32] S. Knoop, S. Vacek, and R. Dillmann, “Sensor fusion for 3D human in transportation tools and applications.
body tracking with an articulated 3D body model,” in Proc. IEEE Int. Her research interests include human–computer
Conf. Robot. Autom. (ICRA), May 2006, pp. 1686–1691. interaction, including 3D human reconstruction.
[33] X. Wei, P. Zhang, and J. Chai, “Accurate realtime full-body motion
capture using a single depth camera,” ACM Trans. Graph., vol. 31, no. 6,
pp. 188:1–188:12, Nov. 2012.
[34] J. Shotton et al., “Real-time human pose recognition in parts from single
depth images,” Commun. ACM, vol. 56, no. 1, pp. 116–124, Jan. 2013.
[35] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object
detection and geographic image classification based on collection of part
detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, pp. 119–132, Junwei Han (M’12–SM’15) received the
Dec. 2014. Ph.D. degree in pattern recognition and intelligent
[36] G. Cheng, J. Han, L. Guo, Z. Liu, S. Bu, and J. Ren, “Effective systems from the School of Automation,
and efficient midlevel visual elements-oriented land-use classification Northwestern Polytechnical University, Xi’an,
using VHR remote sensing images,” IEEE Trans. Geosci. Remote Sens., China, in 2003.
vol. 53, no. 8, pp. 4238–4249, Aug. 2015. He is a Professor with Northwestern Polytechnical
[37] D. S. Alexiadis, P. Kelly, P. Daras, N. E. O’Connor, T. Boubekeur, and University. His research interests include multimedia
M. B. Moussa, “Evaluating a dancer’s performance using Kinect-based processing and brain imaging analysis.
skeleton tracking,” in Proc. 19th ACM Int. Conf. Multimedia, 2011, Dr. Han is an Associate Editor of IEEE
pp. 659–662. T RANSACTIONS ON H UMAN M ACHINE S YSTEMS ,
[38] J. Smisek, M. Jancosek, and T. Pajdla, “3D with Kinect,” in Consumer Neurocomputing, and Multidimensional Systems
Depth Cameras for Computer Vision (Advances in Computer Vision and Signal Processing.
and Pattern Recognition), A. Fossati, J. Gall, H. Grabner, X. Ren, and
K. Konolige, Eds. London, U.K.: Springer, 2013, pp. 3–25.
[39] J.-Y. Bouguet. (2007). Camera Calibration Toolbox for MATLAB.
[Online]. Available: http://www.vision.caltech.edu/bouguetj/calib_doc
[40] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background
prior-based salient object detection via deep reconstruction residual,” Shuhui Bu received the master’s and Ph.D. degrees
IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1309–1321, from the College of Systems and Information Engi-
Aug. 2015. neering, University of Tsukuba, Tsukuba, Japan, in
[41] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: 2006 and 2009, respectively.
Real-time tracking of the human body,” IEEE Trans. Pattern Anal. Mach. He was an Assistant Professor with Kyoto Uni-
Intell., vol. 19, no. 7, pp. 780–785, Jul. 1997. versity, Kyoto, Japan, from 2009 to 2011. He is
[42] J. Han, S. He, X. Qian, D. Wang, L. Guo, and T. Liu, “An object- currently an Associate Professor with Northwestern
oriented visual saliency detection framework based on sparse coding Polytechnical University, Xi’an, China. He has
representations,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, authored approximately 40 papers in major interna-
no. 12, pp. 2009–2021, Dec. 2013. tional journals and conferences. His research inter-
[43] (2012). OpenNI Tracker. [Online]. Available: http://wiki.ros.org/ ests include computer vision and robotics.
openni_tracker