Professional Documents
Culture Documents
Markerless Monocular Motion Capture: Constraints
Markerless Monocular Motion Capture: Constraints
~-~.- -
.__i"-?._l_
_---+"
L-""-
_
I
_
--_-
Jinho Leet
MER1
Rick Parent$
The Ohio State University
ABSTRACT
We present a technique to extract motion parameters of a human
figure from a single video stream. Our goal is to prototype motion
synthesis rapidly for game design and animation applications. For
example, our approach is especially useful in situations where motion capture systems are restricted in their usefuhess given the various required instrumentation. Similarly, our approach can be used
to synthesize motion from archival footage. By extracting the silhouette of the foreground figure and using a model-based approach,
the problem is re-formulated as a local, optimized search ofthe pose
space, The pose space consists of 6 rigid body vansformation parameters plus the internal joint angles of the figure.The silhouette of
the figure from the captured video is compared against the silhoueite of a synthetic figure using a pixel-bypixel, distancebased cost
function to evaluate goodness-of-fit. For for a single video stream,
this is not without problems. Occlusion and ambiguities arising
from the use of a single view often cause spurious reconstruction of
the captured motion. By using temporal coherence, physical constraints, and knowledge of the anatomy, a viable pose sequence can
be reconstructed for many live-action sequences.
INTRODUCTION
36
Raghu Machirajul
The Ohio State University
We describe herein a method to reconstruct arbitraq motion sequences a at is model-based, operates on images, and exploits
knowledge about the anatomy. Consequently, our method is simple,
efficient and requires limited manual intervention. Will aur method
reconstruct all motion sequences successfully? The answer i s no.
Occlusion of limbs by larger parts of the anatomy cannot always be
resolved through the use of image siIhouettes. On the other hand,
we wish to explore the limits of efficient monocular motion reconstruction. Our results show that we can reconstruct increasingly
complex motion when we include a larger set of anatomical features and CQnSk3htS and employ robust image comparison metrics.
We now provide an overview of ow approach.
cost functions and incremental search strategies. Our focus on simpler cost functions will eventually allow for tlle realization of near
real-time reconstruction often needed for computer graphics applications while at the same time making few assumptions about the
motion being tracked.
The starling point of our method is a model of a human actm with
multiple quadrics assembled at various joints. After an initial pose
is established either automatically or with the aid of the user, a
frame-to-frame tracking procedure ensues in which the solution of
the last frame is the initial guess for the next frame. In addition to
silhouettes and edges, our objective function uses anatomical and
physical constraints to aid in disambiguating the view.
are given in Section 5. Section 6 includes results which demonstrate the effectiveness of our technique. Concluding remarks and
pointers to future work are described in Section 7.
PREVIOUS
WORK
mate the pose and shape such that the models 2D projection closely
fits the captured person in the image. Bregler and Malik [2] propose
a framework to estimate 3D poses of each body segment in a kinematic chain using twist representation of a general rigid-body motion.However,their reconstruction assumes lateral symmetry in the
tracked motion. Pavlovic et al[16] present asystem to track frontaparallel motion using a dynamic Bayesian nerwork approach. Their
work focuses on the dynamics of human behavior as described by
bio-mechanical models.
Deutscher et al 141 use a human model to build a framework of a
kinematic chain using limbs of conical sections for computational
simplicity and high-level interpretation of output. They use edges
and silhouettes for their cost function to estimate pose from multiple camera views. A condensation algorithm is employed to search
the high dimensional space without restrictions. Cmanza er a1 [3]
use silhouettes from multi-view synchronized video footage to reconstruct the motion of a 3D human body model, and then re-render
the models appearance and motion interactively from any novel
view point. Kakadiaris et al [IO] use a spatio-temporal analysis to
track upper body motions from multiple cameras.
Monocular markerless motion capture bas been studied by a few researchers. Sminchisescu and Triggs [20]achieve successful motion
synthesis based on the propagation of a mixture of Gaussian density functions, each representing probable 3D configurations of the
human body over rime. Difranco et al [5] pmpose a batch framework to reconstruct poses from a single view point based on 2D
correspondences subject to a number of constraints. Their methods
are shown to be successful when deployed on moderately difficult
sequences that include athletic and dance movements.
Alternative approaches to model based tracking of human bodies
have been also used. Bobick and Davis [ I ] construct temporal templates from a sequence of silhouette images and present a method
to match the temporal template against stored views of the known
actions. Wren et a1 [22] use 2D blobs for tracking motion in a video
image. Leventon and Freeman [I41take a statistical approach and
used a set of motion examples to build a probability model for the
purpose of recovering 3D joint angles for a new input sequence.
37
O U R HUMAN MODEL
Once we derive a background model, we can subtract the background From all the images in the sequence. Often, we do not have
Core-Weigbted XOR
Our goal is to find the motion parameter set fl that minimizes the
total penalty
4 r
THEOPTIMIZATION
STRATEGY
The use of a single view makes the reconstwction problem illposed, as stated earlier. We now describe our optimization strategy to fit the motion parameters to a single stream of silhouette
images. First, we describe how we extract silhouettes from a video
sequence. Then, we present an objective function based on the difference of area between model-generated silhouette and input silhouette. Next, we discuss how we improve the performance of our
optimization algorithm. We achieve this by incorporating edge information along with both physical and anatomical constraints into
the objective function. FinalIy, we discuss a non-linear multidimensional optimization algorithm we employ to minimize the proposed
objective function.
3s
for a suitable cost function f,where S!;Lw and S d e { ( / 3 ( ' ) )are the
ifhinput silhouette image and a silhouette image generated by M(')
respectively. For the sake of clarity, we instead denote S)iiw and
Smodel(fl(i))simply as Sinpur and Smadel respectively.
How does one design a viable and robust cost function f as described in Eq.(l) ? The easiest way to measure the difference of
two binary images is the number of 'on' pixels when pixel-wise
XOR operation is applied to the two images 1131. In this case,
where the double summation iterates over all pixel locations. If our
goal requires that f = 0,that is, if two silhouettes overlap exactly,
the optimal soIution will be unique in terms of S d e l . However, if
our objective function f cannot be reduced to zero given inherent
characreristjcs of the problem, it is likely that there are multiple
optimal solutions. Any preference among those multiple optimal
solutions should be incorporated into the cost function.
Since limbs are features af particular importance in any articulated
figure, we do not want to lose track of those features. Limbs in a
human body are often well characterized by their skeleton or medial
axis derived from the silhouette image. Therefore, whenever ambiguity occurs it would be better tu choose the direction in parametric
shape space such that model-generated silhouette covers the core
area of the input silhouette. The core area includes the silhouette
pixels close to the skeleton. This requirement can be incorporated
in the cost function by imposing higher penalty if S,,,del(fi(i))does
not overlap any region near the core area of the input silhouette
$1
rnpur-
A r m s and the V shape are treated the same as the edge term,
which means we try to get a maximum matching between detected
&*(4).
where T is the image after edges have been extracted h m the foreground image using a contrast threshold. This tenn i s summed over
all pixel positions along with the core-weighted XOR function.
Moreover, the outlines of the arms and the V shape between two
legs can be detected based on those concave points along edge contours; the neck and the crotch are usually the concave points along
the edge contours, and anns begin from the neck and the V shape
between the legs is formed by the feet and the crotch. However,
since we are only using a single camera, it is hard to distinguish
between the left and right limbs. Therefore, we do not q u i r e the
exact matching of the limbs from the model and the silhouettes. The
handlfaot consmint term only focuses on reducing the distance between the models handslfeet and the nearest detected ones.
z = -wl
J((i,,, - id12
+ (jm-
(5)
Figure 2: Three examples for hands/feet, arms. legs and head detections along silhouette outlines. where head is marked as red, body
tips are white, arms in green and legs in red.
39
IMPLEMENTATION
Figure 3: Keyframes are necessarily t o reconstruct this sequence containing ambiguities and self-occlusions. Figure 10 shows an example
where this keyframe technique is employed. We specify the left and
right pose as the keyframe poses.
.
4.4
tion parameters.
The simplex method can be easily adapted to our multi-dimensional
human body model. The initial simplex of R dimensions consists of
n + 1 vertices, Though we use this method for alignment and shape
fitting, we explain its workings by only describing reconstruction
of the motion. Let the coefficients ~ ( i =
) f ~ { ~ - ) ., .. ,pniipl)) (the
solution of the previous frame) be one of the initial points po of the
simplex. We can choose the other remaining n points to be
p;=m+pie;,
i= 1.q
where ejs are n unit vectors and pi are defined by the characteristic length scale of each component of p. The movement of this
n-dimensional simplex is confined to be within the motion space
close enough to the configuration of the current frame and there is
no need to perfom exhaustive searches beyond certain ranges of
movement between two consecutive frames. To further target the
most relevant parameter space to search, parameter velocity is used
to bias the simplex location. This bias and the size of the simplex
are determined by limits on parmetric acceleration that arise from
principles of physics and basic anatomy.
40
the
@;:A,
m.I is minimized
The middle and right image in Figure 4 depict the result for the two
steps described above. Note that our coreweighted XOR metric is
helpful to align the shape of the initial model at the center of the
silhouette image. If the initial model was not aligned at the center
of the silhouette, the subsequent shape fitting process would have
failed.
5.2 Motion Tracking
After the initialization described in the previous section, acrual motion tracking is performed by searching a motion configuration p,*,(i)
such that it minimizes the objective function, as described in Sec(
tion 4, for each frame i. The optimal parameter vector, pi!,,
found
at frame i is used as the initial guess for the optimization routine of
frame i + 1.
I
50
la,
150
2m
ZKI
Fnrr-ber
~3
otherwise.
'.
5 3 Motion Refinement
The final tracking result cannot be perfect because: the shape of
the articulated model does not match the actor in the video exactly;
there are occlusions and ambiguities; there are many more DOFs
for each arliculated joint. However, our system generates reasonably good motion data and aIlows the users to refine the generated
motion in the post-processing stage. The user can also add new
constraints to reduce the search space for particular parameters to
get better tracking results. In addition, given our motion data, user
can modify the motion with any 3D character animation tool, like
Poser@ from Curious Lab.
6
RESULTS
We illustrate our method using several full body tracking sequences. AI1 the video clips are sampled at the rate of 30Hz, and
some of them are from CMU Graphics Lab Motion Capture Database [e]. Despite that we are using only one camera and do not employ any marker, we still obtain desirable results. Figure 8, Figure 9
and Figure 10 are examples of tracking artists and athletes. The first
dance sequence (Figure 8) is relatively straightforward to track because there are no occlusions and motion is mainly planar. Our
tracker manages to track the fmnto parallel motions. The second
badminton sequence (Figure 9) is more complex because of multiple occlusions occurring during the action. However we can reconstruct the motion when we assume that both legs move with constant velocity. In addition, we capture the in-depth (out-of-plane)
motions of the player successfully. The previous sequences could
all be tracked without user intervention. The last sequence (Figure IO) contains many more challenges given the complex torso rotations and arm swings. After introducing keyframing, our tracker
estimates the uncertain positions of arms and torso, and recovers the motions of the arms even though they are totally occtuded
in the silhouettes. For this sequence, six keyframes were used.
More reconstruction results can be found at http://www.cse.ohiostate.edukhenyis/research/motion/index.htm.
The time expended for analysis is about 3 sec. per frame for simple sequences and about 5 sec. per frame for difficult ones, when
executed on a Pentium Iv PC with a 2 GHz CPU and 512MB of
memory. It should be noted that 131 reported times of 7 to 12 sec per
41
REFERENCES
[ 11
CONCLUSION AND
FUTURE
WORK
We presented a simple and robust technique to reconswct 3D motion parameters of a human model using image silhouettes. A novel
cost metric called core-weighted XOR was introduced and consistently used for the automatic alignment, shape fitting and motion
tracking. The computation time of the cost function is directiy related to the overall performance of our method. Currently our implementation does not exploit any hardware acceleration. In the future, we intend to accelerate the weighted-XOR computation using
features of a modern graphics hardware. Our work is closest to that
of Sminchisescu [20] and that of DiFranco [ 5 ] . We compare favorably in terms of generality to the results they report. By detecting
mare human features like head, arms, hands, legs and feet, we can
further improve the correctness of registering our model with the
all the images and recovering the 3d poses.
2001.
[2] C. Bregler and J. Malik.Tracking People with Twists and Exponential
Maps. In Proceedings 0fIEE.E CVPR, pages 8-15,1998.
[31 J. Carranza, C. Thaobalt, M.Magnor, andH-P Seidel. Free-Viewpoint
Video of Human Actors. In Pmceedings of SIGGRAPHZCO3, volume
22, No.3, pages 549477,2003.
[4] 5. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle Filtering. In Pmceedings of IEEE C'VPR,
volume 2, pages 126-133,2000.
[5] D.E. DiFtanco, T.Cham,and J.M.Rehg. Reconsmctionof 3-D Figure Motion from 2-D Correspondences. In Proc. Conf: Computer Wsion and Parrern Recognition,pages 307-314.2001.
[6j M. Gleicher and N. Ferrier. Evaluating Wdeo-Based Motion Capture.
In Proceedings of Computer Animation, pages 75-80,2002.
171 I. Haritaoglu, D. Hanuood, and L.Davis. Ghost: A Human Body Part
Labeling System Using Silhouettes. In lnremational Con@"
on
Panern Recognirion, volume I, pages 77-82, 1998.
[8] J.K. Hodgins. Camegie Mellon University GraphicsLab Motion Capture Database, http://mocap.cs.cmu,edd.
[91 T. Horprasert, D. Hamood, and L.S. Davis. A statistical approach
for real-time robust background subtraction and shadow detection. In
IEEE Frame-Rote Workshop, 1999.
[lo] I. Kakadiaris and D.Metaxas. Model-Based Estimation of 3D Human Motion. In IEEE Transactionson Panem Analysis and Machine
Inrelligence, volume 22, December 2000.
[11] J. Lee, J. Chai, P.S.A. Reitsma, J.K.Hodgins, and N.S. Pollard. Interactive Conuol of Avatars Animated with Human Motion Data. In
Pmceedings of SIGGRAPH 2002, Computer Graphrcs Proceedings,
Annual Conference Series, pages 49 1-500. ACM. ACM Press I ACM
SIGGRAPH,2002.
121 M.W. Lee,I. Cohen, and S.K. Jung. Particle Filter with Analytical
Inference for Human Body Tracking. In IEEE Workshop on Morion
and Vdeu Computing, 2002.
131 H.P.A. Lensch, W. Heidrich, and H.Seidel. Automated Texture Registration and Stitching for Real World Models. In Proceedings of Pacific
Graphics ZOOO,2000.
151
ACKNOWLEDGEMENTS
The authors would like to thank the Advanced Computing Center for Art and Design at Ohia State University for their support,
specifically for the use of their motion capture lab and software environment. The authors would also like to thank the folks at CMU
for making available their motion captured database. The database
was created with funding from NSF EIA-0196217. This work was
supported, in part, by U.S. National Science Foundation Grant ITRUS-0428249 and by the Secure Knowledge Management Program,
Air Force Research Laboratory, Information Directorate, WrightPatterson AFB.
42
Human Motion Capture. In Compurer Ksion and Image Understanding, 81(3),pages 231-268,2001.
161 V. Pavlovic, J.M.Rehg, T. Cham, and K.P. Murphy. A Dynamic
Bayesian Network Approach to Figure Tracking Using Learned Dynamic Models. In h d Con$ on Computer Vision,pages 94-101,1999.
171 W.H. Press, B.P. Hannery, S.A. Teukolosky, and W.T. Vetterling. Numerical Recipes in Cr The Art of Scientific Computing. Cambridge
Universiry Press, New York, 1988.
181 L. Ren, G. Shakbnamvich,I. Hodgins,H. Pfister, and P. Viola. Learning Silhouette Features for Control of Human Motion. In Proceedings
of rhe SlGGRAPH 2004 Conference OR Sketches & Applications, August 2004.
[19] C. Sminchisescu. Consistency and Coupling in Human Model Likelihoods. In IEEE Inrernational Conference on Auromutic Face and
Gessure Recognirion, pages 27-32, 2002.
[20] C. Smiochisescu and B. Triggs. Kinematic Jump Processes For
Monocular 3D Human Tracking. In Pmc. Con$ Computer W o n and
Parrem Recognition, pages 69-16,2003.
[21] Cristian Sminchisescu and Bill E g g s . Estimating Articulated Human
Motion with Covariance Scaled Sampling. Internoflonu1Jouml of
Robotics Research, 2003.
I221 C.R. Wren, A. Azarbayejani, T. Darell, and A.P. Pentland. Pfinder:
Real-limeTracking of the Human Body. In Transmiions on Panern
Analysis and Machine intelligence, 19(7), 1997.
Figure 8: Tracking a dancer: displaying representative frames. The top row shows the footage, the middle row is the matching between
silhouette (white ares) and our model (colored area), and the bottom row i5 the animation after rendering. The limb stretching term is used
here.
Figure 9. Tracking a badminton player: displaying representative frames. The limb stretching term for legs is used here.
figure 10: Tracking an actor: dispbaying representative frames. We employ the keyframe technique on the arms t o reconstruct this sequence.
43