Professional Documents
Culture Documents
Fusion 4 D
Fusion 4 D
Fusion 4 D
Mingsong Dou Sameh Khamis Yury Degtyarev Philip Davidson* Sean Ryan Fanello*
* *
Adarsh Kowdle Sergio Orts Escolano Christoph Rhemann * David Kim Jonathan Taylor
Pushmeet Kohli Vladimir Tankovich Shahram Izadi†
Microsoft Research
Figure 1: We present a new method for real-time high quality 4D (i.e. spatio-temporally coherent) performance capture, allowing for
incremental nonrigid reconstruction from noisy input from multiple RGBD cameras. Our system demonstrates unprecedented reconstructions
of challenging nonrigid sequences, at real-time rates, including robust handling of large frame-to-frame motions and topology changes.
Fusion4D combines the concept of volumetric fusion with estima- real-time, ranging from seconds to hours per frame.
tion of a smooth deformation field across RGBD views. This enables
both incremental reconstruction, improving the surface estimation Real-time Approaches: Only recently have we seen real-time non-
over time, as well as parameterization of nonrigid scene motion. Our rigid reconstruction systems appear. Approaches fall into three
approach robustly handles large frame-to-frame motion by using a categories. Single object parametric approaches focus on a single
novel, fully parallelized, nonrigid registration framework, including object of interest, e.g. face, hand, or body, which is parametrized
a learning-based RGBD correspondence matching regime. It also ahead of time in an offline manner, and tracked or deformed to fit the
robustly handles topology changes, by switching between reference data in real-time. Compelling real-time reconstructions of nonrigid
models to better explain the data over time, and robustly blending articulated motion (e.g. [Ye et al. 2013; Stoll et al. 2011; Zhang et al.
between data and reference volumes based on correspondence esti- 2014]) and shape (e.g. [Ye et al. 2013; Ye and Yang 2014]) have
mation and alignment error. We compare to related work and show been demonstrated. However by their very nature, these approaches
several clear improvements over real-time approaches that either rely on strong priors based on either pre-learned statistical models,
track an online generated template or fuse depth data into a single articulated skeletons, or morphable shape models, prohibiting cap-
reference model incrementally. Further, we show geometric recon- ture of arbitrary scenes or objects. Often the parametric model is
struction results on-par with offline methods which require orders of not rich enough to capture challenging poses or all types of shape
magnitude more processing time and many more RGBD cameras. variation. For human bodies, even with extremely rich offline shape
and pose models [Bogo et al. 2015], reconstructions can suffer from
the effect of uncanny valley [Mori et al. 2012]; and clothing or hair
2 Related Work can prove problematic [Bogo et al. 2015].
Multi-view Performance Capture: Many compelling offline per- Recently, real-time template-based reconstruction of more diverse
formance capture systems have been proposed. Some specifi- nonrigidly moving objects was demonstrated [Zollhöfer et al. 2014].
cally model complex human motion and dynamic geometry, in- Here an online template model was captured statically, and deformed
cluding people with general clothing, possibly along with pose in real-time to fit the data captured from a novel RGBD sensor. Addi-
parameters of an underlying kinematic skeleton (see [Theobalt tionally, displacements on this tracked surface model were computed
et al. 2010] for a full review). Some methods employ variants from the input data and fused over time. Despite impressive real-
of shape-from-silhouette [Waschbüsch et al. 2005] or active or pas- time results, this work still requires a template to be first acquired
sive stereo [Starck and Hilton 2007]. Template-based approaches rigidly, making it impractical for capturing children, animals or
deform a static shape model such that it matches a human [de Aguiar other objects that rarely hold still. Furthermore, the template model
et al. 2008; Vlasic et al. 2008; Gall et al. 2009] or a person’s cloth- is fixed and so any scene topology change will break the fitting.
ing [Bradley et al. 2008]. Vlasic et al. [2009] use a sophisticated Such approaches also rely heavily on closest point correspondences
photometric stereo light stage with multiple high-speed cameras [Rusinkiewicz and Levoy 2001] and are not robust to large frame-
to capture geometry of a human at high detail. Dou et al. [2013] to-frame motions. Finally in both template based and single object
capture precise surface deformations using an eight-Kinect rig, by parametric approaches the model is fixed, and the aim is to deform
deforming a human template, generated from a KinectFusion scan, or articulate the model to explain the data rather than incrementally
using embedded deformation [Sumner et al. 2007]. Other methods reconstruct the scene. This means that new input data does not refine
jointly track a skeleton and the nonrigidly deforming surface [Vlasic the reconstructed model over the time.
et al. 2008; Gall et al. 2009].
DynamicFusion [Newcombe et al. 2015] addresses some of the
Whilst compelling, these multi-camera approaches require consid- challenges inherent in template-based reconstruction techniques
erable compute and are orders of magnitude slower than real-time, by demonstrating compelling results of nonrigid volumetric fusion
also requiring dense camera setups in controlled studios, with sophis- using a single Kinect sensor. The reference surface model is incre-
ticated lighting and/or chroma-keying for background subtraction. mentally updated based on new depth measurements, refining and
Perhaps the high-end nature of these systems is exemplified by [Col- completing the model over time. This is achieved by warping a ref-
let et al. 2015] which uses over 30 RGBD cameras and a large erence volume nonrigidly to each new input frame, and fusing depth
studio setting with green screen and controlled lighting, producing samples into the model. However, as shown in the supplementary
extremely high quality results, but at approximately 30 seconds per video of this work the frame-to-frame motions are slow and carefully
frame. We compare to this system later, and demonstrate comparable orchestrated, again due to reliance on closest point correspondences.
results in real-time with a greatly reduced set of RGBD cameras. Also, the reliance on a single volume registered to a single point in
time means that the current data being captured cannot represent
Accommodating General Scenes: The approach of [Li et al. 2009] a scene dramatically different from the model. This makes fitting
uses a coarse approximation of the scanned object as a shape prior to the model to the data and incorporating it back into the model more
obtain high quality nonrigid reconstructions of general scenes. Oth- challenging. Gross inconsistencies between the reference volume
ers also treat the template as a generally deformable shape without and data can result in tracking failures. For example, if the reference
skeleton and use volumetric [de Aguiar et al. 2008] or patch-based model is built with a user’s hands fused together, estimation of the
deformation methods [Cagniart et al. 2010]. Other nonrigid tech- deformation field will fail when the hands are seen to separate in
niques remove the need for a shape or template prior, but assume the data. In practice, these types of topology changes occur often as
small and smooth motions [Zeng et al. 2013; Wand et al. 2009; Mitra people interact in the scene.
et al. 2007]; or deal with topology changes in the input data (e.g.,
the fusing and then separation of hands) but suffer from drift and
over-smoothing of results for longer sequences [Tevs et al. 2012; 3 System Overview
Bojsen-Hansen et al. 2012]. [Guo et al. 2015; Collet et al. 2015]
introduce the notion of keyframe-like transitions in offline nonrigid Our work, Fusion4D, attempts to bring aspects inherent in multi-
reconstructions, to accommodate topology changes and tracking view performance capture systems to real-time scenarios. In so
failures. [Dou et al. 2015] demonstrate a compelling offline system doing, we need to design a new pipeline that addresses the limita-
with nonrigid variants of loop closure and bundle adjustment to tions outlined in current real-time nonrigid reconstruction systems.
create compelling scans of arbitrary scenes without a prior human Namely, we need to be robust to fast motions and topology changes
or template model. All these more general techniques are far from and support multi-view input, whilst still maintaining real-time rates.
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016
114:3 • Fusion4D: Real-time Performance Capture of Challenging Scenes
Figure 2: The Fusion4D pipeline. Please see text in Sec. 3 for details.
Fig. 2 shows the main system pipeline. We accumulate our 3D 2D silhouettes of the regions of interest are produced. The segmenta-
reconstruction in a hierarchical voxel grid and employ volumetric tion mask plays a crucial role in estimating the visual hull constraint
fusion [Curless and Levoy 1996] to denoise the surface over time (see Sec. 5.2.3) that helps ameliorate issues with missing data in the
(Sec. 6). Unlike existing real-time approaches, we use the concept input depth and ensures that foreground data is not deleted from the
of key volumes to deal with radically different surface topologies model. Our segmentation also avoids the need for a green screen
over time (Sec. 6). This is a voxel grid that maintains the reference setup as in [Collet et al. 2015] and allows capture in natural and
model, and ensures smooth nonrigid motions within the key vol- realistic settings. In our pipeline we employed a simple background
ume sequence, but allows more drastic changes across key volumes. model (using both RGB and depth cues) that does not take into
This is conceptually similar to the concept of a keyframe or anchor account temporal consistency. This background model is used to
frame used in nonrigid tracking [Guo et al. 2015; Collet et al. 2015; compute unary potentials by considering pixel-wise differences with
Beeler et al. 2011], but this concept is extended for online nonrigid the current scene observation. We then use a dense Conditional Ran-
volumetric reconstruction. dom Field (CRF) [Krähenbüh and Koltun 2011] model to enforce
smoothness constraints between neighboring pixels. Due to our real-
We take multiple RGBD frames as input and first estimate a segmen- time requirements, we use an approximate GPU implementation
tation mask per camera (Sec. 4). A dense correspondence field is similar to [Vineet et al. 2012].
estimated per separate RGBD frame using a novel learning-based
technique (Sec. 5.2.4). This correspondence field is used to initialize
the nonrigid alignment, and allows for robustness to fast motions – 5 Nonrigid Motion Field Estimation
a failure case when closest point correspondences are assumed as in
[Zollhöfer et al. 2014; Newcombe et al. 2015]. In each frame we observe N depthmaps, {Dn }N n=1 and N fore-
Next is nonrigid alignment, where we estimate a deformation field to ground masks, {Sn }N n=1 . As is common [Curless and Levoy 1996;
warp the current key volume to the data. We cover the details of this Newcombe et al. 2011; Newcombe et al. 2015], we accumulate
step in Sec. 5. In addition to fusing data into the key (or reference) this depth data into a non-parametric surface represented implicitly
volume as in [Newcombe et al. 2015], we also fuse the currently by a truncated signed distance function (TSDF) or volume V in
accumulated model into the data volume by warping and resampling some “reference frame” (which we denote as key volume). This
the key volume. This allows Fusion4D to be more responsive to new allows efficient alignment and allows for all the data to be averaged
data, whilst allowing more conservative model updates. Nonrigid into a complete surface with greatly reduced noise. Further, the
alignment error and the estimated correspondence fields can be zero crossings of the TSDF can be easily located to extract a high
used to guide the fusion process, allowing for new data to appear quality mesh1 V = {vm }M 3
m=1 ⊆ R with corresponding normals
M
very quickly when occluded regions, topology changes, or tracking {nm }m=1 . The goal of this section is to show how to estimate a
failures occur, but also allowing fusion into the model over time. deformation field that warps the key volume V or the mesh V to
align with the raw depth maps {Dn }N n=1 . We typically refer V or V
as model, and {Dn }N n=1 as data.
4 Raw Depth Acquisition and Preprocessing
In terms of acquisition our setup is similar to [Collet et al. 2015], 5.1 Deformation Model
but with a reduced number of cameras and no green screen. Our
system, in its most general form, produces N depthmaps using 2N Following [Li et al. 2009] and [Dou et al. 2015] we choose the
monocular infrared (IR) cameras and N RGB images used only to embedded deformation (ED) model of [Sumner et al. 2007] to pa-
provide texture information. Whereas the setup in [Collet et al. 2015] rameterize the nonrigid deformation field. Before processing each
consists of 106 cameras producing 24 depthmaps, our setup uses new frame, we begin by uniformly sampling a set of K “ED nodes”
only 24 cameras, producing N = 8 depthmaps and RGB images. within the reference volume by sampling locations {gk }K k=1 ⊆ R
3
All of our cameras are in a trinocular configuration and have a 1 from the mesh V extracted from this volume. Every vertex vm in
megapixel output resolution. Depth estimation is carried out using that mesh is then “skinned” to its closest ED nodes Sm ⊆ {1, ..., K}
the PatchMatch Stereo algorithm [Bleyer et al. 2011], which runs in using a set of fixed skinning weights {wkm : k ∈ Sm } ⊆ [0, 1]
real-time on GPU hardware (see [Zollhöfer et al. 2014] and [Pradeep calculated as wkm = Z1 exp kvm − gk k2 /2σ 2 , where Z is a nor-
et al. 2013] for more details). malization constant ensuring that, for each vertex, these weights
A segmentation step follows the depth computation algorithm, where 1A triangulation is also extracted which we use for rendering.
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016
114:4 • M. Dou et al.
add to one. Here σ defines the effective radius of the ED nodes, Although (5) is an approximation to (4), it offers a variety of key ben-
which we set as σ = 0.5d, where d is the average distance between efits. First, the use of a point-to-plane term is a well known strategy
neighboring ED nodes after the uniform sampling. to speed up convergence [Chen and Medioni 1992]. Second, the use
of a “projective correspondence” avoids the expensive minimization
We then represent the local deformation around each ED node gk in (4). Lastly, the visibility set Vn (G) is explicitly computed to be
using an affine transformation Ak ∈ R3×3 and a translation tk ∈ robust to outliers which avoids employing a robust data term here
R3 . In addition, a global rotation R ∈ SO(3) and translation that often slows Gauss-Newton like methods [Zach 2014]. Interest-
T ∈ R3 are added. The set G = {R, T } ∪ {Ak , tk }K k=1 fully ingly, the last two points interfere with the differentiability of (5) as
parameterizes the deformation that warps any point v ∈ R3 to Pn (Πn (ṽ)) jumps as the projection crosses pixel boundaries and
X m V(G) undergoes discrete modifications as G changes. Nonetheless,
T (vm ; G) = R wk [Ak (v − gk ) + gk + tk ] + T. (1) we use a further approximation (see Sec. 5.3) at each Gauss-Newton
k∈Sm iteration whose derivative both exists everywhere and is more effi-
cient to compute.
Equally, a normal n will be transformed to
X m −T 5.2.2 Regularization Terms
T ⊥ (nm ; G) = R wk Ak nm , (2)
k∈Sm As the deformation model above could easily represent unreason-
able deformations, we follow [Dou et al. 2015] by deploying two
and normalization is applied afterwards. regularization terms to restrict the class of allowed deformations.
The first term
5.2 Energy Function
K
X K
X
To estimate the parameters G, we formulate an energy function Erot (G) = kATk Ak − IkF + (det(Ak ) − 1)2 . (6)
k=1 k=1
E(G) that penalizes the misalignment between our model and the
observed data, regularizes the types of allowed deformations and encourages each local deformation to be close to a rigid transform.
encodes other priors and constraints. The energy function The second encourages the neighboring affine transformations to be
similar as
E(G) =λdata Edata (G) + λhull Ehull (G) + λcorr Ecorr (G) +
(3) K X
λrot Erot (G) + λsmooth Esmooth (G) X
Esmooth (G) = wjk ρ(kAj (gk −gj )+gj +tj −(gk +tk )k2 )
consists of a variety of terms that we systematically define below. k=1 j∈Nk
(7)
5.2.1 Data Term where wjk = exp(−kgk − gj k2 /2σ 2 ) is a smoothness weight that
is inversely proportional to the distance between two neighboring
The most crucial portion of our energy formulation is a data term ED nodes, and σ is set to be the average distance between all pairs of
that penalizes misalignments between the deformed model and the neighboring ED nodes. Here Nk denotes the set of ED nodes neigh-
data. In its most natural form, this term would be written as boring node k, and ρ(·) is a robustifier to allow for discontinuities
in the deformation field.
N X
X M
Êdata (G) = min kT (vm ; G) − xk2 (4) 5.2.3 Visual Hull Term
x∈P(Dn )
n=1 m=1
The data term above only constrains the deformation when the
where P(Dn ) ⊆ R3 extracts a point cloud from depth map Dn . We, warped model is close to the data. To see why this is problem-
however, approximate this using a projective point-to-plane term as atic, let us assume momentarily the best case scenario where we
happen to have a perfect model that should be able to fully “explain”
N
X X 2 the data. If a piece of the model is currently being deformed to a
Edata (G) = ñm (G)> (ṽm (G) − Γn (ṽm (G))) location outside the truncation threshold of depth maps, the gradient
n=1 m∈Vn (G) will be zero. Another, more fundamental issue, is that a piece of
(5) the model that is currently unobserved (e.g. a hand hidden behind a
⊥
where ñm (G) = T (nm ; G) and ṽm (G) = T (vm ; G) (with user’s back) is allowed to enter free-space. This occurs despite the
slight notational abuse we simply use ṽ and ñ to represent the fact that we know that free-space should not be occupied as we have
warped points and normals); Γn (v) = Pn (Πn (v)), with Πn : observed it to be free. Up until now, other methods [Newcombe et al.
R3 → R2 projecting a point into the n’th depth map and Pn : 2015] have generally ignored this constraint, or equivalently their
R2 → R3 back-projecting the corresponding pixel in Dn into 3D; model has only been forced to explain the foreground data while
and Vn (G) ⊆ {1, ..., M } are vertex indices that are considered to ignoring “negative” background data.
be “visible” in view n when the model is deformed using G. In To address this, we formulate an additional energy term that en-
particular, we consider a vertex to be visible if codes the constraint that the deformed model lies within the visual
hull. The visual hull is a concept used in shape-from-silhouette
Πn (ṽm ) is a valid and visible pixel in view n and space-carving reconstruction techniques [Kutulakos and Seitz 2000].
kṽm − Pn (Πn (ṽm ))k ≤ d and Typically in 2D it is defined as the intersection of the cones cut-out
by the back-projection of an object’s silhouette into free-space.
ñ> ⊥
m Pn (Πn (ṽm )) < n
The near-camera side of the back-projected cone that each silhouette
where Pn⊥ : R2 → R3 maps pixels to normal vectors estimated from generates is cut using the observed depth data (see Fig. 3) before
Dn ; d and n are the truncation thresholds for depth and normal being intersected. In the single viewpoint scenario, where there is
respectively. much occlusion, the constraint helps push portions of the deformed
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016
114:5 • Fusion4D: Real-time Performance Capture of Challenging Scenes
The exact computation of H is computationally expensive and unsuit- During training, we select the split functions to maximize the
able for a real-time setting. Instead, we approximate H by applying weighted harmonic mean between precision and recall of the patch
Gaussian blur to the occupancy volume2 , which is implemented correspondences. Ground truth correspondences for training the split
efficiently on the GPU. function parameters of the decision trees are obtained via the offline
but accurate nonrigid bundle adjustment method proposed by [Dou
5.2.4 Correspondence Term et al. 2015]. We tested different configurations of the algorithm and
empirically found that 5 trees with 15 levels give the best trade-off
Finding the 3D motion field of nonrigid surfaces is an extremely between precision and recall. At test time, when simple pixel dif-
challenging task. Approaches relying on non-convex optimization ferences are used as features, the intersection strategy proposed in
can easily end up in erroneous local optima due to bad starting points [Wang et al. 2016] is not robust due to perspective transformations
caused by noisy and inconsistent input data e.g. due to large motions. of RGB images. A single tree does not have the ability to handle all
A key role is played by the initial alignment of the current input data possible image patch transformations. Intersection across multiple
Dn and the model. Our aim is therefore to find point-wise corre- trees (as proposed in [Wang et al. 2016]) also fails to retrieve the
spondences to provide a robust initialization for the solver. Finding correct match in the case of RGBD data. Only few correspondences
reliable matches between images has been exhaustively studied; re- usually belonging to small motion regions are estimated.
cently, deep learning techniques have shown superior performance We address this by taking the union over all the trees, thus modeling
[Weinzaepfel et al. 2013; Revaud et al. 2015; Wei et al. 2015]. How- all image transformations. However a simple union strategy gen-
ever these are computationally expensive, and currently prohibitive erates many false positives. We solve this problem by proposing a
for real-time scenarios (even with GPU implementations). voting scheme. Each tree with a unique collision (i.e. a leaf with
In this paper we extend the recently proposed Global Patch Collider only two candidates) votes for a possible match, and the one with the
(GPC) [Wang et al. 2016] framework to efficiently generate accurate highest number of votes is returned. This approach generates much
more dense and reliable correspondences even when large motion is
2 Followed with postprocessing, i.e., applying 1.0 − H and scaling. present. We evaluate this method in Sec. 7.3.
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016
114:6 • M. Dou et al.
This method gives us, in the n’th view, a set of Fn matches In addition to being differentiable, the independence of ñm greatly
Nf simplifies the necessary derivative calculations as the derivative with
{uprev
nf , unf }f =1 between pixels in the current frame and the previ-
ous frame. For each match (uprev respect to any parameter in G is the same for any view.
nf , unf ) we can find a corresponding
point qnf ∈ R3 in the reference frame using
Evaluation of J> J and J> f In order to make this algorithm
qnf = argmin kΠn (T (v; Gprev )) − uprev
nf k (10) tractable for the large number of parameters we must handle, we
v∈V
bypass the traditional approach of evaluating and storing J so that
where Gprev are the parameters that deform the reference surface it can be reused in the computation of J> J and J> f . Instead we
V to the previous frame. We would then like to encourage these directly evaluate both J> J and J> f given the current parameters
model points to deform to their 3D correspondences. To this end, X. In our scenario, this approach results in a dramatically cheaper
we employ the the energy term memory footprint while simultaneously minimizing global memory
Fn
N X
X reads and writes. This is because the number of residuals in our
Ecorr (G) = ρ(kT (qnf ; G) − Pn (unf )k2 ) (11) problem is orders of magnitude larger than the number of parameters
n=1 f =1
(i.e. C >> D) and therefore the size of the Jacobian J ∈ RC×D
dwarfs that of J> J ∈ RD×D .
where ρ(·) is a robustifier to handle correspondence outliers.
Further, J> J itself is a sparse matrix composed of non-zero blocks
5.3 Optimization {hij ∈ R12×12 : i, j ∈ {1, ..., K} ∧ i ∼ j} created by ordering
parameter blocks from K ED nodes, where i ∼ j denotes that the
In this section, we show how to rapidly and robustly minimize i’th and j’th ED nodes simultaneously contribute to at least one
E(G) on the GPU to obtain an alignment between the model and residual. The (i, j)’th block can be computed as
the current frame. To this end, we let X ∈ RD represent the X
concatenation of all the parameters and let each entry of f (X) ∈ RC hij = j>
ci jcj (14)
contain each of the C unsquared terms (i.e. the residuals) from c∈Iij
the energy above so that E(G) = f (X)> f (X). In this form, the
problem of minimizing E(G) can be seen as a standard sparse non- where Iij is the collection of residuals dependent on both parameter
linear least squares problem which can be solved by approaches block i and j and jci is the gradient of c’th residual, fc , w.r.t. i-
based on the Gauss-Newton algorithm. We handle the robust terms th parameter block. Note that each Iij will not change during a
using the square-rooting technique described in [Engels et al. 2006; step calculation (due to our approximation) so we only need to
Zach 2014]. calculate each index set once. Further, the cheap derivatives of the
approximation in (13) ensure that the complexity of computing J> J,
For each frame we initialize all the parameters from the motion although linearly proportional to the number of surface vertices, is
field of the previous frame. We then fix the ED nodes parame- independent of the number of cameras.
ters {Ak , tk }K
k=1 and estimate the global rigid motion parameters
{R, T } using projective iterative closest point (ICP) [Rusinkiewicz To avoid atomic operations on the GPU global memory, we let
and Levoy 2001]. Next we fix the global rigid motion parameters and each CUDA block handle one J> J block and perform reduction on
estimate the ED nodes parameters. The details of the optimization the GPU shared memory. Similarly, J> f ∈ RD×1 can be divided
are presented in the following sections. into K segments, {(J> f )i ∈ R12×1 }K
i=1 , with the i’th segment
calculated as
5.3.1 Computing a Step Direction
X >
(J> f )i = jci fc (15)
c∈Ii
We compute a step direction h ∈ RD in the style of the Levenberg-
Marquardt (LM) solver on the GPU. At any point X in the search where Ii contains all the constraints related to ED node i. We assign
space we solve for one GPU block per (J> f )i and again perform the reduction on
shared memory.
(J> J + µI)h = −J> f (12)
C×D
where µ is a damping factor, J ∈ R is the Jacobian of f (X) Linear Equations Solver Solving the cost function in Eq. (3)
and f is simply an abbreviation for f (X) to obtain a step direction h. amounts to a series of linear solves of the normal equations
If the update will lower the energy (i.e. E(X + h) < E(X) the step (Eq. (12)). DynamicFusion [Newcombe et al. 2015] uses a direct
is accepted (i.e. X ← X + h) and the damping factor is lowered to sparse Cholesky decomposition. Given their approximation of the
be more aggressive. When the step is rejected, as it would raise the data term component of J> J as a block diagonal matrix this still re-
energy, the damping factor is raised and (12) is solved again. This sults in a real-time system. However, we do not wish to compromise
behaviour can be interpreted as interpolating between an aggressive the fidelity of the reconstruction by approximating J> J if we can
Gauss-Newton minimization and a robust gradient descent search as still optimize the cost function in real-time, so we chose to iteratively
lowering the damping factor implicitly downscales the update as a solve using preconditioned conjugate gradient (PCG). The diagonal
back-tracking line search would. blocks of J> J are used as the preconditioner.
where Dd is the fused TSDF at the data frame, and Hd is the visual
Figure 5: Left: reference surface. Middle and Right: surfaces from hull distance transform (Sec. 5.2.3). We then aggregate this error at
warped volume without and with voxel collision detection. the ED nodes by averaging the errors from the vertices associated
with the same ED node. This aggregation process reduces the influ-
ence of the noise in the depth data on the alignment error. Finally,
we reject any reference voxel if any of its neighboring ED nodes has
an alignment error beyond a certain threshold. The extracted surface
from Ṽr is illustrated in Fig. 6.
After we fuse the depth maps into a data volume Vd and warp the
reference volume to the data frame forming Ṽr , the next step is to
blend the two volumes Vd and Ṽr to get the final fused volume V̄d ,
Figure 6: Volume blending. Left to right: reference surface; non- used for the reconstructed output.3
rigid alignment residual showing topology change; extracted surface Even after the conservative selective fusion described in the previous
at the warped reference; extracted surface from final blended volume section, simply taking a weighted average of the two volumes (i.e.,
˜r r +dd wd
d¯d = d w̃
w̃r +wd
) leads to artifacts. This naive blending does not
guarantee that the SDF band around the zero-crossing will have
hdr , wr i at neighboring voxels within some distance τ on the regular a smooth transition of values. This is because boundary voxels
lattice of the data volume. Every data voxel xd would then calculate that survived the rejection phase will suppress any zero-crossings
the weighted average of the accumulated data hd¯r , wr i, both SDF coming from the data, causing artifacts and lowering the quality at
value and SDF weight, using the weight exp(−kx̃r − xd k2 /2σ 2 ). the output.
Note, this blending (or averaging) is bound to cause some geometric To handle this problem, we start by projecting the reference surface
blur. To ameliorate this effect, each reference voxel xr does not vertices Vr to the depth maps. We can then calculate a per-pixel
directly vote for the SDF value it is carrying (i.e., dr ) but for the depth alignment error as the difference between the vertex depth d
corrected value d¯r using the gradient field of the SDF, i.e., and its projective depth dproj , normalized by a maximum dmax . Put
together, we calculate
d¯r = dr + (x̃r − xd )> ∆,
˜
(
min ( 1.0, |d − dproj | / dmax ) if dproj is valid
∆ is the gradient at xr in the reference. ∆ ˜ is the warped gradient epixel = (17)
1.0 otherwise.
using Eq. (2) and approximates the gradient field at the data volume.
In other words, d¯r is the prediction of the SDF value at xd given the
SDF value and gradient at x̃r . Each voxel in the data volume Vd can then have an aggregated
average depth alignment error evoxel when projecting it to depth
6.1.2 Selective Fusion maps. Finally, instead of using the naive blending described above,
we use the blending function
To ensure a high-fidelity reconstruction at the data frame, we need
to ensure that each warped reference voxel x̃r will not corrupt the d˜r w̃r (1.0 − evoxel ) + dd wd
d¯d = (18)
reconstructed result. To this end we perform two tests before fusing w̃r (1.0 − evoxel ) + wd ,
in a warped voxel and reject its vote if it fails either.
downweighting the reference voxel data by its depth misalignment.
Voxel Collision. When two model parts move towards each other
(e.g., clapping hands), the reference voxels contributing to different 6.2 Fusion at the Reference Frame
surface areas might collide after warping, and averaging the SDF
values voted for by these voxels is problematic: in the worst case,
As in [Newcombe et al. 2015], to update the reference model we
the voxels with a higher absolute SDF value will overwhelm the
warp each reference voxel xr to the data frame, project it to the depth
voxels at the zero crossing, leading to a hole in the model (Fig. 5).
maps, and update the TSDF value and weight. This avoids an explicit
To deal with this voxel collision problem, we perform the fusion data-to-model warp. Additionally, we also know the reference voxels
in two passes. In the first pass, and for any given data voxel xd , x̃r not aligned well to the data from Eq. (16). For these voxels we
we evaluate all the reference voxels voting at its location and save discard their data and refresh it from the data in the current data
the reference voxel ẋr with the smallest absolute SDF value. In frame. Finally, we reset the entire volume periodically to the fused
the second pass we reject the vote of any reference voxel xr at this data volume V̄d (i.e., key volumes) to handle large misalignments
location if |xr − ẋr | > η. that cannot be recovered from by the per-voxel refresh.
Voxel Misalignment. We also need to evaluate a proxy error at each 3 Marching cubes is applied to this volume to extract the final mesh
reference voxel xr to detect if the nonrigid tracking failed so we representation.
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016
114:9 • Fusion4D: Real-time Performance Capture of Challenging Scenes
We now provide results, experiments and comparisons of our real- In Sec. 5.2.4 we described our approach to estimating RGBD cor-
time performance capture method. respondences. We now evaluate its robustness compared to other
state-of-the-art methods. One sequence with very fast motions is con-
sidered. In order to compare different correspondence algorithms,
7.1 Live Performance Capture we only minimize the Ecorr (G) term in Eq. 3 and we compute the
residual error. We report results as percentage of alignment error
Our system is fully implemented on the GPU using CUDA. Re- between the current observation and the model. In particular, we
sults of live multi-view scene captures for our test scenes are shown show the percentage of vertices with error > 5mm. We compared
in Figures 1 and 7 as well as in the supplementary material. It is different methods: standard SIFT detector and descriptors [Lowe
important to stress that all these sequences were captured online 2004] , a FAST detector [Rosten and Drummond 2005] followed by
and in real-time, including depth estimation and full nonrigid re- SIFT descriptors, DeepMatch [Weinzaepfel et al. 2013], EpicFlow
construction. Furthermore, these sequences are captured over long [Revaud et al. 2015] and our extension of Global Patch Collider
time periods comprising many minutes. We make a strong case for [Wang et al. 2016] described in Sec. 5.2.4. Quantitative results on
nonrigid alignment in Fig. 8. While volumetrically fusing the live this fast motion sequence are reported in Fig. 12. The best results
data does produce a more aesthetically appealing result compared to are obtained by our method with 29% outliers, then SIFT (34%),
the input point cloud, it cannot resolve issues arising from missing FAST+SIFT (34%), DeepMatch (36%) and EpicFlow (36%). Most
data (holes) or noise. On the other hand, these issues are significantly of the error occurred in regions where very large motion is present:
ameliorated in the reconstructed mesh with Fusion4D by leveraging a qualitative comparison is depicted in Fig. 13.
temporal information.
7.4 Nonrigid Reconstruction Comparisons
We captured a variety of diverse and challenging nonrigidly moving
scenes. This includes multiple people interacting, deforming objects, In Fig. 15, we compare to the dataset of [Collet et al. 2015] for a
topology changes and fast motions. Fig. 7 shows multiple examples sequence with extremely high motions. The figure compares render-
for each of these scenes. Our reconstruction algorithm is able to ings of the original meshes and multiple reconstructions, where red
deal with extremely fast motion, where most online nonrigid sys- corresponds to a fitting error of 15mm. In particular, we compare
tems would fail. Fig. 10 depicts typical situations where the small our method with [Zollhöfer et al. 2014] and [Newcombe et al. 2015],
motion assumption does not hold. This robustness is in due to the showing our superior reconstructions in these challenging situations.
ability to estimate fast RGBD correspondences allowing for robust We also show results and distance metrics for the method of [Collet
initialization of the ED graph, and also the ability to recover from et al. 2015] which is an offline technique with a runtime of about
misalignment errors. In Fig. 9 we show a number of challenging 30 minutes per frame on the CPU, and runs with 30 more cameras
topology changes that our system can cope with in a robust manner. than our system. In a more quantitative analysis (Fig. 11) we plot
This includes hands being initially reconstructed on the hips of the the error over the input mesh for our method and [Collet et al. 2015],
performer and then moved, and items of clothing being removed, which shows that our algorithm can match the motion and fine scale
such as a jacket or scarf etc. details exhibited in this sequence. Our approach shows qualitatively
similar results but with a system that is about 4 orders of magnitude
Other examples of reconstructions in Fig. 7 and supplementary
faster, allowing for true real-time performance capture.
video, depict clothing changes, taekwondo moves, dancing, animals,
moving hair and interaction with objects. For any of these situations Finally, multiple qualitative comparisons among different state of
the algorithm automatically retrieves the nonrigid reconstruction the art methods are shown in Fig. 14. These sequences exhibits all
with real-time performance. Notice also that the method has no classical situations where online methods fail, such as large motions
shape prior of the object of interest and can easily generalize to and topology changes. Again our real-time reconstruction methods
non-human models, for example animals or objects. correctly retrieves the non rigid shapes for any of these scenarios.
Please also see accompanying video figure.
7.2 Computational Time
8 Limitations
Similar to [Collet et al. 2015] the input RGBD and segmentation
data is generated on dedicated PCs. Each machine is an Intel Core i7 Even though we demonstrated one of the first methods for real-time
3.4GHz CPU, 16GB of RAM and it uses two NVIDIA Titan X GPUs. nonrigid reconstruction from multiple views, showing reconstruction
Each PC processes two depthmaps and two segmentation masks in of challenging scenes, our system is not without limitations. Given
parallel. The total time is 21ms and 4ms for the stereo matching and the tight real-time constraint (33ms/frame) of our approach, we
segmentation, respectively. Correspondence estimation requires 5ms rely on temporal coherence of the RGBD input stream making
with a parallel GPU implementation. In total each machine uses no the processing at 30Hz a necessity. If the frame rate is too low
more than 30ms to generate the input for the nonrigid reconstruction or frame-to-frame motion is too large, either the frame-to-frame
pipeline. RGBD frames are generated in parallel to the nonrigid correspondences would be inaccurately estimated or the nonrigid
pipeline, but do introduce 1 frame of latency. alignment would fail to converge given the tight time budget. In
either case our method might lose tracking. In both scenarios our
A master PC (another Intel Core i7 3.4GHz CPU, 16GB of RAM, system does fall back to the live fused data. However, as shown
with a single NVIDIA Titan X), aggregates and synchronizes all in Fig. 16 the volume blending can look noisy as new data is first
the depthmaps, segmentation masks and correspondences. Once the being fused. Another issue in our current work is robustness to
RGBD inputs are available, the average processing time to nonrigidly segmentation errors. Large segmentation errors, if there is missing
reconstruct is 32ms (i.e., 31fps) with 3ms for preprocessing (10% depth data for instance, can lead to incorrect visual hull estimation.
of the overall pipeline), 2ms (7%) for rigid pose estimation (on This can cause some noise to be integrated into the model as shown
average 4 iterations), 20ms (64%) for the nonrigid registration (5 in Fig. 16. Finally, any small nonrigid alignment errors can cause
LM iterations, with 10 PCG iterations), and 6ms (19%) for fusion. slight oversmoothing of the model at times e.g. Fig. 16. We deal
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016
114:10 • M. Dou et al.
Figure 7: Real-time results captured of Fusion4D, showing a variety of challenging sequences. Please also see accompanying video.
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016
114:11 • Fusion4D: Real-time Performance Capture of Challenging Scenes
Figure 8: A comparison of the input data as a point cloud (left), the Figure 11: Quantitative comparison with [Collet et al. 2015]
fused live data without nonrigid alignment (center), and the output
of our system (right).
References
sensor. In CVPR. S TOLL , C., H ASLER , N., G ALL , J., S EIDEL , H., AND T HEOBALT,
C. 2011. Fast articulated motion tracking using a sums of gaus-
E NGELS , C., S TEW ÉNIUS , H., AND N IST ÉR , D. 2006. Bundle sians body model. In Proc. ICCV, IEEE, 951–958.
adjustment rules. Photogrammetric computer vision 2, 124–131.
S UMNER , R. W., S CHMID , J., AND PAULY, M. 2007. Embedded
G ALL , J., S TOLL , C., D E AGUIAR , E., T HEOBALT, C., ROSEN - deformation for shape manipulation. ACM TOG 26, 3, 80.
HAHN , B., AND S EIDEL , H.-P. 2009. Motion capture using joint
skeleton tracking and surface estimation. In Proc. CVPR, IEEE, T EVS , A., B ERNER , A., WAND , M., I HRKE , I., B OKELOH , M.,
1746–1753. K ERBER , J., AND S EIDEL , H.-P. 2012. Animation cartography-
intrinsic reconstruction of shape and motion. ACM TOG.
G UO , K., X U , F., WANG , Y., L IU , Y., AND DAI , Q. 2015. Robust
non-rigid motion tracking and surface reconstruction using l0 T HEOBALT, C., DE AGUIAR , E., S TOLL , C., S EIDEL , H.-P., AND
regularization. In ICCV, 3083–3091. T HRUN , S. 2010. Performance capture from multi-view video. In
Image and Geometry Processing for 3D-Cinematography, R. Ron-
K R ÄHENB ÜH , P., AND KOLTUN , V. 2011. Efficient inference in fard and G. Taubin, Eds. Springer, 127ff.
fully connected crfs with gaussian edge potentials. NIPS.
V INEET, V., WARRELL , J., AND T ORR , P. H. S. 2012. Filter-based
K UTULAKOS , K. N., AND S EITZ , S. M. 2000. A theory of shape mean-field inference for random fields with higher-order terms
by space carving. IJCV. and product label-spaces. In ECCV.
L I , H., A DAMS , B., G UIBAS , L. J., AND PAULY, M. 2009. Robust V LASIC , D., BARAN , I., M ATUSIK , W., AND P OPOVI Ć , J. 2008.
single-view geometry and motion reconstruction. ACM TOG. Articulated mesh animation from multi-view silhouettes. ACM
TOG (Proc. SIGGRAPH).
L OWE , D. G. 2004. Distinctive image features from scale-invariant
V LASIC , D., P EERS , P., BARAN , I., D EBEVEC , P., P OPOVIC , J.,
keypoints. IJCV.
RUSINKIEWICZ , S., AND M ATUSIK , W. 2009. Dynamic shape
M ITRA , N. J., F L ÖRY, S., OVSJANIKOV, M., G ELFAND , N., capture using multi-view photometric stereo. ACM TOG (Proc.
G UIBAS , L. J., AND P OTTMANN , H. 2007. Dynamic geometry SIGGRAPH Asia) 28, 5, 174.
registration. In Proc. SGP, 173–182. WAND , M., A DAMS , B., OVSJANIKOV, M., B ERNER , A.,
M ORI , M., M AC D ORMAN , K. F., AND K AGEKI , N. 2012. The B OKELOH , M., J ENKE , P., G UIBAS , L., S EIDEL , H.-P., AND
uncanny valley [from the field]. Robotics & Automation Magazine, S CHILLING , A. 2009. Efficient reconstruction of nonrigid shape
IEEE 19, 2, 98–100. and motion from real-time 3D scanner data. ACM TOG.
WANG , S., FANELLO , S. R., R HEMANN , C., I ZADI , S., AND
N EWCOMBE , R. A., I ZADI , S., H ILLIGES , O., M OLYNEAUX , D.,
KOHLI , P. 2016. The global patch collider. CVPR.
K IM , D., DAVISON , A. J., KOHLI , P., S HOTTON , J., H ODGES ,
S., AND F ITZGIBBON , A. 2011. KinectFusion: Real-time dense WASCHB ÜSCH , M., W ÜRMLIN , S., C OTTING , D., S ADLO , F.,
surface mapping and tracking. In Proc. ISMAR, 127–136. AND G ROSS , M. 2005. Scalable 3D video of dynamic scenes. In
Proc. Pacific Graphics, 629–638.
N EWCOMBE , R. A., F OX , D., AND S EITZ , S. M. 2015. Dy-
namicfusion: Reconstruction and tracking of non-rigid scenes in W EI , L., H UANG , Q., C EYLAN , D., VOUGA , E., AND L I , H.
real-time. In CVPR, 343–352. 2015. Dense human body correspondences using convolutional
networks. arXiv preprint arXiv:1511.05904.
P ONS -M OLL , G., TAYLOR , J., S HOTTON , J., H ERTZMANN , A.,
AND F ITZGIBBON , A. 2015. Metric regression forests for corre- W EINZAEPFEL , P., R EVAUD , J., H ARCHAOUI , Z., AND S CHMID ,
spondence estimation. IJCV 113, 3, 163–175. C. 2013. Deepflow: Large displacement optical flow with deep
matching. In ICCV.
P RADEEP, V., R HEMANN , C., I ZADI , S., Z ACH , C., B LEYER ,
M., AND BATHICHE , S. 2013. MonoFusion: Real-time 3D Y E , M., AND YANG , R. 2014. Real-time simultaneous pose
reconstruction of small scenes with a single web camera. In Proc. and shape estimation for articulated objects using a single depth
ISMAR, IEEE, 83–88. camera. In CVPR, IEEE.
R EVAUD , J., W EINZAEPFEL , P., H ARCHAOUI , Z., AND S CHMID , Y E , M., Z HANG , Q., WANG , L., Z HU , J., YANG , R., AND G ALL ,
C. 2015. Epicflow: Edge-preserving interpolation of correspon- J. 2013. A survey on human motion analysis from depth data.
dences for optical flow. CVPR. In Time-of-Flight and Depth Imaging. Sensors, Algorithms, and
Applications. Springer, 149–187.
ROSTEN , E., AND D RUMMOND , T. 2005. Fusing points and lines
Z ACH , C. 2014. Robust bundle adjustment revisited. In Computer
for high performance tracking. In ICCV.
Vision–ECCV 2014. Springer, 772–787.
RUSINKIEWICZ , S., AND L EVOY, M. 2001. Efficient variants of Z ENG , M., Z HENG , J., C HENG , X., AND L IU , X. 2013. Template-
the icp algorithm. In 3DIM, 145–152. less quasi-rigid shape modeling with implicit loop-closure. In
S HOTTON , J., G LOCKER , B., Z ACH , C., I ZADI , S., C RIMINISI , Proc. CVPR, IEEE, 145–152.
A., AND F ITZGIBBON , A. 2013. Scene coordinate regression Z HANG , Q., F U , B., Y E , M., AND YANG , R. 2014. Quality
forests for camera relocalization in rgb-d images. In CVPR. dynamic human body modeling using a single low-cost depth
camera. In CVPR, IEEE, 676–683.
S MOLIC , A. 2011. 3d video and free viewpoint videofrom capture
to display. Pattern recognition 44, 9, 1958–1968. Z OLLH ÖFER , M., N IESSNER , M., I ZADI , S., R HEMANN , C.,
Z ACH , C., F ISHER , M., W U , C., F ITZGIBBON , A., L OOP, C.,
S TARCK , J., AND H ILTON , A. 2007. Surface capture for T HEOBALT, C., ET AL . 2014. Real-time non-rigid reconstruction
performance-based animation. Computer Graphics and Applica- using an rgb-d camera. ACM TOG.
tions 27, 3, 21–31.
ACM Trans. Graph., Vol. 35, No. 4, Article 114, Publication Date: July 2016