Professional Documents
Culture Documents
Patch Based Non Rigid 3D Reconstruction From A Single Depth Stream
Patch Based Non Rigid 3D Reconstruction From A Single Depth Stream
Abstract
43
Figure 2. Our proposed pipeline. The reconstruction process is incremental; each frame is patched and back-projected. We divide the
sequence into keyframes, in each we build a local incremental model. We obtain the final reconstruction using a correspondence-based
global fusion approach.
fast inter-frame motions. To combat drift, Dou et al. [11] growing model. The reconstruction process is incremental;
employed non-rigid bundle adjustment to automatically de- each frame is patched (§3.1) and deformed onto a grow-
tect loop closures. They demonstrated high quality results, ing model using a deformation framework (§3.2). We use
but required hours of processing and they were still unable a surfel-based representation for the data, which enables us
to fully handle rapid movements. We aim to handle drift by to fuse new geometry onto the model easily (§3.3).
the use of a keyframe-based approach which eliminates the The idea of patch-based representation has been used be-
use of bundle-adjustment based techniques and speeds up fore in multi-view reconstruction and tracking [4]. The key
the reconstruction pipeline. to enabling this for single-stream 3D reconstruction is the
Patch-based Representations: Patch-based represen- incorporation of keyframes into our approach (§3.4). This
tations have been proposed for the problem of non-rigid is accomplished by dividing the sequence of depth images
structure from motion [25, 7] . The work of Varol et into keyframes, which hold F consecutive frames, in each
al. [32] calculated homographies between corresponding we build a local incremental model. Finally, we fuse all the
planar patches from a short image sequence. This enabled models to obtain a global model of the deforming object
reconstruction of textured deformable surfaces. Similarly, (§3.5). As we use a surfel-based representation, we provide
Agapito et al. [12] reconstructed strongly deforming ob- a quick overview of our visualization approach (§3.6).
jects, such as a waving flag, by dividing the surface into
3.1. Surface Patching
overlapping patches and reconstructing each of these inid-
vidually to obtain a final 3D model. Recent works in non- Our patching approach constructs roughly equally sized
rigid structure from motion (NRSfM) such as the work of patches distributed uniformly on the object surface. The
Ji et al. [18] have shown more accurate results than previous patching is computed directly on each incoming depth
state-of-the-art methods for non-rigid shape reconstruction. frame, as depicted in Figure 3, allowing us to propagate
The patch-based method of Cagniart et al. [3] has been par- the patch information for each point when projected to 3D.
ticularly successful in the multi-view tracking setting, as it We employ a superpixel approach for the patch genera-
splits the surface into non-rigidly connected rigid patches, tion, based on the mask Simple Linear Iterative Cluster-
which can withstand noise and occlusions. We also adopt a ing (maskSLIC) [17] algorithm. In certain situations the
patch-based representation for our approach, but we couple mask of the object is complex, containing holes or disparate
it with a keyframe-based framework to address problems of features, which makes it possible to generate disconnected
the single-view sensor setting, i.e., partial views, noisy data patches. These detached patches may contain regions on
and occlusions. the surface object, which deform non-rigidly to different lo-
cations. We show an example of detached patches in Fig-
ure 3. Finally, we ensure that detached patches are split,
3. Proposed Method
such that the deformation framework is not adversely af-
Our framework reconstructs a 3D model of a dynamic fected by such patches.
object, as illustrated in Figure 2. The method keeps track
3.2. Patch-based Deformation Framework
of correspondences throughout the sequence allowing for
fast inter-frame motions. The most crucial aspect of our We employ a patch-based surface rigidity deformation
pipeline is the use of a patch-based surface representation, framework, inspired from [4], to compute the non-rigid de-
enabling us to deform incoming frames non-rigidly onto a formations of each frame with respect to the canonical pose.
44
Patching Splitting Detached Patches
Figure 3. The patching, depicted on the left, is accomplished by placing seeds and clustering all pixels to a nearest seed location. Patches
which are detached due to disparate features, as illustrated on the right hand side, are split to ensure they can deform non-rigid to their
respective locations. On the top row we show the output from the patching framework before applying the last step of splitting the detached
patches. On the bottom row we show the additional patches that are created to split the detached patches.
The surface of an object X = (X) is represented in 3D co- Assuming that Y and X(Θ) lie near to each other,
ordinates, where X = {xv }Vv=1 ⊂ R3 are the positions of the correspondence set C is built with a nearest-neighbor
all its points. The task of the deformation framework is to search. A deformation parameter Θ is initialized and X(Θ)
register the points of the incoming frame X to the canon- is transformed accordingly. We then evaluate a new cor-
ical pose Y. With each incoming depth frame, we deform respondence set C for X(Θ), and the process repeats until
the frame onto the reference model and integrate the new convergence. This iterative approach is accomplished with
geometry. a variant of non-rigid ICP [1].
Patch Parameterization. Once the surface patching is Energy Definition. We utilize an energy formulation
completed on a frame, the depth data is back-projected into combining a data term Edata , which enforces the deformed
a point cloud. A rigid transformation with respect to the cloud to be aligned with the reference model, and a regu-
world coordinates is associated with each patch Pk , param- larization term Er , which imposes as-rigid-as-possible con-
eterized by the position of the patch center ck and a rotation strains between neighboring patches. Given a correspon-
matrix Rk . The rigid transformation for each point xk can dence set C, where each patch i in the input cloud is associ-
be computed as follows: ated with a point p on the target, we formulate the data term
as:
xk = Rk (x − ck ) + ck ,
X
(1) Edata (Θ; C) = wi,p kyi − xp (Θ)k22 , (3)
The parameters representing each patch Θ = {θk }k=1:K (i,p∈C)
are combined into a vector Θ = {Rk , ck }k=1:Np , describ- where each correspondence pair is associated with a weight
ing the entire surface. Np describes the number of patches wi,p . Given two neighboring patches k and l ∈ Nk , the
on the surface. The framework computes one rigid transfor- rigidity energy enforces the predictions xk (v) and xl (v) of
mation per patch, while the motion of each vertex is com- a vertex point v to be consistent:
puted using linear blending from the patch center. A gaus-
sian weighting function, αk is employed in the blending, K X
X X
and the position for a point xk is computed using the patch Er (Θ) = wkl (v)kxk (v) − xl (v)k2 , (4)
and its direct neighbors k ∪ Nk : k=1 l∈Nk v∈Pk ∪Pl
X
x= αs xs . (2) where the weights wkl (v) encode a property of uniform
s∈k∪Nk stiffness. These are computed and normalized proportion-
Deformation Framework. Given an observed point ally to the sum of the blending weights from Equation 2 for
cloud X = (X) where Y = (Y) is the canonical pose, each patch and its neighbors. Our final energy is defined as:
we register X to Y, estimating Θ̂ such that X(Θ̂) resem-
arg min E(Θ) = arg min Edata (X(Θ)) + λr Er (Θ). (5)
bles Y as closely as possible. This is accomplished in a Θ Θ
two-part process. First, each point in Y is associated with a
point in X to build a correspondence set C. Next, Θ̂ is esti- EM-ICP. To minimize the above energy we model the
mated by minimizing an energy E which describes the dis- problem in a Bayesian context and employ Expectation-
crepancy between each pair association from the first step Maximization (EM) for MAP estimation [4]. This formu-
C : Θ̂ = arg minΘ E(Θ; C). lation is robust to outliers in the canonical model, reducing
45
the overall error accumulation as well as reducing drift. Ad- representation. Consistent with prior work [27], we esti-
ditionally, it is robust to noise and missing data in incoming mate the rigid motion of the camera employing a fully vol-
depth frames. umetric approach [28], enabling our method to work on se-
quences with free camera motion. We obtain the deformed
3.3. Surfel-based Fusion point cloud after warping the incoming frame through the
Our surfel representation is based on [20], and the surfels EM-ICP framework with respect to the current keyframe
of a point cloud are represented as a set of points P̄k , each model. We fuse this output (§3.3) into the current keyframe
associated with a position vk ∈ R3 , a normal nk ∈ R3 , a model. If a frame is the last one in a keyframe, the deformed
confidence ck ∈ R, and a timestamp tk ∈ R. The deforma- point cloud from the deformation framework for that frame
tion framework produces a deformed point cloud with as- is fused into the keyframe model and is additionally used
sociated normals of an incoming depth frame with respect to initialize the next keyframe model. To obtain the final
to its keyframe model. The deformed point cloud must be global model, all keyframes are fused together (§3.5) us-
integrated into this model. As we use a surfel-based repre- ing a global correspondence-based approach. This fusion
sentation, we do not need to switch to a different data repre- approach is particularly advantageous for our deformation
sentation (e.g., DynamicFusion [23] requires constant con- framework and allows us to track correspondences through-
versions between a point cloud representation and an SDF). out the entire sequence.
When averaging points, we use correspondences within a
3.5. Global Correspondence-based Fusion
pre-defined radius δnear . If the correspondence of the point
from the deformed point cloud and the model are within this To obtain the final reconstruction all keyframe models
radius, we average them together with the same averaging must be fused together. Contrary to [10], we can accom-
scheme as in [20]. Any point from the deformed point cloud plish this without switching data representations by fusing
which does not have a correspondence and is not outside the keyframes together using keyframe-correspondences.
of a radius δf ar , is added to the model as new geometry. We track correspondences between frames and propagate
Points which are outside of the bounding radius δf ar are these between each keyframe; when the last deformation
discarded, as they are most likely outliers or artifacts from of a keyframe is obtained, it is used to initialize the next
sensor noise. To further eliminate noise, we remove points keyframe model and the correspondences are propagated as
with a low confidence (ck < cstable ) after a time interval well. Other keyframe-based methods, such as [36, 9], can-
tmax . We choose cstable = 10 and compute the confidence not track correspondences and must instantiate a new vol-
in the same manner as in [20]. ume for each new keyframe without any associations be-
tween them.
3.4. Keyframes The global correspondence-based fusion is done as the
We propose a Keyframes-based scheme similar to [10], last step of the proposed method. Starting at the last
outlined in Figure 2, which enables us to handle lengthy model generated, the corresponding surfels between the last
sequences such as a human turning. This approach con- keyframe model and its predecessor are fused, while non-
sists of dividing the entire sequence into keyframes, hold- corresponding surfels are added as new geometry of the
ing F frames each, and using correspondences-based fusion global model. This process is repeated, fusing all preced-
to obtain the final reconstruction. Following the initializa- ing models onto the growing global model. Furthermore,
tion of the first keyframe, we process all the frames incre- the global correspondence-based fusion enables us to ob-
mentally. We use the initial frame in the first keyframe to tain accurate results for lengthy sequences while reducing
initialize the model. We perform surface-patching on each drift and eliminating the need for complicated constraints
subsequent frame and deform it non-rigidly using the EM- such as bundle-adjustment. The correspondences found be-
ICP deformation framework onto the local keyframe model. tween each keyframe are averaged with the same surfel-
New geometry is fused into the model using surfel-based based scheme as outlined in §3.3. For all new points without
fusion (§3.3). At the end of each keyframe, the last out- correspondences we add them to the global model as new
put from the deformation framework initializes the model geometry.
for the next keyframe, which helps to combat drift and al-
3.6. Visualization
lows us to establish correspondences throughout the entire
sequence. This is accomplished due to our surfel-based The surfel-based representation consists of point primi-
representation. The pipeline generates a model for each tives containing no connectivity [24]. As such, we employ
keyframe, all of which are fused at the end of the sequence a surface splatting approach similar to [39] for visualiza-
to form a final global model. tion. This is an advantage compared to other approaches
Each depth frame is surface patched (§3.1) and back- in non-rigid monocular reconstruction, which switch data
projected into a 3D point cloud, stored using a surfel-based representations to an SDF, where visualization is extracted
46
Figure 4. Patch-based non-rigid reconstruction results on several non-rigid sequences. Each sequence contains natural dynamic movements
of a human such as swaying from side to side, leg movements, and head tilting. Our proposed method achieves full loop reconstructions
without using any explicit loop closure detection methods.
through ray tracing or marching cubes [22]. In contrast, our timization. It may be possible to accelerate the EM-ICP
approach allows us to stay in one representation and is less framework using a GPU implementation, but due to the spe-
prone to data loss through data representation switching. cific nature of our formulation, this is beyond the scope of
this paper.
4. Evaluation 4.1. Qualitative Evaluation
In this section we present extensive qualitative and quan- Our Sequences: The experiments on the new se-
titative evaluation of our proposed framework, using both quences captured by us highlight our ability to recover
publicly available datasets as well as our own acquisitions. reconstructions of full loops around deformable objects;
As we employ a surfel-based representation, the render- something not typically demonstrated in comparable ap-
ing is completed using surface splatting, as described in proaches. Our method is capable of capturing their geom-
§ 3.6. Our entire pipeline is implemented on commodity etry with high fidelity by only employing the patch-based
hardware. Running on a 2.70GHz i7 CPU, the EM-ICP de- rigidity constraints, without explicitly accounting for drift
formation framework takes 3 seconds per frame on average. or non-rigid loop closure. Figure 4 depicts three of our own
Additionally, the final step (§3.5) takes on average 2 sec- sequences, where a man deforms dynamically while turn-
onds for the entire sequence. This is significantly faster ing. Our framework is able to successfully complete the
than the non-rigid bundle adjustment scheme of Dou et reconstruction despite motions such as head tilts, back-and-
al. [11], which is the only other related single-stream tech- forth swaying, and leg and arm movements.
nique executed on a CPU, taking 30 seconds per frame for Large Inter-frame Motions: Next we demonstrate the
pre-processing, together with another 6 hours for joint op- ability of our framework to recover rapid motions in chal-
47
Target Frame (First) Input Frame (Second) Patched Input Frame Input Deformed to Target
Figure 5. Ability of our proposed patch-based deformation framework to handle large motions. We manage to successfully warp the input
frame towards the target frame despite the large pose difference between them.
48
sion [36] Vicon markers. For the YT sequence we report
an average global error of 2.3cm, using 20 frames in each
keyframe and 70 patches. This compares favourably to the
average error of 4.4cm of DynamicFusion [23] and 3.7cm
of VolumeDeform [16]. BodyFusion [36] reports a lower
error of 2.2cm, but they require skeleton-based priors to re-
duce the ambiguities of the deformation parameters, while
our method does not involve the use of priors. Therefore,
our patch-based method manages to track correspondences
with high accuracy under fast motions. This is an advantage
of our approach compared to other keyframe-based tech-
Figure 7. Comparison to the non-rigid bundle adjustment method niques [10, 9] which instantiate a new volume for every
of Dou et al. [11]. Our method, illustrated in the top row, is able keyframe and lose data association between keyframes, and
to reconstruct the dynamic sequence without employing any loop
compared to purely SDF-based approaches like KillingFu-
closure methodologies, preserving features such as garment folds.
sion [27] which completely lacks correspondence informa-
We manage to accurately reconstruct small features (which may
appear noisy due to the surfel-based visualization, e.g the nose.) tion.
49
[2] F. Bogo, M. J. Black, M. Loper, and J. Romero. Detailed rigid reconstruction. In European Conference on Computer
full-body reconstructions of moving people from monocu- Vision (ECCV), pages 362–379. Springer, 2016.
lar RGB-D sequences. In IEEE International Conference on [17] B. Irving, I. A. Popescu, R. Bates, P. D. Allen, A. L. Gomes,
Computer Vision (ICCV), 2015. P. Kannan, P. Kinchesh, S. Gilchrist, V. Kersemans, S. Smart,
[3] C. Cagniart, E. Boyer, and S. Ilic. Free-form mesh tracking: et al. maskSLIC: regional superpixel generation with appli-
A patch-based approach. In IEEE Conference on Computer cation to local pathology characterisation in medical images.
Vision and Pattern Recognition (CVPR), 2010. arXiv preprint arXiv:1606.09518, 2016.
[4] C. Cagniart, E. Boyer, and S. Ilic. Probabilistic deformable [18] P. Ji, H. Li, Y. Dai, and I. D. Reid. ” maximizing rigid-
surface tracking from multiple videos. In European Confer- ity” revisited: A convex programming approach for generic
ence on Computer Vision (ECCV), pages 326–339. Springer, 3d shape reconstruction from multiple perspective views. In
2010. ICCV, pages 929–937, 2017.
[5] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, [19] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews,
D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan. High- T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic studio:
quality streamable free-viewpoint video. ACM Transactions A massively multiview system for social motion capture. In
on Graphics (TOG), 34(4), 2015. Proceedings of the IEEE International Conference on Com-
[6] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, puter Vision, pages 3334–3342, 2015.
D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan. High- [20] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich,
quality streamable free-viewpoint video. ACM Transactions and A. Kolb. Real-time 3d reconstruction in dynamic scenes
on Graphics (TOG), 34(4):69, 2015. using point-based fusion. In 3D Vision-3DV 2013, 2013 In-
[7] T. Collins and A. Bartoli. Locally affine and planar de- ternational Conference on, pages 1–8. IEEE, 2013.
formable surface reconstruction from video. In International [21] Q. Liu-Yin, R. Yu, L. Agapito, A. Fitzgibbon, and C. Rus-
Workshop on Vision, Modeling and Visualization, pages 339– sell. Better together: Joint reasoning for non-rigid 3d re-
346, 2010. construction with specularities and shading. arXiv preprint
arXiv:1708.01654, 2017.
[8] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H. Seidel,
[22] W. E. Lorensen and H. E. Cline. Marching cubes: A high
and S. Thrun. Performance capture from sparse multi-view
resolution 3D surface construction algorithm. In Proceed-
video. ACM Transactions on Graphics (TOG), 27(3), 2008.
ings of the 14th Annual Conference on Computer Graph-
[9] M. Dou, P. Davidson, S. R. Fanello, S. Khamis, A. Kowdle,
ics and Interactive Techniques, SIGGRAPH ’87, pages 163–
C. Rhemann, V. Tankovich, and S. Izadi. Motion2Fusion:
169, 1987.
Real-time volumetric performance capture. In ACM Trans-
[23] R. A. Newcombe, D. Fox, and S. M. Seitz. DynamicFusion:
actions on Graphics (TOG), 2017.
Reconstruction and tracking of non-rigid scenes in real-time.
[10] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, In Proceedings of the IEEE Conference on Computer Vision
A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Tay- and Pattern Recognition, pages 343–352, 2015.
lor, et al. Fusion4D: Real-time performance capture of chal-
[24] H. Pfister, M. Zwicker, J. Van Baar, and M. Gross. Surfels:
lenging scenes. ACM Transactions on Graphics (TOG),
Surface elements as rendering primitives. In Proceedings of
35(4):114, 2016.
the 27th annual conference on Computer graphics and in-
[11] M. Dou, J. Taylor, H. Fuchs, A. Fitzgibbon, and S. Izadi. teractive techniques, pages 335–342. ACM Press/Addison-
3D scanning deformable objects with a single RGBD sensor. Wesley Publishing Co., 2000.
In Proceedings of the IEEE Conference on Computer Vision [25] C. Russell, J. Fayad, and L. Agapito. Energy based multiple
and Pattern Recognition, pages 493–501, 2015. model fitting for non-rigid structure from motion. In Com-
[12] J. Fayad, L. Agapito, and A. Del Bue. Piecewise quadratic puter Vision and Pattern Recognition (CVPR), 2011 IEEE
reconstruction of non-rigid surfaces from monocular se- Conference on, pages 3009–3016. IEEE, 2011.
quences. In European Conference on Computer Vision [26] T. Schmidt, R. Newcombe, and D. Fox. Self-supervised
(ECCV), pages 297–310. Springer, 2010. visual descriptor learning for dense correspondence. IEEE
[13] K. Fujiwara, K. Nishino, J. Takamatsu, B. Zheng, and Robotics and Automation Letters, 2(2):420–427, 2017.
K. Ikeuchi. Locally rigid globally non-rigid surface registra- [27] M. Slavcheva, M. Baust, D. Cremers, and S. Ilic. Killingfu-
tion. In Computer Vision (ICCV), 2011 IEEE International sion: Non-rigid 3D reconstruction without correspondences.
Conference on, pages 1527–1534. IEEE, 2011. In IEEE Conference on Computer Vision and Pattern Recog-
[14] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, nition (CVPR), volume 3, page 7, 2017.
and H. P. Seidel. Motion capture using joint skeleton tracking [28] M. Slavcheva, W. Kehl, N. Navab, and S. Ilic. Sdf-2-
and surface estimation. In IEEE Conference on Computer Sdf: Highly accurate 3D object reconstruction. In Euro-
Vision and Pattern Recognition (CVPR), 2009. pean Conference on Computer Vision (ECCV), pages 680–
[15] K. Guo, F. Xu, T. Yu, X. Liu, Q. Dai, and Y. Liu. Real- 696. Springer, 2016.
time geometry, albedo and motion reconstruction using a sin- [29] T. Tamaki, M. Abe, B. Raytchev, and K. Kaneda. Softassign
gle RGB-D camera. ACM Transactions on Graphics (TOG), and EM-ICP on GPU. In First International Conference on
2017. Networking and Computing, 2010.
[16] M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, and [30] D. J. Tan, T. Cashman, J. Taylor, A. Fitzgibbon, D. Tarlow,
M. Stamminger. VolumeDeform: Real-time volumetric non- S. Khamis, S. Izadi, and J. Shotton. Fits like a glove: Rapid
50
and reliable hand shape personalization. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2016.
[31] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and
M. Nießner. Face2Face: Real-time face capture and reen-
actment of RGB videos. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[32] A. Varol, M. Salzmann, E. Tola, and P. Fua. Template-free
monocular reconstruction of deformable surfaces. In Com-
puter Vision, 2009 IEEE 12th International Conference on,
pages 1811–1818. IEEE, 2009.
[33] D. Vlasic, I. Baran, W. Matusik, and J. Popović. Articulated
mesh animation from multi-view silhouettes. ACM Transac-
tions on Graphics (TOG), 27(3), 2008.
[34] D. Vlasic, P. Peers, I. Baran, P. Debevec, J. Popović,
S. Rusinkiewicz, and W. Matusik. Dynamic shape capture
using multi-view photometric stereo. ACM Transactions on
Graphics (TOG), 28(5), 2009.
[35] R. Yu, C. Russell, N. D. Campbell, and L. Agapito. Direct,
dense, and deformable: Template-based non-rigid 3d recon-
struction from rgb video. In Proceedings of the IEEE In-
ternational Conference on Computer Vision, pages 918–926,
2015.
[36] T. Yu, K. Guo, F. Xu, Y. Dong, Z. Su, J. Zhao, J. Li, Q. Dai,
and Y. Liu. BodyFusion: Real-time capture of human motion
and surface geometry using a single depth camera. In The
IEEE International Conference on Computer Vision (ICCV).
ACM, October 2017.
[37] M. Zollhöfer, M. Nießner, S. Izadi, C. Rhemann, C. Zach,
M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, and
M. Stamminger. Real-time Non-rigid Reconstruction using
an RGB-D Camera. ACM Transactions on Graphics (TOG),
33(4), 2014.
[38] M. Zollhöfer, A. G. Patrick Stotko, C. Theobalt, M. Nießner,
R. Klein, and A. Kolb. State of the art on 3D reconstruction
with RGB-D cameras. 2018.
[39] M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Surface
splatting. In Proceedings of the 28th annual conference on
Computer graphics and interactive techniques, pages 371–
378. ACM, 2001.
51