Professional Documents
Culture Documents
Folding Patch CAVW 2020 F
Folding Patch CAVW 2020 F
Multi-View Stereo
Jie Liao1 , Yanping Fu1 , Qingan Yan2 , and Chunxia Xiao1∗
1 Schoolof Computer Science, Wuhan University
2 Silicon Valley Research Center for Multimedia Software, JD.com
Abstract PatchMatch
In this paper, we propose the novel folding patch model
which can replace the traditional patch model utilized
in patch-based Multi-View Stereo (MVS) methods to
significantly improve the reconstruction results. The 1 Introduction
patch model is applied as an approximation of the
scene surface differential in the geometric estimation Given a set of calibrated images, Multi-View Stereo
procedure. By minimizing the photometric discrepancy (MVS) recovers a dense 3D representation of the tar-
of the projection of the patch model on multiple source get scene. The reconstruction results can be applied to
images, patch-based MVS algorithms optimize the automatic geometry, scene classification, image-based
position and normal values for the 3D hypothesis of the modeling, and robot navigation. With known camera
target pixel. The optimization is based on the assump- poses, MVS boils down to a dense 3D matching prob-
tion that the patch model can fit the target scene surface lem under the constraint of projection geometry. The
perfectly. However, when it comes to complex scenes matching operation requires a model which can carry
crowded with sharp edges, splintery surfaces, or round enough contextual or feature information for the pixel
surfaces, the patch model is inherently not suitable since to be matched and discern the correct matches across
even from the microscopic perspective these surfaces multiple images. The patch model is widely adopted by
are not entirely flat. We construct the folding patch the current top-performing MVS algorithms according
model by folding the traditional patch model from the to 3D reconstruction benchmarks [1, 2] to perform the
middle line. By adjusting the folding angle and direc- dense correspondence. A patch in the MVS algorithm is
tion, the folding patch model can fit complex surfaces essentially a micro-plane in 3D space designed as an ap-
more flexibly. We apply our folding patch model to proximation of the target scene surface. The 3D geom-
the representative open-source Patch Based Multi-View etry corresponds to the pixel in the image is encoded as
Stereo (PMVS) and COLMAP, and validate the effec- the depth (or distance to the camera optical center along
tiveness on ETH3D benchmark and datasets captured the back-projection ray) and normal of the patch model.
in nature. The results demonstrate that utilizing the By minimizing the photometric discrepancy of the pro-
folding patch model can significantly improve the be- jection of regularly sampled points from the patch model
havior of PMVS and COLMAP, especially on datasets on different images, MVS algorithms optimize the depth
mainly consist of complex surfaces from plants. and normal values of the patch model to estimate the ge-
ometry of the target pixel. The estimation procedure is
Keywords: 3D Reconstruction, Multi-View Stereo, based on the assumption that the microscopic surface
can be approximated by a small rectangle. The assump-
∗ tion stands when the patch is small or the surface is flat
This work was partly supported by The National
Key Research and Development Program of China enough. However, there exist cases when the surface is
(2017YFB1002600), the NSFC (No. 61672390), not that flat, and the size of the patch model has to be
Wuhan Science and Technology Plan Project (No. big enough to collect sufficient texture information.
2017010201010109), and Key Technological Inno- In this paper, we propose the novel folding patch cor-
vation Projects of Hubei Province (2018AAA062). respondence which can be integrated into patch-based
Chunxia Xiao is the corresponding author of this pa- MVS algorithms and significantly improve the recon-
per, E-mail: cxxiao@whu.edu.cn struction results. The key insight is to make the patch
propagating such matches to surrounding areas based
on natural coherence of patches in images. Barnes et
al. [6] further modified their work by extending the
searching dimensions to scales and rotations in addition
to just translations. Non-Rigid Dense Correspondence
(NRDC) proposed in [7] interleaves nearest-neighbor
field computations with fitting a global non-linear para-
metric color model and aggregating consistent matching
regions using locally adaptive constraints. NRDC real-
izes dense correspondences under similar areas acquired
by different cameras and lenses, under non-rigid trans-
formations, different lighting, and diverse backgrounds.
[8] extended the PatchMatch search with DAISY de-
scriptors and filter-based efficient flow inference, cou-
Figure 1: Geometric elements of the folding patch pling and optimizing these modules seamlessly with im-
age segments as the bridge. Though previous attempts
mentioned above have obtained high accuracy and den-
model folding to make it more flexible for approximat- sity in dense correspondence estimation, these methods
ing different kinds of surfaces. The main contributions cannot be directly applied in 3D reconstruction. As their
of our paper are summarized as follows: work concentrates on building reliable dense correspon-
• We introduce the novel folding patch model which dence fields for two images in terms of two dimensions,
can replace the traditional patch model widely uti- the mapping relationship is limited to simple similar-
lized in patch-based MVS algorithms. The folding ity transformation. Thus their matching accuracy hardly
patch model consists of four elements as position, reaches pixel-wise precision which is significant for tri-
normal direction, folding direction, and folding angulation.
angle. The four elements can be easily encoded Bleyer et al. [9] tries to perform dense stereo match-
as five variants accordingly in different MVS al- ing by sampling and propagation. They use slanted sup-
gorithms as described in Sec 3. port windows instead of patches to approximate the tan-
gent plane of the surface. Several variants of this al-
• We propose the adjusting strategy for replacing the gorithm have been proposed like PMBP by Besse et al.
traditional patch model in MVS algorithms with [10] and PM-Huber by Heise et al. [11], which intro-
the folding patch model. The adjustment includes duce explicit regularization based on [9]. Building upon
the optimization and propagation of geometric hy- the previous PM-Huber algorithm, Heise et al. [12] fur-
potheses, which are critical to ensure the density ther introduced an extension to the multi-view scenario
of reconstructed point clouds. that employs an iterative refinement scheme.
We integrate our folding patch model in the open- Differently from directly applying PatchMatch in 3D
source PMVS [3] and COLMAP [4], and validate the reconstruction, [13] creates depth maps for each input
effectiveness on ETH3D benchmark and our manually image through image selection and PatchMatch, and
captured datasets. Qualitative and quantitative com- subsequently reconstruct surface parts by multiple depth
parisons demonstrate that exploiting the folding patch map filtering and fusion. Based on PatchMatch for depth
can significantly improve the behavior of PMVS and map in [13], [14] proposes a probabilistic framework
COLMAP. that jointly models pixel-level view selection and depth
map estimation given the local pairwise image photo-
metric consistency. Schönberger et al. [4] develops
2 Related Work work in [14] by jointly estimating depth and normal in-
formation, integrating it with geometric priors for view
This paper proposes a novel folding patch model as an selection and introducing multi-view geometric consis-
alternative to the traditional patch model in patch-based tency term for the simultaneous refinement and image-
MVS algorithms, intended to create dense stereo cor- based depth and normal fusion. Wei et al. [15] proposes
respondences for 3D reconstruction. Therefore, we de- a novel selective joint bilateral upsampling and depth
scribe existing work in two domains: 1) patchmatch al- propagation strategy to improve the performance of [4].
gorithms and 2) stereo correspondence descriptors. Based on [4], Romanoni et al. [16] assumed that tex-
tureless regions are piece-wise flat and fitted a plane for
each color-homogeneous superpixel segmented from the
2.1 PatchMatch Algorithms reference image. This method effectively estimates geo-
The concept of PatchMatch was firstly proposed in [5], metric hypotheses for planar-like surfaces but is not suit-
using random sampling to find suitable matches and able for curved surfaces. Liao et al. [17] proposes the
local consistency to guide the depth and normal estima- 3.1 Patch Model in MVS
tion with geometry from neighboring pixels with similar
For the information afforded by the pixel alone insuffi-
colors and introduces the pyramid architecture to fas-
cient to support dense matching across different views,
ten the convergence of local consistency. However, the
information of surrounding pixels is usually exploited.
depth estimation for areal textureless regions is still not
The patch model is such a geometry that is used to col-
ideal.
lect these surrounding pixels. A patch p in the MVS al-
gorithm is essentially a micro-plane in 3D space which
is designed as an approximation of the target scene sur-
2.2 Stereo Correspondence face. Its geometry is fully determined by its center c(p)
and normal vector n(p). By minimizing the photomet-
Descriptors ric discrepancy of the projection of the patch model on
different images, MVS algorithms optimize the position
Stereo matching descriptors have been studied for sev-
and normal vector of the patch model to estimate the ge-
eral decades, and many methods have been proposed.
ometry of target pixel.
Lowe, D.G. [18] proposed the Scale-Invariant Feature
Patch Model in PMVS The patch model in PMVS
Transform (SIFT) descriptor which is invariant to im-
is a (3D) rectangle, which is oriented so that one of
age scaling, translation, and rotation, and partially in-
its edges is parallel to the x-axis of the reference cam-
variant to illumination changes and affine transforma-
era (the camera associated with reference image R(p)).
tion or 3D projection. Based on Lowe’s work, lots of
The extent of the rectangle is chosen so that the small-
3D reconstruction (mainly Structure From Motion) al-
est axis-aligned square in R(p) containing its image the
gorithms are introduced. [19, 20, 21, 22, 23, 24, 24, 25].
projection is of size µ × µ pixels in size (µ is usually set
As feature points corresponding to SIFT descriptor are
as 5 or 7). The patch center c(p) is encoded as the Eu-
usually sparsely distributed in images and they essen-
clidean distance dis(p) to the optical center of the ref-
tially do not have the capacity to propagate correspon-
erence image O(R(p)) and is constrained to be along
dences to neighboring pixels, SIFT descriptor is mainly
with the back-projection ray. The distance unit is set as
limited as a matching tool in sparse 3D reconstruction
the average distance along with the projection ray cor-
despite the fact that it has performed pretty well in cor-
responding to the displacement of one pixel in each se-
respondence especially in terms of accuracy.
lected visible image. The normal vector n(p) is encoded
Patch model is essentially a local tangent plane ap- as the roll and yaw angles that rotate the vector in the
proximation of a surface. Algorithms proposed by opposite direction of the reference camera to the normal
[13, 26, 4, 14, 16, 27, 17] utilize a pixel window in direction. The photometric discrepancy of the patch p
the reference view to represent the projection of a small on reference image R(p) and source image Ii is calcu-
planar patch in the scene and conduct a region-growing lated by overlaying a µ × µ grid on p and computing one
procedure to generate dense correspondences and depth minus the normalized cross-correlation score between
maps. Methods like [28, 9, 10, 11, 12] directly encode colors of projections of all grid points on each image.
the patch model as a rectangle in three-dimension space Patch Model in COLMAP The patch model in
which is oriented and one of its edges is parallel to the COLMAP is a micro-plane whose projection on the ref-
x-axis of the reference camera. The size of the rectan- erence image is a square consisting of µ × µ pixels. The
gle is chosen so that the smallest axis-aligned square in patch center c(p) is encoded as the its projection depth
reference image containing its image projection is with θ(p) on reference image. The normal vector n(p) is
appropriate size. The 3D patch model is widely utilized encoded as the x and y values of the unit normal vec-
in dense reconstruction for its approximation of scene tor. As the patch model is plane, the coordinate map-
surface and ability to propagate 3D mapping relation- ping from projection of patch p on reference image R(p)
ship to neighboring pixels. However, it is still an issue to image Ii is determined by the planar transformation
for such algorithms to address surfaces with sharp and x0 ∼ Hi (p)x, where ’∼’ denotes the projective equality
arbitrary edges. and Hi (p) the homography defined by patch p. Let Ri ,
Ki and ti be the camera rotation, intrinsic and transla-
tion of image Ii , the homography is expressed by a 3 × 3
matrix:
3 Methodology
Hi (p) = Ki · Ri ·
(t1 − ti ) · n(p)T (1)
In this section, we detailedly describe the proposed fold-
ing patch model, which is complementary to the existing I− · RT1 · KT1 .
θ(p)
patch-based MVS algorithms. Here we use the represen-
tative PMVS [3] and COLMAP [4] as the backbone ar- The photometric discrepancy of the patch p on reference
chitecture to demonstrate the application of the folding image R(p) and source image Ii is defined as one minus
patch in patch-based MVS algorithms. bilaterally weighted normalized cross-correlation be-
Figure 2: Folding patches in 3D space
Figure 4: Approximating micro surfaces ( the transparent blue ones ) with the folding patch model : (a) plane micro-
surface, (b) curved micro-surface, (c) folding micro-surface
lated as
(x(p), y(p), z(p)) =
. (2)
(x(c), y(c), −z(c)) Rα Rβ Rγ
Rα , Rβ and Rγ are local rotation matrices defined re-
spectively as
1 0 0
Rpitch = 0 cos(α) − sin(α)
0 sin(α) cos(α)
cos(β) 0 sin(β)
Rroll = 0 1 0 (3) Figure 5: The pixels constituting the folding patch pro-
− sin(β) 0 cos(β) jection on the reference image are divided into
two clusters by the projection of the folding
cos(γ) − sin(γ) 0 line. The green line represents the projection
Ryaw = sin(γ) cos(γ) 0 . of the folding line. The pixels belong to the
0 0 1 same cluster are labeled with the same color.
The folding angle a(p∗ ) is encoded directly as the angle The black array points in the x-axis direction.
between the local x-axis and corresponding plane of the
folding patch as shown in Figure 1. When a(p∗ ) is zero,
a folding patch equals to the traditional flat patch. folding patch. We adopt the same filtering strategy as
The initialization of the folding patch center and nor- PMVS.
mal direction is the same as that in the original PMVS. Folding Patch Model in COLMAP We integrate
Folding unit vector is initialized as the cross product of the folding patch model in COLMAP and name it
normal vector n(p) and unit vector in the x-axis of ref- FCOLMAP. The center and normal vectors are encoded
erence camera x(c). Folding angle a(p) is initialized as in the same way as COLMAP. We represent the fold-
zero. Intuitively, the shapes and poses of the initialized ing direction as a line in the reference image, passing
folding patch and original patch in PMVS are the same. through the center of the projected pixel matrix and di-
The optimization of the folding patch is also conducted viding the pixels into two clusters as shown in Figure 5.
by overlaying a regular grid on the model and minimiz- The two clusters belong respectively to the two planes
ing the photometric discrepancy between projections of containing the two faces of the folding patch. The cut-
grid points on reference images and other view-selected ting line is encoded as the angle between it and the x-
images. The difference is that a(p) and f (p) partici- axis of the reference image in the anti-clockwise direc-
pate in the coordinate calculation of grid points. We tion, and we denote it as σ(p). Then the unit vector in
adopt the Bound Approximation BY Quadratic Approx- the folding direction f (p) can be represented as
imation (BOBYQA) optimization technique proposed
by M.J.D.Powell [29] to optimize geometric parame- f (p) =
(4)
ters of the folding patch. For the expansion procedure, (cos(σ(p)), sin(σ(p)), 0) × n(p) × n(p).
the expanded folding patch center is also calculated as
the intersection of back-projection ray with the folding In COLMAP, the coordinate of each pixel inside the
patch. The normal vector of the expanded folding patch patch projection can be mapped to source image Ii
is assigned as the normal direction of the intersected through planar homography transformation formulated
half-plane of the seed folding patch. The folding direc- as Equation 1. For FCOLMAP, coordinates of pixels in
tion and folding angle are directly copied from the seed each cluster are also mapped to other images through
botanical_garden
observatory
bridge
boulders
Figure 6: Qualitative point cloud comparisons between different algorithms on some high-resolution multi-view test
datasets of ETH3D benchmark.
Figure 7: Qualitative point cloud comparisons between different algorithms on datasets for natural scenes captured
by our selves.
and FCOLMAP significantly improve the performance datasets can exhibit the capability of algorithms to fit
of the backbone PMVS and COLMAP. It is attributed complex shapes and edges in natural scenes. We use
to the better flexibility of the folding patch model that four datasets of natural scenes as shown in Figure 7.
makes it approximate complex scene surfaces better. From top to bottom, corresponding datasets consist of
Figure 7 exhibits qualitative point cloud compar- respectively 15, 10, 17, and 12 unstructured images, all
isons. It can be observed that utilizing the folding patch with a resolution of 1632 × 1224. We obtain camera
model can significantly improve the quality of the re- calibration and sparse points reconstructed by SFM [19]
constructed point clouds. By comparing results gen- as input of PMVS and FPMVS, [22] for COLMAP and
erated by FCOLMAP and COLMAP, we can see that FCOLMAP. We use default parameters afforded by the
regions containing complex surfaces are better recon- public implementation for COLMAP and FCOLMAP.
structed with folding patches even if COLMAP works For PMVS and FPMVS, we use default parameters ex-
pretty well on most of the regions. For the botani- cept for level = 0 and csize = 1, which means source
cal garden dataset, FCOLMAP reconstructs more grass images are not resized, and each pixel is supposed to
and leaves than COLMAP. For the observatory dataset, generate one 3D point. Figure 7 shows one source im-
the round building surface and the cracks between the age and corresponding reconstructed point clouds. It can
bricks are reconstructed better with FCOLMAP. For the be observed that the point clouds reconstructed with the
boulders dataset, the rope on the ground is completely folding patch model are better than point clouds recon-
reconstructed through FCOLMAP but is lost by the orig- structed with the traditional patch model.
inal COLMAP.
4.3 Limitations
4.2 Natural Scenes Table 1 shows the accuracy scores of PMVS, FPMVS,
In this section, we densely reconstruct models of natu- COLMAP, and FCOLMAP on the high-resolution test
ral scene datasets using PMVS, FPMVS, COLMAP, and datasets of the ETH3D benchmark. It can be observed
FCOLMAP. All these datasets are captured directly by that utilizing the folding patch model will decrease the
a smartphone camera. Although we do not have the cor- accuracy of reconstructed models. The flexibility of the
responding ground-truth to calculate precise values for folding patch model makes it better approximate various
accuracy and completeness, the results of these algo- surfaces. However, the parameters introduced in addi-
rithms exhibit visible differences. Experiments on such tion to traditional patch model make the geometric opti-
mization results less precise, which leads to the decreas- Multi-View Stereo. In: ECCV. Springer;. p. 501–
ing of reconstruction accuracy. In addition, the two extra 518.
dimensions introduced by the folding patch model defi-
nitely increase the time consumption of geometry opti- [5] Barnes C, Shechtman E, Finkelstein A, and Gold-
mization, which in turn increases the final running time man DB. PatchMatch: A randomized correspon-
as shown in table 2 dence algorithm for structural image editing. SIG-
GRAPH. 2009;p. 24.
[3] Furukawa Y, and Ponce J. Accurate, dense, [16] Romanoni A, and Matteucci M. TAPA-
and robust multiview stereopsis. IEEE transac- MVS: Textureless-Aware PatchMatch Multi-View
tions on pattern analysis and machine intelligence. Stereo. In: CVPR; 2019. p. 10413–10422.
2010;32(8):1362–1376.
[17] Liao J, Fu Y, Yan Q, and Xiao C. Pyramid Multi-
[4] Schönberger JL, Zheng E, Pollefeys M, and Frahm View Stereo with Local Consistency. Computer
JM. Pixelwise View Selection for Unstructured Graphics Forum. 2019;38(7):335–346.
[18] Lowe DG. Object recognition from local scale- Jie Liao received the BSc degree
invariant features. ICCV. 1999;p. 1150–1157. from Wuhan University in 2013,
and the MSc from Huazhong
[19] Snavely N, Seitz SM, and Szeliski R. Photo University of Science and Tech-
tourism: exploring photo collections in 3D. ACM nology in 2015. He is currently
Transactions on Graphics. 2006;25(3):835–846. purchasing the Ph.D. degree at
Computer School, Wuhan Uni-
[20] Agarwal S, Furukawa Y, Snavely N, Curless B,
versity, Wuhan, China. His work mainly focuses on
Seitz S, and Szeliski R. Building rome in a day.
Multi-view Stereo, IBR and lightfield reconstruction.
In: ICCV; 2009. p. 105–112.
[21] Zheng E, and Wu C. Structure from motion using Yanping Fu received the BSc
structure-less resection. In: ICCV; 2015. p. 2075– degree from Changchun Univer-
2083. sity in 2008, and the MSc de-
gree from the Yanshan Univer-
[22] Schönberger JL, and Frahm JM. Structure-from- sity in 2011. He is currently
motion revisited. In: CVPR; 2016. p. 4104–4113. pursuing the PhD degree with
the School of Computer Science,
[23] Yan Q, Xu Z, and Xiao C. Fast Feature-Oriented Wuhan University, Wuhan, China. His research is
Visual Connection for Large Image Collections. focused on 3D reconstruction, texture mapping and
Computer Graphics Forum. 2014;33(7):339–348. SLAM.
[24] Yan Q, Yang L, Zhang L, and Xiao C. Distinguish- Qingan Yan received his Ph.D.
ing the indistinguishable: Exploring structural am- degree at Computer Science De-
biguities via geodesic context. In: CVPR; 2017. p. partment of Wuhan University in
3836–3844. 2017. Before that, he got his
M.S. and B.E. degree in Com-
[25] Chen Y, Hao C, Cai Z, Wu W, and Wu E. Live Ac-
puter Science from Southwest
curate and Dense Reconstruction from a Handheld
University of Science and Tech-
Camera. Computer Animation and Virtual Worlds.
nology, and Hubei Minzu University, in 2012 and 2008
2013;24:387–397.
respectively. He is currently working at JD.com Silicon
[26] Goesele M, Snavely N, Curless B, Hoppe H, and Valley Research Center in California as a research sci-
Seitz SM. Multi-view stereo for community photo entist. His research interests include 3D reconstruction,
collections. In: CVPR. IEEE; 2007. p. 1–8. geometric scene perception, visual correspondence and
image synthesis.
[27] Xu Q, and Tao W. Multi-Scale Geometric Con-
sistency Guided Multi-View Stereo. In: CVPR; Chunxia Xiao received his BSc
2019. p. 5483–5492. and MSc degrees from the Math-
ematics Department of Hunan
[28] Furukawa Y, and Ponce J. Accurate, Dense, and Normal University in 1999 and
Robust Multi-View Stereopsis. CVPR. 2007;p. 1– 2002, respectively, and his PhD
14. degree from the State Key Lab
of CAD CG of Zhejiang Univer-
[29] Powell MJD. The BOBYQA Algorithm for sity in 2006. Currently, he is a professor in the School
Bound Constrained Optimization without Deriva- of Computer, Wuhan University, China. From October
tives. Cambridge technical Report. 2009;p. 26–46. 2006 to April 2007, he worked as a postdoc at the De-
partment of Computer Science and Engineering, Hong
[30] Zhao X, Zhou Z, Duan Y, and Wu W. Detail-
Kong University of Science and Technology, and during
feature-preserving surface reconstruction. Com-
February 2012 to February 2013, he visited University
puter Animation and Virtual Worlds. 2012;23(3-
of California-Davis for one year. His main interests in-
4):407–416.
clude computer graphics, computer vision and machine
[31] Fu Y, Yan Q, Yang L, Liao J, and Xiao C. Tex- learning. He is a member of IEEE.
ture Mapping for 3D Reconstruction with RGB-D
Sensor. In: CVPR;. p. 4645–4653.
Table 1: Quantitative accuracy score comparisons on high-resolution test datasets of ETH3D benchmark with thresh-
old of 5cm.
Table 2: Quantitative running time comparisons on high-resolution test datasets of ETH3D benchmark with time
unit as second.
2cm 5cm
PMVS FPMVS COLMAP FCOLMAP PMVS FPMVS COLMAP FCOLMAP
botanical garden 40.82 65.08 87.13 89.28 5.72 20.80 95.28 96.35
boulders 42.54 51.03 65.67 66.95 46.20 56.71 79.75 80.88
bridge 65.12 69.48 88.30 88.41 73.41 84.42 94.28 94.73
door 63.99 69.58 84.17 89.03 78.31 84.42 92.28 95.45
exhibition hall 34.43 41.46 62.96 64.05 41.03 51.78 75.17 79.68
lecture room 24.48 40.67 63.80 66.87 30.06 49.57 78.02 82.04
living room 57.12 72.32 87.69 88.61 65.68 81.61 93.77 94.61
lounge 3.92 16.46 38.04 46.78 6.08 23.49 58.60 71.13
observatory 76.17 70.00 92.56 92.89 83.43 81.48 97.84 98.50
old computer 17.70 24.24 46.66 57.48 30.04 40.61 65.06 76.24
statue 55.52 65.42 74.91 78.55 65.86 76.25 85.92 91.05
terrace 48.75 58.40 84.24 87.67 60.80 74.46 91.63 94.80
All 44.16 53.68 73.01 76.38 52.22 64.68 83.96 87.95
Table 3: Quantitative F1 score comparisons on ETH3D benchmark with thresholds of 2cm and 5cm.