Professional Documents
Culture Documents
NeuralFusion Online Depth Fusion in Latent Space CVPR
NeuralFusion Online Depth Fusion in Latent Space CVPR
Abstract
3162
In summary, we make the following contributions: Classic Online Depth Fusion Approaches. The majority
• We propose a novel, end-to-end trainable network architec- of depth map fusion approaches are built upon the semi-
ture, which separates the scene representations for depth nal “TSDF Fusion” work by Curless and Levoy [9], which
fusion and the final output into two different modules. fuses depth maps using an implicit representation by av-
eraging TSDFs on a dense voxel grid. This approach es-
• The proposed latent representation yields more accurate pecially became popular with the wide availability of low-
and complete fusion results for larger feature dimensions cost depth sensors like the Kinect and has led to works like
allowing to balance accuracy against resource demands. KinectFusion [40] and related works like sparse-sequence
• Our network architecture allows for end-to-end learnable fusion [66], BundleFusion [12], or variants on sparse grids
outlier filtering within a translation step that significantly like VoxelHashing [42], InfiniTAM [46], Voxgraph [47],
improves outlier handling. octree-based approaches [14, 35, 60], or hierarchical hash-
ing [24]. However, the output of these methods usually
• Although fully trainable, our approach still only performs contains typical noise artifacts, such as surface thickening
very localized updates to the global map which maintains and outlier blobs. Another line of works uses a surfel-based
the online capability of the overall approach. representation and directly fuses depth values into a sparse
point cloud [27, 32, 36, 56, 61, 63]. This representation per-
2. Related Work fectly adapts to inhomogeneous sampling densities, requires
low memory storage, but also lacks connectivity and topo-
Representation Learning for 3D Reconstruction. Many logical surface information. An overview of depth fusion
works proposed learning-based algorithms using implicit approaches is given in [73].
shape representations. Ladicky et al. [31] learn to regress a
mesh from a point cloud using a random forest. The concur- Classic Global Depth Fusion Approaches. While online
rent works OccupancyNetworks [38], DeepSDF [43], and approaches only process one depth map at a time, global
IM-NET [5] encode the object surface as the decision bound- approaches use all information at once and typically apply
ary of a trained classification or regression network. These additional smoothness priors like total variation [30, 67],
methods only use a single feature vector to encode an en- its variants including semantic information [6, 17, 18, 52,
tire scene or object within a unit cube and are, unlike our 53], or refine surface details using color information [72].
approach, not suitable for large-scale scenes. This was im- Consequently, their high compute and memory requirements
proved in [3, 7, 21, 45] which use multiple features to encode prevent their application in online scenarios unlike ours.
parts of the scene. [8, 19, 25, 50, 51] extract image features Learned Global Depth Fusion Approaches. Octnet [49]
and learn scene occupancy labels from single or multi-view and its follow-up OctnetFusion [48] fuse depth maps us-
images. DISN [65] works in a similar fashion but regresses ing TSDF fusion into an octree and then post-processes the
SDF values. Chiyu et al. [22] propose a local implicit grid fused geometry using machine learning. RayNet [44] uses
representation for 3D scenes. However, they encode the a learned Markov random field and a view-invariant feature
scene through direct optimization of the latent representa- representation to model view dependencies. SurfaceNet [20]
tion through a pre-trained neural network. More recently, jointly estimates multi-view stereo depth maps and the fused
Scene Representation Networks [58] learn to reconstruct geometry, but requires to store a volumetric grid for each
novel views from a single RGB image. Liu et al. [33] learn input depth map. 3DMV [11] optimizes shape and semantics
implicit 3D shapes from 2D images in a self-supervised man- of a pre-fused TSDF scene of given 2D view information.
ner. These works also operate within a unit cube and are Contrary to these methods, we learn online depth fusion and
difficult to scale to larger scenes. DeepVoxels [57] encodes can process an arbitrary number of input views.
visual information in a feature grid with a neural network Learned Online Depth Fusion Approaches. In
to generate novel, high-quality views onto an observed ob- the context of simultaneous localization and mapping,
ject. Nevertheless, the work is not directly applicable to CodeSLAM [2], SceneCode [69] and DeepFactors [10] learn
a 3D reconstruction task. Our method combines the con- a 2.5D depth representation and its probabilistic fusion rather
cept of learned scene representations with data fusion in a than fusing into a full 3D model. The DeepTAM [70] map-
learned latent space. Recently, several works proposed a ping algorithm builds upon traditional cost volume compu-
more local scene representation [1, 15, 16] that allows larger tation with hand-crafted photoconsistency measures, which
scale scenes and multiple objects. Further, [34, 41] learn im- are fed into a neural network to estimate depth, but full 3D
plicit representations with 2D supervision via differentiable model fusion is not considered. DeFuSR [13] refines depth
rendering. Overall, none of all mentioned works consider maps by improving cross-view consistency via reprojection,
online updates of the shape representation as new informa- but it is not real-time capable. Similar to our approach,
tion becomes available and adding such functionality is by Weder et al. [62] perform online reconstruction and learn
no means straightforward. the fusion updates and weights. In contrast to our work,
3163
all information is fused into an SDF representation which the fusion pipeline and are executed iteratively on the input
requires handcrafted pre- or post-filtering to handle outliers, depth map stream. An additional fourth stage translates the
which is not end-to-end trainable. The recent ATLAS [39] feature volume into an application-specific scene represen-
method fuses features from RGB input into a voxel grid and tation, such as a TSDF volume, from which one can finally
then regresses a TSDF volume. While our method learns render a mesh for visualization. We detail our pipeline in
the fusion of features, they use simple weighted averaging. the following and we refer to the supplementary material for
Their large ResNet50 backbone limits real-time capabilities. additional low-level architectural details.
Sequence to Vector Learning. On a high-level, our Feature Extraction. The goal of iteratively fusing depth
method processes a variable length sequence of depth maps measurements is to (a) fuse information about previously
and learns a 3D shape represented by a fixed length vector. unknown geometry, (b) increase the confidence about al-
It is thus loosely related to areas like video representation ready fused geometry, and (c) to correct wrong or erroneous
learning [59] or sentiment analysis [37, 68] which processes entries in the scene. Towards these goals, the fusion process
text, audio or video to estimate a single vectorial value. In takes the new measurements to update the previous scene
contrast to these works, we process 3D data and explicitly state g t−1 , which encodes all previously seen geometry. For
model spatial geometric relationships. a fast depth integration, we extract a local view-aligned fea-
ture subvolume v t−1 with one ray per depth measurement
3. Method centered at the measured depth via nearest neighbor search
Overview. Given a stream of input depth maps Dt : R2 → in the grid positions. Each ray of features in the local feature
R with known camera calibration for each time step t ∈ N, volume is concatenated with the ray direction and the new
we aim to fuse all surface information into a globally con- depth measurement. This feature volume is then passed to
sistent scene representation g : R3 → RN while removing the fusion network.
noise and outliers as well as complete potentially missing Feature Fusion. The fusion network fuses the new depth
observations. The final output of our method is a TSDF map measurements Dt into the existing local feature representa-
s : R3 → R, which can be processed into a mesh with stan- tion v t−1 . Therefore, we pass the feature volume through
dard iso-surface extraction methods as well as an occupancy four convolutional blocks to encode neighborhood informa-
map o : R3 → R. Figure 2 provides an overview of our tion from a larger receptive field. Each of these encoding
method. The key idea is to decouple the scene representation blocks, consists of two convolutional layers with a kernel
for geometry fusion from the output scene representation. size of three. These layers are followed by layer normaliza-
This decoupling is motivated by the difficulty for existing tion and tanh activation function. We found layer normal-
methods to handle outliers within an online fusion method. ization to be crucial for training convergence. The output
Therefore, we propose to fuse geometric information into of each block is concatenated with its input, thereby gener-
a latent feature space without any preliminary outlier pre- ating a successively larger feature volume with increasing
filtering. A subsequent translator network then decodes the receptive field. The decoder then takes the output of the
latent feature space into the output scene representation (e.g., four encoding blocks to predict feature updates.The decoder
a TSDF grid). This approach allows for better and end-to- consists of four blocks with two convolutional layers and
end trainable handling of outliers and avoids any handcrafted interleaved layer normalization and tanh activation. The out-
post-filtering, which is inherently difficult to tune and typi- put of the final layer is passed through a single linear layer.
cally decreases the completeness of the reconstruction. Fur- Finally, the predicted feature updates are normalized and
thermore, the learned latent representation also enables to passed as v t to the feature integration.
capture complex and higher resolution shape information, Feature Integration. The updated feature state is inte-
leading to more accurate reconstruction results. grated back into the global feature grid by using the inverse
Our feature fusion pipeline consists of four key stages global-local grid correspondences of the extraction mapping.
depicted as networks in Figure 2. The first stage extracts Similar to the extraction, we write the mapped features into
the current state of the global feature volume into a local, the nearest neighbor grid location. Since this mapping is
view-aligned feature volume using an affine mapping de- not unique, we aggregate colliding updates using an average
fined by the given camera parameters. After the extraction, pooling operation. Finally, the pooled features are combined
this local feature volume is passed together with the new with old ones using a per-voxel running average operation,
depth measurement and the ray directions through a feature where we use the update counts as weights. This residual
fusion network. This feature fusion network predicts optimal update operation ensures stable training and a homogeneous
updates for the local feature volume, given the new measure- latent space, as compared to direct prediction of the global
ment and its old state. The updates are integrated back into features. Both the feature extraction and integration steps are
the global feature volume using the inverse affine mapping inspired by [62], but they use tri-linear interpolation instead
defined in the first stage. These three stages form the core of of nearest-neighbor sampling. When extracting and integrat-
3164
sha1_base64="vTePSJ1hVBdmbenhqKUcotqpXyw=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FPXisYD+gjWWz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUrtW4fMzzzJr1yxa26M5Bl4uWkAjnqvfJXtx+zNOIKmaTGdDw3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9m507IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtkQvMWXl0nzvOq5Ve/+olK7zuMowhEcwyl4cAk1uIM6NIDBCJ7hFd6cxHlx3p2PeWvByWcO4Q+czx/A/I8s</latexit>
sha1_base64="vTePSJ1hVBdmbenhqKUcotqpXyw=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FPXisYD+gjWWz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUrtW4fMzzzJr1yxa26M5Bl4uWkAjnqvfJXtx+zNOIKmaTGdDw3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9m507IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtkQvMWXl0nzvOq5Ve/+olK7zuMowhEcwyl4cAk1uIM6NIDBCJ7hFd6cxHlx3p2PeWvByWcO4Q+czx/A/I8s</latexit><latexit
sha1_base64="vTePSJ1hVBdmbenhqKUcotqpXyw=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FPXisYD+gjWWz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUrtW4fMzzzJr1yxa26M5Bl4uWkAjnqvfJXtx+zNOIKmaTGdDw3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9m507IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtkQvMWXl0nzvOq5Ve/+olK7zuMowhEcwyl4cAk1uIM6NIDBCJ7hFd6cxHlx3p2PeWvByWcO4Q+czx/A/I8s</latexit><latexit
sha1_base64="vTePSJ1hVBdmbenhqKUcotqpXyw=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FPXisYD+gjWWz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUrtW4fMzzzJr1yxa26M5Bl4uWkAjnqvfJXtx+zNOIKmaTGdDw3QT+jGgWTfFLqpoYnlI3ogHcsVTTixs9m507IiVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtkQvMWXl0nzvOq5Ve/+olK7zuMowhEcwyl4cAk1uIM6NIDBCJ7hFd6cxHlx3p2PeWvByWcO4Q+czx/A/I8s</latexit><latexit
<latexit
sha1_base64="8lUtg8tNWrNgYPJOeBbAQ49WVRo=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRF0GNRDx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqAccJ9yM6UCIUjKKVWrePGZ5XJ71S2a24M5Bl4uWkDDnqvdJXtx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs9m507IqVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtoQvMWXl0mzWvHcind/Ua5d53EU4BhO4Aw8uIQa3EEdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/CgY8t</latexit>
sha1_base64="8lUtg8tNWrNgYPJOeBbAQ49WVRo=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRF0GNRDx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqAccJ9yM6UCIUjKKVWrePGZ5XJ71S2a24M5Bl4uWkDDnqvdJXtx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs9m507IqVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtoQvMWXl0mzWvHcind/Ua5d53EU4BhO4Aw8uIQa3EEdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/CgY8t</latexit><latexit
sha1_base64="8lUtg8tNWrNgYPJOeBbAQ49WVRo=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRF0GNRDx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqAccJ9yM6UCIUjKKVWrePGZ5XJ71S2a24M5Bl4uWkDDnqvdJXtx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs9m507IqVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtoQvMWXl0mzWvHcind/Ua5d53EU4BhO4Aw8uIQa3EEdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/CgY8t</latexit><latexit
sha1_base64="8lUtg8tNWrNgYPJOeBbAQ49WVRo=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRF0GNRDx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3Uz91hPXRsTqAccJ9yM6UCIUjKKVWrePGZ5XJ71S2a24M5Bl4uWkDDnqvdJXtx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs9m507IqVX6JIy1LYVkpv6eyGhkzDgKbGdEcWgWvan4n9dJMbzyM6GSFLli80VhKgnGZPo76QvNGcqxJZRpYW8lbEg1ZWgTKtoQvMWXl0mzWvHcind/Ua5d53EU4BhO4Aw8uIQa3EEdGsBgBM/wCm9O4rw4787HvHXFyWeO4A+czx/CgY8t</latexit><latexit
<latexit
D
D t−1
t−2
Feature Translation.
Input Depth Maps
sha1_base64="GfPoxA//rVrkyanfzmav7q16pQw=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeizqwWMF0xbaWDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmiTTjPsskYluh9RwKRT3UaDk7VRzGoeSt8LRzdRvPXFtRKIecJzyIKYDJSLBKFrJv33McdKrVN2aOwNZJl5BqlCg0at8dfsJy2KukElqTMdzUwxyqlEwySflbmZ4StmIDnjHUkVjboJ8duyEnFqlT6JE21JIZurviZzGxozj0HbGFIdm0ZuK/3mdDKOrIBcqzZArNl8UZZJgQqafk77QnKEcW0KZFvZWwoZUU4Y2n7INwVt8eZk0z2ueW/PuL6r16yKOEhzDCZyBB5dQhztogA8MBDzDK7w5ynlx3p2PeeuKU8wcwR84nz/k7o66</latexit>
sha1_base64="GfPoxA//rVrkyanfzmav7q16pQw=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeizqwWMF0xbaWDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmiTTjPsskYluh9RwKRT3UaDk7VRzGoeSt8LRzdRvPXFtRKIecJzyIKYDJSLBKFrJv33McdKrVN2aOwNZJl5BqlCg0at8dfsJy2KukElqTMdzUwxyqlEwySflbmZ4StmIDnjHUkVjboJ8duyEnFqlT6JE21JIZurviZzGxozj0HbGFIdm0ZuK/3mdDKOrIBcqzZArNl8UZZJgQqafk77QnKEcW0KZFvZWwoZUU4Y2n7INwVt8eZk0z2ueW/PuL6r16yKOEhzDCZyBB5dQhztogA8MBDzDK7w5ynlx3p2PeeuKU8wcwR84nz/k7o66</latexit><latexit
sha1_base64="GfPoxA//rVrkyanfzmav7q16pQw=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeizqwWMF0xbaWDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmiTTjPsskYluh9RwKRT3UaDk7VRzGoeSt8LRzdRvPXFtRKIecJzyIKYDJSLBKFrJv33McdKrVN2aOwNZJl5BqlCg0at8dfsJy2KukElqTMdzUwxyqlEwySflbmZ4StmIDnjHUkVjboJ8duyEnFqlT6JE21JIZurviZzGxozj0HbGFIdm0ZuK/3mdDKOrIBcqzZArNl8UZZJgQqafk77QnKEcW0KZFvZWwoZUU4Y2n7INwVt8eZk0z2ueW/PuL6r16yKOEhzDCZyBB5dQhztogA8MBDzDK7w5ynlx3p2PeeuKU8wcwR84nz/k7o66</latexit><latexit
sha1_base64="GfPoxA//rVrkyanfzmav7q16pQw=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeizqwWMF0xbaWDbbTbt0swm7E6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmmrvL2zu7dfOThsmiTTjPsskYluh9RwKRT3UaDk7VRzGoeSt8LRzdRvPXFtRKIecJzyIKYDJSLBKFrJv33McdKrVN2aOwNZJl5BqlCg0at8dfsJy2KukElqTMdzUwxyqlEwySflbmZ4StmIDnjHUkVjboJ8duyEnFqlT6JE21JIZurviZzGxozj0HbGFIdm0ZuK/3mdDKOrIBcqzZArNl8UZZJgQqafk77QnKEcW0KZFvZWwoZUU4Y2n7INwVt8eZk0z2ueW/PuL6r16yKOEhzDCZyBB5dQhztogA8MBDzDK7w5ynlx3p2PeeuKU8wcwR84nz/k7o66</latexit><latexit
<latexit
Dt
Stream
Depth Map
sha1_base64="dLRPEZ1VI35vjRj6uBhtmkK8GGU=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FLx4r2A9oY9lsN+3SzSbsTgol9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3c381phrI2L1iJOE+xEdKBEKRtFKrfFThhfetFeuuFV3DrJKvJxUIEe9V/7q9mOWRlwhk9SYjucm6GdUo2CST0vd1PCEshEd8I6likbc+Nn83Ck5s0qfhLG2pZDM1d8TGY2MmUSB7YwoDs2yNxP/8zophjd+JlSSIldssShMJcGYzH4nfaE5QzmxhDIt7K2EDammDG1CJRuCt/zyKmleVj236j1cVWq3eRxFOIFTOAcPrqEG91CHBjAYwTO8wpuTOC/Ou/OxaC04+cwx/IHz+QMN/49e</latexit>
sha1_base64="dLRPEZ1VI35vjRj6uBhtmkK8GGU=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FLx4r2A9oY9lsN+3SzSbsTgol9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3c381phrI2L1iJOE+xEdKBEKRtFKrfFThhfetFeuuFV3DrJKvJxUIEe9V/7q9mOWRlwhk9SYjucm6GdUo2CST0vd1PCEshEd8I6likbc+Nn83Ck5s0qfhLG2pZDM1d8TGY2MmUSB7YwoDs2yNxP/8zophjd+JlSSIldssShMJcGYzH4nfaE5QzmxhDIt7K2EDammDG1CJRuCt/zyKmleVj236j1cVWq3eRxFOIFTOAcPrqEG91CHBjAYwTO8wpuTOC/Ou/OxaC04+cwx/IHz+QMN/49e</latexit><latexit
sha1_base64="dLRPEZ1VI35vjRj6uBhtmkK8GGU=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FLx4r2A9oY9lsN+3SzSbsTgol9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3c381phrI2L1iJOE+xEdKBEKRtFKrfFThhfetFeuuFV3DrJKvJxUIEe9V/7q9mOWRlwhk9SYjucm6GdUo2CST0vd1PCEshEd8I6likbc+Nn83Ck5s0qfhLG2pZDM1d8TGY2MmUSB7YwoDs2yNxP/8zophjd+JlSSIldssShMJcGYzH4nfaE5QzmxhDIt7K2EDammDG1CJRuCt/zyKmleVj236j1cVWq3eRxFOIFTOAcPrqEG91CHBjAYwTO8wpuTOC/Ou/OxaC04+cwx/IHz+QMN/49e</latexit><latexit
sha1_base64="dLRPEZ1VI35vjRj6uBhtmkK8GGU=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyURQY9FLx4r2A9oY9lsN+3SzSbsTgol9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3c381phrI2L1iJOE+xEdKBEKRtFKrfFThhfetFeuuFV3DrJKvJxUIEe9V/7q9mOWRlwhk9SYjucm6GdUo2CST0vd1PCEshEd8I6likbc+Nn83Ck5s0qfhLG2pZDM1d8TGY2MmUSB7YwoDs2yNxP/8zophjd+JlSSIldssShMJcGYzH4nfaE5QzmxhDIt7K2EDammDG1CJRuCt/zyKmleVj236j1cVWq3eRxFOIFTOAcPrqEG91CHBjAYwTO8wpuTOC/Ou/OxaC04+cwx/IHz+QMN/49e</latexit><latexit
<latexit
sha1_base64="/+p3yCCuZ5c2PdxQ7dOxrYk+L3g=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRgh6LXjxWsB/QxrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqAScJ9yM6VCIUjKKV2sPHDC+8ab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6rHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QP22o9P</latexit>
sha1_base64="/+p3yCCuZ5c2PdxQ7dOxrYk+L3g=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRgh6LXjxWsB/QxrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqAScJ9yM6VCIUjKKV2sPHDC+8ab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6rHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QP22o9P</latexit><latexit
sha1_base64="/+p3yCCuZ5c2PdxQ7dOxrYk+L3g=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRgh6LXjxWsB/QxrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqAScJ9yM6VCIUjKKV2sPHDC+8ab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6rHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QP22o9P</latexit><latexit
sha1_base64="/+p3yCCuZ5c2PdxQ7dOxrYk+L3g=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRgh6LXjxWsB/QxrLZbtqlm03YnQgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpr6xubW8Xt0s7u3v5B+fCoZeJUM95ksYx1J6CGS6F4EwVK3kk0p1EgeTsY38789hPXRsTqAScJ9yM6VCIUjKKV2sPHDC+8ab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6rHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QP22o9P</latexit><latexit
<latexit
Extraction Layer
the latent feature grid was just updated, and render the TSDF
tures of the local neighborhood and predicts the TSDF s(pi )
world coordinates. Then, at each of these sampled points pi ,
the translator aggregates the information stored in the fea-
architecture in this step is inspired by IM-Net [5]. For effi-
In the final and possibly asyn-
Grid
Fusion
3165
Feature
Fusion Network
Global Feature
sha1_base64="IVmpRYrurapju0ecD+idfb62OLU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cKpi20sWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px7cxvP3FtRKIecJLyIKZDJSLBKFrJHz7mOO1Xa27dnYOsEq8gNSjQ7Fe/eoOEZTFXyCQ1puu5KQY51SiY5NNKLzM8pWxMh7xrqaIxN0E+P3ZKzqwyIFGibSkkc/X3RE5jYyZxaDtjiiOz7M3E/7xuhtF1kAuVZsgVWyyKMkkwIbPPyUBozlBOLKFMC3srYSOqKUObT8WG4C2/vEpaF3XPrXv3l7XGTRFHGU7gFM7BgytowB00wQcGAp7hFd4c5bw4787HorXkFDPH8AfO5w8alY7d</latexit>
sha1_base64="IVmpRYrurapju0ecD+idfb62OLU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cKpi20sWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px7cxvP3FtRKIecJLyIKZDJSLBKFrJHz7mOO1Xa27dnYOsEq8gNSjQ7Fe/eoOEZTFXyCQ1puu5KQY51SiY5NNKLzM8pWxMh7xrqaIxN0E+P3ZKzqwyIFGibSkkc/X3RE5jYyZxaDtjiiOz7M3E/7xuhtF1kAuVZsgVWyyKMkkwIbPPyUBozlBOLKFMC3srYSOqKUObT8WG4C2/vEpaF3XPrXv3l7XGTRFHGU7gFM7BgytowB00wQcGAp7hFd4c5bw4787HorXkFDPH8AfO5w8alY7d</latexit><latexit
sha1_base64="IVmpRYrurapju0ecD+idfb62OLU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cKpi20sWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px7cxvP3FtRKIecJLyIKZDJSLBKFrJHz7mOO1Xa27dnYOsEq8gNSjQ7Fe/eoOEZTFXyCQ1puu5KQY51SiY5NNKLzM8pWxMh7xrqaIxN0E+P3ZKzqwyIFGibSkkc/X3RE5jYyZxaDtjiiOz7M3E/7xuhtF1kAuVZsgVWyyKMkkwIbPPyUBozlBOLKFMC3srYSOqKUObT8WG4C2/vEpaF3XPrXv3l7XGTRFHGU7gFM7BgytowB00wQcGAp7hFd4c5bw4787HorXkFDPH8AfO5w8alY7d</latexit><latexit
sha1_base64="IVmpRYrurapju0ecD+idfb62OLU=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cKpi20sWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px7cxvP3FtRKIecJLyIKZDJSLBKFrJHz7mOO1Xa27dnYOsEq8gNSjQ7Fe/eoOEZTFXyCQ1puu5KQY51SiY5NNKLzM8pWxMh7xrqaIxN0E+P3ZKzqwyIFGibSkkc/X3RE5jYyZxaDtjiiOz7M3E/7xuhtF1kAuVZsgVWyyKMkkwIbPPyUBozlBOLKFMC3srYSOqKUObT8WG4C2/vEpaF3XPrXv3l7XGTRFHGU7gFM7BgytowB00wQcGAp7hFd4c5bw4787HorXkFDPH8AfO5w8alY7d</latexit><latexit
<latexit
gt
sha1_base64="Rn2DW34lMrcSLPNclBmz0mEbuFI=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeix68VjBtIU2ls120y7dbMLupFBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB03TZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+Hobua3xlwbkahHnKQ8iOlAiUgwilbyx085TnuVqltz5yCrxCtIFQo0epWvbj9hWcwVMkmN6XhuikFONQom+bTczQxPKRvRAe9YqmjMTZDPj52Sc6v0SZRoWwrJXP09kdPYmEkc2s6Y4tAsezPxP6+TYXQT5EKlGXLFFouiTBJMyOxz0heaM5QTSyjTwt5K2JBqytDmU7YheMsvr5LmZc1za97DVbV+W8RRglM4gwvw4BrqcA8N8IGBgGd4hTdHOS/Ou/OxaF1zipkT+APn8wcxjY7s</latexit>
sha1_base64="Rn2DW34lMrcSLPNclBmz0mEbuFI=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeix68VjBtIU2ls120y7dbMLupFBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB03TZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+Hobua3xlwbkahHnKQ8iOlAiUgwilbyx085TnuVqltz5yCrxCtIFQo0epWvbj9hWcwVMkmN6XhuikFONQom+bTczQxPKRvRAe9YqmjMTZDPj52Sc6v0SZRoWwrJXP09kdPYmEkc2s6Y4tAsezPxP6+TYXQT5EKlGXLFFouiTBJMyOxz0heaM5QTSyjTwt5K2JBqytDmU7YheMsvr5LmZc1za97DVbV+W8RRglM4gwvw4BrqcA8N8IGBgGd4hTdHOS/Ou/OxaF1zipkT+APn8wcxjY7s</latexit><latexit
sha1_base64="Rn2DW34lMrcSLPNclBmz0mEbuFI=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeix68VjBtIU2ls120y7dbMLupFBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB03TZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+Hobua3xlwbkahHnKQ8iOlAiUgwilbyx085TnuVqltz5yCrxCtIFQo0epWvbj9hWcwVMkmN6XhuikFONQom+bTczQxPKRvRAe9YqmjMTZDPj52Sc6v0SZRoWwrJXP09kdPYmEkc2s6Y4tAsezPxP6+TYXQT5EKlGXLFFouiTBJMyOxz0heaM5QTSyjTwt5K2JBqytDmU7YheMsvr5LmZc1za97DVbV+W8RRglM4gwvw4BrqcA8N8IGBgGd4hTdHOS/Ou/OxaF1zipkT+APn8wcxjY7s</latexit><latexit
sha1_base64="Rn2DW34lMrcSLPNclBmz0mEbuFI=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4KokIeix68VjBtIU2ls120y7dbMLupFBCf4MXD4p49Qd589+4bXPQ1gcDj/dmmJkXplIYdN1vZ219Y3Nru7RT3t3bPzisHB03TZJpxn2WyES3Q2q4FIr7KFDydqo5jUPJW+Hobua3xlwbkahHnKQ8iOlAiUgwilbyx085TnuVqltz5yCrxCtIFQo0epWvbj9hWcwVMkmN6XhuikFONQom+bTczQxPKRvRAe9YqmjMTZDPj52Sc6v0SZRoWwrJXP09kdPYmEkc2s6Y4tAsezPxP6+TYXQT5EKlGXLFFouiTBJMyOxz0heaM5QTSyjTwt5K2JBqytDmU7YheMsvr5LmZc1za97DVbV+W8RRglM4gwvw4BrqcA8N8IGBgGd4hTdHOS/Ou/OxaF1zipkT+APn8wcxjY7s</latexit><latexit
<latexit
vt
of the fusion process and can be used asynchronously for an efficient fusion process.
L=
variance by σch
Local,
Integration Layer
n i
4. Experiments
1X
Feature Grid
View-Aligned
Translator Network
Features
Networks
Latent Space
Update Prediction
TSDF Head
network parameters. Our un-optimized implementation runs Method MSE↓ MAD↓ Acc.↑ IoU↑ F1↑
at ∼ 7 frames per seconds with a depth map resolution of [e-05] [e-02] [%] [0,1] [0,1]
240 × 320 on an NVIDIA RTX 2080. This demonstrates the DeepSDF [43] 464.0 4.99 66.48 0.538 0.66
Occ.Net. [38] 56.8 1.66 85.66 0.484 0.62
real-time applicability of our approach. IF-Net [7] 6.2 0.47 93.16 0.759 0.86
Evaluation Metrics. We use the following evaluation met- TSDF Fusion [9] 11.0 0.78 88.06 0.659 0.79
rics to quantify the performance of our approach: Mean TSDF + 2D denoising 27.0 0.84 87.48 0.650 0.78
Squared Error (MSE), Mean Absolute Distance (MAD), Ac- TSDF + 3D denoising 8.2 0.61 94.76 0.816 0.89
RoutedFusion [62] 5.9 0.50 94.77 0.785 0.87
curacy (Acc.), Intersection-over-Union (IoU), Mesh Com- Ours 2.9 0.27 97.00 0.890 0.94
pleteness (M.C.) Mesh Accuracy (M.A.), and F1 score. Fur-
ther details are in the supplementary material.
3166
N MSE↓ MAD↓ Acc.↑ IoU↑ Table 1. Ablation Study. We
TSDF Fusion
3167
Reconstructed Geometry Outlier Projection onto XY-Plane
Outlier Method MSE↓ MAD↓ Acc.↑ IoU↑ TSDF Routed- TSDF Routed-
Fraction [e-05] [e-02] [%] [0,1] Fusion [9] Fusion [62] Ours Fusion [9] Fusion [62] Ours
TSDF
34.51 1.17 85.17 0.645
Fusion
0.01
Routed-
5.43 0.57 95.21 0.837
Fusion
Ours 2.27 0.29 97.57 0.884
TSDF
80.72 2.02 73.86 0.432
Fusion
0.05
Routed-
9.84 0.68 94.46 0.803
Fusion
Ours 4.91 0.22 98.05 0.851
TSDF
102.50 2.43 67.47 0.341
Fusion
0.1
Routed-
14.25 0.77 92.95 0.764
Fusion
Ours 3.35 0.22 98.48 0.865
Figure 7. Reconstruction from Outlier-Contaminated Data. The left table states performance measures for various outlier fractions. The
figures on the right show corresponding reconstruction results and errors projected on the xy-plane for an exemplary ModelNet [64] model.
Our method outperforms state-of-the-art depth fusion methods regardless of the outlier fraction, but in particular with larger outlier amounts.
Note that high outlier rates are common in multi-view stereo as shown in the supp. material of [29].
IoU
4.0 5 10 0.80 1
2
6
7
ness than [62]. This is due to our learned translation from
3.5 0.78 3 8
0.76 4 9
feature to TSDF space, which allows to better handle noise
3.0
0.74 5 10 artifacts and outliers without the need for hand-tuned, heuris-
2.5
0 20 40 60 80 100 0 20 40 60 80 100
Step Step tic post-filtering. Further, we show improved outlier and
Figure 9. Random frame order permutations. The proposed noise artefact removal compared to TSDF Fusion [9] while
method seems to be largely invariant to the frame integration order, being on par with respect to completeness. These results are
since it always converges to the same result. also quantitatively shown in Table 2.
Tanks and Temples Dataset. In order to demonstrate our
methods outlier handling capability, we also run experiments
on the Tanks and Temples dataset [29]. We computed stereo
depth maps using COLMAP [54, 55] and fused this data
using our method, PSR [26], TSDF Fusion [9], and Routed-
XY-Plane XZ-Plane YZ-Plane Output Mesh Fusion [62]. To demonstrate the easy applicability to new
Figure 10. Visualization of our learned latent space encoding. datasets in scenarios with limited ground-truth, we train our
Our asynchronous fusion network integrates all measurements in- method on one single scene (Ignatius) from the Tanks and
cluding outliers, but the translator effectively filters the outliers to Temples training set. We reconstructed a dense mesh us-
generate a clean output mesh. ing Poisson Surface Reconstruction (PSR) [26], rendered
artificial depth maps, and used TSDF Fusion to generate a
ground-truth SDF. Then, we used this ground-truth to train
3168
Lounge
Stone wall
Input Frame PSR [26] TSDF Fusion [9] RoutedFusion [62] Ours
Figure 12. Results on Tanks and Temples [29]. Our method significantly reduces the number of outliers compared to the other methods
without using any outlier filtering heuristic and solely learning it from data. We especially highlight the results on the caterpillar scene,
where our proposed method filters most outliers while the reconstructions of competing methods are heavily cluttered with outliers.
the fusion of stereo depth maps. Figure 12 shows the recon- 5. Conclusion
structions of the unseen Caterpillar, Truck, and M60 scene
from [29]. Our proposed method significantly reduces the We presented a novel approach to online depth map fu-
amount of outliers in the scene across all models. While [62] sion with real-time capability. The key idea is to perform
shows comparable results on some scenes, it is heavily de- the fusion operation in a learned latent space that allows to
pendent on its outlier post-filter, which fails as soon as there encode additional information about undesired outliers and
are too many outliers in the scene (see also Figure 1). Further, super-resolution complex shape information. The separation
they also pre-process the depth maps using a 2D denoising of scene representations for fusion and final output allows
network while our network uses the raw depth maps. for an end-to-end trainable post-filtering as a translator net-
work, which takes the latent scene encoding and decodes
Limitations. While our pipeline shows excellent gener- it into a standard TSDF representation. Our experiments
alization capabilities (e.g. generalizing from a single MVS on various synthetic and real-world datasets demonstrate
training scene), it is biased to the number of observations superior reconstruction results, especially in the presence of
integrated during training leading to less complete results on large amounts of noise and outliers.
some test scene parts with very few observations. However,
this issue can be overcome by a more diverse set of training Acknowledgments. We thank Akihito Seki from Toshiba Japan for in-
sightful discussions and valuable comments. This research was supported
sequences with different number of observations. by Toshiba and Innosuisse funding (Grant No. 34475.1 IP-ICT).
3169
References volumetric depth fusion using successive reprojections. In
Proc. International Conference on Computer Vision and Pat-
[1] Abhishek Badki, Orazio Gallo, Jan Kautz, and Pradeep tern Recognition (CVPR), pages 7634–7643. Computer Vi-
Sen. Meshlet priors for 3d mesh reconstruction. CoRR, sion Foundation / IEEE, 2019.
abs/2001.01744, 2020. [14] Simon Fuhrmann and Michael Goesele. Fusion of depth maps
[2] Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan with multiple scales. ACM Trans. Graph., 30(6):148:1–148:8,
Leutenegger, and Andrew J. Davison. Codeslam - learn- 2011.
ing a compact, optimisable representation for dense visual [15] Kyle Genova, Forrester Cole, Avneesh Sud, Aaron Sarna, and
SLAM. In 2018 IEEE Conference on Computer Vision and Thomas A. Funkhouser. Deep structured implicit functions.
Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, CoRR, abs/1912.06126, 2019.
June 18-22, 2018, pages 2560–2568, 2018. [16] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna,
[3] Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, William T. Freeman, and Thomas A. Funkhouser. Learning
Julian Straub, Steven Lovegrove, and Richard A. Newcombe. shape templates with structured implicit functions. CoRR,
Deep local shapes: Learning local SDF priors for detailed 3d abs/1904.06447, 2019.
reconstruction. In Andrea Vedaldi, Horst Bischof, Thomas [17] Christian Häne, Christopher Zach, Andrea Cohen, Roland
Brox, and Jan-Michael Frahm, editors, Proc. European Con- Angst, and Marc Pollefeys. Joint 3d scene reconstruction
ference on Computer Vision (ECCV), volume 12374 of Lec- and class segmentation. In Proc. International Conference
ture Notes in Computer Science, pages 608–625. Springer, on Computer Vision and Pattern Recognition (CVPR), pages
2020. 97–104, 2013.
[4] Angel X. Chang, Thomas A. Funkhouser, Leonidas J. Guibas, [18] Christian Häne, Christopher Zach, Andrea Cohen, and
Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio Savarese, Marc Pollefeys. Dense semantic 3d reconstruction. IEEE
Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Transactions on Pattern Analysis and Machine Intelligence,
Yi, and Fisher Yu. Shapenet: An information-rich 3d model 39(9):1730–1743, 2017.
repository. CoRR, abs/1512.03012, 2015. [19] Zeng Huang, Tianye Li, Weikai Chen, Yajie Zhao, Jun Xing,
[5] Zhiqin Chen and Hao Zhang. Learning implicit fields for Chloe LeGendre, Linjie Luo, Chongyang Ma, and Hao Li.
generative shape modeling. In Proc. International Conference Deep volumetric video from very sparse multi-view perfor-
on Computer Vision and Pattern Recognition (CVPR), pages mance capture. In Vittorio Ferrari, Martial Hebert, Cristian
5939–5948, 2019. Sminchisescu, and Yair Weiss, editors, Proc. European Con-
[6] Ian Cherabier, Christian Häne, Martin R. Oswald, and Marc ference on Computer Vision (ECCV), volume 11220 of Lec-
Pollefeys. Multi-label semantic 3d reconstruction using voxel ture Notes in Computer Science, pages 351–369. Springer,
blocks. In International Conference on 3D Vision (3DV), 2018.
2016. [20] Mengqi Ji, Juergen Gall, Haitian Zheng, Yebin Liu, and Lu
[7] Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. Fang. Surfacenet: An end-to-end 3d neural network for mul-
Implicit functions in feature space for 3d shape reconstruction tiview stereopsis. In IEEE International Conference on Com-
and completion. In IEEE Conference on Computer Vision puter Vision, ICCV 2017, Venice, Italy, October 22-29, 2017,
and Pattern Recognition (CVPR). IEEE, jun 2020. pages 2326–2334, 2017.
[8] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin [21] Chiyu Max Jiang, Avneesh Sud, Ameesh Makadia, Jingwei
Chen, and Silvio Savarese. 3d-r2n2: A unified approach for Huang, Matthias Nießner, and Thomas A. Funkhouser. Local
single and multi-view 3d object reconstruction. In Proceed- implicit grid representations for 3d scenes. In Proc. Interna-
ings of the European Conference on Computer Vision (ECCV), tional Conference on Computer Vision and Pattern Recogni-
2016. tion (CVPR), pages 6000–6009. IEEE, 2020.
[9] Brian Curless and Marc Levoy. A volumetric method for [22] Chiyu Max Jiang, Avneesh Sud, Ameesh Makadia, Jing-
building complex models from range images. In Proceedings wei Huang, Matthias Nießner, and Thomas A. Funkhouser.
of the 23rd Annual Conference on Computer Graphics and Local implicit grid representations for 3d scenes. CoRR,
Interactive Techniques, SIGGRAPH 1996, New Orleans, LA, abs/2003.08981, 2020.
USA, August 4-9, 1996, pages 303–312, 1996. [23] Olaf Kähler, Victor Adrian Prisacariu, Carl Yuheng Ren, Xin
[10] J Czarnowski, T Laidlow, R Clark, and AJ Davison. Deep- Sun, Philip H. S. Torr, and David W. Murray. Very high frame
factors: Real-time probabilistic dense monocular slam. IEEE rate volumetric integration of depth images on mobile devices.
Robotics and Automation Letters, 5:721–728, 2020. IEEE Trans. Vis. Comput. Graph., 21(11):1241–1250, 2015.
[11] Angela Dai and Matthias Nießner. 3dmv: Joint 3d-multi-view [24] Olaf Kähler, Victor Adrian Prisacariu, Julien P. C. Valentin,
prediction for 3d semantic scene segmentation. In Computer and David W. Murray. Hierarchical voxel block hashing for
Vision - ECCV 2018 - 15th European Conference, Munich, efficient integration of depth images. IEEE Robotics and
Germany, September 8-14, 2018, Proceedings, Part X, pages Automation Letters, 1(1):192–197, 2016.
458–474, 2018. [25] Abhishek Kar, Christian Häne, and Jitendra Malik. Learn-
[12] Angela Dai, Matthias Nießner, Michael Zollhöfer, Shahram ing a multi-view stereo machine. In Isabelle Guyon, Ulrike
Izadi, and Christian Theobalt. Bundlefusion: Real-time glob- von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fer-
ally consistent 3d reconstruction using on-the-fly surface rein- gus, S. V. N. Vishwanathan, and Roman Garnett, editors, Ad-
tegration. ACM Trans. Graph., 36(3):24:1–24:18, 2017. vances in Neural Information Processing Systems 30: Annual
[13] Simon Donné and Andreas Geiger. Defusr: Learning non- Conference on Neural Information Processing Systems 2017,
3170
4-9 December 2017, Long Beach, CA, USA, pages 365–376, works: Learning 3d reconstruction in function space. In
2017. Proc. International Conference on Computer Vision and Pat-
[26] Michael M. Kazhdan and Hugues Hoppe. Screened pois- tern Recognition (CVPR), pages 4460–4470, 2019.
son surface reconstruction. ACM Trans. Graph., 32(3):29:1– [39] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha,
29:13, 2013. Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-
[27] Maik Keller, Damien Lefloch, Martin Lambers, Shahram to-end 3d scene reconstruction from posed images. In
Izadi, Tim Weyrich, and Andreas Kolb. Real-time 3d recon- Proc. European Conference on Computer Vision (ECCV),
struction in dynamic scenes using point-based fusion. In 2013 2020.
International Conference on 3D Vision, 3DV 2013, Seattle, [40] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David
Washington, USA, June 29 - July 1, 2013, pages 1–8, 2013. Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli,
[28] Diederik P. Kingma and Jimmy Ba. Adam: A method for Jamie Shotton, Steve Hodges, and Andrew W. Fitzgibbon.
stochastic optimization. In Yoshua Bengio and Yann LeCun, Kinectfusion: Real-time dense surface mapping and tracking.
editors, 3rd International Conference on Learning Represen- In 10th IEEE International Symposium on Mixed and Aug-
tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, mented Reality, ISMAR 2011, Basel, Switzerland, October
Conference Track Proceedings, 2015. 26-29, 2011, pages 127–136, 2011.
[29] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen [41] Michael Niemeyer, Lars M. Mescheder, Michael Oechsle,
Koltun. Tanks and temples: Benchmarking large-scale scene and Andreas Geiger. Differentiable volumetric rendering:
reconstruction. ACM Transactions on Graphics, 36(4), 2017. Learning implicit 3d representations without 3d supervision.
[30] Kalin Kolev, Maria Klodt, Thomas Brox, and Daniel Cremers. In Proc. International Conference on Computer Vision and
Continuous global optimization in multiview 3d reconstruc- Pattern Recognition (CVPR), pages 3501–3512. IEEE, 2020.
tion. International Journal of Computer Vision, 84(1):80–96, [42] Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and
2009. Marc Stamminger. Real-time 3d reconstruction at scale using
[31] Lubor Ladicky, Olivier Saurer, SoHyeon Jeong, Fabio Man- voxel hashing. ACM Trans. Graph., 32(6):169:1–169:11,
inchedda, and Marc Pollefeys. From point clouds to mesh 2013.
using regression. In IEEE International Conference on Com- [43] Jeong Joon Park, Peter Florence, Julian Straub, Richard A.
puter Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, Newcombe, and Steven Lovegrove. DeepSDF: Learning
pages 3913–3922, 2017. continuous signed distance functions for shape representation.
[32] Damien Lefloch, Tim Weyrich, and Andreas Kolb. In Proc. International Conference on Computer Vision and
Anisotropic point-based fusion. In 18th International Confer- Pattern Recognition (CVPR), pages 165–174, 2019.
ence on Information Fusion, FUSION 2015, Washington, DC, [44] Despoina Paschalidou, Ali Osman Ulusoy, Carolin Schmitt,
USA, July 6-9, 2015, pages 2121–2128, 2015. Luc Van Gool, and Andreas Geiger. Raynet: Learning volu-
[33] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. metric 3d reconstruction with ray potentials. In 2018 IEEE
Learning to infer implicit surfaces without 3d supervision. Conference on Computer Vision and Pattern Recognition,
In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018,
Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, ed- pages 3897–3906, 2018.
itors, Advances in Neural Information Processing Systems 32: [45] Songyou Peng, Michael Niemeyer, Lars M. Mescheder, Marc
Annual Conference on Neural Information Processing Sys- Pollefeys, and Andreas Geiger. Convolutional occupancy
tems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, networks. In Proc. European Conference on Computer Vision
BC, Canada, pages 8293–8304, 2019. (ECCV), 2020.
[34] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc [46] Victor Adrian Prisacariu, Olaf Kähler, Stuart Golodetz,
Pollefeys, and Zhaopeng Cui. DIST: rendering deep implicit Michael Sapienza, Tommaso Cavallari, Philip H. S. Torr,
signed distance function with differentiable sphere tracing. and David William Murray. Infinitam v3: A framework
In Proc. International Conference on Computer Vision and for large-scale 3d reconstruction with loop closure. CoRR,
Pattern Recognition (CVPR), pages 2016–2025. IEEE, 2020. abs/1708.00783, 2017.
[35] Nico Marniok, Ole Johannsen, and Bastian Goldluecke. An [47] V. Reijgwart, A. Millane, H. Oleynikova, R. Siegwart, C.
efficient octree design for local variational range image fusion. Cadena, and J. Nieto. Voxgraph: Globally consistent, vol-
In Proc. German Conference on Pattern Recognition (GCPR), umetric mapping using signed distance function submaps.
pages 401–412, 2017. IEEE Robotics and Automation Letters, 2020.
[36] John McCormac, Ankur Handa, Andrew J. Davison, and [48] Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and An-
Stefan Leutenegger. Semanticfusion: Dense 3d semantic dreas Geiger. Octnetfusion: Learning depth fusion from data.
mapping with convolutional neural networks. In 2017 IEEE In 2017 International Conference on 3D Vision, 3DV 2017,
International Conference on Robotics and Automation, ICRA Qingdao, China, October 10-12, 2017, pages 57–66, 2017.
2017, Singapore, Singapore, May 29 - June 3, 2017, pages [49] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Oct-
4628–4635. IEEE, 2017. net: Learning deep 3d representations at high resolutions.
[37] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment In 2017 IEEE Conference on Computer Vision and Pattern
analysis algorithms and applications: A survey. Ain Shams Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26,
Engineering Journal, 5(4):1093 – 1113, 2014. 2017, pages 6620–6629, 2017.
[38] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer, [50] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-
Sebastian Nowozin, and Andreas Geiger. Occupancy net- ishima, Hao Li, and Angjoo Kanazawa. PIFu: Pixel-aligned
3171
implicit function for high-resolution clothed human digitiza- and Pattern Recognition (CVPR), June 2020.
tion. In Proc. International Conference on Computer Vision [63] Thomas Whelan, Renato F. Salas-Moreno, Ben Glocker, An-
(ICCV), pages 2304–2314. IEEE, 2019. drew J. Davison, and Stefan Leutenegger. Elasticfusion: Real-
[51] Shunsuke Saito, Tomas Simon, Jason M. Saragih, and Han- time dense SLAM and light source estimation. I. J. Robotics
byul Joo. Pifuhd: Multi-level pixel-aligned implicit function Res., 35(14):1697–1716, 2016.
for high-resolution 3d human digitization. In Proc. Interna- [64] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
tional Conference on Computer Vision and Pattern Recogni- guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets:
tion (CVPR), pages 81–90. IEEE, 2020. A deep representation for volumetric shapes. In IEEE Con-
[52] Nikolay Savinov, Christian Häne, Lubor Ladicky, and Marc ference on Computer Vision and Pattern Recognition, CVPR
Pollefeys. Semantic 3d reconstruction with continuous reg- 2015, Boston, MA, USA, June 7-12, 2015, pages 1912–1920.
ularization and ray potentials using a visibility consistency IEEE Computer Society, 2015.
constraint. In Proc. International Conference on Computer [65] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomı́r Mech,
Vision and Pattern Recognition (CVPR), pages 5460–5469, and Ulrich Neumann. DISN: deep implicit surface network
2016. for high-quality single-view 3d reconstruction. In Advances
[53] Nikolay Savinov, Lubor Ladicky, Christian Hane, and Marc in Neural Information Processing Systems 32: Annual Con-
Pollefeys. Discrete optimization of ray potentials for seman- ference on Neural Information Processing Systems 2019,
tic 3d reconstruction. In Proc. International Conference on NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada,
Computer Vision and Pattern Recognition (CVPR), pages pages 490–500, 2019.
5511–5518, 2015. [66] Long Yang, Qingan Yan, Yanping Fu, and Chunxia Xiao.
[54] J. L. Schönberger and J. Frahm. Structure-from-motion revis- Surface reconstruction via fusing sparse-sequence of depth
ited. In Proc. International Conference on Computer Vision images. IEEE Trans. Vis. Comput. Graph., 24(2):1190–1203,
and Pattern Recognition (CVPR), pages 4104–4113, 2016. 2018.
[55] Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, [67] Christopher Zach, Thomas Pock, and Horst Bischof. A glob-
and Marc Pollefeys. Pixelwise view selection for unstructured ally optimal algorithm for robust tv-l1 range image integra-
multi-view stereo. In Bastian Leibe, Jiri Matas, Nicu Sebe, tion. In Proc. International Conference on Computer Vision
and Max Welling, editors, Proc. European Conference on (ICCV), pages 1–8, 2007.
Computer Vision (ECCV), volume 9907 of Lecture Notes in [68] Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for
Computer Science, pages 501–518. Springer, 2016. sentiment analysis: A survey. WIREs Data Mining and Knowl-
[56] Thomas Schöps, Torsten Sattler, and Marc Pollefeys. Sur- edge Discovery, 8(4):e1253, 2018.
felmeshing: Online surfel-based mesh reconstruction. IEEE [69] Shuaifeng Zhi, Michael Bloesch, Stefan Leutenegger, and
Transactions on Pattern Analysis and Machine Intelligence, Andrew J. Davison. Scenecode: Monocular dense semantic
pages 1–1, 2019. reconstruction using learned encoded scene representations.
[57] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias In IEEE Conference on Computer Vision and Pattern Recog-
Nießner, Gordon Wetzstein, and Michael Zollhöfer. Deep- nition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019,
voxels: Learning persistent 3d feature embeddings. In IEEE pages 11776–11785, 2019.
Conference on Computer Vision and Pattern Recognition, [70] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox.
CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages Deeptam: Deep tracking and mapping with convolutional
2437–2446, 2019. neural networks. Int. J. Comput. Vis., 128(3):756–769, 2020.
[58] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. [71] Qian-Yi Zhou and Vladlen Koltun. Dense scene reconstruc-
Scene representation networks: Continuous 3d-structure- tion with points of interest. ACM Trans. Graph., 32(4):112:1–
aware neural scene representations. In Advances in Neural 112:8, 2013.
Information Processing Systems 32: Annual Conference on [72] Michael Zollhöfer, Angela Dai, Matthias Innmann, Chenglei
Neural Information Processing Systems 2019, NeurIPS 2019, Wu, Marc Stamminger, Christian Theobalt, and Matthias
8-14 December 2019, Vancouver, BC, Canada, pages 1119– Nießner. Shading-based refinement on volumetric signed
1130, 2019. distance functions. ACM Trans. Graph., 34(4):96:1–96:14,
[59] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdi- 2015.
nov. Unsupervised learning of video representations using [73] M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner,
LSTMs. In ICML, 2015. R. Klein, and A. Kolb. State of the Art on 3D Reconstruction
[60] Frank Steinbrücker, Christian Kerl, and Daniel Cremers. with RGB-D Cameras. Computer Graphics Forum (Euro-
Large-scale multi-resolution surface reconstruction from graphics State of the Art Reports 2018), 37(2), 2018.
RGB-D sequences. In IEEE International Conference on
Computer Vision, ICCV 2013, Sydney, Australia, December
1-8, 2013, pages 3264–3271, 2013.
[61] Jörg Stückler and Sven Behnke. Multi-resolution surfel maps
for efficient dense 3d modeling and tracking. J. Visual Com-
munication and Image Representation, 25(1):137–147, 2014.
[62] Silvan Weder, Johannes L. Schönberger, Marc Pollefeys, and
Martin R. Oswald. RoutedFusion: Learning real-time depth
map fusion. In IEEE/CVF Conference on Computer Vision
3172