Professional Documents
Culture Documents
Paper
Paper
Paper
Scene Modeling
JINGWEI HUANG, Huawei Technologies, China
SHANSHAN ZHANG, Huawei Technologies, China
BO DUAN, Huawei Technologies, China
YANFENG ZHANG, Huawei Technologies, China
XIAOYANG GUO, Huawei Technologies, China
MINGWEI SUN, Huawei Technologies and Wuhan University, China
LI YI, Tsinghua University, China
Ceiling
reuse
embed
Floor plan
embed
GNN
Planes Cuboids Initial Arrangements Floor Plan Arrangement Scene Arrangement
Fig. 1. (a) We prepare the scene by detecting planes and cuboids as basic primitives. (b) we propose ArrangementNet to reconstruct floor plan arrangements
and enrich them to organize primitives with different semantics. (c) We generate the scene from enriched arrangements to obtain a BIM model.
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:2 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi
a novel graph neural network that consumes noisy over-partitioned initial well-solved in large and complex scenes, [Fang et al. 2021; Han
arrangements extracted through non-learning techniques and outputs high- et al. 2021] robustly construct an arrangement as a superset of the
quality scene arrangement. The core of ArrangementNet is an extended floor plan by over-partitioning the horizontal floor region with
graph convolution that leverages co-linear and co-face relationships in the detected wall planes. Then, they filter it to construct the floor plan
arrangement and improves the quality of prediction in complex scenes. We by solving energy minimization. Since a floor plan arrangement is
apply ArrangementNet to improve floor plan and ceiling arrangements and
insufficient to represent a 3D scene, we propose to enrich it as a
enrich them with semantic objects as scene arrangements for scene gen-
eration. Our approach faithfully models challenging scenes obtained from scene arrangement with an additional ceiling arrangement sharing
laser scans or multiview stereo and shows significant improvement in BIM boundary edges with the floor plan and other semantic objects as
model reconstruction compared to the state-of-the-art. Our code is available cuboids embedded into floor plan edges. Such a representation can
at https://github.com/zssjh/ArrangementNet. be elegantly converted into a vectorized 3D model.
CCS Concepts: • Computing methodologies → Mesh geometry models;
Our key idea is to formulate arrangement construction as a graph
Scene understanding. filtering problem so that we can leverage learning techniques to con-
vert over-partitioned arrangements [Fang et al. 2021] into compact
Additional Key Words and Phrases: Building information model (BIM), Ar-
scene arrangements with dramatically improved quality. We pro-
rangement, Graph neural network, Floor plan
pose ArrangementNet (Fig. 1(b)) as a graph neural network (GNN)
ACM Reference Format: to filter the over-partitioned arrangement by classifying whether to
Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, preserve or drop certain arrangement elements. Such a formulation
Mingwei Sun, and Li Yi. 2023. ArrangementNet: Learning Scene Arrange-
avoids challenging corner regression [Chen et al. 2019; Stekovic
ments for Vectorized Indoor Scene Modeling. ACM Trans. Graph. 42, 4,
et al. 2021] by converting connectivity analysis into a binary classi-
Article 51 (August 2023), 15 pages. https://doi.org/10.1145/3592122
fication problem, so that the network can learn from data effectively.
1 INTRODUCTION In detail, we model each arrangement face as a node of the graph,
and insert a graph edge for each arrangement edge connecting ad-
Vectorized reconstruction of point clouds is a fundamental problem
jacent faces at both sides of the edge. We additionally insert links
in computer graphics and vision communities. The ultimate goal is
between arrangement edges that are co-face or co-linear, and extend
to represent the scene using concise polygonal meshes, where the
the graph convolution to operate on arrangements and highlight
main structures1 are well-segmented according to the semantics.
these specific relationships. Such a graph fully captures the structure
Such compact representation for a building is called a building infor-
of the partitioned space based on the arrangement, and we argue
mation model (BIM) and is the foundation for real-time downstream
that our GNN jointly considers floor and wall regions by message
applications in gaming, civil engineering, and virtual/augmented
passing through such a graph. For example, the network tends to
reality. However, the reconstruction quality from existing solutions
preserve wall edges at floor boundaries or drop them where faces at
can hardly meet the standard of applications.
both sides are not floor regions. In other words, our GNN analyzes
Existing works suffer from incomplete point clouds as input. Lo-
connectivity based on messages passed through the arrangement
cal geometry simplification [Garland and Heckbert 1997; Salinas
in an end-to-end manner without the requirement of challenging
et al. 2015] requires smooth surfaces and is sensitive to noises and
room segmentation [Chen et al. 2019; Fang et al. 2021].
incompleteness, especially around transparent objects like windows.
Next, we enrich the predicted floor plans into scene arrangements
A popular direction is to vectorize point clouds into big planar
to model 3D scenes. We reuse floor plan boundaries and insert in-
shapes [Chen and Chen 2008; Huang et al. 2017; Schindler et al.
tersection lines among ceiling planes to build a ceiling arrangement.
2011; Van Kreveld et al. 2011] and complete the scene by extend-
We detect objects as cuboids and project each instance as a rectangle
ing planes to connect each other. However, connectivity analysis
in the horizontal plane. Each rectangle edge is either merged into
is a challenging problem to solve. A common geometric solution
the floor plan arrangement or embedded into a close wall edge. As
is through space partition [Bauchet and Lafarge 2020; Fang and
such, we obtain scene arrangements that fully describe connectivity
Lafarge 2020; Nan and Wonka 2017], which produces noisy planes
among various semantic parts. Different from [Bauchet and Lafarge
around small objects. Further, they assume watertight geometry and
2020; Han et al. 2021; Ikehata et al. 2015], we generate the 3D model
cannot preserve thin structures or openings. Floor plan reconstruc-
with multiple seamlessly assembled semantic parts.
tion [Chen et al. 2019, 2021; Liu et al. 2018, 2017; Stekovic et al. 2021;
We evaluate our approach concerning the overall 3D modeling
Xu et al. 2021] aims to simplify the problem and focus on better con-
and floor plan reconstruction. We deliver 3D models with better
nectivity analysis among floors and walls. However, these learning
quality than existing works (Sec. 6), and handle challenging set-
methods require accurate corner regression and corner connection
tings where point clouds are produced from the multiview stereo
prediction based on accurate room segmentation, both of which
algorithm. Further, our scene arrangement representation gener-
are extremely hard for large scenes. Further, 3D modeling requires
ates door openings and unscanned transparent windows, which
seamlessly assembling doors, windows, and ceilings to floor plans,
cannot be directly recovered using state-of-the-art solutions [Han
which is non-trivial but not well-studied.
et al. 2021; Ikehata et al. 2015]. Our automatic pipeline faithfully
Our key observation is that BIM modeling can be effectively
produces BIMs with concise geometry and correctly assembled se-
represented by an arrangement [Ron et al. 2022] as a 2D partition
mantic components. Sec. 7 shows that our floor plan prediction
of a horizontal plane by a set of edges. Since plane detection is
significantly outperforms the state-of-the-art methods on various
1 wall, floor, ceiling, door, and window datasets. While existing methods fail to detect corners and connect
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:3
them as walls, we correctly reconstruct most wall structures, es- geometric rules and cannot deliver clean structures under cluttered
pecially for large scenes. Our extended GNN convolution further scenes. More importantly, they assume the scene to be watertight,
improves performance by jointly considering floor faces and wall which does not fit real scans with thin structures or completely
edges. missed planar regions. [Fang and Lafarge 2020] exploit principles
Overall, our core research contributions are: from connectivity-based methods but still rely on the watertight
• A formulation for vectorized modeling by learning scene assumption. Our learning-based approach solves a binary classifica-
arrangements to analyze connectivity. tion problem, which resolves these issues and is not limited by the
• A novel ArrangementNet that learns to filter over-partitioned watertight assumption.
initial arrangements for floor plan and ceiling construction.
• An extended GNN convolution to fully capture the arrange- 2.2 2D Vectorized Geometry
ment structure. Primitive-based Partition. Partition-based method can be performed
• An automatic pipeline that significantly outperforms existing in 2D by projecting walls as line segments and floors as faces into
methods for floor plan and vectorized 3D modeling. a horizontal plane. Similar to space partition, [Li et al. 2020; Mura
et al. 2014; Ochmann et al. 2016b; Turner and Zakhor 2014] require
2 RELATED WORKS first performing room segmentation before analyzing the floor plan.
2.1 3D Plane Assembly [Fang et al. 2021] address this issue by separately reconstructing
outer boundaries and inner walls. However, room segmentation is it-
Plane Detection. Indoor reconstruction usually begins with the self a challenging problem to solve. We jointly learn to classify faces
detection of plane primitives. Among them, the most popular tradi- and edges as floor plans based on the arrangement in an end-to-end
tional methods are through Ransac [Chum and Matas 2005; Fischler manner without explicit room segmentation.
and Bolles 1981; Kang and Li 2015; Matas and Chum 2004; Torr and
Zisserman 2000] and region growing [Marshall et al. 2001; Rabbani Image-based Understanding. Floor plan reconstruction is also stud-
et al. 2006]. Recently, primitive fitting is further addressed by su- ied in the imaging communities given images as inputs. Traditional
pervised [Huang et al. 2021; Li et al. 2019b; Sharma et al. 2020; Zou methods are proposed to produce wireframes [Furukawa et al. 2009;
et al. 2017] and unsupervised [Fang et al. 2018; Sharma et al. 2018; Silberman et al. 2012], room layouts [Delage et al. 2006; Hedau
Tulsiani et al. 2017] networks. We adopt a modified version of [Rab- et al. 2009; Izadinia et al. 2017] or floor plans [Cabral and Furukawa
bani et al. 2006] and find it sufficiently robust for plane detection. 2014]. [Liu et al. 2015; Vidanapathirana et al. 2021] further augment
[Yu and Lafarge 2022] focus on plane detection and also deserve to floor plans into textured meshes. Neural networks are introduced
follow. to replace the basic primitive detection modules to produce cor-
ners, edges, or regions [Hu et al. 2021; Liu et al. 2018, 2017; Qian
Plane Connectivity. After plane detection, [Chen and Chen 2008; and Furukawa 2020; Zou et al. 2018]. [Chen et al. 2019; Nauata
Huang et al. 2017; Schindler et al. 2011; Van Kreveld et al. 2011] solve and Furukawa 2019; Phalak et al. 2020] detect room instances using
the reconstruction problem by computing an adjacency graph and Mask-RCNN [He et al. 2017] and recover floor plan via post optimiza-
extracting edges, corners, and faces based on plane affinity. How- tion or Monte Carlo Tree Search [Stekovic et al. 2021]. Therefore,
ever, pure geometric rules are not sufficient to describe the scene, their performance depends on room segmentation and is hard to
and they often yield connectivity errors and produce incomplete generalize well to novel scenes. [Xue et al. 2020; Zhang et al. 2019;
models. While the topology of surface reconstruction can guide Zhou et al. 2019] propose to parse 3D wireframes in an end-to-end
the connectivity analysis [Holzmann et al. 2018; Mehra et al. 2009] manner by jointly predicting junctions and their connections. Re-
to alleviate this problem, the surface reconstruction itself suffers cently, [Chen et al. 2021; Xu et al. 2021] introduce transformer-based
from large incompletion. We train the ArrangementNet to learn object detection [Carion et al. 2020] into wireframe parsing. They
the connectivity from data instead of analysis based on low-level project point clouds into a horizontal plane as a density image to
geometry information. predict floor plans. However, we observe that it is more robust and
Space Partition. Vectorized building reconstruction can also be straightforward to directly fit planes from 3D point clouds since it
handled by space partition based on detected primitives. A subset correctly detects wall candidates in large and complex scenes. Our
of edges or faces from the partition is selected based on clustering network is built upon an arrangement initialized with these wall
or energy minimization. [Ochmann et al. 2019, 2016a; Zhang et al. candidates and can better focus on connectivity analysis.
2021] extend and intersect detected wall edges and perform room
labeling to optimize the structural topology. Similarly, [Cui et al. 2.3 Vectorized 3D Modeling
2019; Li et al. 2019a; Oesau et al. 2014; Previtali et al. 2014; Tran An important direction for indoor modeling is to decompose the
and Khoshelham 2019; Wang et al. 2018, 2020] also rely on robust problem into subproblems for reconstructing different semantic el-
room segmentation. [Boulch et al. 2014; Mura et al. 2016; Nan and ements. [Ikehata et al. 2015] propose a grammar that consists of
Wonka 2017] adopt a general 3D pipeline where plane primitives rooms, structural details, objects, and room connections via doors.
are directly used to slice 3D space into convex polyhedra. Such a [Han et al. 2021] follow a similar philosophy and separately recon-
formulation can handle complex geometries like sloping ceiling struct multi-plane ceilings, floors, and walls with structural details.
planes. [Bauchet and Lafarge 2020] further accelerate the algorithm However, these representations are not unified while energy min-
with a novel kinetic structure. However, they are based on pure imization is required, where the performance highly depends on
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:4 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi
Fig. 2. Framework overview. We prepare the scene by detecting basic primitives (planes and cuboids) and segmenting the point cloud into multiple stories
(Sec. 3). Then, we propose ArrangementNet to construct and enrich arrangements to organize primitives with different semantics. Finally, we generate the 3D
model from enriched arrangements (Sec. 5).
(a) Input point cloud (b) Region growing (c) Scale-aware (ours)
Fig. 3. Scale-aware region growing for plane detection. (b) [Rabbani et al.
2006] segment ceilings into several primitives inside blue and yellow ellipses.
(c) Point normals at different scale levels help ignore non-planar details
(blue and yellow) and preserve structural curved surfaces (red) on walls.
Planes. We prepare the scene by segmenting it into multiple sto- o is the cuboid center. 𝜃 is a 1D orientation assuming that the
ries with semantic planar shapes. We first detect a set of 3D planes cuboid is aligned with the up vector. s records the width, height, and
for floors, walls, and ceilings. For plane detection, we modify [Rab- depth of the cuboid. 𝑙 is the semantic label. 𝑡 denotes the number
bani et al. 2006] by estimating point normals for different scale of front/back faces that are occupied by points and should be mod-
levels with 32𝑁 (1 ≤ 𝑁 ≤ 4) nearest neighborhoods. As a result, the eled (Fig. 4). We modify FCAF3D [Rukhovich et al. 2021] to detect
normals preserve details for lower levels and are robust to noises B. We generate pseudo point candidates at empty regions for wall
at higher levels. Then, we grow the region if point normals agree planes as shown in Fig. 5 (black points in bottom-left). Then, we
with the plane at any scale level. The scale-aware region growing send both the original point cloud and generated points to FCAF3D,
better ignores non-planar details and robustly approximate curved which improves the detection quality for open doorways/windows
surfaces with piece-wise planes (Fig. 3). We determine the semantics as shown in Fig. 5 (bottom-right). We present details for pseudo
of 3D planes by voting semantic labels associated to each point pre- points generation in the supplemental material.
dicted by a sparse convolution network [Graham et al. 2018]. Finally,
we segment the scene into multiple stories by 3D floor planes and 4 ARRANGEMENTNET
assign detected planes to different stories so that each story can be In this section, we construct arrangements with ArrangementNet
modeled separately. to organize detected primitives for each story. Sec. 4.1 describes the
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
Completed Scan Original Scan ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:5
4.1 Formulation
We first formulate the floor plan reconstruction as a binary classifi-
cation problem on an arrangement. 2D geometric arrangement[Ron
et al. 2022] is a subdivision of a 2D plane induced by geometric
objects. By constraining geometric objects to line segments, the
arrangement is a plane partition by line segments into cells. Parti-
tioned cells are named as “segments”(1-dimensional) and “faces”(2-
dimensional). The subdivision implies several relationships: adja-
cency among cells, co-linear relationships between adjacent seg- (a) Projected points (b) Initial segments (c) Final segments
ments that share the same input line segments, and co-face rela-
tionships between adjacent segments that share the same face. We Fig. 7. Planes can be uncovered in scans. We recover missing planes in (b)
define a graph on the arrangement in Eq. 2. by triangulation to connect existing primitive endpoints (green in (c)).
G = {V 𝑓 , E, E𝑙 , E 𝑓 } (2)
V𝑓 and E are sets of faces and segments, where each segment intersection corners. In practice, we extrude both endpoints of each
connects two faces at both sides of the segment. E𝑙 and E 𝑓 represent line segment by 𝜃𝑑 = 5𝑚 and preserve the line segment between the
the co-linear and co-face relationship between segments. Existing furthest intersection points after the extrusion. Fig. 6(b) shows an
works [Fang et al. 2021; Turner and Zakhor 2014] notice that an example of initialized line segments, and Fig. 6(c) shows an example
arrangement is intrinsically a compact representation of a floor plan. of extended line segments. Finally, we perform constrained Delau-
By projecting the indoor scene to the floor, we can describe the main nay triangulation [Paul Chew 1989] to partition the regions (Fig. 6
structure as a subdivision of floor regions by walls as line segments. (d)). The triangulation can produce connections among endpoints
Our goal is to over-partition the horizontal plane to build an initial to recover missing wall primitives from input scans. We illustrate
graph G, and design a network to classify whether to preserve faces an example where missing primitives are recovered by connecting
in V 𝑓 as floors and segments in E as walls to form the floor plan. endpoints using our algorithm in Fig. 7. We experiment on a large-
scale scene dataset (Sec. 7.1) and find that the triangulation recovers
4.2 Arrangement Initialization 78% of segments by connecting endpoints.
We aim to initialize an arrangement G by over-partitioning the hori-
zontal plane with wall segments. We project vertical wall planes into 4.3 Arrangement Prediction
the horizontal plane as 2D line segments and extend line segments With an over-segmented floor plan arrangement from arrangement
to intersect with each other to over-partition the 2D plane following initialization, our goal is to select a subset of faces and segments to
[Fang et al. 2021], which aims to recover missing wall segments and produce a compact floor plan with ArrangementNet. Our analysis
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:6 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi
Line Segment
Fig. 8. (a) We propose arrangement convolution that consists of node convolution, edge convolution and link convolution on the arrangement graph, which
captures relationships among all the arrangement elements. (b)Network architecture. We extract nodes and edges features by a multilayer arrangement
convolution and pass them through 2-layer MLP to determine whether to preserve arrangement elements.
is on the graph of the arrangement (Eq. 2). Fig. 8(a) illustrates an arrangements than that in standard GNN.
example floor plan graph. (𝑛) (𝑛) (𝑛)
∑︁
(𝑛) (𝑛)
We design a GNN (Fig. 8(b)) as a neural network on top of the h𝑢𝑣 = Φ𝑙 ( ĥ𝑢𝑣 + 𝑔𝑙 ( ĥ𝑣𝑤 ))
graph to select its subset as the final floor plan. In detail, we pass (𝑢,𝑣,𝑤 ) ∈ E𝑙
(5)
(𝑛) (𝑛) (𝑛) (𝑛)
∑︁
input signals associated with nodes and edges through six layers + Φ𝑓 ( ĥ𝑢𝑣 + 𝑔𝑓 ( ĥ𝑣𝑥 ))
of arrangement convolutions to extract high-level features. The ar- (𝑢,𝑣,𝑥 ) ∈ E 𝑓
rangement convolution is an extended version of graph convolution
(𝑛) (𝑛)
consisting of three operators as node convolution, edge convolu- h𝑢𝑣 are aggregated with colinear neighbors ĥ𝑣𝑤 ) via a 1D con-
(𝑛) (𝑛)
tion, and link convolution. These operators fully exploit the spatial volution 𝑔𝑙 and passed through a 2-layer MLP Φ𝑙 to obtain
structure of the arrangement. Node convolution intends to encode the colinear signal. The co-face signal can be obtained similarly by
adjacency and pass messages from neighboring nodes through edges. (𝑛) (𝑛)
aggregating ĥ𝑣𝑥 ) via 𝑔 𝑓 and through Φ 𝑓 .
(𝑛)
Eq. 3 describes the node convolution at the 𝑛-th layer of arrangement
For input signals, we send a 5-dimensional input node feature to
convolution.
the first arrangement convolution as the concatenation of the center
(𝑛) (𝑛−1) (𝑛) (𝑛−1) (𝑛−1)
∑︁
h𝑣 = Φ (𝑛) (h𝑣 + 𝑓𝑒→𝑣 (h𝑢𝑣 ) · h𝑢 ) (3) position, area, the ratio of occupied face region by point cloud to
(𝑢,𝑣) ∈ E that of the whole face, and a boundary indicator of the floor face.
The center position is 2-dimensional and other features are scalars.
We denote node and edge features after the 𝑛-th convolution as We set the boundary indicator as 1 if the face is adjacent to a wall
(𝑛) (𝑛) (𝑛)
h𝑣 and h𝑢𝑣 . Φ (𝑛) is a 2-layer MLP. 𝑓𝑒→𝑣 is a 1D convolution that segment. The edge feature is 7-dimensional as the concatenation
translates the edge feature into a square matrix, serving as a linear of the center position and the ratios of occupied segment region by
transformation to aggregate neighboring node features. Since each scanned points at different height ranges to the whole segment. We
node is a triangle and constantly adjacent to three neighbors, we do compute ratios for five height ranges evenly split from [ℎ 𝑓 , ℎ 𝑓 + 2.5]
not require additional normalization during convolution. (ℎ 𝑓 is the floor height). Note that face/segment ratios are important
The edge convolution intends to aggregate adjacent node and signals to indicate whether an arrangement element is covered
edge features from the previous arrangement convolution (Eq. 4). by the point cloud. These signals make floor plan reconstruction
learnable from an overly segmented arrangement (Fig. 6(d)). The
(𝑛) (𝑛) (𝑛−1) (𝑛) (𝑛)
ĥ𝑢𝑣 = Φ𝑒 (h𝑢𝑣 + 𝑔 (𝑛) (h𝑢 + h𝑣 )) (4) output from the final arrangement convolution is passed through
two separate 2-layer MLPs to predict binary labels denoting whether
(𝑛)
Φ𝑒 is a 2-layer MLP for edge convolution, and 𝑔 is a 1D convolution to preserve certain arrangement elements as part of the final floor
that projects node feature to edge feature space. plan. We supervise the network with a binary cross-entropy loss
(𝑛)
The output of edge convolution ĥ𝑢𝑣 is a temporary edge feature (BCE) given ground truth annotation of floor plans as a subset of
and is further processed by link convolution to pass information the arrangement.
through co-linear and co-face paths offered by arrangement. The Since floor boundaries should practically be adjacent to the wall
link convolution produces the final edge feature of the arrangement edge, we rectify the network prediction by optimizing a binary
convolution, as shown in Eq. 5. Note that the link convolution graphcut [Boykov and Kolmogorov 2004; Boykov et al. 2001] to
aggregates temporary edge features into final edge features and is refine the floor labels.
∑︁ ∑︁
not a standard GNN operator. Such a difference makes our final minimize 𝑤 𝑣 |𝑙 𝑣 − 𝑠 𝑣 | + 𝑤𝑢𝑣 (1 − 𝑠𝑢𝑣 ) (6)
edge feature better at capturing co-linear and co-face structures in {𝑙 𝑣 }
𝑖 𝑙𝑢 ≠𝑙 𝑣
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:7
embedding
embedding
(a) Ceiling point cloud (3D) (b) Ceiling instances (2D) (a) Floor plan Arrangement (b) Ceiling Arrangement
Ceiling ↔ : co-linear
↔ : co-face
↔ : embedding
↔ : reuse
(c) Extended instances (d) Instance contour (e) Arrangement
Floor
(c) Reuse Boundary (d) Relationships
Fig. 9. Ceiling arrangement reconstruction. (b) We raster initial primitive
instance labels onto an image. (c) We expand regions to fill non-occupied
pixels. (d) We combine floor plan boundaries and internal lines to form Fig. 10. The enriched scene arrangements for each story include a floor plan
ceiling arrangements (e). and a ceiling arrangement. Boundary segments of the floor plan arrange-
ment are reused in the ceiling arrangement. Edges of cuboids are either
embedded into wall segments or merged into the floor plan arrangement.
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:8 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi
(a) Ceiling edges (b) Floor plan edges (a) BIM reconstruction
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:9
Fig. 13. BIM Reconstruction by our approach from complex laser scans.
7 EVALUATION
Finally, we compare our reconstruction with human modeling. We
ask experienced artists to create BIM models from scratch according 7.1 Datasets
to scanned point clouds in our large-scene dataset. Considering that We use several datasets to evaluate the performance of our method.
each scene takes more than 20 hours to draw, we are surprised that To compare with existing state-of-the-art methods on floor plans, we
our reconstruction is even better than human-created models ac- follow [Stekovic et al. 2021] and use Structure3D [Zheng et al. 2020]
cording to geometry accuracy (Tab. 1). It suggests that our pipeline is and Floor-SP [Chen et al. 2019] (captured with commodity RGB-D
ready for automatic BIM production with sufficient quality. Further, sensors) to demonstrate the performance. Since these datasets focus
another important metric for BIM modeling is the simplification only on small-scale rooms, we collect a large-scene dataset contain-
rate (output face number divided by the number of points in the ing 54 buildings as multi-story offices using NavVis scanner [NavVis
input point cloud). Tab. 2 compares our method with outputs from 2022]. It provides high-quality point clouds aligned with images
human modeling and [Mehra et al. 2009] as a representative for captured from RGB cameras. We show statistics of this dataset in
mesh abstraction. Our simplification rate is significantly better than Tab. 3 and input point clouds in Fig. 19(a). We annotate our dataset
[Mehra et al. 2009] and is close to BIM models created by artists. with semantics, door/window bounding boxes, and floor plans for
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:10 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi
# of scenes # of storeys # of rooms # of points area (𝑚 2 ) Existing methods require room instance segmentation and fail on
54 2.24 10.7 4.63 ×107 3.74 ×104 our challenging large-scene dataset. Fig. 18 shows that the floor plan
cannot be reasonably recovered by Floor-SP but can be handled well
using our approach. We investigate the internal reason and find that
each story of the building. We use 40 scenes for training and the Mask-RCNN [He et al. 2017] fails to produce correct segmentation
other 14 scenes for testing. for such complex scenes. In contrast, we can accurately recover
edges and corners. Although our prediction can be wrong for some
7.2 Floor Plans
inner walls and influence the room segmentation, the quality of our
We evaluate the quality of floor plan reconstruction from the as- result is sufficient for BIM modeling.
pects of connectivity and geometry accuracy. Connectivity accuracy [Fang et al. 2021] points out that more accurate geometry from
can be measured using metrics proposed in [Chen et al. 2019] that space partition-based approaches is possible. Tab. 5 shows that our
measure the precision and recall of predicted corners, edges, and geometry accuracy is even better than [Fang et al. 2021] in terms
room instances. of CD and RMS metrics proposed by [Fang et al. 2021], attributing
As shown in Tab. 4, we compare our approach with DP [Wu and to both accurate primitive from geometry processing and robust
Marquez 2003], Floor-SP [Chen et al. 2019], MonteFloor [Stekovic connectivity prediction from the network.
et al. 2021] and HEAT [Chen et al. 2021] on Structure3D, Floor-SP,
and our own large-scene datasets. The metrics are directly borrowed 7.3 Semantics, Planes and Cuboids
from [Chen et al. 2019; Stekovic et al. 2021]. We report scores for Semantics. We train [Graham et al. 2018] on 40 scenes in our
methods if implementations are available or scores are available large scene dataset. We find that the trained model generalizes
from the original paper. As a result, our method shows significant well to various scenes. Tab. 6 reports the mean IoU of different
improvement in all datasets compared to the state-of-the-art. Since semantics that we consider. The first row directly measures the
the high-quality arrangement initialization is easy to obtain on prediction from the network, and the second row measures the
synthetic datasets, our performance is nearly perfect on Structure3D. results by assigning each point the primitive semantics. As expected,
According to the metric of corner and angle, we are especially good the performance of point-level and primitive-level IoU is close to
at estimating accurate corners and edges. Our floor plan does not that in public datasets [Dai et al. 2017].
directly segment rooms, but the room segmentation quality is still
the best. Fig. 17 shows the reconstructed walls from Structure3D Planes. We detect planes using scale-aware region growing (Sec. 4.2).
and Floor-SP datasets. As a result, we faithfully reconstruct small We report the mean distance between original points to fitted planes,
details of wall structures. the ratio of points covered by detected planes, and the number of
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
Input scan
Bauchet et al.
Ours
Input scan
Bauchet et al.
Ours ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:11
Fig. 16. We faithfully reconstruct main structures for complex multi-story buildings.
Table 6. Semantic prediction performance at point and plane primitive level. Table 7. Comparison of plane detection between the standard region grow-
ing[Rabbani et al. 2006] and that with our scale-aware extension.
Semantics Floor Wall Ceiling Pillar Door Window
Pt. IoU (%) 95.6 88.1 94.2 78.1 68.0 62.5 Dist. (cm) Coverage (%) # of planes
Prim. IoU (%) 98.3 91.7 95.4 83.6 65.8 60.7 Region-growing 0.68 96.6 31553
Scale-aware 1.27 97.8 14241
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:12 •
Floor-SP Dataset Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:13
Fig. 19. Scanned point clouds, semantics, and primitive segmentation for our large scene dataset.
8 CONCLUSION
understand the behavior, we simulate different qualities of arrange- We present ArrangementNet, an automatic pipeline to generate high-
ment initialization on the large-scene dataset by perturbing the wall quality BIMs by learning floor plan arrangements using a graph
detection accuracy. Starting from primitive IoU as 91.7% (Tab. 6), neural network and enriching them to organize different semantic
we simulate more wrong predictions and evaluate the room predic- parts. We show significant improvement in floor plans and 3D vec-
tion accuracy in Tab. 11. The quality of the estimated floor plan is torized scene reconstruction compared to the state-of-the-art. We
decreasing with the initialization quality. However, we find that the observe two directions to improve our pipeline. First, we can cover
rate of descent is slowing, indicating that ArrangementNet tends more semantic elements such as beams, ladders, or windows/doors
to compensate for the errors from the initialization. Arrangement with more complex shapes. Second, we believe there is a potential
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:14 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi
point clouds. IEEE Journal of Selected Topics in Applied Earth Observations and
Point cloud
Erick Delage, Honglak Lee, and Andrew Y Ng. 2006. A dynamic bayesian network
model for autonomous 3d reconstruction from a single indoor image. In 2006 IEEE
computer society conference on computer vision and pattern recognition (CVPR’06),
Vol. 2. IEEE, 2418–2428.
Hao Fang and Florent Lafarge. 2020. Connect-and-Slice: an hybrid approach for recon-
structing 3D objects. In Proceedings of the IEEE/CVF Conference on Computer Vision
Ours
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:15
Photogrammetry and Remote Sensing 154 (2019), 127–138. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor
Chenxi Liu, Alexander G Schwing, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. segmentation and support inference from rgbd images. In European conference on
2015. Rent3d: Floor-plan priors for monocular layout estimation. In Proceedings of computer vision. Springer, 746–760.
the IEEE conference on computer vision and pattern recognition. 3413–3421. Sinisa Stekovic, Mahdi Rad, Friedrich Fraundorfer, and Vincent Lepetit. 2021. Monte-
Chen Liu, Jiaye Wu, and Yasutaka Furukawa. 2018. Floornet: A unified framework for floor: Extending mcts for reconstructing accurate large-scale floor plans. In Proceed-
floorplan reconstruction from 3d scans. In Proceedings of the European conference on ings of the IEEE/CVF International Conference on Computer Vision. 16034–16043.
computer vision (ECCV). 201–217. Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois
Chen Liu, Jiajun Wu, Pushmeet Kohli, and Yasutaka Furukawa. 2017. Raster-to-vector: Goulette, and Leonidas J. Guibas. 2019. KPConv: Flexible and Deformable Convo-
Revisiting floorplan transformation. In Proceedings of the IEEE International Confer- lution for Point Clouds. In Proceedings of the IEEE/CVF International Conference on
ence on Computer Vision. 2195–2203. Computer Vision (ICCV).
David Marshall, Gabor Lukacs, and Ralph Martin. 2001. Robust segmentation of primi- Philip HS Torr and Andrew Zisserman. 2000. MLESAC: A new robust estimator with
tives from range data in the presence of geometric degeneracy. IEEE Transactions application to estimating image geometry. Computer vision and image understanding
on pattern analysis and machine intelligence 23, 3 (2001), 304–314. 78, 1 (2000), 138–156.
Jiri Matas and Ondrej Chum. 2004. Randomized RANSAC with Td, d test. Image and H Tran and K Khoshelham. 2019. A stochastic approach to automated reconstruction of
vision computing 22, 10 (2004), 837–842. 3D models of interior spaces from point clouds. ISPRS Annals of the Photogrammetry,
Ravish Mehra, Qingnan Zhou, Jeremy Long, Alla Sheffer, Amy Gooch, and Niloy J Remote Sensing and Spatial Information Sciences 4 (2019), 299–306.
Mitra. 2009. Abstraction of man-made shapes. In ACM SIGGRAPH Asia 2009 papers. Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. 2017.
1–10. Learning shape abstractions by assembling volumetric primitives. In Proceedings of
Claudio Mura, Oliver Mattausch, and Renato Pajarola. 2016. Piecewise-planar recon- the IEEE Conference on Computer Vision and Pattern Recognition. 2635–2643.
struction of multi-room interiors with arbitrary wall arrangements. In Computer Eric Turner and Avideh Zakhor. 2014. Floor plan generation and room labeling of indoor
Graphics Forum, Vol. 35. Wiley Online Library, 179–188. environments from laser range data. In 2014 international conference on computer
Claudio Mura, Oliver Mattausch, Alberto Jaspe Villanueva, Enrico Gobbetti, and Renato graphics theory and applications (GRAPP). IEEE, 1–12.
Pajarola. 2014. Automatic room detection and reconstruction in cluttered indoor Marc Van Kreveld, Thijs Van Lankveld, and Remco C Veltkamp. 2011. On the shape of
environments with complex room layouts. Computers & Graphics 44 (2014), 20–32. a set of points and lines in the plane. In Computer Graphics Forum, Vol. 30. Wiley
Liangliang Nan and Peter Wonka. 2017. Polyfit: Polygonal surface reconstruction from Online Library, 1553–1562.
point clouds. In Proceedings of the IEEE International Conference on Computer Vision. Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X Chang, and Manolis
2353–2361. Savva. 2021. Plan2scene: Converting floorplans to 3d scenes. In Proceedings of the
Nelson Nauata and Yasutaka Furukawa. 2019. Vectorizing world buildings: Planar IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10733–10742.
graph reconstruction by primitive detection and relationship classification. arXiv Cheng Wang, Shiwei Hou, Chenglu Wen, Zheng Gong, Qing Li, Xiaotian Sun, and
preprint arXiv:1912.05135 1, 3 (2019). Jonathan Li. 2018. Semantic line framework-based indoor building modeling using
NavVis. 2022. NavVis Scanner. https://www.navvis.com/. backpacked laser scanning point cloud. ISPRS journal of photogrammetry and remote
Sebastian Ochmann, Richard Vock, and Reinhard Klein. 2019. Automatic reconstruction sensing 143 (2018), 150–166.
of fully volumetric 3D building models from oriented point clouds. ISPRS journal of Senyuan Wang, Guorong Cai, Ming Cheng, José Marcato Junior, Shangfeng Huang,
photogrammetry and remote sensing 151 (2019), 251–262. Zongyue Wang, Songzhi Su, and Jonathan Li. 2020. Robust 3D reconstruction of
Sebastian Ochmann, Richard Vock, Raoul Wessel, and Reinhard Klein. 2016a. Automatic building surfaces from point clouds based on structural and closed constraints. ISPRS
reconstruction of parametric building models from indoor point clouds. Computers Journal of Photogrammetry and Remote Sensing 170 (2020), 29–44.
& Graphics 54 (2016), 94–103. Chenglu Wen, Yudi Dai, Yan Xia, Yuhan Lian, Jinbin Tan, Cheng Wang, and Jonathan
Sebastian Ochmann, Richard Vock, Raoul Wessel, and Reinhard Klein. 2016b. Automatic Li. 2019. Toward efficient 3-D colored mapping in GPS-/GNSS-denied environments.
reconstruction of parametric building models from indoor point clouds. Computers IEEE Geoscience and Remote Sensing Letters 17, 1 (2019), 147–151.
& Graphics 54 (2016), 94–103. S-T Wu and MERCEDES ROCio GONZALES Marquez. 2003. A non-self-intersection
Sven Oesau, Florent Lafarge, and Pierre Alliez. 2014. Indoor scene reconstruction using Douglas-Peucker algorithm. In 16th Brazilian symposium on computer graphics and
feature sensitive primitive extraction and graph-cut. ISPRS journal of photogramme- Image Processing (SIBGRAPI 2003). IEEE, 60–66.
try and remote sensing 90 (2014), 68–82. Yifan Xu, Weijian Xu, David Cheung, and Zhuowen Tu. 2021. Line segment detection
L Paul Chew. 1989. Constrained delaunay triangulations. Algorithmica 4, 1 (1989), using transformers without edges. In Proceedings of the IEEE/CVF Conference on
97–108. Computer Vision and Pattern Recognition. 4257–4266.
Ameya Phalak, Vijay Badrinarayanan, and Andrew Rabinovich. 2020. Scan2plan: Nan Xue, Tianfu Wu, Song Bai, Fudong Wang, Gui-Song Xia, Liangpei Zhang, and
efficient floorplan generation from 3d scans of indoor scenes. arXiv preprint Philip HS Torr. 2020. Holistically-attracted wireframe parsing. In Proceedings of the
arXiv:2003.07356 (2020). IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2788–2797.
Mattia Previtali, Marco Scaioni, Luigi Barazzetti, and Raffaella Brumana. 2014. A flexible Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and
methodology for outdoor/indoor building reconstruction from occluded point clouds. Long Quan. 2020. Blendedmvs: A large-scale dataset for generalized multi-view
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
2, 3 (2014), 119. Pattern Recognition. 1790–1799.
Yiming Qian and Yasutaka Furukawa. 2020. Learning pairwise inter-plane relations Mulin Yu and Florent Lafarge. 2022. Finding Good Configurations of Planar Primitives
for piecewise planar reconstruction. In European Conference on Computer Vision. in Unorganized Point Clouds. In CVPR 2022-IEEE Conference on Computer Vision
Springer, 330–345. and Pattern Recognition.
Tahir Rabbani, Frank Van Den Heuvel, and George Vosselmann. 2006. Segmentation of Wenyuan Zhang, Zhixin Li, and Jie Shan. 2021. Optimal Model Fitting for Building
point clouds using smoothness constraint. International archives of photogrammetry, Reconstruction From Point Clouds. IEEE Journal of Selected Topics in Applied Earth
remote sensing and spatial information sciences 36, 5 (2006), 248–253. Observations and Remote Sensing 14 (2021), 9636–9650.
Wein Ron, Eric Berberich, Fogel Efi, Halperin Dan, Hemmer Michael, Salzman Oren, Ziheng Zhang, Zhengxin Li, Ning Bi, Jia Zheng, Jinlei Wang, Kun Huang, Weixin Luo,
and Zukerman Baruch. 2022. 2D Arrangements. https://doc.cgal.org/latest/ Yanyu Xu, and Shenghua Gao. 2019. Ppgnet: Learning point-pair graph for line
Arrangement_on_surface_2/index.html. segment detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
Danila Rukhovich, Anna Vorontsova, and Anton Konushin. 2021. FCAF3D: Fully and Pattern Recognition. 7105–7114.
Convolutional Anchor-Free 3D Object Detection. arXiv preprint arXiv:2112.00322 Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. 2020.
(2021). Structured3d: A large photo-realistic dataset for structured 3d modeling. In European
David Salinas, Florent Lafarge, and Pierre Alliez. 2015. Structure-aware mesh decima- Conference on Computer Vision. Springer, 519–535.
tion. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 211–227. Yichao Zhou, Haozhi Qi, and Yi Ma. 2019. End-to-end wireframe parsing. In Proceedings
Falko Schindler, Wolfgang Wörstner, and Jan-Michael Frahm. 2011. Classification and of the IEEE/CVF International Conference on Computer Vision. 962–971.
reconstruction of surfaces from point clouds of man-made objects. In 2011 IEEE Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. 2018. Layoutnet: Reconstruct-
International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, ing the 3d room layout from a single rgb image. In Proceedings of the IEEE Conference
257–263. on Computer Vision and Pattern Recognition. 2051–2059.
Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 2017. 3d-
2018. Csgnet: Neural shape parser for constructive solid geometry. In Proceedings of prnn: Generating shape primitives with recurrent neural networks. In Proceedings
the IEEE Conference on Computer Vision and Pattern Recognition. 5515–5523. of the IEEE International Conference on Computer Vision. 900–909.
Gopal Sharma, Difan Liu, Subhransu Maji, Evangelos Kalogerakis, Siddhartha Chaud-
huri, and Radomír Měch. 2020. Parsenet: A parametric surface fitting network for
3d point clouds. In European Conference on Computer Vision. Springer, 261–276.
ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.