Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

ArrangementNet: Learning Scene Arrangements for Vectorized Indoor

Scene Modeling
JINGWEI HUANG, Huawei Technologies, China
SHANSHAN ZHANG, Huawei Technologies, China
BO DUAN, Huawei Technologies, China
YANFENG ZHANG, Huawei Technologies, China
XIAOYANG GUO, Huawei Technologies, China
MINGWEI SUN, Huawei Technologies and Wuhan University, China
LI YI, Tsinghua University, China

Ceiling

reuse
embed

Floor plan

embed
GNN
Planes Cuboids Initial Arrangements Floor Plan Arrangement Scene Arrangement

(a) Prepared Scene (b) ArrangementNet

(c) Vectorized Modeling (BIM)

Fig. 1. (a) We prepare the scene by detecting planes and cuboids as basic primitives. (b) we propose ArrangementNet to reconstruct floor plan arrangements
and enrich them to organize primitives with different semantics. (c) We generate the scene from enriched arrangements to obtain a BIM model.

Authors’ addresses: Jingwei Huang, huangjingwei6@huawei.com, Huawei Tech-


We present a novel vectorized indoor modeling approach that converts point
nologies, China; Shanshan Zhang, zhangshanshan15@huawei.com, Huawei Tech- clouds into building information models (BIM) with concise and semanti-
nologies, China; Bo Duan, duanbo5@huawei.com, Huawei Technologies, China; cally segmented polygonal meshes. Existing methods detect planar shapes
Yanfeng Zhang, zhangyanfeng8@huawei.com, Huawei Technologies, China; Xi- and connect them to complete the scene. Some focus on floor plan recon-
aoyang Guo, guoxiaoyang3@huawei.com, Huawei Technologies, China; Mingwei Sun,
sunmingwei2@huawei.com, Huawei Technologies and Wuhan University, China; Li Yi,
struction as a simplified problem to better analyze connectivity between
ericyi0124@gmail.com, Tsinghua University, China. planes of floors and walls. However, the connectivity analysis is still chal-
lenging and ill-posed with incomplete point clouds as input. We propose
Permission to make digital or hard copies of all or part of this work for personal or ArrangementNet to estimate scene arrangements from an incomplete point
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
cloud, which we can easily convert into a BIM model. ArrangementNet is
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
republish, to post on servers or to redistribute to lists, requires prior specific permission 0730-0301/2023/8-ART51 $15.00
and/or a fee. Request permissions from permissions@acm.org. https://doi.org/10.1145/3592122

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:2 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi

a novel graph neural network that consumes noisy over-partitioned initial well-solved in large and complex scenes, [Fang et al. 2021; Han
arrangements extracted through non-learning techniques and outputs high- et al. 2021] robustly construct an arrangement as a superset of the
quality scene arrangement. The core of ArrangementNet is an extended floor plan by over-partitioning the horizontal floor region with
graph convolution that leverages co-linear and co-face relationships in the detected wall planes. Then, they filter it to construct the floor plan
arrangement and improves the quality of prediction in complex scenes. We by solving energy minimization. Since a floor plan arrangement is
apply ArrangementNet to improve floor plan and ceiling arrangements and
insufficient to represent a 3D scene, we propose to enrich it as a
enrich them with semantic objects as scene arrangements for scene gen-
eration. Our approach faithfully models challenging scenes obtained from scene arrangement with an additional ceiling arrangement sharing
laser scans or multiview stereo and shows significant improvement in BIM boundary edges with the floor plan and other semantic objects as
model reconstruction compared to the state-of-the-art. Our code is available cuboids embedded into floor plan edges. Such a representation can
at https://github.com/zssjh/ArrangementNet. be elegantly converted into a vectorized 3D model.
CCS Concepts: • Computing methodologies → Mesh geometry models;
Our key idea is to formulate arrangement construction as a graph
Scene understanding. filtering problem so that we can leverage learning techniques to con-
vert over-partitioned arrangements [Fang et al. 2021] into compact
Additional Key Words and Phrases: Building information model (BIM), Ar-
scene arrangements with dramatically improved quality. We pro-
rangement, Graph neural network, Floor plan
pose ArrangementNet (Fig. 1(b)) as a graph neural network (GNN)
ACM Reference Format: to filter the over-partitioned arrangement by classifying whether to
Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, preserve or drop certain arrangement elements. Such a formulation
Mingwei Sun, and Li Yi. 2023. ArrangementNet: Learning Scene Arrange-
avoids challenging corner regression [Chen et al. 2019; Stekovic
ments for Vectorized Indoor Scene Modeling. ACM Trans. Graph. 42, 4,
et al. 2021] by converting connectivity analysis into a binary classi-
Article 51 (August 2023), 15 pages. https://doi.org/10.1145/3592122
fication problem, so that the network can learn from data effectively.
1 INTRODUCTION In detail, we model each arrangement face as a node of the graph,
and insert a graph edge for each arrangement edge connecting ad-
Vectorized reconstruction of point clouds is a fundamental problem
jacent faces at both sides of the edge. We additionally insert links
in computer graphics and vision communities. The ultimate goal is
between arrangement edges that are co-face or co-linear, and extend
to represent the scene using concise polygonal meshes, where the
the graph convolution to operate on arrangements and highlight
main structures1 are well-segmented according to the semantics.
these specific relationships. Such a graph fully captures the structure
Such compact representation for a building is called a building infor-
of the partitioned space based on the arrangement, and we argue
mation model (BIM) and is the foundation for real-time downstream
that our GNN jointly considers floor and wall regions by message
applications in gaming, civil engineering, and virtual/augmented
passing through such a graph. For example, the network tends to
reality. However, the reconstruction quality from existing solutions
preserve wall edges at floor boundaries or drop them where faces at
can hardly meet the standard of applications.
both sides are not floor regions. In other words, our GNN analyzes
Existing works suffer from incomplete point clouds as input. Lo-
connectivity based on messages passed through the arrangement
cal geometry simplification [Garland and Heckbert 1997; Salinas
in an end-to-end manner without the requirement of challenging
et al. 2015] requires smooth surfaces and is sensitive to noises and
room segmentation [Chen et al. 2019; Fang et al. 2021].
incompleteness, especially around transparent objects like windows.
Next, we enrich the predicted floor plans into scene arrangements
A popular direction is to vectorize point clouds into big planar
to model 3D scenes. We reuse floor plan boundaries and insert in-
shapes [Chen and Chen 2008; Huang et al. 2017; Schindler et al.
tersection lines among ceiling planes to build a ceiling arrangement.
2011; Van Kreveld et al. 2011] and complete the scene by extend-
We detect objects as cuboids and project each instance as a rectangle
ing planes to connect each other. However, connectivity analysis
in the horizontal plane. Each rectangle edge is either merged into
is a challenging problem to solve. A common geometric solution
the floor plan arrangement or embedded into a close wall edge. As
is through space partition [Bauchet and Lafarge 2020; Fang and
such, we obtain scene arrangements that fully describe connectivity
Lafarge 2020; Nan and Wonka 2017], which produces noisy planes
among various semantic parts. Different from [Bauchet and Lafarge
around small objects. Further, they assume watertight geometry and
2020; Han et al. 2021; Ikehata et al. 2015], we generate the 3D model
cannot preserve thin structures or openings. Floor plan reconstruc-
with multiple seamlessly assembled semantic parts.
tion [Chen et al. 2019, 2021; Liu et al. 2018, 2017; Stekovic et al. 2021;
We evaluate our approach concerning the overall 3D modeling
Xu et al. 2021] aims to simplify the problem and focus on better con-
and floor plan reconstruction. We deliver 3D models with better
nectivity analysis among floors and walls. However, these learning
quality than existing works (Sec. 6), and handle challenging set-
methods require accurate corner regression and corner connection
tings where point clouds are produced from the multiview stereo
prediction based on accurate room segmentation, both of which
algorithm. Further, our scene arrangement representation gener-
are extremely hard for large scenes. Further, 3D modeling requires
ates door openings and unscanned transparent windows, which
seamlessly assembling doors, windows, and ceilings to floor plans,
cannot be directly recovered using state-of-the-art solutions [Han
which is non-trivial but not well-studied.
et al. 2021; Ikehata et al. 2015]. Our automatic pipeline faithfully
Our key observation is that BIM modeling can be effectively
produces BIMs with concise geometry and correctly assembled se-
represented by an arrangement [Ron et al. 2022] as a 2D partition
mantic components. Sec. 7 shows that our floor plan prediction
of a horizontal plane by a set of edges. Since plane detection is
significantly outperforms the state-of-the-art methods on various
1 wall, floor, ceiling, door, and window datasets. While existing methods fail to detect corners and connect

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:3

them as walls, we correctly reconstruct most wall structures, es- geometric rules and cannot deliver clean structures under cluttered
pecially for large scenes. Our extended GNN convolution further scenes. More importantly, they assume the scene to be watertight,
improves performance by jointly considering floor faces and wall which does not fit real scans with thin structures or completely
edges. missed planar regions. [Fang and Lafarge 2020] exploit principles
Overall, our core research contributions are: from connectivity-based methods but still rely on the watertight
• A formulation for vectorized modeling by learning scene assumption. Our learning-based approach solves a binary classifica-
arrangements to analyze connectivity. tion problem, which resolves these issues and is not limited by the
• A novel ArrangementNet that learns to filter over-partitioned watertight assumption.
initial arrangements for floor plan and ceiling construction.
• An extended GNN convolution to fully capture the arrange- 2.2 2D Vectorized Geometry
ment structure. Primitive-based Partition. Partition-based method can be performed
• An automatic pipeline that significantly outperforms existing in 2D by projecting walls as line segments and floors as faces into
methods for floor plan and vectorized 3D modeling. a horizontal plane. Similar to space partition, [Li et al. 2020; Mura
et al. 2014; Ochmann et al. 2016b; Turner and Zakhor 2014] require
2 RELATED WORKS first performing room segmentation before analyzing the floor plan.
2.1 3D Plane Assembly [Fang et al. 2021] address this issue by separately reconstructing
outer boundaries and inner walls. However, room segmentation is it-
Plane Detection. Indoor reconstruction usually begins with the self a challenging problem to solve. We jointly learn to classify faces
detection of plane primitives. Among them, the most popular tradi- and edges as floor plans based on the arrangement in an end-to-end
tional methods are through Ransac [Chum and Matas 2005; Fischler manner without explicit room segmentation.
and Bolles 1981; Kang and Li 2015; Matas and Chum 2004; Torr and
Zisserman 2000] and region growing [Marshall et al. 2001; Rabbani Image-based Understanding. Floor plan reconstruction is also stud-
et al. 2006]. Recently, primitive fitting is further addressed by su- ied in the imaging communities given images as inputs. Traditional
pervised [Huang et al. 2021; Li et al. 2019b; Sharma et al. 2020; Zou methods are proposed to produce wireframes [Furukawa et al. 2009;
et al. 2017] and unsupervised [Fang et al. 2018; Sharma et al. 2018; Silberman et al. 2012], room layouts [Delage et al. 2006; Hedau
Tulsiani et al. 2017] networks. We adopt a modified version of [Rab- et al. 2009; Izadinia et al. 2017] or floor plans [Cabral and Furukawa
bani et al. 2006] and find it sufficiently robust for plane detection. 2014]. [Liu et al. 2015; Vidanapathirana et al. 2021] further augment
[Yu and Lafarge 2022] focus on plane detection and also deserve to floor plans into textured meshes. Neural networks are introduced
follow. to replace the basic primitive detection modules to produce cor-
ners, edges, or regions [Hu et al. 2021; Liu et al. 2018, 2017; Qian
Plane Connectivity. After plane detection, [Chen and Chen 2008; and Furukawa 2020; Zou et al. 2018]. [Chen et al. 2019; Nauata
Huang et al. 2017; Schindler et al. 2011; Van Kreveld et al. 2011] solve and Furukawa 2019; Phalak et al. 2020] detect room instances using
the reconstruction problem by computing an adjacency graph and Mask-RCNN [He et al. 2017] and recover floor plan via post optimiza-
extracting edges, corners, and faces based on plane affinity. How- tion or Monte Carlo Tree Search [Stekovic et al. 2021]. Therefore,
ever, pure geometric rules are not sufficient to describe the scene, their performance depends on room segmentation and is hard to
and they often yield connectivity errors and produce incomplete generalize well to novel scenes. [Xue et al. 2020; Zhang et al. 2019;
models. While the topology of surface reconstruction can guide Zhou et al. 2019] propose to parse 3D wireframes in an end-to-end
the connectivity analysis [Holzmann et al. 2018; Mehra et al. 2009] manner by jointly predicting junctions and their connections. Re-
to alleviate this problem, the surface reconstruction itself suffers cently, [Chen et al. 2021; Xu et al. 2021] introduce transformer-based
from large incompletion. We train the ArrangementNet to learn object detection [Carion et al. 2020] into wireframe parsing. They
the connectivity from data instead of analysis based on low-level project point clouds into a horizontal plane as a density image to
geometry information. predict floor plans. However, we observe that it is more robust and
Space Partition. Vectorized building reconstruction can also be straightforward to directly fit planes from 3D point clouds since it
handled by space partition based on detected primitives. A subset correctly detects wall candidates in large and complex scenes. Our
of edges or faces from the partition is selected based on clustering network is built upon an arrangement initialized with these wall
or energy minimization. [Ochmann et al. 2019, 2016a; Zhang et al. candidates and can better focus on connectivity analysis.
2021] extend and intersect detected wall edges and perform room
labeling to optimize the structural topology. Similarly, [Cui et al. 2.3 Vectorized 3D Modeling
2019; Li et al. 2019a; Oesau et al. 2014; Previtali et al. 2014; Tran An important direction for indoor modeling is to decompose the
and Khoshelham 2019; Wang et al. 2018, 2020] also rely on robust problem into subproblems for reconstructing different semantic el-
room segmentation. [Boulch et al. 2014; Mura et al. 2016; Nan and ements. [Ikehata et al. 2015] propose a grammar that consists of
Wonka 2017] adopt a general 3D pipeline where plane primitives rooms, structural details, objects, and room connections via doors.
are directly used to slice 3D space into convex polyhedra. Such a [Han et al. 2021] follow a similar philosophy and separately recon-
formulation can handle complex geometries like sloping ceiling struct multi-plane ceilings, floors, and walls with structural details.
planes. [Bauchet and Lafarge 2020] further accelerate the algorithm However, these representations are not unified while energy min-
with a novel kinetic structure. However, they are based on pure imization is required, where the performance highly depends on

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:4 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi

Fig. 2. Framework overview. We prepare the scene by detecting basic primitives (planes and cuboids) and segmenting the point cloud into multiple stories
(Sec. 3). Then, we propose ArrangementNet to construct and enrich arrangements to organize primitives with different semantics. Finally, we generate the 3D
model from enriched arrangements (Sec. 5).

t=0 t=1 t=2 Door Window

(a) Input point cloud (b) Region growing (c) Scale-aware (ours)

Fig. 3. Scale-aware region growing for plane detection. (b) [Rabbani et al.
2006] segment ceilings into several primitives inside blue and yellow ellipses.
(c) Point normals at different scale levels help ignore non-planar details
(blue and yellow) and preserve structural curved surfaces (red) on walls.

the scan quality. We propose to use scene arrangements to organize


different semantics. Such a representation allows us to seamlessly
assemble doors, windows, and ceilings with floor plans to model 3D
scenes. Fig. 4. Different point occupation status for front/back faces of cuboids
for doors and windows. 𝑡 denotes the number of front/back faces that are
occupied by points and should be modeled (Fig. 4).
3 SCENE PREPARATION
Our indoor reconstruction system consists of mainly three stages:
scene preparation, arrangement construction, and 3D model gen- Cuboids. We represent doors and windows by detecting bounding
eration from arrangements, as shown in Fig. 2. In this section, we boxes as cuboids with several attributes.
preprocess the scene by detecting basic primitives necessary for
arrangement construction. B = {o, 𝜃, s, 𝑙, 𝑡 } (1)

Planes. We prepare the scene by segmenting it into multiple sto- o is the cuboid center. 𝜃 is a 1D orientation assuming that the
ries with semantic planar shapes. We first detect a set of 3D planes cuboid is aligned with the up vector. s records the width, height, and
for floors, walls, and ceilings. For plane detection, we modify [Rab- depth of the cuboid. 𝑙 is the semantic label. 𝑡 denotes the number
bani et al. 2006] by estimating point normals for different scale of front/back faces that are occupied by points and should be mod-
levels with 32𝑁 (1 ≤ 𝑁 ≤ 4) nearest neighborhoods. As a result, the eled (Fig. 4). We modify FCAF3D [Rukhovich et al. 2021] to detect
normals preserve details for lower levels and are robust to noises B. We generate pseudo point candidates at empty regions for wall
at higher levels. Then, we grow the region if point normals agree planes as shown in Fig. 5 (black points in bottom-left). Then, we
with the plane at any scale level. The scale-aware region growing send both the original point cloud and generated points to FCAF3D,
better ignores non-planar details and robustly approximate curved which improves the detection quality for open doorways/windows
surfaces with piece-wise planes (Fig. 3). We determine the semantics as shown in Fig. 5 (bottom-right). We present details for pseudo
of 3D planes by voting semantic labels associated to each point pre- points generation in the supplemental material.
dicted by a sparse convolution network [Graham et al. 2018]. Finally,
we segment the scene into multiple stories by 3D floor planes and 4 ARRANGEMENTNET
assign detected planes to different stories so that each story can be In this section, we construct arrangements with ArrangementNet
modeled separately. to organize detected primitives for each story. Sec. 4.1 describes the

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
Completed Scan Original Scan ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:5

(a) Projected points (b) Initial lines

Input point cloud Active points Object Detection

Fig. 5. Existing detection approaches cannot recognize openings since very


few active points can be extracted to vote bounding boxes (first row). We (c) Extended lines (d) Arrangement
generate pseudo points (bottom-left) to recover these boxes (bottom-right).
Fig. 6. Initialization of floor plan arrangement. (b) We project detected wall
planes as initial lines. (c) We extend lines to intersect with each other. (d)
scene representation with arrangements. We initialize an arrange- We perform constrained Delaunay triangulation to form the arrangement.
ment (Sec. 4.2) and learn ArrangementNet as a graph neural network
to filter the initial arrangement and preserve the floor plan (Sec. 4.3).
The network converts the connectivity analysis into a binary clas-
sification problem and learns from data effectively. We aggregate
doors/windows/ceilings into the filtered floor plan arrangements to
represent the scene in Sec. 4.4.

4.1 Formulation
We first formulate the floor plan reconstruction as a binary classifi-
cation problem on an arrangement. 2D geometric arrangement[Ron
et al. 2022] is a subdivision of a 2D plane induced by geometric
objects. By constraining geometric objects to line segments, the
arrangement is a plane partition by line segments into cells. Parti-
tioned cells are named as “segments”(1-dimensional) and “faces”(2-
dimensional). The subdivision implies several relationships: adja-
cency among cells, co-linear relationships between adjacent seg- (a) Projected points (b) Initial segments (c) Final segments
ments that share the same input line segments, and co-face rela-
tionships between adjacent segments that share the same face. We Fig. 7. Planes can be uncovered in scans. We recover missing planes in (b)
define a graph on the arrangement in Eq. 2. by triangulation to connect existing primitive endpoints (green in (c)).
G = {V 𝑓 , E, E𝑙 , E 𝑓 } (2)
V𝑓 and E are sets of faces and segments, where each segment intersection corners. In practice, we extrude both endpoints of each
connects two faces at both sides of the segment. E𝑙 and E 𝑓 represent line segment by 𝜃𝑑 = 5𝑚 and preserve the line segment between the
the co-linear and co-face relationship between segments. Existing furthest intersection points after the extrusion. Fig. 6(b) shows an
works [Fang et al. 2021; Turner and Zakhor 2014] notice that an example of initialized line segments, and Fig. 6(c) shows an example
arrangement is intrinsically a compact representation of a floor plan. of extended line segments. Finally, we perform constrained Delau-
By projecting the indoor scene to the floor, we can describe the main nay triangulation [Paul Chew 1989] to partition the regions (Fig. 6
structure as a subdivision of floor regions by walls as line segments. (d)). The triangulation can produce connections among endpoints
Our goal is to over-partition the horizontal plane to build an initial to recover missing wall primitives from input scans. We illustrate
graph G, and design a network to classify whether to preserve faces an example where missing primitives are recovered by connecting
in V 𝑓 as floors and segments in E as walls to form the floor plan. endpoints using our algorithm in Fig. 7. We experiment on a large-
scale scene dataset (Sec. 7.1) and find that the triangulation recovers
4.2 Arrangement Initialization 78% of segments by connecting endpoints.
We aim to initialize an arrangement G by over-partitioning the hori-
zontal plane with wall segments. We project vertical wall planes into 4.3 Arrangement Prediction
the horizontal plane as 2D line segments and extend line segments With an over-segmented floor plan arrangement from arrangement
to intersect with each other to over-partition the 2D plane following initialization, our goal is to select a subset of faces and segments to
[Fang et al. 2021], which aims to recover missing wall segments and produce a compact floor plan with ArrangementNet. Our analysis

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:6 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi

Line Segment

Arrangement Conv (32,32)(32,32)

Arrangement Conv (32,32)(32,32)


Cells ( )

Arrangement Conv (5,7)(32,32)


Floor
2-Layer MLP Prediction
Edges ( ) Input Cell (32, 32, 2)
Feature
Edges ( ) connecting cells
x5
…… Supervision BCE
Edges’ co-face links ( ) Ground Truth
Loss
Edges’ co-linear links ( )
Wall
Input Segment 2-Layer MLP Prediction
Node Convolution Edge Convolution Link Convolution Feature (32, 32, 2)
(Eq. 3) (Eq. 4) (Eq. 5)

(a) Arrangement Convolution (b) ArrangementNet

Fig. 8. (a) We propose arrangement convolution that consists of node convolution, edge convolution and link convolution on the arrangement graph, which
captures relationships among all the arrangement elements. (b)Network architecture. We extract nodes and edges features by a multilayer arrangement
convolution and pass them through 2-layer MLP to determine whether to preserve arrangement elements.

is on the graph of the arrangement (Eq. 2). Fig. 8(a) illustrates an arrangements than that in standard GNN.
example floor plan graph. (𝑛) (𝑛) (𝑛)
∑︁
(𝑛) (𝑛)
We design a GNN (Fig. 8(b)) as a neural network on top of the h𝑢𝑣 = Φ𝑙 ( ĥ𝑢𝑣 + 𝑔𝑙 ( ĥ𝑣𝑤 ))
graph to select its subset as the final floor plan. In detail, we pass (𝑢,𝑣,𝑤 ) ∈ E𝑙
(5)
(𝑛) (𝑛) (𝑛) (𝑛)
∑︁
input signals associated with nodes and edges through six layers + Φ𝑓 ( ĥ𝑢𝑣 + 𝑔𝑓 ( ĥ𝑣𝑥 ))
of arrangement convolutions to extract high-level features. The ar- (𝑢,𝑣,𝑥 ) ∈ E 𝑓
rangement convolution is an extended version of graph convolution
(𝑛) (𝑛)
consisting of three operators as node convolution, edge convolu- h𝑢𝑣 are aggregated with colinear neighbors ĥ𝑣𝑤 ) via a 1D con-
(𝑛) (𝑛)
tion, and link convolution. These operators fully exploit the spatial volution 𝑔𝑙 and passed through a 2-layer MLP Φ𝑙 to obtain
structure of the arrangement. Node convolution intends to encode the colinear signal. The co-face signal can be obtained similarly by
adjacency and pass messages from neighboring nodes through edges. (𝑛) (𝑛)
aggregating ĥ𝑣𝑥 ) via 𝑔 𝑓 and through Φ 𝑓 .
(𝑛)
Eq. 3 describes the node convolution at the 𝑛-th layer of arrangement
For input signals, we send a 5-dimensional input node feature to
convolution.
the first arrangement convolution as the concatenation of the center
(𝑛) (𝑛−1) (𝑛) (𝑛−1) (𝑛−1)
∑︁
h𝑣 = Φ (𝑛) (h𝑣 + 𝑓𝑒→𝑣 (h𝑢𝑣 ) · h𝑢 ) (3) position, area, the ratio of occupied face region by point cloud to
(𝑢,𝑣) ∈ E that of the whole face, and a boundary indicator of the floor face.
The center position is 2-dimensional and other features are scalars.
We denote node and edge features after the 𝑛-th convolution as We set the boundary indicator as 1 if the face is adjacent to a wall
(𝑛) (𝑛) (𝑛)
h𝑣 and h𝑢𝑣 . Φ (𝑛) is a 2-layer MLP. 𝑓𝑒→𝑣 is a 1D convolution that segment. The edge feature is 7-dimensional as the concatenation
translates the edge feature into a square matrix, serving as a linear of the center position and the ratios of occupied segment region by
transformation to aggregate neighboring node features. Since each scanned points at different height ranges to the whole segment. We
node is a triangle and constantly adjacent to three neighbors, we do compute ratios for five height ranges evenly split from [ℎ 𝑓 , ℎ 𝑓 + 2.5]
not require additional normalization during convolution. (ℎ 𝑓 is the floor height). Note that face/segment ratios are important
The edge convolution intends to aggregate adjacent node and signals to indicate whether an arrangement element is covered
edge features from the previous arrangement convolution (Eq. 4). by the point cloud. These signals make floor plan reconstruction
learnable from an overly segmented arrangement (Fig. 6(d)). The
(𝑛) (𝑛) (𝑛−1) (𝑛) (𝑛)
ĥ𝑢𝑣 = Φ𝑒 (h𝑢𝑣 + 𝑔 (𝑛) (h𝑢 + h𝑣 )) (4) output from the final arrangement convolution is passed through
two separate 2-layer MLPs to predict binary labels denoting whether
(𝑛)
Φ𝑒 is a 2-layer MLP for edge convolution, and 𝑔 is a 1D convolution to preserve certain arrangement elements as part of the final floor
that projects node feature to edge feature space. plan. We supervise the network with a binary cross-entropy loss
(𝑛)
The output of edge convolution ĥ𝑢𝑣 is a temporary edge feature (BCE) given ground truth annotation of floor plans as a subset of
and is further processed by link convolution to pass information the arrangement.
through co-linear and co-face paths offered by arrangement. The Since floor boundaries should practically be adjacent to the wall
link convolution produces the final edge feature of the arrangement edge, we rectify the network prediction by optimizing a binary
convolution, as shown in Eq. 5. Note that the link convolution graphcut [Boykov and Kolmogorov 2004; Boykov et al. 2001] to
aggregates temporary edge features into final edge features and is refine the floor labels.
∑︁ ∑︁
not a standard GNN operator. Such a difference makes our final minimize 𝑤 𝑣 |𝑙 𝑣 − 𝑠 𝑣 | + 𝑤𝑢𝑣 (1 − 𝑠𝑢𝑣 ) (6)
edge feature better at capturing co-linear and co-face structures in {𝑙 𝑣 }
𝑖 𝑙𝑢 ≠𝑙 𝑣

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:7

embedding

embedding
(a) Ceiling point cloud (3D) (b) Ceiling instances (2D) (a) Floor plan Arrangement (b) Ceiling Arrangement

Ceiling ↔ : co-linear
↔ : co-face
↔ : embedding
↔ : reuse
(c) Extended instances (d) Instance contour (e) Arrangement
Floor
(c) Reuse Boundary (d) Relationships
Fig. 9. Ceiling arrangement reconstruction. (b) We raster initial primitive
instance labels onto an image. (c) We expand regions to fill non-occupied
pixels. (d) We combine floor plan boundaries and internal lines to form Fig. 10. The enriched scene arrangements for each story include a floor plan
ceiling arrangements (e). and a ceiling arrangement. Boundary segments of the floor plan arrange-
ment are reused in the ceiling arrangement. Edges of cuboids are either
embedded into wall segments or merged into the floor plan arrangement.

𝑠 𝑣 and 𝑠𝑢𝑣 are network prediction scores of nodes and segments


ranging from 0 to 1 as the likelihood to keep the cell. For the first boundary. Fig. 9(d) visualizes floor plan boundaries in blue and in-
term, we penalize the target node label 𝑙 𝑣 if it is inconsistent with ternal segments in red. Fig. 9(e) shows the final ceiling arrangement.
the network prediction 𝑠 𝑣 . For the second term, penalize floor face To determine the 3D geometry, we assign each face of the arrange-
boundaries (𝑙𝑢 ≠ 𝑙 𝑣 ) if the network prediction 𝑠𝑢𝑣 rejects it as wall ment with a 3D plane parameter of a detected ceiling plane instance
segments. 𝑤 𝑣 denotes the area of the face. 𝑤𝑢𝑣 denotes the length by solving a multi-label graph cut[Boykov and Kolmogorov 2004;
of the segment. After optimizing the final floor regions, we deter- Boykov et al. 2001] similar to [Han et al. 2021].
mine an segment as a boundary wall if it is at the floor boundary. Fig. 10 illustrates the enriched scene arrangements. More im-
Otherwise, we keep it as an inner segment if it is adjacent to at least plementation details are discussed in the supplemental material.
one floor face and the prediction score is above 0.5.

4.4 Arrangement Enrichment 5 SCENE GENERATION


We enrich the learned arrangement as scene arrangements to seam- We aim to generate the final 3D scene from the enriched arrange-
lessly assemble doors, windows, and ceilings to the floor plan. ment. We convert each face or segment of the arrangement to 3D
polygons based on the following rules.
Cuboids. We project detected cuboids (Sec. 3) as rectangles on the
Arrangement faces. We generate a 3D polygon for each arrange-
horizontal plane. To correctly embed cuboids into walls, we identify
ment face given its 3D plane parameter and cell boundaries. Since we
whether an edge of a rectangle is close to any wall segment. If so,
represent the ceiling as a hybrid of segments and faces, we need to
we append the rectangle edge to the “embedding” list of the wall
model a vertical ceiling plane from each ceiling segment to connect
segment to record the embedding relationship between the cuboid
non-vertical ceiling planes at both sides of the segment.
and the wall. Otherwise, we directly merge the rectangle edge of a
rectangle into the floor plan arrangement. Ceiling segments. For each ceiling segment, we determine the
height value of its endpoints given the 3D plane parameters associ-
Ceilings. For a ceiling point cloud (Fig. 9(a)), we obtain its piece- ated with its two adjacent faces. Therefore, we produce four vertices
wise planar shape instances as described in Sec. 3. We rasterize the and form a trapezoid as a vertical ceiling plane. Fig. 11(a) illustrates
ceiling plane instance labels onto a horizontal plane as an image an example of reconstructed ceilings. Blue and red regions are gen-
(Fig. 9(b)). We expand the regions using a breadth-first search to fill erated by the ceiling arrangement faces and segments, respectively.
non-occupied pixels (Fig. 9(c)). We generate ceiling segments for
the arrangement by computing the intersection between each pair
of adjacent plane instances. For degenerated cases at boundaries Floor plan segments. It is more complicated to model arrangement
or where adjacent 3D planes are nearly parallel (angles between segments of the floor plan since we need to consider the embedding
normal vectors are smaller than 20◦ ), we adopt a Douglas-Peucker relationship and the open/close status of doors and windows. Notice
polygonization [Wu and Marquez 2003] to vectorize the boundary that the embedding list of a wall segment can contain multiple
in the image space between adjacent primitives. Floor plan bound- overlapped door/window edges. To solve this problem, we need to
ary segments are reused in the ceiling arrangement as the ceiling split the wall segment in both horizontal and vertical directions.

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:8 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi

(a) Ceiling edges (b) Floor plan edges (a) BIM reconstruction

Fig. 11. Illustration of scene generation from arrangement segments. (a)


Vertical ceiling planes (red) are extruded from ceiling arrangement segments.
(b) Wall, door, and window are generated from floor plan segments.

We split each wall segment at every embedded door/window edge


endpoint or intersection point with the ceiling arrangement, as
shown in (𝑙 1 to 𝑙 8 in Fig. 11(b)). We sort the upper and lower bounds
of each embedded segment and generate rectangles starting from
the bottom floor (𝑟 0 and 𝑟 1 in Fig. 11(b)). Notice that the top region
is adjacent to the ceiling, where we generate a trapezoid instead (𝑟 2
in Fig. 11(b)).
As a result, Fig. 12 shows that we faithfully recover the geometry
(b) Input scan (c) Main structures
of main structures in a unified BIM model.
Fig. 12. Enriched arrangements offer sufficient information for BIM genera-
6 RESULTS
tion. Main structures are faithfully modeled in the final model.
6.1 BIM Reconstruction
High-quality BIM. We compare our method with [Bauchet and
Lafarge 2020] as a recent state-of-the-art for BIM reconstruction using multiview stereo. While all network modules are trained us-
on a large-scene dataset (Sec. 7.1). Figure 16 visualizes BIM recon- ing laser scans, it generalizes well to MVS-based point clouds. We
struction from laser scans using [Bauchet and Lafarge 2020] or our believe the core reason is that our enriched scene arrangement cor-
approach rendered from outside and inside. [Bauchet and Lafarge rectly organizes complex semantic parts and relationships, and the
2020] provides a single model (shown in gray). Since we reconstruct ArrangementNet successfully learns connectivity rules.
the scene with well-segmented semantic parts, we visualize each
Generalization to novel scenes. We further test our pipeline on
semantic label with a different material. As a result, our method
publicly available data, including two indoor scenes captured by
produces better reconstruction with richer semantics information.
lidar from [Knapitsch et al. 2017] and the colored scan from [Wen
Salient artifacts from [Bauchet and Lafarge 2020] are marked with
et al. 2019]. Fig. 15 shows our holistic reconstruction of these avail-
red rectangles. [Bauchet and Lafarge 2020] can sometimes discard
able indoor scenes. As a result, different semantic parts are correctly
a whole room, while our approach handles these cases well. From
recovered and assembled, and our enriched scene arrangement pro-
inside, [Bauchet and Lafarge 2020] fails to recover door openings.
duces reasonable reconstruction.
It also misses or wrongly adds big wall structures. In addition, it
In sum, our pipeline aims to cover different sources of point
is not semantic-aware and preserves unwanted elements. All the
clouds including multiview stereo, commodity RGB-D sensors, and
above issues are addressed decently by our approach. As a result,
high-end laser scanners. Our solution aims to cover various indoor
we faithfully recover details of structural objects and produce BIM
scenes from small but cluttered residences to large-scale and com-
models at the standard of LOD 300. We show more reconstruction
plex indoor environments.
results of challenging laser scans in Fig. 13.
Robust to MVS. Beyond high-quality laser scans, our method han- 6.2 Overall statistics
dles point clouds from poor-quality multiview stereo. To provide We run all our experiments with a 24-threads 3.0GHz CPU with
the input, we use a modified version of [Gu et al. 2020] that fits one GeForce RTX 3090 GPU. We analyze the running time for all
panoramas trained on BlendedMVS [Yao et al. 2020] and our large- scenes in our large-scene dataset. The expected running time for
scene dataset. Fig. 14 shows the holistic reconstruction of large our pipeline for every 100𝑚 2 is 1.4 second. We show detailed per-
scenes using our pipeline given densely reconstructed point clouds formance for each stage in the supplemental.

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:9

Fig. 13. BIM Reconstruction by our approach from complex laser scans.

Table 1. Comparison with human-created BIM models. Our simplification


rate is significantly better than [Mehra et al. 2009] and is close to BIM
models created by artists.
Input scan Bauchet et al. Ours
Floor Wall Ceiling
Dist (cm) Human Ours Human Ours Human Ours Fig. 14. Our scene arrangements can handle MVS-based point clouds as
<5 71 90 70 77 15 46 input.
<10 85 95 80 84 25 60
< 20 90 97 85 90 37 72 Table 2. Simplification rate comparison.
Mean Dist (cm) 4.5 1.5 3.3 1.3 42.6 7.9
Method [Mehra et al. 2009] Ours Human
Simp. rate (%) 0.30 0.018 0.016

7 EVALUATION
Finally, we compare our reconstruction with human modeling. We
ask experienced artists to create BIM models from scratch according 7.1 Datasets
to scanned point clouds in our large-scene dataset. Considering that We use several datasets to evaluate the performance of our method.
each scene takes more than 20 hours to draw, we are surprised that To compare with existing state-of-the-art methods on floor plans, we
our reconstruction is even better than human-created models ac- follow [Stekovic et al. 2021] and use Structure3D [Zheng et al. 2020]
cording to geometry accuracy (Tab. 1). It suggests that our pipeline is and Floor-SP [Chen et al. 2019] (captured with commodity RGB-D
ready for automatic BIM production with sufficient quality. Further, sensors) to demonstrate the performance. Since these datasets focus
another important metric for BIM modeling is the simplification only on small-scale rooms, we collect a large-scene dataset contain-
rate (output face number divided by the number of points in the ing 54 buildings as multi-story offices using NavVis scanner [NavVis
input point cloud). Tab. 2 compares our method with outputs from 2022]. It provides high-quality point clouds aligned with images
human modeling and [Mehra et al. 2009] as a representative for captured from RGB cameras. We show statistics of this dataset in
mesh abstraction. Our simplification rate is significantly better than Tab. 3 and input point clouds in Fig. 19(a). We annotate our dataset
[Mehra et al. 2009] and is close to BIM models created by artists. with semantics, door/window bounding boxes, and floor plans for

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:10 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi

Table 4. Evaluation on connectivity accuracy for floor plan reconstruction.


Church

Room Corner Angle MA


Prec Rec Prec Rec Prec Rec Prec Rec
Structure3D
DP 0.93 0.94 0.74 0.79 0.49 0.52 0.72 0.75
Meeting room

Floor-SP 0.89 0.88 0.81 0.73 0.80 0.72 0.83 0.78


MonteFloor 0.96 0.94 0.89 0.77 0.86 0.75 0.90 0.82
HEAT 0.97 0.94 0.82 0.83 0.78 0.79 0.86 0.85
Ours 0.99 0.99 0.97 0.99 0.96 0.98 0.97 0.98
Floor-SP
Floor-SP 0.85 0.83 0.72 0.58 0.65 0.52 0.74 0.64
MonteFloor 0.88 0.85 0.78 0.63 0.68 0.54 0.78 0.67
MiMAP Colored

Ours 0.87 0.94 0.85 0.86 0.83 0.85 0.85 0.88


Large-scene
Floor-SP 0.34 0.18 0.21 0.02 0.11 0.01 0.22 0.07
Ours 0.88 0.81 0.82 0.83 0.80 0.81 0.83 0.82

Table 5. Evaluation on geometry accuracy for floor plan reconstruction


Input scan BIM reconstruction
in Floor-SP dataset. Our results outperform DP [Wu and Marquez 2003],
ASIP [Li et al. 2020], Floor-SP [Chen et al. 2019], and SP [Fang et al. 2021].
Fig. 15. Our scene arrangement correctly organizes complex semantic parts
and relationships, and successfully processes novel scenes from [Knapitsch
Method DP ASIP Floor-SP SP Ours
et al. 2017] and [Wen et al. 2019].
RMS 0.184 0.187 0.160 0.138 0.102
CD 0.201 0.193 0.172 0.147 0.117
Table 3. Statistics of our large scene dataset.

# of scenes # of storeys # of rooms # of points area (𝑚 2 ) Existing methods require room instance segmentation and fail on
54 2.24 10.7 4.63 ×107 3.74 ×104 our challenging large-scene dataset. Fig. 18 shows that the floor plan
cannot be reasonably recovered by Floor-SP but can be handled well
using our approach. We investigate the internal reason and find that
each story of the building. We use 40 scenes for training and the Mask-RCNN [He et al. 2017] fails to produce correct segmentation
other 14 scenes for testing. for such complex scenes. In contrast, we can accurately recover
edges and corners. Although our prediction can be wrong for some
7.2 Floor Plans
inner walls and influence the room segmentation, the quality of our
We evaluate the quality of floor plan reconstruction from the as- result is sufficient for BIM modeling.
pects of connectivity and geometry accuracy. Connectivity accuracy [Fang et al. 2021] points out that more accurate geometry from
can be measured using metrics proposed in [Chen et al. 2019] that space partition-based approaches is possible. Tab. 5 shows that our
measure the precision and recall of predicted corners, edges, and geometry accuracy is even better than [Fang et al. 2021] in terms
room instances. of CD and RMS metrics proposed by [Fang et al. 2021], attributing
As shown in Tab. 4, we compare our approach with DP [Wu and to both accurate primitive from geometry processing and robust
Marquez 2003], Floor-SP [Chen et al. 2019], MonteFloor [Stekovic connectivity prediction from the network.
et al. 2021] and HEAT [Chen et al. 2021] on Structure3D, Floor-SP,
and our own large-scene datasets. The metrics are directly borrowed 7.3 Semantics, Planes and Cuboids
from [Chen et al. 2019; Stekovic et al. 2021]. We report scores for Semantics. We train [Graham et al. 2018] on 40 scenes in our
methods if implementations are available or scores are available large scene dataset. We find that the trained model generalizes
from the original paper. As a result, our method shows significant well to various scenes. Tab. 6 reports the mean IoU of different
improvement in all datasets compared to the state-of-the-art. Since semantics that we consider. The first row directly measures the
the high-quality arrangement initialization is easy to obtain on prediction from the network, and the second row measures the
synthetic datasets, our performance is nearly perfect on Structure3D. results by assigning each point the primitive semantics. As expected,
According to the metric of corner and angle, we are especially good the performance of point-level and primitive-level IoU is close to
at estimating accurate corners and edges. Our floor plan does not that in public datasets [Dai et al. 2017].
directly segment rooms, but the room segmentation quality is still
the best. Fig. 17 shows the reconstructed walls from Structure3D Planes. We detect planes using scale-aware region growing (Sec. 4.2).
and Floor-SP datasets. As a result, we faithfully reconstruct small We report the mean distance between original points to fitted planes,
details of wall structures. the ratio of points covered by detected planes, and the number of

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
Input scan
Bauchet et al.
Ours
Input scan
Bauchet et al.
Ours ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:11

Fig. 16. We faithfully reconstruct main structures for complex multi-story buildings.

Table 6. Semantic prediction performance at point and plane primitive level. Table 7. Comparison of plane detection between the standard region grow-
ing[Rabbani et al. 2006] and that with our scale-aware extension.
Semantics Floor Wall Ceiling Pillar Door Window
Pt. IoU (%) 95.6 88.1 94.2 78.1 68.0 62.5 Dist. (cm) Coverage (%) # of planes
Prim. IoU (%) 98.3 91.7 95.4 83.6 65.8 60.7 Region-growing 0.68 96.6 31553
Scale-aware 1.27 97.8 14241

detected planes in Tab. 7. As a result, the fitting error and coverage


ratio remain almost the same with our modification to region grow- our complete-and-detect approach. Tab. 8 evaluate several metrics
ing [Rabbani et al. 2006], while our number of required planes is including mAP@0.25 and mAP@0.5 (from [Rukhovich et al. 2021])
much smaller. and mean angle error (𝜖𝜃 ). Scores are collected based on different
semantics and three different open/close surface statuses(discussed
Cuboids. We train the door and window objects using the 40 in Sec. 3). We significantly improve the object detection at different
training scenes in our large-scene dataset. We evaluate the perfor- open/close surface statuses. Further, the smaller 𝑡 is, the contribu-
mance of FCAF3D [Rukhovich et al. 2021] and its extension with tion of completion is more salient, indicating that completion is

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:12 •
Floor-SP Dataset Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi

Input Floor-SP Ours Ground truth


Structured3D Dataset

Input Floor-SP MonteFloor Ours Ground truth

Fig. 17. Visualization of floor plan reconstruction from different methods


on Floor-SP dataset.

Table 8. Evaluation on object detection according to different semantics


and levels of emptiness.
Input scan Floor-SP Ours Ground truth
FCAF3D/Ours mAP@0.25 mAP@0.5 𝜖𝜃◦
Window 27.63/78.08 14.87/54.06 38.21/- Fig. 18. Visualization of floor plan reconstruction on our large-scene dataset.
Door 56.74/74.33 36.54/52.78 21.24/-
𝑡 =0 0.08/61.10 0.01/45.88 55.24/- Table 9. Running time and memory consumption.
𝑡 =1 50.44/85.53 26.85/61.27 23.65/-
𝑡 =2 76.04/82.12 48.25/53.10 10.28/- # of points Time CPU Mem GPU Mem
Floor-SP 6.5×105 18s 2.12 GB 2.48GB
Structured3D 5.8×106 175s 4.84 GB 2.61GB
critical for openings detection. We find the angle error from the Large-scene 2.7×107 589s 8.45GB 3.01GB
original prediction is large but can be decently reduced by aligning
objects to wall structures. Fig. 20 shows qualitative results, where we
correctly detect the two door openings with the correct orientation. 7.5 Ablation Studies and Limitation Discussion
7.4 Performance Profile Arrangment convolution. We perform an ablation study on Floor-
SP dataset to investigate the influence of the newly-proposed ar-
We evaluate the end-to-end running time and memory (including
rangement convolution (Eq. 5) to the performance in Tab. 10. As a
CPU and GPU) statistics of our algorithm on three datasets of differ-
result, our arrangement convolution improves the floor plan predic-
ent scales in Table. 9. We notice that the processing time increases
tion, since it exploits co-linear and co-face relationships as additional
linearly with the increase in point number. The CPU memory con-
information offered by the special arrangement graph. As a result,
sumption increases slower and the usage of GPU memory is not
link convolution improves the quality of floor plan estimation.
related to the point number but to the density of points due to
chunking. It costs 16.4GB CPU memory and 2.78GB GPU memory Influence from arrangement Initialization. It is obvious that ar-
for the largest scene in our dataset with 7.6×107 points. As a result, rangement initialization (Sec. 4.2) has influences on the final floor
our method shows good scalability to process scans in the large plan quality. As shown in Tab. 4, high-quality initialization from
scale. synthetic data leads to nearly perfect reconstruction. To further

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:13

(a) Point cloud (b) Semantic Prediction (c) Primitive Detection

Fig. 19. Scanned point clouds, semantics, and primitive segmentation for our large scene dataset.

Table 11. Ablation study on influence from arrangement initialization.

Wall Prim. IoU (%) 91.7 85.0 80.0 75.0


Room Prec. 0.88 0.84 0.83 0.82

initialization can also be influenced by the quality of semantic seg-


mentation. We replace SparseConvNet [Graham et al. 2018] with
RandLA-Net [Hu et al. 2019] and KPConv [Thomas et al. 2019]. The
primitive IOU decreases from 91.7% to 91.1% and 90.8%, while the
room precision remains 88.0%, suggesting that we are robust to
(a) Input scan (b) FCAF3D (c) Ours (d) Ground truth
different choices of semantic segmentation algorithms.
Fig. 20. Visualization of detected objects using FCAF3D and our extension. Sensor noises and density. To understand how the noises and den-
By completing and aligning bounding boxes to walls, we improve the quality sity of the input point cloud influence the BIM reconstruction, we
of object detection. pick a scene from the large-scene dataset and simulate different
levels of noise and density. Fig. 21 shows the BIM reconstruction
Table 10. Ablation study on the influence of link convolution. results from these simulated scans. Our floorplan and ceiling struc-
ture is more robust than Floor-SP [Chen et al. 2019] to point density
Room Corner Angle MA and noises. We observe that semantic prediction causes incorrect
Prec Rec Prec Rec Prec Rec Prec Rec reconstruction when 𝑁 = 10cm (Fig. 21(c)). Door and window de-
Without Eq. 5 0.81 0.84 0.83 0.82 0.77 0.79 0.80 0.82 tection is more sensitive to both noises and density, but could be
With Eq. 5 0.87 0.94 0.85 0.86 0.83 0.85 0.85 0.88 potentially alleviated by adding noises in training examples.

8 CONCLUSION
understand the behavior, we simulate different qualities of arrange- We present ArrangementNet, an automatic pipeline to generate high-
ment initialization on the large-scene dataset by perturbing the wall quality BIMs by learning floor plan arrangements using a graph
detection accuracy. Starting from primitive IoU as 91.7% (Tab. 6), neural network and enriching them to organize different semantic
we simulate more wrong predictions and evaluate the room predic- parts. We show significant improvement in floor plans and 3D vec-
tion accuracy in Tab. 11. The quality of the estimated floor plan is torized scene reconstruction compared to the state-of-the-art. We
decreasing with the initialization quality. However, we find that the observe two directions to improve our pipeline. First, we can cover
rate of descent is slowing, indicating that ArrangementNet tends more semantic elements such as beams, ladders, or windows/doors
to compensate for the errors from the initialization. Arrangement with more complex shapes. Second, we believe there is a potential

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
51:14 • Jingwei Huang, Shanshan Zhang, Bo Duan, Yanfeng Zhang, Xiaoyang Guo, Mingwei Sun, and Li Yi

point clouds. IEEE Journal of Selected Topics in Applied Earth Observations and
Point cloud

Remote Sensing 12, 8 (2019), 3117–3130.


Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and
Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor
scenes. In Proceedings of the IEEE conference on computer vision and pattern recogni-
tion. 5828–5839.
Floor-SP

Erick Delage, Honglak Lee, and Andrew Y Ng. 2006. A dynamic bayesian network
model for autonomous 3d reconstruction from a single indoor image. In 2006 IEEE
computer society conference on computer vision and pattern recognition (CVPR’06),
Vol. 2. IEEE, 2418–2428.
Hao Fang and Florent Lafarge. 2020. Connect-and-Slice: an hybrid approach for recon-
structing 3D objects. In Proceedings of the IEEE/CVF Conference on Computer Vision
Ours

and Pattern Recognition. 13490–13498.


Hao Fang, Florent Lafarge, and Mathieu Desbrun. 2018. Planar shape detection at
(a) N = 0cm D=2cm (b) N = 8cm D=2cm (c) N = 10cm D=2cm structural scales. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2965–2973.
Hao Fang, Florent Lafarge, Cihui Pan, and Hui Huang. 2021. Floorplan generation from
3D point clouds: A space partitioning approach. ISPRS Journal of Photogrammetry
Point cloud

and Remote Sensing 175 (2021), 44–55.


Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated cartography.
Commun. ACM 24, 6 (1981), 381–395.
Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. 2009.
Floor-SP

Manhattan-world stereo. In 2009 IEEE Conference on Computer Vision and Pattern


Recognition. IEEE, 1422–1429.
Michael Garland and Paul S Heckbert. 1997. Surface simplification using quadric error
metrics. In Proceedings of the 24th annual conference on Computer graphics and
interactive techniques. 209–216.
Benjamin Graham, Martin Engelcke, and Laurens Van Der Maaten. 2018. 3d semantic
Ours

segmentation with submanifold sparse convolutional networks. In Proceedings of


the IEEE conference on computer vision and pattern recognition. 9224–9232.
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. 2020.
(d) N = 4cm D=4cm (e) N = 4cm D=16cm (f) N = 4cm D=32cm Cascade cost volume for high-resolution multi-view stereo and stereo matching. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2495–2504.
Fig. 21. BIM Reconstruction results under different levels of noises and Jiali Han, Mengqi Rong, Hanqing Jiang, Hongmin Liu, and Shuhan Shen. 2021. Vector-
density. We use 𝑁 to represent the deviation of gaussian noises added to ized indoor surface reconstruction from 3D point cloud with multistep 2D optimiza-
the scan and 𝐷 to represent the average space of the point. tion. ISPRS Journal of Photogrammetry and Remote Sensing 177 (2021), 57–74.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In
Proceedings of the IEEE international conference on computer vision. 2961–2969.
Varsha Hedau, Derek Hoiem, and David Forsyth. 2009. Recovering the spatial layout of
to introduce learning-based methods for texture generation based cluttered rooms. In 2009 IEEE 12th international conference on computer vision. IEEE,
1849–1856.
on arrangements. Thomas Holzmann, Michael Maurer, Friedrich Fraundorfer, and Horst Bischof. 2018.
Semantically aware urban 3d reconstruction with plane-based regularization. In
REFERENCES Proceedings of the European Conference on Computer Vision (ECCV). 468–483.
Qingyong Hu, Bo Yang, Linhai Xie, Andrew Markham, Stefano Rosa, Yulan Guo, Zhihua
Jean-Philippe Bauchet and Florent Lafarge. 2020. Kinetic shape reconstruction. ACM Wang, and Niki Trigoni. 2019. RandLA-Net: Efficient Semantic Segmentation of
Transactions on Graphics (TOG) 39, 5 (2020), 1–14. Large-Scale Point Clouds. Cornell University - arXiv (Nov 2019).
Alexandre Boulch, Martin de La Gorce, and Renaud Marlet. 2014. Piecewise-planar 3D Zhihua Hu, Bo Duan, Yanfeng Zhang, Mingwei Sun, and Jingwei Huang. 2021. MVLay-
reconstruction with edge and corner regularization. In Computer Graphics Forum, outNet: 3D layout reconstruction with multi-view panoramas. arXiv preprint
Vol. 33. Wiley Online Library, 55–64. arXiv:2112.06133 (2021).
Yuri Boykov and Vladimir Kolmogorov. 2004. An experimental comparison of min- Jingwei Huang, Angela Dai, Leonidas J Guibas, and Matthias Nießner. 2017. 3Dlite:
cut/max-flow algorithms for energy minimization in vision. IEEE transactions on towards commodity 3D scanning for content creation. ACM Trans. Graph. 36, 6
pattern analysis and machine intelligence 26, 9 (2004), 1124–1137. (2017), 203–1.
Yuri Boykov, Olga Veksler, and Ramin Zabih. 2001. Fast approximate energy minimiza- Jingwei Huang, Yanfeng Zhang, and Mingwei Sun. 2021. PrimitiveNet: Primitive
tion via graph cuts. IEEE Transactions on pattern analysis and machine intelligence Instance Segmentation with Local Primitive Embedding under Adversarial Metric.
23, 11 (2001), 1222–1239. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15343–
Ricardo Cabral and Yasutaka Furukawa. 2014. Piecewise planar and compact floorplan 15353.
reconstruction from images. In 2014 IEEE Conference on Computer Vision and Pattern Satoshi Ikehata, Hang Yang, and Yasutaka Furukawa. 2015. Structured indoor modeling.
Recognition. IEEE, 628–635. In Proceedings of the IEEE international conference on computer vision. 1323–1331.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Hamid Izadinia, Qi Shan, and Steven M Seitz. 2017. Im2cad. In Proceedings of the IEEE
and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In conference on computer vision and pattern recognition. 5134–5143.
European conference on computer vision. Springer, 213–229. Zhizhong Kang and Zhen Li. 2015. Primitive fitting based on the efficient multiBaySAC
Jie Chen and Baoquan Chen. 2008. Architectural modeling from sparsely scanned range algorithm. PloS one 10, 3 (2015), e0117341.
data. International Journal of Computer Vision 78, 2 (2008), 223–236. Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. 2017. Tanks and
Jiacheng Chen, Chen Liu, Jiaye Wu, and Yasutaka Furukawa. 2019. Floor-sp: Inverse cad temples: Benchmarking large-scale scene reconstruction. ACM Transactions on
for floorplans by sequential room-wise shortest path. In Proceedings of the IEEE/CVF Graphics (ToG) 36, 4 (2017), 1–13.
International Conference on Computer Vision. 2661–2670. Lingxiao Li, Minhyuk Sung, Anastasia Dubrovina, Li Yi, and Leonidas J Guibas. 2019b.
Jiacheng Chen, Yiming Qian, and Yasutaka Furukawa. 2021. HEAT: Holistic Edge At- Supervised fitting of geometric primitives to 3d point clouds. In Proceedings of the
tention Transformer for Structured Reconstruction. arXiv preprint arXiv:2111.15143 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2652–2660.
(2021). Muxingzi Li, Florent Lafarge, and Renaud Marlet. 2020. Approximating shapes in
Ondrej Chum and Jiri Matas. 2005. Matching with PROSAC-progressive sample con- images with low-complexity polygons. In Proceedings of the IEEE/CVF Conference on
sensus. In 2005 IEEE computer society conference on computer vision and pattern Computer Vision and Pattern Recognition. 8633–8641.
recognition (CVPR’05), Vol. 1. IEEE, 220–226. Minglei Li, Franz Rottensteiner, and Christian Heipke. 2019a. Modelling of build-
Yang Cui, Qingquan Li, Bisheng Yang, Wen Xiao, Chi Chen, and Zhen Dong. 2019. ings from aerial LiDAR point clouds using TINs and label maps. ISPRS Journal of
Automatic 3-D reconstruction of indoor environment with mobile laser scanning

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.
ArrangementNet: Learning Scene Arrangements for Vectorized Indoor Scene Modeling • 51:15

Photogrammetry and Remote Sensing 154 (2019), 127–138. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor
Chenxi Liu, Alexander G Schwing, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. segmentation and support inference from rgbd images. In European conference on
2015. Rent3d: Floor-plan priors for monocular layout estimation. In Proceedings of computer vision. Springer, 746–760.
the IEEE conference on computer vision and pattern recognition. 3413–3421. Sinisa Stekovic, Mahdi Rad, Friedrich Fraundorfer, and Vincent Lepetit. 2021. Monte-
Chen Liu, Jiaye Wu, and Yasutaka Furukawa. 2018. Floornet: A unified framework for floor: Extending mcts for reconstructing accurate large-scale floor plans. In Proceed-
floorplan reconstruction from 3d scans. In Proceedings of the European conference on ings of the IEEE/CVF International Conference on Computer Vision. 16034–16043.
computer vision (ECCV). 201–217. Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Francois
Chen Liu, Jiajun Wu, Pushmeet Kohli, and Yasutaka Furukawa. 2017. Raster-to-vector: Goulette, and Leonidas J. Guibas. 2019. KPConv: Flexible and Deformable Convo-
Revisiting floorplan transformation. In Proceedings of the IEEE International Confer- lution for Point Clouds. In Proceedings of the IEEE/CVF International Conference on
ence on Computer Vision. 2195–2203. Computer Vision (ICCV).
David Marshall, Gabor Lukacs, and Ralph Martin. 2001. Robust segmentation of primi- Philip HS Torr and Andrew Zisserman. 2000. MLESAC: A new robust estimator with
tives from range data in the presence of geometric degeneracy. IEEE Transactions application to estimating image geometry. Computer vision and image understanding
on pattern analysis and machine intelligence 23, 3 (2001), 304–314. 78, 1 (2000), 138–156.
Jiri Matas and Ondrej Chum. 2004. Randomized RANSAC with Td, d test. Image and H Tran and K Khoshelham. 2019. A stochastic approach to automated reconstruction of
vision computing 22, 10 (2004), 837–842. 3D models of interior spaces from point clouds. ISPRS Annals of the Photogrammetry,
Ravish Mehra, Qingnan Zhou, Jeremy Long, Alla Sheffer, Amy Gooch, and Niloy J Remote Sensing and Spatial Information Sciences 4 (2019), 299–306.
Mitra. 2009. Abstraction of man-made shapes. In ACM SIGGRAPH Asia 2009 papers. Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. 2017.
1–10. Learning shape abstractions by assembling volumetric primitives. In Proceedings of
Claudio Mura, Oliver Mattausch, and Renato Pajarola. 2016. Piecewise-planar recon- the IEEE Conference on Computer Vision and Pattern Recognition. 2635–2643.
struction of multi-room interiors with arbitrary wall arrangements. In Computer Eric Turner and Avideh Zakhor. 2014. Floor plan generation and room labeling of indoor
Graphics Forum, Vol. 35. Wiley Online Library, 179–188. environments from laser range data. In 2014 international conference on computer
Claudio Mura, Oliver Mattausch, Alberto Jaspe Villanueva, Enrico Gobbetti, and Renato graphics theory and applications (GRAPP). IEEE, 1–12.
Pajarola. 2014. Automatic room detection and reconstruction in cluttered indoor Marc Van Kreveld, Thijs Van Lankveld, and Remco C Veltkamp. 2011. On the shape of
environments with complex room layouts. Computers & Graphics 44 (2014), 20–32. a set of points and lines in the plane. In Computer Graphics Forum, Vol. 30. Wiley
Liangliang Nan and Peter Wonka. 2017. Polyfit: Polygonal surface reconstruction from Online Library, 1553–1562.
point clouds. In Proceedings of the IEEE International Conference on Computer Vision. Madhawa Vidanapathirana, Qirui Wu, Yasutaka Furukawa, Angel X Chang, and Manolis
2353–2361. Savva. 2021. Plan2scene: Converting floorplans to 3d scenes. In Proceedings of the
Nelson Nauata and Yasutaka Furukawa. 2019. Vectorizing world buildings: Planar IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10733–10742.
graph reconstruction by primitive detection and relationship classification. arXiv Cheng Wang, Shiwei Hou, Chenglu Wen, Zheng Gong, Qing Li, Xiaotian Sun, and
preprint arXiv:1912.05135 1, 3 (2019). Jonathan Li. 2018. Semantic line framework-based indoor building modeling using
NavVis. 2022. NavVis Scanner. https://www.navvis.com/. backpacked laser scanning point cloud. ISPRS journal of photogrammetry and remote
Sebastian Ochmann, Richard Vock, and Reinhard Klein. 2019. Automatic reconstruction sensing 143 (2018), 150–166.
of fully volumetric 3D building models from oriented point clouds. ISPRS journal of Senyuan Wang, Guorong Cai, Ming Cheng, José Marcato Junior, Shangfeng Huang,
photogrammetry and remote sensing 151 (2019), 251–262. Zongyue Wang, Songzhi Su, and Jonathan Li. 2020. Robust 3D reconstruction of
Sebastian Ochmann, Richard Vock, Raoul Wessel, and Reinhard Klein. 2016a. Automatic building surfaces from point clouds based on structural and closed constraints. ISPRS
reconstruction of parametric building models from indoor point clouds. Computers Journal of Photogrammetry and Remote Sensing 170 (2020), 29–44.
& Graphics 54 (2016), 94–103. Chenglu Wen, Yudi Dai, Yan Xia, Yuhan Lian, Jinbin Tan, Cheng Wang, and Jonathan
Sebastian Ochmann, Richard Vock, Raoul Wessel, and Reinhard Klein. 2016b. Automatic Li. 2019. Toward efficient 3-D colored mapping in GPS-/GNSS-denied environments.
reconstruction of parametric building models from indoor point clouds. Computers IEEE Geoscience and Remote Sensing Letters 17, 1 (2019), 147–151.
& Graphics 54 (2016), 94–103. S-T Wu and MERCEDES ROCio GONZALES Marquez. 2003. A non-self-intersection
Sven Oesau, Florent Lafarge, and Pierre Alliez. 2014. Indoor scene reconstruction using Douglas-Peucker algorithm. In 16th Brazilian symposium on computer graphics and
feature sensitive primitive extraction and graph-cut. ISPRS journal of photogramme- Image Processing (SIBGRAPI 2003). IEEE, 60–66.
try and remote sensing 90 (2014), 68–82. Yifan Xu, Weijian Xu, David Cheung, and Zhuowen Tu. 2021. Line segment detection
L Paul Chew. 1989. Constrained delaunay triangulations. Algorithmica 4, 1 (1989), using transformers without edges. In Proceedings of the IEEE/CVF Conference on
97–108. Computer Vision and Pattern Recognition. 4257–4266.
Ameya Phalak, Vijay Badrinarayanan, and Andrew Rabinovich. 2020. Scan2plan: Nan Xue, Tianfu Wu, Song Bai, Fudong Wang, Gui-Song Xia, Liangpei Zhang, and
efficient floorplan generation from 3d scans of indoor scenes. arXiv preprint Philip HS Torr. 2020. Holistically-attracted wireframe parsing. In Proceedings of the
arXiv:2003.07356 (2020). IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2788–2797.
Mattia Previtali, Marco Scaioni, Luigi Barazzetti, and Raffaella Brumana. 2014. A flexible Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and
methodology for outdoor/indoor building reconstruction from occluded point clouds. Long Quan. 2020. Blendedmvs: A large-scale dataset for generalized multi-view
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and
2, 3 (2014), 119. Pattern Recognition. 1790–1799.
Yiming Qian and Yasutaka Furukawa. 2020. Learning pairwise inter-plane relations Mulin Yu and Florent Lafarge. 2022. Finding Good Configurations of Planar Primitives
for piecewise planar reconstruction. In European Conference on Computer Vision. in Unorganized Point Clouds. In CVPR 2022-IEEE Conference on Computer Vision
Springer, 330–345. and Pattern Recognition.
Tahir Rabbani, Frank Van Den Heuvel, and George Vosselmann. 2006. Segmentation of Wenyuan Zhang, Zhixin Li, and Jie Shan. 2021. Optimal Model Fitting for Building
point clouds using smoothness constraint. International archives of photogrammetry, Reconstruction From Point Clouds. IEEE Journal of Selected Topics in Applied Earth
remote sensing and spatial information sciences 36, 5 (2006), 248–253. Observations and Remote Sensing 14 (2021), 9636–9650.
Wein Ron, Eric Berberich, Fogel Efi, Halperin Dan, Hemmer Michael, Salzman Oren, Ziheng Zhang, Zhengxin Li, Ning Bi, Jia Zheng, Jinlei Wang, Kun Huang, Weixin Luo,
and Zukerman Baruch. 2022. 2D Arrangements. https://doc.cgal.org/latest/ Yanyu Xu, and Shenghua Gao. 2019. Ppgnet: Learning point-pair graph for line
Arrangement_on_surface_2/index.html. segment detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
Danila Rukhovich, Anna Vorontsova, and Anton Konushin. 2021. FCAF3D: Fully and Pattern Recognition. 7105–7114.
Convolutional Anchor-Free 3D Object Detection. arXiv preprint arXiv:2112.00322 Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. 2020.
(2021). Structured3d: A large photo-realistic dataset for structured 3d modeling. In European
David Salinas, Florent Lafarge, and Pierre Alliez. 2015. Structure-aware mesh decima- Conference on Computer Vision. Springer, 519–535.
tion. In Computer Graphics Forum, Vol. 34. Wiley Online Library, 211–227. Yichao Zhou, Haozhi Qi, and Yi Ma. 2019. End-to-end wireframe parsing. In Proceedings
Falko Schindler, Wolfgang Wörstner, and Jan-Michael Frahm. 2011. Classification and of the IEEE/CVF International Conference on Computer Vision. 962–971.
reconstruction of surfaces from point clouds of man-made objects. In 2011 IEEE Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. 2018. Layoutnet: Reconstruct-
International Conference on Computer Vision Workshops (ICCV Workshops). IEEE, ing the 3d room layout from a single rgb image. In Proceedings of the IEEE Conference
257–263. on Computer Vision and Pattern Recognition. 2051–2059.
Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 2017. 3d-
2018. Csgnet: Neural shape parser for constructive solid geometry. In Proceedings of prnn: Generating shape primitives with recurrent neural networks. In Proceedings
the IEEE Conference on Computer Vision and Pattern Recognition. 5515–5523. of the IEEE International Conference on Computer Vision. 900–909.
Gopal Sharma, Difan Liu, Subhransu Maji, Evangelos Kalogerakis, Siddhartha Chaud-
huri, and Radomír Měch. 2020. Parsenet: A parametric surface fitting network for
3d point clouds. In European Conference on Computer Vision. Springer, 261–276.

ACM Trans. Graph., Vol. 42, No. 4, Article 51. Publication date: August 2023.

You might also like