1 s2.0 S0926580522000425 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Automation in Construction 136 (2022) 104169

Contents lists available at ScienceDirect

Automation in Construction
journal homepage: www.elsevier.com/locate/autcon

Review

Applications of natural language processing in construction


Yuexiong Ding a, b, Jie Ma c, Xiaowei Luo a, b, *
a
Department of Architecture and Civil Engineering, City University of Hong Kong, Hong Kong, China
b
Architecture and Civil Engineering Research Center, Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China
c
Department of Building and Real Estate, Hong Kong Polytechnic University, Hong Kong, China

A R T I C L E I N F O A B S T R A C T

Keywords: In the construction industry under “Industry 4.0”, Natural Language Processing (NLP) has been widely used to
Natural language processing process and analyze text data to achieve construction intelligence. However, there lacks a comprehensive review
Artificial intelligence of NLP application in construction-related areas, raising bar of research entry and setting obstacles for the rapid
Scientometric analysis
development in this fields. Ninety one NLP-related research articles in construction-related fields were retrieved
Construction research
to conduct a scientometric analysis using CiteSpace and VOSViewer, and summarized from the perspectives of
Industry 4.0
anchordatasets/data sources, technologies/tools, and applications and progress. The results show that data
isolation causing non-reproducibility of research is one of the severe problems to be solved. Besides, pure NLP
application studies will no longer meet the future industry development needs and more cross-modal interdis­
ciplinary research based on the end-to-end pre-trained neural network model framework is needed. This study
helps readers gain an in-depth understanding of the NLP application and development in construction.

1. Introduction There are more and more NLP-related studies carried out in
construction-related areas. Such as the application of NLP in document
According to McKinsey digital globalization index, the construction management [13,25,90], safety management [16,24,82], compliance
industry is currently one of the lowest digital industries globally [56]. In checking [73,93,101], risk management [41–43], and Building Infor­
the context of “Industry 4.0”, construction-related fields, including ar­ mation Modeling (BIM) [50,91,113]. However, there is no relevant re­
chitecture, engineering, and construction, therefore, are seizing this view research referring to NLP application in construction-related areas.
opportunity to further develop in the direction of digitization and in­ It would cause difficulties for those researchers who want to enter into a
telligence to achieve significant improvements in automation, produc­ new area since the literature review is commonly regarded as a desirable
tivity, and reliability. In order to implement digital strategies in approach to have a preliminary understanding of a research area. On the
construction-related fields, AI has served as the efficient and feasible other hand, most review studies in construction-related fields only focus
solution to change the traditional execution mode of construction pro­ on the scientometric analysis or summary of main ideas, ignoring the
jects. In the process of digitization, data in construction-related fields importance of sorting and summarizing datasets/data sources, tech­
have been stored in the form of various electronic text formats, such as nologies, and tools. For example, Martinez et al. [57] made a compre­
Word, Sheet, Email, Extensible Markup Language (XML), Hypertext hensive scientometric analysis on computer vision applications in the
Markup Language (HTML), Portable Document Format (PDF), construction field, discussing the current research status and future
Computer-aided Design (CAD), Industry Foundation Classes (IFC), etc., trends without summarizing data and technology. Fang et al. [26]
all of which contain human language more or less. As an essential summarized the application of computer vision-related technology for
branch of AI, NLP provides an intelligent way to process those text data, behavior-based safety in construction, ignoring scientometric analysis
enabling the intelligent agent to learn from human language and auto­ and the review of data used in the relevant literature. Though the data
matically complete knowledge representation (KRep), retrieval, and sources, algorithms, and technologies related to data mining were
reasoning process in a human-like way. Therefore, NLP technologies reviewed by Yan et al. [94], the summary of data sources is only a simple
have been the key to helping achieve further intelligence in statistic of data acquisition methods without providing the links of data
construction-related fields. sources and other related information. Romero-Silva and de Leeuw [72]

* Corresponding author at: Department of Architecture and Civil Engineering, City University of Hong Kong, Hong Kong, China.
E-mail address: xiaowluo@cityu.edu.hk (X. Luo).

https://doi.org/10.1016/j.autcon.2022.104169
Received 23 September 2021; Received in revised form 9 January 2022; Accepted 12 February 2022
Available online 24 February 2022
0926-5805/© 2022 Published by Elsevier B.V.
Y. Ding et al. Automation in Construction 136 (2022) 104169

Table 1 to the keywords and settings, with both advantages and disadvantages.
Two types of searching keywords. The advantage is collecting as many relevant articles as possible to avoid
NLP-related keywords (NRK) Construction-related keywords (CRK) omission. The disadvantage is introducing a large number of irrelevant
literature, increasing the workload of manual screening.
NLP, Natural Language Processing, Jobsite*, Job Site*, Construction,
Semantic Analy*, Semantic Similarity, Construction Site*, Construction Fig. 1 shows the data collection process. Firstly, a total of 1689 and
Sentiment Analy*, Knowledge Manage*, Construction Project*, 1248 articles was retrieved from Scopus and WOS by using the keywords
Extract*, Knowledge Graph, Ontolog*, Construction Industry, Building Industry, and settings mentioned above. These two sources of data were then
Entit*, Text Mining, Text Summar*, Civil Engineering. merged to remove the duplicate items, leaving 1962 nonredundant pa­
LDA, Latent Dirichlet Allocation,
Document Translat*, Document
pers. Finally, 91 articles with solid correlations with NLP were selected
Manage*, Document Analy*. through the fine manual screening. A traditional but rigorous approach
was used in the last phase. The authors first need to scan a paper and
then make the decision based on the following criteria: 1) Whether NLP-
conducted a review work of text mining analysis in Operations Research related technologies were part of the research methodology; 2) Whether
and Management Science (OR/MS) subject area, which focused only on the application field of the research was construction-related.
publication analysis and future development. Without an all-around After collecting the literature, some preprocessing operations were
review, readers might hard to have an overall in-depth understanding adopted, mainly on unifying data format and the merger operation of
of the research field even after reading the relevant review article. keywords and authors. Since the exported data formats of Scopus and
In view of the current problems and limitations, this paper aims to WOS were inconsistent, which was not conducive to the subsequent
conduct a systematic and comprehensive review of NLP application in analysis work, the authors re-exported the final selected articles from
construction-related fields from 2000 to 2020 to reveal the development the WOS platform to obtain a unified data format. Another issue is that
and evolution pattern and future trend and help readers have an overall different researchers have diverse writing and expression styles, result­
understanding of NLP in this field by providing getting started supports ing in the emergence of synonyms. For example, the keywords “NLP”
for and in-depth research inspirations. Specifically, the objectives of this and “Natural Language Processing” represent the same concept, and the
study include 1) conducting literature collection and systematic scien­ keywords “Contract Management” and “Construction Contract Manage­
tometric analysis (Section 2); 2) summarizing and reviewing the related ment” are the same because these papers are all in the construction field.
articles comprehensively from the aspects of datasets/data sources, Those phrases in the same meaning but with different forms of expres­
technologies/tools, and application & progress (Section 3); 3) discussing sion must be shifted into the same symbol, or they would be treated
the current challenges, possible solutions, and future research trends of differently during the following scientometric analysis. Similarly,
NLP application in construction-related fields (Section 4). different publishers have their name formats. For instance, the authors
“Weiming Wang”, “W. Wang” and “W. M. Wang” may be the same person
2. Publications analysis and need to be transformed to a unified name. Note that when merging
authors, only when their personal information, such as Scopus ID,
2.1. Data collection ORCID, or email address, is the same can we determine that they are the
same person.
Two famous literature retrieval platforms, Scopus and WOS, were
adopted in this study. There are two types of keywords used for
retrieving, NLP-related keywords (NRK) and construction-related key­
words (CRK), as shown in Table 1. In the early stages, the concept of NLP
had not been widely used. Instead, “text mining” or “text processing”
was primarily used. Thus, it is difficult to find some earlier related pa­
pers only using the keyword “NLP”. However, with the development of
NLP in recent years, many text processing technologies are now in the
category of NLP. In view of the situation, the authors extended the
searching range by adding some keywords related to some major
application fields of NLP or text mining, such as “Semantics”, “Senti­
ment”, “Knowledge”, “Ontology”, “Text”, and “Document”. The authors
combined two types of keywords to search literature from advanced
search mode in Scopus and WOS. The search format on Scopus is TITLE-
ABS-KEY ((NRK 1 OR NRK 2 OR …) AND (CRK 1 OR CRK 2 OR …)), and
TS = ((NRK 1 OR NRK 2 OR …) AND (CRK 1 OR CRK 2 OR …)) on the
WOS. Here “TITLE-ABS-KEY” and “TS” indicate finding records con­
taining the keywords in the Abstract, Title, and/or Keywords fields of a
record. Other search restrictions were set as follows: Time
(“2000–2020”), Document Type(“Article”), Language(“English”), Subject
area(“Engineering”). The retrieving scope is relatively broad according Fig. 2. Source publication distribution.

Fig. 1. Literature collection process.

2
Y. Ding et al. Automation in Construction 136 (2022) 104169

support. The USA and China are the top two big construction countries
globally [71]. On the other hand, it also indicates to a large extent that
they are pioneers and leaders in this area and have considerable inter­
national academic influence and good cooperation ability. The second
tier includes Australia, Great Britain, South Korea, Canada, and France,
accounting for 3% to 12%. These countries are also important research
forces in the field. The third-tier countries/regions account for less than
3%, which are not displayed here.
Another essential statistic is publication year, ranging from 2000 to
2020, as shown in Fig. 4. According to the broken line distribution, the
development of NLP in the construction field can be divided into three
stages. The first stage is the germination stage (2000− 2011), the second
stage is the gradual development stage (2012–2018), and the third stage
is the rapid development stage (2019–2020). After consulting the rele­
vant literature, it is found that the development pattern revealed in
Fig. 4 is related to the historical development of NLP in the general
Fig. 3. Distribution of authors’ countries/regions.
environment. Before 2011, NLP technology was mainly based on some
machine learning models [10,40,63] or simple feed-forward neural
networks [9], coupled with the expensive computing power, resulting in
Number of annual publicaon the low research popularity of NLP in various fields, let alone the con­
28 struction field. In 2012, Alexnet started the prelude of deep learning
[39]. After that, the neural network has gradually become the leading
technology of NLP, such as word embedding neural network in 2013
[61,62], sequence to sequence model in 2014 [78], and attention
13 mechanism in 2015 [6], which made the performance of NLP continu­
11 11 ously be improved. In addition, the computing power of GPU was
constantly improving, and various deep learning frameworks were
5 5 4
3 4 continuously open source, such as TensorFlow in 2015 and PyTorch in
2 1 1 1
0 0 1 1 0 0 0 0 2016. These favorable factors make NLP usher in a small climax of
development in many areas, including the construction area. During this
period, the main models of NLP were memory-based networks and their
variants, such as Long Short-term Memory networks (LSTMs) and Gated
Recurrent Unit networks (GRUs). However, from 2017, the memory-
Fig. 4. Distribution of annual publications.
based networks seemed to encounter a bottleneck, with no significant
breakthrough in performance. As a result, the research enthusiasm of
2.2. Overview
NLP gradually declined. It was not until Google’s new model called
Bidirectional Encoder Representations from Transformers (BERT) [22]
In this step, some straightforward statistics were shown to reveal the
came out that this dilemma was broken. As soon as BERT was proposed,
data distribution.
it achieved state-of-the-art performance in many downstream tasks of
The first one is source publication statistics. As shown in Fig. 2, a
NLP, making the pre-trained language model become the mainstream
total of 91 selected articles are published in 23 different journals, where
model in this period and setting off another upsurge of NLP research.
Automation in Construction published the highest percentage of articles,
From the bottleneck to the breakthrough, it seems to explain why the
followed by the Journal of Computing in Civil Engineering and Advanced
number of articles in Fig. 4 drops sharply in 2018 and then rises
Engineering Informatics. The number of papers published by these top
dramatically in the next two years.
three journals far exceeds that of others, accounting for 56% of all pa­
pers, reflecting their leading position in the field of computing in con­
2.3. Keyword co-occurrence
struction. Furthermore, the number of articles published by Automation
in Construction is nearly twice that of the second place, which shows its
In this part, two analysis tools, VOSviewer and CiteSpace, were
dominance in this area. Journal of Construction Engineering and Man­
adopted to carry out analysis work. VOSviewer was used to draw and
agement, Journal of Management in Engineering, and Engineering Con­
display the co-occurrence network due to its friendly visualization [87].
struction and Architectural Management are the middle-ranking journals
Citespace was used for computing tasks (e.g., bursts, centrality) because
according to Fig. 2, with related articles accounting for 3% to 10%. The
of its powerful numerical calculation functions [15]. The same tools will
remaining 17 journals account for less than 3% of the articles.
also be used in Sections 2.4 and 2.5.
Next is statistics of authors’ countries/regions. According to the
There are 198 keywords (given by the authors) in total, and 47
statistical results, a total of 18 countries/regions were identified. In
appear more than two times. Fig. 5 shows the keyword co-occurrence
Fig. 3, only countries/regions with a proportion of no less than 3% are
network. Each node represents a keyword, the node’s size indicates
shown. Since each article has multiple authors from different countries/
the frequency of keyword occurrence, the connection between nodes
regions, the sum of the statistical values of each column is greater than
denotes the co-occurrence, and the color distinguishes keywords’ clus­
the total number of articles (91). Besides, the percentage ratio here is the
ters. There is no doubt that NLP and text mining are the two most
statistical value of each column divided by the total number of articles,
frequent keywords, echoing Section 2.1 that there are many cross con­
not the total number of authors. Hence, the sum of the total percentages
cepts and technologies between NLP and text mining. According to the
is greater than 100%. From the columnar distribution chart, these
node color, six different clusters are identified. If only focus on the
countries/regions can be roughly divided into three echelons. The first
keywords in the domain of construction and ignore the keywords that
tier includes the USA and China, both of which account for more than
represent the technologies, then the application of NLP in the con­
12%. The percentages of these two countries are far ahead of others,
struction field can be divided into the following sub-domains: cluster
undoubtedly benefiting from their excellent domestic environmental
“Red” is related to accident analysis, cluster “Navy” is related to

3
Y. Ding et al. Automation in Construction 136 (2022) 104169

Fig. 5. Keywords co-occurrence network.

Fig. 6. Top-10 keywords with the strongest bursts.

document/information management and retrieval, cluster “Yellow” is received. Therefore, keyword outburst analysis can clearly show the
related to automatic compliance checking, cluster “Green” is related to research focus in different periods and be regarded as an indicator of the
safety management, cluster “Orange” is related to building information research hotrods. For example, the outburst period of “semantic anal­
modeling, cluster “Azure” is related to risk management, and cluster ysis” with the highest outbreak intensity is from 2015 to 2017, indi­
“Purple” is related to project management. cating a lot of studies focused on the application of “semantic analysis”
Fig. 6 shows the top 10 keywords with the strongest bursts and their in the field of construction in this time interval. In terms of strength, the
outburst time slots, in which “Strength” means the intensity of bursts, top three keywords are “semantic analysis”, “automated compliance
“Begin” and “End” denote the start and end years of the bursts. The checking”, and “ontology”. From the perspective of different outbreak
greater the strength of the burst indicates that the more attention it has periods, the early popular keywords are “ontology”, “retrieval”, and

4
Y. Ding et al. Automation in Construction 136 (2022) 104169

Fig. 7. Author collaboration network (ignored some small disconnected networks).

“information management”. Limited by the development of hardware


Table 2
and software technology, the researches in this period mainly focused on
Betweenness centrality of top-5 authors.
knowledge management (including extraction and retrieval). The burst
duration of these three topics is relatively long, indicating that the Author Centrality
research directions were limited to these three topics, which might also Heng LI 0.03
be due to the limitation of the technology. On the other hand, contin­ Boto ZHONG 0.02
uous and long-term research might help full exploration. In the medium Hanbin LUO 0.01
Hongqin FAN 0.01
term, “sentiment analysis”, “automatic compliance checking”, Weili FANG 0.01
“reasoning”, and “risk management” are the hot direction. Notably, the
popularity of “ontology” and “retrieval” continues to the middle stage.
The technologies in the middle period were relatively mature, allowing 2.4. Co-authorship
researchers to study in more directions, such as simulating human be­
ings for reasoning and various inspection work, resulting in more types One of the leading scientific research achievements is academic
of burst keywords after 2015. However, the burst durations after 2015 literature, and most of them were the joint efforts of different scholars or
are shorter, which means that those directions did not be studied for a diverse research groups. Studying the co-authorship of these articles
long period. Thus, a concern is that the research significance of the ar­ might help reveal the characteristics of academic exchanges and
ticles after 2015 might be more and more focused on the application of collaboration, as well as the law of discipline development in related
new technologies or the development of new directions/domains, fields.
leading to many new but shallow studies that were difficult to apply in Fig. 7 is the central part of the co-authorship network. In order to
practice. In the recent stage, more attention is paid to “BIM”, “text display more clearly, only the top-3 sub-networks with the most nodes
mining”, and “ web crawler”. There might be two hot issues in this are shown. Each node represents an author, and the connection between
period according to these outburst keywords. One is how to manage a nodes indicates their cooperation. The more links the node has, the
large amount of BIM data efficiently with the gradual prevailing of BIM. larger the node. Table 2 shows the top-5 nodes with the largest
On the other hand, it seems that the volume of data has become the main betweenness centrality (calculated in CiteSpace). Centrality is a crucial
bottleneck in the development of this field. Thus, the convenient and fast index to measure the importance of nodes in the whole network. Ac­
web crawler has become the first choice for many studies to supplement cording to the network and the centrality results, as the top-2 authors
their research data. with the highest centrality, the collaboration network of Heng LI and
Botao ZHONG forms the largest sub-network, which reveals the critical

5
Y. Ding et al. Automation in Construction 136 (2022) 104169

Fig. 8. Co-citation network.

position of these two authors that play an indispensable role in pro­ the selected articles is 3446, of which 38 articles were cited no less than
moting cooperation in this field. In addition, the top-5 authors with the five times, as shown in Fig. 8. In the co-citation network, each node
highest centrality are all in the same network, indicating they and the represents an article. The node size denotes the number of times cited by
collaboration network formed by their research teams are the significant the selected article, and the link between two nodes indicates that the
research force and contribution force in this field. The key nodes of the same article has cited them. Different colors in the graph represent
other two sub-networks are Geoffrey SHEN and Liyaning TANG, different clusters. As can be seen obviously from Fig. 8, Caldas’s articles
respectively. Generally, close and extensive collaboration often occurs in about document classification in 2003 and 2002 are the most cited
the research of novel things or directions because it requires multilateral references [13,14], followed by Tixier’s article related to construction
resources both in knowledge and technology to better handle the great safety accidents and risk analysis published in 2016 [83] and Eastman’s
research challenges. Therefore, to some extent, the key authors and their article on compliance checking published in 2009 [23]. The relevant
cooperation outputs in these collaboration sub-networks could be used research topics of these four most-cited references correspond with the
as the reference wind vane for research in this field. related topics of the identified clusters in keyword analysis, which
further reveals the critical enlightening role they played and the solid
2.5. Co-citation research foundation they laid in this field.
The co-citation network reveals the core research of the whole
The co-citation network can reveal the structure and characteristics period, while citation burst analysis can reflect the fundamental re­
of the cited literature in the whole period so as to find out the core searches on which were significantly relied in different periods. Fig. 9
research in the field. After calculation, the total number of citations of shows the top-10 articles with the most vigorous burst. According to the

Fig. 9. Top-10 references with the strongest citation bursts.

6
Y. Ding et al. Automation in Construction 136 (2022) 104169

Table 3 Table 4
Statistical distribution results of articles in different construction domains and Articles with unclear data sources.
their sub-categories. Construction domains Related articles Count
Construction Secondary Related articles Count Total
DM/IM/KM [14,51,46,19,86,25,3,2,67,55,77,30,105,85] 14
domains categories
AA/SM [95,82,106,58,108,7,27,8] 8
[46], [51], [19], [5], [48], ACC [73,68,74,44,109] 5
KE/KRep/
[97], [60], [25], [70], [45], 14 BIM [50,59,91,20] 4
KRet
[67], [75], [77], [85] RM [76] 1
DM/IM/KM [14], [13], [86], [4], [3], 28 Others [52,11,64,49] 4
C/C [2], [55], [65], [33], [30], 12 Total 36
[90], [1]
Q&A Sys [38], [105] 2
[17], [83], [84], [82], [28], into six construction domains. They are Document/Information/
C/C [99], [98], [16], [107], 13 Knowledge Management (DM/IM/KM), Accident Analysis/Safety Man­
[108], [7], [27], [8]
AA/SM 22 agement (AA/SM), Automated Compliance Checking (ACC), Building
KE/KRep/ [24], [95], [96], [115],
KRet [37], [106], [58]
7 Information Modeling (BIM), Risk Management (RM), and Other
FA [66], [36] 2 Construction-related Domains (Others). In order to better and structur­
KE/KRep/ [68], [102], [103], [101],
7 ally analyze the selected articles, these six primary categories are further
KRet [110], [93], [109] divided into secondary categories. As shown in Table 3, the secondary
ACC 13
C/C [74], [111,112] 3
Checking [73], [44], [100] 3
categories of the first five primary categories are divided by types of
KE/KRep/ research objectives, including Knowledge Extraction/Retrieval (KE/
[50], [53], [59], [91], [20] 5
BIM KRet 8 KRet), Classification/Clustering (C/C), Question Answering System
C/C [35], [18], [113] 3 (Q&A Sys), Factor Analysis (FA), and Checking. However, the last pri­
KE/KRep/
[76], [43], [41] 3 mary category is still divided by the research domains of construction, i.
RM KRet 5
C/C [42], [32] 2 e., Social Media Analysis (SMA), Scene Understanding (SU), Mainte­
[34], [81], [69], [88], [80], nance Requests Analysis (MRA), Cost Management (CM), Social Re­
SMA 8
[92], [114], [47] sponsibility Analysis (SRA), and Worker Competencies Analysis (WCA).
SU [54], [49] 2 Readers should note that the taxonomy and its corresponding results
Others MRA [11], [64] 2 15
CM [89] 1
in Table 3 are not unique because some articles may belong to multiple
SRA [52] 1 construction domains. Thus, an article is forced to belong to one con­
WCA [104] 1 struction domain and one secondary category to facilitate analysis. The
Total 91 priority of article classification is: “AA/SM > ACC > BIM > RM > DM/
IM/KM”. The “DM/IM/KM” has the lowest priority because almost all
stored text data in digital fromat is also called “electronic document”,
strength of the outburst, it can be roughly divided into three intensity
which means “DM/IM/KM” has the largest scope and partially contains
levels: greater than three, greater than two but not greater than three,
other domains. For example, Chi et al. [17] classified the accident
and not greater than two. The articles with the highest outburst intensity
documents into different accident categories to better manage and store
level are the articles on compliance checking published by Tan in 2010
the accident documents. This article belongs to both“DM/IM/KM” and
[79] and the articles on document classification published by Caldas in
“AA/SM”, but finally was assigned into “AA/SM” according to the
2003 [13]. From the perspective of emergence time, it can be roughly
priority.
divided into three stages: early-stage (before 2011), middle-stage
(2011–2016), and recent-stage (2017–2020). Notably, there were no
burst articles in the early stage because this area was just in the 3.1. Datasets and sources
exploratory phase, with low research popularity and less related re­
searches, which was also mentioned in the publication time analysis in In this section, the datasets and data sources used in the selected
Section 2.2. The early basic research laid a solid foundation for the papers will be arranged methodically. After statistics, there are many
development of this area, which can also illustrate why seven of the top- articles with unclear data sources, up to 36 (39.6%), as shown in Table 4,
10 references with the strongest burst were published before 2011. After including those papers without a relevant introduction to the dataset
2010, this field gradually saw a research boom. Similarly, those re­ and those with a simple introduction but without source links. For
searches in the middle stage would continue to pave the way for the example, Caldas et al. [14] did not describe their data. Al Qady and
following research period. For example, the burst references in the Kandil [3] only said that their data was collected from an international
recent period, Fan’s article related to accidents analysis and safety airport construction project but no more specific information. Table 5
management [24] and Zhang’s article on document information shows a total of 33 data types and 41 datasets adopted by selected ar­
extraction and compliance checking [102,103], were published in the ticles, where each dataset might be adopted by multiple articles, and
middle stage. Since there is a specific correlation between the original each source might be used by multiple datasets. Among those datasets,
papers and references, if an article had frequently been cited at a certain seven with a data volume of less than a hundred (17.1%), fourteen with
period, the research focus or topics might be relevant to that article. more than a hundred but less than a thousand (34.1%), fifteen with more
Therefore, it can be boldly speculated that some of the current research than a thousand but less than ten thousand (36.6%), and five with more
trends in this field are accident analysis, safety management, and than ten thousand (12.2%). The size of most datasets is not very large.
compliance checking. The authors also noted that few articles in Table 5 shared the data
they collected, preprocessed, or labeled, which means that much of the
3. NLP in construction-related areas time needs to be taken in data collection, preprocessing, and labeling if
the researchers want to repeat the studies and make progress. Unfortu­
In this section, the authors will introduce NLP in construction-related nately, the newly collected data may be different from the original. Liu
areas from three aspects: datasets and sources, technologies and tools, et al. [54] is the only exception (AEC Image Caption in Table 5) who has
and applications and progresses. With a slight modification in the results shared the annotated dataset, source code, and even the trained model
of keyword clustering in Section 2.4, the selected articles are classified to enable other researchers to easily make further exploration based on
their research. Due to the limitation of data, some studies can only be

7
Y. Ding et al. Automation in Construction 136 (2022) 104169

Table 5 Table 5 (continued )


Datasets adopted by related articles. Y/P means paid access, and Y/AL denotes Data Types Source Links Public Dataset Related Articles
public but area limited. Accessible Size
Data Types Source Links Public Dataset Related Articles Construction https://www.
Accessible Size Accident Cases in westlawasia.com/
DM/IM/KM Hong Kong hong-kong
https://www.cpwr 1167 [17]
Classified Online https://sweets. constructionsolut Y
Product construction.com/ Y 3030 [13] ions.org
Descriptions default.jsp Construction https://www.cdc. 140 [83]
Web Pages in Accident Cases or gov/niosh/face/ Y 3556 [84]
https://dmoz-odp.
Human-edited Y 7500 [38] Injury Reports or inhouse.html
org
Directory Safety Risk Cases https://www.
General Conditions Y
https://www. or Accident cross-safety.org/uk
of the Contract Y/P 1 [4,5,70] 590 [115]
aiacontracts.org Narratives in the https://www.
for Construction US worksafebc.com/ Y
Engineering en
domain-Specific https://www. https://www.osha. 1000 [28,98,99,16]
Y 111 [48] N
Technical ncree.narl.org.tw gov 2000 [107]
Documents Investigation
https://pcces.arch Reports on
nowledge.com/c http://www.
Construction Y/AL 100 [66]
CAD Drawings si/Default.aspx? Y 134 [97] chinasafety.gov.cn
Accidents in
aspxerrorpath China
=/csi/Default.aspx News of
http://www. https://news.sbs.
Construction
scotland.gov.uk/ co.kr/news/ Y 28,263 [36]
Fire-Accident in
Topics/Built- newsMain.do
Korea
Environment/ https://www.
Scottish Technical
Building/Building- Y 1 [60] kosha.or.kr/kosha/ Y
Standards Korea Accident
standards/ index.do 4263 [37]
publications/ Reports
https://www.
pubtech/ N
cosmis.or.kr/
thb2011octdom
https://forums.aut Count Data Types: 5, Dataset: 10
Y
odesk.com ACC
Q & A Data related http://www.cad
Y 451 [45] [102], [103],
to CAD tutor.net International https://www.iccs
https://www. Y/P 1 [111], [112],
Y Building Code afe.org/
cadforum.cz/en [93,100,101,110]
https:// Count Data Types: 1, Dataset: 1
buildingdata.
N
energy.gov/ BIM
projects
Green Building https://standards.
https://www. 71 [75]
Cases buildingsmart.
wbdg.org/ Standards of IFC Y 1 [53]
org/IFC
additional- Y
/RELEASE/
resources/case-
https://www.
studies BIM Case Studies Y 240 [35]
bimobject.com/
http://cdaily.kr/ Y
https://www.
https://www. Y
Y dbpia.co.kr/
dnews.co.kr/
BIM-related http://riss.kr/ Y
https://www. 450 [18]
Y literatures https://kiss.
Issues of fnnews.com/
kstudy.com/index. Y
International http://www.
25,143 [65] asp
Construction kiscon.net/ Y
https://apps.
Market ksc_main.asp
User comments for autodesk.com/
http://www.molit. Y 2120 [113]
Y BIM Application RVT/en/Home/
go.kr/portal.do
Index
http://www.
Y
ohmycon.co.kr/ Count Data Types: 4, Dataset: 4
(LexisNexis)
Construction- RM
https://www.
Defect Litigation Y/P 1498 [33]
lexisnexis.com/en- Bid Documents of
Cases https://dot.ca.go
us/home.page Infrastructure Y 260 [42]
v/
https://www.uspt Projects
Patent Documents Y 348 [90]
o.gov/ International
https://www. Federation of
Publications scopus.com/ Consulting https://fidic.org/ Y/P 1 [41,43]
Related to Civil search/form.uri? Y/P 663 [1] Engineers Red
Engineering display = Book
basic#basic https://www.sec.
Transaction Data of gov/edgar/
Count Data Types: 12, Dataset: 12
Construction searchedgar/ Y 995 [32]
AA/SM Companies companysearch.
html
Y/P 360 [24]
Count Data Types: 3, Dataset: 3

(continued on next page)

8
Y. Ding et al. Automation in Construction 136 (2022) 104169

Table 5 (continued )
Data Types Source Links Public Dataset Related Articles
Accessible Size

Others

25,289 [34]
https://weibo.
WeiBo data Y 5360 [88]
com/login.php
80,793
[80]
275,325
https://twitter.
Twitter data Y 3200 [81]
com/
23 [69]
Legislative https://www.
Document on legco.gov.hk/
Y 1748 [92]
Construction general/english/
Project library/index.html
https://www.hkpl.
News of Bursting gov.hk/tc/e-resou
Water Pipes in rces/e-databases/k Y/P 2828 [114]
Hong Kong eyword/wisenews/
all/1
News of
Prefabricated https://www.cnki.
Y/P 615 [47]
Buildings in net/
China
Competitively Bid
https://dot.ca.
Data of Highway Y 1221 [89]
gov/
Projects
https://github.
Image Caption for
com/HannahHua
Construction Y >1000 [54]
nLIU/AEC-imag
Activity Scenes
e-captioning
https://www.chi
Y
nahr.com/

https://www. Fig. 10. Relationship of NLU and NLG in the scope of NLP.
Recruitment Y/AL
zhaopin.com/
Advertisements http://www. 243,521 [104]
for Constriction job1001.com/
Y to generate natural language that humans can understand. The input
Project Manager https://www. knowledge of NLG can also be transformed from other types of data (e.
Y
liepin.com/ g., image, audio, etc.) by other encoding models (e.g., CNN, LSTM, etc.).
https:// For example, an image can be encoded into a set of vectors by a CNN
Y
www.51job.com/
model and then as the input of an NLG model (e.g., LSTM) to gradually
Count Data Types: 8 Dataset: 11 generate a series of words.
Total Data Types: 33 Dataset: 41 Fig. 11 is the Mind map of the overall NLP context generated ac­
cording to the technologies used in the selected articles, and Table 6
provides details of the tools adopted in the selected papers for readers’
carried by a few researchers with data, forming a research monopoly. reference and adoption. The main applications of NLG include machine
The most apparent research monopoly, for example, is the ACC research translation (MT), answers generation in Q&A system, and auto-summary
with “International Building Code” as the data source in Table 5, which is such as generating reports from company income data, summarizing the
almost conducted in the research team with El-Gohary as the core. main content of an image, etc. In the current stage, the major technol­
Though the International Building Code can be paid access, the regu­ ogies and models of NLG include LSTM, GRU, Seq2seq, BERT, and
lations extracted and processed by El-Gohary’s team and the evaluation Generative Pre-trained Transformer (GPT), among which BERT and GPT
data they used are the keys. Unfortunately, those data were shared only are state-of-the-art (SOTA) models. On the other hand, according to the
within their teams. The research monopoly is very unfavorable to the review of collected articles, the appearance rate of NLG in the field of
healthy development and progress of academia. In addition, 27 of 33 construction is far lower than NLU in the field of construction (only two
types of data (84.4%) in Table 5 only have one related article, indicating articles were related to NLG, more details in Section 3.3). NLU mainly
a lack of iterative/progressive research in the same or similar research includes the following parts:
directions. It might be caused by data monopoly to a large extent.
1) Morphology Analysis (MA). MA uses the computer to analyze the
morphology of natural language to judge the structure and category
3.2. Technologies and tools of words, which usually has the following operations:
a) Tokenization/Segmentation. Tokenization is the process of
This section aims to provide the reader an overall context of NLP separating and possibly classifying various parts of the input
without details. Therefore, the authors will systematically organize but sentences. For English sentences, it can be simply divided ac­
briefly introduce the basic idea of NLP technologies/algorithms applied cording to spaces. The situation becomes complicated for words
or appeared in the selected articles. Please refer to their original sources that can not be separated by spaces (such as Chinese). The initial
for more detailed information. segmentation method was based on dictionary matching, which
The primary tasks of NLP generally can be divided into Natural was fast and low cost but had poor adaptability. Statistics-based
Language Understanding (NLU) and Natural Language Generation methods were then developed, such as Hidden Markov Model
(NLG). The relationships between NLP, NLU, and NLG are shown in (HMM), Conditional Random Field (CRF). The latest methods are
Fig. 10. NLU structurally represents/encodes the unstructured text to deep learning-based, such as the Bi-LSTM-CRF proposed by
make machines understand human language like humans, while NLG Huang et al. [31].
transforms/decodes the structured data into the text to allow machines

9
Y. Ding et al. Automation in Construction 136 (2022) 104169

Fig. 11. Mind map of the overall NLP context, including technologies, algorithms, and tools.

b) Word Normalization (for some languages, e.g., English) includes meaning of an ambiguous word in a specific context. For example,
Stemming and Lemmatization. Stemming is the process of the word “table” has different meanings in the sentence “There is a
reducing inflected words (sometimes derivatives) to stem, word table in the room” and “This is the metric conversion table”. At the
base, or root form. For example, “does”, “done”, “doing”, and sentence level is the Semantic Role Labeling (SRL) that assigns labels
“did” will be restored to “do” through Stemming. Lemmatization to words or phrases in senses to indicate their semantic roles. Take
is the process of transforming the complex form of words into the the sentence “Who did what to whom at where” for an example, no
most basic form. For example, “cities”, “children”, and “teeth” matter how the sentence structure changes, “who” is always the
will be transformed into the basic forms of “city”, “child”, “tooth”, agent, “did” is the predicate, “what” is the theme, “whom” is the
respectively. Word normalization is generally a lexicon-based recipient, and “where” is the location. At the textual level is the task
method implemented by simple dictionary lookup. of finding all expressions that refer to the same entity, called Cor­
c) Part-of-Speech (POS) tagging. POS tagging is the task of giving eference Resolution (CR). For example, the word “It” refers to the
each word in a sentence a POS category. The POS categories here “window” when the text is “There is a window. It is open.”.
may be Nouns, Verbs, Adjectives, Prepositions, Conjunctions, etc. 4) Text Representation (TextRep). TR aims to numerically represent the
d) Named Entity Recognition (NER). NER is the task of locating and unstructured text/documents to make them mathematically
identifying entities mentioned in unstructured text into pre­ computable. The NLU implementations of the above tasks are the
defined categories such as person names, organizations, loca­ simulated methods according to human understanding, while the
tions, medical codes, time expressions, quantities, monetary TextRep is a numerical task that can be regarded as a computer’s
values, percentages, etc. specific way for NLU. Generally, TextRep can be implemented in the
2) Syntactic Parsing (SP). SP is the process of analyzing the input sen­ following ways:
tence to obtain the syntactic structure, mainly including Phrase a) Boolean Model (BooMo). BM generates a vector consisting of
Structure Parsing (PSP), Dependency Syntactic Parsing (DSP). PSP is 0 and 1 to indicate whether the specific word appears in the text.
used to identify the phrase structure in the sentence and the hier­ BooMo is a strict matching model only using logical operators
archical syntactic relationship between phrases, such as Noun Phrase “and”, “or” and “not” for vector calculation.
(NP), Verb Phrase (VP). DSP is utilized to recognize the grammatical b) Vector Space Model (VSM). VSM utilizes a real-valued vector to
interdependency between words in sentences, such as the de­ represent the words, sentences, or documents, which implicitly
pendency among subject, predicate, and object. projects the knowledge contained in the text into a specific vector
3) Semantic Analysis (SA). For different text units, the tasks of SA are space. BooMo can be regarded as a particular case of the VSM, but
different. At the word level, the basic task of SA is Word Sense the primary methods are the following.
Disambiguation (WSD), which refers to identifying the correct

10
Y. Ding et al. Automation in Construction 136 (2022) 104169

Table 6
Tools/lexicons that were used by the selected papers. Multi NLP denotes that the tool can perform multiple NLP operations. TA/DM/ML denotes that the tool is an
integrated platform that supports text analysis, data mining, and machine learning. Multi-Lan means multiple languages.
Tool name Tool types Functions Language Open- Source/URL
source?

Jieba Python package Tokenization Chinese Y https://pypi.org/project/jieba/


PKUSeg Python package Tokenization Chinese Y https://pypi.org/project/pkuseg/
Chinese,
PanGu Software Tokenization Y http://pangusegment.codeplex.com/
English
Stanford Tagger Java package POS Multi-Lan Y https://nlp.stanford.edu/software/tagger.shtml
Stanford Parser Java package SP Multi-Lan Y https://nlp.stanford.edu/software/lex-parser.shtml
https://github.com/tensorflow/models/tree/f2f25096d3dc6561a855dab9
SyntaxNet Python package SP Multi-Lan Y
14cf2913100728d6/research/syntaxnet
Word2Vec Python package WordEmb Multi-Lan Y https://www.tensorflow.org/tutorials/text/word2vec
FastText Python package WordEmb Multi-Lan Y https://fasttext.cc/
Genism Python package TopMo Multi-Lan Y https://pypi.org/project/gensim/
MALLET Java package TopMo Multi-Lan Y http://mallet.cs.umass.edu/topics.php
BERT Python package Multi NLP Multi-Lan Y https://github.com/google-research/bert
GPT Python package Multi NLP Multi-Lan Y https://github.com/openai/gpt-3
NLTK Python package Multi NLP Multi-Lan Y https://www.nltk.org/
HanLP Web API Multi NLP Multi-Lan Y https://github.com/hankcs/HanLP
CoreNLP Java package Multi NLP Multi-Lan Y https://stanfordnlp.github.io/CoreNLP/
GATE Software Multi NLP Multi-Lan Y https://gate.ac.uk/
Sundance Software Multi NLP English N https://www.cs.utah.edu/~riloff/pdfs/official-sundance-tr.pdf
PolyAnalyst Software TA/DM/ML Multi-Lan N https://www.megaputer.com/polyanalyst/
KNIME Software TA/DM/ML English Y https://www.knime.com/
Voyant-tool Web Application TA/DM/ML Multi-Lan Y https://voyant-tools.org/
English,
RapidMiner Software TA/DM/ML N https://rapidminer.com/
Japanese
IBM SPSS
Software TA/DM/ML Multi-Lan N https://www.ibm.com/products/spss-modeler
Modeller
Leximancer Software TA/DM/ML Multi-Lan N https://www.leximancer.com/
Ontology
Protégé Software English Y https://protege.stanford.edu/
editor
Public ontology
SUMO Word mapping Multi-Lan Y http://www.ontologyportal.org/
lexicon
Material ontology
MatWeb Word mapping English Y http://www.matweb.com/search/SearchSubcat.asp
lexicon
WordNet Lexicon Word mapping English Y https://wordnet.princeton.edu/
HowNet Semantic lexicon Word mapping Chinese Y https://github.com/thunlp/OpenHowNet
CKIP Corpus / Chinese Y https://ckip.iis.sinica.edu.tw/

i. Bag-of-Words (BOW). In the BOW, the text is represented as a dimension of latent semantic space and the semantic rela­
bag (multiple sets) of words. The frequency of each word in tionship becomes smaller and clearer. In addition to dimen­
the text is counted to obtain the vector representation of the sionality reduction, LSA can also be used for document Topic
text. However, the position information of words in sentences Modeling (TopMo).
can not be preserved, and there will be high dimensionality iv. Word Embedding (WordEmb). WordEmb uses a large number
and sparsity when the corpus increases. of text data to train a representation model that can generate
ii. Term Frequency-Inverse Document Frequency (TF-IDF). dense low-dimensional vectors. The closer the meaning of two
Intuitively, the importance of a word increases with the words, the closer the distance between two word-vectors in
number of times it appears in the document but decreases the vector space. The initial WordEmb is a static representa­
inversely with the frequency it appears in the corpus. Based tion method that can not solve polysemy, such as Word2Vec
on this idea, TF-IDF considers not only the term frequency in and FastText. The latest and SOTA WordEmb model is a dy­
the document but also the total term frequency in the whole namic representation method based on context embedding,
corpus. The more a word appears in a document and the less it whose output representation can change according to the
appears in all documents, the higher the TF-IDF value of the context, such as the Embeddings from Language Model
word, and the more it can represent the article. Normalized (ELMo), BERT, and GPT models.
TF-IDF is often called TFC-weighting, which considers the c) N-Gram. N-Gram is a statistical language model used to calculate
third factor: the document vector’s length. TF-IDF can be used the next most likely word for the given text. It is based on the
for feature selection (by setting a threshold value of TF-IDF) Markov hypothesis assuming the occurrence of the Nth word is
and keyword extraction. However, it still can not retain the only relevant to the first N-1 words but not related to other words,
positional relationship of words in sentences. then the occurrence probability of the whole sentence is equal to
iii. Latent Semantic Analysis (LSA), sometimes called Latent Se­ the probability product of the occurrence of each word. N-Gram
mantic Indexing (LSI). LSA can solve the problems of high divides the text with a sliding window of size N to form a fragment
dimension and sparsity. LSA decomposes the document-term sequence with fragment length N. When N = 1, it is called
matrix obtained by TF-IDF into the product of two smaller Unigram, N = 2 is called Bigram, and N = 3 is called Trigram.
matrices through Singular Value Decomposition (SVD) to Take the sentence “He is wearing gloves” and N = 3 for an example,
remove the “noise” in the document, such as stop-words, the fragment sequence will be [“He”, “He is”, “He is wearing”, “is
misuse words, or occasionally irrelevant words. After matrix wearing gloves”, “wearing gloves”, “gloves”].
decomposition, the semantic structure of the text is gradually d) Latent Dirichlet Allocation (LDA). LDA is a Bayesian-based
present. Compared with the traditional vector space, the generative statistical model that allows sets of observations to

11
Y. Ding et al. Automation in Construction 136 (2022) 104169

be explained by unobserved groups. The LDA assumes each dimension of VSM and improve the efficiency of CAD document
document is a mixture of a small number of topics, and each retrieval. Fan et al. [25] improved the retrieval performance of project
word’s presence is attributed to one of the document’s topics. In information in AEC documents by considering more features, including
this case, the topic of each document is given in the form of a the project-specific terms in the documents and their dependency re­
probability distribution. After analyzing some documents lations. In addition to adding extra features, Shen et al. [75] also used
(observed data) and extracting their topics (or probability dis­ different methods for different features to measure the local similarity
tribution), topic clustering or text classification can be carried out because a single similarity measurement might be one-sided. The global
according to the topic distributions. similarity was then obtained by weighted addition. Similarly, combined
with several similarity measures, Torkanfar and Azar [85] proposed a
method to quantify the similarity of construction projects based on se­
3.3. Applications & progress mantic and structural metrics derived from their Work Breakdown
Structures (WBSs).
Since there are a relatively large number of articles, this section will
briefly summarize articles according to the taxonomic results shown in 3.3.1.2. Classification/clustering. The general process of document
Table 3. In addition, almost all NLP studies have conducted the same classification is preprocessing, text representation, and classification
preprocessing operations on text data, including tokenization, stop­ modeling. Caldas et al. [14] developed the construction document
words removal, stemming, lemmatization, POS tagging, NER, phase classification model using different text representation methods, such as
chunking, etc. The authors also confirm this after reading through each Boolean Weighting, Absolute Frequency, TF-IDF, and TFC-weighting.
article. Therefore, the preprocessing operations for data will not be The TFC-weighting combined with Support Vector Machine (SVM)
mentioned in the summary of the articles. finally achieved the best results. Based on the same combination, Caldas
and Soibelman [13] implemented an automated hierarchical document
3.3.1. Document/information/knowledge management classification system with an average classification accuracy of three
levels was 92.05%. To refine the represented knowledge, Ur-Rahman
3.3.1.1. Knowledge extraction/retrieval. In knowledge extraction, and Harding [86] developed a three-step classification system for post-
Choudhary et al. [19] used commercial software (PolyAnalyst 5.0) to project reviews (PPR) document classification. However, the system’s
systematically analyze PPR files and mined valuable experience and performance was a little bit low and had much room for improvement.
knowledge to facilitate dissemination and exploitation of PPR learning. After Caldas, Al Qady and Kandil [4] implemented the automated hi­
Al Qady and Kandil [5] adopted a natural language parsing tool (Sun­ erarchical classification by document clustering using LSA, which was
dance) to parse contract documents into noun phrases (NP), verb also adopted by the research of ([2,3] for document clustering and
phrases (VP), prepositional phrases (PP), etc. By identifying the subject classification. Document tagging is a multilabel classification task. For
and object associated with VP, an SVO triplet concept < subj, VP, obj > example, Mahfouz et al. [55] developed a classification model based on
was identified with the F1 score reaching 90% of human performance. TF-IDF and Naive Bayes (NB) to automatically tag legal factors for the
McGibbney and Kumar [60] developed a user-oriented legislation different site conditions (DSC) litigations. Moon et al. [65] developed a
browser with viewing, retrieval, and navigation functions by automat­ tagging system for global contracts that automatically assigned five tags
ically adding corresponding notes to the named entities and specific to each document. The tags of these two studies were generated by
terms detected by the tool GATE. Niu and Issa [70] also utilized GATE to human experience. Unlike them, Jallan et al. [33] and Afolabi et al. [1]
extract the common root concepts/ontologies to help build a taxonomy adopted LDA to automatically assign each document one or more topics
for the domain ontology of construction contracts. Knowledge visuali­ to realize document tagging. The other latest researches like Hassan and
zation is an essential part of knowledge extraction, which is conducive to Le [30] classified the contractual text into requirements and non­
knowledge absorption. The common forms of knowledge visualization requirements text with Word2Vec and SVM, achieving better perfor­
are tree, graph, and word cloud. For example, with integrating multiple mance than other machine learning methods. Wu et al. [90] vectorized
methods (CRF, Fuzzy Neural Networks, LDA, Family-tree), Li et al. [45] the construction patent documents in combination with N-Gram and TF-
modeled the long-term knowledge evolution of empirical engineering IDF. The Multilayer Perceptron (MLP) model is then used to identify the
knowledge in the development of CAD. Nedeljkovic and Kovacevic [67] patent type of information and communication technology.
proposed a two-phase method to extract key phrases from project doc­
uments, and then built a phrase graph network to visualize valuable 3.3.1.3. Question answering system. The question answering system is a
project knowledge according to the relationship calculated by the se­ next-step application of knowledge retrieval, which further generates
mantic similarity between phrases. Sun et al. [77] extracted keywords the answer from relevant documents or paragraphs that has been
from monthly construction reports by TF-IDF and visualized the key­ retrieved or located. For example, Kovacevic et al. [38] developed a
words by co-occurrence network and word cloud technologies to reveal system to generate the answers to construction-related questions ac­
the hidden knowledge. cording to the identified key elements from the retrieved paragraphs of
Another topic is knowledge retrieval, which is usually achieved by web pages processed by semi-supervised SVM and TF-IDF. However,
comparing the similarity of two represented vectors. Li and Ramani [46] answer generation according to the key elements was a rule-based
developed an ontology-based design document query system using TF- method. To achieve full automation, Zhong et al. [105] implemented
IDF and cosine similarity and achieved better performance than an NLG model based on the BERT model to automatically generate an­
keyword-based search technology. Lin and Soibelman [51] used various swers from the retrieved relevant paragraphs.
query expansion strategies combined with the extended Boolean model
to assist online product information retrieval in AEC areas. Many 3.3.2. Accident analysis/safety management
improvement studies have been conducted to improve the performance
of traditional retrieval models. For example, considering full-text-based 3.3.2.1. Knowledge extraction/retrieval. With the help of POS tagging,
retrieval might reduce the relevance ranking due to the multi-topic Yeung et al. [95] simplified and reorganized the complex safety accident
attribute of documents, Lin et al. [48], therefore, combined ontology narrative text to generate a narrative map, which was widely believed to
concepts with paragraphs to form ontopassages to realize a paragraph- help learners better understand and remember the key factors and
level retrieval system for engineering domain-specific technical docu­ events leading to the accidents. Yeung et al. [96] further developed a
ments. By using external knowledge, Yu and Hsu [97] developed a computational narrative semi-fiction generation model to generate new
corpus-based approach based on Bayesian classifiers to reduce the

12
Y. Ding et al. Automation in Construction 136 (2022) 104169

narratives from the existing narrative knowledge base to support paper related to ACC in the construction field began in 2013 Salama and
training and learning. Tixier et al. [83] developed a set of NLP programs El-Gohary [73], which proposed a deontic model based on syntactic and
based on manual rules to automatically extract attributes and outcomes semantic features consisting of a hierarchy of normative concepts,
from the injury reports with an F1 score of 96%. Martínez-Rojas et al. interconcept relations, and deontic axioms (for regulations represent­
[58] used an NLP software (KNIME) to extract seven types of content ing). The study preliminarily explored the feasibility of ACC. ACC first
from safety and health plans documents using manually defined rules to requires understanding and extracting constraints from various con­
check the important specific requirements of the plans at an early stage. struction regulatory documents and then transforming them into a
There were also some improvement strategies for retrieval tasks in the formalized format that enables checking/reasoning. For example, Nie­
AA/SM domain. For example, Fan and Li [24] reduced the dimension of meijer et al. [68] parsed the design constraints described in natural
TF-IDF vector dimension using LSI and completed the retrieval task of language into tree structure according to predefined rules using syn­
similar accident cases with the cosine similarity. However, the human tactic parsing to realize the standardized representation of constraints.
force was added for the selection of terms. In addition to using TF-IDF After POS tagging and dependency parse, J. Zhang and El-Gohary [102]
for document representation, Zou et al. [115] also applied Word2Vec adopted pattern-matching-based rules and conflict resolution rules to
to vectorize and expand query text to improve the performance of case extract regulations and then represented them using the deontic logic
retrieval. Kim and Chi [37] used Word2Vec for query text expansion and method Salama and El-Gohary [73], while Jiansong Zhang and El-
CRF for latent knowledge extraction of documents to achieve a better Gohary [103] proposed a bottom-up transformation method based on
retrieval performance of accident cases. A simple cross-modal retrieval semantic mapping rules and conflict resolution rules to extract con­
case in the AA/SM domain was carried out by Zhong et al. [106]. The straints and transform them into first-order logic (FOL) with Prolog
research first applied the HowNet to extract the ontologies related to syntax. After extracting regulatory concepts and IFC concepts from
potential hazards in project documents as the core concepts and then compliance documents using pattern-matching-based rules, Zhang and
manually labeled the potential hazard categories of construction images El-Gohary [101] then identified the relationship between each pair of
based on the extracted ontologies. Finally, similar images were retrieved regulatory concepts and IFC concepts to develop an IFC extension mode
by calculating the similarity between the query text vector and the combining compliance-related information. Li et al. [44] utilized POS
annotation text vector of the image. tagging, Gazetteer Lookup, and chunking to extract spatial constraints
from spatial configurations.
3.3.2.2. Classification/clustering. The classification task in AA/SM is Many strategies have also been applied to improve the performance
usually to classify the categories of accident reports. For example, Chi of regulation extraction. For example, Zhou and El-Gohary [110]
et al. [17] combined TF-IDF, principal component analysis (PCA), and introduced ontology knowledge to help recognize the semantic features
SVM to classify the accident category of the document. Goh and Ubey­ of the text better, improving the rule-based extraction performance.
narayana [28] also used SVM to classify construction accident narrative Since almost all rule-based extraction methods were using POS tagging,
documents, while Zhang et al. [99] integrated multiple machine the extraction accuracy largely depends on the accuracy of POS tagging.
learning models to improve the classification performance. Based on the Therefore, Xue and Zhang [93] proposed error-driven transformational
knowledge extracted by Follow-up Tixier et al. [83], a series of follow-up rules to improve the existing taggers, which reduced 82.7% of errors in
studies were carried out. For example, Tixier et al. [84] predicted the POS tagging results of building codes. On the other hand, rule-based
construction injury types according to the extracted attributes and extraction is limited and usually relies on predefined vocabularies
outcomes. Tixier et al. [82] constructed an undirected graph using the involving heavy feature engineering. In order to solve this problem,
extracted key attributes from the injury reports and then applied several Zhong et al. [109] proposed a two-steps deep neural network, which first
graph clustering algorithms to dig out possible safety clashes. Baker combined the Bi-LSTM and the CRF to extract the entities, then fore­
et al. [8] took the extracted attributes as features to classify the cate­ casted the constraint patterns between those entities by the combination
gories of safety outcomes by using different Machine Learning (ML) of LSTM and MLP. Unlike rule-based extraction methods, the two-steps
models. Entering the era of deep learning, Word2Vec was commonly deep neural network can be applied without complex handcrafted fea­
used instead of TF-IDF. Word2vec is often used in conjunction with other tures engineering.
deep learning models, such as LSTM [98], GRU [7,16], and Convolu­
tional Neural Network (CNN) [107,108]. The studies of Goh and 3.3.3.2. Classification/clustering. The regulatory documents usually
Ubeynarayana [28], Zhang et al. [99], Zhang [98], and Cheng et al. [16] contain other types of text. Therefore, the classification task in the ACC
were based on the same dataset, which with the F1 scores of 0.67, 0.68, domain plays a filtering role by identifying sentences or paragraphs
0.723, and 0.69, respectively. A gradual upward trend indicates that the containing constraints to improve the efficiency and accuracy of rule
performance of the deep learning model is higher than the machine extraction. For example, Salama and El-Gohary [74] proposed a text
learning model. However, Word2Vec is a static method that can not classification model based on SVM to classify the parts of regulatory
solve the problem of polysemy. Therefore, the dynamic representation documents into different categories to reduce the complexity of subse­
method started to be used. For instance, Fang et al. [27] used the BERT quent constraint extraction. P. Zhou and El-Gohary [111] proposed an
model to represent the documents and achieved a classification accuracy ML-based approach to categorize and group regulatory documents hi­
of 86.91%. erarchically, while Peng Zhou and El-Gohary [112] further added
ontology knowledge to the model to achieve better performance.
3.3.2.3. Factor analysis. Na et al. [66] extracted 43 safety risk factors
from the investigation reports on construction accidents and analyzed 3.3.3.3. Checking. The checking process of ACC in the earliest related
the relationships of each safety risk factor through interpretative paper was carried out manually [73]. To make it more automatic, Li
structural modeling (ISM). Through several morphological analyses, et al. [44] defined a mechanism in the geographical information system
Kim and Kim [36] identified the construction fire accident factors from to check compliance by executing multiple spatial rules in a logical
news articles and then analyzed the main factors causing fire accidents sequence. Based on the research of Jiansong Zhang and El-Gohary
in different seasons via PCA. Bahdanau et al. [102], Zhang and El-Gohary [100] systematically pro­
posed an automatic ACC system, including regulation extraction and
3.3.3. Automated compliance checking transformation, fact extraction and transformation, and compliance
reasoning. The regulation and fact were transformed into logic rules and
3.3.3.1. Knowledge extraction/representation/retrieval. The earliest logic facts using the FOL format, and then a B-Prolog’s reasoner was

13
Y. Ding et al. Automation in Construction 136 (2022) 104169

used to reason according to the logic rules and logic facts to complete 3.3.6. Others
compliance checking. Note that there are some new directions in the “Others” category in
Table 3 that do not appear in the results of keyword clustering, such as
3.3.4. Building information modeling SMA, SU, and MRA. Especially the SMA, which has up to 8 relevant
articles. One of the possible reasons it did not be identified during
3.3.4.1. Knowledge extraction/retrieval. With the help of the Interna­ keyword clustering is that those articles in “Social Media Analysis” did
tional Framework for Dictionaries (IFD), Lin et al. [50] mapped the not unify representing keywords. For example, the related keywords
keywords queried by users into corresponding IFC entities. The graph- those articles used are “social network”, “social network analysis”,
based path search method was then applied to find the relations be­ “social factor”, “public opinion”, “public attitude”, and “news analysis”.
tween entities within the IFC graph model to realize intelligent fine- However, fundamentally speaking, all these analysis studies were based
grained retrieval of BIM data. Liu et al. [53] used explicit semantic on “Social Media”. Having a certain number of relevant studies but no
analysis technology to build an external knowledge repository specific unified keywords may indicate a new hot research trend or direction. On
to the BIM product model from IFC4 release documents to enhance the other hand, the keyword inconsistency is not obvious in the SU and
product model retrieval via two-step enhancements: concept expansion MRA because the articles in these two domains were all published in
and re-ranking based on local context analysis. Marzouk and Enaba [59] 2020. The latest publications in new domains also might be a signal of
developed a dynamic text analysis model of BIM contract and corre­ new research directions.
spondence by extracting keywords and essential terms from construc­
tion contracts and closely monitoring them in BIM change requests. 3.3.6.1. Social media analysis. Social media data are often used to
Dawood et al. [20] also developed a design change tracking system that analyze public opinion, attitude, and sentiment towards certain public
automatically extracted the change type, target object, content, and events/projects. The common media platforms are Twitter and Weibo.
other information from the change request text. The system then For example, Jiang et al. [34] crawled public messages and detailed
compared the changes of old and new IFC documents to judge whether information of the Three Gorges Project from Weibo to conduct senti­
the design change request was met. After extracting keywords related to ment classification using a lexicon-based sentiment analysis algorithm.
equipment installation information from real-world facility installation Similarly, Tang et al. [81] carried out a sentiment analysis on the Twitter
information table and conducting semantic disambiguation based on data of four groups (construction workers, construction companies,
IFD, Xie et al. [91] constructed a hierarchy tree from the BIM model for construction unions, and construction media) in the United States to
displaying devices’ spatial information and carried out the matrix understand the views of different construction communities on the
matching operation on it to realize the matching between the real-world construction industry and the focus of daily concern. Nik-Bakht and El-
facility and BIM data. Diraby [69] also collected feeds for 23 (out of the top 100) infrastructure
projects in North America and analyzed the data from two dimensions of
3.3.4.2. Classification/clustering. Jung and Lee [35] built an automatic topic modeling and sentiment classification utilizing a new term
classification of BIM cases to compare the classification performance of importance measurement method based on modified TF-IDF. However,
the unsupervised clustering method (LSA and LDA) and supervised the categories of sentiment adopted in the above studies were only two
classification method (SVM) and finally concluded that LDA produced or three. Wang et al. [88], therefore, extended the emotions to seven
the highest F1 score. Likewise, based on LDA, Choo et al. [18] conducted (interest, disgust, happiness, sadness, fear, shock, anger) to analyze the
the topic clustering analysis on BIM-related research literature to Chinese public attitudes towards off-site construction based on Weibo
explore the research and development trends of BIM. Zhou et al. [113] data in more detail. Following up the research of [81], Tang et al. [80]
trained a Bayesian classifier for sentiment classification of user com­ further crawled the Weibo data of four construction groups in China to
ments to analyze users’ attitudes of each BIM application, then used LDA compare the similarities and differences of the public perspectives of the
to cluster the comments of negative emotions to further identify the user construction industry in China and the United States. As time pass by,
problems reflected in negative comments. the focus of public concern is gradually changing. To capture the
In addition to the above two categories, the application of NLP in changes, Xue et al. [92] developed a dynamic topic model based on LDA
BIM usually involves ACC. Relevant studies such as ([68,73,100,101] to reveal the trends in public concerns over time and the relevance be­
have been introduced in Subsection 3.3.3. tween each public concern and stakeholder. The news report is also a
kind of social media that contains valuable information by analyzing
3.3.5. Risk management facts and interviewing experts. For example, Zhou et al. [114] combined
the keyword association rule mining (Apriori) and NER to extract
3.3.5.1. Knowledge extraction/retrieval. Siu et al. [76] applied an NLP infrastructure failure interdependencies (IFIs) and identify affected
software (IBM SPSS Modeller) to identify 16 New Engineering Contract stakeholders from the social news of water pipe bursts, providing
(NEC) risk categories from unstructured textual descriptions of Hong valuable references for decision-makers in the process of infrastructure
Kong NEC projects and analyze risk rating using the decision tree. Based management. Li et al. [47] identified 79 obstacle factors from the news
on the extracted SVO tuple inspired by Al Qady and Kandil [5], Lee et al. reports on China’s prefabricated buildings (PBs) via content analysis and
[43] developed a rule-based automatic contract risk extraction model ranked the factors according to their importance calculated by TF-IDF to
for poisonous clauses in international construction contracts, while Lee help the stakeholders in the Chinese PBs industry better understand the
et al. [41] proposed a proactive risk assessment model from the barriers hindering the industry development.
perspective of the contractor to identify potential contractual risks that
may raise disputes in contract conditions modified by the owner. 3.3.6.2. Scene understanding. To understand text-based on-site inspec­
tion scenes and improve the decision-making process, Lin et al. [49]
3.3.5.2. Classification/clustering. Lee and Yi [42] combined the un­ introduced keyword extraction and topic modeling to identify major
structured bidding documents processed by LDA and the structured concerns and their dynamics of on-site issues for a better decision-
numeric data to fit an ML-based classifier to identify the risk level of making process. For image scene understanding, Liu et al. [54] devel­
bidding documents. Jallan and Ashuri [32] categorized the risk factors oped a cross-modal NLG model combining word embedding, CNN, and
of the sentences extracted from publicly traded 10-K documents of LSTM to generate sentences describing the main content of the image,
construction companies using the extended Word2Vec model (FastText) which is of great significance for vision-based intelligent monitoring.
and cosine similarity.

14
Y. Ding et al. Automation in Construction 136 (2022) 104169

3.3.6.3. Maintenance requests analysis. With the popularization of The other challenges may lie in the degree of automation/intelli­
informatization, the number of electronic maintenance requests has gence, mainly in knowledge extraction and reasoning. Many knowledge
increased dramatically. To assist in dealing with massive electronic extractions in the selected articles are generally based on manually
maintenance requests, Bortolini and Forcada [11] used word cloud predefined rules. For example, many regulations/constraints in the ACC
technology to visualize keywords extracted from electronic maintenance domain were extracted according to the artificial rules. On the other
requests to identify common keywords of maintenance requests with hand, the current reasoning method in ACC is the low-level reasoning
different severity and thus, increase the productivity of facility man­ method based on First Order Logic (FOL), which requires that the
agers for making control and preventive strategies. Mo et al. [64] extracted regulations/constraints be logically represented in FOL format
transformed the task of maintenance staff assignment into a classifica­ for the subsequent reasoning. Therefore, the current ACC implementa­
tion problem and fitted an ML-based classification model to determine tions can only be called semi-automation in a strict sense, and advanced
the appropriate workforce by learning from past staff assignment cases. extraction and high-level reasoning models are needed to eliminate
Finally, the logistic regression model achieved the best performance in these transformation rules to achieve full automation. However, when
predicting crew and maintenance priority. trying to implement full-automation and high-level reasoning, the
challenge of high-level text understanding will occur because both
3.3.6.4. Other applications. Other applications of NLP in the construc­ knowledge extraction and reasoning are based on the understanding of
tion field include CM, SRA, WCA. For example, Williams and Gong [89] the text. In short, full automation and its supporting methods, high-level
developed an ensemble stacking model to recognize projects completed understanding and reasoning, are the technical challenges both at pre­
near the original low bid and projects completed with large cost over­ sent and in the future.
runs by using N-Gram, TF-IDF, and SVD in an integrated software (Rapid
Miner). Lin et al. [52] analyzed the interview corpus of 25 construction 4.2. Possible solutions
industry practitioners using the text processing tool (Leximancer) to
generate concept maps and topics and discuss the influence of stake­ The best way to solve the challenges related to data scale and mo­
holders on the implementation of social responsibility. Zheng et al. nopoly is to encourage researchers to open source data and code, which
[104] crawled a large number of job advertisements to extract the greatly help expand datasets and improve models through the power of
competency topics of the construction project manager (CPM) empha­ the community. Some business data can also be shared after desensiti­
sized in the advertisement and thus, track the demands for CPM com­ zation. In addition, web crawlers can automatically and quickly obtain a
petencies at the industry level and give the different emphases of large amount of public data from the world wide web, which has become
different scale companies for the competency of CPM. one of the popular means of data acquisition in the context of the big
data era (one of the burst keywords in Section 2.4). Therefore, the
4. Discussion reasonable use of web crawlers can also solve the problem of insufficient
data to a certain extent. For those studies with few samples and unable to
This section discusses the current challenges, possible solutions, and expand data, some targeted methods can be adopted appropriately, such
future research trends of NLP applications in the construction field. as semi-supervised learning, few-shot learning, transfer learning,
Though pure theoretical challenges are important, such as the proposal knowledge distillation, federated transfer learning, and pre-training
of new NLP algorithms or the improvement of old algorithms, they are model method. Both federated transfer learning and pre-training
not the focus of this section. model method are based on the idea of transfer learning. Federated
transfer learning realizes the encryption and sharing training of sensitive
data, while the pre-training model method trains the model on the large
4.1. Challenges corpus then fine-tunes for the specific NLP tasks.
For addressing other challenges, developing an end-to-end neural
First of all, data has become the foundation research threshold in the network may be the better choice. Firstly, compared with other
big data artificial intelligence era. It is difficult to fit a sufficiently methods, the neural network is easier to achieve the cross-modal fusion
intelligent and general AI agent without enough data. The most of multiple types of data, well solving the challenge of data diversity.
advanced CV and NLP studies are based on tens of millions of large Besides, the end-to-end network can eliminate the intermediate manual
datasets. For example, ImageNet [21] is a database commonly used in processes to the greatest extent, making the whole process from raw data
the CV area, containing 14 million images. The training corpus of GPT-3 to results be completed by the model. The end-to-end neural network of
reaches 45 Terabytes [12]. However, the volume of most datasets used NLG is one of the fully automatic models that can convert other types of
in the selected articles is not very large, of which 87.8% are less than data to text form, which has various applications such as regulations/
10,000 (Table 5). In addition, researchers in this field were reluctant to constraints extraction, answers generation, etc. The middle layer output
disclose and share the datasets they processed/labeled, making the of the NLG model can be regarded as the representation of transformed
relevant datasets difficult to access. The phenomenon of data monopoly knowledge and can be used as a feature extractor. The neural network
in this field is serious. For example, nearly 40% of the selected articles can also be used for reasoning over text. For example, Gupta et al. [29]
did not publish data or data sources (Table 4), and only one research developed an end-to-end text reasoning network based on Neural
[54] shared the processed and annotated data (Table 5). Therefore, Module Networks (NMN), providing an interpretable neural network
breaking the data monopoly and obtaining enough high-quality datasets model for text reasoning. Combining the NLG and reasoning network
in this field is also a great challenge. The third challenge is related to into one end-to-end network is also feasible to implement the one-step
data diversity. With the maturity of hardware technology, various completion from knowledge extraction to reasoning, which might be
equipment has been gradually used in many construction projects to suitable for ACC. Finally, dynamic representation methods, such as
assist and facilitate daily management. The formats or types of data BERT and GPT, are preferred to be used for word/sentence/text repre­
produced from different devices are different, including text data (e.g., sentation since they have the highest level of text understanding ability
from electronic documents), image data (e.g., from surveillance cameras and achieve SOTA performance. Some pioneer researches that combined
or drones), sensor data (e.g., from wearable devices), and audio data (e. the above technologies have been carried out in the construction field.
g., recording of the radio calls). Thus, the challenge mainly lies in fusing For example, Zhong et al. [105] developed an NLG model based on the
different kinds of data to develop a more comprehensive model with BERT model to automatically generate answers from the relevant par­
higher performance. The future studies, in this case, might no longer be agraphs, and Liu et al. [54] implemented a cross-modal NLG model to
pure NLP tasks, and NLP is just a part of the research. generate description sentences for images.

15
Y. Ding et al. Automation in Construction 136 (2022) 104169

Table 7
A simple roadmap for the future trends.
At present Next 3–5 years Distant future

Industry status Low-level automation Whole process automation Multi-agent automation cooperation
• More open-source research Open source becomes
Data Data monopoly
• Public/web data received great attention standard
End-to-end pre-trained
Dominant technologies Machine learning and neural networks More advanced models
neural networks
• AA/SM, BIM, ACC
AA/SM, BIM, ACC
Research directions • Diversify and concretize (SMA, SU, MRA, SR, etc.) Interdisciplinary research is the norm
Pure NLP research
• Preliminary interdisciplinary research (SU + ACC)

4.3. Future trends 5. Conclusion

Based on the above discussion of the current research status, this In this paper, 91 NLP-related articles in construction-related areas
section boldly discusses the future trend of NLP in construction-related were selected manually from thousands of articles. A systematic scien­
fields. A simple roadmap was given in Table 7. Note that this section tometric analysis was then carried out by CiteSpace and VOSviewer to
will mainly discuss the research trends in the next 3–5 years, and the reveal the development patterns, crucial studies, and research trends of
distant future is beyond the discussion scope. the NLP application in those areas. After that, the authors comprehen­
First of all, open access to data and source code will be one of the sively introduce NLP applications of the selected articles from three
possible trends in this field. By condensing many excellent researchers aspects, including datasets and sources, technologies and tools, and
and ideas, the research will achieve iterative development and make application and progress. At the end of this paper, the authors discuss
continuous progress. A good example is deep learning, whose recent some challenges faced by the current NLP-related research in
blowout development and brilliant achievements are due to the joint construction-related fields, followed by some possible solutions and the
efforts of many open source tools, public data, and a wide range of forecasting of future research trends. The main contributions and sig­
excellent developers. In addition, given great advantages, web crawlers nificance of this paper are as follows: 1) This paper might be the first
will play a more significant role in data acquisition in the future. As review paper focusing on the NLP application in construction-related
mentioned above, there would be less pure NLP research but more cross- fields; 2) A systematic scientometric analysis and multi-aspects sum­
modal studies based on multi-type data fusion in the future. In terms of mary of NLP-related research in the construction field were conducted;
technology and tool, end-to-end neural networks, especially the end-to- 3) The results of multi-aspects summary provide sufficient support for
end pre-trained neural network models, will be the mainstream tech­ researchers who want to start NLP research in the construction field,
nology in this field in the future. The end-to-end pre-trained neural from data acquisition, technology and tool selection to the current
network is an ideal model that might possibly address the challenges of research status; 4) Some enlightening ideas for further research di­
data scale, cross-modal data fusion, full automation, high-level under­ rections were also provided to the readers by discussing the current
standing, and high-level reasoning at the same time. Currently, ac­ challenges, possible solutions, and future research trends.
cording to Section 3, there are just a few studies based on the pre-trained
neural network. However, given such great advantages as mentioned,
Declaration of Competing Interest
research based on the end-to-end pre-trained neural network will usher
in explosive growth in the next few years, especially in the application of
The authors declare that they have no known competing financial
higher-level understanding, reasoning.
interests or personal relationships that could have appeared to influence
Trends for future research directions mainly consist of two parts.
the work reported in this paper.
From the analysis in Section 2, the current hot studies related to NLP in
construction domains include accident analysis/safety management,
BIM, and ACC. However, the NLP technologies used in these domains are Acknowledgment
still not intelligent enough, which was mainly reflected in low-level text
understanding and reasoning ability and the following consequence, This work was supported by the Shenzhen Science and Technology
low-level automation. Therefore, these research domains will continue Innovation Committee Grant (PJ#JCYJ20180507181647320). The
to receive significant attention in the future with the support of more conclusions herein are those of the authors and do not necessarily reflect
advanced new technologies like the end-to-end pre-trained model. In the views of the sponsoring agency.
addition, there are some emerging directions according to the “others”
in Table 3, such as SMA, SU, MRA, SR, and WCA, indicating that future References
research directions may gradually diversify and concretize. These new
[1] I.T. Afolabi, J. Badejo, S.A. Adubi, O.A. Odetunmibi, Identifying major civil
emerging directions may also be applied in or combined with the orig­ engineering research influencers and topics using social network analysis, Cogent
inal hot domains, resulting in new applications. For example, the com­ Engineering 7 (2020) 1835147, https://doi.org/10.1080/
bination of SU and ACC technologies can monitor workers’ safety 23311916.2020.1835147.
[2] M. Al Qady, A. Kandil, Automatic classification of project documents on the basis
behavior. In this case, the intelligent agent needs to identify, extract and of text content, J. Comput. Civ. Eng. 29 (2015) 04014043, https://doi.org/
understand the relevant safety construction rules from the specification 10.1061/(ASCE)CP.1943-5487.0000338.
documents then assist the vision system in capturing the unsafe behavior [3] M. Al Qady, A. Kandil, Automatic clustering of construction project documents
based on textual similarity, Autom. Constr. 42 (2014) 36–49, https://doi.org/
of workers based on SU. Similarly, SU and ACC can be used for 10.1016/j.autcon.2014.02.006.
compliance inspection of construction processes. Another feasible [4] M. Al Qady, A. Kandil, Document discourse for managing construction project
example is to analyze the role of social media in worker safety man­ documents, J. Comput. Civ. Eng. 27 (2013) 466–475, https://doi.org/10.1061/
(ASCE)CP.1943-5487.0000201.
agement and accident prevention.
[5] M. Al Qady, A. Kandil, Concept relation extraction from construction documents
using natural language processing, J. Constr. Eng. Manag. 136 (2010) 294–302,
https://doi.org/10.1061/(ASCE)CO.1943-7862.0000131.
[6] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to
align and translate, in: 3rd International Conference on Learning Representations,

16
Y. Ding et al. Automation in Construction 136 (2022) 104169

ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings, [31] Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging,
2015. Available at: http://arxiv.org/abs/1409.0473 (Accessed June 12, 2021). CoRR (2015) vol. abs/1508.01991. Available at: http://arxiv.org/abs/1508.01
[7] H. Baker, M.R. Hallowell, A.J.-P. Tixier, Automatically learning construction 991. (Accessed 28 August 2021).
injury precursors from text, Autom. Constr. 118 (2020), 103145, https://doi.org/ [32] Y. Jallan, B. Ashuri, Text mining of the securities and exchange commission
10.1016/j.autcon.2020.103145. financial filings of publicly traded construction firms using deep learning to
[8] H. Baker, M.R. Hallowell, A.J.-P. Tixier, AI-based prediction of independent identify and assess risk, J. Constr. Eng. Manag. 146 (2020), https://doi.org/
construction safety outcomes from universal attributes, Autom. Constr. 118 10.1061/(ASCE)CO.1943-7862.0001932.
(2020), 103146, https://doi.org/10.1016/j.autcon.2020.103146. [33] Y. Jallan, E. Brogan, B. Ashuri, C.M. Clevenger, Application of natural language
[9] Y. Bengio, R. Ducharme, P. Vincent, A Neural Probabilistic Language Model, in: processing and text mining to identify patterns in construction-defect litigation
Advances in Neural Information Processing Systems., 2001, pp. 932–938. cases, J. Leg. Aff. Disput. Resolut. Eng. Constr. 11 (2019), https://doi.org/
Available at: https://www.jmlr.org/papers/volume3/tmp/bengio03a.pdf. 10.1061/(ASCE)LA.1943-4170.0000308.
(Accessed 12 June 2021). [34] H. Jiang, P. Lin, M. Qiang, Public-opinion sentiment analysis for large hydro
[10] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation, The Journal of projects, J. Constr. Eng. Manag. 142 (2016) 05015013, https://doi.org/10.1061/
Machine Learning Research 3 (2003) 993–1022, https://doi.org/10.5555/ (ASCE)CO.1943-7862.0001039.
944919.944937. [35] N. Jung, G. Lee, Automated classification of building information modeling (BIM)
[11] R. Bortolini, N. Forcada, Analysis of building maintenance requests using a text case studies by Bim use based on natural language processing (NLP) and
mining approach: building services evaluation, Build. Res. Inf. 48 (2020) unsupervised learning, Adv. Eng. Inform. 41 (2019), 100917, https://doi.org/
207–217, https://doi.org/10.1080/09613218.2019.1609291. 10.1016/j.aei.2019.04.007.
[12] T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, [36] J.-S. Kim, B.-S. Kim, Analysis of fire-accident factors using big-data analysis
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, method for construction areas, KSCE J. Civ. Eng. 22 (2018) 1535–1543, https://
G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter, doi.org/10.1007/s12205-017-0767-7.
C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, [37] T. Kim, S. Chi, Accident case retrieval and analyses: using natural language
S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few- processing in the construction industry, J. Constr. Eng. Manag. 145 (2019)
shot learners, Advances in neural information processing systems 33 (2020) 04019004, https://doi.org/10.1061/(ASCE)CO.1943-7862.0001625.
1877–1901, in: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4 [38] M. Kovacevic, J.-Y. Nie, C. Davidson, Providing answers to questions from
967418bfb8ac142f64a-Paper.pdf. (Accessed 11 September 2021). automatically collected web pages for intelligent decision making in the
[13] C. Caldas, L. Soibelman, Automating hierarchical document classification for construction sector, J. Comput. Civ. Eng. 22 (2008) 3–13, https://doi.org/
construction management information systems, Autom. Constr. 12 (2003) 10.1061/(ASCE)0887-3801(2008)22:1(3).
395–406, https://doi.org/10.1016/S0926-5805(03)00004-9. [39] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep
[14] C.H. Caldas, L. Soibelman, J. Han, Automated classification of construction convolutional neural networks, in: F. Pereira, C.J.C. Burges, L. Bottou, K.
project documents, J. Comput. Civ. Eng. 16 (2002) 234–243, https://doi.org/ Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25,
10.1061/(ASCE)0887-3801(2002)16:4(234). Curran Associates, Inc., 2012, pp. 1097–1105. Available at: http://papers.nips.cc
[15] C. Chen, Searching for intellectual turning points: progressive knowledge domain /paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.
visualization, Proc. Natl. Acad. Sci. 101 (2004) 5303–5310, https://doi.org/ pdf. (Accessed 1 May 2020).
10.1073/pnas.0307513100. [40] J. Lafferty, A. McCallum, F.C. Pereira, Conditional Random Fields: Probabilistic
[16] M.-Y. Cheng, D. Kusoemo, R.A. Gosno, Text mining-based construction site Models for Segmenting and Labeling Sequence Data, Available at: https://repo
accident classification using hybrid supervised machine learning, Autom. Constr. sitory.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers, 2001
118 (2020), 103265, https://doi.org/10.1016/j.autcon.2020.103265. (Accessed June 12, 2021).
[17] N.-W. Chi, K.-Y. Lin, S.-H. Hsieh, Using ontology-based text classification to assist [41] J. Lee, Y. Ham, J.-S. Yi, J. Son, Effective risk positioning through automated
job Hazard analysis, Adv. Eng. Inform. 28 (2014) 381–394, https://doi.org/ identification of missing contract conditions from the Contractor’s perspective
10.1016/j.aei.2014.05.001. based on Fidic contract cases, J. Manag. Eng. 36 (2020) 05020003, https://doi.
[18] S. Choo, H. Park, T. Kim, J. Seo, Analysis of trends in Korean BIM research and org/10.1061/(ASCE)ME.1943-5479.0000757.
technologies using text mining, Appl. Sci. 9 (2019) 4424, https://doi.org/ [42] J. Lee, J.-S. Yi, Predicting Project’s uncertainty risk in the bidding process by
10.3390/app9204424. integrating unstructured text data and structured numerical data using text
[19] A.K. Choudhary, P.I. Oluikpe, J.A. Harding, P.M. Carrillo, The needs and benefits mining, Appl. Sci. 7 (2017) 1141, https://doi.org/10.3390/app7111141.
of text mining applications on post-project reviews, Comput. Ind. 60 (2009) [43] J. Lee, J.-S. Yi, J. Son, Development of automatic-extraction model of poisonous
728–740, https://doi.org/10.1016/j.compind.2009.05.006. clauses in international construction contracts using rule-based NLP, J. Comput.
[20] H. Dawood, J. Siddle, N. Dawood, Integrating IFC and NLP for automating change Civ. Eng. 33 (2019) 04019003, https://doi.org/10.1061/(ASCE)CP.1943-
request validations, Journal of Information Technology in Construction 24 (2019) 5487.0000807.
540–552, https://doi.org/10.36680/J.ITCON.2019.030. [44] S. Li, H. Cai, V.R. Kamat, Integrating natural language processing and spatial
[21] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, Li Fei-Fei, ImageNet: Imagenet: A reasoning for utility compliance checking, J. Constr. Eng. Manag. 142 (2016)
large-scale hierarchical image database, in: 2009 IEEE Conference on Computer 04016074, https://doi.org/10.1061/(ASCE)CO.1943-7862.0001199.
Vision and Pattern Recognition. Presented at the 2009 IEEE Computer Society [45] X. Li, Z. Jiang, B. Song, L. Liu, Long-term knowledge evolution modeling for
Conference on Computer Vision and Pattern Recognition Workshops (CVPR empirical engineering knowledge, Adv. Eng. Inform. 34 (2017) 17–35, https://
Workshops), IEEE, 2009, pp. 248–255, https://doi.org/10.1109/ doi.org/10.1016/j.aei.2017.08.001.
CVPR.2009.5206848. [46] Z. Li, K. Ramani, Ontology-based design information extraction and retrieval,
[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep Artificial Intelligence for Engineering Design, Analysis and Manufacturing 21
bidirectional transformers for language understanding, Association for (2007) 137–154, https://doi.org/10.1017/S0890060407070199.
Computational Linguistics 1 (2018) 4171–4186, https://doi.org/10.18653/v1/ [47] Z. Li, S. Zhang, Q. Meng, X. Hu, Barriers to the development of prefabricated
N19-1423. buildings in China: a news coverage analysis, Eng. Constr. Archit. Manag. (2020),
[23] C. Eastman, Jae-min Lee, Y. Jeong, Jin-kook Lee, Automatic rule-based checking https://doi.org/10.1108/ECAM-03-2020-0195 vol. ahead-of-print.
of building designs, Autom. Constr. 18 (2009) 1011–1033, https://doi.org/ [48] H.-T. Lin, N.-W. Chi, S.-H. Hsieh, A concept-based information retrieval approach
10.1016/j.autcon.2009.07.002. for engineering domain-specific technical documents, Adv. Eng. Inform. 26
[24] H. Fan, H. Li, Retrieving similar cases for alternative dispute resolution in (2012) 349–360, https://doi.org/10.1016/j.aei.2011.12.003.
construction accidents using text mining techniques, Autom. Constr. 34 (2013) [49] J.-R. Lin, Z.-Z. Hu, J.-L. Li, L.-M. Chen, Understanding on-site inspection of
85–91, https://doi.org/10.1016/j.autcon.2012.10.014. construction projects based on keyword extraction and topic modeling, IEEE
[25] H. Fan, F. Xue, H. Li, Project-based as-needed information retrieval from Access 8 (2020) 198503–198517, https://doi.org/10.1109/
unstructured AEC documents, J. Manag. Eng. 31 (2014), https://doi.org/ ACCESS.2020.3035214.
10.1061/(ASCE)ME.1943-5479.0000341. [50] J.-R. Lin, Z.-Z. Hu, J.-P. Zhang, F.-Q. Yu, A natural-language-based approach to
[26] W. Fang, P.E.D. Love, H. Luo, L. Ding, Computer vision for behaviour-based safety intelligent data retrieval and representation for cloud BIM, Computer-Aided Civil
in construction: a review and future directions, Adv. Eng. Inform. 43 (2020), and Infrastructure Engineering 31 (2016) 18–33, https://doi.org/10.1111/
100980, https://doi.org/10.1016/j.aei.2019.100980. mice.12151.
[27] W. Fang, H. Luo, S. Xu, P.E.D. Love, Z. Lu, C. Ye, Automated text classification of [51] K.-Y. Lin, L. Soibelman, Knowledge-assisted retrieval of online product
near-misses from safety reports: an improved deep learning approach, Adv. Eng. information in architectural/engineering/construction, J. Constr. Eng. Manag.
Inform. 44 (2020), 101060, https://doi.org/10.1016/j.aei.2020.101060. 133 (2007) 871–879, https://doi.org/10.1061/(ASCE)0733-9364(2007)133:11
[28] Y.M. Goh, C.U. Ubeynarayana, Construction accident narrative classification: an (871).
evaluation of text mining techniques, Accid. Anal. Prev. 108 (2017) 122–130, [52] X. Lin, B. McKenna, C.M.F. Ho, G.Q.P. Shen, Stakeholders’ influence strategies on
https://doi.org/10.1016/j.aap.2017.08.026. social responsibility implementation in construction projects, J. Clean. Prod. 235
[29] N. Gupta, K. Lin, D. Roth, S. Singh, M. Gardner, Neural module networks for (2019) 348–358, https://doi.org/10.1016/j.jclepro.2019.06.253.
reasoning over text. CoRR, vol. abs/1912.04971, Available at: http://arxiv. [53] H. Liu, Y.-S. Liu, P. Pauwels, H. Guo, M. Gu, Enhanced explicit semantic analysis
org/abs/1912.04971, 2019. (Accessed 5 September 2021). for product model retrieval in construction industry, IEEE Transactions on
[30] F.U. Hassan, T. Le, Automated requirements identification from construction Industrial Informatics 13 (2017) 3361–3369, https://doi.org/10.1109/
contract documents using natural language processing, J. Leg. Aff. Disput. TII.2017.2708727.
Resolut. Eng. Constr. 12 (2020), https://doi.org/10.1061/(ASCE)LA.1943- [54] H. Liu, G. Wang, T. Huang, P. He, M. Skitmore, X. Luo, Manifesting construction
4170.0000379. activity scenes via image captioning, Autom. Constr. 119 (2020), 103334,
https://doi.org/10.1016/j.autcon.2020.103334.

17
Y. Ding et al. Automation in Construction 136 (2022) 104169

[55] T. Mahfouz, A. Kandil, S. Davlyatov, Identification of latent legal knowledge in [79] X. Tan, A. Hammad, P. Fazio, Automated code compliance checking for building
differing site condition (DSC) litigations, Autom. Constr. 94 (2018) 104–111, envelope design, J. Comput. Civ. Eng. 24 (2010) 203–211, https://doi.org/
https://doi.org/10.1016/j.autcon.2018.06.011. 10.1061/(ASCE)0887-3801(2010)24:2(203).
[56] J. Manyika, S. Lund, J. Bughin, Digital globalization: the new era global flows, [80] L. Tang, L. Griffith, M. Stevens, M. Hardie, Social media analytics in the
McKinsey Global Institute (2016). Available at: https://www.mckinsey. construction industry comparison study between China and the United States,
com/~/media/McKinsey/Business%20Functions/McKinsey%20Digital/Our% Eng. Constr. Archit. Manag. 27 (2020) 1877–1889, https://doi.org/10.1108/
20Insights/Digital%20globalization%20The%20new%20era%20of%20global% ECAM-12-2019-0717.
20flows/MGI-Digital-globalization-Full-report.ashx (Accessed August 17, 2021). [81] L. Tang, Y. Zhang, F. Dai, Y. Yoon, Y. Song, R.S. Sharma, Social media data
[57] P. Martinez, M. Al-Hussein, R. Ahmad, A Scientometric analysis and critical analytics for the U.S. construction industry: preliminary study on twitter,
review of computer vision applications for construction, Autom. Constr. 107 J. Manag. Eng. 33 (2017) 04017038, https://doi.org/10.1061/(ASCE)ME.1943-
(2019), 102947, https://doi.org/10.1016/j.autcon.2019.102947. 5479.0000554.
[58] M. Martínez-Rojas, R. Martín Antolín, F. Salguero-Caparrós, J.C. Rubio-Romero, [82] A.J.-P. Tixier, M.R. Hallowell, B. Rajagopalan, D. Bowman, Construction safety
Management of Construction Safety and Health Plans Based on automated clash detection: identifying safety incompatibilities among fundamental
content analysis, Autom. Constr. 120 (2020), 103362, https://doi.org/10.1016/j. attributes using data mining, Autom. Constr. 74 (2017) 39–54, https://doi.org/
autcon.2020.103362. 10.1016/j.autcon.2016.11.001.
[59] M. Marzouk, M. Enaba, Text analytics to analyze and monitor construction [83] A.J.-P. Tixier, M.R. Hallowell, B. Rajagopalan, D. Bowman, Automated content
project contract and correspondence, Autom. Constr. 98 (2019) 265–274, https:// analysis for construction safety: a natural language processing system to extract
doi.org/10.1016/j.autcon.2018.11.018. precursors and outcomes from unstructured injury reports, Autom. Constr. 62
[60] L.J. McGibbney, B. Kumar, An intelligent authoring model for subsidiary (2016) 45–56, https://doi.org/10.1016/j.autcon.2015.11.001.
legislation and regulatory instrument drafting within construction and [84] A.J.-P. Tixier, M.R. Hallowell, B. Rajagopalan, D. Bowman, Application of
engineering industry, Autom. Constr. 35 (2013) 121–130, https://doi.org/ machine learning to construction injury prediction, Autom. Constr. 69 (2016)
10.1016/j.autcon.2013.04.005. 102–114, https://doi.org/10.1016/j.autcon.2016.05.016.
[61] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word [85] N. Torkanfar, E.R. Azar, Quantitative similarity assessment of construction
Representations in Vector Space, in: Proceedings of the International Conference projects using WBS-based metrics, Adv. Eng. Inform. 46 (2020), 101179, https://
on Learning Representations (ICLR 2013), 2013. Available at: http://arxiv.org/ doi.org/10.1016/j.aei.2020.101179.
abs/1301.3781 (Accessed June 12, 2021). [86] N. Ur-Rahman, J.A. Harding, Textual data Mining for Industrial Knowledge
[62] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed Management and Text Classification: a business oriented approach, Expert Syst.
representations of words and phrases and their compositionality, in: Advances in Appl. 39 (2012) 4729–4739, https://doi.org/10.1016/j.eswa.2011.09.124.
Neural Information Processing Systems (NIPS 2013), 2013, pp. 3111–3119. [87] N.J. van Eck, L. Waltman, Visualizing bibliometric networks, in: Y. Ding,
Available at: https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec R. Rousseau, D. Wolfram (Eds.), Measuring Scholarly Impact, Springer
039965f3c4923ce901b-Abstract.html. (Accessed 12 June 2021). International Publishing, 2014, pp. 285–320, https://doi.org/10.1007/978-3-
[63] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction 319-10377-8_13.
without labeled data, in: Proceedings of the Joint Conference of the 47th Annual [88] Y. Wang, H. Li, Z. Wu, Attitude of the Chinese public toward off-site construction:
Meeting of the ACL and the 4th International Joint Conference on Natural a text mining study, J. Clean. Prod. 238 (2019), 117926, https://doi.org/
Language Processing of the AFNLP, 2009, pp. 1003–1011. Available at: https:// 10.1016/j.jclepro.2019.117926.
aclanthology.org/P09-1113.pdf. (Accessed 12 June 2021). [89] T.P. Williams, J. Gong, Predicting construction cost overruns using text mining,
[64] Y. Mo, D. Zhao, J. Du, M. Syal, A. Aziz, H. Li, Automated staff assignment for numerical data and ensemble classifiers, Autom. Constr. 43 (2014) 23–29,
building maintenance using natural language processing, Autom. Constr. 113 https://doi.org/10.1016/j.autcon.2014.02.014.
(2020), 103150, https://doi.org/10.1016/j.autcon.2020.103150. [90] H. Wu, G. Shen, X. Lin, M. Li, B. Zhang, C.Z. Li, Screening patents of ICT in
[65] S. Moon, Y. Shin, B.-G. Hwang, S. Chi, Document management system using text construction using deep learning and NLP techniques, Eng. Constr. Archit. Manag.
Mining for Information Acquisition of international construction, KSCE J. Civ. 27 (2020) 1891–1912, https://doi.org/10.1108/ECAM-09-2019-0480.
Eng. 22 (2018) 4791–4798, https://doi.org/10.1007/s12205-018-1528-y. [91] Q. Xie, X. Zhou, J. Wang, X. Gao, X. Chen, L. Chun, Matching real-world facilities
[66] X. Na, W. Jianping, L. Jie, N. Guodong, Analysis on relationships of safety risk to building information modeling data using natural language processing, IEEE
factors in metro construction, Journal of Engineering Science and Technology Access 7 (2019) 119465–119475, https://doi.org/10.1109/
Review 9 (2016) 150–157, https://doi.org/10.25103/jestr.095.24. ACCESS.2019.2937219.
[67] D. Nedeljkovic, M. Kovacevic, Building a construction project key-phrase network [92] J. Xue, G.Q. Shen, Y. Li, J. Wang, I. Zafar, Dynamic stakeholder-associated topic
from unstructured text documents, J. Comput. Civ. Eng. 31 (2017) 04017058, modeling on public concerns in Megainfrastructure projects: case of Hong
https://doi.org/10.1061/(ASCE)CP.1943-5487.0000708. Kong–Zhuhai–Macao bridge, J. Manag. Eng. 36 (2020) 04020078, https://doi.
[68] R.A. Niemeijer, B. de Vries, J. Beetz, Freedom through constraints: user-oriented org/10.1061/(ASCE)ME.1943-5479.0000845.
architectural design, Adv. Eng. Inform. 28 (2014) 28–36, https://doi.org/ [93] X. Xue, J. Zhang, Building codes part-of-speech tagging performance
10.1016/j.aei.2013.11.003. improvement by error-driven transformational rules, J. Comput. Civ. Eng. 34
[69] M. Nik-Bakht, T. El-Diraby, Project collective mind: unlocking project discussion (2020) 04020035, https://doi.org/10.1061/(ASCE)CP.1943-5487.0000917.
networks, Autom. Constr. 84 (2017) 50–69, https://doi.org/10.1016/j. [94] H. Yan, N. Yang, Y. Peng, Y. Ren, Data Mining in the Construction Industry:
autcon.2017.08.026. present status, opportunities, and future trends, Autom. Constr. 119 (2020),
[70] J. Niu, R.R.A. Issa, Developing taxonomy for the domain ontology of construction 103331, https://doi.org/10.1016/j.autcon.2020.103331.
contractual semantics: a case study on the AIA A201 document, Adv. Eng. Inform. [95] C.L. Yeung, C.F. Cheung, W.M. Wang, E. Tsui, A knowledge extraction and
29 (2015) 472–482, https://doi.org/10.1016/j.aei.2015.03.009. representation system for narrative analysis in the construction industry, Expert
[71] G. Robinson, Global construction market to grow $8 trillion by 2030: driven by Syst. Appl. 41 (2014) 5710–5722, https://doi.org/10.1016/j.eswa.2014.03.044.
China, US and India, Global Construction. (2015). Available at: https://www.ice. [96] C.L. Yeung, C.F. Cheung, W.M. Wang, E. Tsui, W.B. Lee, Managing knowledge in
org.uk/ICEDevelopmentWebPortal/media/Documents/News/ICE%20News/Glo the construction industry through computational generation of semi-fiction
bal-Construction-press-release.pdf (Accessed June 12, 2021). narratives, J. Knowl. Manag. 20 (2016) 386–414, https://doi.org/10.1108/JKM-
[72] R. Romero-Silva, S. de Leeuw, Learning from the past to shape the future: a 07-2015-0253.
comprehensive text mining analysis of OR/MS reviews, Omega 100 (2021), [97] W. Yu, J. Hsu, Content-based text mining technique for retrieval of CAD
102388, https://doi.org/10.1016/j.omega.2020.102388. documents, Autom. Constr. 31 (2013) 65–74, https://doi.org/10.1016/j.
[73] D.A. Salama, N.M. El-Gohary, Automated compliance checking of construction autcon.2012.11.037.
operation plans using a deontology for the construction domain, J. Comput. Civ. [98] F. Zhang, A hybrid structured deep neural network with Word2vec for
Eng. 27 (2013) 681–698, https://doi.org/10.1061/(ASCE)CP.1943- construction accident causes classification, Int. J. Constr. Manag. (2019) 1–21,
5487.0000298. https://doi.org/10.1080/15623599.2019.1683692.
[74] D.M. Salama, N.M. El-Gohary, Semantic text classification for supporting [99] F. Zhang, H. Fleyeh, X. Wang, M. Lu, Construction site accident analysis using text
automated compliance checking in construction, J. Comput. Civ. Eng. 30 (2016) mining and natural language processing techniques, Autom. Constr. 99 (2019)
04014106, https://doi.org/10.1061/(ASCE)CP.1943-5487.0000301. 238–248, https://doi.org/10.1016/j.autcon.2018.12.016.
[75] L. Shen, H. Yan, H. Fan, Y. Wu, Y. Zhang, An integrated system of text mining [100] J. Zhang, N.M. El-Gohary, Integrating semantic NLP and logic reasoning into a
technique and case-based reasoning (TM-CBR) for supporting green building unified system for fully-automated code checking, Autom. Constr. 73 (2017)
design, Build. Environ. 124 (2017) 388–401, https://doi.org/10.1016/j. 45–57, https://doi.org/10.1016/j.autcon.2016.08.027.
buildenv.2017.08.026. [101] J. Zhang, N.M. El-Gohary, Extending building information models
[76] M.-F.F. Siu, W.-Y.J. Leung, W.-M.D. Chan, A data-driven approach to identify- Semiautomatically using semantic natural language processing techniques,
quantify-analyse construction risk for Hong Kong NEC projects, J. Civ. Eng. J. Comput. Civ. Eng. 30 (2016), https://doi.org/10.1061/(ASCE)CP.1943-
Manag. 24 (2018) 592–606, https://doi.org/10.3846/jcem.2018.6483. 5487.0000536.
[77] J. Sun, K. Lei, L. Cao, B. Zhong, Y. Wei, J. Li, Z. Yang, Text visualization for [102] J. Zhang, N.M. El-Gohary, Semantic NLP-based information extraction from
construction document information management, Autom. Constr. 111 (2020), construction regulatory documents for automated compliance checking,
103048, https://doi.org/10.1016/j.autcon.2019.103048. J. Comput. Civ. Eng. 30 (2015), https://doi.org/10.1061/(ASCE)CP.1943-
[78] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural 5487.0000346.
networks, in: Advances in Neural Information Processing Systems (NIPS 2014), [103] Jiansong Zhang, N.M. El-Gohary, Automated information transformation for
2014, pp. 3104–3112. Available at: https://proceedings.neurips.cc/paper automated regulatory compliance checking in construction, J. Comput. Civ. Eng.
/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html. (Accessed 12 29 (2015), https://doi.org/10.1061/(ASCE)CP.1943-5487.0000427.
June 2021).

18
Y. Ding et al. Automation in Construction 136 (2022) 104169

[104] J. Zheng, Q. Wen, M. Qiang, Understanding demand for project manager [110] P. Zhou, N. El-Gohary, Ontology-based automated information extraction from
competences in the construction industry: data mining approach, J. Constr. Eng. building energy conservation codes, Autom. Constr. 74 (2017) 103–117, https://
Manag. 146 (2020) 04020083, https://doi.org/10.1061/(ASCE)CO.1943- doi.org/10.1016/j.autcon.2016.09.004.
7862.0001865. [111] P. Zhou, N. El-Gohary, Domain-specific hierarchical text classification for
[105] B. Zhong, W. He, Z. Huang, P.E.D. Love, J. Tang, H. Luo, A building regulation supporting automated environmental compliance checking, J. Comput. Civ. Eng.
question answering system: a deep learning methodology, Adv. Eng. Inform. 46 30 (2016) 04015057, https://doi.org/10.1061/(ASCE)CP.1943-5487.0000513.
(2020), 101195, https://doi.org/10.1016/j.aei.2020.101195. [112] Peng Zhou, N. El-Gohary, Ontology-based multilabel text classification of
[106] B. Zhong, H. Li, H. Luo, J. Zhou, W. Fang, X. Xing, Ontology-based semantic construction regulatory documents, J. Comput. Civ. Eng. 30 (2016) 04015058,
modeling of knowledge in construction: classification and identification of https://doi.org/10.1061/(ASCE)CP.1943-5487.0000530.
hazards implied in images, J. Constr. Eng. Manag. 146 (2020) 04020013, https:// [113] S. Zhou, S.T. Ng, S.H. Lee, F.J. Xu, Y. Yang, A domain knowledge incorporated
doi.org/10.1061/(ASCE)CO.1943-7862.0001767. text mining approach for capturing user needs on BIM applications, Eng. Constr.
[107] B. Zhong, X. Pan, P.E.D. Love, L. Ding, W. Fang, Deep learning and network Archit. Manag. 27 (2020) 458–482, https://doi.org/10.1108/ECAM-02-2019-
analysis: classifying and visualizing accident narratives in construction, Autom. 0097.
Constr. 113 (2020), 103089, https://doi.org/10.1016/j.autcon.2020.103089. [114] S. Zhou, S.T. Ng, Y. Yang, J.F. Xu, Delineating infrastructure failure
[108] B. Zhong, X. Pan, P.E.D. Love, J. Sun, C. Tao, Hazard analysis: a deep learning and interdependencies and associated stakeholders through news mining: the case of
text mining framework for accident prevention, Adv. Eng. Inform. 46 (2020), Hong Kong’s water pipe bursts, J. Manag. Eng. 36 (2020) 04020060, https://doi.
101152, https://doi.org/10.1016/j.aei.2020.101152. org/10.1061/(ASCE)ME.1943-5479.0000821.
[109] B. Zhong, X. Xing, H. Luo, Q. Zhou, H. Li, T. Rose, W. Fang, Deep learning-based [115] Y. Zou, A. Kiviniemi, S.W. Jones, Retrieving similar cases for construction project
extraction of construction procedural constraints from construction regulations, risk management using natural language processing techniques, Autom. Constr.
Adv. Eng. Inform. 43 (2020), 101003, https://doi.org/10.1016/j. 80 (2017) 66–76, https://doi.org/10.1016/j.autcon.2017.04.003.
aei.2019.101003.

19

You might also like