Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Construction and Building Materials 258 (2020) 120291

Contents lists available at ScienceDirect

Construction and Building Materials


journal homepage: www.elsevier.com/locate/conbuildmat

Patch-based weakly supervised semantic segmentation network for


crack detection
Zhiming Dong 1, Jiajun Wang 1, Bo Cui ⇑, Dong Wang, Xiaoling Wang
State Key Laboratory of Hydraulic Engineering Simulation and Safety, Tianjin 300072, China

h i g h l i g h t s

 A weakly supervised semantic segmentation network for crack detection is proposed.


 The proposed patch-based method can be flexibly applied to images of different sizes.
 The proposed method can significantly reduce the annotation workload of the semantic segmentation method.

a r t i c l e i n f o a b s t r a c t

Article history: Obtaining spatial and topological information for cracks in construction materials is important for the
Received 5 February 2020 evaluation of service performance in infrastructure engineering. Manually extracting crack information
Received in revised form 5 June 2020 from an image is a tedious process. Recently, popular deep learning-based semantic segmentation tech-
Accepted 16 July 2020
nologies have been employed to alleviate this problem. However, existing semantic segmentation meth-
Available online 1 September 2020
ods for crack detection are fully supervised, i.e., these methods require manual annotation of data to
obtain pixel-level labels for training, which is time-consuming. To solve this problem, this paper proposes
Keywords:
a patch-based weakly supervised semantic segmentation network for crack detection. The proposed
Crack detection
Deep learning
method uses image-level annotation as the supervision condition and fully considers the local similarity
Weakly supervised learning of the crack topology in the image. The use of patches cropped from the image as the input in this method
Semantic segmentation can reduce the image complexity significantly without losing the spatial location information of the
Computer vision crack. A discriminative localization technique is used to extract rough location information of the crack
Convolutional neural network from a trained classification network, which is then refined by a conditional random field to obtain a syn-
thetic label. These synthetic labels can replace the manually annotated pixel-level labels for the training
of the segmentation network. Thereafter, a neighborhood fusion strategy is used to merge the patches
into the final output. Two datasets are employed to train and evaluate the proposed method. The results
indicate that this method can achieve a performance comparable to that of fully supervised methods (an
MIoU of 0.821 and F-score of 0.8 were obtained for the fully supervised method compared with an MIoU
of 0.782 and F-score of 0.741 with the proposed method), while reducing the annotation workload by
approximately 80%.
Ó 2020 Published by Elsevier Ltd.

1. Introduction inspectors to evaluate the residual capacity and safety of engineer-


ing structures. At present, a large number of projects are under
The performance and functionality of infrastructure engineer- construction or nearing their design life. Therefore, the task of
ing projects such as roads, bridges, and buildings are significantly visual inspection is becoming increasingly challenging and impor-
affected by cracks in the construction materials. Using a visual- tant. Monitoring cracks and obtaining crack information statistics
based method to extract key information regarding these cracks are time-consuming and subjective tasks. To perform crack detec-
for service performance evaluations is an effective means for tion more rapidly and accurately, many computer-aided visual
inspection methods have been proposed. Among these methods,
deep learning (DL)-based methods often achieve the best
⇑ Corresponding author: State Key Laboratory of Hydraulic Engineering Simula-
performance.
tion and Safety, Tianjin 300072, China.
In recent years, DL-based computer vision methods have
E-mail addresses: dongzm@tju.edu.cn (Z. Dong), cuib@tju.edu.cn (B. Cui).
1
These authors contributed equally to this work and should be considered co-first achieved state-of-the-art performance in various tasks, such as
authors. image classification [1], object detection [2], and semantic

https://doi.org/10.1016/j.conbuildmat.2020.120291
0950-0618/Ó 2020 Published by Elsevier Ltd.
2 Z. Dong et al. / Construction and Building Materials 258 (2020) 120291

segmentation [3]. For the crack detection task, an image classifica- research regarding crack detection and WSL is reviewed in this
tion method can obtain image-level semantic information; an section.
object detection method can obtain the location and coarse shape
information; and a semantic segmentation method can obtain 2.1. Crack detection
detailed pixel-level semantic information of the crack. As the
cracks shown in an image often have no fixed shape or location, 2.1.1. Traditional crack detection methods
the image classification and object detection methods cannot Crack detection is a crucial task for infrastructure engineering
obtain sufficiently accurate crack information based on the image, safety assessments. Therefore, many methods have been proposed
such as the length and shape of the crack. However, this informa- for detecting cracks in various surfaces. In recent years, several
tion is important for further quantitative analysis of the crack data automatic crack detection methods have been designed based on
[4,5]. Therefore, semantic segmentation methods are worth study- digital image processing (DIP) and machine learning (ML) tech-
ing for application to the crack detection task. nologies. Methods based on DIP usually need to design a set of fil-
The current semantic segmentation methods for crack detection ters for discriminating cracks from the background by accounting
are based on fully supervised learning (FSL) [6–11], i.e., the training for the differences in texture features between the crack and the
process of the model must be supervised using manually anno- background [25–28]. In general, these methods do not require
tated pixel-level labels. The architecture of the semantic segmenta- manual annotation, i.e., they are unsupervised methods. However,
tion network usually employs a convolutional neural network while DIP-based methods are simple and efficient for processing
(CNN) to assign an initial label to each pixel, and then a loss func- simple crack images, they lose their effectiveness when processing
tion is designed based on the characteristics of the dataset and more complex textures. To detect cracks in construction materials
task. This utilizes various gradient-based parameter optimization more effectively, ML-based methods have been designed. In gen-
methods to minimize the loss [3,12,13]. A fully supervised labeling eral, ML-based methods require a manually designed feature
process is time-consuming and limits the application of DL-based extractor to encode the input image, and the detection task is then
semantic segmentation methods for crack detection. For example, executed in an embedded feature space [29,30]. Compared with
an image annotated at the pixel level takes approximately 15 times DIP-based methods, ML-based methods often have higher accuracy
as long as a bounding box and 60 times as long as image-level and stronger robustness; however, the design of the feature extrac-
annotation to complete [14]. tor depends on manually selecting the model parameters, making
Weakly supervised learning (WSL) methods use incomplete solving the problem subjective. In addition, ML-based methods
annotation data to supervise the model learning process. WSL can only extract low-level features such as color and texture; they
methods can effectively solve the problem of time-consuming may not be able to extract high-level semantic information, even
labeling in DL methods. In image semantic segmentation tasks, though such high-level semantic information is very important
and compared to the pixel-level labels used in fully supervised for locating an object in an image.
learning methods, WSL-based semantic segmentation methods
can use bounding boxes [15,16], image-level labels [17,18], points 2.1.2. Deep learning-based crack detection methods
[19], scribbles [20], etc. as supervision conditions. The workload for DL technology has developed rapidly and become increasingly
these annotation types is much less than that of pixel-level labels. prevalent since 2012 [1]. DL-based computer vision methods, espe-
Applying a WSL semantic segmentation method for crack detection cially those based on a CNN, have achieved state-of-the-art perfor-
can significantly reduce the workload of labeling, thus allowing the mance in many relevant vision tasks [1–3]. Therefore, various
high-precision DL methods to better assist in the automatic detec- CNN-based methods such as image classification [31], object detec-
tion of cracks. tion [32,33], and semantic segmentation [6,8] have been applied to
However, the existing WSL methods may not be able to deal crack detection tasks. Compared with image classification and
well with complex cracks in images with no fixed shape and loca- object detection, semantic segmentation methods can provide
tion [21–24]. Compared with the semantic segmentation methods more shape and location information for the crack in an image.
used in other tasks, such as in traffic scene segmentation or med- The architectures of semantic segmentation networks generally
ical image segmentation, the topological structure of a crack has utilize convolution and downsampling layers to encode an input
both self-similarity and local similarity. After a crack image is image and embed it into a characteristic space, while de-
divided into patches, the topological structure of the crack is sim- convolution and upsampling layers are employed to decode the
plified, but still maintains shapes and structural features similar to embedded feature [3,11–13]. In previous studies on crack detec-
the original. Therefore, an image processing method based on tion, numerous methods have aimed to achieve better results for
patches can efficiently and accurately process a crack image. To a crack image dataset. For example, Liu et al. used a U-net to
this end, a patch-based WSL method for crack detection is pro- achieve accurate segmentation on a small dataset [9]. Wang et al.
posed in this paper. This method uses image-level annotations converted the iterative inference process of a conditional random
rather than pixel-level annotations as the supervision conditions. field (CRF) to a recurrent neural network (RNN) layer [10] and
This remainder of this paper is organized as follows: Section 2 combined the RNN layer with a CNN layer to train them together.
provides a review of the relevant work involved in this study; Sec- The application of CRF improved the precision of the segmentation
tion 3 describes the proposed method in detail; Section 4 intro- result. Li et al. used a hierarchical classification approach to process
duces the datasets and evaluation metrics; Section 5 provides the unbalanced information in data [34]. Tong et al. used X-ray imag-
experimental results and discussions, and the analysis of examples ing technology to obtain image data inside a material to allow the
thereof is provided in Section 6; finally, conclusions are presented properties of the material to be evaluated more accurately [6].
in Section 7. Dorafshan et al. combined a deep CNN with edge detectors and
proposed a new method for reducing noise in the final binary
images [7]. Bang et al. proposed a pixel-level detection method
2. Related works in black-box images that could identify road cracks effectively
and inexpensively [35]. Luo et al. proposed ‘‘DeepCrack,” which
To reduce the data annotation workload, this paper proposes a combined multi-level, multi-scale features to enhance the preci-
WSL method for segmentation of a crack image, and the attention sion of segmentation and established an open-source dataset and
mechanism is used to improve the prediction accuracy. Related benchmark for evaluation methods [11].
Z. Dong et al. / Construction and Building Materials 258 (2020) 120291 3

These strategies use the different characteristics of the crack 2.2. Weakly supervised learning (WSL)
image dataset to improve their methods, which effectively
improves the accuracy of the prediction. However, these methods To solve the problem of time-consuming data annotation, many
all use pixel-level labels for supervision of the training process. image segmentation methods have attempted to supervise the
The annotation of pixel-level labels is time-consuming and dead- training of segmentation networks using incomplete annotations,
ening. Although these improved methods have achieved outstand- also known as WSL. There are many approaches to WSL, including
ing performance, when they are applied in practice, it is important bounding boxes, and image-level based WSL methods are rela-
to consider achieving a balance between the workload and tively popular owing to their easier data labeling procedure. How-
accuracy. ever, crack data can have complex topological structures, e.g., the
Patch-based image classification methods crop the image into shape and size of cracks can be irregular, and cracks often run
patches and then classify the patches. This allows the complexity through the entire image. Therefore, the labeling of bounding
of the crack image to be reduced, and the position of the patch boxes is not a suitable method for this study. Image-level annota-
belonging to the crack category can be used to roughly represent tion can minimize the work required for annotation and can be
the position of the crack in the image. The work of creating effectively applied to the task of crack segmentation. Accordingly,
image-level labels on a patch is less arduous than that required this study uses image-level annotation to supervise the training
for the creation of pixel-level labels. In addition, the patch-based procedure of the semantic segmentation network.
image classification method can extract rough location information Some WSL methods based on image-level annotation for seg-
for the cracks from the image. As a result, many researchers have mentation network training design an end-to-end network. By
studied patch-based image classification methods. Cha et al. pro- modifying the structure of the network, the middle layer of the
posed a CNN-based image classification method for detecting con- network can learn more significant location and shape information
crete cracks, in which sliding window technology was used to clip for a object, and then the training network can use the image-level
the images [36]. Ali et al. proposed a DL-based subsurface damage label [18,21,42]. Recently, there have been numerous methods that
detection method that used cropped patches for training; this instead design a workflow. For example, some methods use some
method could locate subsurface damage using bounding boxes discriminative localization techniques to obtain the segmentation
on infrared thermography images [37]. Park et al. proposed a seeds of a target category, and then reasonably expand the seg-
patch-based crack detection method in which the interference mentation seeds within the boundary to obtain synthetic labels.
was removed using a segmentation method, and three types of Finally, the synthetic labels are used to complete the training of
cracks were detected from black box images [38]. In addition, some the segmentation network [22,24]. These methods have recently
studies have also cropped images into patches and then processed achieved state-of-the-art precision.
them to reduce the complexity of the crack images or to obtain Semantic segmentation tasks are mostly applied for medical
rough location information for the cracks [39–41]. image segmentation or traffic scene segmentation. The objects of
The above methods can reduce the difficulty of data processing such tasks have relatively fixed shapes, such as cells and buses.
or enhance the availability of the methods from various aspects. Many WSL methods make full use of this information to supervise
However, these methods are fully supervised. As shown in Fig. 1, the training process, including those mentioned above and all the
the methods using pixel-level labels (e) can extract abundant crack weakly supervised methods based on bounding boxes. However,
information from an image (f) but require complex annotation; cracks generally have no fixed shape or position, and thus the
those using bounding boxes (a) or image-level labels (c) can easily methods based on shape information or seed-expanding do not
obtain labels, but can extract less crack information (b, d). A handle crack images well. Crack images also have complex topolo-
method that can use image-level labels (g) to train a semantic seg- gies and structures. The strategy in this study is inspired by patch-
mentation network to obtain pixel-level output (h) will have more based methods. It was found that cropping the crack image into
practical application possibilities. small patches can reduce the complexity of the crack image with-

Fig. 1. Comparison of supervision conditions and results between the proposed WSL method and FSL methods.
4 Z. Dong et al. / Construction and Building Materials 258 (2020) 120291

out deterioration of the performance of the results. To this end, a instead of a fully connected layer can avoid the flattening of the
semantic segmentation method based on WSL is proposed in this feature map, thus allowing the spatial location information of the
paper. object to be preserved completely in the embedded features. Then,
the CAM of the class positive can be calculated using Eq. (1):
X
3. Proposed method CAMpositiv e ¼ F i  W positiv ei ð1Þ
i<N channel
3.1. Overview of the proposed method
where CAM positiv e refers to the CAM of the class positive, F refers
The workflow of the proposed method is shown in Fig. 2. The to the feature map after pool5, F i is the i-th slice of the channel
proposed method contains five steps: (a) crop the input image, dimension, N channel is the size of the channel dimension, W positiv e is
ain , and then annotate it as a1True and a1False ; (b) use a discriminative the weight of the class positive in the fully connected layer, and
localization technique in a1True to obtain a class activation mapping W positiv ei is the weight that F i contributes to CAM positiv e . As this study
is a two-category task, the technique only needs to extract the
of the crack, a2True ; (c) use DenseCRF to generate synthetic labels,
crack information from the background, and thus the negative cat-
a3True , from the class activation mapping, a2True ; (d) utilize the syn-
egory does not need to be handled.
thetic labels, a3True , and corresponding original patches, a1True , to train
a segmentation network and obtain the predicted output, a4True ; and
3.3. Position attention module
(e) use a neighborhood fusion strategy to join the probability maps
from the segmentation network, a4True , and those patches not The attention mechanism can effectively improve the perfor-
including cracks, a4False , into the original size, a5 , and then use the mance of the network. The attention module proposed in [43] is
argmax operation to obtain the final output, aout . used in this method; it can fully consider the relationships
The proposed method includes four steps: preprocessing, label between elements in feature maps using a non-local operation.
generating, training the Seg net, and postprocessing. Sections Application of the attention module can allow for a fuller consider-
3.2–3.5 provide details of the proposed weakly supervised seman- ation of the contextual relationships in the feature map, thereby
tic segmentation method. enhancing the representation ability of the features.
The position attention module (PAM) is shown in Fig. 3. The
3.2. Discriminative localization technique dimensions of the input tensor are ðB  C  H  W Þ, where B is
the batch size, H and W are the height and width, respectively, C
Class activation mapping (CAM) [21] is a classic discriminative represents the channel, and different subscripts represent different
localization technique that is used to obtain the approximate loca- sizes of each dimension. First, a 1  1 convolution layer is used to
tion of objects in many WSL-based semantic segmentation meth- embed the input tensor in a feature space with a lower channel
ods [22–24]. In the proposed method, CAM is also used to obtain dimension to obtain h, U, and g. After reshaping, the h and U values
the approximate location and shape information of the object. are multiplied to obtain the result, which is called the overall self-
The last fully connected layer of ResNet is removed (after the pool5 attention map. Then, the overall self-attention map and reshaped g
layer), and then a global average pooling layer and fully connected are multiplied, and the result is reshaped to the same shape as the
layer are added to obtain the final output. Using the global average embedded tensor. Finally, a 1  1 convolution layer is used to

Fig. 2. Illustration of the proposed method. First, the input image is clipped into patches, and a discriminative localization technique is used to obtain the class activation
mapping of the crack in each patch. Then, DenseCRF is used on the class activation mapping and the corresponding patch to generate synthetic labels, which are utilized to
train a segmentation network (Seg net). At the end of the proposed method, the probability heat map of patches as predicted by the Seg net is merged to the original size via a
neighborhood fusion strategy.
Z. Dong et al. / Construction and Building Materials 258 (2020) 120291 5

Fig. 3. Illustration of the position attention module: (a) detailed structure of the PAM, (b–g) different operations in the PAM.

restore the embedded feature to the original dimensions. To reduce X X


M
ðmÞ
b ð xi Þ ¼
Q lðxi ; lÞ xðmÞ Q i ðlÞ ð3Þ
the computational complexity, a max pooling layer can be con- i
l2L m
nected after the convolutional layers of U and g, which allows
the dimensions of the data to be reduced while ensuring that the where lðxi ; lÞ ¼ ½xi –l; ½: is the Iverson bracket, which has a value of
output dimensions are not affected. Fig. 3 details the structure of 1 if the condition in the bracket is satisfied, otherwise it is 0; L is a
the PAM. As the input tensor and output tensor of the PAM have collection of all categories; M is the number of Gaussian kernels;
the same shape, they can be combined with existing architectures ðmÞ
and x is the linear combination weight. Q i ðlÞ can be expanded
[43–45] to better consider the spatial correlation of different posi-
tions in the feature map. as follows:
ðm Þ X ðmÞ   0
Q i ð lÞ ¼ k f i ; f j Q j ðlÞ ð4Þ
3.4. Fully connected conditional random field (DenseCRF) j–i

ðmÞ   0
In the proposed method, DenseCRF is used to process the prob- where k f i ; f j is the m-th Gaussian kernel, and Q j ðlÞ is the
ability map obtained from the CAM [46] in order to obtain a suffi- output of the last iteration, which is initialized to 1
expfwu ðxi Þg.
Zi
ciently fine segmentation mask as a synthetic label. DenseCRF is a
The processing flow of DenseCRF is shown in Fig. 4.
type of probability map model and is widely used in image seg-
The CAM activation map is normalized by SoftMax as a unary
mentation [46–48]. Generally, other image segmentation methods
potential function of DenseCRF. The binary potential function is
use the output pixel-level classification probability map as the
relatively complex, and has some parameters in the Gaussian ker-
input to DenseCRF, while the output is a more elaborate segmenta-
nels, which can affect the performance of the DenseCRF output.
tion mask.
The parameter selection strategy will be discussed in detail in
DenseCRF considers the impacts of all other pixels on each
Section 5.2.
pixel, and thus it is also called a ‘‘full connected” or ‘‘dense con-
The generation process for the synthetic labels can be briefly
nected” CRF. The unary potential function considers the character-
described as follows. ResNet is used as a backbone network to
istics of each pixel and generally uses the initial classification
design a classification network with a global average pooling layer,
probability of each pixel to initialize it, whereas the pairwise
and the PAM is added before the last residual block of each stage of
potential function fully considers the global features, and two
ResNet. In this paper, this network is abbreviated as ‘‘Cls-GAP net.”
Gaussian kernels are used to consider the pixel features and loca-
The process then utilizes the image-level annotations as labels to
tion information. The final energy function is obtained by adding
train the Cls-GAP net. After the network training is completed,
the unary potentials and pairwise potentials of all pixels. Finally,
the discriminative localization technique described in Section 3.2
a mean field approximation inference technique is used to opti-
is used to obtain the CAM probability map of the crack object,
mize the model iteratively until convergence, and a refined seg-
and DenseCRF is used to process the CAM probability map to
mentation result is obtained. DenseCRF uses the following update
obtain the synthetic labels used to train the segmentation network.
equation for the mean field approximation inference:
The network structure of the Cls-GAP net is shown in detail in
1 n o Fig. 5, which also describes the calculation of the CAM.
Q i ð xi Þ ¼ ^ ð xi Þ ;
exp wu ðxi Þ  Q ð2Þ
i
Zi
3.5. Segmentation network
where Q i ðxi Þ is the probability that the i-th pixel belongs to cat-
egory xi , wu ðxi Þ is the unary potential of the i-th pixel, and Z i is the
P n o To extract more precise pixel-level semantic information from
normalization factor, Z i ¼ exp w ðxi Þ  Q b ðxi Þ . Q
b ðxi Þ can
xi 2L u i i images, a DL-based semantic segmentation method, called Seg
be expanded as follows: net, is introduced in the proposed method and trained using the
6 Z. Dong et al. / Construction and Building Materials 258 (2020) 120291

Fig. 4. Detailed explanation of DenseCRF, which uses the original image and the pixel-wise probability map generated by the CAM as the input, assigning a label to each pixel
as the output.

synthetic labels generated by DenseCRF. The unique feature of the in the dataset: 20,000 positive images with cracks, and 20,000 neg-
proposed method is that a workflow is designed to detect cracks in ative images without cracks. Each image is resized to 224  224 for
images at the pixel level. The proposed method is a weakly super- input to the network.
vised semantic segmentation method, and thus it only needs to DeepCrack contains 537 crack images with a resolution of
annotate image-level labels rather than pixel-level labels to com- 544  384, which all have manually annotated labels. DeepCrack
plete the training. includes 300 images as a training set and 237 images as a test
A U-Net with a ResNet backbone is used as the Seg net, and the set. The images are cropped to a size of 224  224 according to a
weighted cross-entropy loss is used to process the category imbal- sliding step. After cropping, approximately 2560 images can be
ance problems in the dataset [49]. The weighted cross-entropy loss obtained. As the images in the DeepCrack dataset all contain
function introduces a balance factor, a, to control the weight of dif- cracks, they all belong to the positive category. Thus, these images
ferent categories in the loss function, which allows the influence of are used for matching with the same number of negative category
minority categories to be better considered. The weighted cross- images to fine-tune the Cls-GAP net. It is worth noting that the
entropy loss can be written as: DeepCrack dataset is category-imbalanced. In the training set, the
proportion of pixels in the positive category is 2.91%, while this
a ¼ N=Npositiv e ð5Þ
proportion is 4.33% in the test set. The weighted cross-entropy loss
    is used to deal with the problem of the category imbalance, as
loss ¼ ay log b
y  ð1  aÞð1  yÞ log 1  b
y ð6Þ described in detail in Section 3.5.
where Npositiv e is the number of pixels in the label that belong to the
crack category; N is the total number of pixels; y is the ground truth, 4.2. Evaluation metrics
y 2 f0; 1g; and b y is the predicted probability, y 2 ½0; 1. In the above,
Three factors in the proposed method have a greater impact
a value of one represents the crack category and zero represents the
on the pixel-wise final output performance: the Cls-GAP net,
non-crack category.
CRF parameters, and Seg net. In this experiment, those propor-
tions are evaluated. The evaluation metrics are similar to those
3.6. Neighborhood fusion strategy
of DeepCrack [11]. For the binary pixel-level classification task,
the relationship between the predicted result for each pixel
To fuse the output masks of the cropped patches into the orig-
and the ground truth can be divided into four categories: TP,
inal size, a neighborhood fusion strategy based on the maximum
FP, FN and TN. For crack detection, TP indicates the number of
probability is used. First, the probability maps of Seg net are
pixels that are predicted as cracks for which the ground truth
obtained before the argmax layer, and these probability maps are
is also a crack; FP indicates the number of pixels that are pre-
stored in positions corresponding to the original image. In the
dicted as cracks, but where the ground truth is a non-crack;
areas where multiple patches overlap, the maximum probability
FN indicates the number of pixels that are predicted as non-
value is selected as the final output probability. As this method is
cracks, but where the ground truth is a crack; and TN indicates
intended for binary classification tasks, good results can be
the number of pixels that are predicted as non-cracks, and the
achieved with this very simple strategy.
ground truth is also a non-crack.
The pixel accuracy (PA) is the simplest metric to indicate the
4. Evaluation proportion of pixels that are predicted correctly; the mean PA
(MPA) is a simple improvement of the PA. First, the proportion of
4.1. Datasets pixels correctly classified in each category is computed from the
total number of pixels in that category, and the result is then aver-
Two datasets were used to train and evaluate the proposed aged. The mean intersection-over-union is a standard measure of
method. These included a crack image classification dataset [50] the semantic segmentation task. This measure computes the pro-
and the DeepCrack crack image semantic segmentation dataset portions of the intersection and union between the ground truth
[11]. The author of the image classification dataset did not give it and prediction for each category, which are then averaged. These
a name. Thus, for convenience, Cls-dataset is used to represent this three metrics can be calculated using TP, FP, FN and TN, which
classification dataset. The goal is to use the DeepCrack dataset to are summarized in Eqs. (7–9).
evaluate the performance of the semantic segmentation in the
crack image, and this method requires training the Cls-GAP net TP þ TN
PA ¼ ð7Þ
to obtain the synthetic labels. The Cls-dataset is used for pre- TP þ FP þ FN þ TN
training the Cls-GAP net. The Cls-dataset crops 458 4032  3024  
TP TN
high-resolution original images into 227  227 patches. A total of MPA ¼ þ =2 ð8Þ
40,000 images are used in the datasets. There are two categories TP þ FN TN þ FP
Z. Dong et al. / Construction and Building Materials 258 (2020) 120291 7

Fig. 5. Architecture of the Cls-GAP net. This network uses ResNet as a backbone, removes all layers after pool5, and adds a global average pooling layer and full connectivity
layer in turn after the pool5 layer; the PAM is added before the last residual block of each stage of ResNet. The calculation of the CAM specifically refers to extracting the
feature map after the pool5 layer of the Cls-GAP net; it is then averaged using the weight of the fully connected layer.

 
TP TN 5.2. DenseCRF parameters
MIoU ¼ þ 2 ð9Þ
TP þ FP þ FN TN þ FN þ FP
The performance of the CRF depends strongly on the choice of
Several other metrics are also commonly used in crack detec-
parameters. Of these parameters, the unary potential can be
tion tasks, as shown in Eqs. (10–12).
directly assigned using the CAM; the role of the pairwise potential
is to describe the similarity between two pixels and to encourage
TP similar pixels to be assigned the same label so that the CRF can
P¼ ð10Þ
TP þ FP split at the edge. This similarity can be expressed as in Eq. (13):

  !
TP   p  p 2 jI  I j
k f i ; f j ¼ x exp  þxð2Þ
ð1Þ i j i h
R¼ ð11Þ 
TP þ FN 2h2a 2h2b
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
apperarncekernels
  !
2  TP p  p 2
F1 ¼ ð12Þ  exp 
i j
ð13Þ
2  TP þ FP þ FN 2h2c
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
In the above equation, P refers to the precision (i.e., the fraction smoothnesskernel
of relevant instances among the retrieved instances), R refers to  
recall (the fraction of the total number of relevant instances that Where f i refers to the eigenvector of the i-th pixel, k f i ; f j is
were actually retrieved), and F 1 refers to the F-score, which can used to describe the similarity of f i ; f j ; x is the weight, pi is the
balance the effects of P and R and more comprehensively evaluate color vector of the i-th pixel, Ii is the position vector of the i-th
the performance of a classifier. The area under the ROC curve (AUC) pixel, and ha ; hb , and hc control the effect of the color vector, the
metric are also used to consider the TN (TrueNegatives), which are position vector in the appearance kernel, and the color vector in
 
not considered in the P, R, and F1 metrics, thereby providing a the smoothness kernel on k f i ; f j , respectively.
more comprehensive evaluation of the method performance. The choice of CRF parameters also affects the qualification rate
of the synthetic labels. Owing to limitations in the ability of Den-
seCRF to extract pixel-level semantic information from images,
some images cannot be properly processed. For example, DenseCRF
5. Results and discussion
may classify all pixels in an image as cracks or non-cracks. To elim-
inate the impact of unqualified synthetic labels on the training of
5.1. Implementation details
the Seg net, synthetic labels whose pixels are classified as all cracks
or all non-cracks are removed. The qualification rate represents the
All of the experiments were conducted on a computer with the
ratio of the number of qualified synthetic labels to the total num-
Ubuntu 16.04 system installed. An NVIDIA TITAN X Pascal GPU was
ber of synthetic labels. To ensure that a sufficient number of syn-
used for training and evaluation. When training the Cls-GAP net,
thetic labels can be used to train Seg Net, the qualification rate is
the cross-entropy loss was used to evaluate the error between
required to be greater than 50%.
the prediction and the ground truth, and the stochastic gradient
descent (SGD) optimizer with momentum was used to update Changes in parameters hc and xð2Þ had little effect on the data-
the parameters. The learning rate was 0.1, the momentum was set [46], and thus similar to the procedure in [12], both hc and xð2Þ
0.9, and the number of epochs for training was 50. When training were set as equal to three. A grid search was used to search for the
the Seg net, the weighted cross-entropy loss [49] was used to pro- optimal values of parameters ha ; hb , and xð1Þ within a certain
cess the unbalanced data, and the ‘‘Adadelta” optimizer was used range (xð1Þ 2 f5; 10; 15g, ha 2 f10; 30; 50; 70; 150; 200g, and
to update the parameters. The learning rate was 0.1, and the num- hb 2 f1; 5; 10; 30; 50; 300g). A total of 100 images were selected
ber of epochs for training was 30. from the DeepCrack dataset for the grid search experiment. To
8 Z. Dong et al. / Construction and Building Materials 258 (2020) 120291

show the relationship between the performance and qualification parameters. The average of the five-fold cross-validation for each
rate of DenseCRF intuitively, the grid search results for ha ; hb with metric is used to evaluate the performance of the method, and
xð1Þ ¼ 10 are shown in Fig. 6. It can be seen from the results that the standard deviation is used to evaluate the stability of the
the MIoU metric increases with the change in the parameters; method. Fig. 7 shows the results of this evaluation.
however, the number of qualified synthetic labels will be generally In Fig. 7, the x-axis is the evaluation metric, the y-axis is the
decreased. Finally, to balance the performance and the qualifica- metric value, the red color represents the results of the FSL
tion rate, the values of ha ; hb ; xð1Þ ; hc , and xð2Þ are determined as method, and the blue color represents the results of the proposed
150, 30, 10, 3, and 3, respectively. WSL method. The error bars on the bar graph represent the stan-
dard deviations, which can indicate the degree of dispersion in
the five-fold cross-validation results. The value of the standard
5.3. Comparison with fully supervised learning deviation is given below the error bar. As shown in Fig. 7, the pro-
posed WSL method and the FSL method provide consistent results,
A five-fold cross-validation is utilized to evaluate the perfor- although the proposed method exhibits a slight loss in perfor-
mance and stability of the proposed WSL method. The ResNet-U- mance and a larger standard deviation. Therefore, compared with
Net is used as the Seg net, and the DeepCrack dataset is used for the FSL method, the proposed WSL method can significantly reduce
evaluation. The synthetic labels generated by DenseCRF are used the annotation workload without losing too much performance or
to train the Seg net in the proposed method, and the ground truth stability.
is used to train the FSL method. The proposed WSL method and FSL Other than ResNet-U-Net, several popular network structures
methods are trained to convergence using the same training were chosen as the Seg net in the proposed method to provide a

Fig. 6. Grid search results for ha ; hb with.xð1Þ ¼ 10

Fig. 7. Evaluation of the method performance and stability.


Z. Dong et al. / Construction and Building Materials 258 (2020) 120291 9

more comprehensive comparison with the FSL method. All of the mation of the feature map. To evaluate the practical impact of the
models were trained to convergence using the same training PAM on the Cls-GAP net performance, similar parameters were
parameters, and all models were evaluated based on the same test used to a train the ResNet with and without a PAM. The two
set. The evaluation results are shown in Fig. 8. trained models were used to generate synthetic labels, and the
In Fig. 8, the x-axis is the evaluation metric, and the y-axis is the evaluation metric calculation results are summarized in Table 1.
corresponding metric value. Different colors represent different It can be seen from Table 1 that the evaluation results for the
models; the solid line represents the FSL method, and the dashed Cls-GAP net with the PAM are generally better than the Cls-GAP
line represents the proposed WSL method. In the legend, the brack- net without the PAM. Moreover, the Cls-GAP net without PAM
eted value after the model name indicates the training time. As has a larger P metric because the synthetic labels generated by
shown in Fig. 8, the evaluation results of the WSL method are the Cls-GAP net with PAM have a larger TP, but the FP also
slightly lower than the FSL method for most metrics, which are increases accordingly. Because the number of pixels belonging to
consistent with the results of ResNet-U-Net, and the proposed TP is larger, its relative increase is smaller than that of FP, and thus
WSL method has a higher P metric than in FSL method because the P metric may be decreased.
the prediction results of the proposed WSL have a smaller FP, To show the operation of the PAM in the intermediate layer of
which leads to a larger value of the P metric. Compared with the Cls-GAP net visually, the PAM added in the third stage of the
FCN-based models, U-Net-based models usually require shorter ResNet backbone in Cls-GAP net is selected for visualization. In
training time and obtain better performance. In general, when the intermediate layer of the PAM, the shape of the overall self-
using different semantic segmentation networks as Seg net, the attention map is ðB  HW  Hdown W down Þ, which means that for
results of the proposed WSL method and the FSL method are each position (x,y) of the input feature map, there is a correspond-
consistent. ing ðHdown  W down Þ feature in the overall self-attention map. The
feature map corresponding to each position in the input feature
5.4. Comparison with other WSL methods map is called the sub-attention map. In Fig. 10, (a) is the original
image. The sub-attention maps of all positions are drawn and
Two typical WSL semantic segmentation methods were com- arranged in order according to their positions in (g). Five positions
pared with the proposed method. One method is an end-to-end in (a) are selected, and their sub-attention maps are displayed in
network called WILDCAT [18], and the other method is a workflow (b–f). To enable better visualization, the sub-attention map was
that generates synthetic labels, and then uses the synthetic labels normalized. As shown in (b–f), the red rectangle indicates the posi-
to train the segmentation networks [22]. In this study, this method tion of the selected point; the position with the larger response
is referred to as ‘‘PSA.” The DenseCRF parameters optimized using value in the sub-attention map belongs to the same category as
the DeepCrack dataset are applied to these two methods, and are this point, and vice versa. The visualization results show that the
the same as those in the proposed WSL method. The weight adjust- PAM considers the spatial correlation of pixels in different posi-
ment parameter, a, and the number of modalities (M) in WILDCAT tions in the intermediate layer feature map, and thus it can better
are optimized in the same way as the original paper, and a = 0.5 locate the cracks in the image.
and M = 4 are used for the comparative experiment.
As shown in Fig. 9, when processing an image with a simple 5.6. Comparison of the annotation workload
crack topology structure, PSA and WILDCAT can locate the position
of the crack in the image approximately (a-c), but cracks with sud- An experiment was designed to compare the workload for
den changes in direction may not be accurately processed (d-f). annotating image-level labels on patches with that of annotating
PSA and WILDCAT lack sufficient feature extraction capabilities pixel-level labels quantitatively. Pixel-level labels were annotated
for images with more complex crack topology structures (g). In using two tools: an online image editor ‘‘Pixlr” [51] and ‘‘LabelMe”
contrast, the proposed WSL method is relatively robust in dealing [52].
with different types of crack images. As shown in Fig. 11, (a) is the original image with a resolution of
384  544. First, the image is cropped into 12 patches (b) with a
5.5. Influence of the position attention module resolution of 224  224 according to a sliding step of 80, and then
the images of different categories are simply placed in different
ResNet was used as the backbone network in the Cls-GAP net, folders to complete the image-level annotation (c). The interface
and the PAM is introduced to better consider the contextual infor- of LabelMe is shown in (d). The user can outline the crack through

Fig. 8. Evaluation results with various Seg net structures.


10 Z. Dong et al. / Construction and Building Materials 258 (2020) 120291

Fig. 9. Pixel-level segmentation results of the proposed method, PSA, and WILDCAT for the DeepCrack dataset.

Table 1
Performance of the position attention module.

Metrics PA MPA MIoU P R F1 AUC


Cls-GAP net with PAM 0.963 0.701 0.677 0.932 0.403 0.563 0.724
Cls-GAP net without PAM 0.958 0.648 0.625 0.951 0.298 0.454 0.667

Fig. 10. Visualization of the overall self-attention map in the PAM.

the polygon and generate a label image (e). The interface of Pixlr is methods. The point counts for a single image roughly reflects the
shown in (f), and its lasso cutout tool can easily separate the crack number of mouse clicks during the annotation process. LabelMe’s
in the image from the background using a smooth curve (g). The point counts indicate the number of vertices of the labeled poly-
python language and open-source library OpenCV are used to con- gon, and the number of point counts for the image-level annotation
vert the output (g) of the online image editor into pixel-level labels method is defined as 12 because an image is cropped into 12
(h). patches. The lasso tool in Pixlr uses smooth curve annotation,
The time and the point counts for different annotation methods which makes it difficult to calculate point counts, so only the time
are evaluated to compare the workloads for different annotation spent on a single image is used to evaluate the workload of Pixlr.
Z. Dong et al. / Construction and Building Materials 258 (2020) 120291 11

Fig. 11. Detailed explanation of the annotation method.

Table 2 erated using this method have a strong object location ability for
Comparison of annotation workloads using different methods. different types of cracks in general, such as simple cracks (a, b),
Annotate methods Time spent on a single Point count for a cracks with widely varying widths (c), cracks disturbed near brick
image (s) single image joints (d) or paint (e), and cracks with more complex topology (f).
Annotating image-level 22.3 12 However, for some cracks with complex topologies (g) or very wide
labels on patches cracks (h), the synthetic labels cannot be generated correctly.
Online image editor 145 / According to the confusion matrix in the fifth row, the FN of the
LabelMe 173 148 synthetic labels is often greater than the FP, which will cause the
evaluation results to have a larger P and smaller R.

Twenty images were selected from the DeepCrack dataset for the 6.2. Segmentation network
annotation workload experiment. All of the experiments were con-
ducted under the same conditions. The experiment was repeated The ResNet-U-Net was used as the Seg net, and the Seg net was
three times, and the average of all rounds of the experiment was trained using synthetic labels. Some typical examples are shown in
used to represent the overall result. Fig. 13.
From Table 2 it is can be seen that the time required to annotate In Fig. 13, the first line is the original image, the second line is
pixel-level labels using LabelMe is approximately 8 times that of the ground truth, the third line is the probability map generated
annotating image-level labels; moreover, the point count for the by the Seg net, the fourth line is the output of the Seg net, and
pixel-level labels using LabelMe is approximately 12 times that the fifth line is the corresponding confusion matrix of the Seg net
of the image-level labels. Using Pixlr to annotate pixel-level labels output. These examples show that the results obtained from the
takes roughly seven times longer than for the image-level labels, Seg net are greatly improved compared with the synthetic labels.
which is slightly less time-consuming than using LabelMe. The In Fig. 13, (a, b) are relatively simple cracks, the crack width in
results show that the annotation workload for the image-level (c) varies greatly, the crack in (d) passes through paint, the back-
labels used in the proposed weakly supervised semantic segmenta- ground material in (e) is very rough, the crack topology in (f, g)
tion method is much smaller than that for the pixel-level labels is more complex, and the crack in (h) is very wide. According to
used in the fully supervised semantic segmentation method. the Seg net output in the fourth line, the proposed method can
effectively deal with cracks in complex situations. For some cracks
that could not be correctly handled by DenseCRF in Fig. 12, Seg net
6. Analysis examples
can deal with them more accurately, e.g., the cracks in (g, h). This is
because the DL-based Seg net has stronger feature extraction capa-
6.1. Generating synthetic labels
bilities than DenseCRF, and thus it can more accurately extract
pixel-level semantic information of the cracks.
The proposed weakly supervised semantic segmentation
method requires synthetic labels to train the Seg net. Therefore,
the quality of the synthetic labels directly affects the performance 7. Conclusion
of the output. The ResNet with PAM was used as the Cls-GAP net
and was trained using image-level annotated labels. Then, the This study proposes a patch-based weakly supervised semantic
trained Cls-GAP net was utilized to process the DeepCrack training segmentation method for crack detection. The proposed method
set to generate synthetic labels for use in training the Seg net. only needs to annotate image-level labels, rather than pixel-level
Some typical examples are shown in detail in Fig. 12. labels, to complete the training. This method crops the image into
In Fig. 12, the first line is the original image, the second line is patches to reduce the complexity, and a discriminative localization
the ground truth, the third line is the probability map generated technique and CRF processing are used to obtain the synthetic
by the CAM technology, the fourth line is the synthetic label gen- labels. The synthetic labels are then used to train a segmentation
erated by DenseCRF, and the fifth line is the confusion matrix of network, and a neighborhood fusion strategy is utilized to merge
the synthesized label. As shown in Fig. 12, the synthetic labels gen- the patches into the final output.
12 Z. Dong et al. / Construction and Building Materials 258 (2020) 120291

Fig. 12. Examples of synthetic labels.

Fig. 13. Examples of Seg net outputs.

The proposed method has three main contributions, as follows: pixel-level annotation, this method uses image-level annotation
as labels for training. As a result, the proposed method reduces
1. A patch-based weakly supervised semantic segmentation the cost of annotation, and thus the image segmentation task
network is proposed for crack detection. Instead of using can be more widely used for crack images.
Z. Dong et al. / Construction and Building Materials 258 (2020) 120291 13

2. This method can be flexibly applied to images of different sizes. [7] S. Dorafshan, R.J. Thomas, M. Maguire, Comparison of deep convolutional
neural networks and edge detectors for image-based crack detection in
A crack image often has a complex topological structure and
concrete, Constr. Build. Mater. 186 (2018) 1031–1045, https://doi.org/10.1016/
locally similar characteristics. Accordingly, a patch-based j.conbuildmat.2018.08.011.
method is designed to crop the complex original images into [8] S. Zhou, W. Sheng, Z. Wang, W. Yao, H. Huang, Y. Wei, R. Li, Quick image
smaller patches before processing. This method reduces the dif- analysis of concrete pore structure based on deep learning, Constr. Build.
Mater. 208 (2019) 144–157, https://doi.org/10.1016/
ficulty of data processing and compensates for the inaccuracy j.conbuildmat.2019.03.006.
caused by incomplete annotation. [9] Z. Liu, Y. Cao, Y. Wang, W. Wang, Computer vision-based concrete crack
3. The attention mechanism is introduced to improve the perfor- detection using U-net fully convolutional networks, Autom. Constr. 104 (2019)
129–139, https://doi.org/10.1016/j.autcon.2019.04.005.
mance of the proposed method. Two public datasets are used [10] M. Wang, J.C.P. Cheng, A unified convolutional neural network integrated with
to train and evaluate the proposed method, and the results indi- conditional random field for pipe defect segmentation, Comput. Civ.
cate that this method can achieve a performance comparable to Infrastruct. Eng. (2019) 1–16, https://doi.org/10.1111/mice.12481.
[11] Y. Liu, J. Yao, X. Lu, R. Xie, L. Li, DeepCrack: A deep hierarchical feature learning
that of a fully supervised model. This method balances the architecture for crack segmentation, Neurocomputing. 338 (2019) 139–153,
trade-off between the annotation workload and performance. https://doi.org/10.1016/j.neucom.2019.01.036.
[12] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, DeepLab:
Semantic Image Segmentation with Deep Convolutional Nets, Atrous
The proposed WSL method significantly reduces the annotation Convolution, and Fully Connected CRFs, IEEE Trans. Pattern Anal. Mach.
workload, which can reduce the threshold for applying semantic Intell. 40 (2018) 834–848, https://doi.org/10.1109/TPAMI.2017.2699184.
segmentation methods to crack detection, thus allowing cracks to [13] O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for
Biomedical Image Segmentation. BT - Medical Image Computing and
be detected and analyzed more effectively. To the best of the
Computer-Assisted Intervention - MICCAI 2015 - 18th International
authors’ understanding, this study is the first time that a WSL Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III,
semantic segmentation method has been applied to research in (2015) 234–241. https://doi.org/10.1007/978-3-319-24574-4_28.
civil infrastructure engineering. [14] P. Dollár, C.L. Zitnick, Microsoft COCO: Common Objects in Context. BT -
Computer Vision - ECCV 2014 - 13th European Conference, Zurich,
In future research, this method will be applied in more varied Switzerland, September 6-12, 2014, Proceedings, Part V, (2014) 740–755.
scenarios, and the method will be improved to enhance its ability https://doi.org/10.1007/978-3-319-10602-1_48
to detect tiny cracks. In addition, applied research based on the [15] J. Dai, K. He, J. Sun, BoxSup: Exploiting bounding boxes to supervise
convolutional networks for semantic segmentation, Proc. IEEE Int. Conf.
semantic segmentation results of the crack images will be Comput. Vis. 2015 Inter (2015) 1635–1643. https://doi.org/10.1109/
conducted. ICCV.2015.191
[16] M. Rajchl, M.C.H. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M.
Damodaram, M.A. Rutherford, J.V. Hajnal, B. Kainz, D. Rueckert, DeepCut:
CRediT authorship contribution statement Object Segmentation from Bounding Box Annotations Using Convolutional
Neural Networks, IEEE Trans. Med. Imaging. 36 (2017) 674–683, https://doi.
org/10.1109/TMI.2016.2621185.
Dong Zhiming: Conceptualization, Methodology, Writing - [17] D. Pathak, P. Krahenbuhl, T. Darrell, Constrained convolutional neural
original draft. Wang Jiajun: Writing - review & editing, Supervi- networks for weakly supervised segmentation, Proc. IEEE Int. Conf. Comput.
sion, Resources. Cui Bo: Resources, Supervision, Funding acquisi- Vis. (2015 Inter (2015)) 1796–1804, https://doi.org/10.1109/ICCV.2015.209.
[18] T. Durand, T. Mordan, N. Thome, M. Cord, WILDCAT: Weakly supervised
tion. Wang Dong: Writing - review & editing. Wang Xiaoling:
learning of deep convnets for image classification, pointwise localization and
Writing - review & editing, Supervision. segmentation, Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR
2017. 2017-Janua (2017) 5957–5966. https://doi.org/10.1109/CVPR.2017.631.
[19] A. Bearman, O. Russakovsky, V. Ferrari, L. Fei-Fei, What’s the Point: Semantic
Declaration of Competing Interest Segmentation with Point Supervision, (n.d.).
[20] D. Lin, J. Dai, J. Jia, K. He, J. Sun, ScribbleSup: Scribble-Supervised Convolutional
The authors declare that they have no known competing finan- Networks for Semantic Segmentation, Proc. IEEE Comput. Soc. Conf. Comput.
Vis. Pattern Recognit. 2016-Decem (2016) 3159–3167. https://doi.org/
cial interests or personal relationships that could have appeared
10.1109/CVPR.2016.344.
to influence the work reported in this paper. [21] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning Deep Features
for Discriminative Localization, Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit. 2016-Decem (2016) 2921–2929. https://doi.org/10.1109/
Acknowledgments CVPR.2016.319.
[22] J. Ahn, S. Kwak, Learning Pixel-Level Semantic Affinity with Image-Level
Funding: This work was supported by the Yalong River Joint Supervision for Weakly Supervised Semantic Segmentation, Proc. IEEE
Comput. Soc. Conf. Comput. Vis, Pattern Recognit. (2018) 4981–4990,
Funds of the National Natural Science Foundation of China (grant https://doi.org/10.1109/CVPR.2018.00523.
number U1965207); the National Natural Science Foundation of [23] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-
China (grant number 51839007); and the National Natural Science CAM: Visual Explanations from Deep Networks via Gradient-Based
Localization. BT - IEEE International Conference on Computer Vision, ICCV
Foundation of China (grant number 51779169).
2017, Venice, Italy, October 22-29, 2017, (2017) 618–626. https://doi.org/
10.1109/ICCV.2017.74.
References [24] Z. Huang, X. Wang, J. Wang, W. Liu, J. Wang, Weakly-Supervised Semantic
Segmentation Network with Deep Seeded Region Growing, Proc. IEEE Comput.
Soc. Conf. Comput. Vis. Pattern Recognit. (2018) 7014–7023. https://doi.org/
[1] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet Classification with Deep
10.1109/CVPR.2018.00733.
Convolutional Neural Networks, Adv. Neural Inf. Process. Syst. 25 (2012).
[25] H.D. Cheng, J.-R. Chen, C. Glazier, Y. Hu, Novel Approach to Pavement Cracking
[2] S. Ren, K. He, R.B. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object
Detection Based on Fuzzy Set Theory, J. Comput. Civ. Eng. 13 (1999) 270–280,
Detection with Region Proposal Networks, IEEE Trans. Pattern Anal. Mach.
https://doi.org/10.1061/(ASCE)0887-3801(1999)13:4(270).
Intell. 39 (2017) 1137–1149, https://doi.org/10.1109/TPAMI.2016.2577031.
[26] T. Non-member, S. Non-member, R. Non-member, S. Hashimoto, Image-Based
[3] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic
Crack Detection for Real Concrete Surfaces, IEEJ Trans. Electr. Electron. Eng. 3
segmentation. BT - IEEE Conference on Computer Vision and Pattern
(2008) 128–135, https://doi.org/10.1002/tee.20244.
Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, (2015) 3431–
[27] H. Oliveira, P. Correia, Automatic road crack segmentation using entropy and
3440. https://doi.org/10.1109/CVPR.2015.7298965.
image dynamic thresholding, (2009).
[4] B. Kim, S. Cho, Image-based concrete crack assessment using mask and region-
[28] L. Li, Q. Wang, G. Zhang, L. Shi, J. Dong, P. Jia, A method of detecting the cracks
based convolutional neural network, Struct. Control Heal. Monit. 26 (2019) 1–
of concrete undergo high-temperature, Constr. Build. Mater. 162 (2018) 345–
15, https://doi.org/10.1002/stc.2381.
358, https://doi.org/10.1016/j.conbuildmat.2017.12.010.
[5] X. Yang, H. Li, Y. Yu, X. Luo, T. Huang, X. Yang, Automatic Pixel-Level Crack
[29] A. Frangi, W.J. Niessen, R. Hoogeveen, T. Walsum, M. Viergever, Model-based
Detection and Measurement Using Fully Convolutional Network, Comput. Civ.
quantitation of 3-D magnetic resonance angiographie images, IEEE Trans. Med.
Infrastruct. Eng. 33 (2018) 1090–1109, https://doi.org/10.1111/mice.12412.
Imaging. 18 (1999) 946–956, https://doi.org/10.1109/42.811279.
[6] Z. Tong, J. Gao, Z. Wang, Y. Wei, H. Dou, A new method for CF morphology
[30] H.K. Ryu, J.K. Lee, E.T. Hwang, J. Liu, H.H. Lee, W.H. Choi, A new corner
distribution evaluation and CFRC property prediction using cascade deep
detection method of gray-level image using Hessian matrix, (2007)
learning, Constr. Build. Mater. 222 (2019) 829–838, https://doi.org/10.1016/
537–540.
j.conbuildmat.2019.06.160.
14 Z. Dong et al. / Construction and Building Materials 258 (2020) 120291

[31] K. Gopalakrishnan, S.K. Khaitan, A. Choudhary, A. Agrawal, Deep Convolutional [41] N.A.M. Yusof, A. Ibrahim, M.H.M. Noor, N.M. Tahir, N.M. Yusof, N.Z. Abidin, M.
Neural Networks with transfer learning for computer vision-based data-driven K. Osman, Deep convolution neural network for crack detection on asphalt
pavement distress detection, Constr. Build. Mater. 157 (2017) 322–330, pavement, J. Phys. Conf. Ser. 1349 (2019), https://doi.org/10.1088/1742-6596/
https://doi.org/10.1016/j.conbuildmat.2017.09.110. 1349/1/012020.
[32] J.C.P. Cheng, M. Wang, Automated detection of sewer pipe defects in closed- [42] A. Kolesnikov, C.H. Lampert, Seed, expand and constrain: Three principles for
circuit television images using deep learning techniques, Autom. Constr. 95 weakly-supervised image segmentation, Lect. Notes Comput. Sci. (Including
(2018) 155–171, https://doi.org/10.1016/j.autcon.2018.08.006. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 9908 LNCS (2016)
[33] Y. Xu, S. Wei, Y. Bao, H. Li, Automatic seismic damage identification of 695–711. https://doi.org/10.1007/978-3-319-46493-0_42.
reinforced concrete columns from images by a region-based deep [43] X. Wang, R. Girshick, A. Gupta, K. He, Non-local Network, Arxiv. (2017), https://
convolutional neural network, Struct. Control Heal. Monit. 26 (2019) 1–22, doi.org/10.1016/0038-1101(79)90174-6.
https://doi.org/10.1002/stc.2313. [44] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, H. Lu, Dual attention network for scene
[34] D. Li, A. Cong, S. Guo, Sewer damage detection from imbalanced CCTV segmentation, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019: pp.
inspection data using deep convolutional neural networks with hierarchical 3146–3154.
classification, Autom. Constr. (2019). [45] Z. Wang, N. Zou, D. Shen, S. Ji, Non-local U-Net for Biomedical Image
[35] S. Bang, S. Park, H. Kim, H. Kim, Encoder–decoder network for pixel-level road Segmentation, (2018). http://arxiv.org/abs/1812.04103.
crack detection in black-box images, Comput. Civ. Infrastruct. Eng. 34 (2019) [46] P. Krähenbühl, V. Koltun, Efficient inference in fully connected crfs with
713–727, https://doi.org/10.1111/mice.12440. Gaussian edge potentials, Adv. Neural Inf. Process. Syst. 24 25th Annu. Conf.
[36] Y.J. Cha, W. Choi, O. Büyüköztürk, Deep Learning-Based Crack Damage Neural Inf. Process. Syst. 2011, NIPS 2011. (2011) 1–9.
Detection Using Convolutional Neural Networks, Comput. Civ. Infrastruct. [47] S. Zheng, S. Jayasumana, P.H.S. Torr, Conditional Random Fields as Recurrent
Eng. 32 (2017) 361–378, https://doi.org/10.1111/mice.12263. Neural Networks, (2015). https://doi.org/10.1109/ICCV.2015.179.
[37] R. Ali, Y.J. Cha, Subsurface damage detection of a steel bridge using deep [48] M.T.T. Teichmann, R. Cipolla, Convolutional CRFs for Semantic Segmentation,
learning and uncooled micro-bolometer, Constr. Build. Mater. 226 (2019) 376– (n.d.).
387, https://doi.org/10.1016/j.conbuildmat.2019.07.293. [49] Y.S. Aurelio, G.M. de Almeida, C.L. de Castro, A.P. Braga, Learning from
[38] S. Park, S. Bang, H. Kam, H. Kim, Patch-Based Crack Detection in Black Imbalanced Data Sets with Weighted Cross-Entropy Function, Neural Process.
Box Images Using Convolutional Neural Networks, J. Comput. Civ. Eng. 33 Lett. 50 (2019) 1937–1949, https://doi.org/10.1007/s11063-018-09977-1.
(2019) 1–11, https://doi.org/10.1061/(ASCE)CP.1943-5487.0000831. [50] T. Light, P. Detection, Road Crack Detection And Segmentation For
[39] X. Zhang, D. Rajan, B. Story, Concrete crack detection using context-aware Autonomous Driving, (2012).
deep semantic segmentation network, Comput. Civ. Infrastruct. Eng. 34 (2019) [51] P. Whitt, Pixlr Editor Tools BT - Beginning Pixlr Editor: Learn to Edit Digital
951–971, https://doi.org/10.1111/mice.12477. Photos Using this Free Web-Based App, in: P. Whitt (Ed.), Apress, Berkeley, CA,
[40] F.C. Chen, M.R. Jahanshahi, NB-CNN: Deep Learning-Based Crack Detection 2017: pp. 19–52. https://doi.org/10.1007/978-1-4842-2698-8_2.
Using Convolutional Neural Network and Naïve Bayes Data Fusion, IEEE Trans. [52] B.C. Russell, A. Torralba, K.P. Murphy, W.T. Freeman, LabelMe: A Database and
Ind. Electron. 65 (2018) 4392–4400, https://doi.org/10.1109/ Web-Based Tool for Image Annotation, Int. J. Comput. Vis. 77 (2008) 157–173,
TIE.2017.2764844. https://doi.org/10.1007/s11263-007-0090-8.

You might also like