Professional Documents
Culture Documents
2017 - Alipour Et Al - Load Capacity Rating of Bridge Populations Through Machine Learning Application of Decision Trees
2017 - Alipour Et Al - Load Capacity Rating of Bridge Populations Through Machine Learning Application of Decision Trees
2017 - Alipour Et Al - Load Capacity Rating of Bridge Populations Through Machine Learning Application of Decision Trees
Mohamad Alipour, S.M.ASCE1; Devin K. Harris, Ph.D., A.M.ASCE2; Laura E. Barnes, Ph.D.3;
Osman E. Ozbulut, Ph.D., A.M.ASCE4; and Julia Carroll5
Abstract: The functionality of the U.S. transportation infrastructure system is dependent upon the health of an aging network of over
600,000 bridges, and agencies responsible for maintaining these bridges rely on the process of load rating to assess the adequacy of individual
structures. This paper presents a new approach for safety screening and load-capacity evaluation of large bridge populations that seeks to
uncover heretofore unseen patterns within the National Bridge Inventory database and establish relationships between select bridge attributes
and their load-capacity status. Decision-tree and random-forest classification models were trained on the national concrete slab bridge data
set of over 40,000 structures. The resulting models were validated on an independent data set and then compared with a number of existing
judgment-based schemes found in an extensive survey of the current state of practice in the United States. The proposed approach offers a
method that provides guidance for improved allocation of resources by informing maintenance decisions through rapid identification of candi-
date bridges that require further scrutiny for either possible load restriction or restriction removal. DOI: 10.1061/(ASCE)BE.1943-
5592.0001103. © 2017 American Society of Civil Engineers.
Author keywords: National Bridge Inventory (NBI); Data-driven; Load rating; Load posting; Decision trees; Random forests.
RF < 1
RF < 1
(Oponal)
Posng or Repair/Rehab Posng not required**
RF < 1 Refined Methods RF ≥ 1
No permit rang May be rated for permit
or Load Tesng
(a) (b)
Fig. 1. (a) Bridge load-rating and posting process; (b) load-posting signage (image by author)
management strategies and safety-related decisions ranging (NCHRP 2014); however, the subjectivity of the process is
from permitting and load posting to rehabilitation, or an even amplified dramatically for structures with missing or incomplete
more conservative approach of terminating operations and design information, which are currently rated based on engineer-
replacing the structure. ing judgment. Furthermore, although the study presented herein
At the national level, it is of primary interest to the Federal has been limited to a specific class of structures (RC slab
Highway Administration (FHWA), the agency responsible for bridges), the proposed approach is generic, allowing for extrapo-
establishing and overseeing bridge safety requirements, to improve lation across other bridge types, and is expected to promote fur-
the fidelity of the current rating and performance evaluation proce- ther discussion on the topic within the national and international
dures to optimize the administration of federal funds and advance bridge community.
the level of safety in the national highway system (OIG 2006).
Considering the shortcomings of the current procedures, responsi- Review of Current Practice in the United States
ble authorities are concerned with the level of errors involved in the
current load-posting status of bridges within the National Bridge An examination of the statistics on the methods used in the
Inventory (NBI) database (OIG 2006). In addition to safety-related United States to load rate highway bridges shows the primary
concerns, bridge owners are hesitant to impede commerce or approach to be analytical methods. In 2014, among bridges lon-
increase operational costs and travel time by imposing unnecessary ger than 6 m (20 ft), over 80% of the bridges in the NBI were
load limits. To overcome these issues, it has been recommended load rated using analytical methods, 0.6% were load tested, and
that state DOTs and local agencies develop data-driven, risk-based the remainder were load rated using engineering judgment or not
approaches for oversight and updating the load-posting status of rated at all (Harris et al. 2015). In many of these cases, informa-
their in-service inventory (OIG 2006). tion such as design or construction plans is not available for load
Within this context, this study leverages emerging machine- ratings, preventing the use of analytical methods. Article 6.1.4
learning techniques to correlate select attributes of the in-service of the MBE states that, for bridges lacking sufficient design
structures available within the NBI database to predict the load- details, a physical inspection of the bridge by a qualified inspec-
posting status of each structure. This process serves to reveal hidden tor and evaluation by a qualified engineer may be sufficient to es-
patterns, uncover possibly causal relationships, and identify struc- tablish an approximate load rating; load tests may also be con-
tures that may be misclassified. The proposed approach is expected sidered (AASHTO 2011). Further investigation of the NBI data
to inform maintenance and management decisions for in-service shows that those bridges lacking design plans are usually short-
bridges, with the potential to optimize the allocation of resources span, local, and low-traffic bridges, for which resource-intensive
based on the structural vulnerability and/or opportunity for load tests are usually deemed impractical considering the rela-
improved traffic flow. The two major outcomes of the proposed sys- tive importance of the bridge. Without the benefit of a load test,
tem if implemented in practice are as follows: little direction is given as to how to implement a judgment-based
1. Providing data-driven load postings for bridges with missing or rating. The MBE states that a safe load capacity can be estimated
incomplete design information and structural plans based on based on design live load, current condition of the structure, and
objective data analysis rather than subjective judgment. live-load history; additionally, a concrete bridge with missing
2. Rapid screening of the entire inventory to find candidate structural details need not be posted for restricted loading if it
bridges for further analysis and possible required load posting has been carrying normal traffic for an appreciable period of
or posting removal, where misclassification is suspected. time and shows no distress (AASHTO 2011). Although the fun-
Note that, for bridges with available design information, the damental premise of these MBE prescriptions is rationally
safe load-carrying capacities are calculated using structural sound, their application is challenged by the lack of objective
analysis, with the same principles assumed in the design process and quantitative criteria and by the intrinsic dependence on
should be taken into consideration during the evaluation. A few Machine learning seeks to mimic and automate the human learning
examples are UDOT (2014), CODOT (2011), and FDOT (2014). process, finding patterns in data without explicitly being pro-
These include year of construction; design vehicle; live loads (past, grammed (Witten et al. 2011). These algorithms have been increas-
present, and future); measurable structural dimensions; condition of ingly used in a wide range of disciplines (Gullien et al. 2015;
load-carrying components; redundancy of load path; changes since Hergenroeder et al. 2014), including civil engineering (Amiri et al.
original construction; comparable structures of known design; traf- 2016; Saitta et al. 2009; Melhem and Cheng 2003; Farrar and
fic characteristics; and performance of bridge under current traffic,
Worden 2012; Jootoo and Lattanzi 2016). Specifically, the area of
such as evidence of distress or excessive movement under load.
infrastructure maintenance and management, which naturally
These manuals do not explicitly explain the process of using these
involves large data sets, has welcomed the use of these methods
variables to arrive at a posting load. It is notable that the
(Melhem and Cheng 2003; Li and Burgueño 2010; Kim and Yoon
Massachusetts bridge manual states that engineering judgment
2010; Melhem et al. 2003; Bektas et al. 2013; Morcous 2005). Li
alone is not acceptable as a rating method and requires that compre-
and Burgueño (2010) built models on Michigan’s database of
hensive field measurements, nondestructive testing, and a material
bridges to predict abutment condition ratings, whereas Huang
testing program be performed for structures with unknown struc-
tural details (MassDOT 2013). (2010) used artificial neural network classification on Wisconsin’s
A number of the state DOTs have outlined systematic proce- concrete bridges to predict deck condition ratings based on geomet-
dures, usually in the form of flowcharts or tables, to help engi- rical, functional, and environmental descriptors. Similarly, decision-
neers carry out judgment-based load postings. Oregon, tree classification algorithms have been used by a number of
Washington, Idaho, and Kentucky provide either a table or researchers to model infrastructure performance (Melhem and
clauses for load rating bridges without plans based solely on Cheng 2003). In a conceptually similar approach, the authors of
condition ratings, in which a structure with a condition rating of this paper used multiple linear regression and neural networks on
4 or less is posted (ODOT 2015; WSDOT 2015; ITD 2014; the Virginia database of RC slab bridges to estimate load ratings
KYTC 2015). The Nebraska Department of Roads (NDOR) uses (Harris et al. 2015) and decision trees and random forests on the
the same method, but indicates a condition rating of 3 to be used national database to study the feasibility of predicting load postings
as the load-posting threshold (NDOR 2010). Pennsylvania pro- (Alipour et al. 2016). The present investigation contributes to the
vides a tabulated rating method that is based on condition rat- existing push toward data-driven performance assessment by refor-
ings, average daily traffic (ADT), and a specific description of mulating the problem of predicting bridge load postings on a
signs of distress on the structure (PennDOT 2010). The Texas national level into a data-driven framework. Although possible dis-
DOT manual presents a flowchart to help rate concrete bridges crepancies and inconsistencies in load-posting policies and prac-
without plans, which is based on the observation of signs of dis- tices in different states are expected to affect the performance of
tress in the inspection reports, as well as structure age and condi- the framework, the collective study of the national inventory is
tion ratings (TxDOT 2013) (Fig. 2). Note that IR and OR refer to expected to supply a data set with sufficient size and diversity for
inventory and operational level load ratings, respectively. efficient knowledge extraction.
Fig. 2. Flowchart for load rating concrete bridges without plans (adapted from TxDOT 2013)
1. Data were collected predominantly from the NBI in 2014 and typos and entry mistakes. Other examples are inconsistencies
underwent a careful preprocessing step to obtain to a format between interrelated variables, such as deck width and number of
suitable for modeling. lanes or span length and total length. In the absence of access to doc-
2. Models were trained on the data using two popular classifica- umentation on individual bridges, an investigation of the causes of
tion algorithms (decision trees and random forests), which the aforementioned inconsistencies was not possible. Therefore, the
learn the patterns and relationships between the bridge descrip- basic process of validation using domain expertise and metadata
tors and their load-posting status. and elimination of suspicious and inexplicable instances was fol-
3. These models were then evaluated in terms of accuracy lowed in this paper. For details on the use of domain expertise to
performance. ensure acceptable data quality as well as a variety of systematic
4. The developed models were tuned to maximize their data-cleansing techniques, the reader is referred to Dasu and Johnson
performance. (2003). Once data cleansing was finished, all of these filters accounted
5. Once finalized, the models were evaluated on an unseen valida- for a further reduction of the population to 47,385 structures.
tion set to report a realistic estimate of the expected future per- Of these remaining structures, 5,151 bridges are currently either
formance. The performance was also compared with the load rated based on judgment or are not load rated at all. Based on a
available DOT practice as a baseline. previous study (Harris et al. 2015), evidence suggested that the ma-
6. Finally, the models were applied to the data set for bridges with jority of these bridges likely lack sufficient as-built structural draw-
missing plans, and the current (judgment-based) and predicted ings and documents necessary for a proper load rating. In these
(using the models) load postings were compared. cases, load rating and posting decisions are typically based on sub-
jective engineering judgment rather than objective analytical
Data Collection and Preprocessing capacity calculations. This data set, referred to as bridges without
plans, is not used in modeling, but is later used in an application of
The data set used for this investigation was derived from the NBI the constructed models. From the remaining data, which are bridges
database for the year 2014, which includes 610,749 structures with plans (and thus have analytical calculation-based load ratings
(FHWA 2014). This study focused on highway bridges; hence, a and postings), an independent validation set of identical size
preliminary filter was used to exclude all types of structures other (5,151) was randomly chosen and kept aside. This validation data
than highway bridges (e.g., culverts, railroad bridges, tunnels, etc.). set was never seen in any of the modeling or analysis steps and was
Also excluded were temporary structures, those currently closed, exclusively used to assess the performance of the finalized models.
and newly constructed and unopened bridges, which are structures The rest of the data (37,083 instances), referred to as the training
in unusual service conditions and therefore known to have outdated and test set in Table 1, are later split into a training set to create
or missing information. The resulting database included 456,219 models and a test set to evaluate them (see the “Performance
observations. By filtering the data based on NBI Item 43A (kind of Evaluation” section). The number of observations used in each step
material and/or design = concrete or concrete continuous) and 43B is summarized in Table 1.
(type of design and/or construction = slab), this study focused on The NBI database includes 116 items, some with multiple
RC slab bridges, which resulted in a total of 64,134 bridges entries and descriptors for each bridge; however, many of the
(14.1%). Furthermore, 6,413 bridges reconstructed during their descriptors describe nonstructural properties of a bridge, such as
service life (thus their condition is not consistent with their age) and bridge name and identification numbers, geographical information,
another 9,002 of those with precast decks were also excluded to hydrological features, properties of the wearing surface, safety
Test
Set
Bridges ROC
with Decision
Design Tree 1 - Specificity
Plans Validation
ccc Training
Set
set
Naonal
Bridge Bridges
Inventory Bridges
without Validation
(NBI) Design Set
without
Plans Design
Plans
attributes. Deck, superstructure, and substructure condition ratings EðSÞ ¼ pi log2 ðpi Þ (1)
i¼1
and deck geometry evaluation ratings, which were initially on a
scale of 1–9, were recategorized as poor (rating <5), fair (5), and
good (>5), following the recommendation of the MBE (AASHTO Information gain (G) is then defined as the reduction in entropy
2011). Design loads were also recategorized into three groups of from the original S using the attribute A. This can be seen in Eq. (2),
heavy, light, and other vehicles based on their equivalent truck where the first term is the entropy of S and the second term is the
weight. Three new attributes were also derived using the informa- sum of entropies of subset Sv made by A weighted by the fraction of
tion in the NBI: superstructure continuity, existence of water (over instances of S that belong to Sv . In this equation, values(A) is the set
water versus over a highway/railroad), and urban (yes or no). Two of all possible values of attribute A, and Sv is the subset of S where
attribute A has the value of v
more attributes, sine(skew) and log(ADT), were created using
mathematical transformations of the current attributes skew angle X jSv j
and ADT, because the alternative representations were found to per- GðS; AÞ ¼ EðSÞ EðSv Þ (2)
jSj
form better than the original attributes. Finally, two more attributes v2Valuesð A Þ
were imported from external sources to represent climatic and eco-
nomic conditions of the locations of the selected bridges. Each The tree will thus be formed by recursively splitting the data at
bridge instance was assigned to one of nine climatic zones as classi- the nodes using the attribute with the maximum information gain.
fied by the National Centers for Environmental Information: north- The stopping criterion is the minimum number of instances per leaf
west, west, southwest, south, southeast, northeast, central, east (final node), which can determine the size of the tree; a larger mini-
north central, and west north central (Karl and Koss 1984). The eco- mum leaf size results in a smaller tree, whereas a smaller minimum
nomic condition for each bridge instance was characterized by the leaf size allows for more complexity. Further details on the theory
ratio of the state’s gross domestic product (GDP) to its respective and construction of decision trees, such as alternative impurity
number of bridges (categorized into three groups as high, low and measures and pruning methods, can be found in the works by
moderate), thus attempting to represent a possible disparity in the Witten et al. (2011) and Mitchell (1997).
maintenance and management of bridges. Although this attribute is Decision trees offer a number of advantages that make them a
expected to provide an indirect indication of the relative resources suitable choice for this investigation. They can be visualized and
available to the states for maintenance and upkeep, future study is are relatively transparent and easy to construct and understand.
required to design more accurate economic attributes based on dif- They are also very easy to interpret and can be translated into simple
ferent infrastructure funding mechanisms. Table 2 summarizes the if–then rules. These characteristics make them a suitable choice for
24 resulting attributes (including the class) used in modeling and applications involving nonmachine-learning experts, such as bridge
the basic statistics for each attribute. owners, infrastructure managers, and maintenance officials. In addi-
tion, decision-tree learning is naturally able to handle both numeri-
cal and categorical data and is not sensitive to outliers. In this work,
Modeling Techniques binary splits were used and minimum leaf size was varied to change
the size of trees constructed (Witten et al. 1999).
Decision Trees
Random Forests
Decision trees are a popular family of classification models that
have been used in various domains and applications to approximate To provide further comparison, C4.5 decision-tree learning was
discrete target functions. The learned functions are composed of a also evaluated against an ensemble learning technique called ran-
series of successive decisions, represented by branches, which ulti- dom forests (Breiman 2001). This method has gained significant
mately terminate in a target class, represented by leaves. Each node traction because it is invariant under scaling and various other fea-
in the tree represents an attribute of the observation, and each ture transformations, and it is robust to the inclusion of irrelevant
branch of the tree represents one of the possible values of the attrib- features (Hastie et al. 2009). In random forests, a large number of
ute. A particular observation is classified by beginning at the root decision trees are constructed on randomized samples, obtained
node of the tree and traversing the entire tree, testing the attribute with replacement, of the same size of the training data, each with a
specified by the corresponding node of the tree with the process few randomly selected attributes, and an instance is classified by
being repeated for each subtree (Mitchell 1997). taking the majority vote among all the trees. The decision trees are
In this work, the C4.5 algorithm was used for tree induction, usually grown to the fullest, and the optimal number of trees and
which is a top-down recursive splitting algorithm (Quinlan 1993). attributes is usually problem-dependent; for a typical classification
pffiffiffi
Assume a collection of observations from the training data as problem with p attributes, p has been recommended as the
Standard
Item Number Attribute name–NBI item number Unit Type Range/value Mean deviation
Input 1 Age–27a Year Numeric [1, 138] 42.5 22.7
attributes 2 Deck geometry evaluation–68 — Nominal Good (54.1%), fair (24.4%), poor — —
(21.6%)
3 Maintenance responsibility–21a — Nominal State (43.2%), county (41.2%), city — —
Downloaded from ascelibrary.org by The State University of New York at Buffalo (SUNY - Buffalo) on 06/09/24. Copyright ASCE. For personal use only; all rights reserved.
number of random attributes (Hastie et al. 2009). As a result, all When evaluating a classifier on the test set, the model provides a
attributes will have the opportunity to be used in a number of trees confusion matrix as its output, which summarizes the number of
and contribute to the model, thus adding to its accuracy and stabil- correctly and incorrectly classified instances in each of the classes
ity. In this study, random forests were trained on the data while involved. Consistent with the regular convention in binary-class
varying the number of attributes to be used in each tree and the num- problems, such as the present problem, posted bridges were
ber of trees in the forest. assumed positive and unposted bridges as negative. Hence, the con-
fusion matrix includes the true positives (TPs) and true negatives
(TNs) as the total number of instances correctly classified as posted
Performance Evaluation and unposted, respectively. In contrast, it also includes false posi-
tives (FPs), which are the unposted bridges incorrectly classified as
In each classification task, there is a training and testing phase to posted, and false negatives (FNs), which are posted bridges incor-
build and tune the model and a validation phase to evaluate the rectly classified as unposted (Fig. 4).
model’s performance on unseen data. In this research, a holdout Accuracy is defined as the ratio of all correctly classified instan-
testing method was used in which two-thirds of the data were ran- ces to all data (Fig. 4) and is generally the most widely used crite-
domly selected as the training set and the rest as the test set. rion to evaluate a learner. It should be noted, however, that this is an
Furthermore, as previously described, a separate validation set was insufficient measure of accuracy for the class-imbalanced problem
also randomly selected and kept intact until modeling was finalized. at hand (see the next section); by simply guessing the class of all
The models were constructed on the training set, and optimum val- instances as not posted, an accuracy of 91.1% will be achieved,
ues of the parameters in the models were found by evaluating per- which is equal to the ratio of all unposted bridges to the total number
formance on the test set (Table 1). Once the models were fully tuned of instances under study. In other words, the misclassification of the
and finalized, they were tested on this validation set to determine a minority class (posted), which is actually the main focus of this
reliable estimate of the future error. study, will not contribute to more than 8.9% of this measure.
FPR error produces the opposite effect, resulting in unnecessary classification (Visa and Ralescu 2005; Chawla et al. 2002). In resam-
conservatism and loss of resources. pling methods, the ratio of the minority and majority is adjusted to-
Although both the FPR and FNR errors are used to assess the ward balance by randomly undersampling the majority (removing a
models, it is recognized that neither can be the sole criterion for number of majority instances), randomly oversampling the minority
determining the best model, and a single-figure criterion is needed [adding artificial minority instances using the synthetic minority
for that purpose. The literature on classification, and class-imbalanced oversampling technique (SMOTE) (Chawla et al. 2002)], or a com-
problems in particular, suggests that the area under the receiver bination thereof. In this investigation, four different resampling sce-
operating characteristic (ROC) curve, denoted as AUC, can be narios were tested: two of them resampled the data to a 1:1 class ra-
used as an efficient single-figure estimate of a classifier’s overall tio, and the other two to a 3:1 class ratio. However, one of the 1:1
accuracy over the full range of FPR–FNR trade-offs in the test scenarios was done by undersampling the majority equal to the mi-
set. A ROC curve is defined as a two-dimensional curve plotted in nority (R1), whereas the other one first undersampled the majority to
a coordinate system with the ratio of TPs over all positives 3 times the minority and then added enough SMOTE samples to
(TP þ FN) on the y-axis, and the FPR on the x-axis. For a classi- make the classes equal (R4). One of the 3:1 scenarios used under-
fier, the curve is constructed by plotting points corresponding to sampling (R2), whereas the other one used SMOTE to achieve this
classification results with different decision thresholds in this ratio (R3). A summary of the four resampling scenarios together
coordinate system. Once the curve is plotted for a classifier, the with the resulting number of observations used in each scenario is
AUC can be measured and reported as a unified performance summarized in Table 3. It should be noted that, in all of these scenar-
measure (Fawcett 2006). Further details in this regard can be ios, the resampling is only performed on the training set, and the test
found in the literature (Huang and Ling 2005; Chawla 2005). set was not resampled to achieve a more reliable evaluation.
A second option in dealing with class-imbalanced data is the use
of cost-sensitive classifiers. This methodology is based on the idea
Class Imbalance Treatment
that the relative costs of FP and FN errors are introduced in model
One of the challenges frequently encountered in classification tasks training by internal instance reweighting, such that the model is
is that the class attribute may have an imbalanced distribution, constructed by minimizing the misclassification cost rather than
meaning that one label (minority) of the class attribute is much less raw accuracy, thus making the base classifier cost-sensitive (Witten
frequent than the other (majority). Often, the detection and predic- et al. 2011). In this investigation, a cost-sensitive meta classifier
tion of the minority group is more of interest. Nevertheless, classifi- was used, which takes as input a cost matrix and the base classifier
cation algorithms generally tend to focus on maximizing the correct to be used, which in this case was either a decision tree or a random
classifications (accuracy) rather than on the minority group. Hence, forest. At this stage of the investigation, the true costs of the prob-
when a model is trained on a class-imbalanced data set, good accu- lem are unknown and would need to be derived from detailed risk
racy (and thus very low FPR) is achieved, whereas most of the mi- analysis considering bridge owners’ feedback. However, in this
nority instances are incorrectly classified as majority (unacceptably investigation, the approach was tested using three cases of relative
TP + TN
ACC =
Actual TP + TN + FP + FN
Posted Unposted FP
FPR =
FP + TN
Posted TP FP
Predicted
Unposted FN TN FN
FNR =
TP + FN
(a) (b)
Fig. 4. (a) Confusion matrix for the classification task; (b) performance criteria equations (ACC denotes accuracy)
Number Resampling scenarios Minority [posted (%)] Majority [unposted (%)] Number of instances
R0 Original data 3,291 (8.9) 33,792 (91.1) 37,083
R1 Majority undersampling 1:1 3,291 (50) 3,291 (50) 6,582
R2 Majority undersampling 3:1 3,291 (25) 9,873 (75) 13,164
R3 SMOTE minority oversampling 242.3% 11,264 (25) 33,792 (75) 45,056
R4 Majority undersampling 3:1 þ SMOTE minority oversampling 200% 9,873 (50) 9,873 (50) 19,746
(posted) class by FN errors of more than 50%. This means that over
Results 50% of all posted bridges have been predicted as not posted. In con-
trast, R1 is the strongest scenario with FNRs between 10 and 20%,
This section presents the results of training, tuning, and testing sev- with R4 as the second best. However, R1 has the highest FPR, and
eral viable models. All modeling and analysis reported in this sec- thus the worst performance in detecting the unposted bridges.
tion were done using Weka 3.6. Weka is an open-source machine- Therefore, a question arises as to which model should be selected.
learning software package developed at the University of Waikato, The answer is given by the AUC criterion [Fig. 5(c)]. Interestingly,
New Zealand, which offers efficient implementations of many clas- R4 possesses the highest AUC and thus the best overall predictive
performance. Intuitively, both R2 and R3 have a 3:1 ratio of
sification algorithms, including decision trees and random forests,
unposted to posted instances, thus giving a higher weight to
that were used in this work (Witten et al. 1999).
unposted predictions yielding a higher FNR. In addition, R3 is
slightly better than R2 according to the AUC, which is in agreement
Table 4. Relative Misclassification Costs Used in Cost-Sensitive with Chawla et al. (2002), who stated that SMOTE is expected to
Classifiers
perform better than simple subsampling. Also, although R4 and R1
Number of both have a 1:1 class ratio, R4 is derived from a 3-times larger sam-
Number Cost-sensitive scenarios FPR cost FNR cost instances ple size, thus constructing a more powerful model, and as such is
selected as the preferred model. Another conclusion from Fig. 5(c)
C0 Original data 1 1 37,083
is that the minimum leaf size of 25 provides the best AUC and will
C1 Low FNR cost 1 5 37,083
thus be considered the optimum leaf size for the models developed
C2 Moderate FNR cost 1 10 37,083
in this work. Figs. 5(d–f) also present the results of cost-sensitive
C3 High FNR Cost 1 15 37,083
cases for decision-tree induction. In a manner similar to the
Resampling Scenarios
R0 R1 R2 R3 R4 R0 R1 R2 R3 R4 R0 R1 R2 R3 R4
0.35 0.95
Area Under ROC Curve (AUC)
False Negative Rate (FNR)
0.85
False Positive Rate (FPR)
0.93
0.75 0.3
0.91
0.65 0.25 0.89
0.55 0.2 0.87
0.45 0.85
0.15 0.83
0.35
0.1 0.81
0.25 0.79
0.15 0.05
0.77
0.05 0 0.75
5 25 50 75 100 125 150 5 25 50 75 100 125 150 5 25 50 75 100 125 150
Minimum Leaf Size Minimum Leaf Size Minimum Leaf Size
Cost-Sensitive Scenarios
C0 C1 C2 C3 C0 C1 C2 C3 C0 C1 C2 C3
Area Under ROC Curve (AUC)
0.75 0.91
0.2
0.65 0.89
0.55 0.15 0.87
0.45 0.85
0.35 0.1 0.83
0.25 0.81
0.05
0.15 0.79
0.05 0 0.77
5 25 50 75 100 125 150 5 25 50 75 100 125 150 5 25 50 75 100 125 150
Minimum Leaf Size Minimum Leaf Size Minimum Leaf Size
(d) (e) (f)
Fig. 5. Decision-tree results for different leaf sizes (a and d) FNR; (b and e) FPR; (c and f) AUC
for the best resampled (R4) and the best cost-sensitive (C2) tree mod- nate trends and relationships previously undetected.
els are presented in Table 5.
The resulting structure of a developed decision-tree model can
Random-Forest Results
shed light on the classification problem and the interaction of attrib-
utes within the problem. For schematic purposes, Fig. 6 illustrates a Random-forest models were constructed for the problem by varying
sample decision tree using R1 resampling and a minimum leaf size of the number of trees in each forest and number of attributes in each
Table 5. Results of Testing the Models on Unseen Validation set of 5,151 Bridges
AGE
≤ 54 yrs > 54 yrs
CLIMATE SE
ZONE
POSTED NOT POSTED NOT POSTED
Other
* “POSTED” denotes a bridge that requires load posng.
ECONOMIC Other
INDEX
NOT POSTED Sample Decision Paths
Moderate If age is less than 54, substructure condion is Other, deck
width is greater than 7.2m and design load is Heavy, then
KIND OF Roads
HIGHWAY POSTED structure is not to be posted.
Fig. 6. Sample decision tree with R1 resampling and minimum leaf size of 100; SE denotes southeast
negative class. This result is to be expected considering the class bridges with plans for which actual quantitative load ratings exist
imbalance and is consistent with the decision-tree results. The lowest (ground truth). The proposed approach (Model M3 as an example)
FNR and highest FPR belong to R1 and R4, respectively; however, is compared with the approach shared by Oregon, Washington,
the FPRs are all below 15% for R1 and 10% for R4. The AUC scores Idaho, and Kentucky, in which any bridge with a condition rating of
show a general decreasing trend with an increase in number of attrib- 4 or below is to be posted (ODOT 2015; WSDOT 2015; ITD 2014;
utes, which is an expected behavior with random forests with 4, 5, KYTC 2015). The methods in the Texas and Pennsylvania load-
and 6 attributes working best, which is again in line with the value of rating manuals require knowledge of the appearance of the member
pffiffiffiffiffi
p recommended by Breiman (2001); however, Fig. 7(c) shows and signs of distress, and thus cannot be directly used for compari-
that all scenarios are comparable in terms of overall performance son. As an alternative, assuming no signs of structural distress, the
because the largest difference between AUCs of any two scenarios condition ratings in the Texas DOT flowchart were used to create
is approximately 0.5%. Although R2 and R0 show a negligible another baseline (TxDOT 2013). Table 6 summarizes the perform-
advantage in AUC, Models R1, R2, and R4 were selected as the ance comparison and shows that the proposed approach provides
most efficient models because they also have satisfactory FNR and superior performance, despite an overall lower accuracy, with a sig-
FPR values. Cost-sensitive random-forest models exhibited similar nificantly lower FNR while maintaining an acceptable FPR. As
trends, with AUC values comparable to the R2 and R0 models, but stated before, FNR error is a matter of safety, as an insufficient
for these models, the FNRs ranged from 27 to 36%, whereas the bridge is allowed to remain open, whereas the FPR error corre-
FPRs ranged from 3.5 to 6%; consequently, none of those models sponds to conservative misclassifications that lead to unnecessary
were deemed suitable as the preferred model. restrictions.
R0 R1 R2 R3 R4 R0 R1 R2 R3 R4 R0 R1 R2 R3 R4
0.7 0.25 0.95
Area Under ROC Curve (AUC)
False Negative Rate (FNR)
0.6 0.945
0.2
0.5
0.94
0.15
0.4
0.935
0.3 0.1
0.93
0.2
0.05
0.1 0.925
0 0 0.92
3 5 7 9 11 13 15 17 19 21 23 25 3 5 7 9 11 13 15 17 19 21 23 25 3 5 7 9 11 13 15 17 19 21 23 25
Number of attributes in each tree Number of attributes in each tree Number of attributes in each tree
(a) (b) (c)
Fig. 7. Comparison of different resampling scenarios for random forests with 200 trees: (a) FNR; (b) FPR; (c) ROC