Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Data Division for Developing Neural Networks Applied to

Geotechnical Engineering
Mohamed A. Shahin1; Holger R. Maier2; and Mark B. Jaksa3

Abstract: In recent years, artificial neural networks 共ANNs兲 have been applied to many geotechnical engineering problems with some
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

degree of success. In the majority of these applications, data division is carried out on an arbitrary basis. However, the way the data are
divided can have a significant effect on model performance. In this paper, the issue of data division and its impact on ANN model
performance is investigated for a case study of predicting the settlement of shallow foundations on granular soils. Four data division
methods are investigated: 共1兲 random data division; 共2兲 data division to ensure statistical consistency of the subsets needed for ANN model
development; 共3兲 data division using self-organizing maps 共SOMs兲; and 共4兲 a new data division method using fuzzy clustering. The results
indicate that the statistical properties of the data in the training, testing, and validation sets need to be taken into account to ensure that
optimal model performance is achieved. It is also apparent from the results that the SOM and fuzzy clustering methods are suitable
approaches for data division.
DOI: 10.1061/共ASCE兲0887-3801共2004兲18:2共105兲
CE Database subject headings: Neural networks; Fuzzy sets; Settlement; Geotechnical engineering; Data processing; Maps.

Introduction network at various stages of learning, and training is stopped once


the error in the testing set increases. The validation set is used to
Artificial neural networks 共ANNs兲 共Zurada 1992; Fausett 1994兲 evaluate the performance of the model once training has been
are a form of artificial intelligence, which, by means of their successfully accomplished. Generally, cross-validation is consid-
architecture, attempt to simulate the biological structure of the ered to be the most effective method to ensure overfitting does not
human brain and nervous system. ANNs learn ‘‘by example,’’ in occur 共Smith 1993兲.
which an actual measured data set of input variables and the In recent times, ANNs have been applied successfully to many
corresponding outputs are presented to determine the rules that prediction tasks in geotechnical engineering, as they have the
govern the relationship between the variables. ANNs are similar ability to model nonlinear relationships between a set of input
to conventional statistical models in the sense that model param- variables and corresponding outputs. A comprehensive list of the
eters 共i.e., connection weights兲 are adjusted in a model calibration applications of ANNs in geotechnical engineering is given by
phase called ‘‘training’’ or ‘‘learning’’ so as to minimize the error Shahin et al. 共2001兲. In the majority of these applications, the data
between the actual measured outputs and those predicted by ANN are divided into the sets needed to develop ANN models 共e.g.,
for a particular data set 共the training set兲. Unlike conventional training, testing, and validation兲 on an arbitrary basis. However,
statistical models, ANN models generally have a large number of recent studies have shown that the way the data are divided can
model parameters and can therefore overfit the calibration data have a significant impact on the results obtained 共Tokar and
共i.e., the data used during training兲, especially if the calibration Johnson 1999兲.
data are noisy. One way to avoid overfitting in ANN models is to ANNs perform best when they do not extrapolate beyond the
use the cross-validation technique 共Stone 1974兲, in which the extreme values of the data used for calibration 共Minns and Hall
available data are divided into three sets; training, testing, and 1996; Tokar and Johnson 1999兲. Consequently, in order to de-
validation. The training set is used to adjust the connection velop the best ANN model, given the available data, the calibra-
tion data should contain all representative patterns that are present
weights. The testing set is used to check the performance of the
in the available data. For example, if the available data contain
1 data samples 共records兲 of extreme values that are excluded from
Postdoctoral Research Associate, School of Civil and Environmental
Engineering, Univ. of Adelaide, South Australia, 5005.
the calibration set, the model cannot be expected to perform well,
2
Senior Lecturer, School of Civil and Environmental Engineering, as the validation data will test the model’s extrapolation ability,
Univ. of Adelaide, South Australia, 5005. and not its interpolation ability. If all of the patterns that are
3 present in the available data are represented in the calibration set,
Senior Lecturer, School of Civil and Environmental Engineering,
Univ. of Adelaide, South Australia, 5005 共corresponding author兲. E-mail: the toughest evaluation of the generalization ability of the model
mjaksa@civeng.adelaide.edu.au is if all representative patterns 共and not just a subset兲 are also
Note. Discussion open until September 1, 2004. Separate discussions contained in the validation data. In addition, if cross-validation is
must be submitted for individual papers. To extend the closing date by
used as the stopping criterion, the results obtained using the test
one month, a written request must be filed with the ASCE Managing
Editor. The manuscript for this paper was submitted for review and pos- set have to be representative of those obtained using the training
sible publication on March 12, 2002; approved on April 29, 2003. This set, as the test set is used to decide when to stop training or, for
paper is part of the Journal of Computing in Civil Engineering, Vol. 18, example, which model architecture or learning rate is optimal.
No. 2, April 1, 2004. ©ASCE, ISSN 0887-3801/2004/2-105–114/$18.00. Consequently, the statistical properties 共e.g., mean and standard

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004 / 105

J. Comput. Civ. Eng., 2004, 18(2): 105-114


deviation兲 of the various data subsets 共e.g., training, testing, and Data Division Approaches Investigated
validation兲 need to be similar to ensure that each subset represents
the same statistical population 共Masters 1993兲. If this is not the Approach 1: Random
case, it may be difficult to judge the validity of ANN models
共Maier and Dandy 2000兲. A random approach is generally used in the field of geotechnical
This fact has been recognized for some time 共Masters 1993; engineering for dividing the available data into the subsets needed
ASCE 2000; Maier and Dandy 2000兲, and several studies have for ANN model development, with no attention given to the sta-
tistical consistency of the data subsets. As a result, the perfor-
used ad-hoc methods to ensure that the data used for calibration
mance of the trained model on the validation data is highly de-
and validation have the same statistical properties 共e.g., Braddock
pendent on which data are contained in the validation set 共e.g.,
et al. 1998; Tokar and Johnson 1999兲. However, it was not until
whether the validation set contains extreme data or not兲, making it
recently that systematic approaches for data division have been
impossible to assess the true generalization ability of the model
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

proposed in the literature. Bowden et al. 共2002兲 used a genetic within the domain of the available data.
algorithm to minimize the difference between the means and stan- Another shortcoming of this approach is that the proportion of
dard deviations of the data in the training, testing, and validation the data to be used for training, testing, and validation needs to be
sets. While this approach ensures that the statistical properties of chosen a priori by the modeler. However, there are no firm guide-
the various data subsets are similar, there is still a need to choose lines in the literature to assist with this task, although some rules-
which proportion of the data to use for training, testing, and vali- of-thumb exist, such as using two thirds of the data for model
dation. Kocjancic and Zupan 共2000兲 and Bowden et al. 共2002兲 use calibration 共i.e., training and testing兲 and one third for model
a self-organizing map 共SOM兲 to cluster high-dimensional input validation 共Hammerstrom 1993兲.
and output data in two-dimensional space and divide the available
data so that values from each cluster are represented in the vari- Approach 2: Statistically Consistent
ous data subsets. This ensures that data in the different subsets are
representative of each other, and SOMs have the additional ad- As part of this approach, the available data are divided in a way
that ensures that the statistical properties of the data in each of the
vantage that there is no need to decide what percentage of the
subsets are as close to each other as possible and thus represent
data to use for training, testing, and validation. The major short-
the same statistical population. In this work, a trial-and-error pro-
coming of this approach is that there are no guidelines for deter-
cess is used to achieve this. The statistical parameters used in-
mining the optimum size and shape of the SOM 共Giraudel and
clude the mean, standard deviation, minimum, maximum, and
Lek 2001兲. This has the potential to have a significant impact on range. To examine how representative the training, testing, and
the results obtained, as the underlying assumption of the approach validation sets are of each other, t- and F-tests are carried out.
is that the data points 共samples or records兲 in one cluster provide The t-test examines the null hypothesis of no difference in the
the same information in high-dimensional space. However, if the means of two data sets and the F-test examines the null hypoth-
SOM is too small, there may be significant intracluster variation. esis of no difference in the standard deviation of the two sets. For
Conversely, if the map is too large, too many clusters may contain a given level of significance, test statistics can be calculated to
single data points, making it difficult to choose representative test the null hypotheses for the t- and F-tests, respectively. Tra-
subsets. ditionally, a level of significance equal to 0.05 is selected 共Levine
In this paper, a new data division approach is introduced and et al. 1999兲. Consequently, this level of significance is used in this
compared with existing approaches for the case study of predic- research. This means that there is a confidence level of 95% that
tion of the settlement of shallow foundations on granular soils. the training, testing, and validation sets are statistically consistent.
The new approach utilizes a fuzzy clustering technique, which A detailed description of these tests is given by Levine et al.
overcomes the limitations of existing methods. Shi 共2002兲 has 共1999兲. The major shortcomings of this approach are that it is
recently used fuzzy clustering for the evaluation and validation of based on trial and error and that the proportion of the data to be
neural networks. However, to the writers’ best knowledge, fuzzy used for training, testing, and validation needs to be chosen in
clustering has yet to be used as a data division approach for advance by the modeler, as mentioned previously.
ANNs. The specific objectives of this paper are
1. To investigate the relationship between the statistical proper- Approach 3: Self-Organizing Map
ties of the data subsets used to develop ANN models and Self-organizing maps 共SOMs兲 belong to the genre of unsuper-
model performance; vised neural networks. The typical structure of SOM consists of
2. To introduce a new approach to data division for ANNs two layers: an input layer and a Kohonen layer 共Fig. 1兲. The
based on fuzzy clustering; processing elements in the Kohonen layer are arranged in a one-
3. To compare the performance of the new approach with that or two-dimensional array. The input from each node in the input
of three existing approaches, including random data division, layer (x i for i⫽1,2,..., n) is fully connected to the Kohonen layer
data division to ensure statistical consistency between the through connection weights (w ji for j⫽1,2, ..., m). At the begin-
various subsets, and data division using a SOM; ning of the self-organizing process, these weights are initialized
4. To investigate the relationship between the proportion of the randomly. At each node in the Kohonen layer, the input (x i ) is
data in each of the subsets used to develop the ANN models presented without providing the desired output, and a matching
and their performance in relation to the data division value is calculated. This value is typically the Euclidean distance
method, which ensures statistical consistency between data (D j ) between the weights of each node and the corresponding
sets; and input values, as shown in Eq. 共1兲:
5. To investigate the impact of the number of data points used n
from each cluster for training on model performance in rela-
tion to the SOM data division method.
D j⫽ 兺 共 x i ⫺w ji 兲 2 ,
i⫽1
j⫽1,2,...,m (1)

106 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004

J. Comput. Civ. Eng., 2004, 18(2): 105-114


3. Information is provided about whether ‘‘outliers’’ 共not nec-
essarily in the statistical sense兲 exist in the data set. For
example, if a cluster contains only one data sample, this
sample should be included in the training set. If it were to be
included in the validation set, the trained ANN model could
not be expected to perform well, as the validation data would
fall outside the range of the training data.
The main disadvantage of this approach is that different pa-
rameters that control the learning process in SOMs 共i.e., learning
rate, neighborhood size, and size and shape of the map兲 have to
be selected in advance. Moreover, as mentioned previously, there
are no precise rules for the optimum choice of these parameters.
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

Fig. 1. Typical structure of self-organizing map


Approach 4: Fuzzy Clustering
The fuzzy clustering algorithm attempts to minimize the follow-
ing objective function 共Kaufman and Rousseeuw 1990兲:
The node that has the minimum Euclidean value is declared the
k
winner. The weights of the winning node and its neighboring 兺 ni, j⫽1 u 2i v u 2j v d i j
nodes, in terms of topology, are then updated to match the input C⫽ 兺
v ⫽1 2 兺 nj⫽1 u 2j v
(3)
values more closely, as shown in Eq. 共2兲:
where k⫽number of clusters; d i j ⫽given distance between data
w ji 共 new兲 ⫽w ji 共 old兲 ⫹␩ 关 x i ⫺w ji 共 old兲兴 (2) points i and j; and u i v ⫽unknown membership function of data
where ␩⫽learning rate. point i to cluster v .
The process is repeated by presenting new input data records The sum in the numerator ranges over all pairs of data points
to the model. The connection weights are adjusted until they re- (i, j), and the membership functions are subject to the following
main unchanged. The weights obtained represent the topological constraints:
relationship between the input data and the result of the preceding u i v ⭓0 for i⫽1,...,n; v ⫽1,...,k (4)
process is a map in which similar data records, in terms of topol-
ogy, are clustered together. A full description of the self-
organizing map process is given by Kohonen 共1997兲. 兺v u i v ⫽1 for i⫽1,...,n (5)
Once clustering has been successfully accomplished, samples
are chosen from each cluster to form the training, testing, and The preceding constraints imply that memberships cannot be
validation sets. Different approaches for achieving this have been negative, and that each data point has a constant total membership
suggested in the literature. Kocjancic and Zupan 共2000兲 suggested value, distributed over the clusters, normalized to 1. For hard
using a fixed number of data samples from each cluster to form clustering, a data point is assigned to the cluster that has the
the data subsets needed for ANN model development. However, largest membership value.
this still requires a subjective decision as to what proportion of The basic notion of fuzzy clustering for data division is similar
data points from each cluster to allocate to the different data sub- to that underlying the SOM data division approach in the sense
sets. Bowden et al. 共2002兲 suggested randomly selecting three that both are used to cluster similar data records together, and
samples from each cluster to form the ANN data subsets, one for once data are clustered, samples are chosen from the clusters to
each of the training, testing, and validation sets. In the instance form the training, testing, and validation sets. However, the fuzzy
when a cluster contains two records, one record is chosen for clustering approach has a number of features that enable it to
training and the other is chosen for testing. If a cluster contains overcome the shortcomings of the SOM data division approach.
only one record, this record is included in the training set. This First, an analytical procedure can be used to determine the
approach overcomes the problem of having to decide how many optimum number of clusters. This is achieved with the aid of the
data points from each cluster to allocate to the different data sub- silhouette value s(i), which is a measure of how well individual
sets. In addition, this approach utilizes the minimum number of data points lie within the cluster they have been assigned to at the
data points for model development, thus increasing computational end of the clustering process, and is given by 共Kaufman and
efficiency. However, it is unclear if better model performance Rousseeuw 1990兲:
could be achieved if all data points remaining in a cluster, after b 共 i 兲 ⫺a 共 i 兲
removal of the testing and validation values, were used for train- s共 i 兲⫽ , ⫺1⭐s 共 i 兲 ⭐1 (6)
max兵 a 共 i 兲 ,b 共 i 兲 其
ing rather than just one point from each cluster. Although Bowden
et al. 共2002兲 conducted a preliminary investigation into this issue where a(i)⫽average dissimilarity of data point i to all other data
and found that the inclusion of the additional training samples did points in a cluster A; and b(i)⫽smallest average dissimilarity of
not improve model performance, further investigation into this data point i to all points in any cluster E different from A.
matter is needed. For an individual data point (i) in cluster A, if s(i) is close to
In summary, the SOM data division method has a number of 1, this implies that the ‘‘within’’ dissimilarity a(i) is smaller than
advantages, including the smallest ‘‘between’’ dissimilarity b(i); therefore, data point i
1. There is no need to decide which proportion of the available can be deemed to have a strong membership to cluster A. By
data to use for training, testing, and validation. calculating the average silhouette width s̄(k) for the entire data
2. The statistical properties of the resulting training, testing, set for different numbers of cluster, the optimum number of clus-
and validation data are similar, provided that intracluster ters can be determined by choosing the number of clusters that
variation is sufficiently small. maximizes the value of s̄(k).

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004 / 107

J. Comput. Civ. Eng., 2004, 18(2): 105-114


Second, guidelines can be developed to determine which data Table 1. Data Ranges Used for Developing Artificial Neural Network
points from each cluster should be used for training, testing, and Models
validation. Information about the degree of membership each data Standard
point has to the cluster it has been assigned to can be used to Model variables Mean deviation Minimum Maximum
ensure that any significant intracluster variation is taken into ac-
Footing width, B 共m兲 8.8 10.1 0.8 60.0
count when assigning data points to their respective subsets. As
part of the data division approach introduced in this paper, it is Footing net applied load, 187.1 123.3 18.3 697.0
suggested to rank the data points in each cluster in order of in- q 共kPa兲
creasing membership value. Next, each data point is assigned to Average SPT blow count, N 24.6 13.5 4.0 60.0
one of ten equally spaced membership intervals 共i.e., 0.0–0.1, Footing geometry, L/B 2.2 1.8 1.0 10.5
0.1–0.2,..., 0.9–1.0兲, one data point from each membership inter- Footing embedment ratio, 0.53 0.58 0.0 3.4
val is assigned to the testing set, and another data point from that D f /B
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

interval is assigned to the validation set while the remaining data Measured settlement, S m 共mm兲 20.4 26.6 0.6 121.0
points from the same interval are assigned to the training set. By
using this approach, the best possible representation of the avail-
able data is achieved in each of the three data subsets.
The detailed procedure for using fuzzy clustering for ANN greatest effect on the settlement of shallow foundations on granu-
data division introduced in this paper is as follows: lar soils, as discussed by Burland and Burbidge 共1985兲. The
1. An initial number of clusters, not less than two, is chosen model output is foundation settlement (S m ).
共the initial number of clusters can be assumed to be equal to
5% of the available data兲; Data Division
2. The available data are clustered using the fuzzy clustering
technique and the average silhouette width s̄(k) of the entire The database used for the development of the ANN models com-
data set is calculated; prises a total of 189 individual cases 共Shahin et al. 2002兲. Ranges
3. The number of clusters is increased by one and step 2 is of the data used for the input and output variables are summarized
repeated until s̄(k) remains constant or the number of clus- in Table 1. The available data are divided using the four ap-
ters reaches 50% of the available data; proaches discussed previously:
4. The number of clusters that result in the largest value of s̄(k)
is considered optimum; Approach 1—Random
5. For the optimum number of clusters, the data records in- As part of this approach, the 189 individual cases are randomly
cluded in each cluster are ranked according to their member- divided into training, testing, and validation subsets. In total, 80%
ship values in incremental intervals of 0.1 between 0.0 and of the data 共i.e., 152 individual cases兲 are used for calibration and
1.0 共i.e., 0.0–0.1, 0.1–0.2,..., 0.9–1.0兲; and 20% of the data 共i.e., 37 individual cases兲 are used for validation.
6. For each cluster and membership interval 共e.g., cluster 1 and The calibration data are further divided into 70% for training 共i.e.,
membership interval 0.0–0.1兲, two samples are chosen—one 106 individual cases兲 and 30% for testing 共i.e., 46 individual
for the testing set and one for the validation set—and all cases兲.
remaining data samples are chosen for the training set. In the
instance when two records are obtained, one record is chosen Approach 2—Statistically Consistent
for training and the other is chosen for testing. If only one As part of this approach, the 189 individual cases are divided into
record is obtained, this record is included in the training set. three statistically consistent subsets. A number of different pro-
portions of the available data are used for training, testing, and
validation, in order to investigate the impact the proportion of the
Case Study data used in the various subsets has on model performance 共see
objective 4兲. The different proportions investigated are summa-
In this research, the four approaches of data division discussed rized in Table 2.
previously are applied to the case study of predicting the settle-
ment of shallow foundations on granular soils. The predictive
ANN models are based on the model developed by Shahin et al.
共2002兲 and implemented using the PC-based software package Table 2. Proportions of Data Used for Training, Testing, and
Neuframe Version 4.0 共Neusciences 2000兲. The steps for develop- Validation
ing ANN models outlined by Maier and Dandy 共2000兲 are used as
Remaining data
a guide in this research. These include determination of model
inputs and outputs, division of the available data, determination of Validation set Training set Testing set
optimal network architecture, weight optimization, and model 共%兲 共%兲 共%兲
validation. 10 70 30
80 20
Model Inputs 90 10
20 70 30
The model inputs used include footing width (B); footing net
80 20
applied pressure (q); soil compressibility, which can be repre-
90 10
sented by the average blow count (N) per 300 mm obtained using
30 70 30
the standard penetration test over the depth of influence of the
80 20
foundation; footing geometry (L/B); and footing embedment
90 10
ratio (D f /B). These input variables are believed to have the

108 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004

J. Comput. Civ. Eng., 2004, 18(2): 105-114


Table 3. Sample Membership Values of Data Records for Fuzzy mate any continuous function, provided that sufficient connection
Clustering weights are used 共Hornik et al. 1989兲. It should also be noted that
Membership values 11 is the upper limit for the number of hidden layer nodes needed
to map any continuous function for a network with five inputs, as
Case record Cluster 1 Cluster 2 ...... Cluster 16
proposed by Caudill 共1988兲. By adopting the above procedure, the
1 0.1208 0.0387 ...... 0.0441 ANN models are found to be optimal when two hidden layer
2 0.0111 0.3835 ...... 0.1003 nodes are used.
3 0.0669 0.0423 ...... 0.0452
. . . ...... .
Weight Optimization
. . . ...... .
. . . ...... . In accordance with Shahin et al. 共2002兲, the backpropagation al-
gorithm 共Rumelhart et al. 1986兲 is used for weight optimization.
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

. . . ...... .
188 0.037 0.1674 ...... 0.0998 In order to obtain the optimal parameters that control the back-
189 0.0222 0.1802 ...... 0.1017 propagation algorithm, the network is trained with different com-
binations of momentum terms and learning rates. The momentum
terms used are 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, and 0.8,
Approach 3—SOM whereas the learning rates used are 0.005, 0.02, 0.1, 0.2, 0.4, and
The PC-based software package Neuframe Version 4.0 共Neu- 0.6. The ANN models are found to be optimal when a momentum
sciences 2000兲 is used to cluster the data using an SOM. The term of 0.8 and a learning rate of 0.2 are used.
inputs 共i.e., B, q, N, L/B, and D f /B) and corresponding output
(S m ) of the predictive model are presented to the SOM as inputs Validation
共Fig. 1兲. As mentioned previously, there is no precise rule for
determining the optimum size of the map. Consequently, a num- Once training has been successfully accomplished, the model is
ber of map sizes are investigated, including 5⫻5, 6⫻6, 7⫻7, and validated on the independent validation set. The coefficient of
8⫻8. For all map sizes, the default parameters 共e.g., learning rate correlation, r, the root mean square error 共RMSE兲; and the mean
and neighborhood size兲 suggested in the software package 共Neu- absolute error 共MAE兲; are used to evaluate the performance of the
sciences 2000兲 are used, and training is continued for 10,000 trained model.
iterations, as the connection weights remain stable after this point.
A grid size of 8⫻8 is chosen, as it ensures that the maximum
number of clusters are found from the training data 共Bowden Results and Discussion
et al. 2002兲. The statistics of the training, testing, and validation sets obtained
In order to investigate the impact of the number of data points when the data are divided in a purely random fashion 共approach
used from each cluster for training on model performance 共see 1兲 and where the statistics of the subsets are taken into account
objective 5兲, two different approaches for choosing training data 共approach 2兲 are shown in Tables 4 and 5, respectively. It can be
from each cluster are adopted. As part of the first approach, all seen that when the data are divided in a purely random manner
data records remaining after the selection of the testing and vali- 共approach 1, Table 4兲, there are some inconsistencies in the sta-
dation data are used for training. As a result, a total of 110 records tistics between the various data subsets. This is confirmed by the
are used for training, 46 for testing, and 33 for validation. As part results of the t- and F-tests, which showed that the hypotheses are
of the second approach, only one data point from each cluster is rejected for most of the testing and validation sets and, conse-
chosen for training. As a result, 54 records are used for training, quently, the data in the three subsets generally do not belong to
46 for testing, and 33 for validation, resulting in a reduction in the the same statistical population. However, it should be noted that
data used for training by approximately 50%. this is not necessarily the case when the data are divided in a
random manner, as only one random trial is performed in this
Approach 4—Fuzzy Clustering work and there are many different possible ways in which the
The software package FANNY 共Kaufman and Rousseeuw 1990兲 is data can be divided into training, testing, and validation subsets.
used to cluster the data using fuzzy clustering. Using the proce- The results in Table 5 show that, when the data are divided in a
dure outlined previously, 10–94 clusters are tried. The average way that takes into account the statistical properties of the various
silhouette width of the entire data s̄(k) is maximized when 16 subsets 共approach 2兲, the statistics are in much better agreement,
clusters are used and is equal to 0.3. Sample membership values as expected. This is confirmed by the outcomes of the t- and
of the optimum clustering 共i.e., number of clusters⫽16) for some F-tests, which show that the hypotheses are accepted for all of the
data records are shown in Table 3. Using the procedure outlined testing and validation sets and, consequently, the training, testing,
previously, samples are chosen for the training, testing, and vali- and validation sets are generally representative of each other.
dation sets, and as a result, a total of 143 records are used for The performance of the models developed using the data sets
training, 25 for testing, and 21 for validation. whose statistics are shown in Tables 4 and 5 are shown in Table 6
共columns 2 and 3兲. It can be seen that there is a direct relationship
between the consistency in the statistics between training, testing,
Architecture
and validation sets and consistency in model performance. When
In accordance with Shahin et al. 共2002兲, multilayer perceptrons the training, testing, and validation data are not representative of
共Zurada 1992; Fausett 1994兲 are used for the development of each other, there can be large discrepancies in the model perfor-
ANN models in this work. The optimum network geometry is mance obtained using the training, testing, and validation sets.
obtained utilizing a trial-and-error approach in which ANNs are Consequently, the results obtained using the validation set may
trained with one hidden layer and 1, 2, 3, 5, 7, 9, and 11 hidden not be truly representative of the performance of the trained
layer nodes. It should be noted that one hidden layer can approxi- model, as the validation set may contain extreme data points that

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004 / 109

J. Comput. Civ. Eng., 2004, 18(2): 105-114


Table 4. Input and Output Statistics Obtained Using Random Data Division
Statistical parameters
Model variables and data sets Mean Standard deviation Minimum Maximum Range
Footing width, B 共m兲
Training set 9.4 11.3 0.8 60.0 59.2
Testing set 9.2 10.3 0.9 41.2 40.3
Validation set 6.1 4.3 2.25 25.5 23.25
Footing net applied pressure, q 共kPa兲
Training set 161.3 98.0 18.3 697.0 678.7
Testing set 267.2 155.2 47.6 666.0 618.4
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

Validation set 161.2 101.5 71.8 507.6 435.7


Average SPT blow count, N
Training set 21.6 11.8 4.0 60.0 56.0
Testing set 28.6 15.7 4.0 60.0 56.0
Validation set 27.8 13.4 7.0 58.0 51.0
Footing geometry, L/B
Training set 1.9 1.5 1.0 9.9 8.9
Testing set 1.9 1.9 1.0 10.5 9.5
Validation set 3.3 1.9 1.0 8.1 7.1
Footing embedment ratio, D f /B
Training set 0.57 0.59 0.0 3.4 3.4
Testing set 0.52 0.64 0.0 3.0 3.0
Validation set 0.41 0.41 0.0 1.8 1.8
Measured settlement, S m 共mm兲
Training set 20.7 28.7 0.6 121.0 120.4
Testing set 23.0 30.5 1.8 120.0 118.2
Validation set 16.1 9.7 4.1 43.0 38.9

Table 5. Input and Output Statistics Obtained Using Data Division to Ensure Statistical Consistency
Statistical parameters
Model variables and data sets Mean Standard deviation Minimum Maximum Range
Footing width, B 共m兲
Training set 8.3 9.8 0.8 60.0 59.2
Testing set 9.3 10.9 0.9 55.0 54.1
Validation set 9.4 10.1 0.9 41.2 40.3
Footing net applied pressure, q 共kPa兲
Training set 188.4 129.0 18.3 697.0 678.7
Testing set 183.2 118.7 25.0 584.0 559.0
Validation set 187.9 114.6 33.0 575.0 542.0
Average SPT blow count, N
Training set 24.6 13.6 4.0 60.0 56.0
Testing set 24.6 12.9 5.0 60.0 55.0
Validation set 24.3 14.1 4.0 55.0 51.0
Footing geometry, L/B
Training set 2.1 1.7 1.0 10.5 9.5
Testing set 2.3 1.9 1.0 9.9 8.9
Validation set 2.1 1.8 1.0 8.0 7.0
Footing embedment ratio, D f /B
Training set 0.52 0.57 0.0 3.4 3.4
Testing set 0.49 0.52 0.0 3.0 3.0
Validation set 0.59 0.64 0.0 3.0 3.0
Measured settlement, S m 共mm兲
Training set 20.0 27.2 0.6 121.0 120.4
Testing set 21.4 26.6 1.0 120.0 119.0
Validation set 20.4 25.2 1.3 120.0 118.7

110 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004

J. Comput. Civ. Eng., 2004, 18(2): 105-114


Table 6. Performance of Artificial Neural Network Models Using Table 7. Performance of Artificial Neural Network Models for Dif-
Data Subsets Obtained for Different Approaches to Data Division ferent Data Proportions Using Statistical Data Division Approach
Self- Performance measures
Performance measures Random Statistical organizing Fuzzy
Root mean Mean
and data sets division division map clustering
Data proportions Correlation square error absolute error
Training and sets coefficient, r 共mm兲 共mm兲
Correlation coefficient, r 0.944 0.930 0.890 0.912 10 共70-30兲
RMSE 共mm兲 9.35 10.01 11.58 10.62 Training set 0.922 9.34 6.55
MAE 共mm兲 6.23 6.87 7.93 7.43 Testing set 0.929 11.34 7.35
Testing Validation set 0.861 17.08 9.49
Correlation coefficient, r 0.845 0.929 0.942 0.967 10 共80-20兲
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

RMSE 共mm兲 16.39 10.12 10.43 10.48 Training set 0.939 9.26 6.63
MAE 共mm兲 11.94 6.43 7.98 6.92 Testing set 0.876 13.82 7.96
Validation Validation set 0.909 12.72 9.07
Correlation coefficient, r 0.659 0.905 0.958 0.957 10 共90-10兲
RMSE 共mm兲 10.57 11.04 10.12 9.59 Training set 0.934 9.25 6.04
MAE 共mm兲 8.85 8.78 7.12 6.13 Testing set 0.924 13.87 10.43
Note: RMSE⫽root mean square error and MAE⫽mean absolute error. Validation set 0.849 18.35 9.95
20 共70-30兲
were not used in the model calibration 共training兲 phase. Conse- Training set 0.930 10.01 6.87
quently, the best model given the available data has not been Testing set 0.929 10.12 6.43
developed. Similarly, if the results obtained using the testing set Validation set 0.905 11.04 8.78
are not representative of those obtained using the training set, 20 共80-20兲
training may be ceased at a suboptimal time, or a suboptimal Training set 0.933 9.57 6.63
network geometry or learning rate or momentum value may be Testing set 0.929 10.96 6.94
chosen. However, when the training, testing, and validation sets Validation set 0.898 11.39 9.01
are representative of each other, the performance of the model on 20 共90-10兲
each of the three subsets is very similar, indicating that the model Training set 0.918 10.67 7.51
has the ability to interpolate within the extremes contained in the Testing set 0.945 10.46 6.89
available data. Validation set 0.878 12.52 9.49
The model performances obtained when different proportions 30 共70-30兲
of the available data are used for training, testing, and validation, Training set 0.920 11.01 7.88
in conjunction with the data division method which takes into Testing set 0.938 10.93 7.28
account the statistical properties of the data 共approach 2兲, are Validation set 0.903 10.94 7.76
shown in Table 7. A code is used to distinguish between the 30 共80-20兲
various proportions of the available data used for training, testing, Training set 0.926 10.68 7.12
and validation. The code consists of three numbers. The first num- Testing set 0.903 11.52 7.71
ber represents the percentage of the data used in the validation Validation set 0.887 11.55 7.83
set, whereas the second two numbers, placed between brackets 30 共90-10兲
and separated by a hyphen, are the percentages that divide the
Training set 0.923 10.10 7.38
remaining data into training and testing sets, respectively. It can
Testing set 0.835 16.33 10.78
be seen from Table 7 that there is no clear relationship between
Validation set 0.920 10.80 7.53
the proportion of data used for training, testing, and validation
and model performance. The best result is obtained when 20% of
the data are used for validation and the remaining data are divided
into 70% for training and 30% for testing. The results in Table 7 cluster for training, rather than choosing only one data point from
also indicate that there can be significant variation in the results each cluster, as the RMSE increases from 10.43 to 14.43 mm and
obtained, depending on which proportion of the data is used for the MAE increases from 7.98 to 10.21 mm when the additional
training, testing, and validation, even when the statistical proper- training data are discarded. However, there is a slight decrease in
ties of the data subsets are taken into account. This may be due to the coefficient of correlation, r, from 0.94 to 0.93 when the ad-
the difficulties in obtaining representative data sets for some of ditional training data are included. Consequently, the subsequent
the proportions for training, testing, and validation investigated discussion in relation to the SOM data division method 共approach
for the particular data set used. 3兲 is restricted to the case where all remaining data are used for
The difficulties associated with deciding which proportion of training.
the available data to use for training, testing, and validation can The statistics of the data in each of the subsets obtained using
be overcome by using an SOM 共approach 3兲 or fuzzy clustering the SOM 共approach 3兲 and fuzzy clustering 共approach 4兲 data
共approach 4兲 for obtaining appropriate data subsets. However, as division methods are very similar 共Tables 8 and 9兲, and the t- and
discussed previously, two different approaches for choosing the F-tests indicated that the three data sets in Tables 8 and 9 may be
training data from the clusters obtained from an SOM have been considered to be representative of each other. The success of the
proposed in the literature and are investigated here. The results SOM and fuzzy clustering data division methods is illustrated in
obtained indicate that it is better to use all of the data remaining Table 6 共columns 4 and 5兲, which compares the predictive results
after the testing and validation data have been removed from each obtained using the four different approaches to data division in-

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004 / 111

J. Comput. Civ. Eng., 2004, 18(2): 105-114


Table 8. Input and Output Statistics Obtained Using Self-Organized Map
Statistical parameters
Model variables and data sets Mean Standard deviation Minimum Maximum Range
Footing width, B 共m兲
Training set 7.9 9.0 0.8 60.0 59.2
Testing set 10.8 13.1 0.9 55.0 54.1
Validation set 8.8 8.8 1.1 33.5 32.4
Footing net applied pressure, q 共kPa兲
Training set 184.6 119.0 18.3 697.0 678.7
Testing set 204.6 133.9 52.0 666.0 614.0
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

Validation set 170.8 122.3 25.0 584.0 559.0


Average SPT blow count, N
Training set 24.0 12.8 4.0 60.0 56.0
Testing set 26.3 15.4 5.0 60.0 55.0
Validation set 24.0 13.0 6.0 50.0 44.0
Footing geometry, L/B
Training set 2.1 1.7 1.0 10.5 9.5
Testing set 2.1 1.9 1.0 9.9 8.9
Validation set 2.2 1.7 1.0 7.8 6.8
Footing embedment ratio, D f /B
Training set 0.57 0.6 0.0 3.4 3.4
Testing set 0.49 0.5 0.0 2.1 2.1
Validation set 0.42 0.43 0.0 1.8 1.8
Measured settlement, S m 共mm兲
Training set 18.6 24.5 0.6 121.0 120.4
Testing set 22.7 27.4 1.3 116.0 114.7
Validation set 23.0 31.8 1.0 120.0 119.0

Table 9. Input and Output Statistics Obtained Using Fuzzy Clustering


Statistical parameters
Model variables and data sets Mean Standard deviation Minimum Maximum Range
Footing width, B 共m兲
Training set 8.7 10.1 0.8 60.0 59.2
Testing set 8.6 10.1 1.0 42.7 41.7
Validation set 9.2 10.6 1.2 36.6 35.4
Footing net applied pressure, q 共kPa兲
Training set 180.2 120.0 18.3 697.0 678.7
Testing set 209.4 134.4 64.0 584.0 520.0
Validation set 207.0 164.4 47.6 584.0 536.4
Average SPT blow count, N
Training set 24.6 13.2 4.0 60.0 56.0
Testing set 23.4 12.4 6.0 50.0 44.0
Validation set 25.5 16.8 5.0 60.0 55.0
Footing geometry, L/B
Training set 2.2 1.9 1.0 10.5 9.5
Testing set 2.0 1.5 1.0 6.7 5.7
Validation set 1.7 1.1 1.0 5.2 4.2
Footing embedment ratio, D f /B
Training set 0.54 0.61 0.0 3.4 3.4
Testing set 0.50 0.40 0.0 1.4 1.4
Validation set 0.50 0.51 0.0 2.1 2.1
Measured settlement, S m 共mm兲
Training set 20.0 25.8 0.6 121.0 120.4
Testing set 21.5 26.5 2.1 87.0 84.9
Validation set 22.1 32.5 1.3 120.0 118.7

112 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004

J. Comput. Civ. Eng., 2004, 18(2): 105-114


vestigated. It can be seen that the results obtained for the SOM using the last two data division methods. The methods negated
共approach 3兲 and fuzzy clustering 共approach 4兲 data division the need to choose which proportion of the data to use for train-
methods are very close to those obtained for the statistically con- ing, testing, and validation and ensured that each of the subsets
sistent data division method 共approach 2兲 and significantly better was representative of the available data. When the SOM method
than the results obtained for the purely random data division was used, r for the testing set was found to be equal to 0.942,
method 共approach 1兲. It should be noted that the results presented whereas the RMSE and MAE were found to be equal to 10.43 and
for the data division method that takes into account the statistical 7.98 mm, respectively. On the other hand, when the fuzzy clus-
properties of the subsets 共approach 2兲 are for the proportion of tering technique was used, r was found to be equal to 0.967,
training, testing, and validation data that gives the best results. whereas the RMSE and MAE were found to be equal to 10.48 and
Consequently, it appears as though the SOM and fuzzy clustering 6.92, respectively. This indicated that the methods work well for
methods are suitable approaches for dividing data into training, the case study considered, producing results that were comparable
testing, and validation subsets. However, as discussed previously, with the best results obtained when the second method was used.
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

fuzzy clustering data division has the advantage over SOM data The results obtained using the SOM data division method also
division that an optimum number of clusters can be obtained ana- confirm the results obtained by Bowden et al. 共2002兲, which
lytically and, consequently, the fuzzy clustering data division ap- showed that the ANN models developed using the SOM data
proach removes the subjectivity associated with the SOM data division technique outperform the conventional method 共approach
division approach. 1兲. In addition, fuzzy clustering was found to overcome the prob-
lem of determining the optimum size of clusters associated with
using SOMs and, consequently, fuzzy clustering was found to
Summary and Conclusions provide a systematic approach for data division of ANN models.

This paper investigated the effect of using four different data


division methods for developing ANN models applied to a case Notation
study of settlement prediction of shallow foundations on granular
soils. The methods were 共1兲 random data division; 共2兲 data divi- The following symbols are used in this paper:
sion to ensure statistical consistency of the data subsets; 共3兲 data a(i) ⫽ average dissimilarity of data point i to all data points
division using self-organizing maps 共SOMs兲; and 共4兲 a new data in cluster A;
division method based on fuzzy clustering. The first method in- B ⫽ footing width;
volved dividing the data in an arbitrary way without paying at- b(i) ⫽ smallest average dissimilarity of point i to all points
tention to the statistical consistency of the data subsets. The sec- in any cluster E different from A;
ond method involved dividing the data in such a way that the C ⫽ objective function of fuzzy clustering;
training, testing, and validation sets are statistically consistent and D f ⫽ footing embedment depth;
thus represent the same population. A study was also carried out D j ⫽ Euclidean distance of node j;
using this method to investigate the relationship between the pro- d i j ⫽ given distance between data points i and j;
portion of the data used for training, testing, and validation and k ⫽ number of fuzzy clusters;
model performance. The last two methods involved applying L ⫽ footing length;
SOMs and fuzzy clustering to cluster data into similar groups. N ⫽ average blow count from standard penetration test;
Samples were then chosen from each of the clusters to form three q ⫽ footing net applied pressure;
statistically consistent data sets. r ⫽ coefficient of correlation;
The coefficient of correlation, r, for the testing set for the S m ⫽ measured settlement;
ANN model that used the first data division method was found to s(i) ⫽ silhouette value of data point i;
be equal to 0.845, whereas the RMSE and MAE were found to be s̄(k) ⫽ average silhouette width of entire data set;
equal to 16.39 and 11.94 mm, respectively. On the other hand, r u i v ⫽ unknown membership function of data point i to
was found to be equal to 0.929 and the RMSE and MAE were cluster v ;
found to be equal to 10.12 and 6.43 mm, respectively, when the w ji ⫽ connection weight between nodes i and j;
second data division method was used, resulting in an increase in x i ⫽ input from node i; and
r of about 10% and reductions in the RMSE and MAE of ap- ␩ ⫽ learning rate.
proximately 38 and 46%, respectively. This indicates that there is
a direct relationship between the consistency of the statistics be-
tween training, testing, and validation sets and model perfor- References
mance. Consequently, the statistical properties of the various data
subsets should be taken into account as part of any data division ASCE Task Committee of Artificial Neural Networks in Hydrology.
procedure to ensure that the best possible model is developed, 共2000兲. ‘‘Artificial neural networks in hydrology. I: Preliminary con-
given the available data. The proportion of the data used for train- cepts.’’ J. Hydrologic Eng., 5共2兲, 115–123.
ing, testing, and validation using the second method also appeared Bowden, G. J., Maier, H. R., and Dandy, G. C. 共2002兲. ‘‘Optimal division
to have an effect on model performance. However, there appeared of data for neural network models in water resources applications.’’
to be no clear relationship between the proportion of the data used Water Resour. Res., 38共2兲, 2.1–2.11.
Braddock, R. D., Kremmer, M. L., and Sanzogni, L. 共1998兲. ‘‘Feed-
in each of the subsets and model performance. In the trials con-
forward artificial neural network model for forecasting rainfall run-
ducted, the optimal model performance was obtained when 20% off.’’ Environmetrics, 9, 419– 432.
of the data were used for validation and 70% of the remaining Burland, J. B., and Burbidge, M. C. 共1985兲. ‘‘Settlement of foundations
data were used for training and 30% for testing. on sand and gravel.’’ Proc., Inst. Civ. Eng., 78共1兲, 1325–1381.
The difficulties in determining which proportions of the data to Caudill, M. 共1988兲. ‘‘Neural networks primer. Part III.’’ AI Expert, 3共6兲,
use in the training, testing, and validation sets was overcome 53–59.

JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004 / 113

J. Comput. Civ. Eng., 2004, 18(2): 105-114


Fausett, L. V. 共1994兲. Fundamentals of neural networks: Architecture, Minns, A. W., and Hall, M. J. 共1996兲. ‘‘Artificial neural networks as
algorithms, and applications, Prentice-Hall, Englewood Cliffs, N.J. rainfall-runoff models.’’ Hydrol. Sci. J., 41共3兲, 399– 417.
Giraudel, J. L., and Lek, S. 共2001兲. ‘‘A comparison of self-organizing Neusciences Corp. 共2000兲. Neuframe version 4.0, Southampton, U.K.
map algorithm and some conventional statistical methods for ecologi- Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 共1986兲. ‘‘Learning
cal community ordination.’’ Ecol. Modell., 146, 329–339. internal representation by error propagation.’’ Parallel distributed
Hammerstrom, D. 共1993兲. ‘‘Working with neural networks.’’ IEEE Spec- processing, D. E. Rumelhart and J. L. McClelland, eds., MIT Press,
trum, 30共7兲, 46 –53. Cambridge, Mass.
Hornik, K., Stinchcombe, M., and White, H. 共1989兲. ‘‘Multilayer feedfor- Shahin, M. A., Jaksa, M. B., and Maier, H. R. 共2001兲. ‘‘Artificial neural
ward networks are universal approximators.’’ Neural Networks, 2, network applications in geotechnical engineering.’’ Australian Geo-
359–366.
mech., 36共1兲, 49– 62.
Kaufman, L., and Rousseeuw, P. J. 共1990兲. Finding groups in data: An
Shahin, M. A., Maier, H. R., and Jaksa, M. B. 共2002兲. ‘‘Predicting settle-
introduction to cluster analysis, Wiley, New York.
ment of shallow foundations using neural networks.’’ J. Geotech.
Kocjancic, R., and Zupan, J. 共2000兲. ‘‘Modeling of the river flowrate: the
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.

influence of the training set selection.’’ Chemom. Intell. Lab. Syst., 54, Geoenviron. Eng., 128共9兲, 785–793.
21–34. Shi, J. J. 共2002兲. ‘‘Clustering technique for evaluating and validating
Kohonen, T. 共1997兲. Self-organizing maps, Springer, Berlin. neural network performance.’’ J. Comput. Civ. Eng., 16共2兲, 152–155.
Levine, D. M., Berenson, M. L., and Stephan, D. 共1999兲. Statistics for Smith, M. 共1993兲. Neural networks for statistical modeling, Van Nostrand
managers using Microsoft Excel, Prentice-Hall, Upper Saddle River, Reinhold, New York.
N.J. Stone, M. 共1974兲. ‘‘Cross-validatory choice and assessment of statistical
Maier, H. R., and Dandy, G. C. 共2000兲. ‘‘Neural networks for the predic- predictions.’’ J. R. Stat. Soc., B-36, 111–147.
tion and forecasting of water resources variables: a review of model- Tokar, A. S., and Johnson, P. A. 共1999兲. ‘‘Rainfall-runoff modeling using
ing issues and applications.’’ Environ. Modell. Softw., 15, 101–124. artificial neural networks.’’ J. Hydrologic Eng., 4共3兲, 232–239.
Masters, T. 共1993兲. Practical neural network recipes in C⫹⫹, Academic, Zurada, J. M. 共1992兲. Introduction to artificial neural systems, West, St.
San Diego. Paul, Minn.

114 / JOURNAL OF COMPUTING IN CIVIL ENGINEERING © ASCE / APRIL 2004

J. Comput. Civ. Eng., 2004, 18(2): 105-114

You might also like