Professional Documents
Culture Documents
Data Division For Developing Neural Networks Applied To Geotechnical Engineering
Data Division For Developing Neural Networks Applied To Geotechnical Engineering
Geotechnical Engineering
Mohamed A. Shahin1; Holger R. Maier2; and Mark B. Jaksa3
Abstract: In recent years, artificial neural networks 共ANNs兲 have been applied to many geotechnical engineering problems with some
Downloaded from ascelibrary.org by Universidad Nacional De Ingenieria on 10/23/18. Copyright ASCE. For personal use only; all rights reserved.
degree of success. In the majority of these applications, data division is carried out on an arbitrary basis. However, the way the data are
divided can have a significant effect on model performance. In this paper, the issue of data division and its impact on ANN model
performance is investigated for a case study of predicting the settlement of shallow foundations on granular soils. Four data division
methods are investigated: 共1兲 random data division; 共2兲 data division to ensure statistical consistency of the subsets needed for ANN model
development; 共3兲 data division using self-organizing maps 共SOMs兲; and 共4兲 a new data division method using fuzzy clustering. The results
indicate that the statistical properties of the data in the training, testing, and validation sets need to be taken into account to ensure that
optimal model performance is achieved. It is also apparent from the results that the SOM and fuzzy clustering methods are suitable
approaches for data division.
DOI: 10.1061/共ASCE兲0887-3801共2004兲18:2共105兲
CE Database subject headings: Neural networks; Fuzzy sets; Settlement; Geotechnical engineering; Data processing; Maps.
proposed in the literature. Bowden et al. 共2002兲 used a genetic within the domain of the available data.
algorithm to minimize the difference between the means and stan- Another shortcoming of this approach is that the proportion of
dard deviations of the data in the training, testing, and validation the data to be used for training, testing, and validation needs to be
sets. While this approach ensures that the statistical properties of chosen a priori by the modeler. However, there are no firm guide-
the various data subsets are similar, there is still a need to choose lines in the literature to assist with this task, although some rules-
which proportion of the data to use for training, testing, and vali- of-thumb exist, such as using two thirds of the data for model
dation. Kocjancic and Zupan 共2000兲 and Bowden et al. 共2002兲 use calibration 共i.e., training and testing兲 and one third for model
a self-organizing map 共SOM兲 to cluster high-dimensional input validation 共Hammerstrom 1993兲.
and output data in two-dimensional space and divide the available
data so that values from each cluster are represented in the vari- Approach 2: Statistically Consistent
ous data subsets. This ensures that data in the different subsets are
representative of each other, and SOMs have the additional ad- As part of this approach, the available data are divided in a way
that ensures that the statistical properties of the data in each of the
vantage that there is no need to decide what percentage of the
subsets are as close to each other as possible and thus represent
data to use for training, testing, and validation. The major short-
the same statistical population. In this work, a trial-and-error pro-
coming of this approach is that there are no guidelines for deter-
cess is used to achieve this. The statistical parameters used in-
mining the optimum size and shape of the SOM 共Giraudel and
clude the mean, standard deviation, minimum, maximum, and
Lek 2001兲. This has the potential to have a significant impact on range. To examine how representative the training, testing, and
the results obtained, as the underlying assumption of the approach validation sets are of each other, t- and F-tests are carried out.
is that the data points 共samples or records兲 in one cluster provide The t-test examines the null hypothesis of no difference in the
the same information in high-dimensional space. However, if the means of two data sets and the F-test examines the null hypoth-
SOM is too small, there may be significant intracluster variation. esis of no difference in the standard deviation of the two sets. For
Conversely, if the map is too large, too many clusters may contain a given level of significance, test statistics can be calculated to
single data points, making it difficult to choose representative test the null hypotheses for the t- and F-tests, respectively. Tra-
subsets. ditionally, a level of significance equal to 0.05 is selected 共Levine
In this paper, a new data division approach is introduced and et al. 1999兲. Consequently, this level of significance is used in this
compared with existing approaches for the case study of predic- research. This means that there is a confidence level of 95% that
tion of the settlement of shallow foundations on granular soils. the training, testing, and validation sets are statistically consistent.
The new approach utilizes a fuzzy clustering technique, which A detailed description of these tests is given by Levine et al.
overcomes the limitations of existing methods. Shi 共2002兲 has 共1999兲. The major shortcomings of this approach are that it is
recently used fuzzy clustering for the evaluation and validation of based on trial and error and that the proportion of the data to be
neural networks. However, to the writers’ best knowledge, fuzzy used for training, testing, and validation needs to be chosen in
clustering has yet to be used as a data division approach for advance by the modeler, as mentioned previously.
ANNs. The specific objectives of this paper are
1. To investigate the relationship between the statistical proper- Approach 3: Self-Organizing Map
ties of the data subsets used to develop ANN models and Self-organizing maps 共SOMs兲 belong to the genre of unsuper-
model performance; vised neural networks. The typical structure of SOM consists of
2. To introduce a new approach to data division for ANNs two layers: an input layer and a Kohonen layer 共Fig. 1兲. The
based on fuzzy clustering; processing elements in the Kohonen layer are arranged in a one-
3. To compare the performance of the new approach with that or two-dimensional array. The input from each node in the input
of three existing approaches, including random data division, layer (x i for i⫽1,2,..., n) is fully connected to the Kohonen layer
data division to ensure statistical consistency between the through connection weights (w ji for j⫽1,2, ..., m). At the begin-
various subsets, and data division using a SOM; ning of the self-organizing process, these weights are initialized
4. To investigate the relationship between the proportion of the randomly. At each node in the Kohonen layer, the input (x i ) is
data in each of the subsets used to develop the ANN models presented without providing the desired output, and a matching
and their performance in relation to the data division value is calculated. This value is typically the Euclidean distance
method, which ensures statistical consistency between data (D j ) between the weights of each node and the corresponding
sets; and input values, as shown in Eq. 共1兲:
5. To investigate the impact of the number of data points used n
from each cluster for training on model performance in rela-
tion to the SOM data division method.
D j⫽ 兺 共 x i ⫺w ji 兲 2 ,
i⫽1
j⫽1,2,...,m (1)
interval is assigned to the validation set while the remaining data Measured settlement, S m 共mm兲 20.4 26.6 0.6 121.0
points from the same interval are assigned to the training set. By
using this approach, the best possible representation of the avail-
able data is achieved in each of the three data subsets.
The detailed procedure for using fuzzy clustering for ANN greatest effect on the settlement of shallow foundations on granu-
data division introduced in this paper is as follows: lar soils, as discussed by Burland and Burbidge 共1985兲. The
1. An initial number of clusters, not less than two, is chosen model output is foundation settlement (S m ).
共the initial number of clusters can be assumed to be equal to
5% of the available data兲; Data Division
2. The available data are clustered using the fuzzy clustering
technique and the average silhouette width s̄(k) of the entire The database used for the development of the ANN models com-
data set is calculated; prises a total of 189 individual cases 共Shahin et al. 2002兲. Ranges
3. The number of clusters is increased by one and step 2 is of the data used for the input and output variables are summarized
repeated until s̄(k) remains constant or the number of clus- in Table 1. The available data are divided using the four ap-
ters reaches 50% of the available data; proaches discussed previously:
4. The number of clusters that result in the largest value of s̄(k)
is considered optimum; Approach 1—Random
5. For the optimum number of clusters, the data records in- As part of this approach, the 189 individual cases are randomly
cluded in each cluster are ranked according to their member- divided into training, testing, and validation subsets. In total, 80%
ship values in incremental intervals of 0.1 between 0.0 and of the data 共i.e., 152 individual cases兲 are used for calibration and
1.0 共i.e., 0.0–0.1, 0.1–0.2,..., 0.9–1.0兲; and 20% of the data 共i.e., 37 individual cases兲 are used for validation.
6. For each cluster and membership interval 共e.g., cluster 1 and The calibration data are further divided into 70% for training 共i.e.,
membership interval 0.0–0.1兲, two samples are chosen—one 106 individual cases兲 and 30% for testing 共i.e., 46 individual
for the testing set and one for the validation set—and all cases兲.
remaining data samples are chosen for the training set. In the
instance when two records are obtained, one record is chosen Approach 2—Statistically Consistent
for training and the other is chosen for testing. If only one As part of this approach, the 189 individual cases are divided into
record is obtained, this record is included in the training set. three statistically consistent subsets. A number of different pro-
portions of the available data are used for training, testing, and
validation, in order to investigate the impact the proportion of the
Case Study data used in the various subsets has on model performance 共see
objective 4兲. The different proportions investigated are summa-
In this research, the four approaches of data division discussed rized in Table 2.
previously are applied to the case study of predicting the settle-
ment of shallow foundations on granular soils. The predictive
ANN models are based on the model developed by Shahin et al.
共2002兲 and implemented using the PC-based software package Table 2. Proportions of Data Used for Training, Testing, and
Neuframe Version 4.0 共Neusciences 2000兲. The steps for develop- Validation
ing ANN models outlined by Maier and Dandy 共2000兲 are used as
Remaining data
a guide in this research. These include determination of model
inputs and outputs, division of the available data, determination of Validation set Training set Testing set
optimal network architecture, weight optimization, and model 共%兲 共%兲 共%兲
validation. 10 70 30
80 20
Model Inputs 90 10
20 70 30
The model inputs used include footing width (B); footing net
80 20
applied pressure (q); soil compressibility, which can be repre-
90 10
sented by the average blow count (N) per 300 mm obtained using
30 70 30
the standard penetration test over the depth of influence of the
80 20
foundation; footing geometry (L/B); and footing embedment
90 10
ratio (D f /B). These input variables are believed to have the
. . . ...... .
188 0.037 0.1674 ...... 0.0998 In order to obtain the optimal parameters that control the back-
189 0.0222 0.1802 ...... 0.1017 propagation algorithm, the network is trained with different com-
binations of momentum terms and learning rates. The momentum
terms used are 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, and 0.8,
Approach 3—SOM whereas the learning rates used are 0.005, 0.02, 0.1, 0.2, 0.4, and
The PC-based software package Neuframe Version 4.0 共Neu- 0.6. The ANN models are found to be optimal when a momentum
sciences 2000兲 is used to cluster the data using an SOM. The term of 0.8 and a learning rate of 0.2 are used.
inputs 共i.e., B, q, N, L/B, and D f /B) and corresponding output
(S m ) of the predictive model are presented to the SOM as inputs Validation
共Fig. 1兲. As mentioned previously, there is no precise rule for
determining the optimum size of the map. Consequently, a num- Once training has been successfully accomplished, the model is
ber of map sizes are investigated, including 5⫻5, 6⫻6, 7⫻7, and validated on the independent validation set. The coefficient of
8⫻8. For all map sizes, the default parameters 共e.g., learning rate correlation, r, the root mean square error 共RMSE兲; and the mean
and neighborhood size兲 suggested in the software package 共Neu- absolute error 共MAE兲; are used to evaluate the performance of the
sciences 2000兲 are used, and training is continued for 10,000 trained model.
iterations, as the connection weights remain stable after this point.
A grid size of 8⫻8 is chosen, as it ensures that the maximum
number of clusters are found from the training data 共Bowden Results and Discussion
et al. 2002兲. The statistics of the training, testing, and validation sets obtained
In order to investigate the impact of the number of data points when the data are divided in a purely random fashion 共approach
used from each cluster for training on model performance 共see 1兲 and where the statistics of the subsets are taken into account
objective 5兲, two different approaches for choosing training data 共approach 2兲 are shown in Tables 4 and 5, respectively. It can be
from each cluster are adopted. As part of the first approach, all seen that when the data are divided in a purely random manner
data records remaining after the selection of the testing and vali- 共approach 1, Table 4兲, there are some inconsistencies in the sta-
dation data are used for training. As a result, a total of 110 records tistics between the various data subsets. This is confirmed by the
are used for training, 46 for testing, and 33 for validation. As part results of the t- and F-tests, which showed that the hypotheses are
of the second approach, only one data point from each cluster is rejected for most of the testing and validation sets and, conse-
chosen for training. As a result, 54 records are used for training, quently, the data in the three subsets generally do not belong to
46 for testing, and 33 for validation, resulting in a reduction in the the same statistical population. However, it should be noted that
data used for training by approximately 50%. this is not necessarily the case when the data are divided in a
random manner, as only one random trial is performed in this
Approach 4—Fuzzy Clustering work and there are many different possible ways in which the
The software package FANNY 共Kaufman and Rousseeuw 1990兲 is data can be divided into training, testing, and validation subsets.
used to cluster the data using fuzzy clustering. Using the proce- The results in Table 5 show that, when the data are divided in a
dure outlined previously, 10–94 clusters are tried. The average way that takes into account the statistical properties of the various
silhouette width of the entire data s̄(k) is maximized when 16 subsets 共approach 2兲, the statistics are in much better agreement,
clusters are used and is equal to 0.3. Sample membership values as expected. This is confirmed by the outcomes of the t- and
of the optimum clustering 共i.e., number of clusters⫽16) for some F-tests, which show that the hypotheses are accepted for all of the
data records are shown in Table 3. Using the procedure outlined testing and validation sets and, consequently, the training, testing,
previously, samples are chosen for the training, testing, and vali- and validation sets are generally representative of each other.
dation sets, and as a result, a total of 143 records are used for The performance of the models developed using the data sets
training, 25 for testing, and 21 for validation. whose statistics are shown in Tables 4 and 5 are shown in Table 6
共columns 2 and 3兲. It can be seen that there is a direct relationship
between the consistency in the statistics between training, testing,
Architecture
and validation sets and consistency in model performance. When
In accordance with Shahin et al. 共2002兲, multilayer perceptrons the training, testing, and validation data are not representative of
共Zurada 1992; Fausett 1994兲 are used for the development of each other, there can be large discrepancies in the model perfor-
ANN models in this work. The optimum network geometry is mance obtained using the training, testing, and validation sets.
obtained utilizing a trial-and-error approach in which ANNs are Consequently, the results obtained using the validation set may
trained with one hidden layer and 1, 2, 3, 5, 7, 9, and 11 hidden not be truly representative of the performance of the trained
layer nodes. It should be noted that one hidden layer can approxi- model, as the validation set may contain extreme data points that
Table 5. Input and Output Statistics Obtained Using Data Division to Ensure Statistical Consistency
Statistical parameters
Model variables and data sets Mean Standard deviation Minimum Maximum Range
Footing width, B 共m兲
Training set 8.3 9.8 0.8 60.0 59.2
Testing set 9.3 10.9 0.9 55.0 54.1
Validation set 9.4 10.1 0.9 41.2 40.3
Footing net applied pressure, q 共kPa兲
Training set 188.4 129.0 18.3 697.0 678.7
Testing set 183.2 118.7 25.0 584.0 559.0
Validation set 187.9 114.6 33.0 575.0 542.0
Average SPT blow count, N
Training set 24.6 13.6 4.0 60.0 56.0
Testing set 24.6 12.9 5.0 60.0 55.0
Validation set 24.3 14.1 4.0 55.0 51.0
Footing geometry, L/B
Training set 2.1 1.7 1.0 10.5 9.5
Testing set 2.3 1.9 1.0 9.9 8.9
Validation set 2.1 1.8 1.0 8.0 7.0
Footing embedment ratio, D f /B
Training set 0.52 0.57 0.0 3.4 3.4
Testing set 0.49 0.52 0.0 3.0 3.0
Validation set 0.59 0.64 0.0 3.0 3.0
Measured settlement, S m 共mm兲
Training set 20.0 27.2 0.6 121.0 120.4
Testing set 21.4 26.6 1.0 120.0 119.0
Validation set 20.4 25.2 1.3 120.0 118.7
RMSE 共mm兲 16.39 10.12 10.43 10.48 Training set 0.939 9.26 6.63
MAE 共mm兲 11.94 6.43 7.98 6.92 Testing set 0.876 13.82 7.96
Validation Validation set 0.909 12.72 9.07
Correlation coefficient, r 0.659 0.905 0.958 0.957 10 共90-10兲
RMSE 共mm兲 10.57 11.04 10.12 9.59 Training set 0.934 9.25 6.04
MAE 共mm兲 8.85 8.78 7.12 6.13 Testing set 0.924 13.87 10.43
Note: RMSE⫽root mean square error and MAE⫽mean absolute error. Validation set 0.849 18.35 9.95
20 共70-30兲
were not used in the model calibration 共training兲 phase. Conse- Training set 0.930 10.01 6.87
quently, the best model given the available data has not been Testing set 0.929 10.12 6.43
developed. Similarly, if the results obtained using the testing set Validation set 0.905 11.04 8.78
are not representative of those obtained using the training set, 20 共80-20兲
training may be ceased at a suboptimal time, or a suboptimal Training set 0.933 9.57 6.63
network geometry or learning rate or momentum value may be Testing set 0.929 10.96 6.94
chosen. However, when the training, testing, and validation sets Validation set 0.898 11.39 9.01
are representative of each other, the performance of the model on 20 共90-10兲
each of the three subsets is very similar, indicating that the model Training set 0.918 10.67 7.51
has the ability to interpolate within the extremes contained in the Testing set 0.945 10.46 6.89
available data. Validation set 0.878 12.52 9.49
The model performances obtained when different proportions 30 共70-30兲
of the available data are used for training, testing, and validation, Training set 0.920 11.01 7.88
in conjunction with the data division method which takes into Testing set 0.938 10.93 7.28
account the statistical properties of the data 共approach 2兲, are Validation set 0.903 10.94 7.76
shown in Table 7. A code is used to distinguish between the 30 共80-20兲
various proportions of the available data used for training, testing, Training set 0.926 10.68 7.12
and validation. The code consists of three numbers. The first num- Testing set 0.903 11.52 7.71
ber represents the percentage of the data used in the validation Validation set 0.887 11.55 7.83
set, whereas the second two numbers, placed between brackets 30 共90-10兲
and separated by a hyphen, are the percentages that divide the
Training set 0.923 10.10 7.38
remaining data into training and testing sets, respectively. It can
Testing set 0.835 16.33 10.78
be seen from Table 7 that there is no clear relationship between
Validation set 0.920 10.80 7.53
the proportion of data used for training, testing, and validation
and model performance. The best result is obtained when 20% of
the data are used for validation and the remaining data are divided
into 70% for training and 30% for testing. The results in Table 7 cluster for training, rather than choosing only one data point from
also indicate that there can be significant variation in the results each cluster, as the RMSE increases from 10.43 to 14.43 mm and
obtained, depending on which proportion of the data is used for the MAE increases from 7.98 to 10.21 mm when the additional
training, testing, and validation, even when the statistical proper- training data are discarded. However, there is a slight decrease in
ties of the data subsets are taken into account. This may be due to the coefficient of correlation, r, from 0.94 to 0.93 when the ad-
the difficulties in obtaining representative data sets for some of ditional training data are included. Consequently, the subsequent
the proportions for training, testing, and validation investigated discussion in relation to the SOM data division method 共approach
for the particular data set used. 3兲 is restricted to the case where all remaining data are used for
The difficulties associated with deciding which proportion of training.
the available data to use for training, testing, and validation can The statistics of the data in each of the subsets obtained using
be overcome by using an SOM 共approach 3兲 or fuzzy clustering the SOM 共approach 3兲 and fuzzy clustering 共approach 4兲 data
共approach 4兲 for obtaining appropriate data subsets. However, as division methods are very similar 共Tables 8 and 9兲, and the t- and
discussed previously, two different approaches for choosing the F-tests indicated that the three data sets in Tables 8 and 9 may be
training data from the clusters obtained from an SOM have been considered to be representative of each other. The success of the
proposed in the literature and are investigated here. The results SOM and fuzzy clustering data division methods is illustrated in
obtained indicate that it is better to use all of the data remaining Table 6 共columns 4 and 5兲, which compares the predictive results
after the testing and validation data have been removed from each obtained using the four different approaches to data division in-
fuzzy clustering data division has the advantage over SOM data The results obtained using the SOM data division method also
division that an optimum number of clusters can be obtained ana- confirm the results obtained by Bowden et al. 共2002兲, which
lytically and, consequently, the fuzzy clustering data division ap- showed that the ANN models developed using the SOM data
proach removes the subjectivity associated with the SOM data division technique outperform the conventional method 共approach
division approach. 1兲. In addition, fuzzy clustering was found to overcome the prob-
lem of determining the optimum size of clusters associated with
using SOMs and, consequently, fuzzy clustering was found to
Summary and Conclusions provide a systematic approach for data division of ANN models.
influence of the training set selection.’’ Chemom. Intell. Lab. Syst., 54, Geoenviron. Eng., 128共9兲, 785–793.
21–34. Shi, J. J. 共2002兲. ‘‘Clustering technique for evaluating and validating
Kohonen, T. 共1997兲. Self-organizing maps, Springer, Berlin. neural network performance.’’ J. Comput. Civ. Eng., 16共2兲, 152–155.
Levine, D. M., Berenson, M. L., and Stephan, D. 共1999兲. Statistics for Smith, M. 共1993兲. Neural networks for statistical modeling, Van Nostrand
managers using Microsoft Excel, Prentice-Hall, Upper Saddle River, Reinhold, New York.
N.J. Stone, M. 共1974兲. ‘‘Cross-validatory choice and assessment of statistical
Maier, H. R., and Dandy, G. C. 共2000兲. ‘‘Neural networks for the predic- predictions.’’ J. R. Stat. Soc., B-36, 111–147.
tion and forecasting of water resources variables: a review of model- Tokar, A. S., and Johnson, P. A. 共1999兲. ‘‘Rainfall-runoff modeling using
ing issues and applications.’’ Environ. Modell. Softw., 15, 101–124. artificial neural networks.’’ J. Hydrologic Eng., 4共3兲, 232–239.
Masters, T. 共1993兲. Practical neural network recipes in C⫹⫹, Academic, Zurada, J. M. 共1992兲. Introduction to artificial neural systems, West, St.
San Diego. Paul, Minn.