Professional Documents
Culture Documents
A Comparison Between Neural Networks and Decision Trees Based On Data From Industrial Radiographic Testing
A Comparison Between Neural Networks and Decision Trees Based On Data From Industrial Radiographic Testing
net/publication/220643822
CITATIONS READS
60 222
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Petra Perner on 21 January 2020.
Abstract
In this paper, we are empirically comparing the performance of neural nets and decision trees based on a data set for
the detection of defects in welding seams. This data set was created by image feature extraction procedures working on
digitized X-ray ®lms. We introduce a framework for distinguishing classi®cation methods. We found that more detailed
analysis of the error rate is necessary in order to judge the performance of the learning and classi®cation method.
However, the error rate cannot be the only criterion for comparing between the dierent learning methods. This is a
more complex selection process that involves more criteria that we are describing in this paper. Ó 2001 Elsevier Science
B.V. All rights reserved.
Keywords: Machine learning; Decision tree induction; Neural networks; Performance analysis; Radiographic testing; Pipe inspection
0167-8655/00/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 9 8 - 2
48 P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54
given in Section 5. In contrast to most other work features in gray level pro®les perpendicular to the
on performance analysis (e.g. Michie et al., 1994), weld direction in the image. In Fig. 1, the ROIs of
we found that a more detailed analysis of the error a crack, an undercut and of no disturbance are
rate is necessary in order to judge the performance shown with the corresponding cross sections and
of the learning and classi®cation methods. How- pro®le plots.
ever, the error rate cannot be the only criterion for Flaw indications in welds are imaged in a ra-
the comparison between the dierent learning diograph by local grey level discontinuities. Thus,
methods. It is a more complex selection process it is reasonable to apply the well-known morpho-
that involves more criteria such as explanation logical edge ®nding operator (Klette and Zampe-
capability, the number of features involved in roni, 1992), the derivative of Gaussian operator
classi®cation, etc. Finally, we will give our con- (Pratt, 1991) and the Gaussian weighted image
clusions in Section 6. moment vector operator (Eua-Anant et al., 1996).
These ®lters are developed to enhance small local
gray value changes on an inhomogeneous back-
2. The application ground.
A one-dimensional FFT-®lter for this special
The subject of this investigation is the in-service crack detection problem was designed (Zscherpel
inspection of welds in pipes of austenitic steel. Pipe et al., 1995). This ®lter is based on the assumption
systems in a power plant are inspected routinely that the preferential direction of the crack is po-
during the lifetime of the power plant by radio- sitioned in the image in the horizontal direction.
graphic testing to ensure the integrity of the The second assumption that was determined em-
equipment. Double wall penetration with X-rays pirically is that the half power width of a crack
(Emax < 200 KeV) and special radiographic ®lms indication is smaller than 300 lm. The ®lter con-
for NDT with lead screens are used for this in- sists of a columnwise FFT highpass Bessel opera-
spection. The ¯aws to be looked for in the au- tion that works with a cuto frequency of 2 l/mm.
stenitic welds are longitudinal cracks due to inter- Normally the half-power width of undercuts is
granular stress corrosion cracking starting from greater so that this ®lter suppresses them. This
the inner side of the tube. means that it is possible to distinguish between
All the data were collected during a Round undercuts and cracks with this FFT-®lter. A row-
Robin Test (Brast et al., 1997). The last step of this oriented lowpass that is applied to the output of
Round Robin Test was a destructive testing this ®lter helps to eliminate noise and to point out
(grinding) of the inspected welds. This gives the the cracks more clearly.
most advantageous situation, that the ``truth'' of all Furthermore, a Wavelet ®lter was used (Strang,
indications is known which is not the normal case 1989). The scale representation of the image after
in industrial non-destructive testing. The radio- the Wavelet transform makes it possible to sup-
graphs are digitized with a spatial resolution of 70 press the noise in the image with a simple thresh-
lm and a gray level resolution of 16 bit per pixel by old operation without losing signi®cant parts of
a special NDT ®lm scanner. Afterwards they are the content of the image. The noise in the image is
stored and decomposed into various ROI of an interference of ®lm and scanner noise and ir-
50 50 pixel size. The essential information in the regularities caused by the material of the weld.
ROIs is described by a set of features which is The features which describe the content of the
calculated from various image-processing methods. ROI are extracted from pro®le plots which run
These image-processing procedures are based through the ROI perpendicular to the weld.
on the assumption that the crack is roughly par- In a single pro®le plot, the position of a local
allel to the direction of the weld. This assumption minimum is detected which is surrounded by two
is reasonable because of the material and welding maxima that are as large as possible. This de®ni-
technique. It allows us to de®ne a marked prefer- tion varies a little depending on the respective
ential direction. It is now feasible to search for image processing routine. A template which is
P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54 49
Fig. 1. ROI of crack, of an undercut and of a region without disturbance with corresponding crossections and pro®le plots.
adapted to the current pro®le of the signal allows (e.g. Zell, 1994). The learning algorithm of the
us to calculate various features
S1; S2; L1; L2. network is based on the gradient descent method
Moreover the half-power width and the respective for error minimization. First, we tried the wrapper
gradients between the local extrema are calculated. model for feature selection (Lui and Motoda,
To avoid statistical calculation errors, the calcu- 1998). We did not obtain any signi®cant result
lation of the template features is averaged over all with this model. Therefore, we used a parameter
of the columns along an ROI. This procedure signi®cance analysis (Egmont-Petersen, 1994) for
leads to 36 parameters. feature selection. Based on that model, we can
The data set used in this experiment contains reduce the parameters to seven signi®cant ones.
features for regions of interest from background, For decision tree induction we used our soft-
crack and undercut regions.The data set consists ware package called Decision Master. An entropy
of altogether 1924 ROIs with 1024 extracted from minimization criterion is used for attribute selec-
regions of no disturbance, 465 from regions with tion (Quinlain, 1996) and reduced-error pruning
cracks and 435 from regions with undercuts. technique (Quinlain, 1987a) is used for tree prun-
ing. Numerical attribute discretization is done
based on methods described in Perner and Trau-
3. Learning methods tzsch (1998). Decision tree induction may also be
looked upon as a method for attribute selection.
We have used induction of decision trees and During the learning phase, only the most relevant
learning of neural nets for our problem. attributes are chosen from the whole set of attri-
Four types of neural networks are used here: a butes for the construction of decision rules in the
Backpropagation, a Radial Basis Function Net- nodes. Therefore, we do not need to carry out
work (RBF), a Fuzzy ARTMAP Network and feature selection before the learning process as in
Learning Vector Quantization (LVQ) Network the case of neural nets.
50 P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54
P
4. Evaluation criterion cii = mi1 cij according to a particular class Pim and
the class speci®c quality that is pki cii = j1 cji
The evaluation criterion most used for a clas- the number of correctly classi®ed samples for one
si®er is the error rate fr Nf =N with Nf the class. In addition to that we use other criteria
number of falsely classi®ed samples and N the shown in Table 2.
whole number of samples. In addition to that, we One of these criteria is the cost for classi®cation
use a contingency table in order to show the expressed by the number of features and the
qualities of a classi®er, see Table 1. In the ®eld of number of decisions used during classi®cation. The
the table are input the real class distribution and other criterion is the time needed for learning. We
the class distribution proposed by the classi®er as also take the explanation capability of the classi®er
well as the marginal distribution cij The main di- into consideration as another quality criterion. It
agonal is the number of correctly classi®ed sam- is also important to know if the classi®cation
ples. The last row shows the number of samples method can learn the classi®cation function (the
assigned to the class shown in row 1 and the last mapping of the attributes to the classes) correctly
line shows the class distribution proposed by the based on the training data set. Therefore, we not
classi®er. only consider the error rate based on the test set we
From this table, we can calculate parameters also consider the error rate based on the training
that assess the quality of the classi®er. data set. For the evaluation of the error rate, we
The correctness used the test and train method instead of cross-
, validation (Weiss and Kulikowski, 1991) since it
Xm Xm X m
p cii cij would have been computationally expensive to
i1 i1 j1 evaluate the neural nets by crossvalidation.
Table 2
Criteria for comparison of learned classi®ers
Generalization capability of the classi®er Error rate based on the test data set
Representation of the classi®er Error rate based on the design data set
Classi®cation costs Number of features used for classi®cation
Number of nodes or neurons
Explanation capability Can a human understand the decision?
Learning performance Learning time
Sensitivity to class distribution in the sample set
P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54 51
to that, the neural nets show a dierence in class than the class undercut (which is not a defect but a
speci®c recognition. Mostly, the class speci®c error geometrical indication from the production pro-
for class crack is considerably better than for the cess of the pipe) it would be better to have the
decision trees (e.g. 0.96 for backpropagation net in lowest class speci®c error for crack. This avoids
Table 5 compared with 0.74 for the decision tree in false alarms and can save a lot of repair costs in
Table 4). Since a defect crack is more important real life.
Table 6
Contingency table of the classi®cation result for the radial basis 5.3. Classi®cation cost
function net
Classi®cation Real class index Our decision tree produces decision surfaces
result Background Crack Undercut Sum that are orthogonal to the feature axes. This
Background 198 2 1 201
method should be able to approximate a non-lin-
Crack 1 100 7 108 ear decision surface, of course with a certain ap-
Undercut 9 0 72 81 proximation error, which will lead to a higher
Sum 208 102 80 390 error rate in classi®cation. Therefore, we are not
pk 0.95 0.98 0.90 surprised that there are more features involved in
pt 0.99 0.93 0.89
the classi®cation, see Table 9. The unpruned tree
p 0:95 uses 14 features and the pruned tree uses 10 fea-
k 0:92 tures. The learned decision tree is a very big and
bushy tree with 450 nodes. The pruned tree re-
duces to 250 nodes but still remains a very big and
Table 7 bushy tree.
Contingency table of the classi®cation result for the LVQ The neural nets work only on seven features
Classi®cation Real class index and e.g. for the backpropagation net we have 72
result Background Crack Undercut Sum neurons. That reduces the eort for feature ex-
Background 183 2 0 185 traction and computation of the classi®cation re-
Crack 7 100 7 114 sult drastically.
Undercut 18 0 73 91
Sum 208 102 80 390
pk 0.88 0.98 0.91 5.4. Learning performance
pt 0.99 0.87 0.80
few seconds on a PC 486 until the decision tree knowledge base constructed by induction. A
has been learned for the data set of 1924 samples compromise is to use decision tree induction to
used. In contrast to that, neural network learning build an initial tree and then derive rules from the
cannot be handled as easily. The learning time tree, thus transforming an ecient but opaque
for the neural nets was 15 min for the back- representation into a transparent one (Quinlain,
propagation net by 60 000 learning steps, 18 min 1987b).
for the RBF net, 20 min for the LVQ net, and However, investigation in (Perner et al., 1996)
45 min for the Fuzzy-ARTMAP-net on a work- showed that the model derived by decision tree
station type INDY Silicon Graphics (R4400- induction does not always represent the model of a
Processor, 150 MHz) with the neural network human expert. The decisions might not appear in
simulator NeuralWorks Professional II/Plus ver- the order a human would use them and especially
sion 5.0. not at all in uncertain domains.
A disadvantage of neural nets is that the feature
selection is a preliminary step before learning. In
our case it was done with a backpropagation net. 6. Conclusions
The parameter signi®cance was determined based
on the contribution analysis method (see Section We compared decision tree induction to neural
3). From 42 features are selected only the seven nets. For our empirical evaluation, we used a data
most signi®cant features which were selected by set of X-ray images that were taken from welding
the feature selection process. If the feature selec- seams. Thirty six features described the defects
tion process is omitted, which is possible, then the contained in the welding seams. Image feature
training time and the size of the neural nets in- extraction was used to create the ®nal data set for
crease considerably. our experiment.
Although the class distribution in the design We found that special types of neural nets
data set was not uniform, both methods of neural have slightly better performance than decision
net-based learning and decision tree learning did trees. In addition to that, there are not so many
well on the data set. It is possible that in the case of features involved in the classi®cation, and this is
decision trees, the poorer class speci®c error rate a good characteristic especially for image inter-
for crack results from the non-uniform sample pretation tasks. Image processing and feature
distribution. Therefore, we used a data set in one extraction is mostly time-consuming and requires
experiment where the number of samples for each special purpose hardware for real time processing
of the three classes was the same (420 samples). As for the computation. However, the explanation
a result we got an error rate that was nearly the capability that exists for trees producing axis-
same as the one shown in Table 3. parallel decision surfaces is an important advan-
tage over neural nets. Moreover, the learning
5.5. Explanation capability process is not so time-consuming, it comprises the
two processes necessary when building image in-
One big advantage of decision trees is their terpretation systems: feature selection and learn-
explanation capability. A human can understand ing the classi®er. Furthermore, it is easy to
the rules and can control the decision making handle.
process. A neural net based classi®er is more or All that shows that the decision on what kind of
less a black box for humans. However, since the method should be used is not only a question of
outcome of the learning process is a very bushy the resulting error rate. It is a more complex se-
tree, this representation is not very readable for lection process that can only be based on the
humans. Therefore, some researchers favor rule- constraints of the application. The present paper
based representations over tree structured repre- provides some new aspects to the selection of a
sentations since rules tend to be more modular classi®cation algorithm for a particular applica-
and can be read in isolation of the rest of the tion.
54 P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54
References tion of digital images in radiology. In: Perner, P., Wang, P.,
Rosenfeld, A., (Eds.), Advances in Structural and Syntac-
Brast, G., Maier, H.J., Knoch, P., Mletzko, U., 1997. Progress tical Pattern Recognition, Lecture Notes on Computer
Report on a NDT Round Robin on Austenitic Circumfer- Science, Vol. 1121. Springer, Berlin, pp. 208±219.
ential Pipe Welds. Proceedings 23, MPA Seminar, Vol. 2. Perner, P., Trautzsch, S., 1998. On feature partitioning for
Stuttgart, Germany, pp. 42.1±42.16. decision tree Induction. In: Amin, A., Dori, D., Pudil, P.,
Egmont-Petersen, L., 1994. Contribution Analysis of multi- Freeman, H., (Eds.), SSPR98 and SPR98. Springer, Berlin,
layer perceptrons. Estimation of the input sources impor- pp. 475±482.
tance for the classi®cation. In: Pattern Recognition in Pratt, K.W., 1991. Digital Image Processing. Wiley, New York,
Practice IV, Proceedings International Workshop. Vlieland, ISBN 0-471-85766-1.
The Netherlands, pp. 347±357. Quinlain, J.R., 1996. Induction of decision trees. Machine
Eua-Anant, N., Elsha®ey, I., Upda, L., Gray, J. N., 1996. A Learning 1, 81±106.
novel image processing algorithm for enhancing the prob- Quinlain, J.R., 1987a. Simplifying decision trees. Int. J. Man-
ability of detection of ¯aws in X-ray images. In: Review of Machine Studies 27, 221±234.
Progress in Quantitative Non-destructive Evaluation, Vol. Quinlain, J.R., 1987b. Generating production rules form
15, Plenum Press, New York, pp. 903±910. decision trees. In: Proceedings of the 10th International
Klette, R., Zamperoni, P., 1992. Handbuch der Operatoren f ur Joint Conference on Arti®cial Intelligence. Morgan Kauf-
die Bildbearbeitung. Vieweg Verlagsgesellschaft. mann, San Mateo, CA, pp. 304±307.
Lui, H., Motoda, H., 1998. Feature Selection for Knowledge Strang, G., 1989. Wavelets and Dilation Equations: A Brief
Discovery and Data Mining. Kluwer Academic Publishers, Introduction Sam Review, Vol. 31, pp. 613±627.
Dordrecht. Weiss, S.M., Kulikowski, C.A., 1991. Computer Systems that
Michie, D., Spiegelhalter, D.J., Taylor, C.C., 1994. Ma- Learn. Morgan Kaufmann, Los Altos, CA.
chine Leaning, Neural Nets and Statistical Classi®cation. Zell, A., 1994. Simulation Neuronaler Netze. Addison-Wesley,
Ellis Horwood Series in Arti®cial Intelligence, Chichester, Bonn Paris.
UK. Zscherpel, U., Nockemann C., Mattis A., Heinrich W., 1995.
Perner, P., Belikova, T.B., Yashunskaya, N.I., 1996. Knowl- Neue Entwicklungen bei der Filmdigitalisierung. DGZfP-
edge acquisition by decision tree induction for interpreta- Jahrestagung in Aachen, Tagungsband.