Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220643822

A comparison between neural networks and decision trees based on


data from industrial radiographic testing

Article  in  Pattern Recognition Letters · January 2001


DOI: 10.1016/S0167-8655(00)00098-2 · Source: DBLP

CITATIONS READS
60 222

3 authors, including:

Petra Perner Uwe Zscherpel


Institute of Computer Vision and applied Computer Sciences IBaI Bundesanstalt für Materialforschung und -prüfung
203 PUBLICATIONS   1,894 CITATIONS    84 PUBLICATIONS   447 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Real-time X-ray imaging of hot cracking during welding View project

PhD in Near-Infrared FTIR spectroscopy View project

All content following this page was uploaded by Petra Perner on 21 January 2020.

The user has requested enhancement of the downloaded file.


Pattern Recognition Letters 22 (2001) 47±54
www.elsevier.nl/locate/patrec

A comparison between neural networks and decision trees


based on data from industrial radiographic testing
Petra Perner a,*, Uwe Zscherpel b, Carsten Jacobsen b
a
Institute of Computer Vision and Applied Computer Sciences Leipzig, Arno-Nitzsche-Strasse 45, 04277 Leipzig, Germany
b
Bundesanstalt f
ur Materialforschung und -pr
ufung, Unter den Eichen 87, 12205 Berlin, Germany

Abstract

In this paper, we are empirically comparing the performance of neural nets and decision trees based on a data set for
the detection of defects in welding seams. This data set was created by image feature extraction procedures working on
digitized X-ray ®lms. We introduce a framework for distinguishing classi®cation methods. We found that more detailed
analysis of the error rate is necessary in order to judge the performance of the learning and classi®cation method.
However, the error rate cannot be the only criterion for comparing between the di€erent learning methods. This is a
more complex selection process that involves more criteria that we are describing in this paper. Ó 2001 Elsevier Science
B.V. All rights reserved.

Keywords: Machine learning; Decision tree induction; Neural networks; Performance analysis; Radiographic testing; Pipe inspection

1. Introduction ray images from welding seams. An image is de-


composed into regions of interest and, for each
Developing image interpretation system com- region of interest, 36 features are calculated by an
prises two tasks: selecting the right features and image processing procedure. A detailed descrip-
constructing the classi®er. tion of this process is given in Section 2. In the
The selection of the right method for the clas- resulting database, each entry describes a region of
si®cation is not easy and often depends on the interest by means of 36 feature values and a class
preference of the system developer. This paper label determined by destructive testing after the X-
describes a ®rst step towards a methodology for ray penetration. The task is to classify the Regions
the selection of the appropriate classi®cation of Interest (ROI) automatically into background
method. Our investigation was not done on a or into defects such as crack and undercut.
standard academic data set where the data are Two di€erent kinds of classi®er were trained
usually nicely cleaned up. The basis for our in- based on that data set: neural nets and decision
vestigation is an image database that contains X- trees. The di€erent kinds of neural nets and deci-
sion trees are described in Section 3. Since the class
is given, we are dealing with supervised learning.
*
Corresponding author. Tel.: +49-341-8665-669; fax: +49-
We introduce a framework for distinguishing
341-8665-636. learning and classi®cation methods in Section 4. A
E-mail address: ibaiperner@aol.com (P. Perner). detailed description of the performance analysis is

0167-8655/00/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 8 6 5 5 ( 0 0 ) 0 0 0 9 8 - 2
48 P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54

given in Section 5. In contrast to most other work features in gray level pro®les perpendicular to the
on performance analysis (e.g. Michie et al., 1994), weld direction in the image. In Fig. 1, the ROIs of
we found that a more detailed analysis of the error a crack, an undercut and of no disturbance are
rate is necessary in order to judge the performance shown with the corresponding cross sections and
of the learning and classi®cation methods. How- pro®le plots.
ever, the error rate cannot be the only criterion for Flaw indications in welds are imaged in a ra-
the comparison between the di€erent learning diograph by local grey level discontinuities. Thus,
methods. It is a more complex selection process it is reasonable to apply the well-known morpho-
that involves more criteria such as explanation logical edge ®nding operator (Klette and Zampe-
capability, the number of features involved in roni, 1992), the derivative of Gaussian operator
classi®cation, etc. Finally, we will give our con- (Pratt, 1991) and the Gaussian weighted image
clusions in Section 6. moment vector operator (Eua-Anant et al., 1996).
These ®lters are developed to enhance small local
gray value changes on an inhomogeneous back-
2. The application ground.
A one-dimensional FFT-®lter for this special
The subject of this investigation is the in-service crack detection problem was designed (Zscherpel
inspection of welds in pipes of austenitic steel. Pipe et al., 1995). This ®lter is based on the assumption
systems in a power plant are inspected routinely that the preferential direction of the crack is po-
during the lifetime of the power plant by radio- sitioned in the image in the horizontal direction.
graphic testing to ensure the integrity of the The second assumption that was determined em-
equipment. Double wall penetration with X-rays pirically is that the half power width of a crack
(Emax < 200 KeV) and special radiographic ®lms indication is smaller than 300 lm. The ®lter con-
for NDT with lead screens are used for this in- sists of a columnwise FFT highpass Bessel opera-
spection. The ¯aws to be looked for in the au- tion that works with a cuto€ frequency of 2 l/mm.
stenitic welds are longitudinal cracks due to inter- Normally the half-power width of undercuts is
granular stress corrosion cracking starting from greater so that this ®lter suppresses them. This
the inner side of the tube. means that it is possible to distinguish between
All the data were collected during a Round undercuts and cracks with this FFT-®lter. A row-
Robin Test (Brast et al., 1997). The last step of this oriented lowpass that is applied to the output of
Round Robin Test was a destructive testing this ®lter helps to eliminate noise and to point out
(grinding) of the inspected welds. This gives the the cracks more clearly.
most advantageous situation, that the ``truth'' of all Furthermore, a Wavelet ®lter was used (Strang,
indications is known which is not the normal case 1989). The scale representation of the image after
in industrial non-destructive testing. The radio- the Wavelet transform makes it possible to sup-
graphs are digitized with a spatial resolution of 70 press the noise in the image with a simple thresh-
lm and a gray level resolution of 16 bit per pixel by old operation without losing signi®cant parts of
a special NDT ®lm scanner. Afterwards they are the content of the image. The noise in the image is
stored and decomposed into various ROI of an interference of ®lm and scanner noise and ir-
50  50 pixel size. The essential information in the regularities caused by the material of the weld.
ROIs is described by a set of features which is The features which describe the content of the
calculated from various image-processing methods. ROI are extracted from pro®le plots which run
These image-processing procedures are based through the ROI perpendicular to the weld.
on the assumption that the crack is roughly par- In a single pro®le plot, the position of a local
allel to the direction of the weld. This assumption minimum is detected which is surrounded by two
is reasonable because of the material and welding maxima that are as large as possible. This de®ni-
technique. It allows us to de®ne a marked prefer- tion varies a little depending on the respective
ential direction. It is now feasible to search for image processing routine. A template which is
P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54 49

Fig. 1. ROI of crack, of an undercut and of a region without disturbance with corresponding crossections and pro®le plots.

adapted to the current pro®le of the signal allows (e.g. Zell, 1994). The learning algorithm of the
us to calculate various features …S1; S2; L1; L2†. network is based on the gradient descent method
Moreover the half-power width and the respective for error minimization. First, we tried the wrapper
gradients between the local extrema are calculated. model for feature selection (Lui and Motoda,
To avoid statistical calculation errors, the calcu- 1998). We did not obtain any signi®cant result
lation of the template features is averaged over all with this model. Therefore, we used a parameter
of the columns along an ROI. This procedure signi®cance analysis (Egmont-Petersen, 1994) for
leads to 36 parameters. feature selection. Based on that model, we can
The data set used in this experiment contains reduce the parameters to seven signi®cant ones.
features for regions of interest from background, For decision tree induction we used our soft-
crack and undercut regions.The data set consists ware package called Decision Master. An entropy
of altogether 1924 ROIs with 1024 extracted from minimization criterion is used for attribute selec-
regions of no disturbance, 465 from regions with tion (Quinlain, 1996) and reduced-error pruning
cracks and 435 from regions with undercuts. technique (Quinlain, 1987a) is used for tree prun-
ing. Numerical attribute discretization is done
based on methods described in Perner and Trau-
3. Learning methods tzsch (1998). Decision tree induction may also be
looked upon as a method for attribute selection.
We have used induction of decision trees and During the learning phase, only the most relevant
learning of neural nets for our problem. attributes are chosen from the whole set of attri-
Four types of neural networks are used here: a butes for the construction of decision rules in the
Backpropagation, a Radial Basis Function Net- nodes. Therefore, we do not need to carry out
work (RBF), a Fuzzy ARTMAP Network and feature selection before the learning process as in
Learning Vector Quantization (LVQ) Network the case of neural nets.
50 P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54

P
4. Evaluation criterion cii = miˆ1 cij according to a particular class Pim and
the class speci®c quality that is pki ˆ cii = jˆ1 cji
The evaluation criterion most used for a clas- the number of correctly classi®ed samples for one
si®er is the error rate fr ˆ Nf =N with Nf the class. In addition to that we use other criteria
number of falsely classi®ed samples and N the shown in Table 2.
whole number of samples. In addition to that, we One of these criteria is the cost for classi®cation
use a contingency table in order to show the expressed by the number of features and the
qualities of a classi®er, see Table 1. In the ®eld of number of decisions used during classi®cation. The
the table are input the real class distribution and other criterion is the time needed for learning. We
the class distribution proposed by the classi®er as also take the explanation capability of the classi®er
well as the marginal distribution cij The main di- into consideration as another quality criterion. It
agonal is the number of correctly classi®ed sam- is also important to know if the classi®cation
ples. The last row shows the number of samples method can learn the classi®cation function (the
assigned to the class shown in row 1 and the last mapping of the attributes to the classes) correctly
line shows the class distribution proposed by the based on the training data set. Therefore, we not
classi®er. only consider the error rate based on the test set we
From this table, we can calculate parameters also consider the error rate based on the training
that assess the quality of the classi®er. data set. For the evaluation of the error rate, we
The correctness used the test and train method instead of cross-
, validation (Weiss and Kulikowski, 1991) since it
Xm Xm X m
pˆ cii cij would have been computationally expensive to
iˆ1 iˆ1 jˆ1 evaluate the neural nets by crossvalidation.

is the number of correctly classi®ed samples ac-


cording to the number of samples. That measure is
the opposite of the error rate. 5. Results
For the investigation of the classi®cation qual-
ity we measure the classi®cation quality pii ˆ 5.1. Error rate, generalization and representation
ability
Table 1
Contingency table The error rate for the design data set and the
test data set is shown in Table 3. The unpruned
Assigned class Real class index
index decision tree shows the best error rate for the de-
sign data set. This tree represents the data best.
1 i ... m Sum
1 c11 ... ... c1m However, if we look for the error rate calculated
j ... cji ... ... based on the test data set, then we note that the
... ... ... ... ... RBF neural net and the Backpropagation network
m cm1 ... ... cmm can do better. Their performance shows not such a
Sum
big di€erence in the two error rates as that for the

Table 2
Criteria for comparison of learned classi®ers
Generalization capability of the classi®er Error rate based on the test data set
Representation of the classi®er Error rate based on the design data set
Classi®cation costs Number of features used for classi®cation
Number of nodes or neurons
Explanation capability Can a human understand the decision?
Learning performance Learning time
Sensitivity to class distribution in the sample set
P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54 51

Table 3 In some cases it might be necessary to have a 4%


Error rate for design data set and test data set better error rate whereas in other cases it might not
Name of classi®er Error rate on Error rate on have a signi®cant in¯uence.
design data set the test data set
in % in %
Binary decision tree 5.2. Classi®cation quality and class speci®c quality
Unpruned 1.7 9.8
Pruned 4.6 9.5 We obtain a clearer picture of the performance
n-ary decision tree
of the various methods if we look for the classi®-
Unpruned 8.6 12.3
Pruned 8.6 12.3 cation quality pk and the class-speci®c classi®ca-
Backpropagation 3.0 6.0 tion quality pt , see Tables 4±8. In the case of the
network decision tree, we observe that the class speci®c
RBF neural net 5.0 5.0 quality for class undercut is very good and can
Fuzzy-ARTMAP-Net 0.0 9.0
outperform the error rate obtained by neural nets.
LVQ 1.0 9.0
In contrast to that, the classi®cation quality for
decision trees is poorer. Samples from the class
``undercut'' and ``background'' are falsely classi-
decision tree. The representation and generaliza-
®ed into the ``crack'' class. That results in a low
tion ability is more balanced in the case of the
classi®cation quality of the class crack. In contrast
neural networks whereas the unpruned decision
tree over®ts the data. This is a typical character-
istic of decision tree induction. Pruning techniques
Table 4
should reduce this behavior. The representation
Contingency table of the classi®cation result for the decision
and generalization ability of the pruned tree shows tree
a more balanced behavior but it cannot outper-
Classi®cation Real class index
form the results of the neural nets. result Background Crack Undercut Sum
The behavior of the neural nets according to
their representation and generalization ability is Background 196 2 1 199
Crack 12 99 22 132
controlled during the learning process. The train- Undercut 0 1 58 59
ing phase was regularly interrupted and the error Sum 208 102 80 390
rate was determined based on the training set. pk 0.94 0.97 0.73
When the error rate decreased after a maximum pt 0.98 0.74 0.98
value, then the net was assumed to have reached its p ˆ 0:90
maximal generalization ability. k ˆ 0:85
It is interesting to note that the Fuzzy-ART-
MAP-Net and the LVQ have the same behavior
with respect to their representation and general- Table 5
ization ability as the decision trees. Contingency table of the classi®cation result for the back-
The observation for the decision tree suggests propagation net
another methodology for using decision trees. In Classi®cation Real class index
data mining, where we mine a large database for result Background Crack Undercut Sum
the underlying knowledge it might be more ap- Background 194 5 1 200
propriate to use decision tree induction since it can Crack 1 97 3 100
represent the data well. Undercut 13 0 76 90
The performance of the RBF expressed by the Sum 208 102 80 390
pk 0.93 0.95 0.95
error rate is 4% better than it is for the decision
pt 0.97 0.96 0.85
tree. The question is: How can we access this re-
sult? Is 4% a signi®cant di€erence? We believe the p ˆ 0:94
k ˆ 0:90
decision must be made based on the application.
52 P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54

to that, the neural nets show a di€erence in class than the class undercut (which is not a defect but a
speci®c recognition. Mostly, the class speci®c error geometrical indication from the production pro-
for class crack is considerably better than for the cess of the pipe) it would be better to have the
decision trees (e.g. 0.96 for backpropagation net in lowest class speci®c error for crack. This avoids
Table 5 compared with 0.74 for the decision tree in false alarms and can save a lot of repair costs in
Table 4). Since a defect crack is more important real life.

Table 6
Contingency table of the classi®cation result for the radial basis 5.3. Classi®cation cost
function net
Classi®cation Real class index Our decision tree produces decision surfaces
result Background Crack Undercut Sum that are orthogonal to the feature axes. This
Background 198 2 1 201
method should be able to approximate a non-lin-
Crack 1 100 7 108 ear decision surface, of course with a certain ap-
Undercut 9 0 72 81 proximation error, which will lead to a higher
Sum 208 102 80 390 error rate in classi®cation. Therefore, we are not
pk 0.95 0.98 0.90 surprised that there are more features involved in
pt 0.99 0.93 0.89
the classi®cation, see Table 9. The unpruned tree
p ˆ 0:95 uses 14 features and the pruned tree uses 10 fea-
k ˆ 0:92 tures. The learned decision tree is a very big and
bushy tree with 450 nodes. The pruned tree re-
duces to 250 nodes but still remains a very big and
Table 7 bushy tree.
Contingency table of the classi®cation result for the LVQ The neural nets work only on seven features
Classi®cation Real class index and e.g. for the backpropagation net we have 72
result Background Crack Undercut Sum neurons. That reduces the e€ort for feature ex-
Background 183 2 0 185 traction and computation of the classi®cation re-
Crack 7 100 7 114 sult drastically.
Undercut 18 0 73 91
Sum 208 102 80 390
pk 0.88 0.98 0.91 5.4. Learning performance
pt 0.99 0.87 0.80

p ˆ 0:91 One of the advantages of induction of decision


k ˆ 0:86 trees is the fact that they are easy to use. Feature
selection and tree construction is done automat-
ically without human interaction. It takes only a
Table 8
Contingency table of the classi®cation result for the fuzzy-
ARTMAP net
Table 9
Classi®cation Real class index
Number of features and number of nodes
result Background Crack Undercut Sum
Name of classi®er Number of Number of
Background 191 5 1 197 features nodes
Crack 10 96 10 116
Decision tree
Undercut 7 1 69 77
Unpruned 14 450
Sum 208 102 80 390
Pruned 10 250
pk 0.92 0.94 0.86
Backpropagation net 7 72 neurons
pt 0.97 0.83 0.90
RBF net 7
p ˆ 0:91 LVQ 7 51 neurons
k ˆ 0:86 Fuzzy-ARTMAP-Net 7
P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54 53

few seconds on a PC 486 until the decision tree knowledge base constructed by induction. A
has been learned for the data set of 1924 samples compromise is to use decision tree induction to
used. In contrast to that, neural network learning build an initial tree and then derive rules from the
cannot be handled as easily. The learning time tree, thus transforming an ecient but opaque
for the neural nets was 15 min for the back- representation into a transparent one (Quinlain,
propagation net by 60 000 learning steps, 18 min 1987b).
for the RBF net, 20 min for the LVQ net, and However, investigation in (Perner et al., 1996)
45 min for the Fuzzy-ARTMAP-net on a work- showed that the model derived by decision tree
station type INDY Silicon Graphics (R4400- induction does not always represent the model of a
Processor, 150 MHz) with the neural network human expert. The decisions might not appear in
simulator NeuralWorks Professional II/Plus ver- the order a human would use them and especially
sion 5.0. not at all in uncertain domains.
A disadvantage of neural nets is that the feature
selection is a preliminary step before learning. In
our case it was done with a backpropagation net. 6. Conclusions
The parameter signi®cance was determined based
on the contribution analysis method (see Section We compared decision tree induction to neural
3). From 42 features are selected only the seven nets. For our empirical evaluation, we used a data
most signi®cant features which were selected by set of X-ray images that were taken from welding
the feature selection process. If the feature selec- seams. Thirty six features described the defects
tion process is omitted, which is possible, then the contained in the welding seams. Image feature
training time and the size of the neural nets in- extraction was used to create the ®nal data set for
crease considerably. our experiment.
Although the class distribution in the design We found that special types of neural nets
data set was not uniform, both methods of neural have slightly better performance than decision
net-based learning and decision tree learning did trees. In addition to that, there are not so many
well on the data set. It is possible that in the case of features involved in the classi®cation, and this is
decision trees, the poorer class speci®c error rate a good characteristic especially for image inter-
for crack results from the non-uniform sample pretation tasks. Image processing and feature
distribution. Therefore, we used a data set in one extraction is mostly time-consuming and requires
experiment where the number of samples for each special purpose hardware for real time processing
of the three classes was the same (420 samples). As for the computation. However, the explanation
a result we got an error rate that was nearly the capability that exists for trees producing axis-
same as the one shown in Table 3. parallel decision surfaces is an important advan-
tage over neural nets. Moreover, the learning
5.5. Explanation capability process is not so time-consuming, it comprises the
two processes necessary when building image in-
One big advantage of decision trees is their terpretation systems: feature selection and learn-
explanation capability. A human can understand ing the classi®er. Furthermore, it is easy to
the rules and can control the decision making handle.
process. A neural net based classi®er is more or All that shows that the decision on what kind of
less a black box for humans. However, since the method should be used is not only a question of
outcome of the learning process is a very bushy the resulting error rate. It is a more complex se-
tree, this representation is not very readable for lection process that can only be based on the
humans. Therefore, some researchers favor rule- constraints of the application. The present paper
based representations over tree structured repre- provides some new aspects to the selection of a
sentations since rules tend to be more modular classi®cation algorithm for a particular applica-
and can be read in isolation of the rest of the tion.
54 P. Perner et al. / Pattern Recognition Letters 22 (2001) 47±54

References tion of digital images in radiology. In: Perner, P., Wang, P.,
Rosenfeld, A., (Eds.), Advances in Structural and Syntac-
Brast, G., Maier, H.J., Knoch, P., Mletzko, U., 1997. Progress tical Pattern Recognition, Lecture Notes on Computer
Report on a NDT Round Robin on Austenitic Circumfer- Science, Vol. 1121. Springer, Berlin, pp. 208±219.
ential Pipe Welds. Proceedings 23, MPA Seminar, Vol. 2. Perner, P., Trautzsch, S., 1998. On feature partitioning for
Stuttgart, Germany, pp. 42.1±42.16. decision tree Induction. In: Amin, A., Dori, D., Pudil, P.,
Egmont-Petersen, L., 1994. Contribution Analysis of multi- Freeman, H., (Eds.), SSPR98 and SPR98. Springer, Berlin,
layer perceptrons. Estimation of the input sources impor- pp. 475±482.
tance for the classi®cation. In: Pattern Recognition in Pratt, K.W., 1991. Digital Image Processing. Wiley, New York,
Practice IV, Proceedings International Workshop. Vlieland, ISBN 0-471-85766-1.
The Netherlands, pp. 347±357. Quinlain, J.R., 1996. Induction of decision trees. Machine
Eua-Anant, N., Elsha®ey, I., Upda, L., Gray, J. N., 1996. A Learning 1, 81±106.
novel image processing algorithm for enhancing the prob- Quinlain, J.R., 1987a. Simplifying decision trees. Int. J. Man-
ability of detection of ¯aws in X-ray images. In: Review of Machine Studies 27, 221±234.
Progress in Quantitative Non-destructive Evaluation, Vol. Quinlain, J.R., 1987b. Generating production rules form
15, Plenum Press, New York, pp. 903±910. decision trees. In: Proceedings of the 10th International
Klette, R., Zamperoni, P., 1992. Handbuch der Operatoren f ur Joint Conference on Arti®cial Intelligence. Morgan Kauf-
die Bildbearbeitung. Vieweg Verlagsgesellschaft. mann, San Mateo, CA, pp. 304±307.
Lui, H., Motoda, H., 1998. Feature Selection for Knowledge Strang, G., 1989. Wavelets and Dilation Equations: A Brief
Discovery and Data Mining. Kluwer Academic Publishers, Introduction Sam Review, Vol. 31, pp. 613±627.
Dordrecht. Weiss, S.M., Kulikowski, C.A., 1991. Computer Systems that
Michie, D., Spiegelhalter, D.J., Taylor, C.C., 1994. Ma- Learn. Morgan Kaufmann, Los Altos, CA.
chine Leaning, Neural Nets and Statistical Classi®cation. Zell, A., 1994. Simulation Neuronaler Netze. Addison-Wesley,
Ellis Horwood Series in Arti®cial Intelligence, Chichester, Bonn Paris.
UK. Zscherpel, U., Nockemann C., Mattis A., Heinrich W., 1995.
Perner, P., Belikova, T.B., Yashunskaya, N.I., 1996. Knowl- Neue Entwicklungen bei der Filmdigitalisierung. DGZfP-
edge acquisition by decision tree induction for interpreta- Jahrestagung in Aachen, Tagungsband.

View publication stats

You might also like