Professional Documents
Culture Documents
Clark 2017
Clark 2017
Clark 2017
Abstract
extraction component and two artificial neural networks, specifically a multilayer perceptron
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
(MLP) and a self-organizing map (SOM). Leaves are detected in the specimen images using
deformable templates and evolutionary algorithms, and then a level set method is used to
separate the leaf outline from the background. Length and width and other shape
measurements are then automatically extracted, together with measurements of the teeth on the
leaf margin. The MLP is used to filter out poor quality data before passing the relatively
information-rich high quality data to the SOM. The SOM then performs unsupervised
classification, creating a topological map, which is then used for the identification of unknown
specimens. The system is thus able to identify plants from leaf shape alone, from images of
traditional herbarium specimens. In addition, as in our earlier work, the system is able to refer
difficult specimens to a botanist for further expert examination, in cases of uncertainty,
together with suggestions as to their identity. Thus a methodology is presented here to provide
a practical way for taxonomists to use a combination of different neural networks and image
processing as an automated tool for plant identification. A case study is provided using data
extracted from images specimens of four species of the tree genus Tilia in the herbarium of the
Royal Botanical Gardens, Kew, UK. It was found that over half the leaves that were identified
by the SOM correctly to the species level using 21 automatically extracted leaf shape
parameters, without test set leaf quality filtering. A full detailed comparison is carried out with
updated results from earlier studies using the MLP alone. After applying filters for leaf quality,
and referral, although the results for specimen identification from the SOM are good (65%),
those using the MLP were found to be much better (95%).
INTRODUCTION
Plant classification and identification are both important for those who need to
know which species they are dealing with; e.g., if some species contain important
compounds of medical interest, and/or others who are interested in establishing
levels of biodiversity, for instance to investigate changes in plant distributions due to
climate change. It is still common to carry out biological identification using a
printed "taxonomic key", although there is a trend towards computer-based or
computer-aided systems [1, 2]. Traditional keys are followed manually, the user
making choices from contrasting statements, usually concerning morphology,
1
Nature Inspired Computing and Engineering (NICE) Research Group, Department of Computer
Science, University of Surrey, Guildford, GU2 7XH, UK j.y.clark@surrey.ac.uk. Partly funded by
the Leverhulme Trust (grant F/00 242/H): MORPHIDAS (Morphological Herbarium Image Data
Analysis) Project.
2
Signal, 5th Floor, 32-38 Leman Street, London E1 8EW.
3
Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AB, UK.
30
using the validation set is necessary to enable training termination before over-
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
The images referred to are actual photographs of herbarium specimens (see Fig. 1:
Tilia platyphyllos PLA0136); i.e., actual physical dried and pressed plant specimens,
mounted on stiff white cartridge paper/card, kept in paper folders and stored in
cabinets to use as voucher specimens for identification and other studies.
To date, the only published computer-based identification studies of Tilia are
ones that separate one species of Tilia (T. cordata) from 12 other species of woody
trees (in 12 other genera) by means of a neural network using leaf image data [27],
and those of the authors [8, 10, 11, 26, 21], who used a neural network for
identification of a number of Tilia species. It is usually much simpler to distinguish
between species in different genera (or other higher level taxa, for that matter). The
real challenge is to distinguish between closely related species in the same genus, as
is the aim here. The original work by Clark [8, 10, 11] involved computer-based
identification of the 19 Tilia species cultivated in Europe. However, character states
Biological Shape Analysis Downloaded from www.worldscientific.com
were then manually recorded in the traditional way by manual observation and
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
measurement. In more recent work [12, 13, 21] and that presented here, information
is extracted almost totally automatically from images of complete herbarium
specimens.
A review of other plant species identification using image processing methods
is given in [5]. This includes leaf and flower shape analysis, leaf texture analysis and
vein analysis, as well as other species identification methodologies. Some work by
other authors similar to that presented here uses a MLP to distinguish between
species and varieties of the genus Banksia in the family Proteaceae [6]. Here,
software was used to automatically extract characters from leaves, such as area and
roundness, for identification purposes. However, the characters obtained were
extracted from images of single undamaged leaves as opposed to whole herbarium
specimens, making the character extraction more straightforward. In this study, leaf
images are automatically extracted from whole herbarium specimen images, and
since that is a much more difficult task, where leaves are often incomplete, folded, or
overlapping, a large number of specimens of each species are needed. Therefore, the
scope of this project is currently restricted here to 4 species of Tilia, as these 4
species were the only ones present in sufficient numbers in the Kew herbarium. The
work is an extension of earlier work involving similar data from the same specimens
and processed in the same way [21] using a MLP, but with the addition of a
methodology for removing poor data from the training set, and subsequent
production of a self-organizing map (SOM) for topological classification and
subsequent species identification. A detailed comparison is also made between the
effectiveness of the SOM technique and that of using the MLP alone.
Unlike some earlier, similar studies [12], preliminary work showed that adding
Gaussian noise did not improve the overall classification accuracy, so none was
added here.
A list of sources of the material is available in Tables 14 and 15; the species
acronyms are given in Tables 5 through 8, Tables 10 through 13 and Figures 3, 4 and
5. The original photographs of herbarium specimens are available on request from
the Herbarium of the Royal Botanic Gardens, Kew.
Neural Networks
Multilayer Perceptron (MLP)
The initial data filtering stage, intended to enable removal of the least useful
training data, involved using a simple feed-forward MLP with one input layer, one
hidden layer, and one output layer. One input node corresponded to each character
(attribute) and one output node was assigned to represent each of the four target
species. Therefore there were 41 input nodes and 4 output nodes (as in earlier work,
[21]). The network performance was optimized by varying the number of hidden
nodes. There were no connections between nodes in the same layer, and no recurrent
connections.
35
The network architecture is shown in Fig. 2 (though the actual number of nodes
is different from that shown in the Figure). The input vectors were independently
normalized for each character over all training records between ±0.9. This was to
help reduce the training time required, and to minimize unintentional character
weighting.
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
The minimum and maximum values for each character over the entire training
set were used for identical scaling by normalization of the validation and test data
sets. The network weights were initialized to small random values between ±0.5 [3],
and the presentation order of input vectors (leaf data records) was randomized
between epochs (each epoch being one run through the complete training set). A bias
input of 1.0 was used. Further details on parameters and training algorithms used are
given in [9]. The error value reported in the results is the Squared Error Percentage
(E) [19], with corrections [8], given by:
36
(1)
where omax and omin are the maximum and minimum output values used in training,
here 0.9 and 0.1 respectively. N is the number of output nodes (equal to the number
of species), and P is the number of records (in this case leaf data records) in the data
set under consideration. opi is the actual output at output node i when input pattern p
is presented. tpi is the target (desired) output at output node i when pattern p is
presented.
Initially, training was carried out using a learning rate of 0.1, no momentum, a
variable number of nodes in the hidden layer and a fixed random seed. Neural
networks have a tendency to become 'overfitted' if too much training is carried out
Biological Shape Analysis Downloaded from www.worldscientific.com
and so they then perform badly when presented with previously-unseen data; i.e.,
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
େ ሺ୭౦ ି୲౦ ሻమ
ଵ
ୡ୭୰୰ୣୡ୲ ൌ େ ቀ ቁǡ (2)
ଵ ሺ୭ౣ౮ ି୲ౣ౮ ሻమ
ሺ୭౦ ି୲౦ ሻమ
ଵ
୵୰୭୬ ൌ ቀሺ୭ మ ቁ, (3)
ଵ ౣ౮ ି୲ౣ౮ ሻ
37
ୗౙ౨౨ౙ౪ ାୗ౭౨ౝ
ൌ ଶ
(4)
Then the Mean Squared Error (MSE), shown in Eq. 2, was calculated for those
leaves the network identified 'correctly', where NC is the total number of leaves
'correctly' identified, and similarly the Mean Squared Error (Eq. 3) for those leaves
that were 'wrongly' identified, and NW is the total number of leaves 'wrongly'
identified. The mean of these two values was then used as a threshold of trust
(Eq. 4). The principle here is that if the system cannot identify a leaf well even if it is
in the training set, then it cannot be expected to give a trustworthy identification of a
similar leaf in the independent test set either. These thresholds are determined
separately for each species, because some species will be easier to identify than
others. Then these thresholds of trust are considered in conjunction with the network
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
results when using the test dataset. Test records which result in a Mean Squared
Error (MSE) greater than the threshold are regarded as 'Bad' leaves and not counted
in the final analysis (see later).
In this new study, as a method of preprocessing before using the SOM, these
thresholds of trust were determined, by using the MLP as described above. Then, the
usefulness of the principle is extended to help decide which of the training data
records are 'good' and which are 'bad'. In short, these thresholds are used to increase
the quality of the training set. Only those records whose MSE were below the
respective threshold, when showing the training set to the trained network, are
retained in the training set to use for classification using the Self-Organizing Map
(SOM). Here one refers to the resultant smaller file as a distilled training set -
because of the analogy to distilling liquids such as alcohol to produce a more
concentrated result.
neighborhood size reached zero, 1000 further iterations were performed for
consolidation. During training, the neighborhood region was not allowed to extend
beyond the boundary of the array.
For every sample x from the total I training records, the best-matching, or
‘winning’ neuron i in the Kohonen (competitive) layer at presentation t was chosen
as follows:
where j = Kohonen node 1, 2, .....N and the Euclidean norm is calculated from:
Training in the competitive layer was carried out at each presentation (t) as follows:
1 I
Ewin = ¦ ฮݔሺݐሻ െ ݓ ฮ(8)
I j =1
This is a measure of the average distance between a data record (taxon) and the
winning node in the multidimensional data space.
Tests were then performed as follows:
1) The random seed was set to an arbitrary fixed value. Several runs were carried
out using different period lengths, presented in Table 2. The run that resulted in
the lowest mean Ewin value, when the entire training set was again presented
to the trained network, was declared the winner. This ensured that the resultant
topological map that had the closest fit to the original data was chosen.
2) Having set the period to the optimized value at the end of the consolidation,
further tests were carried out in order to determine an optimized gain value
(learning rate). The results used for optimization are shown in Table 3.
39
3) Having determined the optimized period length and gain, the tests were
repeated using 30 different random seeds. A topological map (Fig. 3) was
constructed from the run with the lowest mean Ewin by presenting the entire
training set again to the trained network, and labelling the winning node for
each leaf data record accordingly, so that each grid square shows which
species resulted in the greatest number of winning nodes at that particular grid
position in the competitive layer. Fig. 4 shows more details, giving the actual
number of winning nodes for each species, while Fig. 5 displays the actual
number of winning nodes with test data.
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
Key:
cordata (COR) platyphyllos (PLA)
americana (AME) tomentosa (TOM)
no species
Fig. 3. 10 x 10 SOM topological map populated with training and validation set data
showing the winning species for each competitive layer node (grid location).
4) The validation set that had previously been used for the MLP training was then
presented (with a reduced number of attributes) to the trained SOM and the
tally for each named species added to the corresponding winning node for each
data record. Although the validation set was not actually used in training, this
was a way for the data available from the validation set to enhance the SOM
model before it was used for identification.
40
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
Key:
cordata (COR) platyphyllos (PLA)
americana (AME) tomentosa (TOM)
Fig. 4. 10 x 10 SOM topological map populated with training set (and validation set) data
showing the number of training data records (leaves) for each species placed in each grid
location
Key:
cordata (COR) platyphyllos (PLA)
americana (AME) tomentosa (TOM)
Fig. 5. 10 x 10 SOM topological map populated with test set data showing the number of
test data records (leaves) for each species placed in each grid location.
Having constructed the SOM classification model, each single leaf record in the
test data set was presented in turn to the trained network model, and the
corresponding winning node in the competitive layer of the trained topological map
(considered as a grid) noted. At the position of this winning node, the total number
of previously recorded training set winners was counted for each species, taking into
account a neighborhood radius of 1 (that is, the winning node plus one square radius
surrounding it), thus considering a 3x3 window. Here the total was center weighted,
with each count for the actual winning node multiplied by 2; whereas those in the
surrounding 1 square neighborhood counted singly. Parts of a neighborhood
extending beyond the limit of the grid were ignored. The highest total was then said
to be the identification decision by the system for that particular test data leaf record
(see Table 9). The confidence value shown is different from that in the previous
study [21]. In this paper, it is the winning species tally divided by the total for all
species for that leaf in the target 3x3 window as shown in Eq. 9 below.
42
୳୫ୠୣ୰୭୵୧୬୬୧୬୬୭ୢୣୱ୭୰ୡ୦୭ୱୣ୬ୱ୮ୣୡ୧ୣୱ୧୬୵୧୬ୢ୭୵
ൌ ୭୲ୟ୪୬୳୫ୠୣ୰୭୵୧୬୬୧୬୬୭ୢୣୱ୭୰ୟ୪୪ୱ୮ୣୡ୧ୣୱ୧୬୵୧୬ୢ୭୵ (9)
are incorporated here in both the MLP results (Table 4, which supersedes the table in
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
the earlier paper) and the SOM results (Table 9). In addition, since the primary aim
of such an exercise is to identify actual specimens, rather than individual leaves, it is
of value to consider specimens as a whole, in which case the identifications of the
individual leaves contribute to the specimen identification. Here only the
identification decisions of all the 'Good' leaves on a specimen contribute to a winner-
takes-all identification decision for the whole specimen.
As in the previous study [21], the concept of referral is used. This means that if
all, or the majority of the leaves on a specimen are 'Bad', then any decision made by
the system is said to be only tentative, and not to be trusted, so the decision should
be 'Referred' to an experienced botanist. It is not surprising that in most cases,
decisions made using 'Bad' leaves mostly have poor confidence levels. This principle
ensures that the system is used conservatively - that is, it is most useful if the user
can trust the system for the decisions that it says it can trust, so it is more sensible to
err on the side of caution, and refer all decisions of which the system is unsure,
whilst still maintaining a high level of automation.
RESULTS
The MLP was exactly the same as that used in previous work [21] using the
same data, with resulting optimized parameters of hidden nodes 36, and learning rate
of 0.065. Since the optimized random seed used was also the same, then the trained
network was identical to that obtained previously. Details of the parameter
optimization are given in [21].
A similar table to Table 4, showing the leaf and specimen identifications using
the MLP alone, was presented in the previous paper [21]. Although there were some
errors relating to Leaf Quality in the previously published table, these are now
corrected and the results are shown here in Table 4, which should be considered to
be a fully updated version. Each Image ID represents a different herbarium specimen
image; Leaf ID is a unique number for a given leaf on that specimen. Leaves Ident as
is the majority rule identification for that leaf; Quality is Good if the Mean Squared
Error (MSE) for that leaf is less than or equal to the threshold of trust [21] for that
species; quality is Bad if the MSE for that leaf was greater than the threshold of trust
for that species.
43
Leaves Specimens
Image ID Leaf ID Ident as Quality Ident as Trust/Refer Confidence
COR0449 1 COR Good COR Trusted 100
COR0434 1 AME Bad AME Referred 100
COR0421 1 COR Good COR Trusted 93.3
COR0421 2 COR Good
COR0421 3 COR Good
COR0426 1 PLA Bad PLA Referred 100
COR0463 1 COR Good COR Trusted 100
COR0463 2 COR Good
Biological Shape Analysis Downloaded from www.worldscientific.com
Regarding the whole specimens, 'Ident as' is the majority rule species
identification decision, considering all the 'Good' leaves on that specimen (or 'Bad'
leaves, if there are no good leaves on the specimen); 'Trust/Refer' relates to whether
to trust the decision of the system, or to refer the final decision to a botanist, based
on the criteria described earlier. Species identifications in bold are those which are
considered 'correct', that is, they agree with the label of the herbarium folder in
which the specimen is kept.
A misidentification matrix for all leaves in the test set is shown in Table 5 with
the majority identifications underlined and the ‘correct’ identification shown in bold.
This is different from the similar table in the previous paper [21], because earlier the
figures were calculated from a population of results; whereas here one is using the
result of showing the test data set to the best optimized trained MLP network (as
determined by the MSE on the validation dataset). This is to enable a more practical
Biological Shape Analysis Downloaded from www.worldscientific.com
and realistic comparison between the results using the SOM and those obtained
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
using the MLP alone. Table 6 shows a similar MLP-based matrix, but instead
considering only those leaves said to be 'Good' using the threshold of trust described
earlier. Since a primary practical aim of using such a system is to identify actual
specimens, rather than just individual leaves, a misidentification matrix for all
specimens is provided in Table 7, and a similar table considering only 'Trusted'
specimens (where the number of constituent 'Good' leaves is greater than the number
of 'Bad' leaves - see earlier) is shown in Table 8.
Table 5. ALL LEAVES test dataset leaf misidentification matrix (MLP only).
LEAVES Identification %
Label COR PLA AME TOM
COR 80.0 10.0 10.0 0.0
PLA 25.0 66.7 0.0 8.3
AME 9.1 54.5 9.1 27.3
TOM 0.0 12.5 12.5 75.0
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench
Table 6. GOOD LEAVES ONLY test dataset leaf misidentification matrix (MLP
only).
LEAVES Identification %
Label COR PLA AME TOM
COR 100.0 0.0 0.0 0.0
PLA 0.0 100.0 0.0 0.0
AME 0.0 0.0 100.0 0.0
TOM 0.0 0.0 14.3 85.7
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench
45
SPECIMENS Identification %
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
Regarding the SOM, optimization results are presented for period length
(Table 2) and gain parameters (Table 3). In each case, the number of iterations and
the Ewin error value at the point of the termination of training, after consolidation, are
provided. The optimized period length was found to be 32 and optimized gain was
0.40. Figs. 3 and 4 show the resultant topological map showing the winning species
for each node in the competitive layer. Fig. 4 shows the number of training data
records for each winning node, for each species.
Turning now to the detailed map of the trained classification model in Fig. 4,
showing the actual number of winning nodes for each species in each competitive
layer grid square, we can see that there is more overlap with respect to species in any
given square and its immediate neighborhood, that is obscured by the winner-takes-
all view shown in Fig. 3. The test set data map shown in Fig. 5 indicates the location
of the winning nodes when presenting the test set to the fully trained network, and
shows the positions of all the winning nodes for all the individual leaf records in the
test set. These positions on the grid can then be directly correlated with the same
positions on the main model produced from the training (and validation) sets shown
in Fig. 4. The position of the winning node for each test data record thus forms the
center of a 3x3 window which can be logically placed in position over the training
(and validation) set map, in which the numbers of training set winning nodes can be
counted separately for each of the four species. Center weighting is used because,
presumably, the actual winning node position is more important and relevant than
the neighbouring positions - though the technique is designed to show relationships
by adjacent topology, so the immediate neighborhood should have relevance.
Now, one can consider the identifications shown in Table 9 resulting from
running the test dataset through the fully trained, consolidated and optimized SOM.
46
Table 9. Leaf and specimen identifications (SOM with leaf quality from MLP).
Leaves Specimens
Image ID Leaf ID Ident as Quality Ident as Trust/Refer Confidence
COR0449 1 COR Good COR Trusted 53.5
COR0434 1 AME Bad AME Referred 52.6
COR0421 1 COR Good COR Trusted 65.8
COR0421 2 COR Good
COR0421 3 COR Good
COR0426 1 PLA Bad PLA Referred 38.7
COR0463 1 COR Good COR Trusted 60.5
COR0463 2 COR Good
COR0463 3 COR Good
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
then 11 out of all the 24 specimens (45.8%) are identified correctly. If one considers
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
Table 10. ALL LEAVES test dataset leaf misidentification matrix (SOM).
Table 11. GOOD LEAVES ONLY test dataset leaf misidentification matrix (SOM).
Table 12. ALL SPECIMENS test dataset specimen misidentification matrix (SOM).
Biological Shape Analysis Downloaded from www.worldscientific.com
SPECIMENS Identification %
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
Table 13. TRUSTED SPECIMENS ONLY test dataset leaf misidentification matrix
(SOM).
SPECIMENS Identification %
Label COR PLA AME TOM
COR 100.0 0.0 0.0 0.0
PLA 0.0 100.0 0.0 0.0
AME 0.0 0.0 0.0 100.0
TOM 20.0 0.0 20.0 60.0
COR = Tilia cordata Mill. PLA = Tilia platyphyllos Scop.
AME = Tilia americana L. TOM = Tilia tomentosa Moench
CONCLUSIONS
This paper builds on earlier studies by the authors that used neural networks for
identification and classification of Tilia herbarium specimens from images. The
principle of using a MLP to help filter out relatively noisy/bad data from a training
set before feeding into a SOM seems useful, since this is clearly a way of removing
much potentially inhibitory, badly damaged leaf data. There is much noise in these
data because the leaf shape information was extracted automatically from images of
herbarium specimens.
49
In the previous study [21], the concept of automated referral was introduced, in
which it is accepted that some specimens are not identifiable by the system and these
are instead referred to an expert botanist. This is reasonable since here one is
attempting to identify specimens automatically from images, with minimal human
intervention, and also only using leaf outlines. The quality of herbarium specimens
can vary considerably, and if there are many leaves overlapping, it is difficult to
separate individual leaves by image processing. Furthermore, one aim of this study is
to show the use of existing herbaria as repositories of information and accumulated
expert knowledge resulting from many years of human experts studying and naming
the specimens. This study does indeed show that it is possible to use such data to
help identify some unknown specimens, without the help of expert knowledge. For
instance, a trainee, non-expert in the plant group concerned could be provided with
help by this kind of system, and then refer difficult identifications to an expert on the
Biological Shape Analysis Downloaded from www.worldscientific.com
group.
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
Table 14. Herbarium specimens used for testing (RBG, KEW Herbarium, UK).
Material continued in Table 15.
Image ID Collection: Location Date
COR0449 Bois de Premieres (Cote d'Or), France 12 July 1875
COR0434 Kings Wood, Yatton, N. Somerset, UK 19 Sept 1906
COR0421 Little Malvern, Worcestershire, UK 21 June 1914?
COR0426 Egham, Surrey, UK 19 August 1917
COR0463 Austria
Western edge of Chalkney Wood, near Earls Colne, Essex, UK
COR0427 14 August 1994
52/872276
PLA0151 Slopes of Chanctonbury Hill, West Sussex, UK 7 July 1926
PLA0134 Near Wargrave, Berkshire, UK 5 August 1946
PLA0136 In garden in Malvern Road, Cheltenham, Gloucs, UK 12 July 1945
PLA0169 Bruton, Somerset, UK July 1936
PLA0146 Near River Mole, Stoke D'Abermon, Surrey, UK 21 July 1918
PLA0171 Oxford Road, Woodstock, UK (planted) 25 June 1945
AME1150 Quebec, Canada June 1825
AME1148 Lake region and Ontario, Canada 10 July 1877
AME1156 St. Louis, Missouri, USA July 1841
AME1129 Noel, Missouri, USA 8 Oct 1909
AME1136 Gordon Hills, Gibson County, Indiana, USA 6 July 1915
AME1128 Noel, Missouri, USA 25 April 1909
7 Oct
TOM0060 Hungary
1927/1947?
TOM0713 Northern Syria Aug? 1908
TOM0101 Cultivated At RBG Kew, UK Oct 1881
TOM0014 Northern Albania 9 July 1918
TOM0097 Arboretum, RBG Kew, UK 16 Sept 1884
TOM0039 Therapia, Turkey July 1862
50
Referring to the general topological map (Fig. 3), this shows the overall
classification view (of the training and validation datasets), with a winner-takes-all
view of each grid square, where the species indicated is the one with the most
winning nodes centered on that square. It can be seen that, as expected, there has
been some separation into areas relating to individual species, although this is not
perfect. Roughly speaking, there is a concentration of T. americana (AME) in the
bottom right of the map; T. cordata (COR) tends to be on the left side of the map; T.
platyphyllos (PLA) tends to be on the right side of the map, but there is also a
significant concentration at the top left; T. tomentosa (TOM) seems to be more
randomly distributed, although it avoids the top left of the map. It is not too
surprising that it is difficult to separate these species, given that they are closely
related, and one is only using automatically extracted characters from leaves on
whole herbarium specimens.
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
Table 15. Herbarium specimens used for testing (RBG, KEW Herbarium, UK).
Continuation of Table 14.
This does justify the technique used here, however, of including the immediate
neighborhood, with center weighting, when calculating identification counts, as this
should make the best use of the information included in this kind of incomplete
separation in the topological map.
Referring to the SOM Misclassification Matrices (Tables 10-13), three of the
four species are mostly correctly placed with at least half of leaves, often more,
identified as the appropriate species. It is clear, however, that Tilia americana is not
identified well by the system. Only one leaf out of 11 is good; all the rest are bad, so
most specimens are ‘Referred’, and the one ‘Trusted’ specimen is incorrect. This is
likely to be due to the inherent noisy nature of the data, and inaccuracies in outlines
derived automatically, especially when leaves are overlapping in the specimen,
which is often the case because of the species' mostly large leaves. In addition, it is
possible that problems are caused by this species being the one with the least data
Biological Shape Analysis Downloaded from www.worldscientific.com
records in the training set - the number of leaves may be below a critical threshold
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
when compared with the other species. The system is still useful, though, because
almost all of the T. americana leaves in the test set were automatically classified as
'Bad' by the MLP stage of the process, and could therefore be discounted from the
final analysis. Another influence is likely to be the number of leaves that the system
can successfully extract from images of whole specimens - usually this is only about
2 or 3, and the decision of the system is thus likely to be heavily influenced by the
proportion of immature to mature leaves extracted.
Considering the incorrect identifications of 'Trusted' identifications by the
system:
AME1150 is from Quebec, Canada, so it cannot be T. tomentosa. It is labelled
T. pubescens, which is a synonym of T. caroliniana, which in the strict sense
grows further south. However, its collection location suggests that it is
just possible that it is T. caroliniana subsp. heterophylla (Venténat) Pigott
(synonym: T. americana var. heterophylla (Venténat) Loudon), which does just
extend that far north, or even a hybrid T. americana × T. caroliniana ([7],
p. 261). Both of these are out of the scope of this study.
TOM0713 is a specimen from Syria, so it is almost certainly correctly T.
tomentosa. However, the leaves of this particular specimen are very similar in
outline shape to those of the American species, so it is not too surprising that
the system is confused.
TOM0039 is almost certainly correct, given that the specimen was collected
near Istanbul in Turkey. However, the leaf shape of T. tomentosa is very
variable, and this particular specimen has leaves of very similar shape to those
of T. cordata. So it is understandable that the system identified it to be that
species on leaf shape characters alone. Also, many test specimens 'Referred' to
a botanist had problems that would make the decision of the system
understandable:
COR0434 is a stump sprout of T. cordata. As stated in the previous paper [21],
it is extremely difficult, if not impossible, for a human to identify leaves on
shoots sprouting from Tilia tree stumps, as their leaf morphology is often
extremely different from the normal canopy leaves. Indeed, all existing keys to
identify Tilia species state that they are for identifying shoots from the canopy,
and also often specify flowering or fruiting shoots.
52
corrected name was therefore attached to the sheet. When this specimen is
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
shape that would become more apparent with a greater number of species, it is likely
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
ACKNOWLEDGMENTS
Grateful thanks are due to the Trustees of the Royal Botanic Gardens, Kew for
permission to study and photograph specimens in the Herbarium, and to include the
photographs here. Also, gratitude is due, as always, to Donald Pigott for earlier
encouragement and especially for writing his wonderful monograph on Tilia. Thanks
also go to Lilian Tang, Y. Hu and J. Jin for help with extraction algorithms, and to
Kulamagul Mahendrarajah for development of the Java-based shell for the MLP
program. Also, thanks are due to Scott Notley for work with the character extraction
software and preparation of datafiles. Many thanks also go to Katherine Clark for
help with the Figures.
REFERENCES
Biological Shape Analysis Downloaded from www.worldscientific.com
by LA TROBE UNIVERSITY on 10/16/17. For personal use only.
[5] Cope JS, Corney DPA, Clark JY, Remagnino P and Wilkin P (2012) Plant
species identification using digital morphometrics: a review. Expert Systems
with Applications 39: 7562-7573.
[7] Pigott D (2012) Lime Trees and Basswoods: A Biological Monograph of the
genus Tilia. Cambridge University Press, Cambridge, UK.
[11] Clark JY (2007) Plant identification from characters and measurements using
artificial neural networks. In: MacLeod N. (Ed.) Automated Taxon
Identification in Systematics (Systematics Association Special Volume): 207-
224. CRC Press, Boca Raton, FA, USA.
[12] Clark JY, Corney DPA and Tang HL (2012) Automated plant identification
using artificial neural networks. IEEE Symposium on Computational
Intelligence in Bioinformatics and Computational Biology (CIBCB). San Diego,
Biological Shape Analysis Downloaded from www.worldscientific.com
[13] Corney DPA, Clark JY, Tang HL and Wilkin P (2012) Automatic extraction of
leaf characters from herbarium specimens. Taxon 61: 231-244.
[14] Jain AK, Zhong Y and Lakshmanan S (1996) Object matching using
deformable templates. IEEE Trans. Pattern Analysis 18: 267-278.
[17] Malladi R, Sethian JA and Vemuri BC (1995) Shape modeling with front
propagation: a level set approach. IEEE Trans. Pattern Anal. 17: 158-175.
[18] Ye L and Keogh E (2009) Time series shapelets: a new primitive for data
mining. Proc. of the 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. Paris: 947-956.
[19] Prechelt L (1994) Proben1 – A set of neural network benchmark problems and
benchmarking rules. Technical Report 21/94, Universität Karlrühe, Germany.
[21] Clark JY, Corney D, Notley S and Wilkin P (2015) Image processing and
artificial neural networks for automated plant species identification from leaf
outlines. In: Lestrel PE (Ed.) Proc. 3rd Int Symp Biol Shape Anal (ISBSA)
World Scientific Pub Singapore and New Jersey, USA.
56
[22] Clark JY and Warwick K (1998) Artificial keys for botanical Identification
using a multilayer perceptron neural network (MLP). Artificial Intelligence
Review 12: 105-115.
[26] Clark JY (2009) Neural networks and cluster analysis for unsupervised
classification of cultivated species of Tilia (Malvaceae). Bot. J Linnean Soc.
159: 300-314.
[30] Pigott CD (1997) Tilia. In The European Garden Flora, Walters, SM et al.
Cambridge University Press, Cambridge, UK: 205-212.
[31] Pigott CD (1997) Two proposals to maintain the names Tilia cordata and Tilia
platyphyllos (Tiliaceae) in their current use. Taxon 46: 351-353.