Professional Documents
Culture Documents
A Coustic Model Training Mobil
A Coustic Model Training Mobil
x, xxxx 1
Abstract: The goal of this article is to provide and present information about the training procedure
SpinxTrain and its eligible modifications to get accurate and robust speech recognition in a mobile
GSM environment. Some modifications are based on effective preprocessing of input data in
combination with the optimal setting of the number of states per model, through the adjustment of the
number of tied states or number of Gaussian mixtures. Another source of increased recognition rate
is the ‘optimal’ setting of the speech decoder. As it is a non–linear, mathematically not well tractable
task containing both real and integer values, methods of evolution strategies can be successfully used
(an 18.6% improvement in WER was observed compared to the original setting). All experiments
and results were obtained for the Slovak speech database Mobildat, which contains recordings of
1100 speakers. The Sphinx4 recognition system was used for evaluation of the trained model.
Keywords: Sphinx 4; SphinxTrain; ATK; ASR; speech recognition; HMM; mobildat; evolution
strategies.
Reference to this paper should be made as follows: Rozinaj, G. and Vojtko, J. (xxxx) ‘Acoustic
model training for speech recognition over mobile networks’, Int. J. Signal and Imaging Systems
Engineering, Vol. x, No. x, pp.xxx–xxx.
Biographical notes: Juraj Vojtko, born in 1981, received MSc in telecommunications from the
Slovak University of Technology in Bratislava, Slovakia in 2005. Since 2009 he has occupied
an assistant professor position at the Institute of Telecommunications at the Faculty of Electrical
Engineering and Information Technology of the Slovak University of Technology in Bratislava. He
has also worked as developer in commercial segment in the area of communication and information
systems since 2002. The field of his research focused on speech processing specifically speech
recognition and speaker identification and verification.
Juraj Kačur, born in Bratislava in 1976. Master of Science degree obtained in the year 2000 and
PhD in 2005 at the Faculty of Electrical Engineering and Information Technology of the Slovak
university of technology (FEI STU) Bratislava. Since 2001, he occupies an assistant professor
position at the institute of telecommunication at FEI STU Bratislava. Between years 2000 and
2001, he was with the Slovak Academy of Science, department of speech analysis and synthesis
where he participated on several projects. The field of his research activities includes: signal
processing, speech processing, speech recognition, speech detection, speaker identification, High
order statistic, Wavelet transform, machine learning, ANN and HMM.
Gregor Rozinaj (M’97) received MSc and PhD in telecommunications from Slovak University of
Technology, Bratislava, Slovakia in 1981 and 1990, respectively. He has been a lecturer at the Institute
of Telecommunications of the Slovak University of Technology since 1981. From 1992–1994, he
worked on the research project devoted to speech recognition at Alcatel Research Center in Stuttgart,
Germany. From 1994–1996, he was employed as a researcher at the University of Stuttgart, Germany
working on a research project for automatic ship control. Since 1997, he has been a Head of the DSP
group at the Institute of Telecommunications of the Slovak University of Technology, Bratislava.
Since 1998, he has been an Associate Professor at the same institute. He is an author of 3 US and
European patents on digital speech recognition and 1 Czechoslovak patent on fast algorithms for DSP.
and the Technological Institute from Massachusetts. The first role. This task is performed by a different tool that is common
version, programmed in the C language, was introduced in to all Sphinx versions called the SphinxTrain (Carnegie Mellon
1987; Sphinx4 is the latest version (Sphinx Group, 2008, 2011; University, 2000). It is an independent application written in C
Carnegie Mellon University, 2008). The Java programming language, which computes acoustic models.
language makes it platform independent. It was created to This package contains source codes of the SphinxTrain tools
provide flexibility and modularity to perform a wide range and control scripts written in Perl, which actually govern the
of tasks related with the recognition of discrete or continuous running of the SphinxTrain during the training phase. Training
speech. The Sphinx4 architecture has a modular design, so of new models can be started by launching the runall.pl file
that the system programmer can change each module of from the scripts_pl folder. It is a Perl script which launches other
the system to fit particular demands. The main blocks of scripts during the training process. Generally, the training process
Sphinx4 are: Frontend, Decoder and Knowledge base, which outlined by the SphinxTrain gradually undergoes these stages:
contain the lexicon (called dictionary), the acoustic models
(HMMs) and the language model. Blocks like Frontend and • verification of the input data
Decoder are independently replaceable and all the necessary • statistical analysis of the input data
manipulations to do so are carried out by setting the desired
configuration features in the configuration file. Next follows • vector quantisation (used only for discrete or
a brief description of the most important modules: semicontinuous models)
• training of context independent (CI) HMM models
• FrontEnd – receives one or more input signals (from a file or
microphone) and outputs sequences of attributes (MFCC). • training of context dependent (CD) HMM phonemes –
all triphones
• Linguist – uses the dictionary and structures from
the AcousticModel. Then it performs translation and • generation of language questions and decision trees building
provides results in the form of a SearchGraph. that are needed in the tying process of CD phonemes
• Decoder – uses SearchGraph and attributes from both • pruning of the decision trees and tying the similar states
of the modules mentioned and provides results to the • training of the CD phonemes’ models with the reduced
application which started the process and controls states number (tied triphones)
functionality of the modules.
• training of the tied triphones models with the gradually
The next picture on Figure 1 shows the architecture of the augmented number of Gaussians.
system:
Figure 2a shows the SphinxTrain training process in general;
Figure 1 Architecture of the Sphinx4 (Sphinx documentation only the stages in white boxes were used for our purpose.
(Carnegie Mellon University, 2008))
Figure 2a Steps in the training process suported by SphinxTrain
3.2 SphinxTrain
As the recognition is based on statistical speech modelling, each
sound unit is modelled by a sequence of states that further model
the probability distributions of the observed speech. The process In order that the training process would be successfully
of training is a very complex assignment that plays an important accomplished, several input data must be carefully prepared
4 G. Rozinaj and J. Vojtko
Figure 2b A scheme showing steps in our training process with date flow
ahead and presented to the training script. The currently follows: conversion of SAMPA symbols, generation of the
supported speech features are MFCC and PLP parameters and list of all utterances, selection of the training set, collection
their time derivatives. Apart from the training data (extracted of transcription information, transformation and storage of
speech), several accompanying files and lists must be generated: data in the format required by the Sphinx tools, removal of
records with unusable content, transformation of the Mobildat
• list of all phonemes
lexicon, generation of a list of phonemes together with the
• dictionary (lexicon) list of fillers (non-speech events) and finally, adaptation of the
configuration file to suit the structure of Mobildat.
• filler dictionary containing non-speech elements
During the process, we have created CI models with 1, 2, 4,
• orthographical transcription of all training records 8, 16, 32, 64 and 128 Gaussians and CD models with 1, 2, 4,
8, 16, 32, 64 Gaussians PDF, just to verify and compare their
• list of all filenames existing in the training set
performance. There were 51 phonemes and 3 types of filler
• SphinxTrain configuration file. models determined to model non-speech acoustic artefacts.
They are SIL for silence (including the silence at the beginning
The output of the SphinxTrain consists of several files
and the end of recordings), FIL for the filled pauses and SPK
which together represent acoustic models for CI phonemes,
for the speaker’s noise (Žgank and Maučec, 2010). Non-speech
untied CD phonemes with 1 Gaussian mixture and finally,
filler INT (intermitted noise) was ignored and mapped into the
tied CD phonemes with multiple Gaussian pdf per state.
SIL model, just as was the STA (stationary noise). Records
However, one of the main drawbacks of the Sphnix system
containing incomplete (truncated) or mispronounced words
is that during the training phase, all models must exhibit
were rejected from the training process, unlike the records
the same structure, regardless of whether they are speech
contaminated by GSM noise, which were preserved. Finally,
units like phonemes or phrases or even non-speech events.
the training set consisted of 41,739 records spoken by 880
Each model has one terminating, non-emitting state to
persons. The dictionary stores the pronunciations of 14,909
allow multiple models to be concatenated into more
words, some of which have more pronunciations. However,
complex models of words or even sentences.
the SphinxTraining process can handle only one – the first; the
remaining ones can be utilised only by the recogniser.
4 Training of HMM models on the MOBILDAT- The initialisation of the models was performed over all
SK using SphinxTrain training recordings by the so called flat start initialisation.
Speech files were described by MFCC parameters and their
The outlined training procedure for the Context Dependent derivatives, which were divided into three streams for every
(CD) and Context Independent (CI) models in the SphinxTrain processed block of the length of 25 m. Thus there were 13
has to be slightly adjusted for the structure of the Mobildat cepstral, 13 delta and 13 delta-delta coefficients, computed
database and the Slovak language. Thus, in the following, a from 16 bit linear PCM waveforms. During the training, the
brief description of the training process is provided, together Baum-Welch algorithm was used with the convergence ratio
with some statistical information. set to 0.04 and the maximum and minimum iteration numbers
In Figure 2b, there is a scheme graphically presenting the were set at 30 and 3, respectively.
phases of the statistical model training from the acoustic signal.
The training was performed on the Slovak Mobildat database
to suit the requirements for the models to be successfully 5 Optimal settings and functional modification
employed in GSM networks. The application of the Mobildat
to the training procedure defined by SphinxTrain was not as The main modifications to the original training process are
straightforward as it is in the case of Masper or Refrec, which presented by 2 blocks marked by asterisks in the scheme
are designed directly for its structure, which is rather complex shown in Figure 2b. Block ‘Model structure definition’
and distributed (5 CDs). Thus, several modifications had to marked by *1 enables the setting of optimal parameters for
be done prior to the training, which can be summarised as the models and block ‘Selection of training data’, marked by
Acoustic model training for speech recognition over mobile networks 5
*2, represents the process of choosing the train set. Both the 15,000 95,26 96,15 95,74 95,38
block and the modifications are described below. 18,000 95,12 95,8 96,09 95,31
5.1 Optimal settings of the training process Table 3 Word recognition rates for CD phonemes and 5 states
Except the settings already mentioned, which were based on per model for Application words
the ‘common sense’ approach, there are several more, which Number of Gaussians
may have a crucial impact on the overall accuracy: the structure
of the models, i.e., the number of states and the number of Number of tied states 8 16 32 64 128
states after the tying process for the CD phonemes. These 2000 94,31 95,35 95,15 95,17 95,67
are language- and database–specific, so their number has to 5000 94,53 95,79 95,95 95,89 94,4
9000 95,33 95,72 95,88 95,88 93,36
be determined by a series of experiments. In the case of the
12,000 95,33 95,79 96,03 95,89
number of states per model, there are only 2 options regarding 15,000 95,33 96,15 96,17 95,88
the Sphinx systems, 3 or 5 states by default. A modification of 18,000 95,55 96,79 96,3 95,8
the training script permitted us to set other values too, so we
tested values from 1–7 states per model. The number of states
after tying is both language- and database-specific. The tested Table 4 Word recognition rates for CD phonemes and 6 states
ranges were set from 2000–18,000, where the initial guess per model for Application words
was based on previous experiments using the HTK system
that determined the number of states using different - designer Number of Gaussians
more flexible approach. For this purpose, we needed to set the Number of tied states 8 16 32 64 128
following variables in SphinxTrain configuration file: 2000 93,14 93,88 95,05 94,84 95,22
5000 94,08 95,04 95,41 94,97 91,35
$CFG_STATESPERHMM = 5;
9000 94,01 95,62 95,05 94,22 86,69
$CFG_INITIAL_NUM_DENSITIES_CD = 1; 12,000 94,59 95,4 95,13 94,02
$CFG_FINAL_NUM_DENSITIES_CD = 128; 15,000 94,23 95,76 96,05 93,77
$CFG_INITIAL_NUM_DENSITIES_CI = 1; 18,000 94,59 95,26 95,33 93,71
$CFG_FINAL_NUM_DENSITIES_CI = 64;
$CFG_N_TIED_STATES = 2000;
Table 5 Word recognition rates for CD phonemes and 7 states
and remove restriction for the number of internal states,
per model for Application words
modify the scripts slave_convg.pl and norm_and_launchbw.
pl and add the script split_gaussians.pl to the phase of context Number of Gaussians
independent model training.
Tables 1–5 show the word recognition rate of application Number of tied states 8 16 32 64 128
2000 90,63 91,26 91,59 92,32 90,98
words grouped by number of states per model. Those results
5000 93,15 93,08 93,09 91,42 85,85
are graphically depicted for maximum accuracy per number 9000 93,76 93,87 93,44 91,19 86,11
of model states in Figure 3. 12,000 93,67 93,27 92,91 89,34
15,000 93,6 93,36 92,39 88,63
Table 1 Word recognition rates for CD phonemes and 3 states 18,000 93,73 93,49 92,23 89,7
per model for Application words
Number of Gaussians
Figure 3 Maximum accuracy for all tested number of internal
Number of tied states 8 16 32 64 128 emitting states in a model
2000 91,23 91,41 91,7 92,42 92,43
5000 92,47 92,36 92,93 93,73 93,08
9000 93,12 92,57 93,44 93,44 92,86
12,000 92,11 92,72 93,08 93,51
15,000 92,98 92,57 92,86 92,5
18,000 92,27 92,5 92,79 93,37
Number of Gaussians
Number of tied states 8 16 32 64 128
2000 94,03 95,2 95,66 95,53 95,53
5000 95,26 96,51 95,87 95,1 93,66
9000 94,91 96,29 96,02 94,88 92,1
12,000 95,05 96,51 95,88 95,45
6 G. Rozinaj and J. Vojtko
5.2 Enhanced background model trained models from the previous training process. It is
more elegant to do it in this way, though it requires higher
As we mentioned above, Sphinx has the restriction that all
computational load. However, if the original models are not
models have to have the same structure (x regular states plus
well trained, the results may not be better (as was observed
1 nonemitting state). Sphinx does not allow construction
during some experiments).
of special models like the T-model. The T-model is used in
In the second approach, fewer models of phonemes
modelling short pauses between words or spelled syllables,
would be affected by background noise, but fewer physical
especially when it is not known whether the pause took place
realisations of phonemes would be available for the training
or not, as these events are not marked in the description files. To
process. As this is a balance between accuracy and robustness,
do it automatically during the course of training, the T-models
the correct scenario is a subject for experiments. Thus, four
are appended to each word in the dictionary. This trick is
options were tested: the original Sphinxtrain procedure
not possible in the Sphinx which, by default, automatically
(applied also to spelled recordings), removal of spelled
assumes the existence of pauses at the beginnings and the
recordings from the training, indiscriminative insertions of
ends of recordings. If there are longer pauses in speech,
SIL models between those spelled items and insertion of SIL
they should be marked. In recordings containing ‘continual’
models only where the silence intervals were detected by the
speech ‘mistakes’ caused by long pauses are not relevant
recognition process (Viterbi decoding). Tables 6 and 7 and
on average. But for spelled recordings, this assumption is
Figures 4 and 5 show the results of those four experiments,
by far not correct. To remedy this inaccuracy, three options
revealing improvements in the recognition process by all
are possible: the first one is to automatically insert a silence
suggested modifications, i.e., by ‘repairing mistakes’ caused
model <sil> between any spelled items.
by the existence of longer pauses in spelled utterances and the
Original transcription: nonexisting T-model.
…
<s> <spk> jedna päť deväť jedna šesť nula Table 6 Word recognition rates for CI phonemes and 3 states
<s>(b30000c4) per model for application words
…
<s> k ú <sil> t ypsilon </s>(b30000l2) Number of Gaussians
<s> <spk> e i gé qé í i en a el </s>(b30000l3) Modification type 8 16 32 64 128
… Original procedure 76,43 79,35 79,71 81,65 82,68
Modified transcription: Removed spelled 78,54 81,54 83,71 85,58 86,3
… Insert SIL 83,48 85,49 87,87 89,46 89,75
<s> <spk> jedna <sil> päť <sil> deväť <sil> Align 82,4 85,06 87,45 88,97 89,55
jedna <sil> šesť <sil> nula </s>(b30000c4)
…
Table 7 Word recognition rates for CD phonemes and 3 states
<s> k <sil> ú <sil> t <sil> ypsilon </s>(b30000l2)
per model and 9000 tied states for application words
<s> <spk> e <sil> i <sil> gé <sil> qé <sil> í
<sil> i <sil> en <sil> a <sil> el </s> (b30000l3) Number of Gaussians
…
the second one is just simply remove recordings with spelled Modification type 4 8 16 32 64
items from the training completely.
Original procedure 91,81 93,12 92,57 93,44 93,44
Original transcription: Removed spelled 93,62 93,49 92,93 93,29 93,66
… Insert SIL 93,87 93,94 94,3 94,95 95,31
<s> i ako ivan v ako viera e ako emil t ako Align 94,38 94,59 94,74 95,6 94,88
tibor a ako alena </s>(b30000l1)
<s> k ú <sil> t ypsilon </s>(b30000l2)
<s> <spk> e i gé qé í i en a el </s> (b30000l3) Figure 4 Word recognition rates for Context Independent
<s> <fil> dvetisíc osemsto sedemdesiatjeden phonemes and 3 states per model for Application words
slovenských korún </s>(b30000m1)
<s> jedno euro dvadsať centov </s>(b30000m2)
…
Modified transcription:
…
<s> i ako ivan v ako viera e ako emil t ako
tibor a ako alena </s>(b30000l1)
<s> <fil> dvetisíc osemsto sedemdesiatjeden
slovenských korún </s>(b30000m1)
<s> jedno euro dvadsať centov </s>(b30000m2)
…
Figure 5 Word recognition rates for Context dependent pho- selected from the set of parents and their offspring, etc. Basic
nemes and 3 states per model for Application words operations are defined as follows.
A mutation represents a process of modification of an
individual that slightly differs from its ‘original version’,
i.e., its features-optimised data are slightly changed by a
stochastic process like in (1):
x ' = x + N (0, δ ), (1)
the new generation from μ + λ individuals in the current Table 8 Selected features for the accuracy optimisation of the
generation. Each individual A is then defined by the speech decoding process realised in the ATK system
original data to be optimised x, vectors of dispersions σ
Feature Value Description
and angles α.
The mutation of individuals is performed after the SILSEQCOUNT 109 Number of frames classified as
recombination phase in the following steps: mutation of silence to detect the silence
angles, where the standard deviation is empirically about SPCSEQCOUNT 22 Number of frames classified as
5°, mutation of dispersions (diagonal covariance matrix) speech to detect the beginning
and finally, mutation of the original data vector x based of speech
SPEECHTHRESH 9.51 Energy threshold for the speech
on the newly derived standard deviations and angles.
detection. It is relative to the
Generally, each part of an individual A can be recombined estimated silence level.
in a different way; thus, ω = (ω x , ϖσ , ωα ) represents a SILENERGY 10 Mean energy of silence
vector of possible recombination methods for each part (background noise) in dB
SPCGLCHCOUNT 3 Number of frames below
(data, dispersions and angles) and vector ρ = ( ρ x , ρσ , ρα ) the threshold that can be
determines the number of parents needed to produce a child. ignored when counting
As the optimisation process proceeds, the actual solution is SPCSEQCOUNT
supposed to get to the vicinity of the global extreme and so, SILGLCHCOUNT 10 Number of frames above
the mutations (dispersions) should be naturally decreasing in the threshold that can be
the time to produce a stable result. However, if the solving ignored when counting
problem is hard to describe, but easily verified, it may be SILSEQCOUNT
SILMARGIN 17 Shift the beginning and end of
better to let the evolution algorithm itself control the standard
the detected speech given the
deviations and methods of recombination, as was previously
number of frames.
outlined for the general case. CMNTCONST 0.9984 The time constant for cepstral
mean adaptation
CMNMINFRAMES 26 Minimum number of frames
6.2 The off-line optimisation process of the speech
after which a cepstral mean
decoding stage adaptation can take place
Once having trained HMM models, the optimisation process WORDPEN –81 Log probability penalisations
for the speech decoder was accomplished using the test for a word insertion error
portion of the MOBILDAT-SK database, which contains
220 speakers. As our main target is a dialog application, we
As all of these parameters have their physical interpretation
evaluated records containing looped digits, denoted as B1,
and cover different ranges of acceptable values, they were
which do the best approximation of the pursued task (the
transformed to the unified range <0,1> according to (4).
unknown number of repetition of words from a dictionary).
This ensures the proper setting of the speech detection fe − lower_limit (4)
fe ' =
algorithm that is also involved in a real speech recognition upper_limit − lower_limit
application, as well as several auxiliary grammar parameters.
In the tests, we used context-dependent models of phonemes where fe is the original value of the feature, fe ' is the
with three states and eight Gaussian mixtures. As the speech transformed feature, lower_limit and upper_limit are the limit
features the MFCC, delta and acceleration coefficients were values of a feasible range.
used (39 elements). The models were trained using the train Then, in the process of recombination or mutation, this
section of the MOBILDAT-SK database (880 speakers) and range had to be checked. If for some features it was not
the MASPER training procedure. To present the real number met, the upper or lower limit was substituted, to ensure the
of variables involved in a decoding task and to highlight their physical meaning. Such normalisation enabled us to use only
significance, the ATK system (Young, 2004) was used. one standard deviation for the vector as a whole, because all
In the next stage, we had to choose some of the most features had approximately the same dispersion. Furthermore,
significant parameters that we believed can play an important we decided to automatically decrease dispersion with the
role for the recognition process employed in a real dialog increasing number of generations to reach a stable solution;
application. It is vital to limit their number to speed up the however, tests with fixed and inherited (optimised) dispersion
convergence. It should be noted that playing all eligible were executed as well. Because of the time consumption, we
records takes 30 minutes and so, assessing the fitness for a tried to limit the number of generations and the number of
single individual in a given generation takes about 30 minutes. individuals in each generation. To do so, we neglected some
The selected parameters for the optimisation process mainly possible correlations among optimised features.
govern the speech detection algorithm the ATK system As we were not completely sure of the “ideal” settings of the
(Young, 2004; 2005) uses and adjust some grammar-related optimisation process, we tested more versions like: constant,
issues. In Table 8, these features are listed together with their decreasing and optimised deviation, various methods for
description and ‘optimally’ set values. parents’ selection and recombination, number of generations,
Acoustic model training for speech recognition over mobile networks 9
etc. In Figure 6, the mean errors in each generation are Abbreviations (Continued)
depicted for the constant deviation (0.3), gradually declining Although the model training phase is vital for a successful
deviation ranging from 0.35–0.02 and the optimised deviation recognition system, the proper adjustment of the decoding
by the evolution process, where in all cases the parents were process plays an important role as well. It is relevant not
selected stochastically according to their fitness. only to achieve the declared accuracy usually observed in
the offline experiments, which provide better results, but also
Figure 6 The mean word error rates per generations for gradually to speed up the decoding process and to reduce use of time
declining, inherited and constant standard deviation and computational resources. As there are many parameters
to be optimised together, which are mutually related either
positively or negatively to each other, usually a common sense
approach applied to the process of their manual adjustment
may be far from the optimum.
By using the above-mentioned off-line method (evolutional
strategies) we achieved 18.72% of WER (off-line tests using
the standard evaluation procedure defined in MASPER for the
looped digits test showed only 1.17% of WER, which shows
the huge difference between off-line and on-line experiments),
which is about 5% lower than those reached when adjusted
manually (generally above 23% of WER). As presented,
the utilisation of evolutional strategies brought promising
results at low cost (we tested only 98 individuals through 10
dimensional spaces, which is less than 1.6 search points per
dimension), in a way similar to what is reported in other areas.
From the tested settings of evolutional strategies
(recombination methods, dispersion control), it can be
7 Conclusion concluded – see Figure 6 – that the gradual decrease of
dispersion has brought better final results within the range
We suppose there is some theoretical maximal accuracy of tested generations, which follows theoretic expectations.
which can be achieved by using the Mobildat-SK database However, this approach showed slightly worse results in the
(HMM models derived from the database) limited to its early stages, which may be caused by the higher dispersion,
size, present triphones and conditions. We observed some which aimed to cover a wider area at the beginning. In the
ambiguity among the tested training parameters mentioned case of “unlimited” evolution time, the deviation should be
above (number of states per model, number of Gaussians kept optimised as well and some kind of life time control
mixtures and number of tied states) in the sense that a introduced, in order to avoid getting stuck in the local minima.
similar recognition accuracy can be obtained by different Features like WORDBEAM, MAXBEAM and GENBEAM
sets of parameters. The modification of one parameter that were involved in the pruning of the decoding process
causes a change in others, assuming accuracy should be kept were not set during the optimisation process. The fitness of
unchanged. We found out that the best results are reached in each individual was based solely on the recognition accuracy
the interval from 4–5 states per model, 16 and 32 Gaussians (WER). The above-mentioned parameters introduce some
for this database. The number of tied states above 5000 approximations to speed up the computation time, but could
has no significant improvement, but it also seems there is a cause deterioration in accuracy. To hit the balance, a new
reciprocal proportion in the number of Gaussians mixtures. fitness must be used utilising the computational load as well.
The second set of tests that evaluated and compared This, however, would be highly dependent on the perplexity
the original training procedure to its three suggested and of the actual dialog system.
implemented modifications proved that the original scenario Usually, the incorporation of the Lamarckian evolutional
is the least effective for the MobilDat-SK. From Tables 6 and strategy (Mitchell, 1997) can bring even better results, that
7, it can be seen that in most tested cases, the scenario with is, some individuals are additionally trained during the
automatically detected silence intervals based on “incorrectly evolution, to reach for a local minimum in their vicinity.
trained models” using Viterbi alignment for SIL insertion was Usually this is done by the hill climbing algorithm and such
the most beneficial. However it required almost two runs of “learned” information is passed to the next generation (this
the training process, so it is more time consuming. On the opposes the Darwin theory). Unfortunately, in the case of
other hand, the indiscriminative insertion of SIL models costly evaluation of the fitness function, this option looses
between all spelled items helped to increase the accuracy its benefits, as more individuals must be examined, which
as well and for some settings, it was even the best option; can be made up for by allowing more generations in the
moreover, this was achieved without adding any extra classical evolutional strategy optimisation. Finally, it should
computational load. Even if this is somehow impractical, the be emphasised that the optimisation of the decoder using ES
removal of the spelled recordings still gains better averaged was done offline and independently from the training process
results than the original procedure. for HMM models and focused only on the achieved accuracy.
10 G. Rozinaj and J. Vojtko