Professional Documents
Culture Documents
Syllable-Based Phonetic Transcription by Maximum Likelihood Methods
Syllable-Based Phonetic Transcription by Maximum Likelihood Methods
R.A.Sharman
1279
discusses only the syllabification process in a hidden sequence of states (syllables) gives rise
detail, since the decision tree methodology is to an observed sequence of symbols (letters).
well described elsewhere[5], whereas the We need to discover the underlying sequence of
syllabification algorithm proposed is novel. An states which gave rise to the observations. The
experiment using a very large set of word- complexity arises since the states and
syllable-pronunciation strings was used to train observations do not align in a simple way[l 1].
the two models, and then tests performed to Syllable models of a similar type have been
determine the accuracy of the resulting proposed for prosody[12] but not for
transcription. transcription, whereas direct models of
transcription have been attempted[4].
A Maximum Likelihood Model of
Let a orthographic word, W, be defined as a
syllabification
sequence of letters, w 1,w2..... %. Let a syllabic
word, S, be defined as a sequence of syllables,
The purpose of this step is to make explicit the
hidden syllable boundaries in the observed
s~,s2..... s,,. The observed letter sequence, IV,
words. These often, but not always coincide then arises from some hidden sequence of
with the molphological boundaries of the syllables, S, with conditional probability I'(WIS),
constituent parts of each word. However, so as There are a finite number of such syllable
not to confuse the question of the derivation of a sequences, of which the one given by
word from its roots, prefixes and suffixes, with max P(WIN) where the maximisation is taken
the question of the pronunciation of the word in over all possible syllable sequences, is the
small discrete sections of vowels and consonants, maximum likelihood solution, and intuitively, the
the term molphology is not used here. Strictly most plausible analysis. By the well-know Bayes
speaking the term syllable might be more theorem, it is possible to rewrite this expression
accurately applied only after transcription to as:
v- "1
phonemes. However, we shall use it here to max [ P(WI S)] = max 1_t'(5'1W)P(S)I
apply to such pronunciation units described s L p(nO J
orthographically. The purpose of such analysis is
to obtain information which will be used by the In this equation it is interesting to interpret the
phonetic transcription stage to make better P(SIW) as a probability distribution capturing
judgements on the pronunciation of consonant the facts of syllable division, while the P(S) is a
and vowel clusters in particular. different distribution capturing the facts of
syllable sequences. The latter model thus
For example, the consonant cluster ph in the
contains information such as which syllables
word loophole might be pronounced / f / by
form prefixes and suffixes, while the former
analogy with the same cluster in the word captures some of the facts of word construction
telephone. However, it might also be in the usage of the language. Note that the term
pronounced as /ph/ by analogy with the same
P(W), which models the sequence of letters, is
cluster in the word tophat. The deciding factor
not required in the maximisation process, since it
is where the syllable boundary lies in the word.
is not a function of S. Given the existence of
The most plausible structure for the word
these two distributions there is, in principle, a
telephone is tele tphone, or possibly,
well-understood method of estimating the
te+le+phone, and for tophat is top+hat. So a
parameters, and performing the decoding[7].
possible syllable structure for the word loophole
The estimation is provably capable of finding a
might be loop+hole, or alternatively loo-tphole,
local optimum[13], and is thus dependent on
or maybe looph ~ole. The syllable model needs
finding good initial conditions to train from. In
to determine what the true, but unobserved,
this application the initial conditions are provided
syllable sequence is, given only the observed
by supervised training data obtained from a
evidence of the orthographic characters. This
dictionary.
can be modelled as a decoding problem in which
1280
A variety of expansions of the terms t'(SIW) and than O(n 2) in computational complexity, as
1'(S) can be derived, depending on the usual tbr this type of problem. The difficulty can
computational cost which is acceptable, and the be overcome by the use of a technique from
amount of training data available. There is thus a context-fi-ee parsing[14], namely the use of a
family of models of increasing complexity which substring table. The method will be briefly
can be used in a methodical way to obtain better described.
modelling, and thus more accurate processing
A word of length, n, can contain n ~/2
The function P(SIW) can be simply modelled as substrings, any of which may potentially be
#t
syllables of the word. Using the method of
l l
tabular layout familiar from the Cocke-Kasami-
which has the wdue 0 eveuwhere , except when Younger (CKY) parsing algorithm, these
s, = w:..., % tbr any j k, when it has the value 1. substrings can be conveniently represented as a
triangular table, 71, ~-: {1~,} (see diagram below).
This simply says that each syllable is spelled the
same way as the letters which compose it. This Where the table contains a non-zero element the
points the way to a more sophisticated model of index number of a unique syllable can be tbund.
syllabification which incorporates spelling '['he first step in parsing the word is to generate
changes at syllable boundaries, but this will not all the possible substrings and check them against
be attempted here. Another application of the a (able of possible syllables. Even [br long
approach might be in a model of inflexional or words with 20 or 30 letters, this is not a
derivational morphology where spelling changes prohibitive calculation. If the letter string is
arc observed at morph boundaries. identified as a possible syllable then the unique
identi~ing number of the syllable can be entered
The function P(S) can be modelled most simply into the table.
as a bi-gram distribution, where the
approximation is made that; (note: dots used as abbreviation in
higher nodes for simplicity)
:'(s,l.,',
.....s, :'(s,ls, ,)
i ] i 2
t t e I I
° L'A ? k
An efficient computational scheme for t e I e p h o n e
One complication exists before either the Viterbi The bigram sequence model can now be
decoding algorithm[7] for determining the calculated by the following algorithm, which is
desired syllable sequence, or the Forward- an adaptation of the fi~miliar CKY algorithm:
Backward parameter estimation algorithm[7] can
be used. This is due to the combinatorial for each letter w[i], i=l,...,n
explosion of state sequences due to the fact that for each starting syllable position
potential syllables may overlap the same letter t[i,j],j=l,...,n+l-i
sequences, as shown in the example above with for each ending syllable position
the word telephone. 1'his leads to the decoding t[i+j-l,k],k=l ..... n-i-j
and training algorithms becoming O(n 3) rather let x-t[i,j] and y-t[i+j-1]
1281
compute P(s(y)[s(x)) programmed conversioned so that no difficulties
were encountered in the current experiment.
In this way it is possible to calculate all the
possible syllable sequences which apply to the The word entries were then divided into training
given word without being overwhelmed by a data (220,000 words) and test data (5,000
search for all possible syllable sequences. words) by randomly extracting words. It was
observed that the 220,000 words in the training
text were composed of sequences of syllables
A methodology for constructing a
taken from a set of 27,000 unique syllable types.
syllabifier
An initial estimate of the syllable bi-gram model
can be directly computed by observation. This
The following methodology can be used to build
initial model was able to decode the training data
a practical implementation of the technique
with 96% accuracy and the test data with 89%
outlined above:
accuracy. This indicates the requirement for a
smoothing technique to generalise the
Collect a list of possible syllables.
parameters of the bi-gram syllable model. Such
.
1282
Conclusions phonetics, tP 127-141 in Talking Machines,
ed G.Bailly and C.Benoit, North Holland,
A method of determining syllable boundaries has 19911.
been shown. The method can be improved by 3. W.A.Ainsworth and NP.Warren,
the use of a tri-syllable model and by the use of Appfication of Mu#ilayer Perceptrons in
more training data. Other extensions could be Text-m-Speech Synthesis Systems, pp 256-
explored quite easily. The method does not find 288 of Neural Networks fi~r Vision, Speech
new syllable types. For this some type of cmd Natural Language, ed D.J.Myers and
unsupervised clustering method is required. The C.Nightingale, Chapman Hall, 1992.
method leaves unsolved the treatment of unusual 4. S.Parfitt and R.ASharman, A hi-directional
or idiosyncratic textual conventions, notations, model of Eng,lish t'ronunciation, pp 801-
and numeric information. It seems that rule- 804, Proceedings of EuroSpeech 91, Genoa,
based techniques will still be needed. 199l.
5. L.R.Bahl, P.V. deSouza,
While the more serious question still to be P.S.Gopalakrishnan, D.Nahamoo and
answered for TTS systems lie elsewhere, for M.A. Picheny, (~ontext-dependent Modelling
example in prosody[10], the inability of systems of Phones in Contitmous Speech using
to perform transcription with high accuracy Decision ]'rees, IEEE ICASSP 1992.
makes this still an open question. The problem 6. F.Jelinek, et al., ]he develotmlent of a large-
of transcription is also of interest in Speech vocabuklty discrete wotff Speech
Recognition[6] where there is a need to generate Recognition system, 11,51,21,;Trans, Speech
phonetic baseforms of words which are included and Signal Processing, 1985."
in the recognisers' vocabulary. In this case the 7. l,.Rabiner, 7'u/orial on ltidden Markov
work required to generate a pronouncing Models' and Selected Atplications in Speech
dictionary fbr a large vocabulary in a new Recognition, Proc IEEE vol 77, no 2.
domain, including many technical terms and new pp257-286, 1989.
jargon not previously seen, calls tbr an 8. S.R.Hertz, J.Kadin, and K.J.Karplus, ]he
automatic, rather than manual technique. l)elta Rule l.)evelopment System fi~r Speech
Synthesis from Text, Proc IEEE vol 73 no
In the wider context the method applied here is 11. pp 1589-1601, 1985.
another example of self-organising methods 9. K.Church, P/vnouncing proper names,
applied to Natural Language Processing. While ACL Chicago, 1985.
these methods have found a fundamental place in 10. R.Collier, H.C.Van Lecuwen and
speech processing (for example, speech L.F.Willems, Speech Synthesis Today and
recognition) they have yet to be seriously 7@norrow, l'hilips Journal of Research and
adopted for language processing. It is a Development, vol 47 no 1, pp 15-34, 1992.
possibility that many more specific tasks in II. S.G.Lawrence and G.Kaye, Alignment of
language processing may be amenable to phonemes with their cot7"e,ponding
treatment by self-organising methods, with a orthoL,raphy, Computer Speech and
consequent improvement in the reliability and Language vol 1, pp 153-165, 1986.
ease of replication of the NLP systems which 12. MGiustiniani, A.Falaschi and P.Pierucci,
incorporate them. Automatic inference of a Syllabic Prosodic
Model Eurospeech pp 197-200, 1991.
13. B.Merialdo, On the locality of the l;orward-
References
Backward Algorithm, IEEE Transactions on
Speech and Audio Processing, pp 255-257
1. J.Allen, MS.Hunnicutt and D.Klatt, From
vol 1 no. 2, April 1993.
]'ext to Speech, Cambridge University
14. A.VAho and J.D.Ullman, "lhe theory of
Press,Cambridge, 1987.
patwing; #'anslalion and compiling,
2. S.M.Lucas and R.i.Damper, Syntactic
Prentice-Hall, 1972.
neural networks .fi~r bi-directional text-
12~3