Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Syllable-based Phonetic transcription by Maximum Likelihood Methods

R.A.Sharman

MP167, IBM(UK) Labs Ltd, Hursley Park, Winchester SO21 2JN, UK

allows the two processes to be handled by


Introduction
different techniques, choosing the technique
most suited to each step. The question of
The transcription of orthographic words into
whether a moq)hological or ,~yllabic
phonetic symbols is one the principal steps of a
decomposition of the word might produce better
text-to-speech system[l]. In such a system a
results is not thrther analysed here. (For the
suitable phonetic pronunciation must be supplied,
present study data was available for syllables and
without human intervention, for every word in
not for morphs, so in the sense the comparison
the text. No dictionary, however large, will
could not be carried out by the techniques
contain all words, let alone proper names,
proposed). The effects of other factors, such as
technical terms and other textual items
part-of speech tagging, domain-dependent
commonly found in unrestricted texts.
information, and other information sources, were
Consequently, an automatic transcription
ignored, although these could be useful in
components is usually considered essential.
practical systems.
ttand-written rule sets, defining the transcription
The technique proposed for syllabification is
of a letter in its context to some sound, view the
based on the principle of Hidden Markov
process as that of parsing with a context-
Modelling, well known in speech recognition[7].
sensitive grammar. This approach to
This presupposes the existence of some training
transcription has been challenged more recently
material containing words in both their
by a variety of methods such as Neural nets[2],
orthographic and syllabic tbrm. Using this data a
Perceptrons[3], Markov Models[4], and
model of syllable sequences can be designed and
Decision Trees[5]. Some approaches have used
trained to identit~y syllable boundaries. Once the
additional information such as prefixes and
most likely syllable division of the word has been
suffixes[8], syllable boundaries[3], sometimes
tbund the phonetic transcription can be produced
combined with the use of parts-of-speech to
by a variety of direct transcription methods, such
assist in the disambiguation of multiple
as the one used here based on Decision Trees[5].
pronunciations. In the phonetic transcription of
The training of such a method presupposes the
proper names special techniques can be
existence of some training data containing words
employed to improve accuracy[9] such as
in both their syllabic and phonetic forms. Using
detecting the language of origin of the name and
the latter data a Decision Tree can be trained to
using different spelling to sound rules. Each
transcribe syllables in context into phone
method has its own advantages and
sequences. The advantage of using decision
disadvantages in terms of computational speed,
trees is that they not only learn general rules, but
complexity and cost. However, none of these
also capture idiosyncratic special cases
methods by itself is completely adequate.
automatically. The resulting process should
perform transcription with high accuracy.
The present method uses the two-step
conversion process described elsewhere[i,3] in
Such a two-stage approach has been shown to
which the structure of the word plays a central
yield improvements[3] but only where perfect
role. First the orthographic word is divided into
syllabification information is available,
its syllables, and secondly the syllable sequence is
consequently a reliable syllabification technique
converted to a phonetic string. This not only
is required. The remainder of this paper
accords with linguistic intuition, but it also

1279
discusses only the syllabification process in a hidden sequence of states (syllables) gives rise
detail, since the decision tree methodology is to an observed sequence of symbols (letters).
well described elsewhere[5], whereas the We need to discover the underlying sequence of
syllabification algorithm proposed is novel. An states which gave rise to the observations. The
experiment using a very large set of word- complexity arises since the states and
syllable-pronunciation strings was used to train observations do not align in a simple way[l 1].
the two models, and then tests performed to Syllable models of a similar type have been
determine the accuracy of the resulting proposed for prosody[12] but not for
transcription. transcription, whereas direct models of
transcription have been attempted[4].
A Maximum Likelihood Model of
Let a orthographic word, W, be defined as a
syllabification
sequence of letters, w 1,w2..... %. Let a syllabic
word, S, be defined as a sequence of syllables,
The purpose of this step is to make explicit the
hidden syllable boundaries in the observed
s~,s2..... s,,. The observed letter sequence, IV,
words. These often, but not always coincide then arises from some hidden sequence of
with the molphological boundaries of the syllables, S, with conditional probability I'(WIS),
constituent parts of each word. However, so as There are a finite number of such syllable
not to confuse the question of the derivation of a sequences, of which the one given by
word from its roots, prefixes and suffixes, with max P(WIN) where the maximisation is taken
the question of the pronunciation of the word in over all possible syllable sequences, is the
small discrete sections of vowels and consonants, maximum likelihood solution, and intuitively, the
the term molphology is not used here. Strictly most plausible analysis. By the well-know Bayes
speaking the term syllable might be more theorem, it is possible to rewrite this expression
accurately applied only after transcription to as:
v- "1

phonemes. However, we shall use it here to max [ P(WI S)] = max 1_t'(5'1W)P(S)I
apply to such pronunciation units described s L p(nO J
orthographically. The purpose of such analysis is
to obtain information which will be used by the In this equation it is interesting to interpret the
phonetic transcription stage to make better P(SIW) as a probability distribution capturing
judgements on the pronunciation of consonant the facts of syllable division, while the P(S) is a
and vowel clusters in particular. different distribution capturing the facts of
syllable sequences. The latter model thus
For example, the consonant cluster ph in the
contains information such as which syllables
word loophole might be pronounced / f / by
form prefixes and suffixes, while the former
analogy with the same cluster in the word captures some of the facts of word construction
telephone. However, it might also be in the usage of the language. Note that the term
pronounced as /ph/ by analogy with the same
P(W), which models the sequence of letters, is
cluster in the word tophat. The deciding factor
not required in the maximisation process, since it
is where the syllable boundary lies in the word.
is not a function of S. Given the existence of
The most plausible structure for the word
these two distributions there is, in principle, a
telephone is tele tphone, or possibly,
well-understood method of estimating the
te+le+phone, and for tophat is top+hat. So a
parameters, and performing the decoding[7].
possible syllable structure for the word loophole
The estimation is provably capable of finding a
might be loop+hole, or alternatively loo-tphole,
local optimum[13], and is thus dependent on
or maybe looph ~ole. The syllable model needs
finding good initial conditions to train from. In
to determine what the true, but unobserved,
this application the initial conditions are provided
syllable sequence is, given only the observed
by supervised training data obtained from a
evidence of the orthographic characters. This
dictionary.
can be modelled as a decoding problem in which

1280
A variety of expansions of the terms t'(SIW) and than O(n 2) in computational complexity, as
1'(S) can be derived, depending on the usual tbr this type of problem. The difficulty can
computational cost which is acceptable, and the be overcome by the use of a technique from
amount of training data available. There is thus a context-fi-ee parsing[14], namely the use of a
family of models of increasing complexity which substring table. The method will be briefly
can be used in a methodical way to obtain better described.
modelling, and thus more accurate processing
A word of length, n, can contain n ~/2
The function P(SIW) can be simply modelled as substrings, any of which may potentially be
#t
syllables of the word. Using the method of
l l
tabular layout familiar from the Cocke-Kasami-
which has the wdue 0 eveuwhere , except when Younger (CKY) parsing algorithm, these
s, = w:..., % tbr any j k, when it has the value 1. substrings can be conveniently represented as a
triangular table, 71, ~-: {1~,} (see diagram below).
This simply says that each syllable is spelled the
same way as the letters which compose it. This Where the table contains a non-zero element the
points the way to a more sophisticated model of index number of a unique syllable can be tbund.
syllabification which incorporates spelling '['he first step in parsing the word is to generate
changes at syllable boundaries, but this will not all the possible substrings and check them against
be attempted here. Another application of the a (able of possible syllables. Even [br long
approach might be in a model of inflexional or words with 20 or 30 letters, this is not a
derivational morphology where spelling changes prohibitive calculation. If the letter string is
arc observed at morph boundaries. identified as a possible syllable then the unique
identi~ing number of the syllable can be entered
The function P(S) can be modelled most simply into the table.
as a bi-gram distribution, where the
approximation is made that; (note: dots used as abbreviation in
higher nodes for simplicity)
:'(s,l.,',
.....s, :'(s,ls, ,)
i ] i 2

Such a simple model can capture many


interesting effects of syllable placements adjacent -- 4 - - ,

to other syllables, and adjacent to boundaries.


However, it would not be expected that subtle tele I elep I lepl

effects of syllabification due to longer range tel I ele I lep


.... 4 .....
effects, if they exist, could be captured this way. te I el I le

t t e I I
° L'A ? k
An efficient computational scheme for t e I e p h o n e

syllabification The computational structure used for finding the syllable s e q u e n c e

One complication exists before either the Viterbi The bigram sequence model can now be
decoding algorithm[7] for determining the calculated by the following algorithm, which is
desired syllable sequence, or the Forward- an adaptation of the fi~miliar CKY algorithm:
Backward parameter estimation algorithm[7] can
be used. This is due to the combinatorial for each letter w[i], i=l,...,n
explosion of state sequences due to the fact that for each starting syllable position
potential syllables may overlap the same letter t[i,j],j=l,...,n+l-i
sequences, as shown in the example above with for each ending syllable position
the word telephone. 1'his leads to the decoding t[i+j-l,k],k=l ..... n-i-j
and training algorithms becoming O(n 3) rather let x-t[i,j] and y-t[i+j-1]

1281
compute P(s(y)[s(x)) programmed conversioned so that no difficulties
were encountered in the current experiment.
In this way it is possible to calculate all the
possible syllable sequences which apply to the The word entries were then divided into training
given word without being overwhelmed by a data (220,000 words) and test data (5,000
search for all possible syllable sequences. words) by randomly extracting words. It was
observed that the 220,000 words in the training
text were composed of sequences of syllables
A methodology for constructing a
taken from a set of 27,000 unique syllable types.
syllabifier
An initial estimate of the syllable bi-gram model
can be directly computed by observation. This
The following methodology can be used to build
initial model was able to decode the training data
a practical implementation of the technique
with 96% accuracy and the test data with 89%
outlined above:
accuracy. This indicates the requirement for a
smoothing technique to generalise the
Collect a list of possible syllables.
parameters of the bi-gram syllable model. Such
.

2. From the observed data of orthographic-


smoothing may reduce the accuracy of the model
syllabic word pairs, construct an initial
on the training data, but should improve it on the
estimate of P ( M ) = l-IP(milmi_~). This is
test data.
the bi-gram model of syllable sequences.
3. Using another list of words, not present in A further 100,000 words, not previously seen in
the initial training data, use the Forward- the dictionary, were obtained from a corpus of
Backward algorithm to improve the estimates 100 million words of Newspaper articles
of the bi-gram model. (This step is optional (available on published CD ROM from the
if the original data is sufficiently large, since Guardian and Independent newspapers).
the hand annotated text may be superior to Numeric items, tbrmatting words, and other
the maximum likelihood solution generated textual items not suitable for this test were
by the Forward-Backward algorithm.) omitted. Assuming that no new syllable types
are required to model this data, the training
To decode a given orthographic word into its procedure described above was used to adapt the
underlying syllable sequence, first construct a initial statistics obtained by direct inspection.
table of the possible syllables in the manner given The performance of the model on the training
above. Use the variant of the parsing algorithm text was 94% and on the test data 92%. This
described above to obtain a value for the most indicates that some generalisation had occurred
likely syllable sequence which could have given which made the model less specific to the initial
rise to the observed spelling in a way consistent training text, but more robust on the test text.
with the Viterbi algorithm for strict HMM's.
The affect of this syllable model on the overall
Training and testing the model pronunciation system is as follows: The basic
decision tree transcription system when working
A large collection of words was obtained for directly from orthography to phonemes has a
which orthography, syllable boundaries and word correct accuracy of 86% on training text
pronunciations were available[l 1], ultimately and 78% on test data. (the result for training
from a machine readable dictionary, the Collins data is not 100% as expected because of
English Dictionary. As described[ll] the smoothing and other generalisations in the
original data was extracted from a type-setting decision tree construction process). With the use
tape in which the words were listed in the usual of syllables as marked, and a new decision tree
forms with abbreviations, run-ons, and other grown on the syllable marked training data, the
typographical devices. These were first overall system has a word accuracy rate of 92%
regularised by a combination of human and on the training data and 89% on the test data.

1282
Conclusions phonetics, tP 127-141 in Talking Machines,
ed G.Bailly and C.Benoit, North Holland,
A method of determining syllable boundaries has 19911.
been shown. The method can be improved by 3. W.A.Ainsworth and NP.Warren,
the use of a tri-syllable model and by the use of Appfication of Mu#ilayer Perceptrons in
more training data. Other extensions could be Text-m-Speech Synthesis Systems, pp 256-
explored quite easily. The method does not find 288 of Neural Networks fi~r Vision, Speech
new syllable types. For this some type of cmd Natural Language, ed D.J.Myers and
unsupervised clustering method is required. The C.Nightingale, Chapman Hall, 1992.
method leaves unsolved the treatment of unusual 4. S.Parfitt and R.ASharman, A hi-directional
or idiosyncratic textual conventions, notations, model of Eng,lish t'ronunciation, pp 801-
and numeric information. It seems that rule- 804, Proceedings of EuroSpeech 91, Genoa,
based techniques will still be needed. 199l.
5. L.R.Bahl, P.V. deSouza,
While the more serious question still to be P.S.Gopalakrishnan, D.Nahamoo and
answered for TTS systems lie elsewhere, for M.A. Picheny, (~ontext-dependent Modelling
example in prosody[10], the inability of systems of Phones in Contitmous Speech using
to perform transcription with high accuracy Decision ]'rees, IEEE ICASSP 1992.
makes this still an open question. The problem 6. F.Jelinek, et al., ]he develotmlent of a large-
of transcription is also of interest in Speech vocabuklty discrete wotff Speech
Recognition[6] where there is a need to generate Recognition system, 11,51,21,;Trans, Speech
phonetic baseforms of words which are included and Signal Processing, 1985."
in the recognisers' vocabulary. In this case the 7. l,.Rabiner, 7'u/orial on ltidden Markov
work required to generate a pronouncing Models' and Selected Atplications in Speech
dictionary fbr a large vocabulary in a new Recognition, Proc IEEE vol 77, no 2.
domain, including many technical terms and new pp257-286, 1989.
jargon not previously seen, calls tbr an 8. S.R.Hertz, J.Kadin, and K.J.Karplus, ]he
automatic, rather than manual technique. l)elta Rule l.)evelopment System fi~r Speech
Synthesis from Text, Proc IEEE vol 73 no
In the wider context the method applied here is 11. pp 1589-1601, 1985.
another example of self-organising methods 9. K.Church, P/vnouncing proper names,
applied to Natural Language Processing. While ACL Chicago, 1985.
these methods have found a fundamental place in 10. R.Collier, H.C.Van Lecuwen and
speech processing (for example, speech L.F.Willems, Speech Synthesis Today and
recognition) they have yet to be seriously 7@norrow, l'hilips Journal of Research and
adopted for language processing. It is a Development, vol 47 no 1, pp 15-34, 1992.
possibility that many more specific tasks in II. S.G.Lawrence and G.Kaye, Alignment of
language processing may be amenable to phonemes with their cot7"e,ponding
treatment by self-organising methods, with a orthoL,raphy, Computer Speech and
consequent improvement in the reliability and Language vol 1, pp 153-165, 1986.
ease of replication of the NLP systems which 12. MGiustiniani, A.Falaschi and P.Pierucci,
incorporate them. Automatic inference of a Syllabic Prosodic
Model Eurospeech pp 197-200, 1991.
13. B.Merialdo, On the locality of the l;orward-
References
Backward Algorithm, IEEE Transactions on
Speech and Audio Processing, pp 255-257
1. J.Allen, MS.Hunnicutt and D.Klatt, From
vol 1 no. 2, April 1993.
]'ext to Speech, Cambridge University
14. A.VAho and J.D.Ullman, "lhe theory of
Press,Cambridge, 1987.
patwing; #'anslalion and compiling,
2. S.M.Lucas and R.i.Damper, Syntactic
Prentice-Hall, 1972.
neural networks .fi~r bi-directional text-

12~3

You might also like