Professional Documents
Culture Documents
05267587
05267587
84
The Radio and Electronic Engineer, Vol. 40. No. 2, August 1970 73
L C. KELLY
3. Speech Synthesis
Having considered how the various speech sounds
are produced it is now possible to see how this
knowledge can be applied to the synthesis of speech.
Most of the earliest speech synthesizers were
mechanical, and in fact constructed as long ago as
the 18th century. Alexander Bell in the late 1800s
1-5 3-0 made a speech synthesizer by making a cast of a
FREQUENCY IN kHz
human skull and moulding the vocal tract and cords
Fig. 4. Calculated amplitude spectrum for one of the vocal from rubber and similar materials. Mechanical
periods in the previous figure. models however are very difficult to control and
August 1970 75
L. C. KELLY
progress has only been made with the advent of electrical terms as a generator producing a periodic
electronic speech synthesizers. waveform rich in harmonics during voiced sounds,
Speech production can be thought of as a combina- and a random noise source during unvoiced sounds.
tion of two functions; firstly the generation of a Many synthesizers do not provide the mixed excitation
sound source and secondly the modification of the signal required for the voiced fricative sounds such
sound from this source by the vocal tract. These two as z and v. In practice it is found that there are
functions are given names taken from general circuit sufficient other perceptual clues for these sounds to be
theory, and are known as the excitation and system correctly identified in spite of this restriction. The
functions respectively. Most speech synthesizers system function (corresponding to the vocal tract) can
depend on this idea of the separation of the speech be represented by a linear, time-varying four-terminal
signal into an excitation function and a system network terminated in a resistance representing the
function. The excitation function can be represented in radiation resistance of the mouth. Control parameters
AMPLITUDE
CONTROL CIRCUITS
BAND-PASS FILTER
- ' yf
PITCH UNVOICED
4 0 0 0 Hz
FREQUENCY _ PULSE NOISE ENERGY
/
GENERATOR GENERATOR /
3 4 0 0 Hz
k
V i FIXED
1
V/UV SWITCH
RESONANT
t CIRCUIT
4 th FORMANT
3 5 0 0 Hz
* /
FIXED FREQUENCY
4th FORMANT AMPLITUDE A 4
VARIABLE
RESONANT
FREQUENCY F3 CIRCUITS
3NI FORMANT
PHOTO-
ELECTRIC AMPLITUDE A5 OUT
TAPE >f
READER
RANGE
AND
1500 Hz 3 rd FORMANT
DECODER
/ /
TO
3 3 0 0 Hz
FREQUENCY F2
2nd FORMANT
AMPLITUDE A 2
\/ \
RANGE
' / • 700 Hz 2 nd FORMANT
/ TO
2 6 0 0 Hz
FREQUENCY F,
1st FORMANT
AMPLITUDE A\
\f f
RANGE
lOOHz 1 st FORMANT
/
/ TO
1000 Hz
Fig. 6. Block diagram of a formant synthesizer. This synthesizer is normally controlled from a
punched paper tape generated by a digital computer.
76 The Radio and Electronic Engineer. Vol. 40, No. 2
SPEECH AND VOCODERS
are used to vary the frequency response or spectral term spectral envelope of the speech signal. Para-
envelope of the four-terminal network to produce meters are extracted from these measurements, and
synthetic speech. Speech synthesizers of this type have transmitted with considerably less bandwidth than
been known for some time and in some instances the required by the speech signal. At the receiver speech
network corresponding to the vocal tract has been a is synthesized from the transmitted parameters. It is
lumped approximation to an electrical transmission line the choice of these parameters that distinguishes the
with parameters corresponding'to the dimensions of the different types of vocoders.
vocal tract that can be varied. Other types of syn- 5.1. Formant Vocoders
thesizer have been simpler in that they approximate
the response of the vocal tract with simple resonant The formant vocoder analyser attempts to measure
circuits. These are usually given the name formant in real time the frequency and amplitude of spectral
synthesizers. A block diagram of a typical formant peaks of the speech signal (the formants) and transmits
synthesizer is shown in Fig. 6. these measurements as parameters to control a
synthesizer similar to the one shown in Fig. 6. At
4. Formant Synthesizer the same time the larynx vibration rate and the
Formant synthesizers are of two basic types. Those decision as to whether the speech is voiced or un-
in which the resonant circuits are cascaded are called voiced must also be measured and transmitted as
serial synthesizers, and those whose outputs are additional control parameters. It is generally recog-
combined in parallel are known as parallel syn- nized that good speech quality can be obtained from
thesizers. The particular synthesizer shown in Fig. 6 a formant synthesizer, but when connected to a real
was designed to operate from a punched paper tape time analyser the results obtained to date have always
generated by a digital computer. It is a parallel-type been inferior. This is mainly because of the difficulty
formant synthesizer and single tuned circuits are in measuring formant frequencies accurately. Figure 6
used to generate the formant peaks. The centre shows that the formant frequency ranges overlap and
frequencies of three of the tuned circuits corresponding this tends even more to aggravate the problem. The
to Fu F2, and F3 are dynamically controlled by advantage of the formant vocoder over most other
voltage analogues generated by the computer. The types of vocoder is the bandwidth compression that
bandwidths of the tuned circuits are fixed and are set can be obtained. Typically a formant vocoder needs
to a bandwidth of between 60 and 120 Hz. Analogue eight parameters and each of these can be band-
signals representing the amplitude of the formants limited to about 25 Hz so that a bandwidth reduction
are fed to the amplitude control circuits and at the of the order of 20: 1 can be obtained relative to the
same time a common excitation signal (pulses or original speech band. It should however be empha-
random noise) is fed to all controllers. The output sized that all the difficulties of real-time formant
signal from the amplitude control circuits is therefore analysis are not yet solved and that until they are it is
a wide-band (4000 Hz) excitation signal whose unlikely that formant vocoders will prove to be very
amplitude has been determined by the appropriate useful.
analogue voltage Au A2, or A3. These signals are
fed to the required voltage controlled tuned circuit. 5.2. Spectrum Channel Vocoder
These variable frequency resonators select the band This is the original channel vocoder invented in the
of the excitation signal in the region of their centre 1930s by Homer Dudley of Bell Telephone Labora-
frequency and their outputs are simply added to tories. In the channel vocoder the short-term spectral
provide the synthetic speech. Two other circuits are envelope of the speech signal is represented by samples
added as a refinement to this basic synthesizer; one is spread across the frequency axis. These samples are
a fixed resonant circuit corresponding to a fourth usually obtained from a bank of band-pass filters
formant, and the second a further wide-band filter that whose centre frequencies are spaced across the speech
is switched on only during unvoiced sounds to band. Normally between 10 and 20 filters are used
provide a further improvement in quality. The to give a corresponding number of samples of the
excitation signal is provided either from a 'voltage- spectral envelope; the outputs of the filters are then
controlled' pulse generator during voiced sounds or rectified and smoothed by low-pass filters to give a
from a random noise source during unvoiced sounds. set of time Varying- average signals representing the
short term spectral envelope. As in the formant
The speech quality obtainable from a synthesizer vocoder, pitch analysis has also to be performed.
of this type can be very good indeed if great care is The resulting signals are then transmitted to the
taken to provide the correct control parameters. vocoder synthesizer and are used to control the fre-
5. Vocoders quency response of what is in effect a time-varying
A vocoder is a device that performs measurements band-pass filter that is fed with a spectrally flat
on speech that are in some way related to the short- excitation signal. The block diagram of a typical
August 1970 11
L C. KELLY
Many schemes have been devised for the measure- tremely complicated to implement, and something
ment of fundamental frequency. Early vocoders simpler is desirable.
used a simple low-passfilterto extract the fundamental Yet another approach to pitch extraction uses the
component from the speech waveform; a frequency philosophy that one simple measurement of pitch is
meter was then used to derive an analogue of the pitch unlikely to be satisfactory, but by combining the
frequency. With a high-quality input signal and answers of several fairly simple measurements and
carefully spoken speech fairly good results could be taking a majority vote on the results a useful pitch
obtained. However more elaborate schemes are extractor can be obtained. Systems using this principle
necessary for greater tolerance to various input have been described by Gold5 of the MIT Lincoln
signals. One such scheme that gives better results Laboratory and also by Gill6 of J.S.R.U.
uses a 'tracking' low-pass, or band-pass filter to
follow changes in the fundamental frequency of the In Gill's system, measurements are made of the
speech waveform. Pitch extractors of this type can interval between major peaks in the speech waveform
work well over a restricted frequency range if they in four frequency bands in the range 0-600 Hz. These
have as their input a high quality speech signal of good period measurements are converted into voltage
signal/ noise ratio. In the last decade autocorrelation analogues, each analogue representing the logarithm
techniques have been used to measure voice pitch. Simple of the larynx vibration frequency over a range from
autocorrelation of the speech waveform, however, 37-5-600 Hz. The analogue voltages of the four
does not give particularly good results during fast channels are compared with each other and also with
pitch inflexions, or during rapid formant transitions. the last transmitted measurement three times every
Some pre-processing of the signal before autocorrela- 20 ms. If during this time a good measure of agree-
tion may be expected to improve measurements, and ment is obtained the majority measurement is used
Gill2 has reported a successful pitch extractor that for the next transmitted value of pitch. If good
autocorrelates 'envelopes' derived from the speech agreement is not obtained but the sound is judged to
waveform. More recently Sondhi3 has suggested the be voiced then a 'no-confidence' signal is transmitted
autocorrelation of centre clipped speech. Centre to the receiver causing it to continue using the previous
clipping the speech removes most of the zero crossings measurement.
from the waveform and this has the effect of com- The voiced-unvoiced decision is made by comparing
pressing or flattening the spectral envelope. This the energy in the band 200-600 Hz with that in the
centre-clipped or spectrally-flattened speech is then range 5000-7000 Hz. A high ratio of low frequency
fed to an autocorrelator whose task is made much to high frequency energy indicates a voiced sound
easier than a correlator that has to operate directly and a high ratio of high frequency to low frequency
on the speech waveform. energy an unvoiced sound. This decision is influenced
by the pitch measurement; when a confident pitch
A different approach to fundamental frequency measurement is obtained the low frequency/high
extraction has been described by Noll4 and is known frequency ratio can be much lower and still interpreted
as 'cepstrum'. The cepstrum is the Fourier transform as a voiced sound.
of the logarithm of the power spectrum of a signal.
Because a speech waveform is nearly periodic during At the receiver the pitch analogue signal is passed
a voiced sound it has an approximation to a line- through a 10 Hz low-pass filter, the cut-off frequency
spectrum; this spectrum has periodic ripples in it at of this filter is switched to a higher value at the
the 'line' spacing, which corresponds to the funda- beginning of each voiced sound in order to get the
mental frequency. Taking the logarithm of this analogue signal to its desired new value quickly.
spectrum compresses the peaks due to the formants This pitch extractor will operate with very few gross
and gives an equal weight to the ripples in the high- pitch errors or errors of voicing detection, on rapid
energy and low-energy regions. The Fourier transform conversational speech.
of this log-spectrum exhibits a large peak correspond- The formant vocoder and the channel vocoder are
ing to the pitch frequency.
the most well known of the analysis/synthesis systems
In a practical cepstrum pitch extractor, log-spectra but several other types of vocoder do exist and a few
are continuously generated and the Fourier transforms of these will now be described.
calculated; the positions of the resultant cepstral
peaks are then used to provide an estimate of the 5.5. Voice Excited Vocoders
pitch period. Since the logarithmic spectrum will Many workers have attempted to improve the
have a strong periodic component even when the speech quality of the channel .vpcoder. One useful
fundamental is missing, this method will give good solution which considerably eases the problem of
results for telephone quality speech. Cepstrum pitch pitch extraction and at the same time improves the
extractors, although apparently successful, are ex- received speech quality is the voice excited vocoder.
August 1970 79
L C. KELLY
The voice excited vocoder differs from the channel Synthetic speech produced by the autocorrelation
vocoder in two main ways. Firstly a baseband of vocoder will have a spectral envelope corresponding
natural speech is transmitted (usually in the range to the square of the spectrum of the original speech
200-800 Hz), and is used at the receiver instead of signal, since in the analyser the speech signal was
synthetic speech for the lower part of the spectrum, multiplied by itself. This results in making the
and secondly at the receiver the baseband is used after spectral peaks more pronounced than they normally
some processing to provide the vocoder excitation would be, and gives the synthetic speech a rather
signal for the frequency region above the baseband. unnatural 'bouncy' quality. This defect may be over-
Channel signals for the band above the baseband are come by using a complex equalizer which effectively
transmitted as in a normal channel vocoder. takes the square root of the input signal spectrum.
Generation of a suitable flat spectrum excitation Figure 8 shows a block diagram of an autocorrelation
signal from the baseband requires some form of non- vocoder.
linear processing, often called spectral flattening, and
there are several methods for doing this. A simple 5.7. Harmonic Vocoder
example of spectralflatteningis half-wave rectification The harmonic vocoder or harmoniphonefirstreport-
followed by a suitable weighting network to produce ed by Pirogov8 is yet another variation of the vocoder
a spectrum which is approximately flat in the required principle. In this vocoder the short term spectral
band. envelope of the speech signal is expanded into a
Generally speaking voice-excited vocoders provide Fourier series; there are several methods for achieving
much better quality speech than simple channel this, but a fairly straightforward way is by means of a
vocoders. Some of this improvement is obviously resistor matrix connected to the low-pass filtered
due to the fact that part of the natural speech is being channel signals of a normal channel-vocoder analyser.
directly transmitted, but the rest of the improvement The time-varying Fourier coefficients obtained from
obtained is probably due to the inherently more this matrix are then used after transmission to control
accurate excitation signal obtained from the spectral a speech synthesizer.
flattening process. The voice excited vocoder will The synthesizer takes the form of a set of wide-band
give a bandwidth compression of about 3: 1 com- non-dispersive, interconnected delay lines. This
pared with the 10: 1 obtained from a typical channel network has frequency responses of sin nDco, cos nDco,
vocoder. available at a series of tapping points; n takes the
value 0, 1, 2, etc., and D is the delay of one delay
5.6. Autocorrelation Vocoder section. The excitation signal is connected to this
The autocorrelation function of a signal is the network, and the signals at the tapping points are
Fourier transform of its power spectrum. It therefore multiplied by the appropriate Fourier coefficient and
follows that the short term spectrum of a speech are then summed in order to produce the synthetic
signal can be represented by a set of short term speech.
autocorrelation functions. This is the basis of the An interesting feature of this type of vocoder is
autocorrelation vocoder, originated by Schroeder.7 that each single coefficient affects the whole spectrum
In this type of vocoder, analysis is performed by of the synthetic speech, and not just a portion of it
multiplying the signal by itself at various delays (delay as in the channel vocoder. It is found in practice
increments must be less than half the reciprocal of the that the higher coefficients have a progressively
required speech bandwidth). The multiplier output smaller effect on the spectrum than the lower ones.
signals are then low-pass filtered to about 25 Hz to This would generally mean that errors in transmission
produce the short term autocorrelation channel of coefficients near the spectrum fundamental would
signals. These signals after transmission are used to have a more serious effect on the speech quality,
control a speech synthesizer. The synthesizer is a and therefore one would expect that these coefficients
time varying filter, whose impulse response is a would need to be transmitted more accurately than
replica of the short term autocorrelation function for those corresponding to the higher harmonics of the
the range of delays used in the analyser. Synthesis is spectrum shape.
achieved by 'sampling' the channel signals with the
excitation signal in a set of multipliers, and then 5.8. Chebyshev Vocoder
assembling the samples in the correct time relation- A further refinement of the harmonic vocoder
ship by use of a delay line. Since the correlation principle is the Chebyshev vocoder first reported by
function is even it,wjll be symmetrical about the Kulya.9 This vocoder expands the short term spectral
centre sample. This'means in practice that a delay envelope of the speech into an orthogonal series in
line that gives a complete reflexion at one end can be the same way as the harmoniphone; in this case
used, with a consequent halving of the delay-line length. however the functions used in the expansion are not
ANALYSER SYNTHESIZER
MULTIPLEXING
AND
TRANSMISSION
EXCITATION
SOURCE
SYNTHETIC
DELAY LINE SPEECH
Fig. 8. Block diagram of an autocorrelation vocoder
OUT
(after Schroeder).
COMPLETE
REFLEXION
sine and cosine waves but transformed Chebyshev delay networks that can be simply constructed from
functions. This may at first sight appear to be an RC networks known as Laguerre networks. Speech
unnecessary complication but it does in fact have two quality from a Chebyshev vocoder can be at least as
main advantages. Firstly, the Chebyshev functions good as from a channel vocoder with the same number
scan the short-term spectral envelope in such a way of control parameters, and there is some evidence to
that more accuracy is obtained at the low-frequency show that if we are restricted to a small number of
end of the spectrum. Since the resolution of the ear is coefficients (e.g. five) that the speech quality can be
believed to be distributed in a similar manner, this better than for the corresponding channel vocoder.
greater low-frequency accuracy would seem a desirable
feature. Secondly, we have seen that synthesis in Figure 9 shows a conventional 18-channel vocoder
terms of Fourier series requires the use of wide-band analyser, modified by the use of a matrix of 180
non-dispersive delay lines which are bulky and resistors, that converts it into a Chebyshev vocoder
difficult to design; it can be shown, however, that the analyser. Figure 10 shows the block diagram of the
use of Chebyshev functions requires a set of dispersive corresponding 10-coefficient synthesizer.
1 , "TO 1 -^T1 \
SPEECH INPUT
2 , 2 RESISTOR
Fig. 9. Block diagram of a
MATRIX 10-coefficient Chebyshev vo-
TRANSMISSION
PATH coder. This was obtained by
: ! i modifying an 18 channel
18 3 vocoder analyser with the
•T10 resistor matrix shown.
PITCH p(t)
EXTRACTOR
August 1970 81
L C. KELLY
/ pit)
TRANSMISSION
PATH