Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

GAIN NORMALIZATION IN A 4200 BPS

HOMOMORPHIC VOCODER
Jae H. Chung and Ronald W. Schafer
Georgia Institute of Technology
School of Electrical Engineering
Atlanta, Ga 30332

thesizer, depicted in Figure 2, an excitation sequence consisting of isolated impulses or random noise was created and
this input was convolved with the estimated vocal tract impulse response to produce the synthetic speech output. The
pitch period, amplitude of the excitation, and the low-time
cepstrum values comprise a parametric representation of the
speech signal that can be encoded for digital transmission
or storage.
The availability of increasingly powerful, inexpensive,
DSP microcomputers has made it possible to consider much
more sophisticated methods for obtaining the excitation signal in vocoders. Multipulse[3], code excited[4], and self excited or vector excitation[5] LPC vocoders have been widely

Abstract
This paper describes a new technique for coding the gains in a vector excitation homomorphic
vocoder. In this system, the excitation signal, which
is obtained by analysis-by-synthesis,consists of a part
derived from a Gaussian codebook and a part derived from the past excitation. The paper shows how
the correlation between the two gain parameters of
the excitation can be increased and how they can be
jointly coded at a lower bit-rate. This new approach
makes it possible to reduce the bit rate of the homomorphic vocoder from 4800 bps to 4200 bps with
essentially no degradation in speech quality.

1 Introduction
In the original definition of the homomorphic vocoder, an
estimate of the time-varying vocal tract impulse response
was extracted using the homomorphic filtering procedure
depicted in Figure 1.[1]The upper part of the figure depicts
the operations required to compute the cepstrum h[n] of
the vocal tract impulse response.[l][2]In Figure 1, v[n] is a
window sequence (e.g., Hamming window) which selects a
short segment of the speech signal for analysis, and l[n]is a
lifter of the form
l[n] =

2, 1 5 n < no
0, otherwise,

(no = pitch period)

(1)

which extracts the low-time part of the cepstrum as a r e p


resentation of the vocal tract impulse response. The lower
part of Figure 1 depicts the operations for computing the
normalized (since &[O] = 0) vocal tract impulse response
h[n]. The original homomorphic vocoder also used the cepstrum as the basis for a voiced/unvoiced decision and to
estimate the pitch period for voiced speech.[l] At the syn-

studied. These same analysis-by-synthesis methods have


also been applied successfully to derive the excitation for
a homomorphic vocoder, at a bit rate of 4800 bps.[6] The
performance of this 4800 bps vector-excited homomorphic
vocoder is far superior to that of a pitch-excited homomorphic vocoder and fully comparable to 4800 bps LPC
vocoders using analysis-by-synthesis vector-excitation.
This paper describes a new method of coding the two
gain parameters of a vector excitation homomorphic vocoder.
The approach involves a time varying gain normalization,
which transforms the original uncorrelated gain parameters
into highly correlated parameters that can be jointly quantized to achieve a significant reduction in bit rate over independent quantization of the two gain parameters. Using this
technique, the bit-rate of the homomorphic vocoder can be
reduced to 4200 bps with little or no degradation when compared to the 4800 bps vocoder with independently quantized
gains.
The paper is organized as follows: Section 2 gives a brief
review of the analysis-by-synthesis method of obtaining the
excitation signal; Section 3 introduces the gain normaliza-

322.4.1.
0942

CH2829-0/90/0000-0942 $1 .OO

0 1990 IEEE

tion procedure; Section 4 describes a simple procedure for


jointly quantizing the two gain parameters of the vector excitation, thereby reducing the bit rate from 4800 bps to 4200
bps; and Section 5 briefly summarizessome conclusions from
the research.

The Excitation Model

Figure 3 shows a block diagram representation of the analysisby-synthesis algorithm for determining the excitation signal
e[n]for the homomorphic vocoder. The excitation model for
a short excitation analysis frame (e.g., 5 msec or 40 samples
at the 8 kHz sampling rate) is of the form

and the correspondingperceptually weighted synthetic speech


is

41. = P1z1[.] + P Z Z Z [ ~ ]
where 21[n] = g[n] * f,,[n] and xz[n] = g[n] * e[n - 721,

(3)

and
g[n] = w[n]* h[n]is the perceptually weighted vocal tract
impulse response. The excitation signal is composed of the
where fY1[n]
is a zero-mean
following two parts: /31f7,,[n],
Gaussian codebook sequence corresponding to index 71 in
the codebook, and Pze[n- r2],which represents a short segment of the past (previously computed) excitation beginning
-y2 samples before the present excitation frame. Henceforth,

p1 will be called the

codebook gain and Pz will be called the

self-ezcitation gain.
First, the parameters
the mean-squared error

72

and

PZ are chosen to minimize

(4)

For a given 7 2 , the value of P Z that minimizes the mean


squared error in (4) is given by

(5)
n

where zz[n]= g[n] * e[n- y2] = w[n]* h[n] * e[n- rz].The


optimum values for 7 2 and Pz are found by an exhaustive
search with values of 7 2 restricted to a finite range. Then
is formed and 71
the residual signal yl[n] = y[n]- ,44~z[n]
and PI are chosen by an exhaustive search of the Gaussian
codebook to minimize

As before, the value of

that minimizes the mean squared

error for a given codebook sequence f,,[n] is


c Y ~ [ ~ l ~ ~ [ n l

where q [ n ] = g[n] * f,, [n] = ~ [ n* ]h[n]* fYl[n].


Note that the effect of convolving w[n] with the original
speech and with the impulse response before synthesis is
effectively to multiply the Fourier transform of the error
between the original speech and the synthetic speech by the
magnitude squared of the frequency response corresponding
to ~ [ n ] When
.
w[n] is properly chosen, the weighting has
the effect of concentrating the coding noise in the formant
regions of the spectral envelope, thereby making the coding
error less perceptible.[7] In the homomorphic vocoder, an
appropriate weighting filter can be derived directly from the
cepstrum. [6]

Gain Normalization

Figures 4(a) and 4(b) show lP1l and


as a function of the
frame index. In this example, the system of Figure 1 wm
used to compute 16 low-time cepstrum values with a vocal
tract analysis frame spacing of 20 msec (160 samples). The
cepstrum values were vector quantized using a codebook of
256 entries (&bits/cepstrum).[6] The system of Figure 3 was
used to obtain the excitation signal using a 40-dimensional
Gaussian codebook with 128 entries (41 represented by 7bits) and with an excitation signal memory (72)ranging
from 32 to 160 samples. The excitation frame length and
spacing were both 5 msec (40 samples).
The behaviors of lpll and
with time are distinctly
different, but the difference is easily understood. First note
from (1) that the impulse responses h[n] are all normaliied
automatically because L[O] = 0. Then recall that PZ is the
constant multiplier of a portion of the previously computed
excitation that contributes to the excitation in the current
frame. Notice in Figure 4(b) that lPzl remains fairly constant near unity, except for large spikes and abrupt dips
toward zero. This is to be expected, since in steady-state
regions, the amplitude in an excitation analysis frame should
be about the same aa the amplitude in previous frames in
that steady-state region. However, in a transition frame

322.4.2.
0943

from voiced to unvoiced, past excitation amplitudes will be


much larger than required, and therefore Ipz I will have to be
small to compensate. Likewise, in transitions from unvoiced
to voiced, the immediate past excitation will be small, while
a larger excitation will be required in the current frame.
Therefore lpzl must be large to compensate.
In contrast, I tends to track the energy envelope of the
speech signal and is somewhat better behaved. Clearly, the
amplitude of the residual signal yl[n]will be proportional to
the amplitude of the original speech signal. Therefore, since
the codebook sequences all have the same energy, lpll will
track the amplitude of y1 [n]and therefore also the amplitude
of the input speech signal.
Figure 4 shows that lpll and lpzl are not highly correlated, and therefore it would seem that there is little to be
gained by jointly quantizing them. However, recall that lp2l
is generally close to unity, and
tends to follow the amplitude of the speech signal and the excitation signal. This
suggests that if lp2l is normalized by a function of the previous excitation energy, then the correlation with
can be
greatly increased. Indeed, Figure 5(a) shows the parameter
aIP21, where

with L representing the excitation frame length. That is,


the gain normalizing factor a is the geometric mean of the
energy of the excitation segment beginning at 72 and the
energy of the just previous excitation frame. This averaging
gives a smoothly varying normalizing factor which, as can
be seen from Figure 5(a), converts Ipz I into a parameter that
varies with time in much the same way that lpll varies.
Figure 6 shows the correlation between
and aIp21
more clearly. Indeed, Figure 6 implies that lpll is proportional to alp2l to within a constant maximum percentage
error. The straight line in Figure 6 is a least squares fit
to the data which include 1536 frames from four different
utterances by four different speakers. This linear fit to the
log-log data is given by

Samples/Frame
Bit Rate
Bits/Frame
cepstrum I 81 82 I 71 72 excitation I cepstrum (bits/sec)
40 I
160
4800
81I 4 4 1I 7 7 I
81 1 41 7
71
40 I
160 I
4200 I
I

Table 1 Bit allocation for homomorphic vocoders.

The 4800 bps homomorphic vocoder that was previously


reported used the bit allocation scheme given in the first
row of Table 1. In the 4800 bps vocoder, each of the two
gain parameters was coded using a 3-bit APCM coder[8] to
codelpll and Ipzl. The results of the previous section suggest
that the total bit allocation can be reduced by jointly coding
lpll and aIpzl. Clearly, many schemes can be found to take
advantage of the correlation illustrated in Figure 6. One
approach is simply to code alp21 using 3-bit APCM. (Figure
5(b) shows an example of this %bit quantization.) This
information together with one bit each for the signs of p1 and
p2 completes the representation for a total of five bits instead
of eight. With an excitation frame rate of 200 frames/sec,
this results in a reduction of 600 bps.
At the receiver, lpzl is derived from the quantized version of alp21 by dividing by a,which is derivable using (8)
from the past excitation. Then
is obtained from the
derived lpzl through (9). As an illustration, Figures 7(a)
and 7(b) show the decoded lpll and lpzl respectively for the
corresponding parameters in Figures 4(a) and 4(b).
The performance of the 4200 bps homomorphic vocoder
is virtually identical to the performance of the 4800 bps
version. This is confirmed by careful listening tests and by
the fact that over a range of speakers and utterances, the
signal-to-noise ratio decreases by less than 0.4 dB in going
from 4800 to 4200 bps.

5
which serves as an approximate relationship between the
codebook gain
and the normalized self-excited gain aIp21.

Quantization for 4200 bps

Conclusions

We have discussed a scheme for coding the codebook gain


and the self-excitation gain in a vector-excitation homomorphic vocoder. The proposed method permits significant reduction of bit-rate for the homomorphic vocoder with virtually no loss of quality. Furthermore, the method is applica-

322.4.3.
0944

ble to any vocoder using the analysis-by-synthesis method


of excitation analysis. Future work will consider other variations on the normalization scheme as well as other methods
of jointly quantizing the two gain parameters.

References
[ 11 A. V. Oppenheim, Speech analysis-synthesis system
based on homomorphic filtering, J. Acoust. Soc. Am.,
vo1.45, pp.458-465, Feb. 1969.

[2] A. V. Oppenheim and R. W.Schafer, Homomorphic


analysis of speech, IEEE Tram. on Audio and Electroacoustics, pp.118-123, June 1968.
131

B. Atal and J. Remde, A new model of LPC excitation for producing natural-sounding speech at low bit
rates, Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, pp. 614-617, 1982.

[4] M. R. Schroeder and B. Atal, Code-excited linear prediction (CELP): high-quality speech at very low bit
rates, Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing, pp. 937-940, 1985.

Figure 1 Homomorphic filtering for estimating


the vocal tract impulse response.

44

DISCRETE
CONVOLIPTON

+31

hI.[

Figure 2 Synthesizer for homomorphic vocoder.

[5] R. Rose and T. P. Barnwell, 111, Quality comparison of


low complexity 4800 bps self excited and code excited
vocoders, Proc. Intl. Conf. on Acoustics, Speech, and
Signal Processing, pp. 1637-1640, 1987.
[6] J. H. Chung and R. W. Schafer, A 4.8 Kbps homomorphic vocoder using analysis-by-synthesis excitation
analysis, Intl. Conf. on Acoustics, Speech, and Signal
Proc., pp. 144-147, 1989.
[7] B. S. Atal and M. R. Schoeder, Predictive coding
of speech signals and subjective error criteria, IEEE
Trans. Acoustics, Speech, and Signal Processing, ASSP27, no. 3, pp. 247-254, June, 1979.

[S] N. S. Jayant, UAdaptivequantization with a one word


memory, Bell System Tech. J., pp. 1119-1144, September, 1973.

ERROR
ZATION

Weighted Error

Figure 3 Analysis-by-synthesis method for


obtaining the excitation sequence e[n].

322.4.4.
0945

IOW

(a) UNOUANTIZBDCODEBOOK GAIN

800

600

.. ... .

400

103 :

200
'0

SO

100

150

200

250

300

350

@) UNOUANTIZED SELF-EXCITATION GAIN

400

10'

4
2
0
0

50

100

150
200
250
excitationframe index

300

350

100

400

10'

102

104

103

SELF-EXCITATION GAIN

NO-D

Figure 6 Ilustration of correlation between lpll and aIPz1.

Figure 4 (a) Codebook gain I/311. (b) Self-excitation gain

IPZ I.

loo0
800
1

600
400

0.5

200

0
0

SO

100

150

250

300

350

400

50

100

150
200
250
excitationframe index

300

350

400

200

8
6

0.5
2
0

SO

100

150
200
250
excitationframe index

300

350

0
0

400

Figure 5 (a) Unquantized normalized Self-excitation gain


aIP11. (b) Quantized normalized self-excitation gain,

Figure 7 (a) Quantized codebook gain (b) Quantized


self-excitation gain.

322.4.5.
0946

You might also like