Entropy-Constrained Temporal Decomposition

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

1

Entropy constrained temporal decomposition

Ki-Seung Lee, Member IEEE

Abstract

This paper presents a new temporal decomposition (TD) method based on a rate-distortion criterion.
The sequence of the target vectors are efficiently determined using a dynamic programming procedure.
This procedure minimize not only the spectral distance measure between the original and the interpolated
spectral sequence but also the overall entropy of the TD parameters in a given interval. An iterative
algorithm is also developed for designing a set of the interpolation functions having minimum distortion
subject to an entropy constraint for the training spectral sequence. This iterative algorithm consists
of updating the target positions of the training spectral sequence using an a prior set of interpolation
functions, and updating the set of interpolation functions using these target sequences.
In experiments, the performance of the proposed TD method for speech coding is compared to that
of the conventional coding methods. Simulation results show the proposed method produced superior
results to a frame-by-frame basis vector quantization scheme in terms of both objective and subjective
evaluations.

Index Terms

Temporal decomposition, Spectral interpolation, Rate-distortion criterion.

Ki-Seung Lee is with the Department of Electronic Engineering, Konkuk University, 1 Hwayang-dong, Gwangjin-gu, Seoul,
143-701, Korea. (e-mail : kseung@konkuk.ac.kr)

July 13, 2005 DRAFT


2

I. INTRODUCTION

Speech conveys communicative information including the meaning of the message being uttered and
speaker identity. The former information is due to a sequence of phonemic commands, which are a major
source of the articulatory movements [1]. An average number of phonemic commands per second, which
is generally known as phoneme rate, is 10-15 sounds/sec [2]. It is also known that the articulatory move-
ments due to one phonemic command overlaps with that due to the phonemic command of its temporal
neighbor. The above principles can be applied to representing a time sequence of speech parameters. One
example of this is known as a temporal decomposition technique [4]-[16].
In TD, a time sequence of speech parameters is decomposed into a series of time-overlapping interpo-
lation functions and an associated series of data vectors, which are referred to as “target vectors”. The
speech parameters at a time is given by a linear combination of the neighboring target vectors and the un-
derlying interpolation functions. In general, the target vectors are not uniformly spaced, and the number
of the target vectors is relatively small, comparing to the number of speech parameters from the frame-by-
frame basis analysis/synthesis schemes. Hence, the required information for representing speech signal
in TD is smaller than that of the frame-by-frame basis analysis/synthesis schemes. This means that au-
tomatic speech recognition (ASR) or speech coding can be implemented with the reduced number of
speech parameters by applying TD. In practice, most previous works associated with TD were dedicated
to lower speech coding rate [4][7][8][11][13]-[16]. Another potential application of TD is based on the
assumption that the locations of each target vector correspond to phonetic events. Accordingly, TD can
be used as a preprocessing of ASR [5][9][10]. The method proposed here mainly focuses on the coding
aspects of TD.
The problem of TD can be formulated in terms of how to determine the interpolation functions and
how to find the locations of the optimum targets in the sense of satisfying a given criterion.
A TD technique based on SVD (Singular Value Decomposition) was proposed by Atal [4]. The under-
lying principle of this work is that finding a set of the target vectors can be formulated as estimating a set
of basis vectors which composes the column space of the matrix whose column corresponds to a given
spectral feature vector. This method was adopted to represent a sequence of the Log Area Ratio (LAR)
which is one of the parameters representing the vocal tract transfer function. The author indicated that an
average bit rate was 690 bits/s, keeping the error in the log area parameters to be less than 0.10.
Following Atal’s pioneering work, a number of extensions have been explored. A linear interpolation-
based TD method was discussed in [6], where a breakpoint analysis is formulated as a geometric interpre-

July 13, 2005 DRAFT


3

tation in a multidimensional parameter space. To avoid heavy computational loads of computing SVD,
more simple methods for estimating the target vectors have been also proposed. Niranjan and Fallside
have explored a TD-like algorithm in which the locations of the target vectors were determined by finding
the points of maximum spectral stability [5]. A similar approach was employed in [8], where an initial
set of the target vectors were constructed by using a spectral stability measure.
On estimating the interpolation functions, a minimum mean square error criterion was employed in
[4]. Kappers and Marcus devised an iterative algorithm allowing for variable length interpolation func-
tions [7]. A model-based approach was also investigated to represent the interpolation functions. In this
approach, the underlying principle is that the interpolation functions vary smoothly as a function of time.
In [12], an interpolation function is approximated as a flat-top Gaussian function. More simple represen-
tation of the interpolation functions was discussed in [6], where the interpolation function is represented
by a straight line. In [16] the prototype interpolation functions for various lengths were constructed via
off-line processing. In this procedure, a minimum mean square error criterion was used, similar to [4].
An interpolation function is selected according to the interval between the two targets via on-line process-
ing. A constrained TD method was also proposed by Kim et al. [15], where the interpolation function is
estimated to minimize the overall distortions while maintaining the ordering property of LSP parameters.
Previous TD techniques have been shown to be capable of producing acceptable approximations of
spectral parameters in which their efforts have been mainly dedicated to minimizing the distortion be-
tween the original spectral sequence and the approximated one. However, fundamental issues for appli-
cation to a coding paradigm, such as the number of bits used to represent parameters and a quantitative
analysis of approximation errors according to bit allocation have not sufficiently discussed in previous
techniques. This is one of the reasons why TD techniques are not widely employed in the applications
associated with speech coding. In [16], a TD method which minimizes the number of bits while maxi-
mum spectral distortion is maintained below a given threshold was introduced. The performance of this
method was confirmed in the rate-distortion plane, where an average spectral distortion of about 1.4 dB
was achieved at an average bit rate of about 8 bits/Frame. In this method, the allocated number of bits
representing each interpolation function is fixed. This means that the overall bits of the method [16]
would be further reduced by employing an entropy coding scheme to represent the interpolation func-
tions. Moreover, since segmentation is not involved in designing a set of the interpolation functions, both
target vectors and the interpolation functions are not jointly optimized in the sense of a minimum mean
square error criterion.

July 13, 2005 DRAFT


4

In this paper, a new TD method which takes into account, not only spectral distortion, but also the
overall entropy of the TD parameters is presented. The underlying criterion is to minimize the weighted
sum of the spectral distance measure between the original and the interpolated spectral sequences and
the overall entropy of the TD parameters representing the spectral sequence. This method enables us to
implement a TD scheme which can control the relative importance of each parameter, the spectral distor-
tion and the number of bits required to code the TD parameters. An iterative descent algorithm is also
introduced for designing a set of the interpolation functions, where segmentation and re-estimation of the
interpolation functions are iteratively performed to minimize the weighted sum of the overall distortion
and the overall entropy.
The proposed TD method has been applied to compressing LSF (Line Spectrum Frequency) coeffi-
cients and the experimental results are shown which provide an evaluation of the effectiveness of the
proposed TD techniques. To confirm the usefulness of the proposed TD scheme, we performed the ex-
periments for compressing the spectral parameters of a long time speech corpus, which is originally used
for a corpus-based TTS (Text-To-Speech) synthesis system. An informal listening test was also carried
on the reproduced speech signals by the proposed method and the conventional coding schemes.
The outline of this paper is as follows: Section II provides the proposed segmentation method, which
is implemented by a dynamic programming technique. An iterative method used for building a set of
interpolation functions is introduced in Section III. Both objective and subjective results, which evaluate
the effectiveness of the proposed temporal decomposition method are presented in Section IV. Section V
concludes this work and suggests some future directions.

II. E NTROPY- CONSTRAINED SEGMENTATION

In TD, a sequence of N spectral feature vectors is represented by K target vectors and K underlying
interpolation functions, as follows:

K
X
ŷ(n) = ak φk (n) (1)
k=1

where ak and φk (n) are the k-th target vector and the k-th interpolation function, respectively. In general,
the number of target vectors K is quite smaller than that of entire spectral vectors K (k << N ). In this
work, target vectors are obtained by sampling given spectral feature vectors which are located in the
specific points (target points). Assuming that the k-th target vector is given by the nk -th spectral feature

July 13, 2005 DRAFT


5

a k −1 ak a k +1
φ kR−1 ( n) φ kL ( n) φ kR ( n) φ kL+1 ( n)

nk −1 nk nk +1
N k −1 Nk

Fig. 1. Spectral feature vector representation using the proposed TD scheme

vector and the k-th interpolation function is defined in the limited interval [nk−1 , nk+1 ], the approximated
feature vectors ŷ(n) can be represented by

ŷ(n) = ak φR L
k (n − nk ) + ak+1 φk+1 (n − nk )

= y(nk )φR L
k (n − nk ) + y(nk+1 )φk+1 (n − nk ) (2)

where nk ≤ n ≤ nk+1 . A graphical explanation of the proposed TD scheme is shown in Fig. 1. As


shown in this figure, φR L
k (n), φk+1 in (2) are the right side of the k-th interpolation function and the left

side of the k + 1-th interpolation function, respectively. The interval between the two targets uniquely
defines the interpolation function, hence φk (n) can be represented as φNk (n), where Nk = nk+1 − nk .
The problem of TD can be formulated in terms of how to determine the set of the interpolation functions
Φ = {φk (n)}K K
k=1 and how to find the set of the locations of the optimum targets A = {ak }k=1 in the

sense of a given criterion. In the previous works, an MMSE (Minimum Mean Square Error) criterion is
mostly used, where an optimum Φ∗ and A are obtained to minimize overall distortion between y(n) and
ŷ(n). However, the overall entropy of the TD parameters Φ, A is also important factor in the applications
associated with data compression. We employ a new criterion which consider not only distortion between
the original feature vectors and the approximated ones, but also overall entropy of the TD parameters.
Accordingly, the optimum A∗ , Φ∗ are given by a following Larangian formulation

A∗ , Φ∗ = arg min[D(A, Φ) + λR(A, Φ)] (3)


A,Φ

where the Lagrange multiplier λ has an interpretation as the relative weight for the overall entropy.
D(A, Φ) and R(A, Φ) represent the overall distortion and the overall entropy, respectively, which are

July 13, 2005 DRAFT


6

given by
N
X
D(A, Φ) = ||y(n) − ŷ(n)||2 (4)
n=1
XK K
X
R(A, Φ) = r(y(nk )) + r(φk ) (5)
k=1 k=1

In the above equations, the overall entropy R(A, Φ) implies the minimum number of bits representing
the entire information associated with the underlying TD parameters A = {ak }K K
k=1 = {y(nk )}k=1 ,

Φ = {φk }K
k=1 . r(x) represents an individual entropy of a random variable x,

r(x) = | log2 (px (x))| (6)

where px (x) is the probability distribution of a random variable x. In practice, r(x) can be approximated
to the number of bits representing x when any variable length coding scheme (e.g. Huffman coding) is
employed. The results from the several experiments showed that there are no remarkable differences in
the overall performance between the two cases; using the real entropy computed by (6) and using the
approximated entropy by a variable length coding. Hence, we used the entropy measure (6).
To solve the minimization problem (3), we assumed that the entire interpolation functions for all pos-
sible length (the method for building a set of interpolation functions will be appeared in Section III) are
given. With these interpolation functions, an optimum target points {y(nk )}K
k=1 are selected from a given

spectral sequence {y(n)}N −1


n=0 in the sense of minimizing the overall distortion and the overall entropy of

the TD parameters. A Dynamic Programming technique is employed to find the optimum target points
{nk }K
k=1 . It first finds the local minimal path for all n within a talk spurt, then the global minimum path

is built by backtracking. The overall procedure for finding an optimal set of target points {n∗k }K
k=1 is thus

as follows

w(n) = arg min [D(k) + d(k, n) + λ{R(k) + r(k, n)}]


1≤k<n

D(n) = D(w(n)) + d(w(n), n)

R(n) = R(w(n)) + r(w(n), n) (7)

where 1 ≤ n ≤ N − 1, d(n1 , n2 ) is the sum of the square errors between the original spectral feature
vectors and the approximated ones by the proposed TD scheme which are computed in the range of
[n1 , n2 ].
n2
X
d(n1 , n2 ) = ||y(n) − {y(n1 )φR L
l1,2 (n − n1 ) + y(n2 )φl1,2 (n − n1 )}||
2
(8)
n=n1

July 13, 2005 DRAFT


7

where l1,2 is the length between n1 and n2 . r(n1 , n2 ) is the overall entropy of the TD parameters which
are computed in the range of [n1 , n2 ].

r(n1 , n2 ) = | log2 (pl (l = n2 − n1 + 1))| + | log2 (py (y = y(n1 )))| (9)

where pl (l) and py (y) are the probability distributions for the random variables l and y, respectively. As
mentioned early, the entropy terms in (9) can be replaced by the number of bits when any variable length
coding scheme is applied. In practice, a necessary information for computing the overall entropy is the
probability distribution of a continuous random vector y(n). This can be approximated by a continuous
probabilistic model (e.g. Gaussian distribution). Considering the fact that most applications of TD are
involved with speech coding, it is more preferred that spectral feature vectors are represented by a set of
discrete vectors, which are obtained by a vector quantization scheme [19]. Hence, the last term in (9) can
be replaced by the followings,

| log2 (py (y = y(n1 )))| ' | log2 (py (y = Q[y(n1 )])| (10)

where Q[x] represents the reproduced vector of x, when vector quantization is applied.
In (7), R(n) is the accumulated entropies up to n, similarly, D(n) is the accumulated distortions
up to n. The backtracking pointer, w(n) holds an indication of which point is the starting point of
the path with the minimum D(n) + λR(n). The optimal sequence of target points in reverse order is
y(N ), y(w(N )), y(w(w(N ))), ....

III. B UILDING A SET OF INTERPOLATION FUNCTIONS

An interpolation function for each length is constructed from the given training corpus in the off-line
training stage. The interpolation function for the length N can be easily estimated from the training
vectors belonging to the segments having the length of N , if the training data are already segmented.
However, if only the feature parameter sequence is given and the sequence of the target vectors is un-
known, designing a set of the interpolation functions is not trivial. Thus, using the training data of the
feature parameter sequence, a set of the interpolation functions is designed by find both the target vector
sequence and the interpolation functions so as to minimize D(A, Φ) + λR(A, Φ). Hence, the optimum
set of the interpolation functions is given by

Φ∗ = arg min[min{D(A, Φ) + λR(A, Φ)}] (11)


Φ A

July 13, 2005 DRAFT


8

An iterative method is employed to solve the problem (11), where segmentation and re-estimation are
iteratively performed. The overall procedure for building a set of interpolation functions is shown in Fig.
2. The basic concept of the proposed training algorithm is, beginning with an initial set of target vectors
Ā and an initial set of interpolation functions Φ̄, to estimate parameters A and Φ such that

D(A, Φ) + λR(A, Φ) ≤ D(Ā, Φ̄) + λR(Ā, Φ̄) (12)

The new parameters then becomes the initial parameters for the next iteration and the process is repeated
until some convergence threshold is reached. A detailed description for each step is as follows :

Step-0. Initialization : Given training set {y(n)}N


n=1 where N is the total number of training patterns,

a set of the reproduced vectors {ỹ(n)}N


n=1 is built using an vector quantization scheme, an initial set of

target vectors A0 = {ỹ(n01 ), ỹ(n02 ), ..., ỹ(n0K 0 )} is built using an adequate method. Set thresholds ²,
e(0) = ∞ and i = 0.

Step-1. Re-estimation : For each length, find the optimal interpolation function having minimum overall
square errors between the real feature vectors and the approximated ones.

φR∗ L∗ i R L
m , φm = arg min [D(A , φm , φm )] (13)
φR L
m , φm

where
ni
X X
k+1

D(Ai , φR L
m , φm ) = ||y(n) − φR i R i
m ỹ(nk ) − φm ỹ(nk+1 )||
2
(14)
i n=ni
k∈Sm k

i = {k|ni
where Sm i
k−1 − nk + 1 = m}, a set of targets having the distance from the previous target

of m-samples. Accordingly, D(Ai , φR L


m , φm ) is the overall square errors for the interpolation function

having the length of m, with a given set of targets at the i-th iteration. Taking the partial derivation of
D(Ai , φR L R L
m , φm ) with respect to φm , φm , and setting it to zero render the following matrix formulation:

Φ∗j,m (n) = R−1


j,m Pj,m (15)

July 13, 2005 DRAFT


9

where
h iT
Φ∗j,m (n) = φL∗ R∗
j,m (n) φj,m (n)
 P P 
ỹj (nk ) ỹj (nk )ỹj (nk+1 )
 
Rj,m = 
 P
k∈Sm k∈Sm
P


ỹj (nk )ỹj (nk+1 ) ỹj (nk+1 )
k∈Sm k∈Sm
· ¸T
P P
Pj,m = ỹj (nk )yj (n) ỹj (nk+1 )yj (n)
k∈Sm k∈Sm

where mmin ≤ m ≤ mmax , mmin and mmax are the minimum and maximum lengths of the underlying
interpolation functions, respectively, 1 ≤ j ≤ p, p is the order of the feature vector, φL∗ R∗
j,m (n) and φj,m (n)

are the left side and the right side of the optimal interpolation functions having length m for the j-th
component of the feature vector at the i-th iteration, yj (n) is the j-th component of the feature vector.
Estimating the interpolation functions for all lengths is finished, construct a set of the interpolation func-
tions for the next iteration, Φi+1 = {φR∗ L∗ mmax
m , φm }m=mmin .

Step-2. Segmentation : Given the constructed interpolation functions by step-1, find the optimal target
positions to minimize D(Ai , Φi+1 ) + λR(Ai , Φi+1 ). Let Ai+1 be the set of the optimal target vectors,
then
Ai+1 = arg min[D(A, Φi+1 ) + λR(A, Φi+1 )] (16)
A

In this step, the dynamic programming technique mentioned in Section II is employed.

Step-3. Convergence test : Given Ai+1 and Φi+1 , compute the overall cost ei+1 = D(Ai+1 , Φi+1 ) +
λR(Ai+1 , Φi+1 ). If (ei − ei+1 )/ei ≤ ², stop with Φi+1 describing the final set of the interpolation func-
tions. Otherwise replace i by i + 1 and go to Step-1.

An important implementation issue associated with the iterative algorithm is its initialization. In prac-
tice, the initialization of the iterative algorithm affects its convergence rate but can also modify the final
0
result. In this work, the set of target positions {n0k }K
k=1 are initialized by the use of the spectral feature

transition rate (SFTR) [8] which is computed as the squared sum of the gradient of the regression line for
the spectral feature vectors. We assumed that the local minima of SFTR correspond to phonetically stable
regions, hence an initial set of the target positions are obtained by taking the feature vectors having local
minimum of SFTR.

July 13, 2005 DRAFT


10

Initial interpolation functions,


Initial target boundaries

Update interpolation functions


using Minimum mean square
error criterion

Update target boundaries


using a dynamic programming

Φ)+λR(A,Φ
D(A,Φ Φ) No
converged?

Yes
END

Fig. 2. Procedure for building a set of interpolation functions

It should be also noted that since the re-estimation stage (step-1) concerns minimization of the overall
distortion only, it cannot be guaranteed that the re-estimation stage always yields the optimal interpolation
functions in the sense of minimum overall entropy. To alleviate this problem, it is necessary to take the
partial derivation of R(Ai , Φi ) with respect to Φi . However, it is very difficult to represent R(Ai , Φi )
as a function of Φi . Our experiments showed that although the resulted interpolation functions after the
re-estimation stage cannot always reduce the overall entropy, the subsequent segmentation stage reduces
the overall entropy drastically. Hence, the overall iteration procedure ensures a nonincreasing sequence
of entropies as well as distortions.

IV. E XPERIMENTAL RESULTS

Speech was sampled at 8KHz and windowed using a 30 msec Hamming window. It is known that LSF
(Line Spectrum Frequency) has a nice property of interpolation over time. Hence in this work, LSF was
used as a spectral feature parameter. The 10th order LSF parameters were calculated at 22.5 msec frame
intervals. Since the proposed algorithm requires discrete representation of each feature vector, all LSF
vectors were quantized by a split vector quantization (SVQ) scheme [19]. Two sets of the SVQ codebooks
were designed from the training corpus, which was also used for designing a set of interpolation functions.
For two SVQ codebooks, the split patterns are 2-2-3-3 and 3-3-4, respectively, and the allocated number
of bits for each component are 7-7-8 bits and 6-6-9-7 bits, respectively. The maximum and the minimum

July 13, 2005 DRAFT


11

interpolation intervals between the two target LSF vectors are set to 3 and 300 frames, respectively, which
were obtained from several experiments.
The speech corpus consisted of 1,000 utterances which corresponds to 457 816 feature vectors. The
entire corpus was split into 500 utterances for training and 500 utterances for the actual test. The 5 male
and the 5 female speakers were participated in recording the speech corpus.
Since a dynamic programming procedure was performed during talk spurts, it is necessary to decom-
pose entire speech signals into speech and silence regions. To this end, a VAD (Voice Activity Detection)
algorithm [23] based on a short-time energy and a zero-crossing rate was employed.

A. Objective Evaluation

In this subsection, we will analyze the rate-distortion (R-D) characteristics of the proposed TD method,
and a conventional VQ-based method. To obtain the R-D curves for the proposed TD method, the weight
for the overall entropy (= λ) in (3) was varied from 1 through 100. For the VQ-based method, the number
of allocated bits for each vector was set to 10 and 9 bits to obtain the R-D characteristics. A standard VQ
codebook design algorithm [21] was employed to get a set of the discrete spectral vectors. Note that the
entropy aspects were not considered in the VQ-training procedure. Hence the resulted set of the discrete
spectral vectors is not optimized in the sense of the minimum entropy criterion. An individual entropy of
each target vector is computed from the frequency of its nearest-neighbor code vector.
The results are shown in Fig. 3 and Fig. 4. The R-D curves shown in these figures are obtained in case
when each target vector is represented by a 3-3-4 SVQ scheme and a 2-2-3-3 SVQ scheme, respectively.
The R-D values obtained by the several frame-by-frame basis coding schemes are also shown in these
figures. The underlying frame-by-frame basis coding schemes include 5-5 SVQ and general VQ (no
split). The results shown in these figures clearly indicate that the R-D characteristics of the proposed TD
methods are superior to those of the frame-by-frame basis SVQ/VQ schemes. For example, an average
bit rate of 3.0 bits/frame was obtained by the proposed TD method at an average RMSE (Root Mean
Square Error) of 4.0. This corresponds to an average RMSE of the 5-5 SVQ schemes when 10 bits are
allocated. Hence, it can be said that at an average RMSE of 4.0, the overall bit rates can be reduced by
three times by applying the proposed TD method.
On the method representing the target vectors, the difference in performance between the two SVQ
schemes is not remarkable. Using the 3-3-4 SVQ scheme reveals slightly better performance than using
the 2-2-3-3 SVQ scheme. This is partly due to the fact that using the 2-2-3-3 SVQ scheme reveals

July 13, 2005 DRAFT


12

the more increased avarge entropy (=27.2 bits/frame) of the target vectors, comparing with the average
entropy (=21.5 bits/frame) in case of using the 3-3-4 SVQ scheme.
In practice, the average entropy of the target vectors is very close to the total number of bits allocated
for individual vector (27.2 bits/frame vs. 28 bits/frame for 2-2-3-3 SVQ and 21.5 bits/frame vs. 22
bits/frame for 3-3-4 SVQ). This is due to the fact that the distributions of the two code vectors are near-
uniform. This implies that it is less important for reducing the overall entropy which vectors are selected
as the target vectors.
The major contributions of reducing the overall entropy include using the interpolation functions which
are frequently used (i.e. which have lower entropy), and reducing the number of the entire target vectors.
The number of the entire target vectors can be reduced by using the longer interpolation functions as often
as possible. This can be achieved by increasing λ value. However, using longer interpolation tends to
have increased overall distortion. This is based on the fact that although the interpolation functions have a
very nice property of approximating the real LSF trajectories, increasing the number of the target vectors
is more preferred to reduce the overall distortions. In practice, it was observed that shorter interpolation
functions tend to have lower entropy. This effect was more clear when small λ value was employed. This
implies that most of the frequently used interpolation functions correspond to the shorter interpolation
functions. Therefore, it can be said that the two major contributions of reducing the overall entropy have
a trade-off relationship.
To further improve the overall performance of the proposed TD method, it should be concern how to
lower the overall entropy of the target vectors. One possible method for achieving further improvements
is employing an entropy-constrained vector quantization (ECVQ) scheme [22]. A part of the codebook
designing algorithm [22] employed in the ECVQ scheme can be also included in the proposed interpola-
tion function design procedure. In this case, the algorithm updates not only the interpolation functions,
but also the entire code vectors at each iteration, and resulted in the jointly optimized interpolation func-
tions and the code vectors in the sense of minimizing both the overall distortions and the overall entropy.
We will focus on this issue, in an attempt to implement the next version of a TD algorithm.

B. Subjective Evaluation by using G. 723 speech codec

It is also important to evaluated the quality of the reproduced speech signals in case when the entire
LSF vectors are represented by the proposed TD method. To this end, we also performed the several
experiments where the qualities of the reproduced speech signals by the several LSF coding methods
are compared by informal listening tests. Differences in perceptual quality were not noticeable in case

July 13, 2005 DRAFT


13

RD-curve for 334-split VQ


5
5-5 SVQ
(9bits)
4.5 5-5 SVQ
(10bits)

4
VQ
(9bits)

Distortion (RMSE)
VQ
3.5 (10bits)

2.5

1.5
2 3 4 5 6 7 8 9 10 11
Rate (bits/Frame)

Fig. 3. Rate-distortion characteristics of the several methods

RD-curve for 2233-split VQ


5

5-5 SVQ
(9bits)
4.5
5-5 SVQ
(10bits)

4
VQ
Distortion (RMSE)

(9bits)
3.5 VQ
(10bits)

2.5

1.5
2 4 6 8 10 12 14
Rate (bits/Frame)

Fig. 4. Rate-distortion characteristics of the several methods

when quantization is carried only on LSF vectors (maintaining the remaining parts of the speech signals
including LP-residuals and gains). Hence, it is desirable that the proposed TD method is employed as
a part of a speech coding scheme. A G.723 ITU standard speech coding scheme [24] was taken as a
baseline speech coding algorithm for this purpose.
Two subjective listening tests were conducted. The first one was designed to evaluate the absolute
quality of the reproduced speech signals using the MOS (Most Opinion Score) test. In this test, 18
listeners participated and were asked to score the quality of the reproduced speech signals. The quality
rating scale for each factor is shown in Table I. The test data set consisted of 10 pairs of sentences which
were taken from the database. None of the listeners had significant prior knowledge of the contents of the

July 13, 2005 DRAFT


14

test sentences. The quality evaluation was carried out on the speech signals reproduced by the following
three coding schemes; (1) Original G. 723 codec. (2) The modified G. 723 codec where the LSF coding
parts of G. 723 codec are replaced by the proposed TD method. (3) The modified G. 723 codec where
LSF coding parts are removed (Original LSF sequences are maintained). Note that although the coding
methods for another speech parameters (e.g. long-term prediction coefficients) are identical for the cases
(1) and (2), the parameters associated with excitation signals including the adaptive and fixed codebook
parameters may be different from each other. This is because the approximated LSF vectors from the
method (1) and the method (2) may be different.
The results are presented in Table II. The reproduced speech signals by the modified G. 723 codec,
where the proposed TD method is employed to compress LSF sequences, were rated about 0.1 higher
than those of the original G. 723 codec, in the case of λ ≤ 10. The entropies of the LSF vectors from
the proposed TD method are 9.77 bits/frame and 6.75 bits/frame which correspond to λ = 1 and λ = 10,
respectively. Considering that the number of allocated bits for LSF vectors in the original G. 723 codec is
24 bits/frame, the proposed TD method achieves better quality even when the overall bit rate is lower than
that of the conventional coding scheme. The listeners indicated that the reproduced speech signals from
the modified G. 723 codec by the proposed TD method sounded more smooth and comfortable. This is
mainly due to that the proposed method leads to smoothly evolving spectral contours over time.
In Table II, the overall entropies of the LSF vectors for λ = 50 and λ = 100 are 3.188 and 2.359,
respectively. In this case, since the overall entropy is more emphasized than the overall distortions, it can
be expected that the qualities of the reproduced speech signals are seriously degraded. However, the MOS
differences from the original G. 723 codec are less than 0.1. There are several reasons why the qualities
of the reproduced speech signals are not so seriously degraded even when the overall distortion is less
emphasized. One of them is the robustness of the G. 723 coding scheme against LSF quantization noise.
Another reason is that with a small number of bits, the proposed TD method produced perceptually good
approximation to the original LSF streams.
One of the objectives of our work is to reduce the number of bits representing a large speech corpus,
which has been used as the database of a corpus-based waveform concatenating TTS (Text-To-Speech)
synthesis system. We tried to implement the underlying TTS system under the environment of the hand-
held, battery-powered devices such as mobile phone and PDA (Personal Digital Assistant). The original
size of the database is about 86 400 kbytes. Hence, without compression, it is impossible to implement
the underlying TTS system under the environment of the hand-held devices having small memory. In this

July 13, 2005 DRAFT


15

TABLE I

Q UALITY RATING SCALE FOR EACH MOS TEST

Description Rating

Excellent 5
Good 4
Pair 3
Poor 2
Bad 1

TABLE II

MOS TEST RESULT

Coding method Average rating

Original (No coding) 4.33


G. 723 without modification 3.39
G. 723 with original LSF 3.56
G. 723 with the proposed TD (λ = 1) 3.48
G. 723 with the proposed TD (λ = 10) 3.47
G. 723 with the proposed TD (λ = 50) 3.34
G. 723 with the proposed TD (λ = 100) 3.30

work, the G. 723 codec was employed to compress the waveforms in the TTS database. The proposed
TD method was used as an alternative coding method for LSF sequence. Accordingly, the set of the
interpolation functions employed in the subjective listening tests were constructed from the database of
the underlying TTS system.
The second listening test was designed to evaluate the quality of the synthetic speech signals from a
TTS system, in case when waveforms included in the database of the underlying TTS system are com-
pressed and reproduced by the coding scheme. To this end, we prepared the two sets of the synthetic
speech signals, one from the underlying TTS system using uncompressed (raw) database, the other from
the same TTS system using uncompressed database. The qualities of the two speech signals are then
compared. This test is closely related with a comparison category rating (CCR) test [25], which have
been used to evaluate the low bit rate speech coding methods. In this test, the listeners identify the quality

July 13, 2005 DRAFT


16

of the second stimulus relative to the first using a two sided rating scale, as shown in Table III. The listen-
ers were asked to judge which stimulus is better or worse than the other. Each stimulus consisted of the
synthetic speech signals from the underlying TTS system using two different databases, uncompressed
and compressed.
The coding method employed in this test is the modified G. 723, where the proposed TD method is
employed to compress LSF sequences. The length between the two neighboring targets (=the length of
the interpolation function) was represent by Huffman coding. The original size of the database is 86 400
kbytes. By using 6.3 kbps G. 723 codec, the size is reduced to 4252.5 kbytes. This is further reduced
by the proposed TD method, the final size of the database is 3808 kbytes. We controlled λ until a good
compromise point was reached between bit rate and sound quality. Finally, the value of λ was set to 50.
18 listeners participated in this test. The 5 test utterances were chosen from the same categories as the
utterances in the database (newspaper reading). However, none of the test utterances were included in the
database.
The average CCR was -0.308, the maximum CCR was +0.08 and the minimum CCR was -0.38. For
all 5 sentences, CCRs are less than 1. This means the quality degradation is not so serious in case
when the database is represented by the reduced parameters. The listeners indicated that the distortions
caused by database compression were perceptually indistinguishable, because of the inherent artifacts
of TTS synthesis including annoying discontinuities at unit boundaries and artifacts caused by prosody
modifications. This implies that the level of the distortions caused by the proposed coding method is
below than that of the artifacts caused by the underlying TTS system. We also performed the same
experiments on the utterances synthesized using the database compressed by original G. 723 codec. The
average CCR in this case was almost identical to that of the modified G. 723 codec where the proposed
TD method is employed. Consequently, an efficient data compression for a TTS database can be achieved
by the proposed TD method without any loss of perceptual quality.
However, the degraded qualities were often observed in the synthetic speech signals, when the LSF
streams were represented by the proposed TD method with large λ (e.g. ≥ 100). In this case, the listeners
indicated that the synthesized utterances using the compressed database often sounded “ambiguous” and
”unclear”. This is due to the fact that since the LSF vectors are excessively smoothed by interpolation, the
detailed structure of the LSF trajectories are lost. Moreover, the synthesized utterances using the highly
compressed database sounded more noisy comparing with the synthesized utterances from the original
database. This noise in part comes from the increased distortions caused by more coarse representation

July 13, 2005 DRAFT


17

TABLE III

Q UALITY RATING SCALE FOR A CATEGORY COMPARISON RATING (CCR) TEST

Description Rating

Much Better 2
Better 1
About the Same 0
Worse -1
Much Worse -2

TABLE IV

CCR FOR EACH TEST SENTENCE

sentence Rating

utterance 1 -0.38
utterance 2 -0.25
utterance 3 -0.11
utterance 4 +0.08
utterance 5 -0.88

Average -0.308

of LSF vectors. Another possible explanation for the noisy quality is that, the artifacts associated with
discontinuities at unit joining points are more emphasized by using the reproduced speech signals having
large distortions. This means that the concatenation distortion between the neighboring units are propor-
tionally increased as the distortions of the units are increased. Since these artifacts are mainly caused by
concatenating the tail and the head of the neighboring units, it is important to reduce the distortions at
unit boundaries. One possible way to do this is that the target positions are synchronized with beginning
and end frames of the units. In this case, since the LSF vectors at unit boundary always correspond to the
targets, a small amount of distortions at unit boundary can be expected, even when higher λ is applied.

V. CONCLUDING REMARKS AND FUTURE WORKS

We propose herein a new TD method which is based on a rate-distortion criterion. Using this crite-
rion, it was possible to implement an optimal method for minimizing bit rates with adjustable spectral

July 13, 2005 DRAFT


18

distortion. The major contribution of this work lies in the two major topics in TD; finding the locations of
the optimal target vectors and building a set of the optimal interpolation functions from the standpoint of
minimizing both the overall distortions and the overall entropy. A dynamic programming procedure was
proposed to find the locations of the optimal targets in which a global path is constructed by minimize
the weighted sum of the distortion and the entropy. To build a set of the optimal interpolation functions,
an iterative algorithm was proposed, where segmentation and re-estimation of the interpolation functions
were iteratively performed.
The effectiveness of the proposed TD method was confirmed by both the objective and the subjective
evaluations. The proposed algorithm reveals the superiority of rate-distortion characteristics over con-
ventional split vector quantization schemes. We also performed the informal listening tests, which were
designed for evaluating the perceptual qualities of the reproduced speech signals when the proposed TD
method was applied to compressing the LSF streams. The reproduced speech signals by the proposed TD
method were rated higher MOS. We extended this work to compress a large speech corpus employed in
a text-to-speech synthesis system. Since there are no delay limitations in compressing a TTS database,
this is proper application for TD. The results show that the quality of the synthesized utterances from the
TTS having the compressed database is comparable to that of the original synthesized speech signals.
In this work, each interpolation function is characterized only by its length. If each interpolation func-
tion is characterized in more detail by using the parameters including the VQ indices for the beginning
and end LSF vectors, voicing states and adaptive/fixed codebook indices, the overall performance can be
further improved.
The distortion measurement employed in this work is Euclidean distance between the two vectors. The
major reason for usage of this simple measure is relatively lower computational complexity. However,
this measure is not closely related with human ear model. Applying an adaptive weight for each LSF com-
ponent, which is already adopted in most low bit rate speech coding algorithms, would further improve
the quality of the reproduced speech signals by the proposed TD method. The entropy measurement em-
ployed in this work is also simple, which is given by base-2 logarithm of the incoming random variable.
Representing each parameter by using the variable length coding schemes was not sufficiently discussed
in this work. Hence, it is necessary to study on finding the proper binary code representation scheme for
the TD parameters. The coding efficiency wound be further increased by this improved coding method.

R EFERENCES
[1] G. Fant, Speech Sounds and Features, Cambridge, MA: MIT Press, 1973.

July 13, 2005 DRAFT


19

[2] J. B. Allen, “How do humans process and recognize speech?,” IEEE Trans. on Speech and Audio processing, vol. 2, issue
4, pp. 567-577, Oct. 1994.
[3] M Ismail and K Ponting (1997), “Between recognition and synthesis-300 bits/second speech coding,” in Proc.
EUROSPEECH-97, vol. 1, pp. 441-444.
[4] B. S. Atal, “Efficient coding of LPC parameters by temporal decomposition,” in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Processing, pp. 81-84, 1983.
[5] M. Niranjan and F. Fallside, “Temporal decomposition: A framework for enhanced speech recognition,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Processing, pp. 655-658, 1989.
[6] P. J. Dix and G. Bloothooft, “A breakpoint analysis procedure based on temporal decomposition,” IEEE Trans. on Speech
and Audio processing, vol. 2, No. 1, part 1. pp. 9-17, Jan. 1994.
[7] V. D.-Kappers A. M. and S. M. Marcus, “Temporal decompositon of speech,” Speech Communication, vol. 8, pp. 125-135,
1989.
[8] A.C.R. Nandasena and M. Akagi, “Spectral stability based event localizing temporal decomposition,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Processing, pp. 957-960, 1998.
[9] F. Bimbot, G. Chollet, and P. Deleglise, “Temporal decomposition and acoustic-phonetic decoding of speech,” in Proc.
IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 445-448, 1988.
[10] P. Deleglise, F. Bimbot, C. Montacie, and G. Chollet, “Temporal decomposition and acoustic-phonetic decoding for the
automatic recognition of continuous speech, ” in 9th International Conference on Pattern Recognition, pp. 839 - 841, 1988.
[11] Y. M. Cheng and D. O’shanghnessy, “Short-term temporal decomposition and its properties for speech compression,” IEEE
Trans. Signal Processing, vol. 39, No. 6, pp. 1282-1290, 1991.
[12] S. Ghaemmaghami and M. Deriche, “Adaptive-width approximation of events in temporal decomposition based speech
coding,” IEE Electronics Letters, vol. 32, No. 24, pp. 2189-2191, 1996
[13] S. Ghaemmaghami, M. Deriche, and S. Sridharan, “Hierarchical temporal decomposition; A novel approach to efficient
compression of spectral characteristics of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2567-
2570, 1998.
[14] S. Ghaemmaghami, and S. Sridharan, “Very low rate speech coding using temporal decomposition,” IEE Electronics
Letters, vol. 35, No. 6, pp. 456-457, 1999.
[15] S.-J. Kim and Y.-H. Oh, “Efficient quantization method for LSF parameters based on restricted temporal decomposition,”
IEE Electronics Letters, vo. 35, No. 12, pp. 962-964, 1999.
[16] K.-S. Lee, “Temporal decomposition based on a rate-distortion criterion,” IEEE Signal Processing Letters, vol 11, No. 1,
pp. 33-35, 2004.
[17] K.-S. Lee and R. V. Cox (1999), “TTS based very low bit rate speech coder,” Proc. IEEE Int. Conf. Acoust., Speech, Signal
Processing, pp. 181-184.
[18] Y. Shiraki and M. Honda, “LPC speech coding based on variable length segment quantization,” IEEE Trans. Acoust.,
Speech, Signal Processing, vol. 36, No. 9, pp. 1437-1444, 1988.
[19] K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPC parameters at 24 bits/frame,” IEEE Trans. on Speech
and Audio processing, vol. 1, pp. 3-14, Jan. 1993.
[20] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signal, Prentice Hall, 1978.

July 13, 2005 DRAFT


20

[21] Y. Linde, A. Buzo and R. M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. on Communications, vol. 28,
pp. 84-95, Jan., 1980.
[22] P. A. Chou, T. Lookabaugh, and R. M. Gray, “Entropy-constrained vector quantization,” IEEE Trans. on Acoust., Speech,
Signal Processing, vol. 37, No. 1, pp. 31-42, 1989.
[23] L. R. Rabiner and R. W. Schafer, Digital Processing of speech signals, Prentice-Hall, 1987
[24] ITU-T Rec. G. 723 “Dual rate speech coder for multimedia telecommunication transmitting at 6.4 and 5.3 kbps”, 1995.
[25] W. B. Kleijn and J. Haagen, “Waveform interpolation for coding and synthesis,” in Speech Coding and Synthesis (W. Kleijn
and K. Paliwal, eds.), ch. 4, pp. 175-207, Elsevier, 1995.

July 13, 2005 DRAFT

You might also like