Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

5 Linear Prediction

Linear prediction is one of the most important tools in speech processing. It can be utilized in many
ways but regarding to speech processing, the most important property is the ability to model the vocal
tract. It can be shown that the lattice structured model of the vocal tract is an all-pole filter which
means a filter that has only poles. One can also think that the lack of zeros restricts the filter to bolster
up certain frequencies which in this case are the formant frequencies of the vocal tract. In reality the
vocal tract is not composed of lossless uniform tubes, but in practice, modeling the vocal tract with
an all-pole filter works fine. Linear prediction (LP) is a usefull method for estimating the parameters
of this all-pole filter according to a recorded speech signal.
Let us first study an example of the usefullness of LP with this respect. Figure 1 presents a 30 ms
window of the vowel [a] with a sampling frequency of 16 kHz. Its amplitude spectrum can be found
in figure 2, showing the fundamental frequency (dense peaks) and formants (broad peaks in the pulse
envelope). In the same picture, there is also the amplitude response of a 20th degree LP-model that
models very well the broad peak envelope.

ikkunoitu a−äänne
0.8

0.6

0.4

0.2

−0.2

−0.4

−0.6

−0.8
0 50 100 150 200 250 300 350 400 450 500

Figure 1: Hanning-windowed waveform of vowel [a].

5.1 Background of Linear Prediction


The term ’linear prediction’ refers to the prediction of the output of a linear system based on its input
 and previous outputs   
       :


   !#"% $'& ) (*  +(*-, #! . "0/*1 2(*3 4(* (1)

29
Amplitudispektri ja LPC−spektri
40

20

−20

−40

−60

−80
0 1000 2000 3000 4000 5000 6000 7000 8000
taajuus, Hz

Figure 2: Amplitude spectrum and LP spectrum.


The notation   refers to the estimate or prediction of   . The idea is that once we know the
input  and the output   we would like to predict the behaviour of the unknown system 5 768
as illustrated in figure 3. In the figure the output   has been delayed so that we can’t use the

real output. The problem is now to determine the constants 2(* and 2(* in a such a way that  
& 1
approximates the real output   as accurately as possible.
The following terms describe the model:

autoregressive, (AR) model The output   is predicted by using only previous outputs and the
current input, which means that 2(*9;:* <(>=?: and only )(0 and 7:@ must be determined.
1 & 1
This corresponds to an all-pole filter.

moving average, (MA) model In this model the prediction is based only on the input, which gives
)(*AB: . This model corresponds to a FIR filter.
&
autoregressive moving average (ARMA) model This is the general model as in equation (1) cor-
responding to a general linear recursive filter.

In speech processing the AR-model is preferred based on the following reasons:


C the input (excitation signal at vocal cords) is unknown
C computational easiness of determination of parameters 2(*
&
C as shown before, the vocal tract is theoretically an all-pole filter (excluding nasal sounds)

30
x(n) y(n)
H(z)

-1
z

B(z) A(z)

^y(n)

Figure 3: Prediction of the output of the unknown system 5 768 based in input and previous outputs.
In speech processing 5 768 corresponds to a vocal tract and the input  is usually unavailable.

C AR-model of higher degree can also model ARMA-model


C stabile all-pole model can be used to present the amplitude response of any system with any
desired precision (however, the degree of the required all-pole model may be considerably
high).
D
Consider an all-pole system with transfer function

E 768 (2)

where
E F68G
<, $ 6IH $ , J, 6IH

D & &
and denotes the gain. The tranfer function is the ratio of the output K 768 and the input L F68 in
z-transformed form D
K 768 
E 768
L 768
D
which implies
E
K 768 F68 L 768 (3)
By taking the inverse 6 -transform of (3) yields the time domain relation
D
 , M "% $ &  NO   PNOA 

which is D
     NQ RNO
M "%$ &  (4)

31
E
where  is the input,   is the response and Q
S  T are the coefficients of the filter 768 .
& &
In other words the output of the all-pole system can be predicted perfectly if the input and the
previous outputs are known. In practice the prediction is never perfect since the systems are not linear
nor all-pole type and there is generally some noise in the output. Moreover in speech processing the
input  is unknown. Nevertheless, the vocal tract (as well as any other system) can be modeled by
using all-pole model and in this case the model works really well.
So by getting rid of dependence on the input  in equation (4) we end up with the following
model that will be used in from now on:

  AG !#"% $'& ) (*  +(*
The hat over  refers to estimate of  .

Our goal is to determine the parameters Q
S 7  U so that   would be close to the
& & &
recorded speech   in some frame of the signal, i.e., so that the prediction error would be minimized.
As the parameters have been determined, we may, according to equation (2), use the following model
of vocal tract

E 768
D D
V

E 768
where (There is a much more elegant way to estimate but we are now mainly interested in
).

5.2 Autocorrelation Method


Parameters O
S  U are to be determined so that the sum of squared errors
& & XW    Y3Z
 
will be minimized over all indices. In practice the sum is finite due to the finiteness of the signal but it
is usefull to think that the frame is infinitely long and only few samples are nonzero. In the following
the output   will be denoted as [  (s referring to speech).
So we have a windowed speech signal [  where only a finite number of samples are nonzero.
With the given prediction coefficients, Q
S 7  T , the energy of the prediction error can be
& & &
written as
\  W ] ^@OZ
 " H
W ] ] _  ` 

" H [ [  )a Z
]
 W ] _ [  `   )(0  4(*)a Z
" H Y! "%$ & [
] 
where  is the length of the prediction filter and [  is the estimate of [ (prediction in this case).
\
By having convention that 7:@A;
, the energy of the prediction error can be written as
& 
\  W ] _  2(* [   4(*Xa Z 
 " H !Y"0/ &
]
32
\
Let us minimize by choosing suitable coefficients Q
 2  U . A necessary condition for
& & \ &

optimality of the choise of NQ is that the partial derivative of with respect to variable NO equals
\ & \ &
zero. Notice that depends on the variables Q
  T so it could
 be written as  Q
  UY
& &  & &
but we omit this to keep the notation short.
So let’s differentiate! The partial derivative with respect to NO (NAG
' #0   ) is
W &
b \ bdc _ c Y! "0/ 2 (* +(*Xa Z
b NO    b& [
NO
& & b c !Y"0/ 2 (* +(*
 3W  _  )(* [  4(*)a  &b [
!#"0/0&  NO
&
 3W  _  )(* [  4(*)a [  PNO
!#"0/0&
D D D
where the differentiation rule e  
  Q X
 *
f  e 
f  F
  Q  fF has been utilized. By regrouping this we get
W ]  _  ) (0 [   4(*)a [ RNQ
" H !Y"0/ &
]
 ? )(0 W ] [   +(* [ RNO
!Y"0/*& " H
]
 ? )(0Xg*2(% ONQ
!Y"0/*&
where

g*2(% ONO W ] [ +(* [  PNO


" H
]
is in fact the autocorrelation of the signal [  with delay (hRN which is
W ]
" H [  [  )(iPNOY
]
Why? Well, by making a substitution in the sum yields

g*2(% ONQj W ]
" H [  4(* [ RNO
W ] ] Y k,lNO4(* Qm,lNORNO

" H [ [
]
 W ] [   [  n2(hPNOQ
" H
]
Moreover the term *
g 2
 %
( O
O
N  depends only on value (olN so it can be denoted by one variable autocor-
relation function
g*2(hPNOA g0)(% ONO

33
By setting the derivatives to zero, we obtain:
pqqq  c #! "0/ )(0Xg*2(h
n:

r  c  !#"0/ & )(0Xg*2(h+'n:

qqq &
qs ..
 c !#"0/
.
)(0Xg*2(h%n:
 &
which can also be written in the form (by remembering that 7:@A;
and g0)(*A g0Od(* )
&
pqqq c Y! "%$ )(*)g*)(h
AGtg0Q
S
q 
r c  !Y"%$ & )(*)g*)(h+'AGtg02

qqq &
qs .. .
c !Y"%$ )(*)g*)(huAGtg0T
 &
which in turn can be reformulated with matrices as:
vww g07:@ g0Q
S g02 ~€
yyy?g*Tz
S ~€
vww g*Q
S ~€vww
ww g0Q
S g07:@ g0Q
S ww & O7 
S 
yyy?g*Tz+ 
ww g*2 
ww ww &  w  
w g02 g0Q
S g07:@ F{@ ‚ w g*7{@ 
yyy?g*Tz{@ 
& 
x
..
.
..
.
..
.
..
.
..
.  x U  x  
g*Um
|g*Um+'|g*Um{}yyy g*F:@ g*T
&
Notice that the coefficient matrix is symmetric (due to g*2(*ƒ„g*Od(0 ) and Toeplitz (due to g*)(% ONOƒ
g*)(…tNO ), which is crucial when deriving a fast computational method to find the coefficients Q
S 7  U .
& & &
5.2.1 Levinson-Durbin Recursion
Recap: at this point we have derived the equations (so called normal equations) for the prediction
coefficients Q
S  T based on the minimization of the prediction error. Now the coefficients
& &
could be solved by inverting the autocorrelation matrix, but this is computationally rather deman-
ding. To help us, Mr. Levinson and Mr. Durbin have developed an efficient algorithm for solving a
symmetric Toeplitz-type equation group.
The basic idea is to solve the matrix equation †ˆ‡
Š‰ ‡
in steps, that is, by increasing the length of the vector and by calculating a new solution based on
the previous solution.
The optimal coefficients satisfy

  NOXg*NO \
M "0/ &
\
where is the sum of squares of prediction error (more information can be found, for instance, in
the book T. W. Parsons, Voice and Speech Processing, McGraw-Hill, Inc., 1987). By using this, the

34
group of equations boils down to
vww g*7: g0Q
S g*2' yyy g*U ~  wv ~ 
ww g*Q
 g07:@ g*Q
 yyy?g*TmŠ
S  ww

ww  ww O
S 
w g*2' g0Q
S g*7: yyy?g*Tm  w & 7
& yyy
 x U 
.. .. .. ... ..
x . . . .
g*T%‹g0Tz
‹g0Tz+'Œyyy g*F:@
vww \ ~€ &
ww : 
w 
 ww :  
..
x . 
:
The matrix on the left is still symmetric and Toeplitz.
Assume that we have already solved the equation when >‚ . Now, let us see how it helps us to
solve Q
S 2' F{@ when  n{ , where the subscript refers to the degree of the equation. So this
& &@ &
is what we have already solved:
vw g*F:@|g*Q
‹g02 €~  wv
€~ 
x g*O
S|g*7:‹g0Q
S  x & Z Q
 
g*7|g*Q
‹g07:@ Z 2 '
vw \ ~  &
Z
 x :  
† :
The structure of matrix yields
vw g*F:@|g*Q
‹g02 €~  wv 2' €~ 
& Z
x g*O
S|g*7:‹g0Q
S  x & Z Q
 
g*7|g*Q
‹g07:@

vw : ~ 
 x : 
\
Z
thus: symmetric Toeplitz-matrices (and only them) have the nice property that when the coefficient
vector and the result vector are twisted upside down, (switch the last and the first element and switch
the second last and second and so on...), the equation is still satisfied. Let us now try to use the
following kind of solution to a bigger group of equations
vww g*F:@|g*Q
S|g*7|g*7{ ~€ qpq vww
~€ vww ~€ Ž qq
:
w g*O
S|g*7:@|g*O
S|g*2'  r w Q
S  w 
q Z ,4( Z& 7 qq
x g*7|g*Q
S|g*F:@|g*Q
  qqs x & Z 2   x & Z O
S  q
g*F{@|g*2|g*O
S| ‘ g*7: & :

vww \ ~  vww ~
w :Z  w : 
 ,’(
x :‘   x \: 
Z
35
‘
c
where  MZ "0/ Z NQ)g*7{“PNO .
&
For this to be a solution, we only require that all the elements, except the first one, in the vector
on the right side are equal to zero. It will be so, if
‘
,4( \ Z n:*

in other words

Z
( ‚  \ Z NO)g*7{“PNQ
 Z M 0
" / &
We notice that
\  \ Z O
”4(IZ 
 
Justification:
‘
\  \ ,’(
 Z
 \
Z , (  Od( \ Z 
’
 \
Z Q
”4 ( Z  

We have thus found that by trying a vector that is a sum of the lower degree solution and its
twisted version multiplied by a constant, we get a solution to the problem of the higher degree. Same
W size from Š
to  . Thus, the results are
deduction works in general when increasing the
W H $ W
(   \
‚ W
$ NOXg0 RNQ
H $ M "0/ & H
W W W
\  \ $ Q
”4( Z 
H
and
W W W W
NO
$ NO-,4( $ RNO
Wˆ• W & & H & H
\ : (\
Because is the prediction error for the  th degree filter), it follows
W
– ( –I—
'
W
The values ( are called reflection coefficients.
Levinson-Durbin recursion will be started with condition
g*7: \ /

which may be thought to be the error of the : th degree predictor (no prediction at all).
There exist also other methods and variations to solve the coefficients but Levinson-Durbin recur-
sion is the most commonly used one. Besides, calculating the — coefficients in this way guarantees that
the absolute values of the reflection coefficients are always
, yielding a stable filter.

36
The degree,  , of the model is chosen by considering that one pole corresponds to one formant,
and because there is approximately one formant per one kHz, the degree is usually the same as the
sampling frequency in kHz. For instance, when the sampling frequency is 8 kHz, the degree of the
model is 8. In practice, to compensate the inaccuracies in the model (AR assumption and others) the
degree is usually chosen to be a little bit higher. For instance, with a sampling frequency of 8 kHz, a
reasonable model degree is 10 or 12, and with a 16 kHz sampling frequency, the degree should be 18
or 20.
The LP analysis method discussed above is perhaps the most important method in speech proces-
sing. In speech coding, for instance, it will be used to code the excitation and vocal tract contributions
separately, in speech recognition it will give information about the spectrum of the speech (and in
this way about the phoneme) and in speech syntesis it enables to control the vocal tract and excitation
separately. In Matlab the LPC (or LP) analysis is implemened by the command lpc.

37

You might also like