Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 17

33

Chapter 3
LPC ANALYSIS AND SYNTHESIS
3.1 INTRODUCTION
Analysis of speech signals is made to obtain the spectral
information of the speech signal. Analysis of speech signal is
employed in variety of systems like voice recognition system and
digital speech coding system. Accepted methods of analyzing the
speech signals make use of linear predictive coding (LPC). Linear
prediction is a good tool for the analysis of speech signals. In linear
prediction the human vocal tract is modeled as an infinite impulse
response system for producing the speech signal. Voiced regions of
speech have a resonant structure and high degree of similarity for
time shifts that are multiples of the pitch periods, for this type of
speech LPC modeling produces an efficient representation. In LPC the
current sample of a speech signal is estimated by the linear
combination of a series of weighted past samples of the speech
signal. The series of weights or coefficients represent the LPC
coefficients which are used as filter coefficients in encoding and
decoding process during coding. Present days in many voice
recognition systems and speech coding systems, LPC analysis
techniques are used to generate the required spectral information of
the speech signal. Voice recognition systems use LPC techniques to
produce observation vectors (LPC coefficients). In a voice recognition
system these observation vectors are used to recognize the uttered
utterances. Voice recognition systems have applications in
34
various industries like telephone industry and consumer
electronics.
For example voice recognition is used in mobile telephony to have
hands free dialing or voice dialing.
LPC analysis is usually conducted at the transmitting end for each
frame of the speech signal to find information like voiced and
unvoiced decisions of a frame, pitch of a frame and the parameters
needed to build up a filter for the current frame. This information
regarding the frame has to be transmitted to the receiving end. Then
the receiver performs LPC synthesis using the information received at
the receiving end. In LPC analysis the input speech signal with 8000
samples per second is divided into frames containing 160 samples
i.e., each frame represents 20msec of the input speech signal. The
reason for framing is that speech is a non-stationary signal where its
properties changes with time. This makes the use of Discrete Fourier
Transform (DFT) or
Autocorrelation techniques impossible. But for most phonemes the
properties of the speech signal remains invariant for a short period of
time (5-100msec) and hence traditional signal processing methods is
applied successfully. Most of the speech processing is done in this
manner. This short period of the signal is called as the frame and
each frame length taken is 20msec. Due to framing, the dependency
between the samples gets lost. To avoid this loss in dependency
between the samples, the adjacent frames are overlapped and the
overlap percent is taken as 50% on both sides. In turn overlapping
results in signal discontinuities at the beginning and at the end of
35
each frame. To reduce these discontinuities each frame is multiplied
using a window [25-30].
3.2 WINDOWING
Window is a region which has a zero value everywhere except for
the region of interest. The function of windowing is to smooth the
estimated power spectrum and to avoid abrupt transitions in the
frequency response between adjacent frames. Windowing a speech
signal involves multiplication of the speech signal on a frame by
frame basis using a window of length equal to the frame length of the
speech signal. The effect of multiplying a frame with a window of
finite-length is equal to convolving the power spectrum of a frame
with the frequency response of the window. This causes the side-
lobes in the frequency response of the window to have an averaging
effect on the power spectrum of the frame. The windows commonly
used are the Rectangular window, Hamming window,
Hanning window or Blackman window. The most widely used window
in speech analysis is the Hamming or Hanning window. Window like
rectangular window has high frequency resolution because of its
narrowest main lobe, but has largest frequency leakage and is not
widely used. This high frequency leakage is due to larger side lobes
and this makes the speech signal noisier. This high frequency leakage
tends to offset the benefits of high frequency resolution. Hence the
rectangular window is not widely used in speech analysis. Windows
like Hamming, Hanning and Blackman windows have smallest
36
frequency resolution and less frequency leakage, so they are widely
used in speech analysis. These windows are smoother at the ends
and are closer to one at the middle. The smoother ends and broader
middle section produces less distortion in the signal. In this thesis
Hamming window of 160 samples equal to the frame length is used.
Window length is also another important parameter that affects
smoothening. If the window length is too large, it may give better
frequency resolution, but the spectral properties of the speech signal
changes over large durations. So the frame size must be shorter over
which the speech signal is considered as stationary and so the
window duration needs to be shorter. Making the window short has
some disadvantages, which are given by [22, 30-31]:
The frame rate increases. This means more information is being
processed than necessary there by increasing the
computational complexity.
The spectral estimates become less reliable because of the
stochastic nature of the speech signal.
Typically the pitch frequency relies between 80 and 500 Hz.
This means that a typical pitch pulse occurs for every 2 to 12
msec.
If the window size is small compared to the pitch period then
the pitch pulse sometimes present and sometimes will not be
present.
37
3.3 CHOOSING THE ORDER OF THE FILTER
Linear predictive coding is a time domain technique that models
the speech signal as a linear combination of the weighted delayed
past speech sample values. LPC order is an important parameter used
in linear prediction, which will affect the quality of synthesized speech
signal as the order determines how many number of weighted past
samples are to be used to determine the current speech sample. In
this thesis an LPC order of 10 is chosen, which means the past 10
speech sample values are used to estimate the current speech
sample. As the LPC of order 10 is used, the LPC model is called as the
10
th
order LPC model. In linear prediction the number of prediction
coefficients required to suitably model the speech signal depends on
the spectral content of the source. For each formant or peak in the
spectrum two poles are required to represent, where one pole
requires one linear predictive coefficient to represent. Generally in
human speech one peak or formant is observed for a fundamental
frequency of 1000 Hz, so the best LPC order depends on the
bandwidth of the sampled speech signal. In narrowband speech
coding the speech signal information is band limited to around 4 KHz
using a low pass filter and hence there are four formants in its
spectrum, to model these four formants eight complex poles are
required and so the filter order must be at least eight but in practice
two poles are taken additionally to minimize the residual energy. So a
total of ten poles are required to represent the four formants in a
narrowband speech signal. So the LPC order is chosen as 10. For an
LPC order 10, the
38
number of LPC coefficients is 11 and the first term is always assumed
to be 1 in the 10
th
order polynomial which is a very important
assumption in LPC analysis [31-32].
3.4 LINEAR PREDICTIVE MODELING OF SPEECH SIGNALS
Linear prediction analysis is the most powerful speech analysis
method. In it the short term correlation that exists between the
samples of a speech signal (formants) is modeled and removed using
a short order filter.
3.4.1 Source Filter Model of Speech Production
The source filter model of speech production is used as a means
for the analysis of speech signals. The block diagram of a source filter
model [31] is shown in Fig 3.1.
Pitch Period
Impulse Train Voiced/ LPC Coefficients
Generator
Unvoiced
Switch
Outpu
t
Speec
h
Time Varying
x(n) r(n) Filter
G
Random Noise
Generator
Fig 3.1 Source filter model of speech production
The excitation signal used in this model is modeled as a train of
impulses for voiced segments of speech and as random noise for
unvoiced segments of speech.
39
The combined spectral contributions of the glottal flow, the vocal
tract and the radiation at the lips is represented by a time varying
filter with a steady state system function given by
M j _
G 1 b
j
z
S (z)
H(z) j 1 , (3.1)
( )
N
X
( z)
1 a
i
z
i
i 1
Equation (3.1) represents the transfer function of the filter consisting
of both poles and zeros, S(z) is the Z-transform of the vocal tract
output and X (z) is the Z-transform of the vocal tract input. If the
order of the denominator is high, H(Z) is approximated by an all pole
model given by
H z
G G (3.2)
( ) p A z
1 a
j
z j ( )
j 1
where p is the order of the filter and
p
A z 1 a
z
j (3.3)
( ) j
j 1
when equation (3.2) is transformed into sampled time domain it
is
given by
p
s n G x n +a s n j (3.4)
( ) ( ) j ( )
j 1
Equation (3.4) represents the LPC difference equation. It states that
the value of the present speech sample s(n) is obtained by summing
the present input G x (n) and a weighted sum of the past speech
40
sample values. If
j
represents the approximate of a
j
then the error
signal is the difference between the input and
encoded and is given by equation (3.5)
p
e (n ) s (n )
j
s (n j)
j 1
speech
signals
(3.5
)
The estimates are now determined by minimizing the mean squared
error given by equation (3.6)
p 12

2
1
E e n E s n s n j (3.6)
( ) ' ( ) ;
{ }

j ( )
1

j 1
]

The partial derivative of equation (3.6) with respect to

j when set to
zero
for
j =1,, p is given
by
p 1
E s n s n j s n i

0
for
i
1 =1,, p (3.7)
' ( ) ( ) ( ) ;
j
1
j
1
]

That is, e n is orthogonal to s n i
for
i = 1,.., p
( )
(
)
Equation (3.7) is arranged
as
p

j

n
(i , j)
n
(i ,0)

(3.8
)
j
1
where

n
(i , j) E s (n
i) s (n j)
(3.9
)
{ }
PDF to Word

You might also like