Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/257788283

Automatic speech recognition using cepstral and itakura-saito distances for


vocal command

Conference Paper · January 2005

CITATIONS READS

0 247

3 authors, including:

Zied Sakka Abdennaceur Kachouri


Ecole Nationale d'Ingénieurs de Sfax University of Sfax
24 PUBLICATIONS   92 CITATIONS    314 PUBLICATIONS   1,323 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

ring oscillator View project

Cannet View project

All content following this page was uploaded by Abdennaceur Kachouri on 30 May 2014.

The user has requested enhancement of the downloaded file.


Third International Conference on Systems, Signals & Devices Volume III
March 21-24, 2005 – Sousse, Tunisia Communication and Signal Processing

Automatic speech recognition using cepstral and itakura-saito


distances for vocal command

Zied SAKKA, Abdennaceur KACHOURI, Ahmed BEN AISSA and Mounir SAMET.

LETI, Electronic and Information Technology Laboratory


National Engineering School of Sfax, BP W , 3038 Sfax- Tunisia.
Phone: (+216) 4 274 088 Fax: (+216) 4 275 595.
E_mail: sakka_zied@yahoo.fr, Abdennaceur.Kachouri@enis.rnu.tn, mounir.samet@yahoo.fr

Abstract

The speech recognition system is not stopping to (Perceptual Linear Predictive Cepstral Coefficients).
evolve and to present significant performances. We adopted like a method of classification the
Nevertheless, the extent of the calculations is very measure of distortion by calculating a distance to
important and complex in particular in the know the cepstral distance and the Itakura-Saito
classification phase. We are interested in this paper distance [1][3][7].
to the sturdiness of the techniques of The acoustic and phonetic basis used is the TI46
parameterization LPCC, MFCC, PLPCC and to the
classification by simple measure of distance or 2. Speaker recognition system
distortion indeed the cepstral distance and Itakura-
Saito distance. The basic elements of a speaker recognition system
are shown in Fig.1. An input utterance from an
Keys words: Speech recognition, cepstral distance, unknown speaker is analyzed to extract speaker
Itakura-Saito distance. characteristic features. The measured features are
compared with prototype features obtained from
1.Introduction known speaker models.
In the identification mode, a speech sample from an
The vocal command is the order which obeys to unknown speaker is analyzed and compared with
sound with thin the meaning of the voice. models of known speakers. The unknown speaker is
The performance of automatic word recognition identified as the speaker whose model best matches
system is dedicated to the effectiveness of the adopted the input speech sample. In the “closed set”
parameterisation technique and to the sturdiness of identification mode, the number of decision
the classification technique used. Seen the alternatives is equal to the size of the population. In
complexity of the word signal, his redundancy and the “open set” identification mode, a reference model
variability inter and intra speakers; the automatic for the unknown speaker may not exist. In this case,
word recognition remains a difficult problem to an additional alternative, “the unknown does not
resolve. To succeed the vocal order, it is necessary to match any of the models”, is required.
take carefully all the steps of the word recognition The unknown speaker’s speech sample is compared
system, in particular the techniques of with the model for the speaker whose identity is
parameterisation. In fact, more the parameters are claimed. If the match is good enough, as indicated by
sturdier more differentiating and more pertinent, more passing a threshold test, the identity claim is verified.
the performance of our system is better, that’s why Crucial to the operation of a speaker recognition
our study is focused on the parameterisation system is the establishment and maintenance of
techniques in order to improve the quality of acoustic speaker models. One or more enrolments sessions are
modelling, for that we present different cepstral required in which training utterances are obtained
parameterisation techniques to know: LPCC (Linear from known speakers. Features are extracted from the
Predictive Cepstral Coefficients), MFCC (Mel training utterances and compiled into models. Many
Frequency Cepstral Coefficients) and PLPCC speaker recognition systems include an updating

9973-959-01-9/ © 2005 / 9885 IEEE


facility in which test utterances are used to adapt regressive model all center due to Durbin and cepstral
speaker models and decision thresholds[2][6]. recursion which allows us to obtain the cepstral
coefficients [4][5].

Speech wav - Pre accentuation


- window of Hamming FFT │ │2 IFFT

Feature extraction

Similarity Durbin récursion


Cepstral Recursion

Reference
Reference
template N
template 1
Fig.2: Algorithm of the LPCC technique.

Similarity 3.2. Parameterization by the MFCC coefficients

Reference It consists in an analyze by a filter-bank divided up


template 2 on a Mel-scale. After pre accentuation, the log
energy of the signal balanced by a Hamming window
is determined by fast Fourier transformation.

Filtering is carried out in the spectral domain while


multiplying the obtained log energy by Hamming
Similarity
filters. The inverse Fourier transformation of the
logarithm of spectral density allows us to obtain our
Reference
template N cepstral coefficients called MFCC.

- Pre accentuation
- window of Hamming FFT │ │2 Log(.)
Maximum selection

Identification result
Fig.1: Structure of speaker recognition system. Cepstral Recursion Durbin récursion IFFT

3. Techniques of parameterization

In this party we describe the three principal Fig3: Algorithm of the MFCC technique.
techniques of parameterisation that showed a
considerable interest in the automatic word
recognition systems. These techniques are LPCC 3.3. Parameterization by the PLPCC coefficients
(Linear Predictive Cepstral Coefficients), MFCC
(Mel Frequency Cepstral Coefficients) and PLPCC The PLP analyze is an improvement of the LPC
(Perceptual Linear Predictive cepstral analyze. It takes account of the three following
Coefficients)[4][5][7]. aspects:
- Integration of critical bands: spectral density is
3.1. Parameterization by the LPCC coefficients putted back on the Mel-scale then convoluted with a
function representing a filter with a critical band.
The calculation of the coefficients LPCC obeys to the
algorithm of the Fig.2. At first the signal is - Preaccentuation by a bank of isotonicity:
preaccentuated by a high pass filter then a Hamming perceived intensity, when we listen a pure sound with
window is applied on it. Next we determine the constant acoustic intensity, varies with the frequency
autocorrelation coefficients with the help of the of this pure sound. To simulate this phenomenon in
Fourier inverse transformation of the log of energy. the framework of PLP analyze, we multiplies
At last, once the coefficients autocorrelation are resultant spectral density of the preceding step by a
obtained, we calculate the coefficients of the auto function of balance.
1  f   f  
Π
- Law of Stevens: the two precedent treatments are
insufficient to establish the correspondence between
d IS ( f , f ' ) = ∫   − In  − 1 dθ (3)
2.Π −Π f '   f ' 
measured intensity and perceived intensity (the The Fig.6 below illustrates the function of Itakura-
sonie). The law of Stevens gives a relation between Saito.
the sonie and intensity :
fonction distance cepstrale
100
0.33
Sonie = (intensity) (1)
90

- Pre accentuation 80
- window of Hamming FFT │ │2 Mel-scale
70

60

d is t a n c e
Cepstral Durbin 50
( ) 0.33 IFFT
Recursion récursion 40

30
Fig.4: Algorithm of the PLPC technique.
20

10
4. Classification and decision
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
Once the parameters of the word are obtained, they
coeficients
must be compared to those memorized in the
reference dictionary. This comparison occurs while
measuring their distortions to the help of simple Fig.5: The function of distance cepstrale
distance calculation. We chose two types of distances
to know the cepstral distance and the Itakura-Saito fonction ItakuraSaito
distance; and according to the value of this distance, 4
superior or inferior to a threshhold that is empirically
estimated, the word will be judged if it is the voucher 3.5
or no. The quoted threshhold varies with the
technique of parameterisation and the types of chosen 3
distances.
2.5
4.1. Cepstral distance
d is t a n c e

2
Cepstral distance is formulated and given by the
following equation where c and c’ are cepstral 1.5
coefficients:
1

∞ 2
2
d cep = ∑ [c ( l ) − c ' ( l ) ]
l =1
(2)
0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
coefficients

The Fig.5 shows the function of cepstral distance. Fig.6: The function of Itakura-Saito.

4.2. Itakura-Saito Distance


5. Evaluation
This distance is described by following mathematical
equation where f and f’ are two spectral densities. The purpose of this paper is to evaluate the different
methods of parameterisation of the speech signal
regarding to construct a sturdy words system
recognition that goes to the better with our The results, indeed the rates of recognitions and of
application, and to realize a simple method of confusions, are satisfactory; the MFCC and LPCC
classification. We used as bases of work the TI46. technique, using the classification by the cepstral
The database consists of 15 repetitions of every 46 distance, are more sturdier than the PLPCC and the
word produced by different speakers. Depending on rates of recognitions are more stable. As for the
this, the database consists of 690 tokens. Our system PLPC technique, she goes better with the Itakura-
is brought to recognize three words: « erase », « Saito distance than the cepstral distance. We, finally,
repeat » and « stop ». Previously we have tested the can classify these techniques under this order of
three techniques of parameterisation LPCC, MFCC, reserve:
PLPCC with a classification by the cepstral distance. - PLPC with the classification by the Itakura-Saito
We will give the confusion matrices relating to each distance.
of the quoted techniques. - MFCC and LPCC with the classification by the
cepstral distance.
Table1 : Confusion matrix relative for LPCC - PLPC with the classification by the cepstral
paramétrization technique with cepstral distance. distance.
Stop Repeat Erase
Stop 86.66% 0% 13.33% REFERENCES
Repeat 0% 80% 20%
Erase 0% 20% 80% [1] Calliope, "La parole et son Traitement
Automatique", Edité par J.P.Tubachi, Masson,
Table2 : Confusion matrix relative for MFCC 1989.
paramétrization technique with cepstral distance. [2] Z. Sakka, A. Mezghani, A. Kachouri et M. Samet,
Stop Repeat Erase " Identification Automatique du Locuteur",
Stop 100% 0% 0% Quatrièmes Journées Scientifiques JS'2003, 21-
Repeat 0% 93.33% 6.66% 22 mai 2003 Ecole de l'Aviation de BorEl-Amri,
Erase 0% 20% 80% Tunisie.
[3] A. Bouzid, K. Ouni and N. Ellouze, "Dynamic
time warping applied to vocal audiometry",
Table3 : Confusion matrix relative for PLPCC
Smart Systems and Devices, pp.463-466,
parameterization technique with cepstral distance.
Hammamet (Tunisie), 27-30 Mars 2001.
Stop Repeat Erase
Stop 86.66% 0% 13.33% [4] V. Mäkinen, "Front-end Feature Extraction with
Repeat 0% 80% 20% Mel-scaled Cepstral Coefficients" Laboratory of
Erase 0% 26.66% 73.33% Computational Engineering Helsinki University
of Technology 14. 9. 2000.
We have also tested the PLP parameterization [5] H. Gabzelli, Z. Lachiri, et N. Ellouze "robustesse
technique for spectral coefficients with d’Itakura- des paramètres LPCC, MFCC, PLPC au bruit
Saito distance. industriel ", Conférence Internationale: Sciences
Electroniques, Technologies de l'Information et
Table 4 : Confusion matrix relative for PLPC Télécommunications, SETIT Sousse(Tunisie)
parameterization technique with Itakura-Saito 2004.
distance.
[6] Z.Sakka, A.Kachouri, A. Mezghani, et M. Samet,
Stop Repeat Erase
"Reconnaissance du locuteur par la technique de
Stop 100% 0% 0% quantification vectorielle", Conférence
Repeat 0% 93.33% 6.33% Internationale: Signaux, Circuits et Systèmes,
Erase 0% 6.33% 93.33% SCS Monastir (Tunisie), 2004.
[7] Lynn D. Wilcox and Marcia A. Bush "Speech
6. Conclusion Recognition", the Electrical Engineering
Handbook -US-ISBN: 0849385741.
In this paper, we have tested the three techniques of
parameterization LPCC, MFCC with the
classification by the cepstral distance and the PLPC
technique with cepstral and Itakura-Saito distance.

View publication stats

You might also like