Speaker Identification Based On GFCC Using GMM: Md. Moinuddin Arunkumar N. Kanthi

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163

Volume 1 Issue 8 (September 2014) www.ijirae.com



_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page - 224

Speaker Identification based on GFCC using GMM

Md. Moinuddin Arunkumar N. Kanthi
M. Tech. Student, E&CE Dept., PDACE Asst. Professor, E&CE Dept., PDACE

Abstract: The performance of the conventional speaker identification system degrades drastically in presence of noise.
The ability of human ear to identify the speakers identity in noisy environment motivates us to use an auditory based
feature called gammatone frequency cepstral coefficient (GFCC). The GFCC is based on gammatone filter bank,
which models the basilar membrane as a series of overlapping band pass filters. The speaker identification system
using the GFCC features and GMMs has been developed and analysed using TIMIT and NTIMIT databases. The
performance of the system is compared with the baseline system using the traditional MFCC features. The results
show that the GFCC features has a good recognition performance not only in clean speech environment, but also in
noisy environment.

Keywords Auditory based feature, Gammatone Frequency Cepstral Coefficient (GFCC), MFCC, GMM, EM -
algorithm

I. INTRODUCTION

Speaker identification determines from which of the enrolled speakers the given utterance has come. The utterance can
be constrained to a known phrase (text-dependent) or totally unconstrained (text-independent). It consists of feature
extraction, speaker modeling and decision making. Typically, extracted speaker features are Mel-frequency cepstral
coefficients (MFCCs). For speaker modeling, Gaussian mixture models (GMMs) are widely used to describe feature
distributions of individual speakers. Recognition decisions are usually made based on likelihoods of observing feature
frames given a speaker model. The poor performance of MFCCs in noisy or mismatched condition can be attributed to
the use of triangular filters for modelling the auditory critical bands. To model cochlear filter more accurately,
Gammatone filters are used instead of the triangular filters and the extracted features are called gammatone frequency
cepstral coefficients (GFCCs).

II. THE SYSTEM MODEL
Speaker identification system consists of two parts: front-end & back-end. The front-end of the system is a feature
extractor while the back-end consists of a classifier and a reference database.
Front-End
Back-End










Figure 1: Architecture of speaker identification system
Database
Train

utterances
GFCC
Extractor
GMM
Modeling
Test

utterance
ML
Classifier
Identification
result


International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -225

The main task of the front-end is to extract features from a speech signal. The aim is to sufficiently represent the
characteristics of the speech signal with reduced redundancy. Features are extracted based on frames. One feature vector
is calculated for every frame.
After feature extraction, the sequence of feature vectors is passed to the back-end of the speaker identification system.
Based on the feature vectors, the back-end of the system selects the most likely utterance out of all the possibilities from
the reference database.
After training, the statistical models are stored in the database. When an unknown utterance is presented, feature
vectors are obtained. The classifier calculates the maximum log likelihood based on the models and decides the most
likely utterance.
III. GFCC EXTRACTION

The GFCC features are based on the GammaTone Filter Bank (GTFB). The feature vectors are calculated from the
spectra of a series of windowed speech frames. The figure below shows the block diagram of GFCC extraction.






Figure 2: Block Diagram of GFCC extraction
Speech
utterance
Pre-emphasis
Framing &
Windowing
DFT &
| .|
2

GTFB
Logarithmic
compression
DCT
GFCC
features

Pre-emphasis stage:

The high frequency components of a speech signal have low amplitude as compare to low frequency components due to
radiation effect of lips.
In order to spectrally flatten the speech signal i.e. to obtain similar amplitude for all frequency components, the speech
signal is passed through a Pre-emphasis filter, which is a first order FIR digital filter, which can eliminate the lips
spectral contribution effectively. The speech after pre-emphasis sounds much sharper.
The transfer function of the pre-emphasis filter is given by the following equation

(1)
Where a is a constant, it has a typical value of 0.97.



Fig 3: Pre-emphasis operation
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -226

Framing & Windowing:

Speech signal is non-stationary i.e. its statistical characteristics varies with time. Since the glottal system cannot change
immediately, speech can be considered to be time-invariant over short segments of time (20-30 ms). Therefore speech
signal is split into frames of 20ms.

When the signal is framed, it is necessary to consider how to treat the edges of the frame otherwise the edges add.
Therefore a windowing function is used to tone down the edges. The choice of window must be the one which has
narrow main lobe and attenuated side lobes. Therefore hamming window is the preferred choice.
The Hamming window is given by equation

(2)


Fig 4: windowing operation

As a consequence of windowing, the samples will not be assigned the same weight in the following computations & for
this reason it is sensible to use an overlap (10 ms).

DFT:

The windowed frame is transformed using a Discrete Fourier transform and the magnitude is taken because phase does
not carry any speaker specific information.

Fig 5: DFT operation

Gammatone filter banks stage:

The Gammatone filter bank consists of a series of band-pass filters, which models the frequency selectivity property of
the basilar membrane.The impulse response of each filter is given by the equation

(1 M) (3)
Where a is the constant (usually equals to 1). n is the filter order ( here n=4). is the phase shift. is the center
frequency and b
m
is the attenuation factor of the filter, which is related to the band of the filter, and is decisive factor of
impulse response decay rate.

Fig 6: Frequency response of 64 channel gammatone filter bank
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -227


The centre frequency of m
th
Gammatone filter can be determined by the equation
+ (4)
Where f
L
and f
H
are the lower and upper frequencies of the filter bank.
The bandwidth of each filter is described by an Equivalent Rectangular Bandwidth (ERB). The ERB is a psychoacoustic
measure of the width of the auditory filter at each point along the cochlea. The equation for the ERB
ERB ( ) =24.7 (5)
The bandwidth of each filter is described by ERB as
b
m
=1.019 ERB ( ) (6)
The FFT magnitude coefficients are binned by correlating them with each gammatone filter i.e., each FFT magnitude
coefficient is multiplied by the gain of the corresponding filter and the result is accumulated. Thus, each bin holds the
spectral magnitude in that filterbank channel.
(7)


Fig 7: Filter bank processing

Logarithmic compression & Discrete cosine Transformation (DCT) stage:
The logarithm is applied to each of the filter output to simulate the human perceived loudness given certain signal
intensity and to separate the excitation (source) produced by the vocal cords and the filter that represents the vocal tract.
Since the log-power spectrum is real, Discrete Cosine Transform (DCT) is applied to the filter outputs which produces
highly uncorrelated features. The envelope of the vocal tract changes slowly, and thus presents at low quefrencies (lower
order cepstrum), while the periodic excitation are at high quefrencies (higher order cepstrum).

, where 1 (8)
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -228


Fig 8: Logarithm and DCT operation
IV. GAUSSIAN MIXTURE MODEL


The task is to classify the feature vectors .
Each speaker is represented by a speaker model .

Where is the mean vector, is the covariance matrix, is the mixture weight.

The Gaussian Mixture Model (GMM) is a model that expresses the probability density function of a random variable in
terms of weighted sum of its components, each of which is described by a Gaussian density.
The feature vectors extracted from the speech of the enrolled speaker are modelled as
(9)
Where is a D-dimensional random vector and
is the component density.

Let X = be the set of training feature vectors.
Training a GMM requires the computation of and from the given feature vectors belonging to a speaker.
Maximum Likelihood estimation is used to estimate these parameters.
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -229



Maximum Likelihood Estimation

ML aims to maximize the likelihood p (X| of the GMM from the given set of feature vectors X =

p (X|




(10)

Since log cannot be moved inside the summation, direct maximization is not possible. However, estimates can be
obtained iteratively using Expectation Maximization Algorithm.

Expectation Maximization Algorithm

1. Initialize:
Means by clustering the feature vectors through k-means algorithm.
Mixture weights to be equally likely, by setting each weight to be .
Co-variance matrix by using an identity matrix.

2. Expectation step:
Evaluate responsibilities


3. Maximization step:
Update the parameters using the current responsibilities




Where


4. Evaluate the log likelihood
International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -230


Check for the convergence of the parameters or the log likelihood. If the convergence criterion is not satisfied return to
step2.

V. SPEAKER IDENTIFICATION


Speaker identification is done by finding the speaker model which has the maximum a posterior probability for the given
set of test feature vectors X = .

i.e.
By mixed bayes rule weve

Assuming speakers to be equally likely
i.e.
and is independent of speaker model the above equation simplifies to


=

Assuming feature vectors are occurrences of independent random variables


Taking logarithm, we get


VI. EXPERIMENTAL RESULTS

Speech Database:
Experiment is conducted on TIMIT and NTIMIT speech database. TIMIT consists of read speech recorded in a quiet
environment without channel distortion. TIMIT database has 630 speakers (438 males and 192 females) with 10
utterances per speaker, each 3 seconds long on average. NTIMIT was created by transmitting all TIMIT utterances over
actual telephone channels.
The Performance of the speaker identification system is evaluated using the 23-dimensional GFCCs and the baseline
13-dimensional MFCCs features by taking different orders of GMM.



International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -231

1. For each database, 8 of the 10 utterances are used for training (about 24s) and 2 for testing
(about 6s).

i. Logarithmic compression:

ii. Cubic root compression:

2. For each database, 9 of the 10 utterances are used for training (about 27s) and 1 for testing
(about 3s).
i. Logarithmic compression:


ii. Cubic root compression:

International Journal of Innovative Research in Advanced Engineering (IJIRAE) ISSN: 2349-2163
Volume 1 Issue 8 (September 2014) www.ijirae.com

_________________________________________________________________________________________________
2014, IJIRAE- All Rights Reserved Page -232


VII. CONCLUSION

The results show that the gammatone frequency cepstral coefficient (GFCC) features, captures speaker characteristics
better than the conventional MFCC features and has a good recognition performance not only in clean speech
environment (TIMIT), but also in noisy environment (NTIMIT). Further, the modified MFCC (MMFCC) and modified
GFCC (MGFCC) features, developed by replacing the log with the cubic root, shows drop in the identification
performance both in clean and noisy speech.

REFERENCES

[1] E. B. Tazi, A. Benabbou, and M. Harti, Efficient Text Independent Speaker Identification Based on GFCC and
CMN Methods" ICMCS 2012, pp. 90-95.
[2] He Xu, Lin Lin, Xiaoying Sun and Huanmei Jin, "A New Algorithm for Auditory Feature Extraction" CSNT-2012,
pp. 229-232.
[3] Feng song H and Xiao Cao, An auditory Feature Extraction Method for Robust Speaker Recognition ICCT 2012,
pp. 1067-1071.
[4] M. Slaney, An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank, Apple Technical
Report No. 35, Advanced Technology Group, Apple Computer Inc., 1993.
[5] Douglas A. Reynolds and Richard C. Rose Robust Text-Independent Speaker Identification Using Gaussian
Mixture Speaker Models" IEEE Trans. Audio, Speech, and Language Processing, vol. 3(1), pp. 72-83,1995.
[6] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer Science, Business Media, 2006.
[7] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Pearson Education (Singapore) Pt. Ltd. 2005.

You might also like