Comparative Study of Speaker Recognition System Using VQ and GMM

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

International Journal of Advanced in Management, Technology and Engineering Sciences ISSN NO : 2249-7455

COMPARATIVE STUDY OF SPEAKER RECOGNITION SYSTEM


USING VQ AND GMM

A.Ramanjaneyulu1, V.Srinivas,2 P.HemaKumar3


1
P.G.Scholar, Dept of ECE, Swaranandhra Inst of Engg & Tech, Narsapur, Andhra pradesh
2,3,
Dept of ECE, Swaranandhra Inst of Engg & Tech, Narsapur, Andhra Pradesh
ramanjaneyulu.addala@gmail.com

ABSTRACT:
The performance of speaker recognition system rate has improved due to recent advances in speaker recognition techniques. Still there
is need for improvement in modeling of speaker recognition system. It becomes difficult for person recognize an unknown speaker in an
uncontrolled environment such that noise added to the input signal with large number of speakers in the trained set. So, the main aim of this
work is to develop a text independent speaker recognition system using MFCC and GMM along with adaptive filter as a pre processing stage. In
this paper, feature vectors from input voice are extracted by using MFCC and the modeling technique GMM is implemented through EM
algorithm. The adaptive filter reduces the noise in speech input and passed through feature extraction phase. It is developed as text- independent
Speaker Recognition System with 50 speakers and also uses the locally recorded database for training. The performance of the proposed system
tested using Adaptive filter based on the log likelihood scores.

Keywords: Adaptive Filter, Mel frequency ceptral coefficients (MFCC), Gaussian Mixture Model (GMM)

I. INTRODUCTION
Nowadays, speaker recognition applications are broadly used in several fields. Speaker recognition focus on the process of recognition
of a person speaking, taking into account peoples speech recordings, which provide specific information about each speaker. This method
permits for a speaker to use his voice like identity verification for many purposes such as voice operators, shopping, telephone transactions, and
also information or database access, voice mail, remote access computers and security check for some confidential information areas.
Speaker recognition is the process of identifying a person and automatically recognizing on the basis of individual speech alone. In this
process of Speaker recognition, the objective is to extract, characterize and recognize the information about speaker identity. The speech signal
always changes as a function of time and it has significant energy from 0 Hz to 5 kHz. Many advancement made it possible to use the speakers
voice to verify identity in security applications and access to the services like voice mail, information services and database access services etc.
The problem of speaker recognition can be classified as speaker identification or speaker verification. Speaker identification identifies the
speaker as one of the known in the trained database and speaker verification verifies that the speaker is the person he/she claims to be or in other
words verifies the identity of the speaker. Speaker identification can be further divided into two classes that is open set and closed set speaker
identification. In open set speaker identification the decision has to be made upon to whom the unknown speech sample resembles the most and
if no satisfactory matching is found, the system also concludes about the speech lying outside the database. Whereas in closed set speaker
identification the speakers data is guaranteed to be present inside the database.
Speaker identification can be further subdivided into two types, which are text dependent and text independent speaker identification.
The text dependent system will use the same test phrase whereas text-independent system have no problem on the test phrase, that means can
speak any word still the system have the capability to identify the person. Text independent systems must use features which depend only on
the speaker and are independent of the uttered word and also the system may work in two modes the training mode and the testing mode. The
goal of this thesis is to improve the speaker recognition task in noisy environment. Speaker recognition, it has two phases. The first phase is
referred to the enrolment phase or training phase while the second one is referred to as the testing phase. In the training phase, each registered
speaker has to provide their speech utterances, so that the system can build or train a reference model for that speaker. In case of speaker
verification systems, in addition, a speaker threshold is also computed from the training samples. During the testing (operational) phase, the
input speech is matched with stored reference model(s) and recognition decision is made this technique makes it possible to use the speaker's
voice to verify their identity.

II. GAUSSIAN MIXTURE MODELING


The modelings are two types, there are template modeling and statistical modeling. In this paper we used the parametric modeling that
is statistical model called Gaussian Mixture Model (GMM) for the identification of speakers.
GMM is the popularly parametric modeling in speaker recognition system. When feature vectors are getting display in a d-dimensional
feature space after clustering, they resemble Gaussian distribution. Clusters can be represented by distribution of the probability and Features can
be represented probability of the distribution function.. The use of the Gaussian mixture density for speaker recognition is motivated by two
facts.

Volume 7 Issue 11 2017 40 http://ijamtes.org/


International Journal of Advanced in Management, Technology and Engineering Sciences ISSN NO : 2249-7455

They are:-
Individual Gaussian classes were interpreted to represent the set of acoustic classes. These acoustic classes represents the vocal tract information
Another important reason involves, GMM is able to represent large class of samples or its distributions by means of linear representation of
Gaussian functions.
The below figure gives a good understanding of what GMM is:

Fig 1: GMM model showing a feature space and corresponding Gaussian model

The weighted sum of N component densities gives the Gaussian mixture density.The Gaussian mixture density has the parameters
mean vectors, covariance matrices and mixture weights, such that these can be represented by

= {Pi,, i} i=1,..N.

GMM is a statistical model; the estimating the parameters of GMM using the available method maximum likelihood estimation. The
parameters of statistical models are estimated to find maximum a posterior (MAP) or likelihood using algorithm is Expectation
Maximization(EM). An Expectation Maximization (EM) algorithm is iterative method. For the improvement in models likelihood value, the
following formulas can be used,

= ( / , )
( / ,).
= ( / ,)
( / ,).
= ( / ,)
-
The N component Gaussian mixture density forming a GMM can be represented by ( / ).

We have T independent training vectors for a given sequence, for these vectors calculate the log likelihood scores and search for the maximum
likelihood. The log likelihood can be computed as

Log p(X/ ) = ( /)

The speaker recognition system is used to identify the unknown speaker based on the likelihood scores of unknown speaker compared
to the log likelihood scores of trained speakers.

III. PROPOSED SYSTEM


The main objective of this paper is to develop the different improved speaker recognition system such that improved recognition rate
can be achieved. When the input utterance is passed through the specific adaptive filter that tends to suppress the noise and giving the signal with
reduced noise. This process of noise cancellation can be done by the filter called Adaptive filter. The performance of the system can be improved
with the system using MFCC and GMM along with adaptive filters compared to the system using MFCC and VQ with adaptive filter as pre
processing stage.

Volume 7 Issue 11 2017 41 http://ijamtes.org/


International Journal of Advanced in Management, Technology and Engineering Sciences ISSN NO : 2249-7455

Input Pre Feature Gaussian


speech Processin Extractio Mixture
signal g Stage
n Phase Model

Decision Identificati
Accept or on Result
Reject

Figure 2: Proposed Block diagram of Speaker Recognition Systems

Note: pre processing stage may be LMS Adaptive Filter or NLMS Adaptive Filter or RLS Adaptive Filter

Here, in this proposed method also MFCC is used for feature extraction and Gaussian Mixture Model technique is used for feature
matching but feature extraction produces the unique and exact features of a speaker only in the controlled environment. But in the presence of
external or surrounding noise, the MFCC yields the undesired features of that particular speaker. To overcome this, the speech signal is passed
through pre-processing stage that is through one particular adaptive filter before applying the utterance to next phase called feature extraction.
In this proposed implementation, the existing Vector Quantization Technique is replaced by Gaussian Mixture Model for Feature
Matching in order to obtain Speaker Recognition. Instead of using Vector Quantization, by the adoption of Gaussian Mixture Model, we can
observe improved performance of the system with considerable accuracy as the Gaussian Mixture Model is a Stochastic Model whereas Vector
Quantization is Template model. The performance measure used in this system is Log Likelihood Scores obtained in Gaussian Mixture Model

Training MFC
signal Adapti C DB of
ve MFCC
1 Filters

Test MFC Likeli


Signal Adaptiv Hood
C
e Filters Scores

Result
signal

Figure 3: Proposed Framework for speaker recognition

In simple words, the speech signal for training is passed through one of the adaptive filters i.e., LMS or NLMS or RLS adaptive filters
then the Mel Frequency Cepstral Coefficients (MFCC) of each signal are calculated and the same stored in the database, this stage we are calling
as Training stage. After that, the unknown signal pre-captured speech from database and sampled that speech signal at 8000Hz is passed through
again LMS or NLMS or RLS adaptive filters whichever is used and then again Mel Frequency Cepstral Coefficients (MFCC) are calculated.
Then the Log Likelihood Score of unknown speaker is calculated by means of that unknown speaker can be identified.

Volume 7 Issue 11 2017 42 http://ijamtes.org/


International Journal of Advanced in Management, Technology and Engineering Sciences ISSN NO : 2249-7455

Also MFCC provides the efficient features in the clean environment; here we are not modifying the architecture of MFCC. Instead we
are processing the input speech signal before going to Feature Extraction (MFCC) Phase. The pre-processing proposed technique that we are
using here is Adaptive Filter. Adaptive Filter is used to reduce or cancel the noise from noisy signal. So, the input speech signal is directly
applying to a Adaptive filter and then the cleaned signal or filtered signal is given to the further stage called Feature Extraction Phase. So that
MFCC gives improved or exact features of the unknown person even in the presence of surrounding noise.
So that the features of the unknown speaker for testing is compared with the features of Reference model (Trained Database) using
Feature matching algorithm. The Log Likelihood Score of each speaker can also be derived. Each time the Likelihood Score is computed for the
two speakers. Thus likelihood scores are used to identify the true speaker.

IV. RESULTS
We implemented the three different speaker recognition systems, there are speaker
recognition system using MFCC and VQ without and with Adaptive filter, speaker recognition system using MFCC and GMM with Adaptive
filter, the experiments were conducted using single word called hello with 50 speakers in trained database in presence of noisy environment.
Experiments in noisy environments with using adaptive filter were conducted for the reduction of noise from the speech signal and also
compared that speech signal with the signal that results without using adaptive filter in noisy environment.

Figure 4: Speech signal without using Adaptive Filter

Figure 5: Speech signal using Adaptive Filter

The speaker recognition system is consists of two main modules, one is Feature extraction and second is modeling module. The Feature
extraction block, which was achieved by extracting the features of the speaker in the training and testing phases using MFCC. And finally the
matching process for recognizing the person can achieved by using the different techniques called Vector Quantization(VQ), Gaussian Mixture
Model(GMM).
Now the existing speaker recognition system was tested with the unknown speaker in the uncontrolled environment. The Recognition Rate
is not achieved in this environment because there is no adaptive filter. The overview of the three times repetition of the same word by the same
person is shown in below table1.

Table 1: Output Of A Speaker Recognition System In Noisy Environment Using VQ Without Adaptive Filter

Speaker Euclidian Recognition Result


Utterance of the Distance Rate(%)
same word for
1st time 1.09 _
nd Speaker
2 time 1.11 _ does not
3rd time 1.21 _ Recognize

Volume 7 Issue 11 2017 43 http://ijamtes.org/


International Journal of Advanced in Management, Technology and Engineering Sciences ISSN NO : 2249-7455

Next, the adaptive filter filters the signal such that the cleaned signal is passed through the feature extraction phase that gives Mel frequency
Cepstral Coefficients (MFCC) and then the code book generated for each VQ model and thus finally Euclidian distance was computed for each
speaker with corresponding VQ code book.
The Recognition Rate of 76.6% was achieved in noisy environment using Adaptive Filter. By using this filter, seen some improvement
in speaker recognition system in terms of Euclidian distance and it is better than the previous system. This adaptive filter improves the
performance of the system as shown in below table2 and figure 6&7 shows the plot between FAR and FRR.

Table 2: Output Of Speaker Recognition System In Noisy Environment Using VQ With Adaptive Filter.

Speaker Utterance
Euclidian Recognition
of the same word
Distance Rate(%)
for

1st time 0.79 70.0

2nd time 0.77 76.6

3rd time 0.80 66.6

Figure 6: Plot for DET Curve

Figure 7: Plot for FAR and FRR

Now the speaker recognition system tested under noisy environment, in which the speech signal was filtered using Adaptive filter. The
Recognition Rate of 97.3% was achieved in noisy environment using GMM with Adaptive filter. This significantly improves the performance of
the system more than that of using VQ with Adaptive Filter. The improvement in Recognition Rate was clearly observed in the below table
when compared with the method of using GMM with Adaptive filter and VQ with Adaptive filter. The overview for three repetitions of the same
utterance by a speaker as shows in Table 3 and figure 8 shows the Plot of Iteration vs log likely hood for one of the speaker in the trained
database.

Volume 7 Issue 11 2017 44 http://ijamtes.org/


International Journal of Advanced in Management, Technology and Engineering Sciences ISSN NO : 2249-7455

Table 3: Output Of Speaker Recognition System In Noisy Environment Using GMM With Adaptive Filter

peaker
Log
Utterance of Recognition
Likelihood
the same word Rate (%)
Ratio
for
1st time 4.21 96.1
nd
2 time 4.16 95.0

3rd time 4.26 97.3

-160

-180

Observed Data Log-likelihood -200

-220

-240

-260

-280

-300

-320

-340

-360
0 20 40 60 80 100 120 140 160 180 200
Iteration

Figure 8: Plot of Iteration vs log likelihood for one of the speaker in the trained database.

V. PERFORMANCE EVALUATION
The performance of each system was evaluated in terms of Recognition Rate. The improvement in Recognition Rate is clearly observed from
one system to another system and such that the speaker recognition system using GMM with adaptive filter gives better Recognition Rate.The
improvement in Recognition Rate was observed in the system using GMM with adaptive filter compared to the system using VQ with adaptive
filter and without adaptive filter
The Existing systems provide the improvement in identification and verification of persons in controlled environment. The present
proposed systems were tested under the noisy conditions, the noisy conditions implies the uncontrolled staff room in the college.
The below table shows the comparison of performance of different systems in terms of Recognition Rate.

Table 4: Performance Comparison Of Different Speaker Recognition Systems In Noisy Environment

Recogni
Speaker Recognition tion
Result
System Rate
(%)
System VQ Without Speaker does not
_
adaptive filter recognize
System VQ with adaptive Speaker
76.6
filter Identified
System GMM with Speaker
97.3
adaptive filter Identified

Thus, this shows the performance comparison of all three speaker recognition systems in noisy conditions. The system GMM with
adaptive filter gives the better performance.

Volume 7 Issue 11 2017 45 http://ijamtes.org/


International Journal of Advanced in Management, Technology and Engineering Sciences ISSN NO : 2249-7455

VI. CONCLUSION
In this paper, Speaker Recognition Systems were developed to perform authentication and authorizations for security applications.
This paper gives the development of the three speaker recognition systems there are system using MFCC and VQ with and without adaptive
filters and also system using MFCC and GMM with adaptive filter and also the performance comparison of all three speaker recognition systems
in noisy conditions. The Speaker Recognition System GMM with adaptive filter gives the better performance compared to the Systems using VQ
with out and with adaptive filters, it achieves the Recognition Rate of 97.3%, which are better than other systems. Based on the above metrics,
we can conclude that the proposed system gives accurate results for the identification or verification of speakers in an uncontrolled environment
such as office room in a company.

REFERENCES
[1] S A Samad, A Hussain, AnuarIshak Impoved Hybrid Speaker Verification in Noisier Environments applied Least Mean -Square
Adaptive Filters Information and Communication Technology for the Muslim World, The 5th international Conference on 17-18
Nov,2014.
[2] Dr. ShailaD.Apte Speech and Audio Processing Wiley PRECISE TEXTBOOK
[3] George R J Design Of An Adaptive Filtering Algorithm For Noise Cancelation Published in IR JET, Volume : 02 Issue: 04 | Jul 2015.
[4] Creighton J & Doraiswami R Real Time Implementation Of an Adaptive Filter For Speech Enhancement published in IEEE 2004.
[5] Geeta Nijhawan, & Dr. Soni M.K MFCC Speaker Recognition and Using Vector Quantization Int. J. on Recent Trends in Engg and
Techy, Vol. 11, No. 1, 2014 July.
[6] Rodrguez Gonzlez J., J.Garca Ortega, Martn Csar , and Hernndez Luis, Incresing robustness in systems using GMM speaker
recognition for noisy and reverberant speech with less complexity microphone arays, in Proc. Fourth International Conference on
Spoken Language, 1996, pp. 1333-1336.
[7] Ji Ming, J Timothy. Hazen, J R. Glass & D A. Reynolds. 2007. Robust Speaker Recognition in Noisy Conditions, published in IEEE
Transactions On Audio, Speech, And Language Processing, Vol. 15, No. 5.
[8] Joseph P.Campbell, Speaker Recognition: A Tutorial, Proc. Of the IEEE, Vol 85,No. 9, 1997 September,pp. 1437-1462.
[9] Sahoo AK, Panda and Ashish kumar study of Speaker Recognition systems thesis submitted to the NIT by ROURKELA, 2011
[10] V Tiwar MFCC & its applications in speaker recognition published in International Journal on Emerging Technologies 1(1): 19-
22(2010) ISSN : 0975-8364.
[11] Ch SrinivasKumar, Dr.PM Rao Design An Automatic Speaker Recognition System apply MFCC, Vector Quantization & LBG Algo
IJCSEg, Vol. 3 No. 8 , 2011 August, ISSN : 0975-3397.
[12] Prof. M Vaishal, Karne, Prof. Thakur Akhlesh Singh , Dr. V Tiwari Least Mean Square (L.M.S) Adaptive Filter For Noisy
Cancellation IJAIIEM , ISSN 2319 4847.
[13] Dr. Kekre H B, Dr. Bharadi V A, Sawant A R Speaker Recognition using Vector Quantization by MFCC and KMCG Clustering
Algoithm 2012 ICCICT, 19-20 Oct.
[14] A Srinivasn Speker Identification & Verifikation using Vector Quantization and Mel Frequency Cepstral Coefficients published in
Research Journal of Applied Sciences, Engineering and Technology 4(1): 33-40, Jan 2012 ,ISSN: 2040-7467.
[15] Reynolds, A Douglas, and Rose RC."Robust text-independent speaker identification using Gaussian mixture speaker models" published
in Audio Processing & Speech Processing, IEEE Transactions on 3. I. Jan 1995
[16] Dr. SadaokiFurui Speaker recognition SadaokiFurui (2008 ), Scholarpedia ,3 (4 ):3 7 1 5
[17] A thesis of Performance Analysis and Enhancements of Adaptive Algoritms and its Applications. mamianga Nanyang Technological
University by Shengkuizhao.
[18] A Novel Windowing Technique for Efficient Computation of MFCC for Speaker Recognition Sahidullah Md, Student Member, IEEE,
GoutamSaha, Member, IEEE.
[19] Yoseph Linde, Buzo Andres Gray R., An vector quantize algorithm for design IEEE Transactions on Communications Vol. 28, pp.84-
95, 1980 Jan.

Volume 7 Issue 11 2017 46 http://ijamtes.org/

You might also like