Text Independent Speaker Verification System: Khushboo Modi

Khushboo Modi
Senior Design Project Report

Text Independent Speaker Verification System
Text Independent
Speaker Verification
System
Khushboo Modi
khushboo@seas.upenn.edu
Project Advisor:
Professor Lawrence Saul
lsaul@cis.upenn.edu
1
Khushboo Modi
Abstract:
User identification and verification are very important aspects of any security system
today, as cheaters find more and more ways to break into even the most complex of
security measures. Biometric recognition systems are in demand today due to their
reliance of human features that are unique to a person and cannot be forged easily such
as face, fingerprints and voice. Like fingerprints, a persons voice has particular unique
features and using this voiceprint, their identity can be verified.
The goal of my project is to design and implement a text-independent speaker
identification system. This means that regardless of what the user speaks, the system
should be able to verify whether he is the person he claims to be. Such a system would
be useful in banks, at ATMs, as well as telephone-based applications, where there is no
way to identify a user based on fingerprint or face.
Related Work:
Speech recognition is not a new subject, however it is a growing industry and
continuously new methods to tap this human quality are being developed. A lot of
research has been done on text-independent speaker verification systems using Gaussian
mixture models and my project is a simple implementation of that. I will be using
published papers on this topic to assist me in my goal.
Khushboo Modi
Technical Approach:
The object of this project is to implement a single speaker verification system.
Statistically speaking, it is a hypothesis test between two hypotheses:
p(Y|H0)
> , accept H0
p(Y|H1)
< , accept H1
where
H0: Y is from the hypothesized speaker S
H1: Y is not from the hypothesized speaker S1
Figure taken from A Tutorial on Text-Independent Speaker Verification

The output of front end processing is a sequence of feature vectors X = {x1, x2,t},
where xt is a feature vector indexed at discrete time t
[1,2,3..., T]. These features are
then used to compute the likelihood ratios of H0 and H1. The log of the likelihood ratio
above would then be:
(X) = log p(X|H0) log p(X|H1)
We need to generate two models for this test to work the speaker model as well as the
background model.
I have planned three stages for implementing this system Training Phase, Tuning
Phase and Testing Phase.
Training Phase: Generate the background model
Tuning Phase: Generate the individual speaker models
Testing Phase: Test the system using new wave files from test speakers
Im using the Gaussian Mixture Model for the likelihood function and so the mixture
density for the likelihood function, for a D-dimensional feature vector x, is:
1
A Tutorial on Text-Independent Speaker Verification

3
Khushboo Modi
QuickTime and a
TIFF (LZW) decompressor
are needed to see this picture.
The GMM parameters (mean, variance, etc) are calculated using the ExpectationMaximization (EM) Algorithm. It is an iterative process that monotonically increases the
likelihood of the estimated model for the observed feature vectors such that for
iterations k and k+1,
p(X|(k+1)) p(X|(k))
The weight, mean, and variance parameters:
Khushboo Modi
Data Collection:
To implement this system, test data is required. I have recorded clips from 25
speakers. Each speaker data set consists of 15 speech clips, of varying lengths. This data
set I split into three categories Training, Tuning and Testing. These are the three
phases of the project and the data will be required in each stage. So 9 out of 15 clips I
have used for training, 3 more for tuning and the rest for testing the application.
To record the clips, I used a microphone and recording software called GoldWave.
One factor that affected the results, was the distance between the microphone and the
speakers mouth. Too close or too far and the results were skewed. I realized this at a
later stage, and so had to ask a few speakers to record more test clips.
Training Phase:
In the Training phase, the background model is created. The background model is
basically a large pool of all sample data, just a large Gaussian mixture model. I have
converted the wave files into a different format so that they can be used for this analysis.
The wave file is a continuous signal, which must be broken down in discrete parameter
vectors. Each vector is about 10ms long, because we assume that in this duration the
vector is stationary. This is not strictly true, but it is a reasonable approximation to make.
The format Ive used is MFCC, which stands for Mel Frequency Ceptral Coefficients.
The conversion can be done as follows:
1. Divide signal into frames.
2. For each frame, obtain the amplitude spectrum.
3. Take the logarithm.
4. Convert to Mel (a perceptually-based) spectrum.
5. Take the discrete cosine transform (DCT).2
However, instead of doing this manually, I used the HTK Toolkit in order to
automate the process. Once the files are in the correct format, from each frame it is
important to discard all the silence and keep the speech samples. So, then I generate
mfcc.speech files.
Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling
http://ciir.cs.umass.edu/music2000/papers/logan_abs.pdf
5
Khushboo Modi
One of the features vectors extracted is energy, which corresponds to the loudness
or softness of the speakers voice. In order to avoid bad results due to this, I removed
the energy vector from the speech files.
Now, the speech files can be combined to generate the background model file. This
model must now be trained. We must decide on the number of Gaussians to work with.
In order to make that decision, you look at the log_likelihood values at the end of the
training process, and compare the values.
For example:
Number of samples
Log_Likelihood in Loop 4
Number of Gaussians
25973
-607367.312500
250
25973
-603886.625000
300
25973
-600375.312500
350
25973
-597373.687500
400
The optimal number of Gaussians is one where the log likelihood value drops for the
first time, because this means that the likelihood is actually increasing.
During my earlier training phase, the optimal number of Gaussians was 300, with the
lowest log likelihood value. However, as the number of samples increased, I decided to
continue testing with higher number of Gaussians and finally achieved best results at 600
Gaussians. As the system is scaled for use by a large of speakers, this number will
increase substantially. I keep the number of Gaussians fixed for the background model
and the speaker models.
Tuning Phase:
In the tuning phase, the individual speaker models are generated. The process for
generating these models is very similar to that for generating the background model, with
a few minor changes.
I take the mfcd.speech files these are the mfcc files with the energy feature
removed and use these to generate a model for that speaker. I keep the number of
Gaussians same as the background model in this case, 600 Gaussians.
The purpose of this system is to test whether a given voice print belongs to the
person the speaker claims to be. In order to achieve this, I needed to device a method to
calculate a threshold value, which would make it easy to identify the speaker/imposter.
An imposter is a user who claims to be somebody else, to try and cheat the system. To
6
Khushboo Modi
do this, I used three test files from each user. I compared the each file of the speaker to
the speaker model, and based on the matching of the features, calculated the likelihood
value of a test recording belonging to that speaker. For each speaker, not only did I
compare the speakers test files, but also the files from other speakers in the background
model. This provided a range of values that would be useful in calculating a threshold.
Below is a sample of the data I got from running the above test.
Dat file:
Speech files
Dip13
Dip14
Dip15
Divye13
Divye14
Divye15
jiten13
jiten14
jiten15
khush13
khush14
khush15
madhu13
madhu14
madhu15
dip
divye
jiten
Khush
madhu
1.080225
0.764772
0.673584
-0.576429
-0.478092
-0.508654
-0.507383
-0.276649
-0.407593
-0.326227
-0.265522
-0.475126
-0.323267
-0.241459
-0.275469
-0.376633
-0.397726
-0.44739
0.964666
0.99568
1.180914
-0.323767
-0.396345
-0.397972
-0.366068
-0.360636
-0.412575
-0.377714
-0.454681
-0.405884
-0.28686
-0.269392
-0.34235
-0.398272
-0.295307
-0.33753
1.3738
0.844433
1.095037
-0.286724
-0.359411
-0.435125
-0.310669
-0.335117
-0.288155
-0.437426
-0.447361
-0.469791
-0.618855
-0.503201
-0.587896
-0.48052
-0.47724
-0.497236
0.927004
1.067671
1.201353
-0.461508
-0.370653
-0.447096
-0.294811
-0.301577
-0.422424
-0.204668
-0.371454
-0.246685
-0.425753
-0.399639
-0.3974
-0.230051
-0.389227
-0.447935
1.254961
1.299156
0.842831
Each speech file belongs to some speaker, and the highlighted likelihood values are
the results of comparing a speaker test file to the same speakers model. The most
important point that I noticed in the tuning phase results was that the likelihood values
of a test file belonging to a speaker is positive when the file actually belongs to the
speaker and negative when the file belongs to an imposter.
I decided that the threshold had to be some function based on the average of the
likelihood values of the speaker files as well as include the imposter values.
The threshold function I used is: + x
is the mean and is the standard deviation of all the likelihood values. x is an integer
whose value is can be varied. I varied x, starting with x = 2. Using this threshold
function, I computed the thresholds of all the speakers in my background model.
Khushboo Modi
Testing Phase:
Once we have all the threshold values and speaker models, it is time to test the
remaining files. This will help us determine if the analysis done above is accurate enough.
Using the threshold values calculated in the tuning phase, I tested the remaining speaker
files. To ensure that the system is accurate while verifying users, we need to test the
threshold values in two ways for false alarms and false rejections. If the likelihood
value of an imposter file is higher than the threshold for the speaker being tested, then
the system will validate the imposter as the speaker. This is a false alarm. On the other
hand, sometimes a speakers own file may not have a likelihood value higher than the
threshold and so the speaker is falsely identified as an imposter. This is a false rejection.
An optimal threshold value would minimize both these values, keeping the error rate
low. I maintain a summary file for each user, which is generated when the testing scripts
are run, recording the likelihood values, and the mean, variance and standard deviation of
the results. The summary file also tracks the number of false alarms and rejections.
mean = -0.424741
var = 0.022052
stdev = 0.148498
threshold for khush is 0.169252
number of false alarms with threshold 0.169252 are 1
number of false rejections with threshold 0.169252 are 0
As mentioned above, I started by keeping the value of x=2. This threshold gave a
very high rate of error, allowing many imposters to be validated as another speaker.
However, there were very few false rejections. So I experimented by varying the value of
x to 3 and then finally x = 4. Currently, I have fixed the value of x as 4. However, with
an increase in number of speakers, this would vary.
Khushboo Modi
The User Interface:
In order to make this system user friendly, I have developed a GUI application,
which is simple and hides the layer of complexity from the user. There are two parts, for
training a new speaker and to test a returning user. It is important to implement these
features in a very short time, while demonstrating the application. I have incorporated a
recorder in the GUI, so that no separate recording software is required.
In theory, a new speaker would be added to the system offline. The background
model need not contain all the users that are added to the system, but if there were a
huge discrepancy in the actual number of users and the user data in the background
sample pool, the results would get skewed. However, for the purpose of demonstration,
while adding a new user, the background model is not modified. The entire procedure is
automated using perl scripts. Once the user records a voice clip, and selects to be added
to the system or to be identified as a particular speaker, all the processes are implemented
and the result is shown on the screen.
A new speaker is added by the following procedure:
Speaker records a voice clip.
Voice clip is converted into speech file of the correct format.

9
Khushboo Modi
Using the data, the speaker model is generated.
Using the same speech file, and the tuning files of the existing users, the
threshold value of the speaker is generated.
A speakers identity is verified by the following procedure:
User records voice clip
Voice clip is converted into speech file.
User selects his username from a drop down menu.
Based on the users selection, the likelihood value of the speech file is
compared with the threshold value of the selected identity.
If the likelihood value is higher than the threshold value, user is identified as
speaker.
If the likelihood value is below the threshold value, user is identified as

imposter.
10
Khushboo Modi
Conclusion:
The aim of this project was to implement an application that would verify a speakers
identity by using the speakers voice print characteristics that distinguish the speaker
from other speakers. I wanted to implement a simple application using the algorithms
already in existence.
Data collection was a very important aspect of this project. It was a challenge to
figure out how many speakers I should use. I initially had about 10, but then I increased
that number to 25. It was also important to figure out what kind of data I should work
with. Should I have multiple files or just one with a lot of speech? How many files for
testing phase and tuning phase? I decided the details for data collection after a lot of trial
and error.
One of the challenges was understanding how the Hidden Markov Model Toolkit
worked. It was important to extract the features that I needed for my experiments, and
being able to manipulate the data the right way. One of the features extracted from the
voice recording is energy. This energy corresponds to the loudness of the speakers voice
and would skew results if taken into account. So I to figure out how to remove the
energy vector from the feature vectors that HTK generated.
While the application gives pretty accurate results, it works well only under certain
environmental circumstances. I recorded most of the data in a room with very little
disturbance in the background. This is meant to be a single speaker verification system,
so no other speakers should be heard in the background. Also, the microphone used for
all the test speakers is the same. The mike is placed at a fixed distance from the speakers
mouth while recording the clip. Using a different microphone or adjusting the distance
between the mike and speakers mouth causes results to be skewed. So, the application
works under this scenario, but not necessarily under any other circumstances. I wouldve
liked to accomplish this, but I was not successful.
Overall I enjoyed working on this project since it was a topic that interested me. A
blessing in disguise was my lack of information and awareness in this field, as it forced
me to read and learn a lot on my own. Also, I learnt how to work on a large project with
very little structure. It was important to set deadlines for myself, and keep working
towards the end goal. There were times when everything went wrong and it was
important not to give up. I am glad that I was able to achieve the goals I set for myself.
11
Khushboo Modi
References:
Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling
http://ciir.cs.umass.edu/music2000/papers/logan_abs.pdf
Schmidt, Regina. Identity Confirmed, Access Permitted: The Basics On Voice
Authentication, Security And Consumer Use Of An Emerging Biometric.
BiometriTech. 3 Sep. 2003
<http://www.biometritech.com/features/090303nu.htm>.
A Tutorial on Text-Independent Speaker Verification EURASIP Journal on Applied
Signal Processing 2004 <http://www.ll.mit.edu/IST/pubs/040401_Bimbot.pdf>.
Reynolds, Douglas. A., Quatieri, Thomas. F., Dunn, Robert B., Speaker Verification
Using Adapted Gaussian Mixture Models Digital Signal Processing, 2000
http://www.ll.mit.edu/IST/pubs/000101_Reynolds.pdf
The HTK Book http://anacardier.eecs.tulane.edu/documentation/htkbook/
12

Text Independent Speaker Verification System: Khushboo Modi

Uploaded by

Copyright:

Available Formats

You might also like

Text Independent Speaker Verification System: Khushboo Modi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Independent Speaker Verification System: Khushboo Modi

Uploaded by

Copyright:

Available Formats

Khushboo Modi

Senior Design Project Report

Figure taken from A Tutorial on Text-Independent Speaker Verification

[1,2,3..., T]. These features are

A Tutorial on Text-Independent Speaker Verification

The weight, mean, and variance parameters:

Logan, Beth. Mel Frequency Cepstral Coefficients for Music Modeling

Speaker records a voice clip.

Voice clip is converted into speech file of the correct format.

Using the data, the speaker model is generated.

A speakers identity is verified by the following procedure:

User records voice clip

Voice clip is converted into speech file.

User selects his username from a drop down menu.

If the likelihood value is below the threshold value, user is identified as

You might also like