ML CEP Group 3

Speaker Recognition System
Submitted by:
Shehroz Jajja 2019-EE-081
Ahmad Daniyal 2019-EE-084
Noman Khalid 2019-EE-166
Ammar Akhtar 2019-EE-167
Group No. 03
Supervised by: Dr. Kashif Javed
Department of Electrical Engineering

University of Engineering and Technology Lahore
Contents
List of Figures ii
1 Introduction 1
1.1 Speaker Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Mel Frequency Cepstral Coefficients (MFCC) . . . . . . . . . . . . . . . . 2
1.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Methodology 4
2.1 Audio Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Extracting Features from Audio Samples . . . . . . . . . . . . . . . . . . . 4
2.3 Train and Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
i
List of Figures
1.1 Unsplash Photo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 MFCC Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Possible Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
ii
Chapter 1
Introduction
1.1 Speaker Recognition System

In today’s era of data technology, audio information plays an important role in increasing
the volume of data; resulting in a need for a methodology which demystifies this content
to give meaningful insights from them. Voice recognition is one of the methodologies
which aims to recognize the person speaking the words, rather than the word them-
selves. As technology has evolved, voice recognition has become increasingly embedded
in our everyday lives with voice-driven applications in every day’s digital appliances.
Voice recognition mainly classified into two parts speaker verification and speaker iden-
tification. Speaker identification determines which registered speaker provides a given
utterance from amongst a set of known speakers. Speaker verification accepts or rejects
the identity claim of a speaker.
Figure 1.1: Unsplash Photo
[5]Speech is a universal form of communication. Speaker Recognition (SR) is the process

of identifying the speaker according to the vocal features of the given speech. This is
different to speech recognition where the identification process is confined to the content
rather than speaker.
1
Chapter 1. Introduction 2
1.2 Mel Frequency Cepstral Coefficients (MFCC)

[4]Machine learning ML extracts features from raw data and creates a dense represen-
tation of the content. This forces us to learn the core information without the noise to
make inferences (if it is done correctly). One popular audio feature extraction method
is the Mel-frequency cepstral coefficients (MFCC) which have 39 features. The feature
count is small enough to force us to learn the information of the audio. 12 parameters
are related to the amplitude of frequencies. It provides us enough frequency channels
to analyze the audio. The Mel-Frequency Cepstral Coefficients (MFCC) feature extrac-
tion method is a leading approach for speech feature extraction and current research
aims to identify performance enhancements. One of the recent MFCC implementations
is the Delta-Delta MFCC, which improves speaker verification. [3]The MFCC feature
extraction technique basically includes windowing the signal, applying the DFT, taking
the log of the magnitude, and then warping the frequencies on a Mel scale, followed by
applying the inverse DCT. This feature is one of the most important method to extract
a feature of an audio signal and is used majorly whenever working on audio signals.
Below is the flow of extracting the MFCC features.
Figure 1.2: MFCC Flow Chart
The main objectives are:
• Remove vocal fold excitation (F0) — the pitch information.
• Make the extracted features independent.
• Adjust to how humans perceive loudness and frequency of sound.
• Capture the dynamics of phones (the context).
Help is taken from these websites [1] and [2]

Chapter 1. Introduction 3
1.3 Support Vector Machines

Support vector machine is another simple algorithm that every machine learning expert
should have in his/her arsenal. Support vector machine is highly preferred by many as
it produces significant accuracy with less computation power. Support Vector Machine,
abbreviated as SVM can be used for both regression and classification tasks. But, it is
widely used in classification objectives.
The objective of the support vector machine algorithm is to find a hyperplane in an N-
dimensional space(N — the number of features) that distinctly classifies the data points.
Figure 1.3: Possible Hyperplanes
o separate the two classes of data points, there are many possible hyperplanes that
could be chosen. Our objective is to find a plane that has the maximum margin, i.e
the maximum distance between data points of both classes. Maximizing the margin
distance provides some reinforcement so that future data points can be classified with
more confidence.
Hyperplanes are decision boundaries that help classify the data points. Data points
falling on either side of the hyperplane can be attributed to different classes. Also, the
dimension of the hyperplane depends upon the number of features. If the number of
input features is 2, then the hyperplane is just a line. If the number of input features is
3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine
when the number of features exceeds 3.
Figure 1.4: Support Vectors

Chapter 2
Methodology
2.1 Audio Segmentation

First of all, all the members of group recorded the data with one recording more to be
done by each member. So, in total we have the Audio data of 8 speakers. Then using
the XTrans software we segmented the recording in the form of ”.tdf” files. Here, we
have our Audio data segmented.
2.2 Extracting Features from Audio Samples

The idea is to make the correct identification of the speaker by using the Gaussian
mixture model. The first step while dealing with an audio sample is to extract the
features from it to identify components from the audio signal. We are using the Mel
frequency cepstral coefficient (MFCC) to extract the features from the audio sample.
MFCC which maps the signal onto a non-linear Mel-Scale that mimics the human hearing
and provides the MFCC feature vectors which individually describes the power spectral
envelope of a single frame.
2.3 Train and Test Data

Total of 64 minutes recording is loaded using librosa python library, 8 min each speakers.
then by using the train-test-split function we can divide our data into 70 percent and
30 percent test. Thus, this data is furthur used for classification.
2.4 Classification
Classification is done 8 speakers. So, kernel being a linear can easily classify the data as
in our case accuracy on Support vector machines comes out to br 95.87 percent. SVM is
imported from the sklearn python library. Audio of 10 seconds recorded by microphone
is used as a test to check the Speaker id provided by classfication.
4
Bibliography
[1] https://medium.com/analytics-vidhya/speaker-identification-using-machine-
learning-3080ee202920 , Last accessed by 2022.
[2] “https://scikit-learn.org/stable/modules/generated/sklearn.models election.traint ests plit.html”, La
[3] “https://scikit-learn.org/stable/tutorial/texta nalytics/workingw itht extd ata.htmlloading−

the − 20 − newsgroups − dataset”, Lastaccessedby2022.
[4] “https://jonathan-hui.medium.com/speech-recognition-feature-extraction-mfcc-plp-
5455f5a69dd9.” , Last accessed by 2022.
[5] https://www.sciencedirect.com/topics/engineering/speaker-recognition-system” ,
Last accessed by 2022.

ML CEP Group 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML CEP Group 3

Uploaded by

Copyright:

Available Formats

Speaker Recognition System

Supervised by: Dr. Kashif Javed

Department of Electrical Engineering

1.1 Unsplash Photo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Speaker Recognition System

Figure 1.1: Unsplash Photo

[5]Speech is a universal form of communication. Speaker Recognition (SR) is the process

1.2 Mel Frequency Cepstral Coefficients (MFCC)

Figure 1.2: MFCC Flow Chart

The main objectives are:

• Remove vocal fold excitation (F0) — the pitch information.

• Make the extracted features independent.

• Adjust to how humans perceive loudness and frequency of sound.

• Capture the dynamics of phones (the context).

Help is taken from these websites [1] and [2]

1.3 Support Vector Machines

Figure 1.3: Possible Hyperplanes

Figure 1.4: Support Vectors

2.1 Audio Segmentation

2.2 Extracting Features from Audio Samples

2.3 Train and Test Data

[2] “https://scikit-learn.org/stable/modules/generated/sklearn.models election.traint ests plit.html”, La

[3] “https://scikit-learn.org/stable/tutorial/texta nalytics/workingw itht extd ata.htmlloading−

You might also like