Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

SPEAKER RECOGNITION USING MFCC TECHNIQUE

submitted in partial fulfillment of the requirements of the degree

Bachelor of Technology
in
ELECTRONICS AND COMMUNICATION ENGINEERING

submitted by

Sk. Rehana -O180789


V. Susmitha -O180796
S. Nagamani -O180420
Y. Venkatalakshmi -O180421

Under the guidance of

Mr. BALA NAGI REDDY

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES-ONGOLE


ANDHRA PRADESH

2018-2024

I
Approval Sheet

This report entitled “Speaker Recognition Using MFCC Technique” by Sk. Rehana, V.Susmitha,
S. Nagamani, Y. Venkatalakshmi is here by approved for the degree of Bachelor of Technology in
Electronics And Communication Engineering at Rajiv Gandhi University Of Knowledge
Technologies-Ongole Campus, Andhra Pradesh.

Examiner (s):

Supervisor (s):

Chairman:

Date: ............................
Place: ...........................

II
RAJIV GANDHI UNIVERSITY OF KNOWLEDGE TECHNOLOGIES
RGUKT, ONGOLE
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

CERTIFICATE
This is to certify that this major project entitled “SPEAKER RECOGNITION USING
MFCC TECHNIQUE” submitted by Sk. Rehana(O180789), V. Susmitha (O180796), S.
Nagamani(O180420), Y. Venkatalakshmi(O180421), respectively in partial fulfillment of
the requirement for the award of bachelor of Technology in Electronics and
Communication Engineering is a bonafide work carried by them under my supervision and
guidance.

Head of the department Project Guide


Mr. G. BALA NAGI REDDY Mr. G. BALA NAGI REDDY
Assistant Professor Assistant professor

III
DECLARATION

We declare that this written submission represents our ideas in our own words and where
others' ideas or words have been included, we have adequately cited and referenced the original
sources. We also declare that We have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in my submission.
We understand that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been properly cited or from
whom proper permission has not been taken when needed.

Name of the student Signature

Sk. Rehana -O180789


V. Susmitha -O180796
S. Nagamani -O180420
Y. Venkatalakshmi -O180421

Date:

IV
ACKNOWLEDGEMENT

We Would like to express my sincere gratitude to G. BALA NAGI REDDY Sir, my Project
guide for valuable suggestions and keen interest throughout progress of my course and research.

We are grateful to SRI BALA NAGI REDDY Sir, HOD of Electronics &Communication
Engineering, for providing excellent computing facilities and a congenial atmosphere for
progressing with my project.

At the outset, I would like to thank Rajiv Gandhi University of Knowledge Technologies,
Ongole for Providing all the necessary resources for the successful completion of my course work.
At last, but not the least I thank my teammates and other students for their physical and moral
support.

With Sincere Regards,


Sk. Rehana
V.Susmitha
S. Nagamani
Y. Venkatalakshmi

v
ABSTRACT

Speaker recognition is the process of automatically recognizing who is speaking on the basis
of individual information included in speech waves. This technique makes it possible to use the
speaker’s voice to verify their identity and control access to services such as voice dialing, banking
by telephone, telephone shopping, database access services, information services, voice mail,
security control for confidential Information areas and remote access to computers.

Speech processing has emerged as one of the important application area of digital signal
processing. The objective of automatic speaker recognition is to extract, characterize and recognize
the information about speaker identity. This paper proposes the comparison of the MFCC and the
Vector Quantisation technique for speaker recognition. Feature vectors from speech are extracted by
using Mel-frequency cepstral coefficients which carry the speaker's identity characteristics and
vector quantization technique is implemented. Vector quantization uses a codebook to characterize
the short-time spectral coefficients of a speaker. These coefficients are used to identify an unknown
speaker from a given set of speakers. The effectiveness of these methods is examined from the
viewpoint of robustness against utterance variation such as differences in content, temporal
variation, and changes in utterance speed.

VI
TABLE OF CONTENTS

CHAPTER NO Page No

ABSTRACT VI
LIST OF FIGURES VIII
APPENDIX XI

CHAPTER 1: INTRODUCTION 1-5


1.1 Evolution of speech recognition system 2
1.2 principles of speaker/voice recognition 3
1.3 Challanges of speaker recognition system 4-5
CHAPTER 2: REVIEW OF LITERATURE 6
2.1 Tools of Overcome Challenges Faced with Speech Recognition 6
System
2.2 Voice Activity Detector 6
CHAPTER 3: METHODOLOGY 7-12
3.1 Speech Feature Extraction 7
3.2 Mel-Frequency Cepstrum Coefficients Processor 9
3.2.1 Frame Blocking 10
3.2.2 Windowing 10
3.2.3 Fast Fourier Transform(FFT) 11
3.2.4 Mel-Frequency Wrapping 11
3.2.5 Cepstrum 12
CHAPTER 4: SPEECH RECOGNITION MATCHING 13-21
4.1 Introduction 13
4.2 Speech Data 14
4.3 Speech Processor 14
4.4 Vector Quantization(VQ) 15-20
4.5 Mel-Frequency Cepstrum Coefficients 21
CHAPTER 5: SIMULATION AND EVALUATION 22-28
5.1 Software requirements 22
5.2 Voice traning 23
5.3 Voice feature 24
5.4 Voice testing 24-25
5.5 Result 26-28
CHAPTER 6: ADVANTAGES AND DISADVANTAGES 29-30
6.1 Advantages 29
6.2 Disadvantages 29
6.3 Applications 30-31
CHAPTER 7: CONCLUSION AND REFERENCES 32-34
Future Scope 33
References 34

LIST OF FIGURES
Figure 1 Basic Structure of Speaker Recognition System 3
Figure 2 An Example of Speech Signal 8
Figure 3 Block Diagram of MFCC Processor 10
Figure 4 An Example of Mel-spaced Filter-Bank 12
Figure 5 Conceptual Diagram Illustrating Vector Quantization Codebook 16
Formation
Figure 6 Voice Training 26
Figure 7.a Voice Testing Input Signal for User-1 27
Figure 7.b Output of Voice Testing for User-1 27
Figure 8.a Voice Testing Input Signal for User-2 28
Figure 8.b Output of Voice Testing for User-2 28
APPENDIX
KEYWORDS

Speaker Identification: Speaker identification is the process of determining from which of the
registered speakers a given utterance comes.

Speaker Verification: Speaker verification is the process of accepting or rejecting the identity
claimed by a speaker. Most of the applications in which voice is used to confirm the identity of a
speaker are classified as speaker verification.

Feature Extraction: Feature extraction is process of obtaining different features such as power,
pitch, and vocal tract configuration from the speech signal.

Feature Matching: Speech-recognition engines match a detected word to a known word.

Mel Frequency Cepstral Coefficient (MFCC): The MFCC feature extraction technique basically
includes windowing the signal, applying the DFT, taking the log of the magnitude, and then
warping the frequencies on a Mel scale, followed by applying the inverse DFT.

Vector Quantization: Vector quantization (VQ) is an efficient coding technique to quantize signal
vectors. It has been widely used in signal and image processing, such as patternrecognition and
speech and image coding.

XI
CHAPTER 1
INTRODUCTION

Speaker recognition is also called voiceprint recognition and voice biometric recognition.
Speaker recognition, as a kind of biometric authentication technology, is a technology that
automatically identifyes the speaker's identity based on the voice parameters that reflect the
speaker's physiological and behavioural characteristics in the voice waveform.

Biometric identification technology is a technology that uses human physiology or


behaviour to identify identity. Identity authentication based on biometric identification technology
is a demand for highly informatized society and economic globalization, and it is an essential
technology in the government and commercial fields.

The biggest advantage of biometric identification technology is that the use of the human
body's own characteristics as the basis for identity recognition saves users the trouble of recording
text passwords or carrying documents. The current common biometric technologies include iris
recognition, fingerprint recognition, palmprint recognition, gait recognition, speaker recognition.
However, in addition to the speaker recognition, several other authentication methods need to be
combined with professional high cost collection equipment and require close contact. Voice is the
most direct way for human communication.

The development of voice-based identity authentication technology conforms to human


communication habits and meets people's requirements for convenience. Speaker recognition is a
technology based on voice for identity authentication, which is also a kind of biometric recognition.
Compared with other identity authentication methods based on human body characteristics,
voiceprint recognition has many advantages: firstly, the collection of voiceprint signals does not
require high hardware equipment, and only a common microphone is needed to complete audio
recording; secondly, voiceprint recognition has less personal privacy and is more humane and
therefore easier to be psychologically accepted by users; thirdly, voiceprint recognition does not
require people to collect information on site, making remote recognition more convenient;
Furthermore, if the dynamic password is used, the voiceprint is more difficult to be copied or
imitated.

1
Speech Recognition Systems now-a-days use many interdisciplinary technologies ranging from
Pattern Recognition, Signal Processing, Natural Language Processing implementing to unified
statistical framework. Such systems find a wide area of applications in areas like signal processing
problems and many more. The objective of this paper is to present the concepts about Speech
Recognition Systems starting from the evolution to the advancements that have now been adapted
to the Speech Recognition Systems to make them more robust and accurate. This paper has the
detailed study of the mechanism, the challenges and the tools to overcome those challenges with a
concluding note that would ensure that with the advancements of the technologies, this world is
surely going to experience revolutionary changes in the near future.

1.1 EVOLUTION OF SPEECH RECOGNITION SYSTEMS

In 1784, a scholar created the first Acoustic Mechanical Speech Machine in Vienna. After that in
1879 Thomas Edison invented the first dictation machine. Continuing with the chain, the next
speech recognition system was developed at Bell Laboratories in 1952, capable of recognizing
spoken digits with 90% accuracy, but it could only recognize numbers spoken by its inventor. In
1970 a scholar came out with a Harpy System, which was able to recognize over 1000 words and
could recognize different pronunciations and some phrases . Speech Recognition continued in 80s
with the introduction of Hidden Markov Model, which used a more mathematical approach of
analyzing sound waves and led to many of the breakthroughs. Soon with the invention of Hidden
Markov Model in 80s IBM Tangora, in 1986 used the Hidden Markov Model which was able to
predict the upcoming phonemes in speech. In 2006, The NSA (National Security Agency) started
using speech recognition systems to segment keywords in recorded speech.

1.2. Principles of Speaker/Voice Recognition

Speaker recognition can be classified into identification and verification. Speaker/Voice


identification is the process of determining which registered speaker provides a given
utterance.Speaker verification, on the other hand, is the process of accepting or rejecting the identity
claim of a speaker. Figure 1 shows the basic structures of speaker identification and verification
systems.

2
At the highest level, all speaker recognition systems contain two main modules (refer to Figure 1):
feature extraction and feature matching. Feature extraction is the process that extracts a small
amount of data from the voice signal that can later be used to represent each speaker.
Feature matching involves the actual procedure to identify the unknown speaker by comparing
extracted features from his/her voice input with the ones from a set of known speakers.

Figure 1:Basic structure of Speaker recognition systems

All speaker recognition systems have to serve two distinguish-phases. The first one is referred to
the enrollment sessions or training phase while the second one is referred to as the operation
sessions or testing phase. In the training phase, each registered speaker has to provide samples of
their speech so that the system can build or train a reference model for that speaker.
In case of speaker verification systems, in addition, a speaker-specific threshold is also computed
from the training samples. During the testing (operational) phase (see Figure 1), the input speech is
matched with stored reference model(s) and recognition decision is made.

3
Speaker recognition is a difficult task and it is still an active research area. Automatic speaker
recognition works based on the premise that a person’s speech exhibits characteristic that are unique
to the speaker. However this task has been challenged by the highly variant of input speech signals.
The principle source of variance comes from the speakers themselves. Speech signals in training
and testing sessions can be greatly different due to many facts such as people voice change with
time, health conditions (e.g the speaker has a cold), speaking rates, etc. There are also other factors,
beyond speaker variability, that present a challenge to speaker recognition technology. Examples of
these are acoustical noise and variations in recording environments (e.g speaker uses different
telephone handsets/microphones).

1.3 Challenges with Speech Recognition System

Speech is an essential mode of communication with computers as well as human beings. Speech
Recognition has a wide range of applicability in the domain of computer science, medical science,
etc. Developing a real time speech recognizer may get effected from adverse environment to
anatomy of human body, involving human aspects too. Some of the key challenges faced with
Speech Recognition Systems are discussed below :

◆ Noisy Environment
Studies have proven that the drawback affecting most of the Speech Recognition Systems is the
environmental noise and its adverse effect on systems performance.

◆Intensive Use of Computer Power


Running the statistical models needed for speech recognition requires the computer’s processor to
perform a lot of heavy work. One of the reasons for this is the necessity to remember each stage of
the word recognition search, in case the system needs to backtrack to come up with the right word.

◆ Accent
The speaking accent differs according to the social and personal situations (e.g., physiological and
cultural aspects). Indeed, compared to native speech recognition, performance degrades when
recognizing accented speech and non-native speech. Studies have shown that the human vary in
accent while speaking to parents and when speaking to friends.

4
◆ Speed of Speech
The speech recognition systems find difficulty separating segments of continuous speedy speech
signals. The speed of speech while speaking may vary and depend upon situations and physical
stress. The pace of speaking may affect in pronunciation through phoneme reduction, time
expansions and compressions.

◆ Recognition of Punctuation Marks


It has been observed that while conversion of speech to on screen text the punctuation marks are not
recognized as they are, instead proper words are recognized. Many different strategies are made to
overcome this challenge involving the speed dictating these punctuation marks, and so on. But the
better solution is yet to come.

◆ Homophones
Homophones are the words that have different meanings but sounds same when pronounced
(e.g.,“There” and “Their”, “Be” and “Bee”). In Speech Recognition Systems, it is very difficult at
the word level to recognize which one is the correct intended word. The current observations show
that, the Speech Recognition Systems on an average achieve 94 to 99 percent accuracy due to
different accents.

5
CHAPTER 2
REVIEW OF LITERATURE

2.1 Tools To Overcome Challenges Faced With Speech Recognition Systems

A number of noise reduction techniques have been engineered to extenuate the effect of
noise on systems performance and often require the estimate of noise statistics. One technique that
has been engineered is to design a database that may be used for the evaluation of feature extraction
at the front end using a defined a Hidden Markov Model employed at backend.

2.2 Voice Activity Detector

Voice Activity Detector is a useful technique for enhancing the performance of Speech
Recognition Systems employed in noisy environmental conditions . Voice Activity Detector is used
in Speed Recognition Systems for feature extraction process thereby resulting in enhancement of
speech by the systems. The dependency factor of Voice Activity Detector lies on pitch detection,
energy threshold, periodicity measure, and spectrum analysis. One major challenge that the detector
face is while making a decision about extraction of Feature Vector (FV); selection of feature vector
for signal detection and strong decision rule is a challenging problem, affecting the performance
rate of Speech Recognition System.

6
CHAPTER 3
METHODOLOGY
3.1 Speech Feature Extraction

Speaker recognition can be classified into speaker identification and speaker verification.
Speaker identification is the process of determining from which of the registered speakers a given
utterance comes. Speaker verification is the process of accepting or rejecting the identity claimed by
a speaker. Most of the applications in which voice is used to confirm the identity of a speaker are
classified as speaker verification.

In the speaker identification task, a speech utterance from an unknown speaker is analyzed
and compared with speech models of known speakers. The unknown speaker is identified as the
speaker whose model best matches the input utterance. In speaker verification, an identity is
claimed by an unknown speaker, and an utterance of this unknown speaker is compared with a
model for the speaker whose identity is being claimed. If the match is good enough, that is, above a
threshold, the identity claim is accepted. A high threshold makes it difficult for impostors to be
accepted by the system, but with the risk of falsely rejecting valid users. Conversely, a low
threshold enables valid users to be accepted consistently, but with the risk of accepting impostors.
To set the threshold at the desired level of customer rejection (false rejection) and impostor
acceptance (false acceptance), data showing distributions of customer and impostor scores are
necessary.

The fundamental difference between identification and verification is the number of decision
alternatives. In identification, the number of decision alternatives is equal to the size of the
population, whereas in verification there are only two choices, acceptance or rejection, regardless of
the population size. Therefore, speaker identification performance decreases as the size of the
population increases, whereas speaker verification performance approaches a constant independent
of the size of the population, unless the distribution of physical characteristics of speakers is
extremely biased.

The purpose of this module is to convert the speech waveform to some type of parametric
representation (at a considerably lower information rate) for further analysis and processing. This
is often referred as the signal-processing front end.

7
Figure2: An example of speech signal

The speech signal is a slowly timed varying signal (it is called quasi-stationary).An example of
speech signal is shown in Figure 2. When examined over a sufficiently short period of time
(between 5 and 100 ms), its characteristics are fairly stationary. However, over long periods of time
(on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech
sounds being spoken. Therefore, short-time spectral analysis is the most common way to
characterize the speech signal. A wide range of possibilities exist for parametrically representing the
speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-
Frequency Cepstrum Coefficients(MFCC), and others. MFCC is perhaps the best known and most
popular, and these will be used in this project.

MFCC’s are based on known variations of the human ears critical bandwidths with frequency;
filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to
capture the phonetically important characteristics of speech. This is expressed in the Mel-frequency
scale, which is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000Hz.
The process of computing MFCCs is described in more detail next.

8
Feature extraction techniques : Feature Extraction is the most important part of speech
recognition since it plays an important role to separate one speech from other. The utterance can be
extracted from a wide range of feature extraction techniques proposed and successfully exploited
for speech recognition task. But extracted feature should meet some criteria while dealing with the
speech signal such as:

i. Easy to measure extracted speech features


ii. It should not be susceptible to mimicry
iii. It should show little fluctuation from one speaking environment to another
iv. It should be stable over time
v. It should occur frequently and naturally in speech

3.2 Mel-Frequency Cepstrum Coefficients Processor

A block diagram of the structure of an MFCC processor is given in Figure 3. The speech input is
typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to
minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can
capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by
humans. As been discussed previously, the main purpose of the MFCC processor is to mimic the
behavior of the human ears. In addition, rather than the speech waveforms themselves, MFFC’s are
shown to be less susceptible to mentioned variations.

The following figure 3 shows the steps involved in MFCC feature extraction. The MFCC is the
most evident example of a feature set that is extensively used in speech recognition. As the
frequency bands are positioned logarithmically in MFCC it approximates the human system
response more closely than any other system. Technique of computing MFCC is based on the short-
term analysis, and thus from each frame a MFCC vector is computed. In order to extract the
coefficients the speech sample is taken as the input and hamming window is applied to minimize
the discontinuities of a signal. Then DFT will be used to generate the Mel filter bank. According to
Mel frequency warping, the width of the triangular filters varies and so the log total energy in a
critical band around the center frequency is included. After warping the numbers of coefficients are
obtained. Finally the Inverse Discrete Fourier Transformer is used for the cepstral coefficients
calculation.

9
Figure 3. Block diagram of the MFCC processor

3.2.1 Frame Blocking

In this step the continuous speech signal is blocked into frames of N samples, with adjacent
frames being separated by M (M < N). The first frame consists of the first N samples.The second
frame begins M samples after the first frame, and overlaps it by N - M samples.

Similarly, the third frame begins 2M samples after the first frame (or M samples after the second
frame) and overlaps it by N- 2M samples. This process continues until all the speech is accounted
for within one or more frames. Typical values for N and M are N = 256 (which is equivalent to 30
ms windowing and facilitate the fast radix-2 FFT) and M = 100.

3.2.2 Windowing

The next step in the processing is to window each individual frame so as to minimize the
signal discontinuities at the beginning and end of each frame. The concept here is to minimize the
spectral distortion by using the window to taper the signal to zero at the beginning and end of each
frame.

10
3.2.3 Fast Fourier Transform (FFT)

The next processing step is the Fast Fourier Transform, which converts each frame of N
samples from the time domain into the frequency domain. The FFT is a fast algorithm to implement
the Discrete Fourier Transform (DFT) which is defined on the set of N samples. The following
equation defines the Fast Fourier Transform of X(j): The result obtained after this step is often
referred to as signal’s Spectrum or Periodogram.

3.2.4- Mel-frequency wrapping

As mentioned above, psychophysical studies have shown that human perception of the
frequency contents of sounds for speech signals does not follow a linear scale. Thus for each tone
with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the
‘Mel’ scale. The Mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic
spacing above 1000 Hz. As a reference point, the pitch of a 1 kHz tone, 40 dB above the perceptual
hearing threshold, is defined as 1000 Mel. Therefore we can use the following approximate formula
to compute the Mel for a given frequency f in Hz :

Mel (f) = 2595 * log 10 (1 + f/700)

One approach to simulating the subjective spectrum is to use a filter bank, one filter for each
desired Mel-frequency component (see Figure 4). That filter bank has a triangular band- pass
frequency response, and the spacing as well as the bandwidth is determined by a constant Mel-
frequency interval.

13
thus consists of the output power of these filters modified1spectrum of S( ) is the input. The number
of Mel spectrum coefficients, K, is when S( typically chosen as 20).

Note that this filter bank is applied in the frequency domain; therefore it simply amounts to taking
those triangle-shape windows in the Figure 4 on the spectrum. Fourier Transformer isused for the
cepstral coefficients calculation.

Figure4: An example of Mel-spaced filter-bank

3.2.5- Cepstrum
In this final step, we convert the log Mel spectrum back to time. The result is called the
Mel frequency Cepstrum coefficients (MFCC). The cepstral representation of the speech spectrum
provides a good representation of the local spectral properties of the signal for the given frame
analysis. Because the Mel spectrum coefficients (and so their logarithm) are real numbers, we can
convert them to the time domain using the Discrete Cosine Transform (DCT).The cepstrum can be
seen as information about the rate of change in the different spectrum bands.

12
CHAPTER 4
SPEECH FEATURE MATCHING
4.1 Introduction

The problem of speaker recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify objects of
interest into one of a number of categories or classes. The objects of interest are generically called
patterns and in our case are sequences of acoustic vectors that are extracted from an input speech
using the techniques described in the previous section. The classes here refer to individual speakers.
Since the classification procedure in our case is applied on extracted features, it can be also referred
to as feature matching.

Furthermore, if there exists some set of patterns that the individual classes of which are
already known, then one has a problem in supervised pattern recognition. This is exactly our case
since during the training session, we label each input voice with the ID (s1 to s n ). These patterns
comprise the training set and are used to derive a classification algorithm. The remaining patterns
are then used to test the classification algorithm; these patterns are collectively referred to as the test
set. If the correct classes of the individual patterns in the test set are also known, then one can
evaluate the performance of the algorithm.

The state-of-the-art in feature matching techniques used in speaker recognition includes


Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ).
In this project, the VQ approach will be used, due to ease of implementation and high accuracy. VQ
is a process of mapping vectors from a large vector space to a finite number of regions in that space.
Each region is called a cluster and can be represented by its center called a codeword. The
collection of all codeword is called a codebook. Figure 5 shows a conceptual diagram to illustrate
this recognition process. In the figure, only two voices and two dimensions of the acoustic space are
shown. The circles refer to the acoustic vectors from the voice 1 while the triangles are from the
voice 2. In the training phase, a speaker-specific VQ codebook is generated for each known speaker
by clustering his/her training acoustic vectors. The result codeword (Centroids) are shown in Figure
5 by black circles and black triangles for voices 1 and 2, respectively. The distance from a vector to
the closest codeword of a codebook is called a VQ- distortion.

13
In the recognition phase, an input utterance of an unknown voice is “vector-quantized” using each
trained codebook and the total VQ distortion is computed. The speaker corresponding to the VQ
codebook with smallest total distortion is identified.

1. Whole-word matching:

The engine compares the incoming digitalaudio signal against a pre-recorded template ofthe
word .This technique takes much less processing than sub-word matching, but it requires that the
user (or someone) prerecord every word that will be recognized - sometimes several thousand
words. Whole-word templates also require large amounts of storage (between 50 and 512 bytes per
word) and are practical only if the recognition vocabulary is known when the application is
developed.

2. Sub-word matching:

The engine looks for sub-words – usually phonemes and then performs further pattern recognition
on those. This technique takes more processing than whole-word matching, but it requires much
less storage (between 5 and 20 bytes perword). In addition, the pronunciation of the word can be
guessed from English text without requiring the user to speak the word beforehand.

4.2 Speech Data

Our goal is to train a voice model (or more specific, a VQ codebook in the MFCC vector space) for
each voice commands S1 – S(n) using the corresponding sound file in the TRAINfolder. After this
training step, the system would have knowledge of the voice characteristic of each (known) voice of
the speaker. Next, in the testing phase, the system will be able to identify the (assumed unknown)
voice/speaker of each sound file, recorded and saved in the TEST folder (only last live sound).

4.3 Speech Processing

In this phase you are required to write a Matlab function that reads a sound file and turns it into a
sequence of MFCC (acoustic vectors) using the speech processing steps described previously. Many
of those tasks are already provided by either standard or our supplied Matlabfunctions. The Matlab

14
functions that you would need to use are: wavread, hamming, fft, dct and melfb (supplied function).
Type- help function-name at the Matlab prompts for more information about a function.

Now cut the speech signal (a vector) into frames with overlap (refer to the frame section in
the theory part). The result is a matrix where each column is a frame of N samples from original
speech signal. Applying the steps for Windowing and FFT to transform the signal into the frequency
domain; this process is used in many different applications and is referred in literature as Windowed
Fourier Transform (WFT) or Short-Time Fourier Transform (STFT). The result is often called as the
spectrum or Periodogram.

The last step in speech processing is converting the power spectrum into Mel-frequency
Cepstrum coefficients. The supplied function melfb facilitates this task. Finally, put all the pieces
together into a single Matlab function, mfcc, which performs the MFCC processing.
The result of the last section is that we transform speech signals into vectors in an acoustic
space. In this section, we will apply the VQ-based pattern recognition technique to build speaker
referencemodels from those vectors in the training phase and then can identify any sequences of
acoustic vectors uttered by unknown speakers.
Now write a Matlab function, that trains a VQ codebook using the LGB algorithm described
before. Use the supplied utility function disteu to compute the pair wise Euclidean distances
between the codeword and training vectors in the iterative process.

4.4- Vector Quantization(VQ):

In Vector Quantization, an ordered set of signal samples or parameters can be efficiently


coded by matching the input vector to a similar pattern or codevector (codeword) in a predefined
codebook.
The main objective of data compression is to reduce the bit rate for transmission or data
storage while maintaining the necessary fidelity of the data. The feature vector may represent a
number of different possible speech coding parameters including linear predictive coding (LPC)
coefficients, cepstrum coefficients. The VQ can be considered as a generalization of scalar
quantization to the quantization of a vector. The VQ encoder encodes a given set of k- dimensional
data vectors with a much smaller subset. The subsetC is called a codebook and its elements Ci are
called codewords, codevectors, reproducing vectors, prototypes or design samples. Only the index i

15
is transmitted to the decoder. The decoder has the same codebook as the encoder, and decoding is
operated by table look-up procedure. The commonly used vector quantizers are based on nearest
neighbor called Voronoi or nearest neighbour vector quantizer.
Both the classical K-means algorithm and the LBG algorithm belong to the class of nearest

Figure 5:Conceptual diagram illustrating vector quantization codebook formation (One voice
can be discriminated from another based of the location of Centroids)

A key component of pattern matching is the measurement of dissimilarity between two


feature vectors. The measurement of dissimilarity satisfies three metric properties such as Positive
definiteness property, Symmetry property and Triangular inequality property. Each metric has three
main characteristics such as computational complexity, analytical tractability and feature evaluation
reliability. The metrics used in speech processing are derived from the Minkowski metric .
The City block metric, Euclidean metric and Manhattan metric are the special cases of
Minkowski metric. These metrics are very essential in the distortion measure computation
functions. The distortion measure is one which satisfies only the positive definiteness property of
the measurement of dissimilarity. There were many kinds of distortion measures including
Euclidean distance, the Itakura distortion measure and the likelihood distortion measure, and soon.
The Euclidean metric is commonly used because it fits the physical meaning of distance or
distortion. In some applications division calculations are not required.

16
To avoid calculating the divisions, the squared Euclidean metric is employed instead of the
Euclidean metric in pattern matching. The quadratic metric is an important generalization of the
Euclidean metric. The weighted cepstral distortion measure is a kind of quadratec metric. The
weighted cepstral distortion key feature is that it equalizes the importance in each dimension of
cepstrum coefficients. In the speech recognition, the weighted cepstral distortion can be used to
equalize the performance of the recognizer across different talkers. The Itakura-Saito distortion
measure computes a distortion between two input vectors by using their spectral densities.

The performance of the vector quantizer can be evaluated by a distortion measure D which
is a non-negative cost ) (D X j ,X j^ )associated with quantizing any input vector Xj with
reproduction vector X j ˆ . Usually, the Euclidean distortion measure is used. The performance of a
quantizer is always qualified by an average distortion ) Dv = E [D (X j ,X j^)] between the input
vectors and the final reproduction vectors, where E represent the expectation operator. Normally,
the performance of the quantizer will be good if the average distortion is small.
Another important factor in VQ is the codeword search problem. As the vector dimension
increases accordingly the search complexity increases exponentially, this is a major limitation of
VQ codeword search. It limits the fidelity of coding for real time transmission. A full search
algorithm is applied in VQ encoding and recognition. It is a time consuming process when the
codebook size is large.
In the codeword search problem, assigning one codeword to the test vector means the
smallest distortion between the C codeword and the test vector among all codewords. Given one
codeword t and the test vector X in the k-dimensional space, the distortion of the squared Euclidean
metric can be expressed as follows:

D ( X , C t ) =∑( x i − c t i ) 2 , where C t = { c t 1 , c t 2 ,..., c t k } and X = { x 1 , x 2 ,..., xk } .

There are three ways of generating and designing a good codebook namely the random
method, the pair-wise nearest neighbor clustering and the splitting method. A wide variety of
distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito
distance and relative entropy have been used for clustering. There are three major procedures in
VQ, namely codebook generation, encoding procedure and decoding procedure. The LBG
algorithm is an efficient VQ clustering algorithm. This algorithm is based either on a known
probabilistic model or on a long training sequence of data.

17
In extension to this let’s see how will be the speaker recognition under stressed conditions:-

Any condition that causes a speaker to vary his or her speech production from normal or
neutral condition is called stressed speech condition. Stressed speech is induced by emotion, high
workload, sleep deprivation, frustration and environmental noise. In stressed condition, the
characteristics of speech signal are different from that of normal or neutral condition. Due to
changes in speech signal characteristics, performance of the speaker recognition system may
degrade under stressed speech conditions. Firstly, six speech features (mel-frequency cepstral
coefficients (MFCC), linear prediction (LP) coefficients, linear prediction cepstral coefficients
(LPCC), reflection coefficients (RC), arc-sin reflection coefficients (ARC) and log-area ratios
(LAR)), which are widely used for speaker recognition, are analyzed for evaluation of their
characteristics under stressed condition. Secondly, Vector Quantization (VQ) classifier and
Gaussian Mixture Model (GMM) are used to evaluate speaker recognition results with different
speech features. This analysis help select the best feature set for speaker recognition under stressed
condition. Finally, four VQ based novel compensation techniques are proposed and evaluated for
improvement of speaker recognition under stressed condition. The compensation techniques are
speaker and stressed information based compensation (SSIC), compensation by removal of stressed
vectors CRSV), cepstral mean normalization (CMN) and combination of MFCC and sinusoidal
amplitude (CMSA) features. Speech data from SUSAS database corresponding to four different
stressed conditions, Angry, Lombard, Question and Neutral, are used for analysis of speaker
recognition under stressed condition.

Cepstral features such as MFCC (Reynolds 1995; Picone 1993), cepstral coefficients (Hunt 1983)
and LP cepstral coefficients are used by various authors for speaker recognition. MFCC features,
which are the most effective among the three, are estimated from the spectral energies from
different mel-frequency bands. In this way, the information associated with the spectral peaks
(amplitudes and bandwidths) are not explicitly used. LP coefficients and LP derived features
(Rabiner and Juang 1993), such as Reflection Coefficients (RC) (Mathur and Story 2003), Arc sin
Reflection Coefficients (ARC) (Campbell 1997), (LAR) (Furui 1981) are also used for speaker
recognition. All the above features are spectral domain features extracted from a preemphasized
speech signal. Pre- emphasis equalize the inherent spectral tilt of the speech. The formants appear
as spectral peaks in a vocal tract spectra of pre-emphasized speech signal.

18
Amplitudes, frequencies and bandwidths of the formants have significant speaker
information. The spectral domain features used for speaker recognition do not use these information
explicitly. In Chagnolleau and Durou (2002), the authors have shown that the performance of
speaker recognition can be improved if different features can be used by combining them in a
suitable manner. A new set of features which can eliminate the above limitations will be useful
towards improvement of performance of speaker recognition systems.

Various compensation techniques are used for improvement in speech recognition under
stressed condition. Hansen has proposed parameter-based compensation techniques for recognition
improvements (Pellom and Hansen 1998; Cairns and Hansen 1994; Hansen et al. 1994; Hansen and
Cairns 1995; Hansen and Womack 1996; Bou-Ghazale and Hansen 1998; Ghazale and Hansen
2000). The features are enhanced to address noise and an adaptive mel-cepstral compensation
algorithm is used to equalize the impact of stress. Roe (1987) has proposed a traditional template-
based system that incorporated both vector quantized and replaced by the closest spectral codeword
in the VQ codebook for subsequent distortion calculation. The spectral distortion induced by
unusual speaking efforts could be compensated by simple linear transformation of the cepstrum
(Chen 1987). The basic assumption of this technique was that spectral distortion induced by unusual
speaking efforts could be compensated by simple linear transformation of the cepstrum. Although
this assumption may be unduly optimistic, it was observed that the statistics of cepstral vectors did
display some systematic modification in various speaking styles. The probability of spectral
compensation in a stochastic model-based recognizer was found to be effective in dealing with
various articulation effects. There is almost negligible effort for analysis and improvement of
performance of speaker recognition systems under stressed conditions. Feature compensation
techniques may be experimented for enhancement of performance of speaker recognition systems.

19
Speech features and classifiers :-

Speech features:-
Speech feature extraction is one of the important block of speaker recognitionproblem. A
survey of automatic speaker recognition techniques is presented by Atal (1976). It has been shown
that the cepstral coefficients are the best features for a speaker recognition system. Reynolds (1995)
has experimented with telephone-speech database for evaluation of performance of different
features such as mel-frequency, linear-frequency filter bank cepstral coefficients, LP cepstral
coefficients and perceptual linear prediction (PLP) cepstral coefficients. Best results are reported for
higher order LP features and band-limited filter bank features. Campbell (1997) has provided a
comparison of different features such as reflection coefficients, LAR, ARC, line spectrum pair
(LSP) frequencies and MFCC. Evaluation of relative performance of different features such as LP
cepstrum, autoregressive moving average (ARMA) cepstrum, LAR and RC is carried out under
different noise conditions (Behera and Dandapat 2002). LP derived cepstral coefficients have been
used for a speaker-specific mapping for text-independent speaker recognition (Misra et al. 2003).
All these results show the wide acceptance of cepstral coefficients for speaker recognition
applications.
Integration of different feature sets for speaker recognition has been proposed by
Chagnolleau and Durou (2002). This approach, named as time-frequency principal components
(TFPC), seeks to find out the time dependent spectral vectors from the speech signal. In this paper,
MFCC and sinusoidal model features are used for the analysis speaker recognition under stressed
condition. Amplitude normalization has been applied on speech signal before the feature extraction
to get the same level of speech utterance.

20
4.5-Mel-Frequency cepstral coefficients (MFCC)

Mel Frequency Cepstral Coefficients (MFCC) (Rabiner and Juang 1993; Murty and
Yegnanarayana 2006) are the most widely used features for speech and speaker recognitions
applications. MFCCs are estimated based on human perception of critical bandwidths. For
calculation of MFCCs, filters spaced linearly at low frequencies and logarithmically at high
frequencies are used to capture the phonetically important characteristics of speech. The mel-
frequency scale is based on a linear frequency spacing below 1000 Hz and a logarithmic spacing
above 1000 Hz. For calculation of melfrequency spectrum, a bank of filters with each having a
triangular bandpass frequency response is used. The process of computing MFCCs is described as
follows. After framing the speech signal, the next step in the processing is to window each
individual frame so as to minimize the signal discontinuities at the beginning and end of each
frame. In this work, Hanning window is used. The next processing step is the computation of Fast
Fourier Transform, which converts each frame of N samples from the time domain into the
frequency domain. This frequency domain representation is referred as spectrum or periodogram.
Psychophysical studies show that human perception of sound spectrum does not follow a linear
scale (Shaughnessy 1987).
Thus for each tone with an actual frequency f, a subjective pitch is measured on the mel-
scale. As a reference point, the pitch of a 1 KHz tone, 40 dB above the perceptual hearing threshold,
is defined as 1000 mels. The following formula (Umesh et al. 1999) is used to compute the mels for
a given frequency f in Hz.
mel(f ) = 2595 ∗ log(1+f 700) log( 1 + f 700)

For calculation of mel-frequency spectrum, a bank of filters with triangular bandpass


frequency response is used. A constant mel-frequency interval is used as the bandwidth of the
filters. The mel-spectrum or the output spectrum of S(w) consists of the output power of these filters.
The number of filters in the filter bank is chosen as 20 in this work. In the final step, MFCCs are
estimated by evaluating the Discrete Cosine Transform (DCT) of the log mel-spectrum. In this
work, from each speech frame of 20 msec with overlap of 10 msec, a set of MFCCs are computed.
This set of MFCCs (MFCC1 to MFCCM ) is used as a feature vector for the speaker.

21
CHAPTER 5
SIMULATION AND EVALUATION

5.1 Software Requirements:


MATLABahigh-level language and interactive environment for numerical
computation, visualization, and programming. MATLAB includes built-in mathematical
functions fundamental to solving engineering and scientific problems, and an interactive
environment ideal for iterative exploration, design, andproblem solving. An image
processing example is used to show you how to get started usingMATLAB.

Includes:

• Interactively importing and visualizing image data from files and webcams
• Iteratively developing an image processing algorithm

• Automating your work with scripts

• Sharing your results with others by automatically creating reports

Matlab Advantages:

• Implement and test your algorithms easily.


• Develop the computational codes easily.
• Debug easily.
• Use large database of built in algorithms.
• Process still images and create simulation videos easily.
• Symbolic computation can be easily done.
• Call external libraries.

22
5.2 Voice Training:
clear all;
close all;
clc;
%create a record object
recorder=audiorecorder(1600,8,2);
%Record user vioce for 5 sec
try
load database
disp('Please Record your voice');
drawnow();
pause(1);
recordblocking(recorder,5);
play(recorder);
data=getaudiodata(recorder);
plot(data);
%Feauture Extraction
f=Voicefeatures(data);
%save users data
uno=input('Enter the user number:');
F=[F;f];
c=[c;uno];
save database
catch
disp('Please Record your voice');
drawnow();
pause(1);
recordblocking(recorder,5);
play(recorder);
data=getaudiodata(recorder);
plot(data);
%Feauture Extraction
f=Voicefeatures(data);
%save users data
uno=input('Enter the user number:');
F=f;
23
c=uno;
save database F c
end
msgbox('Your Voice Registered');

5.3 Voice Feature:

function[xPitch]=Voicefeatures(data)
F=fft(data(:,1));
plot(real(F));
m=max(real(F));
xPitch=find(real(F)==m,1)

5.4 Voice Testing:

clc;
clear all;
close all;
%create a recorder object
recorder=audiorecorder(16000,8,2);
%Record users voice for 5 sec
disp('Please record your voice');
drawnow();
pause(1);
recordblocking(recorder,5);
play(recorder);
data=getaudiodata(recorder);
plot(data)
%feature Extraction
pre=Voicefeatures(data);
%classify
load database
D=[];
for(i=1:length(F))
%d=sum(abs(F(i)-f));
24
%D=[D d]
F(i)
d=abs(F(i)-pre)
D=[D d]
end
%smallest distance
sm=inf;
ind=-1;
for(i=1:length(D))
if(D(i)<sm)
sm=D(i);
ind=i;
end
end
detected_class=c(ind);
%disp('The detected class is:');
%detected_class
load database
matched_pitch=F(ind);
if ((matched_pitch+100)>pre && pre>(matched_pitch-100))
disp("the detected class is")
detected_class
msgbox('Welcome Back')
else
disp("User not Registered")
msgbox('You are not Registered')
end

25
5.5 Result:
Voice Trainng

Figure 6. Voice Training


The above figure shows that the voice training system that is the users voice is registered
into the system when the user was speaking out. Through this users voice is trained to the system.
Voice Testing:
USER-1

26
Figure 7.a. Voice testing input signal for user-1

Figure 7.b.Output of voice testing for user-1


After registered the user’s voice,we have to test the user-1 voice it is registered or not.whenever the
voice is recognisied and it comapres with the registered voice if it is present it shows popup with
the Welcome back.
USER-2

27
Figure 8.a Voice testing input signal for user-2

Figure 8.b. Output of voice testing for user-2

Figure 8.a and 8.b shows that User2 voice is testing that is it tests voice input with the registered
voice and if it is not matched shows popup with you are not Registered.

28
CHAPTER 6

ADVANTAGES AND DISADVANTAGES

6.1 ADVANTAGES:

1. MFCC is that it is good in error detection and able to produce a robust feature when the signal is
affected by noise.

2. Enabling hand free technology.

3. Time-saving, ease of use , accuracy.

6.2 DISADVANTAGES:

1. Error and Misinterpretation of words.

2. Language input and the requirement of language skills.

3. Filtering background noise is a task and it is too much that can even be difficult for humans to

accomplish.

29
6.3 APPLICATIONS:

After nearly sixty years of research, speech recognition technology has reached a relatively
high level. However, most state-of-the-art ASR systems run on desktop with
powerfulmicroprocessors,sample memory and an ever-present power supply. In these years, with
the rapid evolvement of hardware and software technologies, ASR has become more and more
expedient as an alternative human-to-machine interface that is needed for the following application
areas:
Stand-alone consumer devices such as wrist watch, toys and hands-free mobile phone in car
where people are unable to use other interfaces or big input platforms like key boards are not
available.
Single purpose command and control system such as voice dialing for cellular, home, and
office phones where multi-function computers (PCs) are redundant.

Some of the applications of speaker verification systems are:

◆ Time and Attendance Systems.


◆ Access Control Systems.
◆ Telephone-Banking.
◆ Biometric login to telephone aided shopping systems.
◆ Information and Reservation services.
◆ Security control for confidential information.
◆ Forensic purposes voice based telephone dialing.

The key focus of this last application mentioned above is to aid the physically challenged in
executing a mundane task like telephone dialing , here the user initially trains the system by uttering
the digits from 0 to 9. Once the system has been trained, the system can recognize the digits uttered
by the user who trained the system. This system can also add some inherent security as the system
based on cepstral approach is speaker dependent. The algorithm is run on a particular speaker and
the MFCC coefficients determined. Now the algorithm is applied to different speaker and the
mismatch was clearly observed. Thus the inherent security provided by the system was confirmed.
Presently systems have also been designed which incorporate Speech and Speaker Recognition.
Typically a user has two levels of check. She/he has to initially speak the right password to gain

30
access to a system. The system not only verifies if the correct password has been said but also
focused on the authenticity of the speaker.The ultimate goal is do have a system which does a
Speech, Iris, Fingerprint Recognition to implement access control. For the communication, speech
is one of the natural forms. A person's voice contains various parameters that convey information
such as emotion, gender, attitude, health and identity. Speaker recognition technologies have wide
application areas, The aim of this paper is to provide the some specific areas where Speaker
Recognition techniques can be used. Here we discuss three main areas where Speaker Recognition
Technique can be used. They are authentication, surveillance and forensic speaker recognition.

31
CHAPTER 7

CONCLUSION AND FUTURE SCOPE

The goal of this project was to create a speaker recognition system, and apply it to a speech
of an unknown speaker. By investigating the extracted features of the unknown speech and then
compare them to the stored extracted features for each different speaker in order to identify the
unknown speaker.

The feature extraction is done by using MFCC (Mel Frequency Cepstral Coefficients) but
not used in FFT approach. MFCC and FFT methods were used as primitive methods for recognizing
the unknown speaker uttering the same word in the training and testing phases, are giving good
results to recognize the speaker but are not efficient as the vector quantization method which
recognizes the speaker either utters the same or different word with high degree of accuracy.

The vector quantization is the most accurate method , in which a VQ codebook is generated
by clustering the training feature vectors of each speaker and then stored in the speaker database. In
this method, the LBG algorithm is used to do the clustering (i.e training a codebook for each
speaker).In the recognition stage, a distortion measure which based on the minimizing the
Euclidean distance was used when matching an unknown speaker with the speaker database.

During this project, we have found out that the VQ based clustering approach provides us
with the faster speaker identification process than only MFCC approach or FFT approach, hence it
is the best choice to build high efficient speaker recognition system.

We have data of 11 speakers ,they all are speaking same word “ one” ,we proceed our this
data through above method and get the code book for each and every speaker,which we will use as
a reference for the matching.After saving this code book. We took another speech data of same
speakers and run them in MAT-LAB in our test function to test whether our code and process.is able
to identify it or not, and finally our system was able to detect and identify each and every speaker
with good accuracy.

32
This method of feature extraction is really very accurate and use full for various functions in
Security purpose ,PIN number and various purposes as stated above.So we can create data base
from various users and that data can be used in identification purposes which increases security in
very good way.Hence this mfcc method should be implemented in various regions for identification
and this is the best method for recognition than HMM model.

FUTURE SCOPE:

This project focused on “isolated word recognition”. But we feel that the idea can be
extended to “continuous word recognition” and ultimately create a language independent
recognition System based on algorithms which make these systems robust.

The detection used in this work is only based on the frame energy in MFCC which is not
good for a noisy environment with low SNR. The error rate of determining the beginning and
ending of speech segments will greatly increase which directly influence the recognition
performance at the pattern recognition part. So, we should try to use some effective way to do
detection. One of these methods could be to use the statistical way to find a distribution which can
separate the noise and speech from each other.

33
REFERENCES:

1. https://www.ijert.org/recognition-of-speaker-using-vector-quantization-and-mfcc#:~:text=MFCC

%20feature%20is%20extracted%20from,threshold%20using%20Euclidean%20distance
%20approach
2. https://www.researchgate.net/figure/Block-Diagram-of-Speaker-Identification-

System_fig3_350126849
3. www.mathworks.com

34

You might also like