Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

SPEECH PROCESSING

Roll# 17-5152

SPEAKER RECOGNITION USING MFCC AND VQ

OVERVIEW

Speech Recognition is the process of automatically recognizing a certain word spoken by a


particular speaker based on individual information included in speech waves. This technique
makes it possible to use the speaker‟s voice to verify his/her identity and provide controlled
access to services like voice based biometrics, database access services, voice based dialing,
voice mail and remote access to computers. Speaker Recognition is one of
the most useful biometric recognition techniques in this world where insecurity is a major
threat. Many organizations like banks, institutions, industries etc are currently using this
technology for providing greater security to their vast databases.

Signal processing front end for extracting the feature set is an important stage in any speech
recognition system. The optimum feature set is still not yet decided though the vast efforts of
researchers. There are many types of features, which are derived differently and have good
impact on the recognition rate. This thesis presents one of the techniques to extract the feature set
from a speech signal, which can be used in speech recognition systems. The key is to convert the
speech waveform to some type of parametric representation (at a considerably lower information
rate) for further analysis and processing. This is often referred as the signal-processing front end.
A wide range of possibilities exist for parametrically representing the speech signal for the
speaker recognition task, such as Linear Prediction Coding (LPC), Mel-
Frequency Cepstrum Coefficients (MFCC), and others.

MFCC is perhaps the best known and most popular, and these will be used in this project.
MFCCs are based on the known variation of the human ear’s critical bandwidths with frequency
filters spaced linearly at low frequencies and logarithmically at high frequencies have been used
to capture the phonetically important characteristics of speech. However, another key
characteristic of speech is quasi-stationarity, i.e. it is short time stationary which is studied and
analyzed using short time, frequency domain analysis. To achieve this, we have first made a
comparative study of the MFCC approach.

Speaker Recognition mainly involves two modules namely feature extraction and
feature matching. Feature extraction is the process that extracts a small amount of data from
the speaker’s voice signal that can later be used to represent that speaker. Feature matching
involves the actual procedure to identify the unknown speaker by comparing the extracted
features from his/her voice input with the ones that are already stored in our speech database.
In feature extraction we find the Mel Frequency Cepstrum Coefficients, which are
based on the known variation of the human ear’s critical bandwidths with frequency and
these, are vector quantized using LBG algorithm resulting in the speaker specific codebook.
In feature matching we find the VQ distortion between the input utterance of an unknown
speaker and the codebooks stored in our database. Based on this VQ distortion we decide
whether to accept/reject the unknown speaker’s identity. The system I implemented in my
work is 80% accurate in recognizing the correct speaker.
SPEECH PROCESSING
Roll# 17-5152

Two purposeful speech databases with added noise, recorded at sampling frequencies 8000 Hz
and 11025 Hz, are used to check the accuracy of the developed speaker identification system in
non-ideal conditions. An analysis is also provided by performing different experiments on the
databases that number of vectors in VQ codebook and sampling frequency influence the
identification accuracy significantly.

We have make ASR system using three methods namely:


(1) MFCC approach
(2) FFT approach
(3) Vector quantization

MFCC approach:
Obtaining the acoustic characteristics of the speech signal is referred to as
Feature Extraction.
Feature Extraction is used in both training and recognition phases.
It comprise of the following steps:
1. Frame Blocking
2. Windowing
3. FFT (Fast Fourier Transform)
4. Mel-Frequency Wrapping
5. Cepstrum (Mel Frequency Cepstral Coefficients)

Using VQ:
A speaker recognition system must able to estimate probability distributions
of the computed
feature vectors. Storing every single vector that generate from the training
mode is impossible,
since these distributions are defined over a high-dimensional space. It is
often easier to start by
quantizing each feature vector to one of a relatively small number of
template vectors, with a
process called vector quantization. VQ is a process of taking a large set of
feature vectors and
producing a smaller set of measure vectors that represents the centroids of
the distribution.
The technique of VQ consists of extracting a small number of representative
feature vectors as
an efficient means of characterizing the speaker specific features. By means
of VQ, storing every
single vector that we generate from the training is impossible.
By using these training data features are clustered to form a codebook for
each speaker. In the
recognition stage, the data from the tested speaker is compared to the
codebook of each speaker
SPEECH PROCESSING
Roll# 17-5152

and measure the difference. These differences are then use to make the
recognition decision
The problem of speaker recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify objects of
interest into one of a number of categories or classes. The objects of interest are generically
called patterns and in our case are sequences of acoustic vectors that are extracted from an input
speech using the techniques described in the previous section. The classes here refer to
individual speakers. Since the classification procedure in our case is applied on extracted
features, it can be also referred to as feature matching.

You might also like