Professional Documents
Culture Documents
بحث عمر
بحث عمر
بحث عمر
Supervisor signature :
Date:
Page 1 of 25
TABLE OF CONTENT
2.1 Introduction………………………………………………………………………………………………….11
Chapter4: Conclusion
Reference
Page 2 of 25
Abstract
Speech signal is basically meant to carry the information about the linguistic
message. But it also contains the speaker-specific information. It is generated by
acoustically exciting the cavities of the mouth and nose, and can be used to
recognize (identify/verify) a person. This project deals with the speaker
identification task; i.e., to find the identity of a person using his/her speech from a
group of persons already enrolled during the training phase.
Speaker Recognition is the process of recognizing the speaker from the individual's
speech biometrics. The voice characteristics of every speaker are different and thus
can be used to construct a model. This model is later used to recognize an enrolled
speaker from the list of available speakers, iner The project makes an effort to
discuss Lined predictive code (LPC) technique for extraction of voice characteristic
with Gaussian Mixture Model (GMM). Further, an in-depth analysis of these
surveyed, techniques is made to identify their advantages and limitations The work
in the field of Speaker Recognition Systems has wide
Page 3 of 25
ACKNOWLEDGMENTS
Foremost, we would like to express our sincere gratitude to our supervisor Assist.
Prof. Dr. Ahmad Kamil Hasan for his continuous support, patience, motivation, and
immense knowledge. His guidance helped us in all the time of research and writing
of this project. I appreciate his time and effort.
We thank our close friends and fellows for the grateful help and enlightening in our
research.
Last but not the least, we would like to thank our families. our parents, sisters, and
brothers for encouraging and supporting us spiritually throughout our life.
Page 4 of 25
CHAPTER ONE
1.1 Introduction
Page 5 of 25
The fundamental difference between identification and verification is the number
of decision alternatives. In identification, the number of decision alternatives is
equal to the size of the population, whereas in verification there are only two
choices, acceptance or rejection, regardless of the population size. Therefore,
speaker identification performance decreases asthe size of the population
increases, whereas speaker verification performance approaches a constant
independent of the size of the population, unless the distribution of physical
characteristics of speakers isextremely biased [3].
There is also a case called "open set" identification, in which a reference model for
an unknown speaker may not exist. In this case, an additional decision alternative,
"the unknown does not match any of the models", is bingedes required.
Verification can be considered a special case of the "open set" identification mode
in which the known population size is one. In either verification or identification, an
additional threshold test can be applied to determine whether the match is
sufficiently close to accept the decision, or if not, to ask for a new trial [4].
The speech signal conveys information about the identity of the speaker. The area
of speaker identification is concerned with extracting the identity of the person
speaking the utterance. As speech interaction with computers becomes more
pervasive in activities such as the telephone, financial transactions and information
retrieval from speech databases, theutility of automatically identifying a speaker is
based solely characteristic.
Page 6 of 25
This project emphasizes on text dependent speaker identification, which deals with
detecting a particular speaker from a known population. The system prompts the
user to provide speech utterance. System identifies the user by comparing the
codebook of speech utterance with those of the stored in the database and lists,
which contain the most likely speakers, could have given that speech utterance [5].
The speech signal is recorded Tor N speakers further the features are extracted.
Feature extraction is done by means of LPC coefficients. The GMM is trained by
applying these features as input parameters. The features are stored in templates
for further comparison. Here, the GMM corresponds to the output; the input is the
extracted features of the speaker to be identified. The GMM does the adjustment
and the best match is found to identify the speaker.
Feature extraction techniques are used to transform the speech signals into
acoustic feature vectors, carrying the essential characteristics of the speech signal
which recognizes the identity of the speaker by their voice. The aim of feature
extraction is to reduce the dimension of acoustic feature vectors by removing
unwanted information and emphasizing the speaker-specific information
Page 7 of 25
[Figure 1] Speaker Identification Systems .
8- Speaker divarication.
Page 8 of 25
1.3 Classification of Speaker Identification Systems
as:
Page 9 of 25
1.4 Objective of this Project
Chapter Two gives the generat overview of human speech production, and
consequently introduces the spreaders recognition feature extraction the speaker
identification using linear predictive code (LPC) mode and finally explain
Gaussian mixture mp using LPC and GMM and gives the experimental, results.
Page 10 of 25
CHAPTER TWO
2.1 Introduction
Page 11 of 25
In speaker verification, an identity is claimed by an unknown speaker and an
utterance of this unknown speaker is compared with a model-for the speaker
whose identity is being claimed. If the match is good enough, that
is, above a threshold, the identity claim is accepted. A high threshold makes it
difficult for impostors to be accepted by the system, but with the risk of falsely
rejecting valid users. Conversely, a low threshold enables valid users to be accepted
consistently, but with the risk of accepting impostors. To set the threshold at the
desired level of customer rejection (false rejection) and impostor acceptance (false
acceptance), data showing distributions of customer and impostor scores are
necessary.
There is also a case called "open set" identification, in which a reference model for
an unknown speaker may not exist. In this case, an additional decision alternative,
"the unknown does not match any of the models", is required. Verification can be
considered a special case of the "open set" identification mode in which the known
population size is one. In either verification or identification, an additional
threshold test can be applied to determine whether the match is sufficiently close
to accept the decision, or if not, to ask for a new trial [6].
Page 12 of 25
2.2 classification of speaker Identification
Both text-dependent and independent methods have a serious weakness. That is,
these security systems can easily be circumvented, because someone can play back
the recorded voice of a registered speaker uttering key words or sentences into the
microphone and be accepted as the registered speaker. Another problem is that
people often do not like text- dependent systems because they do not like to utter
their identification number, such as their social security number, within the hearing
of other
people. To cope with these problems, some methods use a small set of words, such
as digits as key words, and each user is prompted to utter a given sequence of key
words which is randomly chosen every time the system is used. Yet even this
Page 13 of 25
method is not reliable enough, since it can be circumvented with advanced
electronic recording equipment that can reproduce key words in a requested order.
For almost all the recognition systems, training is the first step. We call this step in
speaker identification system enrollment phase, and call the following step
recognition phase.
It is to get the speaker models or voiceprints for speaker database. The first phase
of verification systems is also enrollment as shown in Fig.3. In this phase we extract
the most useful features from speech signal for SI and train models to get optimal
system parameters
Page 14 of 25
intended to be informative and non-redundant, facilitating the subsequent learning
and generalization steps, and in some cases leading to better human
interpretations. Feature extraction is related to dimensionality reduction [4].
Page 15 of 25
Linear Predictive coding
2.4: One of the most powerful speech analysis techniques is the method
of linear predictive analysis. This method has become the predominant
technique for estimating the basic speech parameters, e.g., pitch,
formants, spectra, vocal tract area functions and for representing speech
for low bit rate transmission or storage. The importance of this method
lies both in its ability to provide the speed and extremely accurate
estimates of the computation. The basic idea behind LPC analysis is that
a speech sample can be approximated as a linear combination of past
speech samples. By minimizing the sum of the squared differences (over
a finite interval) between the actual speech samples and the linearly
predicted ones.
It is assumed that the variations with time of the vocal tract shape can
be approximated with sufficient accuracy by a secession of stationary
shapes. It is possible to define an all-pole transfer function H(z) that
produces the output speech s(n) given the
Page 16 of 25
input excitation u(n) (either an impulse or
Thus, the linear filter is completely specified by scale factor G (gain factor) and p
predictor coefficients a,.,a,. The number of coefficients p required to represent any
speech segment adequately is determined by many factors, such as the length of
the vocal tract, the coupling of the nasal cavities, the place of the excitation and
the nature of the glottal flow function.
A major advantage of the all-pole model of the speech production is that it allows
one to determine the filter parameters in a straight-forward manner by solving a
set of linear equations. In the all-pole model, the speech sample s(n) at n sampling
instant is related to the excitation, u(n) by the following equation:
𝒑
s(n) = ∑ 𝐚𝐤 𝐬(𝐧 − 𝒌) + 𝑮𝒖(𝒏) (𝟐)
𝒌=𝟏
where u(n) is the n sampling of the excitation and G is the gain factor. Equation (2)
represents the LPC difference equation, which shows that the value of the present
output may be determined by summing the weighted present input, Gutn), and the
weighted sum of the post output samples. If the excitation utn) is white noise, the
best estimate of the a speech sample based on speech samples is given by:
𝒑
s(n) =∑𝒌=𝟏 𝒂𝒌𝒔(𝒏 − 𝒌) (𝟑)
where 3(n) is called the predicted value of sin) and is the predictor coefficient. The
prediction error between the actual speech sample and the predicted sample is
defined as:
Page 17 of 25
which is the output of a system whose transfer function is:
𝒆(𝒛) 𝒑
A(z) = 𝑺(𝒛) = 𝟏 ∑𝒌=𝟏 𝐚𝐤𝐳 −𝒌 (𝟔)
where A(z) is the transfer function of the predictor error filter or the inverse filter
for the system H(2) To determine the filter coefficients, as, the mean squared
prediction error is minimized over a short-segment of speech (N). The average
square of the prediction error becomes:
𝑵−𝟏 𝑵−𝟏 𝒑
𝒂𝒌𝒔 (𝐧 − 𝐤)
𝑬𝒎 = ∑ 𝒆𝟐 (𝒏) ∑ [𝒔(𝒏) − ∑ ]𝟐
𝒏=𝟎 𝒏=𝟎 𝒌=𝟏
R(i)= ∑𝑵−𝟏−𝒊
𝒏=𝟎 𝒔(𝒏)𝒔(𝒏 + 𝒊) (𝟏𝟎)
[ ] [𝒂𝒑] [𝑹(𝑷)]
Where Equation (12) is the DFX of the sequence Equation (13) gines the logarithm
of cahue value of the DFT of the input and Equation (14) gives the real copstral
coefficient the put sequence is the copstaal The real cepetnam is mainly used as
feature yector as an improvement over the decci sage of CPC based cepstral
features if a give speaker in the proces speaker identification.
Page 18 of 25
The pxp autocorrelation matrix of the term has the form of a toeplitz matrix .which
is symmetrical and has the same values along the lines parallel to the man diagonal
This type of equation is called a Yule Walker equation Since the positive definition
The equation for the autocorrelation method can be effectively solved by the
2.5 GMM Gaussion mixture model is a probabilistic model for representing the
presence of subpopula tions within an overall population, without requiring that an
observed data set should identify the sub-population to which an individual
observation belongs. Formally a mixture model corresponds to the mixture
distribution that represents the probability distribution of observations in the
overall population. The probability distribution function of GMM can be defined as:
𝟏 𝟏
𝑵(𝝁, 𝜺) = 𝐞𝐱𝐩(− (𝒙 − 𝝁)𝒕𝜺−𝟏(𝒙−𝝁)
(𝟐𝛑)! √𝜺𝜺 𝟐
Where
M- Mean
Page 19 of 25
It is also important to note that because the component Gaussian are acting
together to model the overall feature density, full covariance matrices are not
necessary even if the features are not statistically independent. The linear
combination of diagonal covariance basis Gaussians is capable of modeling the
correlations between feature vector elements. The effect of using a set of M full
covariance matrix Gaussians can be equally obtained by using a larger set of
diagonal covariance Gaussians. GMMs are often used in biometric systems, most
notably in speaker recognition systems, due to their capability of rep-resenting a
large class of sample distributions. One of the powerful attributes of the GMM is its
ability to form smooth approximations to arbitrarily shaped densities,
There are several methods to estimate the statistical parameters ofthe GMM
model. The most popular method is the maximum likelihood (ML) or maximum a
posteriori (MAP) estimator.
Figure 6 compares the densities obtained using a unimodal Gaussian model, and a
GMM. Plot (a) shows the histogram of a single feature from a speaker recognition
system (a single cepstral value from a 25 second utterance by a male speaker); plot
(b) shows a uni-modal Gaussian model of this feature distribution; plot (c) shows
aGMM and its ten underlying component densities; The GMM not only provides a
smooth overall
Page 20 of 25
Chapter 3
chapters3
There are two inputs to the speaker identification system the first is the identity
claim which may be provided be keyed in identification number
where I is the number of frame in the speech Signal. The typical chosen value of
N and M are 280 Samples Cabout 17.5 msec) and 100 samples
(about 6 msec), respectively. The frame window used to minimize the signal
discontinuities at the beginning and end of each frame is defined as
Page 21 of 25
matching similarity. The Previous Procedures are refeated for all unknown.
Speakers and the system is checked to access for identifying speaker or not,
then the system is tested to find the identification rate which is defined as:
Table (1) shows the identification rate for different numbers of speakers using
LPC and GMM-
It is clear from this table that the identification rate is decreased when the
number of speaker is increased.
Page 22 of 25
Chapter 4
Conclusion
4.1 Conclusion
Page 23 of 25
References
[5] S. Chen and Y. Luo, "Speaker verification using LPC and support
vector machine," Proc. Int. MultiConference ..., vol. I, pp. 18-21, 2009.
Page 24 of 25
Republic of Iraq
Ministry of Higher Education and Scientific Research
University of Technology
EIectromeehanical Engineering Department
Energy and Renewable Energies Branch
BY
Omar assam Abbas
Ameer abd- ALKareem
Ahmed Moneim Zidan
SUPERVISOR
Dr.Ahmad Kamil Hasan
Page 25 of 25