Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Speaker Verification: Spoofing and Counter

Measures

Mid-Term Report submitted for Major Project Evaluation

DEPARTMENT OF ELECTRONICS AND COMMUNICATION


ENGINEERING

P Sai Sreeja B150747EC


B Lekhana B150462EC
P Anand Kumar B150493EC
P Samhitha B150540EC
Durga Bhavani M B150861EC

Project Guide : Dr.Waquar Ahmad


Speaker verification : Spoofing and Counter measures

Abstract— Various measurable physiological and be- associated with each test example. On the contrary,
havioural traits which are distinctive have been investi- in Speaker identification the speaker does not make
gated for biometric recognition. Speech is the primary any explicit claim and it rather attempts to find
method of communication in which physiological and
the best match of the test speech against the
behavioral characteristics have individual differences i.e,
distinctive features of the vocal tract shape and intonation set of available trained models of the speakers.
can be captured and utilized for Automatic Speaker SV systems are more practically acclaimed for
Verification(ASV). Based on speech sample, the ASV’s deployment purpose and SI systems are used in
function is to accept or reject a claimed identity. Even forensic applications. Based on text SV, systems
though biometric authentication has advanced signifi- can be classified as text dependent and text inde-
cantly, they remain vulnerable to spoofing attacks. One
pendent systems. In text dependent system, text is
can synthetically produce an individual speech and use
it for authentication purposes. Our project focuses on fixed, which makes the speaker to utter the same
fundamental recognition performances, as opposed to text during training and testing. Where as in the
security to spoofing by modifying some existing features text independent system there is no restriction on
like Mel-Frequency Cepstral Coefficients(MFCC), Linear speech content of the users during training and
Predictive Cepstral Coefficients(LPCC) and Perceptual testing.
Linear Prediction (PLP) and conduct a study on imper-
sonation, replay speech synthesis and voice conversion
spoofing attacks.
II. PROBLEM STATEMENT

I. INTRODUCTION A study of existing feature extraction algorithms


in Automatic speaker verification and and imple-
Speech can be used as a biometric feature to
mentation of a new feature extraction algorithm
distinguish between human beings as each speaker
which includes detailed analysis of extraction algo-
has different style of speech delivery. Size and
rithms and draw conclusions based on their equal
shape of vocal tract, size of larynx falls under
error rates.
the category of physiological structures of speaker.
This causes difference between the speakers in
III. MOTIVATION
speech production. Speaker modelling is essential
for many tasks, which include speaker recogni- With the development of various techniques
tion, speaker diarization, speaker change detec- suitable for modelling speaker characteristics, the
tion, speaker clustering, etc. Speaker recognition area of Speaker verification has gone through
refers to recognizing a person based on voice major improvement in recent decades. These ad-
samples of that particular person. On the other vancements in the field of SV has opened doors
hand, speaker diarization deals with finding who towards practical deployable systems for person
spoke when, which is useful to find the speech of a authentication. In recent years people developed
particular speaker from a conversation of multiple a smart home security application for controlling
speakers[2]. different household activities using short utter-
Our project focuses on the speaker knowledge ances with the knowledge of speaker and speech
utilized for authenticating a person from speech recognition. On the other hand a speech biometric
signal, which is referred to as speaker recognition. attendance system is developed for marking atten-
Based on the task objective it can be categorized dance over online telephone based network. These
as speaker verification(SV) and speaker identi- works showcase SV as an emerging area for having
fication(SI). In Speaker verification, a claim is deployable systems with real world applications.
The use of short utterances for SV not only A. FEATURE EXTRACTION METHODS
reduces the required testing time, but also provides
comfort to the speakers as they are less burdened A speech signal has voice tone, prosodic
for speaking. In this regard, the text-dependent and language content. Each speaker can be
SV is found as a favourable one for having a seperated by short-term spectral, prosodic
deployable system due to less amount of train and
and highlevel unique features. Short-term
test time involved. However, there is a very high
chance of spoofing by the unauthorized users as spectral features are extracted from short
the fixed phrases global across all the users of frames of 20-30 milliseconds duration.
the system. In the case of text-dependent SV, less They describe the short-term spectral en-
speaker variability is captured due to consideration velope which is an acoustic correlate of
of a small phrase for speaker modelling. On the
contrary, the text-independent SV captures the
voice timbre. MFCCs, LPCCs and PLP are
speaker characteristics in a more generic manner popular spectral features. There are also
and is less susceptible to the attackers. Further some features like IMFCC, RFCC.
as there is no constraint on the users on what Prosodic features are extracted from
to be spoken, it is more convenient for the users
of systems in practice. All these salient features
longer segments such as syllables and
altogether project text-independent Speaker verifi- word-like units to characterise speaking
cation as a better choice for a robust system for style These features, such as pitch, energy
implementation. and duration are less sensitive to chan-
nel effects. However, due to their sparsity,
IV. LITERATURE SURVEY the extraction of prosodic features requires
Speaker verification is the process to accept or relatively large amounts of training data
reject an identity claim by comparing two speech and pitch extraction algorithms are gener-
samples in which one is used as reference of the ally unreliable in noisy environments. High
identity and the other that is collected during the
test from the person who makes the claim. level features are extracted from a lexicon
Feature extraction maps every interval in speech or other discrete tokens to represent speaker
to a multidimensional feature space. A speech in- behaviour.
terval typically of 1030 ms of the speech waveform Feature extraction methods:
and is referred to as a frame of speech. Feature
vectors are then compared to speaker models by 1) MFCC(popular)
using matching. The obtained match score for 2) LPCC
each feature vector or sequence of feature vectors 3) RFCC
measures the similarity of the input feature vectors 4) IMFCC
to models of the claimed speaker or feature vector
patterns for the claimed speaker then a decision is 1) MFCC FEATURE EXTRACTION
made to either accept or reject the claim according METHOD: MFCC is a representation
to the match score or sequence of match scores of the real cepstral of a windowed short
Automatic speaker verificaton contains time signal derived from the Fast Fourier
four steps: Transform (FFT) of that signal. The
1) voice recording difference from the real cepstral is that a
2) feature extraction nonlinear frequency scale is used, which
3) matching approximates the behaviour of the auditory
4) Decision system. It is an audio feature extraction
technique that extracts parameters from
the speech similar to ones that are used by
humans for hearing speech(Mel-scale)
Features extraction in ASR(Automatic
Speaker Recognition) is the computation
Fig. 1. Melscale filter bank
of a sequence of feature vectors which
provides a compact representation of the
given speech signal. frequency domain representation.The
magnitude of windowed signal is used
STEPS IN MFCC FEATURE EX- to obtain the power spectrum.
TRACTION METHOD 4) The Mel-Scaled filter bank is ap-
plied to the Fourier transformed frame
1) The speech signal is divided into time which is triangular. This scale is
frames containing arbitrary number of approximately linear up to 1 kHz,
samples.FIR filter is used to enhance and logarithmic at greater frequencies.
higher frequencies of the input speech The relation between frequency of
signal where as Pre emphasis filter is speech and Mel scale is
used to make the signal less suscepti-
ble to finite precision. 1 + f (Hz)
F requency(M elscale) = [2595log
700
y(n) = x(n) − x(n − 1)
In Mel-scale filter bank the higher fre-
where x(n) is the input speech signal quency filters have greater bandwidth
and 0.9 ≤ α ≤ 1. The speech signal when comapred to lower frequency
is divided into number of frames of filters, but their temporal resolutions
duration 20-30 msec in which speech are the same.(Fig.1)
signal assumed to be stationary with 5) The last step is to calculate Discrete
50 percent overlap between two suc- Cosine Transformation (DCT) of the
cessive frames in order to avoid any outputs from the filter bank. DCT
loss of information. coefficients range according to sig-
2) Each time frame is then windowed nificance, whereas the 0th coefficient
with Hamming window to eliminate is not considered since it is unreli-
discontinuities at the edges. The fil- able.For each speech frame MFCC
ter coefficients w(n) of a Hamming is computed. The coefficients set ob-
window of length n are computed tained in this frame is called an acous-
according to the formula: tic vector that represents the pho-
netically important characteristics of
W (n) = 0.54 − 0.46 ∗ cos(2πn/M )
speech which is very useful for further
3) In the next step after windowing, Fast analysis and processing in Speaker
Fourier Transformation (FFT) is cal- verification . For example We can
culated to convert each time frame to take audio of 2 Second which gives
spectral features represent phonetic
information, as they are derived di-
rectly from spectra. The features ex-
tracted from spectra emphasize the
contribution of all frequency compo-
nents of a speech signal. LPCCs are
generally used to capture emotion-
specific information obtained through
vocal tract features.Cepstrum may be
obtained using linear prediction anal-
Fig. 2. MFCC and LPCC Block Diagram
ysis of a speech signal.
The basic idea behind linear predic-
approximate 128 frames each contain tive analysis is that the nth speech
128 samples having (window size = sample can be estimated by a linear
16ms). We can use first 20 to 40 combination of its previous p samples
frames that give good estimation of as shown in the following equation.
speech[1].
s(n) ≈ a1 s(n1)+a2 s(n2)+a3 s(n3)+· · ·
2) LPCC FEATURE EXTRACTION
METHOD: LPC is the most powerful · · · + ap s(np)
speech analysing method and where a1, a2, a3, ... are assumed to
speech sample for the current time be constants over a speech analysis
can be approximated as a linear frame.
sequence of past speech samples. These are known as predictor coeffi-
The methodology behind the usage cients or linear predictive coefficients.
of LPC is to reduce the squared These coefficients are used to predict
difference between original and the speech samples. The difference of
estimated speech signal at a finite actual and predicted speech samples
time. With the help of LPC, we is known as an error. It is given by
can derive the LPCC coefficients p
which is further translate into cepstral
X
e(n) = s(n)−ŝ(n) = s(n)− ak s(n−k)
coefficients. LPCC is obtained by the k=1
method called autocorrelation. where e(n) = error in prediction, s(n)
The cepstral coefficients obtained = original speech signal, s(n) = pre-
from both linear prediction (LP) anal- dicted speech signal, ak ,s are the pre-
ysis or a filter bank approach are dictor coefficients. For caluculating
almost treated as standard front end unique set of predictor coefficients,
features , when speech is recorded in the sum of squared differences be-
a clean environment speech systems tween the actual and predicted speech
developed based on these features will samples is minimized as shown in the
obtain a high level of accuracy .These equation below
p
2 X
E{e [n]} = E{s[n] + ak s[n − k]}
k=1
where n = number of samples in an
analysis frame. For obtaining LP co-
efficients, En has to be differentiated
with respect to each ak and the result
is equated to zero En / ak = 0, for k
= 1, 2, 3,..., p
After finding the ak s, we can find
cepstral coefficients using the follow-
ing recursion.
C0 = loge p Fig. 3. Filterbanks of a)RFCC b)LPCC c)MFCC d)IMFCC

m−1
X k
C m = am + Ck am−k against MFCC. It is defined as one
k=1 m
of the characteristic property of the
for 1<m <p audio system.
m−1
X k 4) RFCC FEATURE EXTRACTION
Cm = Ck am−k METHOD:
k=m−p m

for m >p V. PHASE INFORMATION


EXTRACTION
3) IMFCC FEATURE EXTRACTION
METHOD: This extraction method Speech signal can be processed by
follows the opposite path of evolution short-time Fourier analysis. Given a
of human audio system. The basic speech signal x(n), the short-time
idea is to retrieve the information lost Fourier transform (STFT) is given as
or missed in the original MFCC. The follows:
structure of the filter bank is inverted.
X(ω) = |X(ω)| ∗ ej∗ψ(ω)
Doing as such higher frequency range
is average by more accurately spaced Re |X(ω)| is the short-time magni-
filters and small number of widely tude spectrum, and ψ(ω) is the short-
spaced filters used in lower frequency time phase spectrum. The square of
range. Such a feature set is named magnitude spectrum, |X(ω)|2 , is usu-
as Inverted Mel frequency Cepstral ally called the power spectrum.
Coefficients (IMFCC). The procedure As natural phase information is miss-
for IMFCC is somewhat same as ing in the reconstructed speech or
that of MFCC but used reverse filter converted speech, phase spectrum can
bank structure. The nature of IMFCC be used to derive features providing
is reciprocal in nature as compared the evidence of converted speech. In
this study, two different features will phase spectrum, in practice, a modi-
be derived from phase spectrum as fied group delay function is adopted.
follows. Smoothed power spectrum is used in-
stead of original power spectrum and
A. COSINE NORMALISATION OF two variables are introduced to em-
PHASE SPECTRUM
phasis the fine structure of the phase
As the original phase spectrum is not spectrum. The modified group delay
continuous in frequency domain, we function is described as follows.
first unwrap the phase spectrum as a
continuous function of frequency. Af-
ter unwrapping, the range of spectrum τγ (ω) = [XR (ω) ∗ YR (ω) + YI (ω) ∗ XI (ω)]/|S
could vary which makes it difficult
to model phase information. Cosine
τα , γ(ω) = [τγ (ω)/|τγ (ω)|]∗[|τγ (ω)|α ]
function is then applied on the un-
wrapped phase spectrum to normalize where |S(ω)|2 is the smoothed power
the range into [1.0, 1.0]. Then discrete spectrum, τα ,γ (ω) is the modified
cosine transform (DCT) is applied on group delay phase spectrum, α and
the Cosine normalized phase spec- γ are two variables to make the
trum to reduce dimensionality. Finally phase spectrum be presented in favor-
we keep 12 cepstral coefficients, ex- able form. After modified group delay
cluding the 0th coefficient. This fea- phase spectrum is obtained, discrete
ture is called cos-phase. cosine transform (DCT) is applied.
We keep 12 cepstral coefficient, ex-
B. FREQUENCY DERIVATIVE OF
cluding the 0th coefficient. The 12
PHASE SPECTRUM
dimension feature are used for model
The next feature is the frequency training and detection and this feature
derivative of phase spectrum, obtained is called MGDF-phase[5].
by the group delay function (GDF)
VI. METHODOLOGY
which is a measure of the nonlinearity
of the phase spectrum and defined as Performance Evaluation Classifier de-
the negative derivative of the phase scription After extracting the features
spectrum with respect to ω from the speech we need to use a
classifier to see the match percent-
age between
τ (ω) = [XR (ω) ∗ YR (ω) + YI (ω) ∗ XI (ω)]/|X(ω)| 2 the natural and synthetic
speech ,so we used a classifier . ba-
where Y(ω ) is the STFT of nx(n); sically there are two types of classi-
XR (ω), XI (ω) and YR (ω), YI (ω) are fiers 1.GMM 2.LBP-SVM In different
the real part and imaginary part of studies we came across we saw that
X(ω ) and Y(ω ), respectively. To cap- GMM technique gives us reasonably
ture the fine structure of group delay good accuracy in database we have
used i.e ASV 2017.So we choose this specific GMM trained samples from
classifier for benchmarking various a particular speaker. UBM is used
features as a prior model in MAP parameter
Gaussian Mixture Model (GMM)- estimation.. The GMM shows best
Universal Background Model performance as compared to the other
(UBM). known method. A Gaussian mixture
model is a weighted sum of M com-
A Gaussian Mixture Model (GMM) ponent Gaussian densities as given by
is one of the parametric models. the equation
In GMM we find out the proba-
x M x
bility of all the Gaussian compo- p( ) =
X
wi g( P )
nents densities. Speaker verification λ i=1 µ, i
is one of the application area where Where; x is a D-dimensional
GMM are used. GMM parameters continuous-valued data vector (i.e.
are estimated from training data us- measurement or features)
ing Expectation-Maximization (EM) x 1 1 0 X−1
g( P )= D/2
exp − (x − µi ) (x − µi )
µ, 2
P
2π | |1/2
algorithm or Maximum a Posteri- i i i

ori (MAP) estimation from a well- With mean vector and covariance
trained prior model . The EM al- matrix the mixture weights satisfy the
gorithm is iterative in nature. GMM constraint that The complete Gaussian
are generally used for text indepen- mixture model is parameterized by the
dent speaker identification.There is no mean vectors, covariance matrices and
need for the vocabulary database or mixture weights from all component
big phoneme. Capturing the general densities. These parameters are col-
characteristics of a population and lectively represented by the notation,
accordingly adapting it to individual X
speaker is the basic idea of UBM. λ = {ω, µ, }
i
UBM is defined as the model which
is used to compare the persons inde- For a sequence of T training vectors
pendent feature characteristics against X = {x1 , · · · , xT }
person specific feature model during
decision of acceptance or rejection. The GMM likelihood, assuming inde-
UBM is also said as GMM only pendence between the vectors, can be
with large set of speakers. Firstly written as
r
likelihood score or ratio for an un- Y
p(X | λ) = p(xt | λ)
known speech sample is found after t=1
that the match score of speaker spe-
For utterances with T frames, the log-
cific mode and universal background
likelihood of a speaker models is;
model is formed by using speaker
Ls (X) =
For speaker identification the value of is made using log-scale likelihood ra-
) (XL s is computed for all speaker tio as follow
models s enrolled in the system and
the owner of the model that generates λ(C) = logp(C|λconverted)logp(C|λnatural
the highest value is the returned as where C is the feature vector sequence
the identified speaker. During train- of a speech signal, converted is the
ing phase, Feature vectors are being GMM model for converted speech,
trained using Expectation and Maxi- and natural is the GMM model for
mization (EM) algorithm. An iterative natural speech. Under the three differ-
update of each of the parameters in , ent situations, we have the same nat-
with a consecutive increase in the log ural speech model natural, but three
likelihood at each step. different converted speech model con-
EQUAL ERROR RATE: verted . The number of Gaussian com-
ponents of GMM is set to 512. Equal
In an automatic speaker verification error rate (EER) is reported as the
(ASV) system, the equal error rate evaluation criterion.
(EER) is a measure to evaluate the
system performance. Usually it needs Feature Equal error Rate(%)
a large number of testing samples
to calculate the EER. In order to MFCCs 16.80
estimate the EER without the ex- cos-phase 6.60
periments using testing samples, a MGDF-phase 9.13
method of model-based EER estima-
tion which computes likelihood scores
Source: 2006 NIST Speaker
directly from client speaker models
Recognition Evaluation Test Set
is proposed. However, the distribution
of the computed likelihood scores is VII. FUTURE WORK
significantly biased against the distri-
Our future works include identify-
bution of likelihood scores obtained
ing best possible feature extraction
from testing samples. Now we design
method and improving its perfor-
and manipulate the speaker models of
mance. Based on this data Speaker
the client speakers and the imposters
verification system is to be trained
so that the distribution of the com-
and tested for Gaussian mixture
puted likelihood scores is closer to
model(GMM). Other blocks in SV
the distribution of likelihood scores
system will be studied and imple-
obtained from testing samples. Then a
mented. Vulnerabilty of SV system
more reliable EER can be calculated
towards different spoofing attacks
by the speaker model.
will be studied. Countermeasures for
The natural/converted speech decision
spoofing attacks will be proposed
and implemented and system will be [4] Douglas A. Reynolds A Speaker identification and verification
using Gaussian mixture speaker models, Speech Communi-
trained accordingly and performance cation 17 (1995) 91-108 ESCA Workshop on Automatic
check will be conducted. Speaker Recognition, Identification and Verification, Mar-
tigny, 5-7 April 1994.
[5] Longbiao Wang, Yuta Kawakami Relative Phase Information
VIII. WORKDONE for Detecting Human Speech and Spoofed Speech INTER-
SPEECH 2015.
We evaluated the perfomance of dif- [6] C. J. Kaufman, Rocky Mountain Research Lab., Boulder, CO,
ferent feature extraction methods like private communication, May 1995.

MFCC,LFCC,RFCC of speech samples un-


der ASV 2017 database and caluculated the
equal error rate using a GMM classifier
using EM(expectation maximization) algo-
rithm.
A. RESULTS
ASV Database 2017

Feature Extraction Mixtures Scores EER(using development trial)(%)


MFCC trail1 32 25.56
MFCC trial2 128 13.53
MFCC trial3 256 12.57
LFCC trial1 32 16.38
LFCC trial2 128 12.29
LFCC trial3 256 7.56
IMFCC trial1 32 15.46
IMFCC trial2 128 9.76
IMFCC trial3 256 6.8
RFCC 512 8.40

Source: 2006 NIST Speaker Recognition


Evaluation Test Set
R EFERENCES
[1] Seiichi Nakagawa, Longbiao Wang, Shinji Ohtsuka Speaker
Identification and Verification by Combining MFCC and
Phase Information in IEEE TRANSACTIONS ON AUDIO,
SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO.
4, MAY 20121085.
[2] Gkay Diken, Zekeriya Tfeki, Ltf Saribulut Ulus evik (2016)
A Review on Feature Extraction for Speaker Recognition
under Degraded Conditions IETE Technical Review, DOI:
10.1080/02564602.2016.1185976.
[3] Fred eric Bimbot, Jean-Franc ois Bonastre EURASIP
Journal on Applied Signal Processin g 2004:4, 430451 c 2004
Hindawi Publishing Corporation.

You might also like