Professional Documents
Culture Documents
Orginal
Orginal
ABSTRACT
Voice/speech is one of the most popular and reliable
biometric technologies used in automatic personal identification systems This paper presents a speaker identification system using cepstral based speech features The commonly used cepstral based features; MelFrequency Cepstral Coefficient (MFCC), linear predictive cepstral coefficient (LPCC) and real cepstral coefficient (RCC) are employed in the speaker identification system
Cont.
The experimental results show that the identification
accuracy with MFCC is superior to both of LPCC and RCC This paper introduces a new approach to control and drive the DC motor, using voice recognition using MFCC This dc motor is used to control wheel chair
INTRODUCTION
Speech is one of the natural forms of communication. Recent development has made it possible to use this in
the security system. the task is to use a speech sample to select the identity of the person that produced the speech from among a population of speakers. This paper makes it possible to use the speakers voice to verify their identity and control access to DC motor. The MFCC algorithm for speech recognition is more accurate than Linear PredictionCoding (LPC)
CONT.
The external DC motor is connected through
interfacing between computer and hardware circuit. The hardware circuit consist of microcontroller (ARM), IC MAX 232, driver IC (l293D) mainly
popular and reliable biometric technologies used in automatic personal identification systems. Speech recognition systems are used for variety of applications such as multimedia browsing tool, access centre, security and finance.
Cont
The physiological structure of a vocal tract is different
for different persons. Due to this, we can differentiate one persons voice from others. This difference in vocal tract structure is reflected in the frequency spectrum of speech signal. This is a filter bank based approach but implemented using time-frequency analysis technique. Here, first time analysis is done through framing operation and then frequency analysis is done by passing that frame through filter bank.
Cont
Filters are designed in such a way that they resemble
the human auditory frequency perception Presently MFCC is the most widely used feature set for speaker recognition. (first proposed for speech recognition. ) In the MFCC filter bank, the low frequencies are given more importance compared to the high frequencies. Speech recognition performance degrades significantly under varying environmental conditions for many application areas. Speech recognition accuracy can be improved by the removal of noise
Cont
Speech recognition system include
1. feature extraction method 2. feature recognition method. MFCC is used as the feature extraction method and
independent and text-dependent methods. TEXT INDEPENDENT SYSTEM A text-independent ASIS does not rely on a specific text being spoken both in the training and testing phase. It relies on long-term statistical characteristics of speech for effecting a successful identification
TEXT-DEPENDENT SYSTEM
In text-dependent ASI system , a fixed utterance, like
passwords, card numbers, PIN codes etc. in both training and testing phase and rely on specific features of the test utterance in order to affect a match. Text dependent system requires less training than text independant. provides a perfect solution in practical applications.
that the speech segment centered at time to is produced when the excitation signal, e(n,to), is passed through a linear filter, h(n,to), the model of vocal tract. That is, for a small segment of time which the properties of the speech signal are assumed to be stationary, the speech signal is composed of a excitation sequence (quickly varying part) convolved with a vocal system impulse response(slowly varying part
Cont..
S(n,to)=e(n,to)*h(n,to)
To extract the vocal-tract specific characteristics it is
desirable to filter out the excitation component from the filter component. The convolution makes it difficult to separate the two parts; therefore the cepstral analysis is introduced. The cepstral coefficients are generally derived either through linear predictive (LP) analysis or mel filterbank analysis.
domain to the frequency domain by applying Fourier transform. According to the convolution theorem as Eq. (2), the convolution expression of Eq.1 becomes multiplication as shown in Eq. (3). When the spectrum is represented logarithmically, its component becomes additive due to the property of logarithm as eqn 4.
works individually on the two components. Here Cs(m,to)is called cepstrum called the cepstrum or real cepstrum coefficient of s(n,to). The domain of cs(m,to) is called the quefrency domain. The vocal tract characteristics are encoded into the lower frequencies. The excitation can therefore be removed by only keeping the lower cepstral coefficient
Cepstral coefficients except that the frequency scale is warped to correspond to the mel scale, This mapping is usually done using the equation Mel(f) = 2595*log(1 +f/700 ) The calculation of the mel-frequency Cepstral coefficients is illustrated in Fig.
MFCC CALCULATION
The speech is first pre-emphasized with a pre-
emphasis filter 1-az1 to spectrally flatten the signal, where "a" is between 0.9 and 1. Here approximate a as 31/32 In the time domain, the relationship between the output Sn and the input s, of the preemphasis block is shown in fig. Sn = Sn - asn-I
Cont
Sn = Sn - asn-I =Sn-(31/32)Sn-1= Sn-( Sn-1- Sn-1/32)
Then the pre-emphasized speech is separated into
short segments called subframe. The frame length is set to 10 ms(80 samples)guarantee stationarity inside the frame nd no overlap. the Hamming window is used mainly to reduce the edge effect, so 80 point window is used
Cont
As if the window size becomes smaller, Recognition
Accuracy the short-time spectrum will give a poorer frequency resolution but a better estimate of the overall spectral envelope 128 point fft is used, only 64 coefficients is needed , since symmetric property rectangle filter bank used to overlap. in a rectangular FFT , the output characteristic of a rectangular filter is either a "1" or a "0
Cont
thus the operations are changed to "add" or "not add".
For a 128-point FFT, the rectangular filter bank is reduced to 23 equally spaced rectangular filters indicates 23 filters produce the highest recognition accuracy the rectangular filters only require 120 addition fn and fn+1represent the original 160-point frames with 50% overlap, and sfn and sfn+1 represent the new 80point non- overlapped sub-frames.
Cont
We add the filter bank outputs sFn,k and sFn+1,k to
generate the power coeficient. Sn,k t We have reduced almost half of the computation by moving the overlap operation to the end of the spectrum calculation. The following DCT and delta calculations are the same There are also 26 features in each frame This extraction algorithm reduces the total number of multiplications
Cont
We can calculate the Mel-Frequency cepstrum from
output power of the filter bank using equation where Sk is the output power of the kth filter of the filter bank, and n is from 1 to 12. We can also calculate the logged energy of each frame as one of the coefficient.
which is calculated without any windowing and
premphasis. Up to now we have got 13 cepstrum coefficientsTo enhance the performance Of the speech recognition
Cont
system, time derivatives are added to the basic static
parameters. The delta coefficients are obtained from the following
formula:
After all the calculations, the total number of MFCC
the linear predictive coding (LPC) into a set of Cepstral coefficients. It is noted that while there is a finite number of LPCs, the number of cepstral coefficients is infinite the cepstrum is a decaying sequence, so a finite number of coefficients are sufficient to approximate it.
Cont.
FEATURE MATCHING
Feature matching techniques used in speaker
recognition include, Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ). The VQ approach has been used here for its ease of implementation and high accuracy.
incoming speech with stored templates. The template with the lowest distance measure from the input pattern is the recognized word. The best match (lowest distance measure) is based upon dynamic programming. This is called a Dynamic Time Warping (DTW) word recogniser.
voice features could avoid the problem of time warping Here we use split method to initial the codebook of every speakers features. When we get the final codebooks, well use them to make sure whom the speaker is . A feature from an unknown speaker should be matching with the database first.
Cont
Then it computes the Euclidean distance between the
feature and every speakers codebooks. D = min[d(X ,Y)] (7) Where X is the unknown speakers feature and Y is the codebooks. Compute the distance between X and every codebook of a speaker, and then take the minimal value as the distance D. Here we set a threshold value, if every speakers distance morethan the threshold value; we need judge again by using the next feature.
Cont
Dk= 1Kmin[d(Xk,Y)]
If k reaches the maximum value we set and non-speakers distance less than the threshold
Cont
only one speakers less than the threshold, the
system judges it is just the speaker. And if some speakers less than the threshold value, we should make the minimal speakers go through the GMM judgment model. And the number of the speakers should be set according to the local condition.
Cont
The speaker-based VQ codebook generation can have Given a set of I training feature vectors, {a1,a2,..an}
characterizing the variability of a speaker, to find a partitioning of the feature vector space, {S1,S2 SM}, for that the whole feature space is represented as S =S1 US2 U . . . U SM. Each partition, Si, forms a convex, Non overlapping region and every vector inside Si is represented by the corresponding centroid vector bi of Si The partitioning is done in such a way that the average distortion
The LBG algorithm requires an initial codebook. The initial codebook is obtained by the splitting method. In this method, an initial codevector is set as the average of the entire training sequence. This codevector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two codevectors are split into four and the process is repeated until the desired number of codevectors is obtained.
Cont
the VQ is shown for two speakers.
The circle refers the speaker 1 and triangle refers for speakers In the training phase, a speaker-specific VQ codebook is generated for each known speaker. shows the use of different number of centroids for the
same data field. After calculation of MFCC and VQ, Euclidian distance is calculated for nearest speech matching
Cont.
the speech signal is taken by microphone that is connected
to computer. Software coding is to calculate the MFCC and VQ (LBG algorithm) MATLAB 7.5 version can be used, to recognize the input speech taken from micro phone. For hardware part to make DC motor understands, microcontroller (ARM) is used. For microcontroller coding Embedded C programming is used. The interfacing between computer and microcontroller is done by RS-232. For drive the DC motor the driver IC L293D is used.
RESULTS
The coding has been developed using MFCC and VQ algorithm, in MATLAB 7.5 version on window Vista platform and supporting hardware also has been implemented. The interfacing is done between hardware and software using RS-232 cable (MAX-232 IC). External DC motor can be driven in forward or reverse direction as well as it can be stopped also by giving speech commands. While calculating of MFCC for database at the time of speech recognition,
conclusion
In this paper MFCC and VQ techniques are used in
speech recognition to control the DC motor drive. using ARM microcontroller in order to control the movement of wheelchair. The code can be developed in MATLAB using MFCC and VQ can be even used for control and drive the stepper motor, servo motor etc.