Professional Documents
Culture Documents
Speech Recognition
Speech Recognition
List of Tables v
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Necessity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 4
2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Introduction of Speech . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Generation of Speech . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Classification of Speech . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Basic Speech Recognition System . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 General Flow of Speech Recognition System . . . . . . . . . . . 7
2.3 Feature Extraction Techniques . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Linear Predictive coding (LPC) . . . . . . . . . . . . . . . . . . 10
2.3.2 Mel-frequency Cepstral Coefficient (MFCC): . . . . . . . . . . . 10
2.3.3 Linear Prediction Cepstral Coefficient (LPCC): . . . . . . . . . . 11
2.4 SPEECH RECOGNITION TECHNIQUES . . . . . . . . . . . . . . . . 11
2.4.1 Acoustic Phonetic Approach . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Pattern Recognition Approach . . . . . . . . . . . . . . . . . . . 12
i
2.4.3 Artificial Intelligence Approach (Knowledge Based Approach) . . 14
2.5 PERFORMANCE OF SYSTEM . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Gaps of Research . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 System Design 17
3.1 System Architechture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Short Time Energy (STE) . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Zero Crossing Rate (ZCR) . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Start point End point Detection . . . . . . . . . . . . . . . . . . 18
3.2.4 Start point End point detection based on ZCR . . . . . . . . . . . 19
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Reference Database Template . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Dynamic Time Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Restrictions on the Warping Function . . . . . . . . . . . . . . . 24
3.5.2 DTW Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.3 Working of DTW Algorithm . . . . . . . . . . . . . . . . . . . . 27
3.6 Design Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
ii
List of Figures
iii
4.8 Optimal Paths, w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 Optimal Path Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
List of Tables
v
C HAPTER 1
Introduction
1.1 Introduction
Speech Recognition is also known as Automatic Speech Recognition (ASR) or computer
speech recognition is the process of converting acoustic. Speech is clearly one of the most
important communication methods available between humans, and it is the primacy of
this medium that motivates research efforts to allow speech to become a viable human-
computer interaction. So Speech recognition is a popular and active area of research, used
to translate words spoken by humans so as to make them computer recognizable. Speech
Signal Identification consists of the process to convert a speech waveform into features
that are useful for further processing. There are many algorithms and techniques are use.
It depends on features capability to capture time frequency and energy into set of coef-
ficients for Cepstrum analysis. Human voice conveys much information such as gender,
emotion and identity of the speaker. The objective of voice recognition is to determine
which speaker is present based on the speaker expression. Human voice is converted into
digital signal form to produce digital data representing each level of signal at every discrete
time step. There are many types of features, which are derived differently and have good
impact on the recognition rate. This project presents one of the techniques to extract the
feature set from a speech signal, which can be used in speech recognition systems. Speech
recognition system performs two fundamental operations: signal modelling and pattern
matching. Signal modelling represents process of converting speech signal into a set of
parameters. Pattern matching is the task of finding parameter set from memory which
closely matches the parameter set obtained from the input speech signal. The digitized
speech samples are then processed using MFCC to produce voice features. After that, the
coefficient of voice features can go through DTW to select the pattern that matches the
database and input frame in order to minimize the resulting error between them. The pop-
1
ularly used Cepstrum based methods to compare the pattern to find their similarity are the
MFCC and DTW. The MFCC and DTW features techniques can be implemented using
MATLAB. This reports the findings of the voice recognition study using the MFCC and
DTW techniques.
1.2 Necessity
Speech recognition is a fascinating application of digital signal processing (DSP) that has
many real-world applications. Speech recognition can be used to automate many tasks
that previously required hands-on human interaction, such as recognizing simple spoken
commands to perform something like turning on lights or shutting a door. To increase
recognition rate, techniques such as neural networks, dynamic time warping and Hidden
Markov models have been used. Recent technological advances have made recognition
of more complex speech patterns possible. For example, there are fairly accurate speech
recognition software products on the market that take speech at normal conversational
pace and convert it to text so no typing is needed to create a document. Despite these
breakthroughs, however, current efforts are still far away from 100% recognition of natural
human speech. Much more research and development in this area are needed before DSP
even comes close to achieving the speech recognition ability of a human being. Therefore,
we consider this a challenging and worthwhile project that can be rewarding in many ways.
1.3 Objective:
Speech recognition is an advanced form of decision making whereby the input originates
with the spoken work of a human user. Ideally, this is the only input that is required.
There are many ways in which speech recognition can be implemented. It basically means
TALKING TO A COMPUTER, having it recognize what we are saying, and lastly, doing
this in real time. This process fundamentally functions as a pipeline that converts audio
into recognized speech. Speech recognition systems emerge as efficient alternatives for
such devices where typing becomes difficult attributed to their small screen limitations.
2
1.4 Motivation
Speech recognition is a popular and active area of research, used to translate words spo-
ken by humans so as to make them computer recognizable. With the rapid development
of computer hardware and software and information technology, speech recognition tech-
nology is gradually becoming a key technology in the computer information processing
technology. Products to develop speech recognition technology is also widely used in
voice activated telephone exchange query information networks, medical services, bank-
ing services, industrial control every aspect of society and peopleâĂŹs lives.
1.5 Organization
The first chapters of this report provide some essential background and an introduction of
speech signal processing and speech recognition and speech recognition techniques. The
remainder of this report includes Chapter.2, Literature survey, Chapter.3 System Architec-
ture, Chapter.4 Implementation and results, Chapter.?? Conclusion.
3
C HAPTER 2
Literature Survey
• Isolated word: It accepts single words or single utterances at a time .This is having
âĂIJListen and Non Listen stateâĂİ. Isolated utterance might be better name of this
class.
• Connected word: Connected word system are similar to isolated words but allow
separate utterance to be âĂIJrun together minimum pause between them.
4
• Spontaneous speech: It is the speech that is natural sounding and not rehearsed. An
SR System with spontaneous speech ability should be able to handle a variety of
natural speech feature such as words being run together.
• Air is pushed from your lung through your vocal tract and out of your mouth comes
speech.
5
• For certain voiced sound, your vocal cords vibrate (open and close). The rate at
which the vocal cords vibrate determines the pitch of your voice. Women and young
children tend to have high pitch (fast vibration) while adult males tend to have low
pitch (slow vibration).
• For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vi-
brate but remain constantly opened.
• The shape of your vocal tract determines the sound that you make.
• As you speak, your vocal tract changes its shape producing different sound.
• The shape of the vocal tract changes relatively slowly (on the scale of 10 mili sec to
100 mili sec).
• The amount of air coming from your lung determines the loudness of your voice.
1. voiced signal
2. unvoiced signal
This is accomplished by dividing a speech signal of your choice into short frames and
by computing the average power of each frame.
Voiced Signal: The speech in a particular frame is declared to be voiced if its average
power exceeds a threshold level that is chosen by the user. Voiced signals tend to be louder
like the vowels /A/, /E/, /I/, /O/, /U/.
Unvoiced Signal: The speech in a particular frame is declared to be unvoiced if its average
power is not exceeds a threshold level that is chosen by the user. Unvoiced signals tend to
be more abrupt like the stop consonants /P/, /T/, /K/.
Pitch: For certain voiced sound, your vocal cords vibrate (open and close). The rate at
which the vocal cords vibrate determines the pitch of your voice.
6
2.2 Basic Speech Recognition System
Speech recognition applications are becoming more and more useful nowadays. Various
interactive speech aware applications are available in the market. But they are usually
meant for and executed on the traditional general-purpose computers. With growth in the
needs for embedded computing and the demand for emerging embedded platforms, it is
required that the speech recognition systems (SRS) are available on them too. Personal
Digital Assistants (PDAs) and other handheld devices are becoming more and more pow-
erful and affordable as well. It has become possible to run multimedia on these devices.
Speech recognition systems emerge as efficient alternatives for such devices where typ-
ing becomes difficult. Speech Recognition is the process of automatically recognizing
a certain word spoken by a particular speaker based on individual information included
in speech waves. This technique makes it possible to use the speakerâĂŹs voice to ver-
ify his/her identity and provide controlled access to services like voice based biometrics,
database access services, voice based dialling, voice mail and remote access to computers.
Speech recognition basically means talking to a computer, having it recognize what we
are saying, and lastly, doing this in real time. There are many types of features, which are
derived differently and have good impact on the recognition rate. This project presents
one of the techniques to extract the feature set from a speech signal, which can be used
in speech recognition systems. Speech recognition system performs two fundamental op-
erations: signal modelling and pattern matching. Signal modelling represents process of
converting speech signal into a set of parameters. Pattern matching is the task of finding
parameter set from memory which closely matches the parameter set obtained from the
input speech signal. [2]
7
• Feature Extraction: Converting the sound waves into a parametric representation
is a major part of any speech recognition approach. Here both static and dynamic
features of speech were used for speech recognition task because the vocal track
is not completely characterized only by static parameters. Feature extraction is the
most important part of speech recognition as it distinguishes one speech from other.
• Network Training: Network Training phase involves the network to be train ac-
cording to database contain. This will includes appropriate recognition techniques
to be chosen for further recognition process. Network will work as a classifier that
includes two learning methods named Supervise learning method and unsupervised
learning method. These classifiers are grouped on the basis of whether they accept
continuous or binary inputs, and on the basis of whether they employ supervised or
unsupervised training.
• Testing or Decoding: Once all these details are given correctly, the decoder iden-
tifies the most likely match for the given input, and it returns the recognized word.
Speech-recognition engines match a detected word to a known word using one of
the matching techniques and will return the output speech as output of system.
8
of the given input signal. The feature extraction is usually performed in three stages. The
first stage is called the speech analysis or the acoustic front end. It performs some kind of
spectra temporal analysis of the signal and generates raw features describing the envelope
of the power spectrum of short speech intervals. The second stage compiles an extended
feature vector composed of static and dynamic features. Finally, the last stage( which is
not always present) transforms these extended feature vectors into more compact and ro-
bust vectors that are then supplied to the recognizer. Although there is no real consensus
as to what the optimal feature sets should look like, one usually would like them to have
the following properties: they should allow an automatic system to discriminate between
different through similar sounding speech sounds, they should allow for the automatic
creation of acoustic models for these sounds without the need for an excessive amount of
training data, and they should exhibit statistics which are largely invariant across speakers
and speaking environment.
9
Methods Property Comments
Principal Non linear feature ex- Traditional,eigenvector based
Component traction method, Linear method, also known as
Analy- map; fast; eigenvector- karhuneu-Loeve expansion;
sis(PCA) based good for Gaussian data
cell7 cell8 cell9
Table 2.1: Feature extraction methods [1]
10
Following figure shows the steps involved in MFCC feature extraction.
11
Figure 2.6: Speech Recognition Technique Classifications [2]
signal over time. With speaker and co articulation effect, the acoustic properties of pho-
netic units are highly variable, it is assumed in this approach that, the criteria governing the
instability are straightforward and can be readily learned by machine. Steps included in
acoustic phonetic approach are as follows: The first step is the spectral analysis of speech
which describes the broad acoustic properties of different phonetic units. The next step is
segmentation and labelling the speech, that results in a phoneme lattice characterization of
the speech. The last step is determination of string of words or a valid word from phonetic
label sequences brought out by the segmentation to labelling. This approach has not been
most extensively used in most commercial application [3].
12
• Hidden Markov Model (HMM) A hidden markov model is signalizing by a finite
state markov model and a set of output distributions. The alteration parameter in the
Markov chain models are temporal variabilityâĂŹs, while the output distribution
model parameters are spectral variability. These two types of variability are essen-
tial for speech recognition. Hidden Markov modelling is more general and has a se-
cure mathematical foundation compared to template based approach. Compared to
knowledge base approach, HMM enables easy incorporation of knowledge sources
into organized architecture.HMM do not provide much insight on the recognition
process, is negative side effect of HMM. To improve performance of HMM system,
analyse of errors of system is made, but it is quite difficult. However, judicious
incorporation of knowledge has significantly improved HMM based system.
• Dynamic Time Warping (DTW) Dynamic time warping is an algorithm for measur-
ing similarity between two sequences which may vary in time or speed [4]. DTW
has been applied to video, audio, graphic, infect any data which can be develop
into a linear representation can be analysed with DTW. In general, DTW allows a
computer to search an optimal match between two time series if one time series
may be âĂIJwarpedâĂİ non-linearly by pulling or shrivelling it along its time axis.
This warping between two time series can then be used to find equivalently regions
among the two time series or to diagnose the similarity between two times series
[4].Continuity is not much important in DTW than in other pattern matching algo-
rithms. A figure shows the example of how one times series is âĂŸwarpedâĂŹ to
another.
In Figure 2.7, each vertical line connects a point in one time series to its correspond-
ingly similar point in the time series. The lines have similar values on the y-axis,
but have been separated so the vertical lines between them can be viewed more eas-
ily. If both of the time series in Figure 2.6 were identical, all of the lines would be
straight vertical lines because no warping would be ne-cassareep to ’line up’ the two
13
time series. The warp path distance is a measure of the difference between the two
time series after they have been warped together, which is measured by the sum of
the distances between each pair of points connected by the vertical lines in Figure
2.7. Thus, two time series that are identical except for localized stretching of the
time axis will have DTW distances of zero. The principle of DTW is to compare
two dynamic patterns and measure its similarity by calculating a min-imam distance
between them [4].
• Word Error Rate (WER) Word error rate is familiar measurement of the per-
formance of a speech recognition or machine translation system. The recognized
14
word sequence can have different length from the reference word sequence is more
general difficulty in performance measurement. The WER is acquired from the
Levenshtein distance, working at the word level alternatively to the phoneme level
[9].Word error rate can be computed as
S+D+I
W ER = (2.2)
N
Where,
S is the number of substitutions.
D is the number of the deletions.
I is the number of the insertions.
N is the number of words in the reference.
N −S−D−I H −I
W RR = 1 −W ER = = (2.3)
N N
CorrectlyRecognizedWords
Recognitionaccuracy = x100 (2.4)
TotalRecognizedWords
15
Methods Property Comments
Principal Non linear feature ex- Traditional,eigenvector based
Component traction method, Linear method, also known as
Analy- map; fast; eigenvector- karhuneu-Loeve expansion;
sis(PCA) based good for Gaussian data
cell7 cell8 cell9
Table 2.2: Feature extraction methods [1]
• In previous work, using the ZCR, Start point and end point detection and STE but
not using the filter like Hamming, Hanning and Blackman window for the improving
efficiency. Here used Hamming window filter.[7]
• Previously isolated word recognition using MFCC with ML feature classifier but
here used DTW (Dynamic Time Wrapping) algorithm for feature matching pro-
cess because it is most useful for isolated word recognition and it will use with the
MFCC.[12]
• In previous work result based on the Euclidean Distance is improved by using start
point and end point detection and also there are improving with the database sam-
ple.[4]
• The different Features extraction technique like PCA, LDA, LPC but MFCC is the
best then above mention technique. It gives the frequency bands are positioned
logarithmically (on the Mel scale) which approximates the human auditory system’s
response more closely than the linearly spaced frequency bands of FFT or DCT. [3]
16
C HAPTER 3
System Design
The block diagram of speech recognition shows in figure 3.1 and details description
given below. Speech input is taken using microphone and it is in analog form.
3.2 Preprocessing
It consists of the following:
17
3.2.1 Short Time Energy (STE)
The energy content of a set of samples is approximated by the sum of the square of the
samples. To calculate STE the speech signal is sampled using a rectangular window func-
tion of width ω samples, where ω << n. Within each window, energy is computed as
follows [7]
ω
e= ∑ xi2 (3.1)
i=0
ω
|sgn(xi ) − sgn(xi − 1)|
z= ∑ (3.2)
i=0 2
18
3.2.4 Start point End point detection based on ZCR
A threshold value by several observations on the signal is found. The part of the signal
from start to the start-point found by end-point detection and that from the end point to
the end of the signal is checked for the zero-crossing rate. After comparing the zero-
crossings with the threshold, the part of the frame is selected and start-point and end-point
is changed. This is done according to the following conditions:
• If ZCR > 3ast (threshold ), then end-point shifts one frame right, provided that the
previous end-point is not in the last frame.
The goal is to find a set of properties of an utterance that have acoustic correlates in the
speech signal, that is, parameters that can somehow be computed or estimated through pro-
cessing of the signal waveform,. Such parameters are termed features. Feature extraction
is the parameterization of the speech signal. It typically includes the process of converting
the signal to a digital from, measuring some important characters of the signal such as
energy or frequency response, augmenting these measurements with some perceptually-
meaningful derived measurements and statistically conditioning these numbers to form
observation vectors. [2] Different feature extraction techniques are as under:
19
• Mel Frequency Cepstral Coefficient (MFCC)
3.3.1 MFCC
MFCC is the most evident and popular feature extraction technique for speech recognition.
It approximates the human system response more closely than any other system because
frequency bands are placed logarithmically here. The overall process of the MFCC is
shown in below Figure 3.2
Step-1: Pre-Emphasis
This step processes the passing of signal through a filter which em-phasizes higher fre-
quencies. This process will increase the energy of signal at higher frequency.
Lets consider a = 0.95, which make 95% of any one sample is pre-sumed to originate
from previous sample. Step-2: Framing
The process of segmenting the speech samples obtained from analog to digital conversion
(ADC) into a small frame with the length with-in the range of 20 to 40 msec. The voice
signal is divided into frames of N samples. Adjacent frames ar e being separated byM (M <
N ). Typical values used are M = 100 and N = 256.
Step-3: Windowing
Hamming window is used as window shape by considering the next block in feature ex-
traction processing chain and integrates all the closest frequency lines. The Hamming
window equation is given as: If the window is defined as W (n), 0nN − 1 where N = num-
ber of samples in each frame Y [n] = Output signal X (n) = input signal W (n) = Hamming
20
window, then the result of windowing signal is shown below:
2πn
W (n) = 0.54 − 0.46cos 0 ≤ n ≤ N −1 (3.5)
N −1
Step-4: Fast Fourier Transform
To converting each frame of N samples from time domain into frequency domain. The
Fourier Transform is to convert the convolution of the glottal pulse U[n] and the vocal
tract impulse response H[n] in the time domain. This statement supports the equation
below [4]:
Y (n) = FFT [h(t ) ∗ x(t )] = H (ω ) ∗ X (ω ) (3.6)
Figure 3.3: Mel scale filter bank, from (young et al, 1997) [4]
This figure shows a set of triangular filters that are used to compute a weighted sum
of filter spectral components so that the output of process approximates to a Mel scale.
Each filter’s magnitude frequency response is triangular in shape and equal to unity at the
centre frequency and decrease linearly to zero at centre frequency of two adjacent filters
[7]. Then, each filter output is the sum of its filtered spectral components. After that the
equation 2.1 is used to compute the Mel for given frequency f in HZ:
Step 6: Discrete Cosine Transform
This is the process to convert the log Mel spectrum into time domain using Discrete Cosine
21
Transform (DCT). The result of the conversion is called Mel-Frequency Cepstrum Coef-
ficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is
transformed into a sequence of acoustic vector.
Step-7: Energy and Spectrum
As speech signals are random, so there is a need to add features related to the change in
cepstral features over time. For this purpose, in this paper, energy and spectrum features
are computed over small interval of frame of speech signals. Mathematically, the energy
in a frame for a signal x in a window from time sample t1 to time sample t2, is represented
as:
ENERGY = ∑ X 2 [t ] (3.7)
22
Methods Property Comments
Principal Non linear feature ex- Traditional,eigenvector based
Component traction method, Linear method, also known as
Analy- map; fast; eigenvector- karhuneu-Loeve expansion;
sis(PCA) based good for Gaussian data
cell7 cell8 cell9
Table 3.1: Training Requirement
In the database all the pre recorded words are stored that are used to train the sys-
tem. Again, during testing this database is referred. The words are recorded for speech
operating robot purpose.
23
Figure 3.4: Distance Grid [13]
• Monotonicity
This property states that the alignment path does not go back in “time” index. Thus,
it guarantees that features are not repeated in the alignment. i.e. is−1 ≤ is and
js−1 ≤ js
• Continuity
This property states that the alignment path does not jump in “time” index .i.e. is−1 −
is ≤ 1 and js−1 − js ≤ 1
24
Figure 3.6: Continuity Conditions [13]
• Warping Windows
This property states that a good alignment path is unlikely to wander too far from
the diagonal. |is − js | ≤ r , where r > 0 is the window length.
• Slope Constraints
The alignment path should not be too steep or too shallow.
25
Figure 3.9: DTW Conditions [13]
26
3.5.3 Working of DTW Algorithm
Consider two sequences of feature vector in an n-dimensional space, time series A and
time series B. The two sequences are aligned on the sides of a grid, with one on the top
and other on the left hand side. Both sequences start on the bottom left of the grid. To
compute DTW, following are the steps:
• Move to the second row g(i, 2) = min(g(i, 1), g(i1, 1), g(i1, 2)) + d (i, 2). Book keep
for each cell the index of this neighboring cell, which contributes the minimum score
(red arrows).
• Carry on from left to right and from bottom to top with the rest of the grid g(i, j ) =
min(g(i, j1), g(i1, j1), g(i1, j )) + d (i, j ).
• Trace back the best path through the grid starting from g(n, m) and moving towards
g(1, 1) by following the red arrows.
27
3.6 Design Logic
The decision is made based on DTW matching technique to select best matching between
reference file and test file. There are two criteria based on that decision is taken. These
are minimum distance and maximum correlation between two sequences. It is clear from
the figure that those reference MFCC vectors selects which have minimum MFCC vector
or maximum correlation.
28
C HAPTER 4
System Performance Analysis
• Processor: Intel(R) Pentium(R) Dual CPU T2390 @ 1.86 GHz 1.87 GHz
• RAM: 2 GB
Step-1: Input
• Filtering
29
Figure 4.1: Input File: Ready1.Wav
30
Figure 4.3: Speech Signal after Silence Removing Including Start Point and End Point
Only
• Windowing
• MFCC feature vector Applying of the above process on the given template made a
MFCC feature vector table. Now it is created the reference database created for the
given template.
31
Figure 4.4: Speech Signal after Pre-Emphasis And Framing
32
Figure 4.6: Reference Database (Template)
33
Step-5: DTW Pattern Matching
Reference files(r): ready1.wav
Testing file (t): test.wav
All pre-processing is applied to test file and then MFCC features of test file is calcu-
lated for comparing with reference file using DTW comparing technique. DTW computes
following parameter to measure similarities between test and reference file.
• Optimal Path, w
34
Figure 4.8: Optimal Paths, w
35