Speech Recognition

Contents
List of Principal Symbols and Acronyms ii
List of Figures iii
List of Tables v
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Necessity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 4
2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Introduction of Speech . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Generation of Speech . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Classification of Speech . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Basic Speech Recognition System . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 General Flow of Speech Recognition System . . . . . . . . . . . 7
2.3 Feature Extraction Techniques . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Linear Predictive coding (LPC) . . . . . . . . . . . . . . . . . . 10
2.3.2 Mel-frequency Cepstral Coefficient (MFCC): . . . . . . . . . . . 10
2.3.3 Linear Prediction Cepstral Coefficient (LPCC): . . . . . . . . . . 11
2.4 SPEECH RECOGNITION TECHNIQUES . . . . . . . . . . . . . . . . 11
2.4.1 Acoustic Phonetic Approach . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Pattern Recognition Approach . . . . . . . . . . . . . . . . . . . 12
i
2.4.3 Artificial Intelligence Approach (Knowledge Based Approach) . . 14
2.5 PERFORMANCE OF SYSTEM . . . . . . . . . . . . . . . . . . . . . . 14
2.5.1 Gaps of Research . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 System Design 17
3.1 System Architechture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Short Time Energy (STE) . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Zero Crossing Rate (ZCR) . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Start point End point Detection . . . . . . . . . . . . . . . . . . 18
3.2.4 Start point End point detection based on ZCR . . . . . . . . . . . 19
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Reference Database Template . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Dynamic Time Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Restrictions on the Warping Function . . . . . . . . . . . . . . . 24
3.5.2 DTW Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.3 Working of DTW Algorithm . . . . . . . . . . . . . . . . . . . . 27
3.6 Design Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 System Performance Analysis 29

4.1 Implementation and Results . . . . . . . . . . . . . . . . . . . . . . . . 29
ii
List of Figures
2.1 Types of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Biological model of Speech . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Basic model of Speech Recognition System . . . . . . . . . . . . . . . . 7
2.4 Steps Involve in LPC Feature Extraction [8] . . . . . . . . . . . . . . . . 10
2.5 Steps Involve in MFCC Feature Extraction [8] . . . . . . . . . . . . . . 11
2.6 Speech Recognition Technique Classifications [2] . . . . . . . . . . . . . 12
2.7 A Warping Between Two Time Series [4] . . . . . . . . . . . . . . . . . 13
3.1 Block Diagram of Speech Recognition System . . . . . . . . . . . . . . 17

3.2 Steps Involve in MFCC Feature Extraction [4] . . . . . . . . . . . . . . 20
3.3 Mel scale filter bank, from (young et al, 1997) [4] . . . . . . . . . . . . . 21
3.4 Distance Grid [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Monotonicity Conditions [13] . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Continuity Conditions [13] . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.7 Warping Windows [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 Slope Constraints [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.9 DTW Conditions [13] . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.10 Decision Logic Based on Minimum Distance [15] . . . . . . . . . . . . 28
4.1 Input File: Ready1.Wav . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Input File after Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Speech Signal after Silence Removing Including Start Point and End Point
Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Speech Signal after Pre-Emphasis And Framing . . . . . . . . . . . . . . 32
4.5 MFCC Feature Vector of Ready1.Wav File . . . . . . . . . . . . . . . . 32
4.6 Reference Database (Template) . . . . . . . . . . . . . . . . . . . . . . 33
4.7 Test File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
4.8 Optimal Paths, w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.9 Optimal Path Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
List of Tables
2.1 Feature extraction methods [1] . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Feature extraction methods [1] . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Training Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
v
C HAPTER 1
Introduction
1.1 Introduction
Speech Recognition is also known as Automatic Speech Recognition (ASR) or computer
speech recognition is the process of converting acoustic. Speech is clearly one of the most
important communication methods available between humans, and it is the primacy of
this medium that motivates research efforts to allow speech to become a viable human-
computer interaction. So Speech recognition is a popular and active area of research, used
to translate words spoken by humans so as to make them computer recognizable. Speech
Signal Identification consists of the process to convert a speech waveform into features
that are useful for further processing. There are many algorithms and techniques are use.
It depends on features capability to capture time frequency and energy into set of coef-
ficients for Cepstrum analysis. Human voice conveys much information such as gender,
emotion and identity of the speaker. The objective of voice recognition is to determine
which speaker is present based on the speaker expression. Human voice is converted into
digital signal form to produce digital data representing each level of signal at every discrete
time step. There are many types of features, which are derived differently and have good
impact on the recognition rate. This project presents one of the techniques to extract the
feature set from a speech signal, which can be used in speech recognition systems. Speech
recognition system performs two fundamental operations: signal modelling and pattern
matching. Signal modelling represents process of converting speech signal into a set of
parameters. Pattern matching is the task of finding parameter set from memory which
closely matches the parameter set obtained from the input speech signal. The digitized
speech samples are then processed using MFCC to produce voice features. After that, the
coefficient of voice features can go through DTW to select the pattern that matches the
database and input frame in order to minimize the resulting error between them. The pop-
1
ularly used Cepstrum based methods to compare the pattern to find their similarity are the
MFCC and DTW. The MFCC and DTW features techniques can be implemented using
MATLAB. This reports the findings of the voice recognition study using the MFCC and
DTW techniques.
1.2 Necessity
Speech recognition is a fascinating application of digital signal processing (DSP) that has
many real-world applications. Speech recognition can be used to automate many tasks
that previously required hands-on human interaction, such as recognizing simple spoken
commands to perform something like turning on lights or shutting a door. To increase
recognition rate, techniques such as neural networks, dynamic time warping and Hidden
Markov models have been used. Recent technological advances have made recognition
of more complex speech patterns possible. For example, there are fairly accurate speech
recognition software products on the market that take speech at normal conversational
pace and convert it to text so no typing is needed to create a document. Despite these
breakthroughs, however, current efforts are still far away from 100% recognition of natural
human speech. Much more research and development in this area are needed before DSP
even comes close to achieving the speech recognition ability of a human being. Therefore,
we consider this a challenging and worthwhile project that can be rewarding in many ways.
1.3 Objective:
Speech recognition is an advanced form of decision making whereby the input originates
with the spoken work of a human user. Ideally, this is the only input that is required.
There are many ways in which speech recognition can be implemented. It basically means
TALKING TO A COMPUTER, having it recognize what we are saying, and lastly, doing
this in real time. This process fundamentally functions as a pipeline that converts audio
into recognized speech. Speech recognition systems emerge as efficient alternatives for
such devices where typing becomes difficult attributed to their small screen limitations.
2
1.4 Motivation
Speech recognition is a popular and active area of research, used to translate words spo-
ken by humans so as to make them computer recognizable. With the rapid development
of computer hardware and software and information technology, speech recognition tech-
nology is gradually becoming a key technology in the computer information processing
technology. Products to develop speech recognition technology is also widely used in
voice activated telephone exchange query information networks, medical services, bank-
ing services, industrial control every aspect of society and peopleâĂŹs lives.
1.5 Organization
The first chapters of this report provide some essential background and an introduction of
speech signal processing and speech recognition and speech recognition techniques. The
remainder of this report includes Chapter.2, Literature survey, Chapter.3 System Architec-
ture, Chapter.4 Implementation and results, Chapter.?? Conclusion.
3
C HAPTER 2
Literature Survey
2.1 Literature Survey
2.1.1 Introduction of Speech

The speech waveform is a sound pressure wave originating from controlled movements of
anatomical structures making up the human speech production system. Basically speech
utterance can be classified as:
Figure 2.1: Types of Speech
• Isolated word: It accepts single words or single utterances at a time .This is having
âĂIJListen and Non Listen stateâĂİ. Isolated utterance might be better name of this
class.
• Connected word: Connected word system are similar to isolated words but allow
separate utterance to be âĂIJrun together minimum pause between them.
• Continuous speech: Continuous speech recognizers allows user to speak almost

naturally, while the computer determine the content.
4
• Spontaneous speech: It is the speech that is natural sounding and not rehearsed. An
SR System with spontaneous speech ability should be able to handle a variety of
natural speech feature such as words being run together.
2.1.2 Generation of Speech

Speech is basically generated as an acoustic wave that is radiated from the nostrils and the
mouth when air is expelled from the lungs with the resulting flow of air perturbed by the
constrictions inside the body.
Figure 2.2: Biological model of Speech
Working of biological model
• Air is pushed from your lung through your vocal tract and out of your mouth comes
speech.
5
• For certain voiced sound, your vocal cords vibrate (open and close). The rate at
which the vocal cords vibrate determines the pitch of your voice. Women and young
children tend to have high pitch (fast vibration) while adult males tend to have low
pitch (slow vibration).
• For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vi-
brate but remain constantly opened.
• The shape of your vocal tract determines the sound that you make.
• As you speak, your vocal tract changes its shape producing different sound.
• The shape of the vocal tract changes relatively slowly (on the scale of 10 mili sec to
100 mili sec).
• The amount of air coming from your lung determines the loudness of your voice.
2.1.3 Classification of Speech

Speech signal can be classified into two types
1. voiced signal
2. unvoiced signal
This is accomplished by dividing a speech signal of your choice into short frames and
by computing the average power of each frame.
Voiced Signal: The speech in a particular frame is declared to be voiced if its average
power exceeds a threshold level that is chosen by the user. Voiced signals tend to be louder
like the vowels /A/, /E/, /I/, /O/, /U/.
Unvoiced Signal: The speech in a particular frame is declared to be unvoiced if its average
power is not exceeds a threshold level that is chosen by the user. Unvoiced signals tend to
be more abrupt like the stop consonants /P/, /T/, /K/.
Pitch: For certain voiced sound, your vocal cords vibrate (open and close). The rate at
which the vocal cords vibrate determines the pitch of your voice.
6
2.2 Basic Speech Recognition System
Speech recognition applications are becoming more and more useful nowadays. Various
interactive speech aware applications are available in the market. But they are usually
meant for and executed on the traditional general-purpose computers. With growth in the
needs for embedded computing and the demand for emerging embedded platforms, it is
required that the speech recognition systems (SRS) are available on them too. Personal
Digital Assistants (PDAs) and other handheld devices are becoming more and more pow-
erful and affordable as well. It has become possible to run multimedia on these devices.
Speech recognition systems emerge as efficient alternatives for such devices where typ-
ing becomes difficult. Speech Recognition is the process of automatically recognizing
a certain word spoken by a particular speaker based on individual information included
in speech waves. This technique makes it possible to use the speakerâĂŹs voice to ver-
ify his/her identity and provide controlled access to services like voice based biometrics,
database access services, voice based dialling, voice mail and remote access to computers.
Speech recognition basically means talking to a computer, having it recognize what we
are saying, and lastly, doing this in real time. There are many types of features, which are
derived differently and have good impact on the recognition rate. This project presents
one of the techniques to extract the feature set from a speech signal, which can be used
in speech recognition systems. Speech recognition system performs two fundamental op-
erations: signal modelling and pattern matching. Signal modelling represents process of
converting speech signal into a set of parameters. Pattern matching is the task of finding
parameter set from memory which closely matches the parameter set obtained from the
input speech signal. [2]
2.2.1 General Flow of Speech Recognition System

Fig. 2.3 shows basic representation of speech recognition system in simple equation which
contains feature extraction, database, network training and testing or decoding.
Figure 2.3: Basic model of Speech Recognition System
7
• Feature Extraction: Converting the sound waves into a parametric representation
is a major part of any speech recognition approach. Here both static and dynamic
features of speech were used for speech recognition task because the vocal track
is not completely characterized only by static parameters. Feature extraction is the
most important part of speech recognition as it distinguishes one speech from other.
• Database: Databases are created to operate large quantities of information by in-

putting, storing, retrieving and managing that information. It will store the data
that are further used for testing purpose. In the training phase database are created
and it usually consists of knowledge about building an acoustic model, dictionary
and grammar. Database contains certain amount of utterance for each phoneme for
each word. To collect the phoneme utterances, recording has been done for iso-
lated words. The isolated words are selected in such a way that they include all
the phonemes and recording has been done using headphone microphone at 8 KHz
sampling rate.
• Network Training: Network Training phase involves the network to be train ac-
cording to database contain. This will includes appropriate recognition techniques
to be chosen for further recognition process. Network will work as a classifier that
includes two learning methods named Supervise learning method and unsupervised
learning method. These classifiers are grouped on the basis of whether they accept
continuous or binary inputs, and on the basis of whether they employ supervised or
unsupervised training.
• Testing or Decoding: Once all these details are given correctly, the decoder iden-
tifies the most likely match for the given input, and it returns the recognized word.
Speech-recognition engines match a detected word to a known word using one of
the matching techniques and will return the output speech as output of system.
2.3 Feature Extraction Techniques

Feature Extraction is the most important part of speech recognition since it plays an im-
portant role to separate one speech from other. The utterance can be extracted from a
wide range of feature extraction techniques proposed and successfully exploited for speech
recognition task. In speech recognition, the main goal of the feature extraction step is to
compute a parsimonious sequence of feature vectors providing a compact representation
8
of the given input signal. The feature extraction is usually performed in three stages. The
first stage is called the speech analysis or the acoustic front end. It performs some kind of
spectra temporal analysis of the signal and generates raw features describing the envelope
of the power spectrum of short speech intervals. The second stage compiles an extended
feature vector composed of static and dynamic features. Finally, the last stage( which is
not always present) transforms these extended feature vectors into more compact and ro-
bust vectors that are then supplied to the recognizer. Although there is no real consensus
as to what the optimal feature sets should look like, one usually would like them to have
the following properties: they should allow an automatic system to discriminate between
different through similar sounding speech sounds, they should allow for the automatic
creation of acoustic models for these sounds without the need for an excessive amount of
training data, and they should exhibit statistics which are largely invariant across speakers
and speaking environment.
9
Methods Property Comments
Principal Non linear feature ex- Traditional,eigenvector based
Component traction method, Linear method, also known as
Analy- map; fast; eigenvector- karhuneu-Loeve expansion;
sis(PCA) based good for Gaussian data
cell7 cell8 cell9
Table 2.1: Feature extraction methods [1]
2.3.1 Linear Predictive coding (LPC)

LPC is one of the most powerful signal analysis methods for linear prediction. It is pre-
dominant technique for determining the basic parameters of speech and provides precise
estimation of speech parameters and computational model of speech. Speech sample can
be approximated as a linear combination of past speech samples is the basic idea behind
LPC. Following shows the steps involved in LPC feature extraction.
Figure 2.4: Steps Involve in LPC Feature Extraction [8]
2.3.2 Mel-frequency Cepstral Coefficient (MFCC):

MFCC is the most evident and popular feature extraction technique for speech recognition.
It approximates the human system response more closely than any other system because
frequency bands are placed logarithmically here. They are obtained from a Mel-frequency
cepstrum where frequency bands are equally spaced on the Mel scale. Computation tech-
nique of MFCC is based on the short-term analysis and thus from each frame MFCC
vector is computed. MFCC can be computed by using the formula:
Mel ( f ) = 2595 ∗ log 10(1 + f /700) (2.1)
10
Following figure shows the steps involved in MFCC feature extraction.
Figure 2.5: Steps Involve in MFCC Feature Extraction [8]
2.3.3 Linear Prediction Cepstral Coefficient (LPCC):

The goal of feature extraction is to demonstrate speech signal by finite number of mea-
sures of the signal. Linear Predictive Coding is used to obtain the LPCC coefficients from
the speech tokens. The LPCC coefficients are then translated to cepstral coefficients. The
cepstral coefficients are regularized in between 1 and -1[6].LPCC was implemented us-
ing autocorrelation method. LPCC are highly sensitive to quantization noise is the main
drawback of it.
2.4 SPEECH RECOGNITION TECHNIQUES

Speech recognition techniques classified into three main categories as Acoustic phonetic
approach, Artificial Neural Network and Stochastic and pattern recognition approach. Pat-
tern recognition classified as Dynamic Time Warping (DTW) and Hidden Markov Model
(HMM), Artificial Intelligence approach includes knowledge based system and Artificial
Neural Network.
2.4.1 Acoustic Phonetic Approach

Acoustic phonetic approach for speech recognition is based on finding speech sound and
providing appropriate labels these sounds. The basis of acoustic phonetic approach based
on the fact that, there exist finite and exclusive phonemes in spoken language and these are
broadly characterized by a set of acoustic properties that are demonstrated in the speech
11
Figure 2.6: Speech Recognition Technique Classifications [2]
signal over time. With speaker and co articulation effect, the acoustic properties of pho-
netic units are highly variable, it is assumed in this approach that, the criteria governing the
instability are straightforward and can be readily learned by machine. Steps included in
acoustic phonetic approach are as follows: The first step is the spectral analysis of speech
which describes the broad acoustic properties of different phonetic units. The next step is
segmentation and labelling the speech, that results in a phoneme lattice characterization of
the speech. The last step is determination of string of words or a valid word from phonetic
label sequences brought out by the segmentation to labelling. This approach has not been
most extensively used in most commercial application [3].
2.4.2 Pattern Recognition Approach

Two essential steps involves in pattern recognition approach are, pattern training and pat-
tern comparison. Using a well formulated mathematical framework and initiates consis-
tent speech pattern representation for reliable pattern comparison, from a set of labelled
training samples through formal training algorithm is essential feature of this approach. In
this, there exist two methods: Template base approach and stochastic approach. Stochastic
model are more suitable approach to speech recognition as it uses probabilistic models to
deal with undetermined or incomplete information [1].there exits many methods in this
approach like HMM, SVM, DTW, VQ etc, among these hidden markov model is most
popular stochastic approach today.
12
• Hidden Markov Model (HMM) A hidden markov model is signalizing by a finite
state markov model and a set of output distributions. The alteration parameter in the
Markov chain models are temporal variabilityâĂŹs, while the output distribution
model parameters are spectral variability. These two types of variability are essen-
tial for speech recognition. Hidden Markov modelling is more general and has a se-
cure mathematical foundation compared to template based approach. Compared to
knowledge base approach, HMM enables easy incorporation of knowledge sources
into organized architecture.HMM do not provide much insight on the recognition
process, is negative side effect of HMM. To improve performance of HMM system,
analyse of errors of system is made, but it is quite difficult. However, judicious
incorporation of knowledge has significantly improved HMM based system.
• Dynamic Time Warping (DTW) Dynamic time warping is an algorithm for measur-
ing similarity between two sequences which may vary in time or speed [4]. DTW
has been applied to video, audio, graphic, infect any data which can be develop
into a linear representation can be analysed with DTW. In general, DTW allows a
computer to search an optimal match between two time series if one time series
may be âĂIJwarpedâĂİ non-linearly by pulling or shrivelling it along its time axis.
This warping between two time series can then be used to find equivalently regions
among the two time series or to diagnose the similarity between two times series
[4].Continuity is not much important in DTW than in other pattern matching algo-
rithms. A figure shows the example of how one times series is âĂŸwarpedâĂŹ to
another.
Figure 2.7: A Warping Between Two Time Series [4]
In Figure 2.7, each vertical line connects a point in one time series to its correspond-
ingly similar point in the time series. The lines have similar values on the y-axis,
but have been separated so the vertical lines between them can be viewed more eas-
ily. If both of the time series in Figure 2.6 were identical, all of the lines would be
straight vertical lines because no warping would be ne-cassareep to ’line up’ the two
13
time series. The warp path distance is a measure of the difference between the two
time series after they have been warped together, which is measured by the sum of
the distances between each pair of points connected by the vertical lines in Figure
2.7. Thus, two time series that are identical except for localized stretching of the
time axis will have DTW distances of zero. The principle of DTW is to compare
two dynamic patterns and measure its similarity by calculating a min-imam distance
between them [4].
2.4.3 Artificial Intelligence Approach (Knowledge Based Approach)

Combination of acoustic phonetic approach and pattern recognition approach makes The
Artificial Intelligence approach. Acoustic phonetic knowledge is used to developed clas-
sification rules for speech sound where template based methods provide less insight about
human speech processing, but these methods have been very productive in the design of
a diversity of speech recognition system. This approach is not much successful as com-
plexness in quantifying skilful knowledge. Integration of levels of human knowledge i.e.
phonetics, lexical access, syntax and semantics, is the another problem of this approach.
Artificial Neural Network method is more reliable method for this approach.
• Artificial Neural Networks (ANN)

An artificial neural network contains potentially large number of simple processing
element that is called units or neurons, which impact each otherâĂŹs performance
via a network of excitatory or repressive weights [5]. It is a feed-forward artificial
neural network which has more than one layer of hidden units between its inputs
and its outputs. Neural Network provides three types of learning methods namely
supervised, unsupervised and reinforced.
2.5 PERFORMANCE OF SYSTEM

The performance of speech recognition system is often described in terms of accuracy
and speed. Accuracy may be measured in terms of word error rate (WER), where speed
is measured with the real time factor. Single Word Error Rate (SWER) and Command
Success Rate (CSR) are other measures of accuracy. [3]
• Word Error Rate (WER) Word error rate is familiar measurement of the per-
formance of a speech recognition or machine translation system. The recognized
14
word sequence can have different length from the reference word sequence is more
general difficulty in performance measurement. The WER is acquired from the
Levenshtein distance, working at the word level alternatively to the phoneme level
[9].Word error rate can be computed as
S+D+I
W ER = (2.2)
N
Where,
S is the number of substitutions.
D is the number of the deletions.
I is the number of the insertions.
N is the number of words in the reference.
When reporting the performance of a speech recognition system, sometimes word

recognition rate (WRR) is used instead of WER,
N −S−D−I H −I
W RR = 1 −W ER = = (2.3)
N N
Where is N-(S+D), the number of correctly recognized word.

Recognition accuracy [9], sometimes calculated
CorrectlyRecognizedWords
Recognitionaccuracy = x100 (2.4)
TotalRecognizedWords
15
cell7 cell8 cell9
Table 2.2: Feature extraction methods [1]
2.5.1 Gaps of Research

From the above review table we concluded that most of the researchers have been used
• In previous work, using the ZCR, Start point and end point detection and STE but
not using the filter like Hamming, Hanning and Blackman window for the improving
efficiency. Here used Hamming window filter.[7]
• Previously isolated word recognition using MFCC with ML feature classifier but
here used DTW (Dynamic Time Wrapping) algorithm for feature matching pro-
cess because it is most useful for isolated word recognition and it will use with the
MFCC.[12]
• In previous work result based on the Euclidean Distance is improved by using start
point and end point detection and also there are improving with the database sam-
ple.[4]
• The different Features extraction technique like PCA, LDA, LPC but MFCC is the
best then above mention technique. It gives the frequency bands are positioned
logarithmically (on the Mel scale) which approximates the human auditory system’s
response more closely than the linearly spaced frequency bands of FFT or DCT. [3]
16
C HAPTER 3
System Design
3.1 System Architechture

This system is recognizing some English words speak by speaker; it is use combination of
features based on MFCC and DWT. Following figure 3.1 shows the overall process of the
system.
Figure 3.1: Block Diagram of Speech Recognition System
The block diagram of speech recognition shows in figure 3.1 and details description
given below. Speech input is taken using microphone and it is in analog form.
3.2 Preprocessing
It consists of the following:
17
3.2.1 Short Time Energy (STE)
The energy content of a set of samples is approximated by the sum of the square of the
samples. To calculate STE the speech signal is sampled using a rectangular window func-
tion of width ω samples, where ω << n. Within each window, energy is computed as
follows [7]
ω
e= ∑ xi2 (3.1)
i=0
3.2.2 Zero Crossing Rate (ZCR)

ZCR of an audio signal is a measure of the number of times the signal crosses the zero
amplitude line by transition from a positive to negative or vice versa. The audio signal
is divided into temporal segments by the rectangular window function as described above
and zero crossing rate for each segment is computed as below, where sgn(xi ) indicates the
sign of the ith sample and can have three possible values: +1, 0, -1 depending on whether
the sample is positive, zero or negative [7].
ω
|sgn(xi ) − sgn(xi − 1)|
z= ∑ (3.2)
i=0 2
3.2.3 Start point End point Detection

Computation of these points is more beneficial as they are used to remove background
noise and made voice signal better than previous signal. Start point of any voice signal
provide the exact starting location of voice sample based of STE and ZCR values, so that
all previous unwanted samples would be removed and new voice signal would created.
This same process is applied to detect End points of any voice signal. Computation of these
points is more beneficial as they are used to remove background noise and made voice
signal better than previous signal. Start point of any voice signal provides the exact starting
location of voice sample based of STE and ZCR values, so that all previous unwanted
samples would be removed and new voice signal would be created. This same process is
applied to detect end points of any voice signal.
18
3.2.4 Start point End point detection based on ZCR
A threshold value by several observations on the signal is found. The part of the signal
from start to the start-point found by end-point detection and that from the end point to
the end of the signal is checked for the zero-crossing rate. After comparing the zero-
crossings with the threshold, the part of the frame is selected and start-point and end-point
is changed. This is done according to the following conditions:
• If ZCR > 3 ∗ (threshold ), then start-point shifts one frame left.
• If ZCR > 3ast (threshold ), then end-point shifts one frame right, provided that the
previous end-point is not in the last frame.
3.3 Feature Extraction

Next step after the pre-processing of the speech signal in the signal modelling is feature
extraction. Feature extraction is the process of retaining useful information of the signal
while discarding redundant and unwanted information or we can say this process involves
analysis of speech signal. However, in practice, while removing the unwanted informa-
tion, on may also lose some useful information in the process. Feature extraction may also
involve transforming the signal into a form appropriate for the models used for classifica-
tion. A few desirable properties of the features are:
• High discrimination between sub-word classes.
• Low Speaker variability.
• Invariance to degradations in the speech signal due to channel and noise.
The goal is to find a set of properties of an utterance that have acoustic correlates in the
speech signal, that is, parameters that can somehow be computed or estimated through pro-
cessing of the signal waveform,. Such parameters are termed features. Feature extraction
is the parameterization of the speech signal. It typically includes the process of converting
the signal to a digital from, measuring some important characters of the signal such as
energy or frequency response, augmenting these measurements with some perceptually-
meaningful derived measurements and statistically conditioning these numbers to form
observation vectors. [2] Different feature extraction techniques are as under:
19
• Mel Frequency Cepstral Coefficient (MFCC)
• Linear Prediction Coding (LPC)
3.3.1 MFCC
MFCC is the most evident and popular feature extraction technique for speech recognition.
It approximates the human system response more closely than any other system because
frequency bands are placed logarithmically here. The overall process of the MFCC is
shown in below Figure 3.2
Figure 3.2: Steps Involve in MFCC Feature Extraction [4]
Step-1: Pre-Emphasis
This step processes the passing of signal through a filter which em-phasizes higher fre-
quencies. This process will increase the energy of signal at higher frequency.
Y [n] = X [n] − 0.95X [n − 1] (3.3)
Lets consider a = 0.95, which make 95% of any one sample is pre-sumed to originate
from previous sample. Step-2: Framing
The process of segmenting the speech samples obtained from analog to digital conversion
(ADC) into a small frame with the length with-in the range of 20 to 40 msec. The voice
signal is divided into frames of N samples. Adjacent frames ar e being separated byM (M <
N ). Typical values used are M = 100 and N = 256.
Step-3: Windowing
Hamming window is used as window shape by considering the next block in feature ex-
traction processing chain and integrates all the closest frequency lines. The Hamming
window equation is given as: If the window is defined as W (n), 0nN − 1 where N = num-
ber of samples in each frame Y [n] = Output signal X (n) = input signal W (n) = Hamming
20
window, then the result of windowing signal is shown below:
Y (n) = X (n)xW (n) (3.4)

2πn
W (n) = 0.54 − 0.46cos 0 ≤ n ≤ N −1 (3.5)
N −1
Step-4: Fast Fourier Transform
To converting each frame of N samples from time domain into frequency domain. The
Fourier Transform is to convert the convolution of the glottal pulse U[n] and the vocal
tract impulse response H[n] in the time domain. This statement supports the equation
below [4]:
Y (n) = FFT [h(t ) ∗ x(t )] = H (ω ) ∗ X (ω ) (3.6)
If X (ω ), H (ω ) and Y (ω ) are the Fourier Transform of X (t ), H (t ) and Y (t ) respec-

tively.
Step-5: Mel Filter Bank Processing
The frequencies range in FFT spectrum is very wide and voice signal does not follow the
linear scale. The bank of filters according to Mel scale as shown in figure 3.3 is then
performed [4].
Figure 3.3: Mel scale filter bank, from (young et al, 1997) [4]
This figure shows a set of triangular filters that are used to compute a weighted sum
of filter spectral components so that the output of process approximates to a Mel scale.
Each filter’s magnitude frequency response is triangular in shape and equal to unity at the
centre frequency and decrease linearly to zero at centre frequency of two adjacent filters
[7]. Then, each filter output is the sum of its filtered spectral components. After that the
equation 2.1 is used to compute the Mel for given frequency f in HZ:
Step 6: Discrete Cosine Transform
This is the process to convert the log Mel spectrum into time domain using Discrete Cosine
21
Transform (DCT). The result of the conversion is called Mel-Frequency Cepstrum Coef-
ficient. The set of coefficient is called acoustic vectors. Therefore, each input utterance is
transformed into a sequence of acoustic vector.
Step-7: Energy and Spectrum
As speech signals are random, so there is a need to add features related to the change in
cepstral features over time. For this purpose, in this paper, energy and spectrum features
are computed over small interval of frame of speech signals. Mathematically, the energy
in a frame for a signal x in a window from time sample t1 to time sample t2, is represented
as:
ENERGY = ∑ X 2 [t ] (3.7)
3.4 Reference Database Template

The signal during training and testing session can be greatly different due to any factors
such as people voice change with time, health condition (e.g. the speaker has a cold),
speaking rate and also acoustical noise and variation in recording environment via mi-
crophone. Table 3.1 gives detail information of recording and training session of overall
voice recognition system. That will create the reference database for Speech Recognition
system.
22
cell7 cell8 cell9
Table 3.1: Training Requirement
In the database all the pre recorded words are stored that are used to train the sys-
tem. Again, during testing this database is referred. The words are recorded for speech
operating robot purpose.
3.5 Dynamic Time Wrapping

Dynamic time warping is an algorithm for measuring similarity between two sequences
that may vary in time or speed [4]. It is a method that is most applicable to signals which
are skewed or shifted in time relative to each other. For instance, similarities in walking
patterns would be detected, even if in one video the person was walking slowly and if
in another he or she were walking more quickly, or even if there were accelerations and
decelerations during the course of one observation. DTW has been applied to video, audio,
and graphics âĂŞ indeed, any data that can be turned into a linear representation can be
analyzed with DTW. In this way, DTW is ideal for speech recognition, where one word
spoken by two users is never exactly the same, but often said with differing speed or
emphasis.
A well-known application has been automatic speech recognition, to cope with differ-
ent speaking speeds. In general, it is a method that allows a computer to find an optimal
match between two given sequences (e.g., time series) with certain restrictions. That is,
the sequences are “warped” non-linearly to match each other [4].
Prior to the application of the DTW algorithm; the endpoints (beginning and ending
frame) of the unknown isolated word (called the test) have been accurately located. Via
a conventional training process, the endpoints of each reference pattern (template) are
accurately known. Hence, the problem of dynamic time warping can be formulated as a
path finding problem over a finite grid as shown in the figure:
23
Figure 3.4: Distance Grid [13]
3.5.1 Restrictions on the Warping Function

There are certain restrictions on finding the optimum path for DTW which include [13]:
• Monotonicity
This property states that the alignment path does not go back in “time” index. Thus,
it guarantees that features are not repeated in the alignment. i.e. is−1 ≤ is and
js−1 ≤ js
Figure 3.5: Monotonicity Conditions [13]
• Continuity
This property states that the alignment path does not jump in “time” index .i.e. is−1 −
is ≤ 1 and js−1 − js ≤ 1
24
Figure 3.6: Continuity Conditions [13]
• Warping Windows
This property states that a good alignment path is unlikely to wander too far from
the diagonal. |is − js | ≤ r , where r > 0 is the window length.
Figure 3.7: Warping Windows [13]
• Slope Constraints
The alignment path should not be too steep or too shallow.
Figure 3.8: Slope Constraints [13]
3.5.2 DTW Conditions

g(1, 1) = d (1, 1)

 g(i, j − 1)

g(i, j ) = min g(i − 1, j − 1) + d (, j ) (3.8)

g(i − 1, j ) + d (i, j )

25
Figure 3.9: DTW Conditions [13]
26
3.5.3 Working of DTW Algorithm
Consider two sequences of feature vector in an n-dimensional space, time series A and
time series B. The two sequences are aligned on the sides of a grid, with one on the top
and other on the left hand side. Both sequences start on the bottom left of the grid. To
compute DTW, following are the steps:
• Start with the calculation of g(1, 1) = d (1, 1).
• Calculate the first row g(i, 1) = g(i1, 1) + d (i, 1).
• Calculate the first column g(1, j ) = g(1, j ) + d (1, j ).
• Move to the second row g(i, 2) = min(g(i, 1), g(i1, 1), g(i1, 2)) + d (i, 2). Book keep
for each cell the index of this neighboring cell, which contributes the minimum score
(red arrows).
• Carry on from left to right and from bottom to top with the rest of the grid g(i, j ) =
min(g(i, j1), g(i1, j1), g(i1, j )) + d (i, j ).
• Trace back the best path through the grid starting from g(n, m) and moving towards
g(1, 1) by following the red arrows.
• DTW Algorithm example according to the above steps is shown below:
27
3.6 Design Logic
The decision is made based on DTW matching technique to select best matching between
reference file and test file. There are two criteria based on that decision is taken. These
are minimum distance and maximum correlation between two sequences. It is clear from
the figure that those reference MFCC vectors selects which have minimum MFCC vector
or maximum correlation.
Figure 3.10: Decision Logic Based on Minimum Distance [15]
28
C HAPTER 4
System Performance Analysis
4.1 Implementation and Results

The Implementation and simulation results of Automatic Speech Recognition (ASR) are
described following. The computer specifications are,
• Processor: Intel(R) Pentium(R) Dual CPU T2390 @ 1.86 GHz 1.87 GHz
• RAM: 2 GB
• System type: 32 bit Operating System
• Operating System (OS): Window 7
• Matlab Version: MATLAB R-2009a
Step-1: Input
• Wave file: ’ready1.wav’(database sample)
• Sampling rate: 8 KHz
• Time duration: 1 second
Step-2: Pre-Processing (STE+ZCR+ Start Point +End Point)
• Filtering
• Short Time Energy(STE)

The signal energy in each frame is computed.
– Number of sample per frame: 256
29
Figure 4.1: Input File: Ready1.Wav
Figure 4.2: Input File after Filtering
30
Figure 4.3: Speech Signal after Silence Removing Including Start Point and End Point
Only
• Start point End point detection based on ZCR

Value of start point and end point detection.
Step-3: Feature Extraction

Implementation of overall process of MFCC feature extraction technique described fol-
lowing.
• Pre-emphasis and framing
• Windowing
• Fast Fourier Transform(FFT)
• Mel Filter Bank Processing
• MFCC feature vector Applying of the above process on the given template made a
MFCC feature vector table. Now it is created the reference database created for the
given template.
Step-4: Creating Reference Database (Template)

According to requirement of table 3.1 reference database is created by MFCC feature
vectors of all other wav file the procedure are explained above.
31
Figure 4.4: Speech Signal after Pre-Emphasis And Framing
Figure 4.5: MFCC Feature Vector of Ready1.Wav File
32
Figure 4.6: Reference Database (Template)
33
Step-5: DTW Pattern Matching
Reference files(r): ready1.wav
Testing file (t): test.wav
Figure 4.7: Test File
All pre-processing is applied to test file and then MFCC features of test file is calcu-
lated for comparing with reference file using DTW comparing technique. DTW computes
following parameter to measure similarities between test and reference file.
• Unnormalized distance between t and r, Dist= 151.178
• Accumulated distance matrix, D
• Normalizing factor, k=248
• Maximum correlation, C = 0.8806
• Optimal Path, w
34
Figure 4.8: Optimal Paths, w
Figure 4.9: Optimal Path Distance
35

Speech Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition

Uploaded by

Copyright:

Available Formats

Contents

List of Principal Symbols and Acronyms ii

List of Figures iii

4 System Performance Analysis 29

2.1 Types of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Block Diagram of Speech Recognition System . . . . . . . . . . . . . . 17

4.1 Input File: Ready1.Wav . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.1 Feature extraction methods [1] . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Training Requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1 Literature Survey

2.1.1 Introduction of Speech

Figure 2.1: Types of Speech

• Continuous speech: Continuous speech recognizers allows user to speak almost

2.1.2 Generation of Speech

Figure 2.2: Biological model of Speech

Working of biological model

2.1.3 Classification of Speech

2.2.1 General Flow of Speech Recognition System

Figure 2.3: Basic model of Speech Recognition System

• Database: Databases are created to operate large quantities of information by in-

2.3 Feature Extraction Techniques

2.3.1 Linear Predictive coding (LPC)

Figure 2.4: Steps Involve in LPC Feature Extraction [8]

2.3.2 Mel-frequency Cepstral Coefficient (MFCC):

Mel ( f ) = 2595 ∗ log 10(1 + f /700) (2.1)

Figure 2.5: Steps Involve in MFCC Feature Extraction [8]

2.3.3 Linear Prediction Cepstral Coefficient (LPCC):

2.4 SPEECH RECOGNITION TECHNIQUES

2.4.1 Acoustic Phonetic Approach

2.4.2 Pattern Recognition Approach

Figure 2.7: A Warping Between Two Time Series [4]

2.4.3 Artificial Intelligence Approach (Knowledge Based Approach)

• Artificial Neural Networks (ANN)

2.5 PERFORMANCE OF SYSTEM

When reporting the performance of a speech recognition system, sometimes word

Where is N-(S+D), the number of correctly recognized word.

2.5.1 Gaps of Research

3.1 System Architechture

Figure 3.1: Block Diagram of Speech Recognition System

3.2.2 Zero Crossing Rate (ZCR)

3.2.3 Start point End point Detection

• If ZCR > 3 ∗ (threshold ), then start-point shifts one frame left.

3.3 Feature Extraction

• High discrimination between sub-word classes.

• Low Speaker variability.

• Invariance to degradations in the speech signal due to channel and noise.

• Linear Prediction Coding (LPC)

Figure 3.2: Steps Involve in MFCC Feature Extraction [4]

Y [n] = X [n] − 0.95X [n − 1] (3.3)

Y (n) = X (n)xW (n) (3.4)

If X (ω ), H (ω ) and Y (ω ) are the Fourier Transform of X (t ), H (t ) and Y (t ) respec-

3.4 Reference Database Template

3.5 Dynamic Time Wrapping

3.5.1 Restrictions on the Warping Function

Figure 3.5: Monotonicity Conditions [13]

Figure 3.7: Warping Windows [13]

Figure 3.8: Slope Constraints [13]

3.5.2 DTW Conditions

• Start with the calculation of g(1, 1) = d (1, 1).

• Calculate the first row g(i, 1) = g(i1, 1) + d (i, 1).

• Calculate the first column g(1, j ) = g(1, j ) + d (1, j ).

• DTW Algorithm example according to the above steps is shown below:

Figure 3.10: Decision Logic Based on Minimum Distance [15]

4.1 Implementation and Results