Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Int. J. Computational Systems Engineering, Vol. 1, No.

1, 2012 25

A Hindi speech recognition system for connected


words using HTK

Kuldeep Kumar* and R.K. Aggarwal


Department of Computer Engineering,
National Institute of Technology,
Kurukshetra-136119, Haryana, India
E-mail: kuldeepgargkkr@gmail.com
E-mail: rka15969@gmail.com
*Corresponding author

Ankita Jain
Department of Electronics and Communication Engineering,
National Institute of Technology,
Kurukshetra-136119, Haryana, India
E-mail: ankitajain.08@gmail.com
Abstract: A speech recognition system converts the speech-sound into the corresponding text.
The uttered speech is first understood by the machine and then the corresponding text is
displayed. This paper aims to build a connected-words speech recognition system for Hindi
language. The system has been developed using hidden Markov model toolkit (HTK) that uses
hidden Markov models (HMMs) for recognition. The system has been trained to recognise any
sequence of words selected from the vocabulary of 102 words. Initially, Mel frequency cepstral
coefficients (MFCCs) have been used to extract the features from the speech-files. Then, the
system has been trained to estimate the HMM parameters using word level acoustic models. The
training data has been collected from 12 speakers including both males and females. The test-data
used for evaluating the system-performance has been collected from the five speakers. The
experiments have also been performed on the system. The experimental results show that the
presented system provides the overall word-accuracy of 87.01%, word-error rate of 12.99%, and
word-correction rate of 90.93% respectively. The work has been evaluated by performing the
comparative analysis with the existing similar works and the betterment has been reported.
Keywords: hidden Markov models; HMMs; speech recognition system; automatic speech
recognition; ASR; hidden Markov model toolkit; HTK; Hindi; connected words; Mel frequency
cepstral coefficient; MFCC.
Reference to this paper should be made as follows: Kumar, K., Aggarwal, R.K. and Jain, A.
(2012) ‘A Hindi speech recognition system for connected words using HTK’, Int. J.
Computational Systems Engineering, Vol. 1, No. 1, pp.25–32.
Biographical notes: Kuldeep Kumar is with the Department of Computer Engineering, National
Institute of Technology (NIT), Kurukshetra, Haryana, India. He received his BTech in Computer
Engineering (with honours) from University Institute of Engineering and Technology (UIET),
Kurukshetra University, Kurukshetra, India in 2009. He has also worked as a summer
trainee in the Department of Computer Engineering, Institute of Technology, Banaras Hindu
University (IT-BHU), Varanasi, India. He has several publications in reputed international
journals/conferences. His current areas of interest include speech processing, semantic web,
software engineering, and statistical models.
R.K. Aggarwal is an Associate Professor in the Department of Computer Engineering at National
Institute of Technology (NIT), Kurukshetra, Haryana, India. He has published more than
30 papers in various conferences and journals. He has delivered several invited talks, keynote
addresses and chaired the sessions in reputed conferences. His research interests include speech
processing, soft computing, statistical modelling, and spirituality. He is a life member of
Computer Society of India (CSI) and Indian Society for Technical Education (ISTE). He has
been involved in various academic, administrative and social affairs of many organisations
having more than 20 years of experience in this field.
Ankita Jain is an Assistant Professor in the Department of Electronics and Communication
Engineering, National Institute of Technology (NIT), Kurukshetra, Haryana, India. She is
involved in various academic, research and social affairs. She has many papers in
national/international journals and conferences. Her areas of interest include speech processing
and communication systems.

Copyright © 2012 Inderscience Enterprises Ltd.


26 K. Kumar et al.

1 Introduction 1.2 Paper contribution


Speech interfacing provides a convenient and user friendly This paper presents a Hindi speech recognition system
way of man-machine communication. Their absence makes developed for connected words. To extract the features from
the man-machine interaction obsolete. Unadventurously, the speech sound, Mel frequency cepstral coefficients
transfer of information between man and machine is carried (MFCCs) are used. Hidden Markov models (HMMs) are
out via keyboards, pointing devices like mouse, touchpad used to train and recognise the speech. Hidden Markov
for input and visual display units, monitors, plotters or model toolkit (HTK) developed in 1989 at the Speech
printers for output. However, it is not convenient for a Vision and Robotics Group of the Cambridge University
common man to use these devices, as they require Engineering Department (CUED) is used to accomplish
certain amount of skills and are time consuming too. On the this. HTK provides system tools for both training and
other hand, speech interfacing offers high bandwidth testing phase. Initially, HTK training tools are used to train
information and relative ease of use. It also permits the HMMs using training utterances from a speech database.
user’s hands and eyes to be busy with a task, which is Features are extracted from these training utterances and
particularly valuable when the user is in motion or in then, these features are used to model the system. Finally,
natural field settings. One can speak hastily instead of HTK recognition tools are used to transcribe the unknown
typing. Similarly speech output is more impressive and utterances. They use the system model generated during
understandable than the text output. Speech interfacing training phase to test the system.
involves two distinct areas, speech synthesis and Apart from introduction in Section 1, the paper is
automatic speech recognition (ASR). Speech synthesis is organised as follows. Some of the related works are
the process of converting the text input into the presented in Section 2. Section 3 presents the architecture
corresponding speech output, i.e., it acts as a text to speech and functioning of the developed speech recognition
converter. Conversely, speech recognition is the way of system. Section 4 deals with implementation work.
converting the spoken sounds into the text similar to the Experimental results are given in Section 5. Section 6
information being conveyed by these sounds. Among compares the presented work with the previous works.
these two tasks, speech recognition is more difficult but it Finally, the system has been concluded in Section 7.
has variety of applications such as interactive voice
response system, applications for physically challenged
persons and others (Aggarwal and Dave, 2011). There are 2 Related works
many public domain software tools available for the
This section presents some of the reported works available
research work in the field of speech recognition such as
in the literature that are similar to the presented work.
Sphinx from Carnegie Mellon University (SPHINX, 2011),
Among others, some of the works providing ASR system
hidden Markov model toolkit (HTK, 2011) and large
for South-Asian languages are (Al-Qatab and Ainon, 2010;
vocabulary continuous speech recognition (LVCSR)
Gupta, 2006; Pruthi et al., 2000).
engine Julius from Japan (Julius, 2011). This paper aims
Pruthi et al. (2000) have developed a speaker-dependent,
to develop and implement speech recognition system
real-time, isolated word recogniser for Hindi. Linear
for Hindi language using the HTK open source
predictive cepstral coefficients (LPCCs) were used for
toolkit.
feature extraction and recognition was carried out using
discrete HMM. System was designed for two male speakers.
1.1 Motivation The recognition vocabulary consists of Hindi digits
While speech interfacing is the expedient way of (0, pronounced as ‘shoonya’ to 9, pronounced as ‘nau’).
communication, it is fruitful only when the common An isolated word speech recognition system for Hindi
man will be able to reap its benefits. Most of the language has been designed by Gupta (2006). System uses
international organisations like Microsoft, SAPI, IBM, and continuous density hidden Markov model (CDHMM) and
Dragon-Naturally-Speech as well as research groups consists of word-based acoustic units. Again the system
working on this field are concentrating on the European vocabulary contains Hindi digits. Recogniser gives good
languages especially English. This restricts their usage to results when tested for sounds used for training the model.
small fraction of the population having literacy in such For other sounds too, the results are satisfactory.
languages. However, some of the highly populated countries The work in Al-Qatab and Ainon (2010) discusses the
like India have their own native languages spoken by large development and implementation of an Arabic speech
portion of the population. A person having no proficiency in system using HTK. System can recognise both continuous
European languages cannot use these technologies. Local speech as well as isolated words. System uses an
relevance and lack of effective Hindi speech recognition Arabic dictionary built manually by the speech-sounds of
system has inspired the authors to develop such system to 13 speakers. MFCCs were used to extract the speech feature
narrow the gap. vectors. The vocabulary consists of 33 words.
A Hindi speech recognition system for connected words using HTK 27

This paper shows the design and implementation of 3.1 Pre-processing


Hindi speech system. The system uses a vocabulary of
Analog speech signals captured by a transducer such as
102 words. The developed system is giving good
microphone or telephone must be digitised according to the
recognition results for both speaker dependent and speaker
Nyquist theorem. According to this theorem, signal must be
independent environments.
sampled more than twice the rate of the highest frequency
required in the analysis. The frequency equal to twice the
highest frequency is called the Nyquist rate. In general, a
3 System architecture sampling rate between 8 KHz and 20 KHz is used for
speech recognition application. For telephonic channel,
The developed speech system mainly consists of three 8 KHz sampling rate is recommended while sampling
modules, acoustic analysis module, training module and rate of 16 KHz is used for normal microphones. For
testing module. Initially, data preparation is carried out. quantisation, 16 bit, 24 bit or 32 bit float are used depending
Multiple occurrences of words available in the vocabulary upon applications.
are recorded. These speech-signals captured by transducer
are pre-processed to convert them into digital forms.
However, these acoustic signals cannot be directly 3.2 Acoustic analysis module
processed by speech recognition system. These have to be Speech recognition system cannot be used to process the
represented in a more compact and efficient form which is digital waveforms. These have to be represented in a more
achieved using acoustic analysis. Training module is used to compact and efficient way. For this, initially, digitised input
generate the system model which is used by the testing is flattened using filters and then essential features having
module during the system testing. The architecture of the acoustic correlation with the speech input are extracted
developed speech recognition system is shown in Figure 1. using feature extraction.

Figure 1 Developed speech system architecture

Testing module
O
W

Acoustic analysis Language Acoustic


model models P(O|W)
Preprocessing Pre-emphasis Feature Pronunciation
Spoken extraction Parameterised corresponding
word waveforms (O) to sound
Pronunciation Speech
Transcription corresponding to sound transcription
dictionary Recognition
component
P(W)

Generated
system model

Training module
Model
generation
Acoustic analysis
HMM
Preprocessing Pre-emphasis Feature re-estimation
Spoken extraction Parameterised
word waveforms(O)
Acoustic model Pronunciation Language model
generation dictionary generation

Corpus
28 K. Kumar et al.

3.2.1 Pre-emphasis the unknown words are spoken. The two kinds of acoustic
models are the word model and phoneme models. In the
Pre-emphasis ensures that in the frequency domain all the
word model, words are modelled as a whole. This model is
formants of the speech signal have similar amplitude so that
used for small vocabulary system. If a new word is added to
they get equal importance in subsequent processing stages.
the vocabulary, system has to be trained for the new word
Typically, the speech signal produced by human beings has
also. In phoneme model, on the other hand, parts of word
a spectral slope of approximately –6dB/octave for voiced
called phones are modelled. This model is used for large
sounds due to following reasons:
vocabulary systems. In phoneme model-based system,
• the slope of glottis introduces a slope of –12dB/octave adding new word to the vocabulary is manageable as sounds
corresponding to the phone sequence of newly added word
• the lip radiation introduces a slope of +6dB/octave. may be already known to the system.
Therefore, the resultant slope of approximately –6dB/octave An acoustic model can be implemented using various
exists in the recorded voiced speech sounds. Due to this approaches such as HMM (Rabiner and Huang, 1986),
reason, high frequency formants have smaller amplitude artificial neural networks (ANN) (Wilinski et al., 1998),
with respect to the lower frequency formants. Pre-emphasis dynamic Bayesian networks (DBN) (Deng, 2006), support
is performed to remove this slope of –6dB/octave vector machine (SVM) (Guo and Li, 2003), hybrid methods
and to make the signal spectrally flatten (Deng and (i.e., combination of two or more approaches) and others.
O’Shaughnessy, 2003). HMM has been used in some form or another in virtually
every state-of-the-art speech and speaker recognition
3.2.2 Feature extraction systems.

Feature extraction is the way of finding a set of properties 3.3.2 Language model generation
of an utterance having acoustic correlations with the speech
Language model predicts the probability of a word
signal. Such parameters are termed as features. A feature
occurring in the context. In some cases, it may be possible
extractor is expected to discard irrelevant information
that there are words which are phonetically similar but
from the task while keeping the useful one. To do this,
successively some portion of the speech signal is considered having different meaning called homophones such as जल
for processing, called window size. Data acquired in a
window is called as a frame. Typically, frame size ranges
(water) or ज़ल (related to burning), हं स (name of a bird) or
from 10 to 25 milliseconds with an overlap of about हँ स (laugh), आिद (start) or आदी (addicted). Handling
50%–70% between consecutive frames. The data in this
such homophones is a critical issue for any ASR, as
analysis interval is multiplied with a windowing function.
they normally increase acoustic confusability. Different
Different types of windows like Rectangular, Hamming,
languages have different number of homophone words. For
Hanning, Bartlett, Blackman or Gaussian can be used. After
example, French language admits a large number of
that features are extracted on frame by frame basis.
homophones; hence in this particular case automatic
There are several ways to extract features from each
transcription is a very challenging task. Homophones are
frame such as LPCC (Markel and Gray, 1976), MFCC
also common is English language (like I or eye, week or
(Davis and Mermelstein, 1980), perceptual linear prediction
weak, principal or principle) but rare in Hindi. Speech
(PLP) (Hermansky, 1990), wavelet (Sharma et al., 2008),
system recognises these words based on the context in
temporal patterns (TRAPs) (Hermansky and Sharma,
which they occur. Language model provides this context to
1999), RASTA (relative spectral transform) processing
the speech recognition system. There are mainly four
(Hermansky and Morgan, 1994) and others. HTK 3.4
approaches to language modelling viz. grammar-based
supports only LPCC, MFCC and PLP.
approach, stochastic approach, uniform modelling and finite
state approach (Madhav, 2005).
3.3 System training module
Training module generates the system model. The designed 3.4 System testing module
system architecture uses HMMs to generate the acoustic and Testing module is used to recognise the unknown
language models. The system model generated during the utterances. Initially, unknown speech input is converted into
training is used during the testing phase to recognise an a form that can be processed easily by the system. To
unknown utterance. The various phases used in system achieve this, acoustic analysis of the speech input is carried
training are: out. The parameterised features thus obtained are used by
the recognition component to give the transcription
3.3.1 Acoustic model generation corresponding to the sounds.

To recognise the unknown utterances, some reference 3.4.1 Recognition component


models are needed to compare with them. These reference
models are known as acoustic models. Using these models, Recognition component recognises the test samples based
most probable sounds are identified that are produced when on the acoustic properties of the words (Young et al., 2009).
A Hindi speech recognition system for connected words using HTK 29

Considering that each word is represented as a sequence of was asked to utter each word of the vocabulary four times.
speech vectors or observations O defined as: Thus giving a total of 4,896 (12*4*102) speech-files. Data
was recorded at the room environment.
O = o1o2 o3 … oT (1) To prepare test data, the voices of five people (three
male and two females) of age group between 17–25 years
where oi is the speech vector observed at time i. The were recorded with the same specification used for
recogniser finds the most probable sequence of words W training. Out of five speakers, three were those used for
given the speech vectors O, i.e., training the system. From these five speakers, total of
190 speech-samples were recorded.
Wˆ = arg max P (W | O) (2)
W∈ L
4.3 Acoustic analysis
The probability P(W | O) is computed using Bayes’ rule Once the data was recorded and digitised, each speech file
given as: was labelled with corresponding word. The developed
system uses manual labelling. Wavesurfer has been used for
P(O | W ) P(W )
P (W | O) = (3) labelling the speech files. Once the files were labelled, the
P(O) training data was parameterised into a sequence of features.
For this purpose, HTK tool HCopy was used. The system
Given an acoustic observation sequence O, recogniser finds uses MFCCs which were derived from fast Fourier
the sequence W of words which maximises the probability transform (FFT)-based log spectra. The input speech was
P(O | W)P(W). The term P(W) is the prior probability of the processed at the frame rate of 10 ms with a Hamming
word which is estimated by the language model. P(O | W) is window of 25 ms. The reason for choosing the Hamming
the observation likelihood estimated using the acoustic window is that the side lobes of this window are lower than
model. The term P(O) is constant and hence can be ignored. the others windows which avoids leakage, i.e., does not
swap relevant energy from distant frequencies. Acoustic
4 Implementation parameters were 39 MFCCs with 12 Mel cepstrum plus log
energy and their delta (first order derivative) and
This section describes the implementation of the speech acceleration (second order derivatives) coefficients. Table 1
system developed based upon the system architecture shows the properties of speech-file and values of various
presented in the previous section. parameters used for acoustic analysis.

4.1 System description Table 1 Values of various parameters used for acoustic
analysis
The Hindi speech recognition system based on MFCCs has
been developed using HTK. HTK v3.4 is used. The system S. no. Parameter Value
has been developed in the Ubuntu 10.04 operating
1 Input file format .wav
environment which is a Linux platform. The system has
been designed to recognise a set of connected words from 2 Sampling rate 16,000 Hz
the set of 102 unique words. Initially, HMM learning has 3 Bit rate (bits per 16
been performed using the HTK training tools. Training sample)
utterances and their associated transcriptions were used for 4 Type of channel used mono
HMM learning. Then, unknown utterances were transcribed 5 Window size 250,000.0 (25 msec.)
using the HTK recognition tools. 6 Frame periodicity 100,000.0 (10 msec.)
7 Window used Hamming
4.2 Database preparation
8 Number of filter-bank 26
For training and testing of the speech system, data channels
needs to be collected from certain numbers of speakers. A 9 Target kind MFCC_0_D_A (MFCC
unidirectional microphone was used for recording keeping a with energy, delta (Δ) and
distance of approximately 5–10 cm between mouth of the acceleration (ΔΔ) coefficients.
speaker and the microphone. 10 Numbers of MFCC 12
System was trained using the voices of 12 people (seven coefficients
male and five female) of age group between 18–23 years. 11 Pre-emphasis 0.97
Recording of the speech sounds were done using the Linux coefficient
command brec. Sounds were recorded at sampling rate of 12 Length of cepstral 22
16,000 Hz on the mono channel. 16 bits per sample were liftering
used which divides the element position of a sample into
13 Energy normalisation true
65,536 (216) possible values. Speech files were stored in
.wav format. By using these specifications, each speaker
30 K. Kumar et al.

4.4 System training dictionary describing how each word is pronounced and
acoustic model generated during HMM training.
This section describes the methodology used for training the
implemented system.
4.6 Performance analysis
4.4.1 Language modelling – task definition In order to analyse the system performance, HTK provides a
tool HResult. It is used to compute the accuracy of the
The developed speech system uses grammar-based
system. It compares the machine transcription of the test
approach for language modelling. In grammar-based
utterances with the corresponding reference transcription
approach, structure of the language is defined in term of
files.
grammar rules. The grammar is specified using extended
The performance of speech system is evaluated as:
Backus-Naur form (EBNF). From the grammar, we can
generate a task network where the nodes will be the words N −D−S H
%correct = × 100 = × 100 (4)
in the dictionary. A network describes the sequences of N N
words that can be recognised by the system. HTK tool
HParse has been used to obtain the task network. A where N is the number of words in test set, D is the number
dictionary describing the pronunciations of the words was of deletions, S is number of substitutions and H is the
also made. The information regarding the correspondence of number of correct labels. %correct gives the percentage of
each HMM with a particular grammar variable was also word correctly recognised.
kept in the dictionary. The accuracy is computed as:
N −D−S −I H −I
4.4.2 Acoustic modelling %accuracy = × 100 = × 100 (5)
N N
HMMs have been used for acoustic modelling. For acoustic where I is the number of insertions. The performance of
modelling of each word, a prototype HMM model was speech recognition system can be evaluated by measuring
specified. HMM prototype specifies the model topology, the word error rate (WER) defined as:
number of states, the transition parameters, and the output
distribution parameters. These HMM prototypes were then S+I+D
word error rate (WER) = × 100
re-estimated using the training data. Apart from the models N (6)
of vocabulary words, model for silent (sil) must be included. = 100 − %accuracy
HMM prototype uses different numbers of states for each
word depending upon the number of phone units, duration
of the words. This system uses 5–16 state HMMs in which 5 Experimental results
the first and last are non-emitting states. The prototype
models were initialised using the HTK tool HInit which The system was tested using the test data prepared
allows a fast and precise convergence of the training separately by a set of five speakers. Each speaker was asked
algorithm. Then HMM parameters were re-estimated using to utter some words of the vocabulary at least once a time.
the HTK tool HRest which estimates the optimum values of Also, some test data was collected from the training data.
HMM parameters. Parameters were re-estimated repeatedly Out of the five speakers used for testing, three were those
using training data until re-estimation converges. Within the used for collecting the training data. Thus, test data
each HRest iteration, convergence is indicated by the contains three types of sounds – sounds used for training the
change measure (convergence factor). This process was system, sounds spoken by the speaker whose other sound
repeated until absolute value of convergence factor does not files were used for training the system, sounds of a speaker
decrease from one HRest iteration to another. A system that does not participate in training. Test data was recorded
vocabulary word uses approximately two to five iterations at the room environment. The recognition results are
to converge. shown in Table 2. Overall word-accuracy and word-error
rate of the system is 87.01% and 12.99% respectively.
Word-correction rate is 90.93%.
4.5 System testing
Once the system models have been generated, system can 5.1 Experiment with different noisy environment
be used to recognise an unknown utterance. It is called
testing. During testing, initially, unknown utterance spoken Experiments have been performed in various noise
by the speaker is converted into the series of acoustic environments: in open space, in lab room, in room
vectors using HTK tool HCopy. These acoustic vectors are environment, in class room, in market. For each of
then processed by Viterbi algorithm. HTK tool HVite is environment except the room environment, testing was done
used to perform Viterbi-based speech recognition which using a data set of 50 words. The results are given in
uses the token passing algorithm (Young et al., 1989). HVite Figure 2. Figure shows that as the noise level increases,
takes as input a network describing the allowable word recognition performance of the system gets degraded.
sequences generated by using HTK tool HParse, a
A Hindi speech recognition system for connected words using HTK 31

Table 2 Recognition results

No. of No. of
Types of No. of No. of No. of % word % word Word error
Speaker ID spoken recognised
sound deletion (D) insertion (I) substitution (S) correct accuracy rate (WER)
words (N) word (H)
Speaker 1 Seen 17 16 0 0 1 94.1 94.1 5.9
Speaker 2 16 16 0 1 0 100.0 93.8 6.2
Speaker 3 15 14 0 1 1 93.3 86.7 13.3
For seen sound seen speaker, average performance: 95.8% 91.6% 8.4%
Speaker 1 Unseen 26 23 1 1 2 88.5 84.6 15.4
Speaker 2 24 22 0 1 2 91.7 87.5 12.5
Speaker 3 31 27 0 1 4 87.1 83.9 16.1
For unseen sound seen speaker, average performance: 89.1% 85.4% 14.6%
Speaker 4 Unseen 41 36 2 3 3 87.8 80.5 19.5
Speaker 5 20 17 1 0 2 85.0 85.0 15.0
For unseen sound unseen speaker, average performance: 86.4% 82.7% 17.3%
Overall system performance: 90.93% 87.01% 12.99%

Figure 2 Performance versus environment acoustic units. Again the word vocabulary contains Hindi
digits. Recogniser gives good results when tested for sound
100
90
room environment used for training the model. For other sounds too, the results
80
open space are satisfactory. System is highly efficient but it is an
70
60 isolated word recogniser. Also vocabulary size is too small.
50
40
lab room The developed system recognises the connected word with
30
20 class room
larger vocabulary. Table 3 shows the tabular comparison of
10 the developed system with the reported works.
0
%correct %accuracy WER market environment
Table 3 Tabular comparison of performance

Al-Qatab
Pruthi et al. Gupta Our
6 Comparative analysis Parameter and Ainon
(2000) (2006) work
(2010)
In this section, the presented work has been compared with Vocabulary 10 33 10 102
the existing similar works. In paper (Pruthi et al., 2000), size
they have developed a speaker-dependent, real-time,
Speaker No Yes Yes Yes
isolated word recogniser for Hindi. To extract the features independency
from the speech data, LPCCs were used. Recognition has
Acoustic Word Phone Word Word
been carried out using HMM. System was designed for two model
male speakers. The recognition vocabulary consists of Hindi
Connected No Yes No Yes
digits (0, pronounced as ‘shoonya’ to 9, pronounced as word
‘nau’). However the system is giving good performance, but recognition
the design is speaker specific and uses very small
% accuracy 84.38% 97.99% - 94.34%
vocabulary. Also it is an isolated word recogniser. The
presented system is speaker independent connected word Word error 15.62% 2.01% - 5.66%
rate
system and uses larger vocabulary.
Paper (Al-Qatab and Ainon, 2010) discusses the
development and implementation of an Arabic speech
system using HTK. System can recognise both continuous 7 Concluding remarks and future work
speech and isolated words. System uses an Arabic This paper presents a speech recognition system for
dictionary built manually by 13 speaker sounds. MFCCs Hindi language. The presented system recognises the
were used to extract the speech feature vectors. The connected-words using acoustic word model. The system
vocabulary consists of 33 words. Although the system has been trained using a vocabulary size of 102 words. HTK
performance is good, the vocabulary size is small. The has been used to develop the system that uses HMMs for
presented system uses larger vocabulary than this one. recognition. Acoustic features are extracted using MFCCs.
Gupta (2006) describes an isolated word speech For HMM modelling, different numbers of states are
recognition system for Hindi language. System uses selected depending upon the number of phone units,
continuous density HMM and consists of word-based duration of the words. Training data has been collected from
32 K. Kumar et al.

12 different speakers. To evaluate the system performance, Hermansky, H. (1990) ‘Perceptually linear predictive (PLP)
system has also been tested in the room environment, open analysis of speech’, Journal of Acoustic Society of America,
Vol. 87, No. 4, pp.1738–1752.
space, lab room, class room environment as well as market
environments using a set of five speakers. It has been Hermansky, H. and Morgan, N. (1994) ‘RASTA processing of
speech’, IEEE Transaction of Speech and Audio Processing,
observed from the performed experiments that the accuracy
Vol. 2, No. 4, pp.578–589.
and WER of the proposed system is 87.01% and 12.99%
Hermansky, H. and Sharma, S. (1999) ‘Temporal patterns
respectively. Word correction rate was found to be 90.93%. (TRAPs) in ASR of noisy speech’, Proc. of IEEE Conference
Based on the comparative analysis performed in the paper, on Acoustic Speech and Signal Processing.
it has been found that the system is performing well with HTK (2011) Hidden Markov Model Toolkit, available at
more vocabulary-size compared to the other reported similar http://htk.eng.cam.ac.uk (accessed on 10 January 2011).
works. The future works involves the development of Julius (2011) Julius: An Open Source for LVCSR Engine, available
system for more vocabulary size and to improve the at http://julius.sourceforge.jp (accessed on 10 January 2011).
accuracy of the system using the noise compensation/speech Madhav, P. (2005) Data Driven Feature Extraction and
enhancement techniques. Parameterization for Speech Recognition, Master’s thesis, IIT
Kanpur, India.
Markel, J.D. and Gray, A.H. (1976) Linear Prediction of Speech,
References Springer-Verlag, New York.
Aggarwal, R.K. and Dave, M. (2011) ‘Discriminative techniques Pruthi, T., Saksena, S. and Das, P.K. (2000) ‘Swaranjali: isolated
for Hindi speech recognition system’, Communication word recognition for Hindi language using VQ and HMM’,
in Computer and Information Science (Information Systems International Conference on Multimedia Processing and
for Indian Languages), Vol. 139, No. 2, pp.261–266, Systems (ICMPS), 13–15 August, IIT Madras, India.
Springer-Verlag Berlin Heidelberg. Rabiner, L.R. and Huang, B.H. (1986) ‘An introduction to hidden
Al-Qatab, B.A.Q. and Ainon, R.N. (2010) ‘Arabic speech Markov models’, IEEE Acoust., Speech Signal Processing,
recognition using hidden Markov model toolkit (HTK)’, Vol. 4, No. 16, pp.4–16.
International Symposium in Information Technology (ITSim), Sharma, A., Shrotriya, M.C., Farooq, O. and Abbasi, Z.A. (2008)
15–17 June, Kuala Lumpur, Vol. 2, pp.557–562. ‘Hybrid wavelet-based LPC features for Hindi speech
Davis, S. and Mermelstein, P. (1980) ‘Comparison of parametric recognition’, International Journal of Information and
representations for monosyllabic word recognition in Communication Technology, Vol. 1, Nos. 3/4, pp.373–381,
continuously spoken sentences’, IEEE Transactions on Inderscience.
Acoustics, Speech and Signal Processing, Vol. 28, No. 4, SPHINX (2011) Sphinx, available at
pp.357–366. http://cmusphinx.sourceforge.net/html/cmusphinx.php
Deng, L. (2006) ‘Dynamic speech models: theory, applications, (accessed on 10 January 2011).
and algorithms’, Synthesis Lectures on Speech and Audio Wilinski, P., Solaiman, B., Hillion, A. and Czamecki, W. (1998)
Processing, Vol. 2, No. 1, pp.1–118. ‘Towards the border between neural and Markovian
Deng, L. and O’Shaughnessy, D. (2003) Speech Processing – A paradigms’, IEEE Transactions on Systems, Man and
Dynamic and Optimization-Oriented Approach, Marcel Cybernetics, Vol. 28, No. 2, pp.146–159.
Dekker Inc., New York. Young, S.J., Evermann, G., Gales, M., Hain, T., Kershaw, D.,
Guo, G. and Li, S.Z. (2003) ‘Content-based audio classification Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D.,
and retrieval by support vector machines’, IEEE Transactions Valtchev, V. and Woodland, P. (2009) The HTK Book,
on Neural Networks, Vol. 14, No. 1, pp.209–215. Microsoft Corporation and Cambridge University
Engineering Department, Cambridge.
Gupta, R. (2006) Speech Recognition for Hindi, Master’s Project
Report, Department of Computer Science and Engineering, Young, S.J., Russell, N.H. and Thornton, J.H.S. (1989) Token
Indian Institute of Technology, Bombay, Mumbai, India. Passing: A Conceptual Model for Connected Speech
Recognition Systems, Technical Report, Department of
Engineering, Cambridge University, Cambridge, UK.

You might also like