Development and Suitability of Indian Languages Speech Database For Building Watson Based ASR System

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260508111

Development and Suitability of Indian Languages Speech Database for


Building Watson Based ASR System

Conference Paper · November 2013


DOI: 10.1109/ICSDA.2013.6709861

CITATIONS READS

4 35

4 authors, including:

Tapabrata Mondal Srinivas Bangalore


Jadavpur University AT&T
9 PUBLICATIONS   48 CITATIONS    184 PUBLICATIONS   3,490 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Text Simplification View project

All content following this page was uploaded by Tapabrata Mondal on 25 March 2019.

The user has requested enhancement of the downloaded file.


DEVELOPMENT AND SUITABILITY OF INDIAN LANGUAGES SPEECH
DATABASE FOR BUILDING WATSON BASED ASR SYSTEM
Dipti Pandey1, Tapabrata Mondal2, S.S.Agrawal3 , Srinivas Bangalore4
KIIT College of Engineering Sohna Road Gurgaon1, Jadavpur University Kolkata2 ,
KIIT College of Engineering, Gurgaon3 , AT&T Labs Research Florham Park NJ 4

(dips.pande1, tapabratamondal2)@gmail.com, dr.shyamsagrawal@gmail.com3 ,


srini@research.att.com4

Abstract-In this paper, we discuss our efforts in the system, which have great similarities. The work
development of Indian spoken languages corpora for could benefit large number of people working in the
building large vocabulary speech recognition systems field of speech recognition, as we are exploiting our
using WATSON Toolkit. The current paper research in the study of comparison of phonemes of
demonstrates that these corpora can be reduced to a different languages. Indian languages are basically
varied degree for various phonemes by comparing phonetic in nature and there exists a one-to-one
the similarity among phonemes of different correspondence between the orthography and
languages. We also discuss the design and pronunciation in all the sounds, barring few
methodology of collection of speech databases and exceptions.
the challenges we have faced during database
creation. The experiments have been conducted on 2. ASR SYSTEM ARCHITECTURE
commonly known Indian languages, by training the The architecture of speech recognition system is
ASR system with WATSON toolkit and evaluation shown in Fig 1.
by Sclite. The results for these experiments show that It contains two modules: The Training Module and
different Indian languages have a great similarity The Testing Module.
among their phoneme structures and phoneme Training module generates the system model with
sequences and we have exploited these features to which test data has been compared to get
create speech recognition system. Also, we have performance percentage. Testing module compares
developed an algorithm to bootstrapping the the test-data with training module and yields the 1-
phonemes of one particular language into another by best hypothesis.
mapping the phonemes of different languages. The First of all, Pronunciation Dictionary is created using
performance of Hindi and Bangla ASR systems using G2P Model (Section 5.1) which is trained with
these databases has been compared. 30,000 words that are linguistically correct. Based on
Keyword Components: Speech Recognition, Speech these linguistically correct words, English phonemes
databases, Indian Languages. of different Hindi Graphemes have been generated.
For creating G2P Model, we have used Moses [8].
1. INTRODUCTION The pronunciation dictionary along with mapping
Researchers are striving hard currently to improve dictionary (Section3.3) represents the different
the accuracy of the speech processing techniques for possibilities of pronouncing, or occurrence of a word.
various applications. In the recent parts, some of the Language Model has been created using large set of
researchers have been focusing on development of text data to capture all the possibilities of occurrence
suitable speech databases for Indian languages for of a phoneme in a word, or a word in a sentence.
developing speech recognition systems: Thus, to give strength to the Acoustic Models
Samudravijaya et al. [1], R.K.Agarwal [2], Chourasia (Section3.1), Language Model has been created.
et al. [3], Shweta Sinha & S.S. Agarwal [4], Srinivas In Testing Module, Sclite [9] is used for evaluating
Bangalore [5], Ahuja et al. [6], Maya Ingle, and the 1-best hypothesis of each word. The average of
Manohar Chandwani [7]. all the accuracies of different words gives the overall
In this paper, our goal is to develop a speech accuracy of the Speech Recognition system with
recognition system that uses Indian languages possible word accuracy percentage.
corpora through WATSON Toolkit. We are focusing Also, the Insertion, Deletion and Substitution error
our major concentration on those languages for can be computed.
developing large vocabulary speech recognition
. Fig 1. ASR SYSTEM ARCHITECTURE

3. BUILDING ASR SYSTEMS We have trained the HMM models using Watson Toolkit
Typically, ASR system comprises of three major [10].For parameterization, Mel Frequency Cepstral
constituents - the acoustic models, the language model Coefficients (MFCC) have been computed. At the time
and the phonetic lexicons. of recognition, various words are hypothesized against
the speech signal.
3.1. Acoustic Models: In this experiment, context- To compute the likelihood of a word, the 1-best
independent as well as context-dependent models of hypothesis of individual word of the text data has been
Hindi & Bangla have been created by borrowing taken with the help of Sclite. The combined likelihood of
phonemes from English. Context-independent models are all the phonemes represents the likelihood of the word in
basically mono-phone models, taking each phone as an the acoustic models.
individual sound unit. Furthermore, context-dependent
models take the probability of occurrence of one phone, 3.2. Language Model: For the language model, very
relative to the neighbouring phones. The data used for large set of text data is required, so that all the
creating acoustic models for Hindi and Bangla have been possibilities of occurrence of a word in Indian languages
shown in Table1 & Table2 respectively. can be captured. The text data taken for language model
is shown in Table3 & Table4.
Corpus Number of Speakers
Sentences (Male / Female) Corpus Number of Total Unique
General Messages 1260 3(Male),2(Female) Sentences Words Words
General Messages 1260 65300 54324
Health & Tourism 41282 2(Male),2(Female) Health & Tourism 41282 90140 67522
Corpus Corpus

News Feeds 800 8(Male),4(Female) News Feeds 800 7727 3351

Philosophical 1000 3(Male),2(Female) Philosophical Data 1000 135400 64360


Data
Wikipedia 19020 415818 175265
Table1. Corpus used for Hindi Acoustic Models
Table3: Text data used for Hindi Language Model
Corpus Number of Sentences Speakers
(Male/Female) Corpus Number of Total Words Unique
Shruti Bangla 7383 2(Male),4(Female) Sentences Words
Speech Corpus Shruti Bangla Speech 7383 22012 10054
Corpus
TDIL Data 1000 1(Male) TDIL Data 1000 25240 6720

Health& Tourism 41282 2(Male),2(Female) Health& Tourism 41282 675915 91033


Corpus
Table2. Corpus used for Bengali Acoustic Models Table4: Text data used for Bangla language Model
Balancing the acoustic model with the language For Example : In case of , is followed by
model: The assumption made by both the acoustic and
language model causes the models to assign non-zero consonant न.We have done the clustering of respective
probabilities to sequences that could never occur. vowel & consonant, using their 3-HMM states in order to
However, the acoustic model does not take into account have strong recognition system which is able to
correlation between consequent frames, while the recognize almost all the phonemes.
language model exploits dependencies between 2-3 OOV (Out-of-vocabulary) problem: During our
consecutive words. Secondly, the acoustic model assigns experiments, the OOV (Out-of-vocabulary) problem
a probability to sequences of continuous random mostly occurs.OOV is the problem of words in the test
variables while the language model assigns probability to speech that are not present in the dictionary. To handle
sequences of words. this, we have added such phones in the vocabulary.
Clusters of Sounds: Some sounds exist in Hindi that are
3.3. Lexicon Model: The lexicon model is a dictionary clusters of two or more different phonemes. To define the
which maps the words to phoneme sequences .In this sounds, we have clustered the phonemes by taking 3-
experiment, we have developed a pronunciation HMM states of all the phonemes and clustered them to
dictionary, a mapping dictionary and grouping of the get a new sound. Examples of clustering are shown in
phones. Table 5.
Pronunciation Dictionary: It contains the lexicons with
respect to each individual word, using their transcription Phonemes Clustering of Sounds
as how a word can be pronounced with help of English ao,2 ao,3 n,2
alphabets. uh,2 uh,3 n,2
Mapping Dictionary:It maps each phoneme of a particular
iy,2 iy,3 n,2
language into English phonemes. This mapping is shown
in Appendix1 (For Vowels) & Appendix2 (For ञ y,2 y,3 n,2
Consonants). In this way, we have created sounds of t,2 t,3 r,2
Hindi & Bangla using English phonemes.
Grouping of Phones: The phones are grouped on the basis झ j,3 h,2 h,3
of place of articulation and manner of articulation Table5: Examples of Clustering of sounds
(Appendix 1 &2). In case the engine is unable to decide a
particular phone, it is able to find the correct phone with Phonemes not common in Hindi & Bengali: Some
the help of such grouping by looking into the category phonemes have been observed that exist in Hindi, but not
the particular phoneme belongs.
in Bengali and vice-versa. The list of these phones are
4. HINDI & BANGLA PHONE SETS given in Table 6. As we have dealt with both the
To represent the sounds of the acoustic space, a set of languages, we have trained the ASR individually for both
phonemes [11] are required which can be either from a the languages, containing their own phoneme sets, and
particular language or from the sounds of a combination compute their accuracies separately.
of languages.
The IPA [12] has defined phone sets for labeling speech Common in Hindi & Bangla 47 Phonemes
databases for sounds of a large number of languages Only in Hindi 10 Phonemes व /v/
(including Hindi). But there are some sounds which are
क़ /q/
not included in IPA but which are used for the purpose of
speech recognition. In continuous speech recognition ञ /ɲ/
task, the purpose of defining a phonetic space is to form य /j/
well-defined phone set which can represent all the sounds ष /ʂ/
that exist in a language. So, we have used some phoneme ख़ /x/
sequences, out of which all the sounds can be extracted,
ग़ /ɣ/
either by individually or by clustering of these phonemes.
Some phonemes exists in text data only, but not ज़ /z/
in the audio files. As the phoneme is pronounced by a झ़ /ʒ/
speaker in different way, the variabilities have been फ़ /f/
captured. For Example: व (/v/) is written in text form in Only in Bangla 3 Phonemes রং ŋ
Bangla too, but it is pronounced as ब /b/. ঐ oj
ঔ ow
4.1. Challenges:While dealing with Indian phone-sets,
the following challenges has been faced.
Table6 : List of Phonemes not common in Hindi & Bengali
Nasal Sounds: To handle nasal sounds is a real task Audio Files
especially when a vowel is followed by a consonant.
5. METHODOLOGY TO DEVELOP TEXT been used as test data in case of Open-Set Speech
CORPORA Recognition. In Closed Set, overall data is used for
5.1 Grapheme to Phoneme Conversion (G2P): For training, and some data from the same is used as test
analyzing the text corpus, the distribution of the basic data.
recognition units, the phones, the di-phones, syllables
etc., the text corpus has to be phonetized. G2P converters 7. ASR EVALUATION RESULTS
are the tools that convert the text corpus into its phonetic We have done some experiments to find the relevancy of
equivalent. But the phonetic nature of Indian languages our experiment. Two individual Recognition Engines of
reduces the effort to building individually mapping tables Hindi & Bengali have been developed. The system has
and rules for the lexical representation. These rules and been trained with the corpus of individual languages.
the mapping tables(Annexure1 &2) together comprise the 7.1. Overall Performance of Hindi and Bangla ASR:
Grapheme to Phoneme converters. The overall performance of Hindi and Bangla ASR when
We have used Moses [8] for G2P conversion, by training using 70 % of the data as the training set and remaining
it with 4280 unique words having their phonetic 30% as the test-set is shown in Table 7 & Table 8
equivalent set up by linguistics. The rest of the respectively.
graphemes are inputted and their phoneme equivalents
are taken as output. Task Num Beam Word Clock
5.2 Rapid Bootstrapping: The language adaptation Name Phrases Width Accuracy Time
Output 174 170 51.5 221.00
technology enables us to rapidly bootstrap a speech 170
recognizer in a new target language. Output 174 190 57.2 311.81
Converting one language phonemes into other: In this 190
experiment, we have developed an algorithm to convert Output 174 210 61.0 431.27
each phoneme of a particular language to the 210
Output 174 230 61.8 589.81
corresponding phoneme of the Hindi language, so we can 230
deal with more data for Hindi taken from other Indian Output 174 250 62.1 781.11
languages. For this, we have mapped each phoneme of a 250
particular language individually to the respective Table 7: Overall Performance of Hindi ASR
phoneme in Hindi, and if it matches with a particular
phoneme of Hindi, then it gives those character of Hindi Task Num Beam Word Clock
as the converted phoneme. Thus, for a text data of a Name Phrases Width Accuracy Time
Output 174 170 47.3 81.04
particular Indian language, we get the data in Hindi 170
phonemes. Then, further processing can be done. Output 174 190 51.8 108.23
190
6. COLLECTION OF AUDIO DATA Output 174 210 54.3 144.43
210
In this section, the steps involved in building the speech Output 174 230 54.2 195.63
corpora are discussed .Two channels: head-held 230
microphone and mobile phones have been used to record Output 174 250 54.9 266.24
the data simultaneously. 250
6.1.Speaker Selection & Transcription of Audio Files: Table 8: Overall Performance of Bangla ASR
Speech data was collected from native speakers of
different languages who were comfortable in speaking 7.2 Hindi Speech Recognition: While dealing with Hindi
and reading the particular language, for training purpose Recognition Engine, we have trained as well as tested the
to capture all diversities attributing to the gender, age and system with Hindi database.
dialect sufficiently. For testing, we have used closed and open set both. The
6.2Transcription Corrections: Besides care was taken to accuracy of the open-set & closed-set for Hindi, where
record the speech with minimal background noise and the test-set is out of the training-data and closed-set,
mistakes in pronunciation, some errors were still left where the test-set is from the training data is as follows:
while recording. These errors had to be identified
manually by listening to the speech. The pronunciation
mistakes were carefully identified and if possible the
corresponding changes were made in the transcriptions
so that the utterance and transcription correspond to each
other. The idea behind this was to make the utmost
utilization of the data and to serve as a corpus for further
related research work.
6.3.Data Statistics: The system has been trained with
70% of the overall corpus and the remaining 30 % has
Fig 2. Word accuracy percentage of Hindi ASR
7.3 Bangla Speech Recognition: For Bangla individually, keeping the language model as same. Thus,
Recognition, we have trained the system with Bengali for Hindi ASR, we have used 50 Hindi sentences and 50
data and then tested it with the same as well as different Bangla to Hindi transliterated sentences, as test sentences
subset, giving accuracy of both closed and open set. The and similarly with Bangla ASR. The accuracies observed
performance of Bangla Recognition Engine is as follows: in these cases are shown in Table10.

Language Testing Sentences Accuracies

Hindi 50 74.2
Bangla 50 65.6
Table10: Comparison of Hindi & Bangla ASR

As we have larger Hindi corpus than Bangla, the


accuracy of Hindi ASR is better. Thus, it has been
concluded that the accuracy can be improved with
Fig 3: Word accuracy percentage of Bangla ASR
increase in corpus.
This shows that the best solution to improve the accuracy
is to add more and more number of speakers in the 8.IMPROVING ASR ACCURACIES USING
training set. The evaluation of the experiment was made MORE DATA FROM VARIOUS LANGUAGES
according to the recognition accuracy and computed We have collected data from various commonly known
using the word error rate [WER] which aligns a Indian languages, and transliterated the whole data into
recognized word against the correct word and computes Hindi alphabet, so that we can capture many variations
the number of substitutions (S), deletions (D) and which occur in different Indian languages.With this data
Insertions (I) and the number of words in the correct acoustic as well as language model is improved. As
sentence (N). having correspondence between orthography and
W.E.R=100*(S+D+I)/N pronunciation, the Indian languages are rich in phonetics,
we can capture all possibities of phonemes. The
7.4 Using the transliterated Data: In this experiment, 50 experiments show that the ASR accuracy can be
sentences of Bangla were taken. After their transliteration improved using large corpus in this manner also.
into Hindi [13], the sentences were recorded with Hindi
native speaker and used as the additional test set. The 9.CONCLUSION & FUTURE WORK
parallel sentences of Bangla and Hindi are then tested In this paper, we discussed the design and development
with the same Bangla ASR. As the acoustic model is of speech databases for two Indian languages i.e.,Hindi
same for both the languages, we can use both the parallel & Bangla and their suitability for developing ASR
files as test set with the same system. system using WATSON tool. The simple methodology of
The accuracy of Bengali corpus and its Hindi database creation presented will serve as a catalyst for the
transliteration when tested with the same Bengali ASR is creation of speech databases in all other Indian
shown in Table9. languages.

Language Sentences from Testing Accuracy (%) Some of the conclusions of our study are:
Set  Female Speakers perform better when the system is
Hindi 50 57.4 trained with Female voice database alone.
Bangla 50 64.2  Accuracy of the system is found better when the
Table9. Accuracies of Original Bangla Sentences & its system is trained with variety of speakers and their
transliterated Version speaking style as compared to simply increasing the
corpus from limited number of speakers.
The system has been trained for Original Bangla  Native Speakers performs better than Non-Native
sentences. Thus, the system is giving better accuracy for Speakers in all conditions
Bangla sentences than their Hindi transliterated version.  As the beam-width of training speakers increases,
As the difference in accuracy percentage is not large, word accuracy also increases.
shows the transliteration is effective. Thus, we can  The word accuracy also increases with clock time.
increase text corpus of a particular language by using
transliterated data obtained from any other language. We hope that the ASRs created using the database
developed in this experiment will serve as baseline
7.5 Comparison of Hindi & Bengali ASR Model: In the systems for further research in improving the accuracies
experiment, we have build the acoustic model for both in each of the languages. Our future work is focused in
Bangla corpus and its Hindi transliterated corpus tuning these models and test them using language and
acoustic models built using a much larger corpus from APPENDIX
large number of speakers. The characterizations of Hindi and Bangla phonemes
have been done as follows:
10.ACKNOWLEDGEMENT CATEGORY HINDI BENGALI IPA REPRESENTATION
We would like to acknowledge the help and support PHONEMES PHONEMES USING ENGLISH
PHONEMES
recieved from Mr.Anirudhha of IIIT,Hyderabad in Monothongs अ অ /ə/ AX
conducting these experiments.We are also thankful to (Short) इ ই /i/ I
उ উ /u/ U
Prof.Michael Carl of CBS, Copenhagen and KIIT ऋ ঋ / RR
management in particular to Dr. Harsh V.Kamrah,Mrs. Monothongs आ আ aː AA
Neelima Kamrah for providing necessary (Long) ई ঈ iː II
ऊ uː
facilities,financial help and encouragement .Also to ঊ UU
/e/
ए এ E
DIETY for providing fellowship to one of the author /o/
ओ ও O
Dipti Pandey. Diphthongs ऐ ঐ /æ/ AI
औ ঔ /ɔː AU

11.REFERENCES Appendix1: Characterization of Vowels


[1] Samudravijaya K, P.V.S.Rao, and S.S.Agrawal," CATEGORY HINDI BENGALI IPA REPRESENTATION
Hindi speech database ," Proc. Int. Conf. on Spoken PHONEMES PHONEMES USING ENGLISH
PHONEMES
Language processing(ICSLP00), Beijing, China, October
Unaspirated क ক /k/ k
2000, CDROM paper: 00192.pdf. (Unvoiced) च চ /tʃ/ c
[2] K.Kumar and R.K. Agarwal,“Hindi Speech ट /ʈ/ tt
Recognition System Using HTK,” International Journal ট
त ত /t/ t
of Computing and Business Research,Vol.2, No.2, प প /p/ p
2011,ISSN (On- line):2229-6166.
[3]Chourasia,K.Samudravijaya,andChandwani,"Phonetic Aspirated ख খ /kʰ/ Kh
ally rich Hindi sentences corpus forcreation of speech
(Unvoiced) छ ছ /tʃʰ/ ch
database,"Proc O-COCOSDA 2005, p 132-137. ठ ঠ /ʈʰ/ tth
[4] Shweta Sinha, SS Agrawal, Olsen Jesper,"Mobile थ থ /tʰ/ th
speech Hindi database, " OCOCOSDA-2011, Hshinchu- फ ফ /pʰ/ ph
Taiwan.
Unaspirated ग গ /g/ g
[5] www.mastar.jp/wfdtr/presentation/2_Dr.Bangalore.
pdf
(Voiced) ज জ /dʒ/ j
[6] Ahuja, R., Bondale, N., Furtado, X., Krishnan, S., ड ড /ɖ/ dd
Poddar, P., Rao, P.V.S., Raveendran, R.,Samudravijaya द দ /d/ d
K, and Sen, A.," Recognition AndSynthesis in the Hindi ब ব /b/ b
Language, in Proceedings of the Workshop on Speech Aspirated घ ঘ /gʰ/ gh
Technology, IIT, Madras,pp.3-19, Dec., 1992. (Voiced) झ ঝ /dʒʰ/ jh
[7] Vishal Chourasia, Samudravijaya K, MayaIngle, and
ढ ঢ /ɖʱ/ ddh
Manohar Chandwani," Hindi speech recognition under
ध ধ /dʰ/ dh
noisy conditions, J. Acoust. Soc. India,54(1), pp. 41-46,
भ ভ /bʰ/ bh
January 2007.
[8] http://www.statmt.org/moses/manual/manual.pdf. Nasals ड़ ড় /ɽ/ ddn
[9]http://www1.icsi.berkeley.edu/Speech/docs/sctk- ञ ঞ /ɲ/ ny
1.2/sclite.htm ण ন /ɳ/ nn
[10] http://www.research.att.com/projects/WATSON/ न ন /n/ n
?fbid=2tgRMa1CfjG. म m
[11] S S Agrawal, K Samudravijaya, Karunesh Arora, ম /m/
\Text and Speech Corpora Development in Indian Semivowels/ य য /j/ y
Languages", Proceedings of ICSLT-O-COCOSDA 2004
Approxima र র /r/ r
nts
New Delhi, India. ल ল /l/ l
[12] www.madore.org/ david/misc/linguistic/ipa/. व ব /v/ w
[13]
http://en.wikipedia.org/wiki/Devanagaritransliteration Sibilants श শ /ʃ/ sh
ष ষ /ʂ/ sh^
स স /s/ s
Glottal ह হ /h/ h
Appendix2: Characterization of Consonants

View publication stats

You might also like