ASR Building Using Sphinx

ASR building using
Sphinx
CS6745: Building ASR and TTS Systems
Gopala Krishna A (gopalakrishna@students)

S P Kishore (skishore@cs.cmu.edu)
The Components of ASR
 Acoustic Model (AM)
 Language Model (LM)
 Phonetic Lexicon (Pronunciation dictionary)
2
Installing the Sphinx Trainer
 Download the Sphinx III trainer from
http://172.16.16.93/ASR/SphinxTrain-0.9.1-beta.tar.gz
(Source:
http://www.speech.cs.cmu.edu/SphinxTrain/SphinxTrain-0.9.1-beta.
)
 Untar and install Sphinx train (as root)
 $tar –xvzf SphinxTrain-0.9.1-beta.tar.gz

 $cd SphinxTrain
 $./configure
 $make
3
Installing the Sphinx II
decoder
 Download the Sphinx II decoder from
http://172.16.16.93/ASR/sphinx2-0.5.tar
(Source:
http://www.sorcerer.mirrors.pair.com/sources/sphinx2/0.5/sphinx2-0.5.tar.bz2
)
 Untar and install the decoder (as root)
 $tar -xvf sphinx2-0.5.tar
 $cd sphinx2-0.5
 $./configure
 $make clean all
 $make test
 $make install
4
CMU-Statistical language
modeling toolkit
 Download the CMU-SLM toolkit from
http://172.16.16.93/ASR/CMU-Cam_Toolkit_v2.tar.gz
(Source http://mi.eng.cam.ac.uk/~prc14/CMU-Cam_Toolkit_v2.tar.gz)
 Untar the tgz and install as root
 $tar –xvzf CMU-Cam_Toolkit_v2.tar.gz
 $cd CMU-Cam_Toolkit_v2/
 $cd src
 Uncomment the #BYTESWAP_FLAG = -DSLM_SWAP_BYTES

in the Makefile
 $make install
5
Before getting started….
 Download Speech data
 Available at http://172.16.16.93/ASR/TEL_Landline.tgz
 Language Phoneset & Phonetizer

 Available at http://172.16.16.93/ASR/TELUGU.phone
 Available at http://172.16.16.93/ASR/IT3-Phonetizer
6
Before getting started…..
contd
 NIST Scorer (for scoring the decoder performance)
 Available at http://172.16.16.93/ASR/nist.tar.gz
 Script for testing

 Available at http://172.16.16.93/ASR/sphinx2-test
 Script for scoring and alignment

 Available at http://172.16.16.93/ASR/scorer.sh
 Available at http://172.16.16.93/ASR/sphinx2-align
7
Speech Databases…. format
 Language //Tamil, Telugu or Marathi Data
 Cellphone
 ID-**** (4-digit userid)
 FileRanking.txt // has info about recording quality

 recorded-ID.txt // has the transcription
 recordings //has the recordings in various formats
 WAV/ // has 52 wav files of the recordings
- recorded-***.wav(3-digit fileid)
 …..
 Landline
 …..
8
Directory structure
9
Training and Testing Datasets
 To train the models and to evaluate their

performance on unseen data, the speakers need
to be classified into the Training and Testing
sets.
 The division is usually 70% of the speakers for

the training and 30% for testing.
10
Wav file collection
 This is done to collect Training and Testing data

sets (good quality wav files without mistakes)
 Copy and untar the file collection module from-

http://172.16.16.93/ASR/collect.tgz
 Extract the wav files as follows
 $cd COLLECT
 $./runall.pl <Directory containing the speaker IDS>
 This could take 5-10 minutes
11
Wav file collection …..contd
 The following are created after running the

runall.pl-
 Use the *.raw files in Training/ for training,
‘Training/transcript’ file as the corresponding
transcription and train_fileids as the fileids file
 Use the *.raw files in Testing/ for training,
‘Testing/transcript’ file as the corresponding
transcription and test_fileids as fileids file
 trainwords_uniq.txt is the unique word list of the
training transcription (used for creating dictionary)
12
Acoustic Model Training
 Create a new directory (training workspace)
 $mkdir TASK_NAME
 Set the environment variables in the directory

 $export SPHINXTRAINDIR=“~/SphinxTrain”
 Make sure you give the correct path of SphinxTrain Dir
 Create the directory structure

 $SPHINXTRAINDIR/scripts_pl/setup_SphinxTrain
LangName
13
Directory wav/
 Copy the Training/*.raw1 to this directory
1
: refer slide 11
14
Directory etc/
 Contents to be put in the etc/ directory
 etc/langname.transcription: Copy the
Training/transcript1 file
 etc/langname.filler: Should contain the silence
specifiers
 <s> SIL
 </s> SIL
 <sil> SIL
1
: refer slide 11
15
Directory etc/……contd
 etc/langname.phone1 : Should contain the
phoneset, each phone in a new line
 Append ‘SIL’ as a phone
 etc/langname.fileids : Should have the filenames

of all the files in the order they appear in the
langname.transcription file (Use the train_fileids
file)
1
: http://172.16.16.93/ASR/TELUGU.phone 16
etc/langname.dic
 etc/langname.dic : Should contain the phone
breakage of each word entry. Proceed as follows -
 Get all the unique words in the training transcription

(excluding the <s>, </s> & filenames) each in a new
line. You may use the trainwords_uniq.txt2
 Use the IT3-Phonetizer1 to split the words into the

constituent phones
 $./IT3-Phonetizer lang.phone lang.wordlist langname.dic
1
: refer slide 6 http://172.16.16.93/ASR/TELUGU.phone 17
2
: refer slide 11
Some modifications
 Modify etc/sphinx_train.cfg and changing the
number of tied states to 1000
 $CFG_N_TIED_STATES = 1000
 To the command line in the file -

scripts_pl/03.makeuntiedmdef/make_united_mdef.pl
add the parameters –minocc 1 and –maxtriphones
20000
18
Some modifications
 In the file bin/make_feats replace the final command
line with the following
 bin/wave2feat -verbose -c $1 -raw -di wav -ei raw
-do feat -eo feat -srate 8000 -nfft 256 -lowerf 130
-upperf 3400 -nfilt 31 -ncep 13 –dither
 Extract the features for the wav files executing

 $bin/make_feats etc/*.fileids
19
Training Checklist
 Make sure the langname.fileids are in the same
order as the filenames in langname.transcription
(check for the first few files)
 Ensure that the same transliteration is used in all the
three - langname.transcription, langname.dic and
langname.phone
 Remove duplicate entries, numerals and silence

specifiers ( like <s>) in langname.dic
20
Steps involved in AM training
 STEP 0: Verify
 ./scripts_pl/00.verify/verify_all.pl
 STEP 1: Vector Quantization

 ./scripts_pl/01.vector_quantize/slave.VQ.pl
 STEP 2: Context Independent (CI) training

 ./scripts_pl/02.ci_schmm/slave_convg.pl
 STEP 3: State Tying

 ./scripts_pl/03.makeuntiedmdef/make_untied_mdef.pl
21
Steps in AM Training ….contd
 STEP 4: Context Dependent (CD) training
 ./scripts_pl/04.cd_schmm_untied/slave_convg.pl
 STEP 5: Tree Building

 ./scripts_pl/05.buildtrees/make_questions.pl
 ./scripts_pl/05.buildtrees/slave.treebuilder.pl
 STEP 6: Tree Pruning

 ./scripts_pl/06.prunetree/slave.state-tie-er.pl
22
Steps in AM Training ….contd
 STEP 7: CD training
 ./scripts_pl/07.cd-schmm/slave_convg.pl
 STEP 8: Deleting Interpolation

 ./scripts_pl/08.deleted-interpolation/deleted_interpolation.pl
 STEP 9: Converting to Sphinx 2 format

 ./scripts_pl/09.make_s2_models/make_s2_models.pl
23
Training the Language Model
 Theoretically, though the LM should be trained on a
large unbiased corpus, to approximate things for
practical feasibility, we train it on a corpus derived from
the testing and training transcriptions.
 Statistical language modeling computes the smoothed
trigram, bigram and the unigram probabilities from the
corpus.
 Concatenate the test and training transcriptions,
 each sentence in a new line
 Remove punctuations, filenames
 Prefix and suffix the sentences with <s> and </s>
24
Training the LM ….contd
 Run the following commands on the corpus (eg.
corpus.txt) in the directory
/CMU-Cam_Toolkit_v2/bin
 `cat corpus.txt |./text2wfreq >corpus.wfreq`;
 `cat corpus.wfreq |./wfreq2vocab > corpus.vocab`;
 `cat corpus.txt |./text2idngram –vocab
corpus.vocab >corpus.idngram`;
 `./idngram2lm -idngram corpus.idngram -vocab
corpus.vocab -arpa corpus.lm`;
25
Pronunciation Dictionary
 The decoder should be provided the phone split of
all the unigrams of the LM.
 Run the IT3-Phonetizer1 on the wordlist containing

the unigrams and get the langname.dic for the
entries
 $./IT3-Phonetizer lang.phone unigram.wordlist
langname.dic
1
: refer slide 3 http://172.16.16.93/ASR/TELUGU.phone 26
Running the decoder
 Modify the script sphinx2-test1 with the appropriate

values for the parameters
 TASK= Training directory path
 HMM= ${TASK}/model_parameters/langname.s2models
 CTLFILE= List of all the filenames of testing raw files

 Arguments for the s2batch:
 -matchfn : output filename
 -datadir : Dir consisting testing files (in raw format)
 -lmfn : path of the language model
 -dictfn : path of the dictionary
1
- refer slide 7
27
Running the decoder….contd
 Arguments you change for the command s2batch:
 -matchfn : output filename

 -datadir : dir consisting the testing files (in raw format)
 -lmfn : path of the language model
 -dictfn : path of the dictionary
 -langwt : a value between 6 and 13 (larger the LM
size, lesser the value of the langwt)
 -logfn : logfile
 Now run the script $./sphinx2-test
28
Evaluating the output
 Use the original transcription (eg. test.txt) of the
testing files to evaluate the output of the decoder
 this is a test sentence (file0001)
 Modify the output of the decoder to the above format
i.e. remove the scores at the end (eg. output.txt)
 Modify and run the scorer.sh1 as follows
 NIST : path of the NIST directory
 REF : the testing transcription ( test.txt)
 HYP : the decoder output (output.txt)
 score.rpt : the performance report of the decoder
 Run the script $./scorer.sh
1
- refer slide 7
29
Interpreting the NIST report
 The scorer aligns the decoder output with the
reference transcript of the test utterances
 It computes the mean word error rate (w.e.r) per
utterance by penalizing the insertions, deletions
and substitutions in alignment
 The report also gives the w.e.r per speaker and
indicates the good and the bad speakers in the test
set
30
Forced Alignment
 A technique to improve the Acoustic Model
 Download the sphinx2-align1 and modify the
parameter paths accordingly
 TASK : Training directory
 HMM : ${TASK}/model_parameters/TELUGU.s2models
 CTLFILE : The list of all the training files to be aligned
 TACTLFN : Transcript to be aligned. The format is -
 *align_all* // This should be the first line
 this is sentence one // Remove <s>, </s> & filenames
 DICT : ${TASK}/etc/langname.dic
1
: refer slide 7
31
Forced Alignment…..contd
 Arguments for the $S2batch
 -osentfn : output file
 -datadir : directory containing the raw files
 -logfn : logfile for the alignment
 Replace the etc/langname.transcription with aligned

transcript (pointed by -osentfn)
 Retrain the Acoustic models1, test and score the

new models to see the improved performance
1
refer slide 20 32
Limited Domain Speech-to-
Speech/ASR
 Target:
 Exploiting the limited

domain
 Integrating ASR with

MT and TTS systems
 Schematic figure
shown alongside
33
The Language
 Identify the kinds of templates and the various entities that
recur in the domain
 Ex: Considering a Tourist domain
 Template1: How can I go to the <Location>?
 Template2 : Can I catch a <Mode> to <Place>?
 Values for Location : Market, Railway Station, Hospital

 Values for Mode: Train, Bus, Aeroplane
 Values for Place: Chennai, Delhi, Hyderabad
 Implement a procedure to generate the legitimate utterances

language of the domain. Use the correct transliteration as that
of the Acoustic models
34
Components for the limited
domain ASR
 AM : Existing AMs built for the languages
 LM : LM trained on the set of legitimate sentences

allowed by your application
 Lexicon: Specified for the unigram terms of the LM
35
Biasing the decoder to LM
 To exploit the limited domain, increase the langwt
parameter of the sphinx2-test to increase the speed
and accuracy of the decoder.
36

ASR Building Using Sphinx

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASR Building Using Sphinx

Uploaded by

Copyright:

Available Formats

ASR building using

CS6745: Building ASR and TTS Systems

Gopala Krishna A (gopalakrishna@students)

 Acoustic Model (AM)

 Language Model (LM)

 Phonetic Lexicon (Pronunciation dictionary)

 $tar –xvzf SphinxTrain-0.9.1-beta.tar.gz

 $make clean all

 Uncomment the #BYTESWAP_FLAG = -DSLM_SWAP_BYTES

 Language Phoneset & Phonetizer

 Script for testing

 Script for scoring and alignment

 FileRanking.txt // has info about recording quality

 To train the models and to evaluate their

 The division is usually 70% of the speakers for

 This is done to collect Training and Testing data

 Copy and untar the file collection module from-

 The following are created after running the

 Set the environment variables in the directory

 Create the directory structure

 etc/langname.fileids : Should have the filenames

 Get all the unique words in the training transcription

 Use the IT3-Phonetizer1 to split the words into the

 To the command line in the file -

 Extract the features for the wav files executing

 Remove duplicate entries, numerals and silence

 STEP 1: Vector Quantization

 STEP 2: Context Independent (CI) training

 STEP 3: State Tying

 STEP 5: Tree Building

 STEP 6: Tree Pruning

 STEP 8: Deleting Interpolation

 STEP 9: Converting to Sphinx 2 format

 Remove punctuations, filenames

 Prefix and suffix the sentences with <s> and </s>

 Run the IT3-Phonetizer1 on the wordlist containing

 Modify the script sphinx2-test1 with the appropriate

 CTLFILE= List of all the filenames of testing raw files

 -matchfn : output filename

 Replace the etc/langname.transcription with aligned

 Retrain the Acoustic models1, test and score the

 Exploiting the limited

 Integrating ASR with

 Template2 : Can I catch a <Mode> to <Place>?

 Values for Location : Market, Railway Station, Hospital

 Implement a procedure to generate the legitimate utterances

 LM : LM trained on the set of legitimate sentences

 Lexicon: Specified for the unigram terms of the LM

You might also like