Professional Documents
Culture Documents
ASR Building Using Sphinx
ASR Building Using Sphinx
Sphinx
2
Installing the Sphinx Trainer
Download the Sphinx III trainer from
http://172.16.16.93/ASR/SphinxTrain-0.9.1-beta.tar.gz
(Source:
http://www.speech.cs.cmu.edu/SphinxTrain/SphinxTrain-0.9.1-beta.
)
Untar and install Sphinx train (as root)
$cd sphinx2-0.5
$./configure
$make test
$make install
4
CMU-Statistical language
modeling toolkit
Download the CMU-SLM toolkit from
http://172.16.16.93/ASR/CMU-Cam_Toolkit_v2.tar.gz
(Source http://mi.eng.cam.ac.uk/~prc14/CMU-Cam_Toolkit_v2.tar.gz)
Untar the tgz and install as root
$tar –xvzf CMU-Cam_Toolkit_v2.tar.gz
$cd CMU-Cam_Toolkit_v2/
$cd src
5
Before getting started….
Download Speech data
Available at http://172.16.16.93/ASR/TEL_Landline.tgz
Available at http://172.16.16.93/ASR/IT3-Phonetizer
6
Before getting started…..
contd
NIST Scorer (for scoring the decoder performance)
Available at http://172.16.16.93/ASR/nist.tar.gz
7
Speech Databases…. format
Language //Tamil, Telugu or Marathi Data
Cellphone
ID-**** (4-digit userid)
Landline
…..
8
Directory structure
9
Training and Testing Datasets
10
Wav file collection
11
Wav file collection …..contd
12
Acoustic Model Training
Create a new directory (training workspace)
$mkdir TASK_NAME
13
Directory wav/
Copy the Training/*.raw1 to this directory
1
: refer slide 11
14
Directory etc/
Contents to be put in the etc/ directory
etc/langname.transcription: Copy the
Training/transcript1 file
etc/langname.filler: Should contain the silence
specifiers
<s> SIL
</s> SIL
<sil> SIL
1
: refer slide 11
15
Directory etc/……contd
etc/langname.phone1 : Should contain the
phoneset, each phone in a new line
Append ‘SIL’ as a phone
1
: http://172.16.16.93/ASR/TELUGU.phone 16
etc/langname.dic
etc/langname.dic : Should contain the phone
breakage of each word entry. Proceed as follows -
18
Some modifications
In the file bin/make_feats replace the final command
line with the following
bin/wave2feat -verbose -c $1 -raw -di wav -ei raw
-do feat -eo feat -srate 8000 -nfft 256 -lowerf 130
-upperf 3400 -nfilt 31 -ncep 13 –dither
19
Training Checklist
Make sure the langname.fileids are in the same
order as the filenames in langname.transcription
(check for the first few files)
Ensure that the same transliteration is used in all the
three - langname.transcription, langname.dic and
langname.phone
20
Steps involved in AM training
STEP 0: Verify
./scripts_pl/00.verify/verify_all.pl
21
Steps in AM Training ….contd
STEP 4: Context Dependent (CD) training
./scripts_pl/04.cd_schmm_untied/slave_convg.pl
./scripts_pl/05.buildtrees/slave.treebuilder.pl
22
Steps in AM Training ….contd
STEP 7: CD training
./scripts_pl/07.cd-schmm/slave_convg.pl
23
Training the Language Model
Theoretically, though the LM should be trained on a
large unbiased corpus, to approximate things for
practical feasibility, we train it on a corpus derived from
the testing and training transcriptions.
Statistical language modeling computes the smoothed
trigram, bigram and the unigram probabilities from the
corpus.
Concatenate the test and training transcriptions,
each sentence in a new line
24
Training the LM ….contd
Run the following commands on the corpus (eg.
corpus.txt) in the directory
/CMU-Cam_Toolkit_v2/bin
`cat corpus.txt |./text2wfreq >corpus.wfreq`;
`cat corpus.wfreq |./wfreq2vocab > corpus.vocab`;
`cat corpus.txt |./text2idngram –vocab
corpus.vocab >corpus.idngram`;
`./idngram2lm -idngram corpus.idngram -vocab
corpus.vocab -arpa corpus.lm`;
25
Pronunciation Dictionary
The decoder should be provided the phone split of
all the unigrams of the LM.
langname.dic
1
: refer slide 3 http://172.16.16.93/ASR/TELUGU.phone 26
Running the decoder
HMM= ${TASK}/model_parameters/langname.s2models
28
Evaluating the output
Use the original transcription (eg. test.txt) of the
testing files to evaluate the output of the decoder
this is a test sentence (file0001)
Modify the output of the decoder to the above format
i.e. remove the scores at the end (eg. output.txt)
Modify and run the scorer.sh1 as follows
NIST : path of the NIST directory
REF : the testing transcription ( test.txt)
HYP : the decoder output (output.txt)
score.rpt : the performance report of the decoder
Run the script $./scorer.sh
1
- refer slide 7
29
Interpreting the NIST report
The scorer aligns the decoder output with the
reference transcript of the test utterances
It computes the mean word error rate (w.e.r) per
utterance by penalizing the insertions, deletions
and substitutions in alignment
The report also gives the w.e.r per speaker and
indicates the good and the bad speakers in the test
set
30
Forced Alignment
A technique to improve the Acoustic Model
Download the sphinx2-align1 and modify the
parameter paths accordingly
TASK : Training directory
HMM : ${TASK}/model_parameters/TELUGU.s2models
CTLFILE : The list of all the training files to be aligned
TACTLFN : Transcript to be aligned. The format is -
*align_all* // This should be the first line
this is sentence one // Remove <s>, </s> & filenames
DICT : ${TASK}/etc/langname.dic
1
: refer slide 7
31
Forced Alignment…..contd
Arguments for the $S2batch
-osentfn : output file
-datadir : directory containing the raw files
-logfn : logfile for the alignment
Schematic figure
shown alongside
33
The Language
Identify the kinds of templates and the various entities that
recur in the domain
Ex: Considering a Tourist domain
Template1: How can I go to the <Location>?
34
Components for the limited
domain ASR
AM : Existing AMs built for the languages
35
Biasing the decoder to LM
To exploit the limited domain, increase the langwt
parameter of the sphinx2-test to increase the speed
and accuracy of the decoder.
36