Ii'ublic reporting burden for this collection of information is estimated to average 1 hour per response, including time for reviewing instruction, searching existing ta sources, gathering and maintaining the data needed. The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the u.s. Government.
Ii'ublic reporting burden for this collection of information is estimated to average 1 hour per response, including time for reviewing instruction, searching existing ta sources, gathering and maintaining the data needed. The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the u.s. Government.
Ii'ublic reporting burden for this collection of information is estimated to average 1 hour per response, including time for reviewing instruction, searching existing ta sources, gathering and maintaining the data needed. The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the u.s. Government.
NAVAL POSTGRADUATE SCHOOL
MONTEREY, CALIFORNIA
THESIS
DIPHONE-BASED SPEECH RECOGNITION USING
NEURAL NETWORKS.
by
Mark E. Cantrell
June, 1996
Dan C. Boger
Robert B. McGhee
Approved for public release; distribution is unlimited
19960912 027 v.ooneo
jeREPORT DOCUMENTATION PAGE een eeeieaaaened |
ic epring bode fo is colloton of iforaton etal o avenge 1 four pr eaponte inca h me for tevewng norton, searching ring
a sous, ateng ahd mating te daa needed and empeing an evieving te collin of formato, Send commen reprints borden nate oF
ther aspatof ta alton of information icing suggestions for ecing is buen to Watheglon Header Seres, rector for nfrnaen|
rations tn Reports, 1215 Jeferson Dis Migheay, Sui 204, ington, VA 22202-1902, and wo the Oc of Mshageme and Bude, Paper Rediton
jet (OT-0188) Washington DC 205
[AGENCY USE ONLY (Leave blank) ] 2. REPORT DATE ‘3.REPORT TYPE AND DATES COVERED
JUNE 1996 ‘Master's Thesis
[TITLE AND SUBTITLE ‘SFUNDING NUMBERS.
DIPHONE-BASED SPEECH RECOGNITION USING NEURAL
NETWORKS
Ik
\UTHOR(S) Cantrell, Mark E.
| PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) ‘PERFORMING ORGANIZATION
REPORT NUMBER
Naval Postgraduate School
‘Monterey CA 93943-5000
| SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10.SPONSORING/MONITORING
AGENCY REPORT NUMBER
[i SUPPLEMENTARY NOTES
‘The views expressed in this thesis are those of the author and do not reflect the official policy or position of
the Department of Defense or the U.S. Government.
fi2a, DISTRIBUTIONAVAILABILITY STATEMENT 126 DISTRIBUTION CODE
proved for public release; distribution is unlimited.
2.
Speaker-independent automatic speech recognition (ASR) i a problem of long-standing interest tothe Department of
Defense. Unfortunately existing systems ae still too limited in capability for many military purposes. Most
large-vocabulary systems use phonemes (individual speech sounds, including vowels and consonants) as recognition units.
‘This research explores the use of diphones (pairings of phonemes) as recognition units. Diphones are acoustically easier to
recognize because coartculation effects between the diphone's phonemes become recognition features, rather than
confounding variables asin phoneme recogeition. Also, diphones cary more information than phonemes, giving the
lexical analyzer two chances to detect every phoneme in the word. Research results confirm these theoretical advantages
Intesting with 4490 speech samples from 163 speakers, 70.2% of 157 test diphones were correctly identified by one
trained neural network. Inthe same test, the correct diphone was one of the tp thee outputs 89.0% ofthe time. During
‘word recognition tess, the comect word was detected 85% of the time in continuous speech, Of those detections, the
correct diphone was ranked first 41.6% ofthe time and among the top six 74% ofthe time. In adtion, new methods of
Pitch-based frequency normalization and network feedback- based time alignment are introduced. Both ofthese techniques
improved recognition accuracy on male and female speech samples from al eight dialect regions in the U.S. In one test
set frequency normalization reduced errors by 34%. Similarly, feedback-based time alignment reduced another network's
test st errors from 32.8% to 11.096,
fa SUBIECT TERMS 1SNUMBER OF PAGES
wutomatic speech recognition, diphone, neural network, speaker independent, 35
tinuous speech 16.PRICE CODE,
h7SECuRITY TSSECURTTY 19 SECURITY 20.LIMITATION OF
‘LASSIFICATION OF CLASSIFICATION OF THIS _| CLASSIFICATION OF ‘ABSTRACT
RT PAGE ABSTRACT
UL
Inclassified Unclassified Unclassified
‘NSN 7540-01-280-5500 ‘Standard Form 298 (Rev. 2-89)
Prescribed by ANS St. 239-18 298-102