ASR Proof

Encyclopedia of Applied Linguistics
Automated Speech Recognition

Fo
Journal: The Encyclopedia of Applied Linguistics
Manuscript ID: Draft
Wiley - Manuscript type: article

r
Date Submitted by the

n/a
Pe
Author:
Complete List of Authors: Levis, John; Iowa State University, Applied Linguistics
Suvorov, Ruslan; Iowa State University, Applied Linguistics
er
CALL, Computational Linguistics, language learning technology,

Keywords:
pronunciation, Speech Recognition
Re
vi
ew
John Wiley & Sons

Page 1 of 18 Encyclopedia of Applied Linguistics
1
1
2
3
4
Automatic Speech Recognition
5
6 Definition
7
8 Automatic speech recognition (ASR) is an independent, machine-based
9
10
11 process of decoding and transcribing oral speech. A typical ASR system receives
12
13 acoustic input from the speaker through a microphone, analyzes it using some
14
15
16
pattern, model or algorithm, and produces an output, usually in the form of a text.
17
18 It is important to distinguish speech recognition from speech identification (or
Fo
19
20 understanding), the latter being the process of determining the meaning of an
21
22
r
23 utterance rather than its transcription.
24
Pe
25 Historical Overview
26
27
28 Pioneering work on ASR dates to the early 1950s. The first ASR system,
er
29
30 developed at Bell Telephone Laboratories by Davis, Buddulph, and Balashek (1952),
31
32
could recognize isolated digits from 0 to 9 for a single speaker. This system used
Re
33
34
35 spectral analysis that compared acoustic templates of the digits with previously
36
vi
37 created templates by the same speaker. In 1956, Olson and Belar created a
38
39
ew
40 "phonetic typewriter" that could recognize ten discrete syllables, but it was also
41
42 speaker-dependent and required extensive training. The performance of these early
43
44
45
ASR systems was lackluster because they used acoustic approaches that only
46
47 recognized basic units of speech clearly enunciated by a single speaker.
48
49 An early attempt to construct speaker-independent recognizers by Forgie
50
51
52 and Forgie (1959) was also the first to use a computer. In following years
53
54 researchers experimented with time-normalization techniques (such as Dynamic
55
56
57
Time Warping, or DTW) to minimize differences in speech rates of different talkers
58
59
60
John Wiley & Sons
Encyclopedia of Applied Linguistics Page 2 of 18
2
1
2
3
4
and to reliably detect speech starts and ends (e.g., Martin, Nelson, & Zadell, 1964;
5
6 Vintsyuk, 1968), and Reddy (1966) attempted to develop a system capable of
7
8 recognizing continuous speech by dynamically tracking phonemes.
9
10
11 The early 1970s were marked by several milestones: focus on the
12
13 recognition of isolated words or discrete utterances, development of large
14
15
16
vocabulary speech recognizers, and experiments to create truly speaker-
17
18 independent systems. During this period, the first commercial ASR system called
Fo
19
20 VIP-100 appeared and won a US National Award. This success triggered the
21
22
r
23 Advanced Research Projects Agency (ARPA) of the US Department of Defense to
24
Pe
25 fund the Speech Understanding Research (SUR) project from 1971-1976. The goal of
26
27
28 SUR was to create a system capable of understanding connected speech of several
er
29
30 speakers from a 1000-word vocabulary in a low-noise environment with an error
31
32
rate of less than ten percent. Of six systems, the most viable were Hearsay II, HWIM
Re
33
34
35 (Hear what I mean), and Harpy (the only system that completely achieved SUR’s
36
vi
37 goal). The systems created had a profound impact on ASR research and
38
39
ew
40 development by demonstrating the benefits of data-driven statistical models over

41
42 template-based approaches and helping move ASR research towards statistical
43
44
45
modeling methods such as Hidden Markov Modeling (HMM).
46
47 HMM became the primary focus of ASR research in the 1980s and was
48
49 implemented in almost every speech recognizer by the end of the decade. This
50
51
52 period was also characterized by the re-introduction of artificial neural network
53
54 (ANN) models, abandoned since the 1950s due to numerous practical problems.
55
56
57
Considerable effort was also made to construct systems for large-vocabulary
58
59
60
John Wiley & Sons
3
1
2
3
4
continuous speech recognition. During this time ASR was introduced in public
5
6 telephone networks and portable speech recognizers were offered to the public.
7
8 Commercialization continued in the 1990s, when ASR was integrated into products
9
10
11 from PC-based dictation systems to air-traffic control training systems.
12
13 During the 1990s, ASR research focused on extending speech recognition to
14
15
16
large vocabularies for dictation, spontaneous speech recognition, and speech
17
18 processing in noisy environments. This period was also marked by systematic
Fo
19
20 evaluations of ASR technologies based on word or sentence error rates (Junqua &
21
22
r
23 Haton, 1996). More importantly, steps made toward applications mimicking human-
24
Pe
25 to-human speech communication by systems speaking with human speakers (e.g.,

26
27
28 Pegasus and How May I Help You, or HMIHY) began and continued after 2000.
er
29
30 The 2000s witnessed further progress in ASR, including the development of
31
32
new algorithms and modeling techniques, advances in noisy speech recognition, and
Re
33
34
35 the integration of speech recognition into mobile technologies. A recent trend is
36
vi
37 research on visual speech recognition, in which visual information, particularly lip

38
39
ew
40 movements, improves ASR performance, especially in noisy environments (Liew &

41
42 Wang, 2009).
43
44
45
46
47 Classification of ASR Systems
48
49 Speech recognition systems can be classified in several ways. Classified
50
51
52 according to the speech data in the training database, ASR systems are speaker-
53
54 dependent (when the system has to be trained for each individual speaker) and
55
56
57
speaker-independent (when the training database contains numerous speech
58
59
60
John Wiley & Sons
4
1
2
3
4
examples from different speakers so the system can accurately recognize any new
5
6 speaker). Classified according to the type of utterance, there are isolated word
7
8 recognition systems, which identify words uttered in isolation, and continuous speech
9
10
11 recognition systems, which are capable of recognizing whole sentences without
12
13 pauses between words.
14
15
16
ASR systems can also be classified based on their approaches to speech
17
18 recognition. Three main approaches to ASR differ in speed, accuracy, and
Fo
19
20 complexity: (a) pattern matching, (b) statistical models, and (c) neural networks.
21
22
r
23 Pattern matching, the first technique used for ASR, was dominant in the late
24
Pe
25 1960s and the 1970s. It compares the speaker's input with pre-stored acoustic
26
27
28 templates or patterns. Pattern matching operates well at the word level for
er
29
30 recognition of phonetically distinct items in small vocabularies, but is less effective
31
32
for larger vocabulary recognition. Another limitation of pattern matching is its
Re
33
34
35 inability to match and align input speech signals with pre-stored acoustic models of
36
vi
37 different lengths. Although pattern-matching techniques are still used in some ASR
38
39
ew
40 products, more powerful, statistical approaches such as Hidden Markov Modeling

41
42 have largely replaced them.
43
44
45
Hidden Markov Modeling (HMM) was first introduced in ASR in the 1970s
46
47 and gained greater popularity in the 1980s. Since then, HMM has become a
48
49 preponderant statistical method for ASR. Unlike pattern matching, HMM is based on
50
51
52 complex statistical and probabilistic analyses. Hidden Markov Models represent
53
54 each language unit (e.g., a phoneme or a word) as a sequence of states, with
55
56
57
transition probabilities between each state, and probability distributions that define
58
59
60
John Wiley & Sons
5
1
2
3
4
the expected observed features for each state. The model with the highest
5
6 probability is believed to represent the correct language unit.
7
8
9
10
11
12
13
14
15
16
17
18
Figure 1. A simple three state Markov model with transition probabilities aij
Fo
19
20
21 (Englund, 2004)
22
r
23
24 The main strength of HMM is that it can describe the probability of states and
Pe
25
26 represent their order and variability through matching techniques such as the
27
28 Baum-Welch and Viterbi algorithms. In other words, this statistical method can
er
29
30
31 adequately analyze both the temporal and spectral variations of speech signals and
32
Re
33 can recognize and efficiently decode continuous speech input. However, HMMs
34
35
36 require extensive training, a large amount of memory, and huge computational
vi
37
38 power for model parameter storage and likelihood evaluation (Burileanu, 2008).
39
ew
40
Neural Networks, also called Artificial Neural Networks (ANN), are
41
42
43 modeled on the human neural system. A network consists of interconnected
44
45 processing elements (units) combined in layers with different weights that are
46
47
48 determined on the basis of the training data. A typical ANN takes an acoustic input,
49
50 processes it through the units, and produces an output (i.e., a recognized text). To
51
52
53
correctly classify and recognize the input, a network uses the values of the weights.
54
55
56
57
58
59
60
John Wiley & Sons
6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Fo
19
20
21
22
r
23
24
Pe
25
26
Figure 2. A simple artificial neural network
27
28 ANNs, first introduced in the late 1950s, have witnessed considerable
er
29
30 advances since the mid 1980s. Their main advantage lies in the classification of
31
32
static patterns (including noisy acoustic data), which is particularly useful for
Re
33
34
35 recognizing isolated speech units. However, pure ANN-based systems are not
36
vi
37
38 effective for continuous speech recognition, so ANNs are often integrated with HMM
39
ew
40 in a hybrid approach.
41
42
43
44
45 Advantages of ASR
46
47 Automatic speech recognition has a number of advantages. The most
48
49
50 frequently cited benefits of having effective ASR technologies include (a) simplicity,
51
52 (b) increased productivity or efficiency, (c) mobility, and (d) cost savings. First, ASR
53
54
systems can simplify different aspects of human life because making a speech input
55
56
57 does not require any specialized skills, as do typing, writing, or other types of
58
59
60
John Wiley & Sons
7
1
2
3
4
manual text-recording activities. Next, speech can be used to input information
5
6 several times faster than typing and handwriting, which leads to increased
7
8 productivity. Additionally, ASR technology can be used when the speaker is moving
9
10
11 or engaged in other activities. Finally, using a speech recognizer can be cost saving
12
13 as it requires fewer clerical workers.
14
15
16
17
18 Issues in ASR
Fo
19
20 Challenges
21
22
r
23 Developing an effective ASR system is a very difficult task that poses a
24
Pe
25 number of challenges. They include speech variability (e.g., intra- and inter-speaker
26
27
28 variability such as different voices, accents, styles, contexts, and speech rates),
er
29
30 recognition units (e.g., words and phrases, syllables, phonemes, diphones and
31
32
triphones), language complexity (e.g., vocabulary size and difficulty), ambiguity (e.g.,
Re
33
34
35 homophones, word boundaries, syntactic and semantic ambiguity), and
36
vi
37 environmental conditions (e.g., background noise, several people speaking

38
39
ew
40 simultaneously, etc.). Despite impressive successes in the field, these challenges

41
42 continue to make all existing ASR technologies prone to errors.
43
44
45
Errors
46
47 Errors in automatic speech recognition can be classified in several ways:
48
49 errors in discrete speech recognition, errors in continuous speech recognition, and
50
51
52 errors in word spotting (Rodman, 1999). Errors in discrete speech recognition
53
54 include deletion errors (when a system ignores an utterance due to the speaker's
55
56
57
failure to pronounce it loudly enough), insertion errors (when a system perceives
58
59
60
John Wiley & Sons
8
1
2
3
4
noise as a speech unit), substitution errors (when a recognizer identifies an
5
6 utterance incorrectly, e.g., We are thinking instead of We are sinking), and rejection
7
8 errors (when the speaker's word is rejected by a system, for instance, because it has
9
10
11 not been included in the vocabulary). Errors in continuous speech recognition can
12
13 also involve deletion, insertion, substitution, and rejection. In addition, this group
14
15
16
contains splits, when one speech unit is mistakenly recognized as two or more units
17
18 (e.g., euthanasia for youth in Asia), and fusions, when two or more speech units are
Fo
19
20 perceived by a system as one unit (e.g., deep end as depend). Finally, errors in word
21
22
r
23 spotting include false rejects, when a word in the input is missed, and false alarms,
24
Pe
25 when a word is misidentified.

26
27
28 According to another classification, errors in ASR systems can be direct,
er
29
30 intent, and indirect (Halverson, Horn, Karat & Karat, 1999). A direct error appears
31
32
when a human misspeaks or stutters. An intent error occurs when the speaker
Re
33
34
35 decides to restate what has just been said. Finally, an indirect error is made when an
36
vi
37 ASR system misrecognizes the speaker's input.

38
39
ew
40
41
42 Applications of ASR
43
44
45
Automatic speech recognition has a multi-disciplinary nature. State-of-the-
46
47 art ASR systems require knowledge from disciplines such as linguistics, computer
48
49 science, signal processing, acoustics, communication theory, statistics, physiology,
50
51
52 and psychology.
53
54 ASR has many applications in computer system interfaces (e.g., voice control
55
56
57
of computers, data entry, dictation), education (e.g., toys, games, language
58
59
60
John Wiley & Sons
9
1
2
3
4
translators, language learning software), healthcare (e.g., systems for creating
5
6 various medical reports, aids for blind and visually impaired patients),
7
8 telecommunications (e.g., phone-based interactive voice response systems for
9
10
11 banking services, information services), manufacturing (e.g., quality control
12
13 monitoring on an assembly line), military (e.g., voice control of fighter aircraft), and
14
15
16
consumer products and services (e.g., car navigation systems, household appliances,
17
18 mobile devices). Some ASR products include Dragon NaturallySpeaking, Embedded
Fo
19
20 ViaVoice, Loquendo, LumenVox, VoCon, and Nuance Recognizer.
21
22
r
23
24
Pe
25 ASR in Applied Linguistics

26
27
28 An important unsolved problem in using ASR in applied linguistics research
er
29
30 and applications is ASR’s weakness in recognizing nonnative speech. Because the
31
32
goal of ASR has been to automatically and accurately recognize words in speech for
Re
33
34
35 particular groups of speakers, better ASR systems have developed because of better
36
vi
37 modeling of more narrowly defined types of native speech (Van Compernolle,

38
39
ew
40 2000). This means that ASR systems lack the flexibility needed to successfully
41
42 recognize speech outside narrowly defined norms. In one study, Derwing, Munro
43
44
45
and Carbonaro (2000) tested Dragon Naturally Speaking’s ability to identify errors
46
47 in speech of very advanced L2 speakers of English. While human listeners were able
48
49 to successfully transcribe between 95-99.7% of the words, the recognition rates by
50
51
52 the program were a respectable 90% for native English speakers, but only 71-72%
53
54 for the nonnative speakers, a result mirrored for nonnative speech by Coniam
55
56
57
(1999).
58
59
60
John Wiley & Sons
10
1
2
3
4
Other attempts to build robust recognition systems for nonnative speech
5
6 have also been less successful than is desirable. Machovikov, Stolyarov, Chernov,
7
8 Sinclair and Machovikova (2002) created a system whose goal was to recognize
9
10
11 mispronunciations of the numbers 1-10 spoken by 33 nonnative speakers of
12
13 Russian. The agreement between Russian native listeners and the system ranged
14
15
16
from 55% (for ‘6’) to 85% (for ‘7’). In this very limited task, the ASR system was not
17
18 very successful in recognizing the mispronunciation of isolated words from a
Fo
19
20 limited database.
21
22
r
23 The major reason for this gap is that machines and humans do not listen in
24
Pe
25 the same way. Scharenborg (2007), in discussing limitations of ASR, compared

26
27
28 research in human speech recognition and ASR. She said that human listeners are
er
29
30 superior because they can use more information from the speech signal to decide
31
32
which words are intended, but that ASR systems must use a far more limited source
Re
33
34
35 of information, the acoustic signal itself. The chasm between the acoustic features
36
vi
37 and the information present to human listeners explains the weaknesses of ASR
38
39
ew
40 systems in recognizing nonnative speech, especially in adjusting for accented

41
42 speech.
43
44
45
46
47 Automatic Rating of Pronunciation
48
49 One hope for ASR systems is to identify pronunciation errors in nonnative
50
51
52 speech. There are two options for automatic rating: giving a global pronunciation
53
54 rating or identifying specific errors. To reach these goals, ASR systems need to
55
56
57
identify word boundaries, accurately align speech to intended targets and compare
58
59
60
John Wiley & Sons
11
1
2
3
4
the segments produced with those that should have been produced. A variety of
5
6 researchers have developed systems meant to provide global evaluations of
7
8 pronunciation (e.g., Neumeyer, Franco, Digalakis & Weintraub, 2000; Witt & Young,
9
10
11 2000) using automatic measures including speech rate, duration, and spectral
12
13 analyses. All of the studies have found that automatic measures do not approach
14
15
16
human ratings, but a combination of automatic measures may improve ratings.
17
18 ASR systems are also not accurate at precisely identifying specific errors in
Fo
19
20 articulation, sometimes identifying correct speech as containing errors, but not
21
22
r
23 identifying errors that actually occur. Neri, Cucchiarini, Strik and Boves (2002)
24
Pe
25 found that as few as 25% of pronunciation errors were detected by their ASR
26
27
28 system, while some correct productions were identified as errors. Truong, Neri, de
er
29
30 Wet, Cucchiarini and Strik (2005) studied whether an ASR system could identify
31
32
mispronunciations of three sounds typically mispronounced by learners of Dutch.
Re
33
34
35 Errors were successfully detected for one of the three sounds, but the ASR system
36
vi
37 was less successful for the other sounds.

38
39
ew
40
41
42 Feedback on Speaking
43
44
45
Although ASR systems are not yet capable of precisely identifying spoken
46
47 errors or of providing adequate global evaluations of pronunciation, there are areas
48
49 in which ASR has been used by applied linguists cognizant of its limitations:
50
51
52 language assessment, feedback for spoken liveliness, and the use of ASR in dialogue
53
54 systems used with language learning software.
55
56
57
58
59
60
John Wiley & Sons
12
1
2
3
4
One spoken language test has used ASR technology successfully. The Versant
5
6 English Language Test, previously called SET-10 and PhonePass, uses ASR
7
8 technology to recognize correct answers provided over the telephone. Spoken
9
10
11 language tasks are constructed to keep possible answers phonetically distinct from
12
13 one another. The ASR system needs only to recognize an answer that is phonetically
14
15
16
distinct from incorrect ones. The validity and reliability of the test compare
17
18 favorably to other human-rated spoken language tests.
Fo
19
20
21
22
r
23 A second use of ASR that works within limitations in providing feedback is
24
Pe
25 evaluation of spoken liveliness. Hincks and Edlund’s (2009) ASR recognizer used
26
27
28 automatic measures of pitch range variation to provide feedback to learners of
er
29
30 English giving oral presentations in English. Using overlapping 10-second measures
31
32
of pitch range variation, learners were given feedback on how much ‘liveliness’ their
Re
33
34
35 voice projected. By increasing pitch range variations, learners were able to control
36
vi
37 the movement of the feedback display, and thus increase the amount of engagement
38
39
ew
40 in their speech.
41
42
43
44
45
A third use of ASR technology is in spoken CALL dialogue systems. If a
46
47 software program for practicing spoken language provides the first line of a
48
49 dialogue, learners give one of two responses. If these responses are dissimilar, the
50
51
52 ASR system can recognize which sentence has been spoken (even with
53
54 pronunciation errors or missing words). The computer can then respond, allowing
55
56
57
the learner to respond again from a menu of possible responses. O’Brien (2006)
58
59
60
John Wiley & Sons
13
1
2
3
4
gives a review of a number of such programs. In one study of Tell Me More, a
5
6 language learning software program that incorporates ASR into speaking and
7
8 pronunciation practice, Cordier, Cooksey, Summers, Tucker and White (2007) found
9
10
11 mixed responses to ASR. Many comments were positive, but students used the ASR
12
13 features less than other technology features, suggesting that they liked the idea of
14
15
16
instant feedback more than the way it actually worked in practice.
17
18
Fo
19
20 Future Directions for ASR in Applied Linguistics
21
22
r
23 Automatic speech recognition holds great promise for applied linguistics,
24
Pe
25 although this promise has not yet been realized. First, the ubiquity of mobile devices
26
27
28 that use ASR-based applications will allow L2 learners to practice their L2 speaking
er
29
30 skills and receive feedback on their pronunciation. Further progress in ASR will
31
32
result in interactive language learning systems capable of providing authentic
Re
33
34
35 interaction opportunities, especially for learners who lack access to native speakers.
36
vi
37 These systems will also eventually be able to produce specific, corrective feedback
38
39
ew
40 to learners on their pronunciation errors. Additionally, the development of noise-

41
42 resistant ASR technologies will allow language learners to use ASR-based products
43
44
45
in various noise-prone environments, such as classrooms, transportation, and other
46
47 public places. Finally, the performance of ASR systems will improve as visual speech
48
49 recognition (based, for instance, on webcam's capturing of learners' lip movements
50
51
52 and facial expressions) becomes more effective and widespread.
53
54
55
56
57
58
59
60
John Wiley & Sons
14
1
2
3
4
Recommended Readings
5
6 Holmes, J., & Holmes, W. (2001). Speech synthesis and recognition (2nd ed.). London,
7
8 UK: Taylor & Francis.
9
10
11 Junqua, J.-C., & Haton, J.-P. (1996). Robustness in automatic speech recognition:
12
13 Fundamentals and application. Boston, MA: Kluwer Academic Publishers.
14
15
16
Neri, A., Cucchiarini, C., Strik, H., & Boves, L. (2002). The pedagogy-technology
17 interface in computer assisted pronunciation training. Computer Assisted
18
Fo
19 Language Learning, 15(5), 441-467.
20
21 Rodman, R. D. (1999). Computer speech technology. Norwood, MA: Artech House.
22
r
23 Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human
24
Pe
25 and automatic speech recognition research. Speech Communication, 49, 336-

26
27 347.
28
er
29
30
31
32
Re
33
34 References
35
36 Ainsworth, W. A. (1988). Speech recognition by machine. London, UK: Peter
vi
37
38
Peregrinus Ltd.
39
ew
40
41 Bourlard, H. A., & Morgan, N. (1994). Connectionist speech recognition: A hybrid
42
43 approach. Boston, MA: Kluwer Academic Publishers.
44
45
46 Burileanu, D. (2008). Spoken language interfaces for embedded applications. In D.
47
48 Gardner-Bonneau & H. E. Blanchard (Eds.), Human factors and voice
49
50
51
interactive systems (2nd ed., pp. 135-161). Norwell, MA: Springer.
52
53 Coniam, D. (1999). Voice recognition software accuracy with second language
54
55 speakers of English. System, 27, 49-64.
56
57
58
59
60
John Wiley & Sons
15
1
2
3
4
Cordier, D., Cooksey, R., Summers, R., Tucker, R., & White, J. (2007). Speech
5
6 recognition for language learning: Student feedback, usability, and human-
7
8 computer interaction. The International Journal of Technology, Knowledge
9
10
11 and Society, 5, 29-41.
12
13 Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic recognition of spoken
14
15
16
digits. The Journal of the Acoustical Society of America, 24(6), 637-642.
17
18 Derwing, T. M., Munro, M.J., & Carbonaro, M. (2000). Does popular speech
Fo
19
20 recognition software work with ESL speech? TESOL Quarterly, 34, 592-603.
21
22
r
23 Furui, S. (2001). Digital speech processing, synthesis, and recognition (2nd ed.). New
24
Pe
25 York, NY: Marcel Dekker.

26
27
28 Halverson, C. A., Horn, D. A., Karat, C., & Karat, J. (1999). The beauty of errors:
er
29
30 Patterns of error correction in desktop speech systems. In M. A. Sasse & C.
31
32
Johnson (Eds.), Human-computer interaction - INTERACT '99 (pp. 133-140).
Re
33
34
35 Edinburgh: IOS Press.
36
vi
37 Hincks, R., & Edlund, J. (2009). Promoting increased pitch variation in oral
38
39
ew
40 presentations with transient visual feedback. Language Learning and

41
42 Technology, 13, 32-50.
43
44
45
Holmes, J., & Holmes, W. (2001). Speech synthesis and recognition (2nd ed.). London,
46
47 UK: Taylor & Francis.
48
49 Junqua, J.-C., & Haton, J.-P. (1996). Robustness in automatic speech recognition:
50
51
52 Fundamentals and application. Boston, MA: Kluwer Academic Publishers.
53
54 Lai, J., Karat, C.-M., & Yankelovich, N. (2008). Conversational speech interfaces and
55
56
57
technologies. In A. Sears & J. A. Jacko (Eds.), The human-computer interaction
58
59
60
John Wiley & Sons
16
1
2
3
4
handbook: Fundamentals, evolving technologies, and emerging applications
5
6 (2nd ed., pp. 381-391). New York, NY: Laurence Erlbaum.
7
8 Liew, A., & Wang, S. (2009). Visual speech recognition: Lip segmentation and
9
10
11 mapping. Hershey, PA: Medical Information Science Reference.
12
13 Machovikov, A., Stolyarov, K., Chernov, M., Sinclair, I., & Machovikova, I. (2002).
14
15
16
Computer-based training system for Russian word pronunciation. Computer
17
18 Assisted Language Learning, 15, 201-214.
Fo
19
20 Markowitz, J. A. (1996). Using speech recognition. Upper Saddle River, NJ: Prentice
21
22
r
23 Hall PTR.
24
Pe
25 Martin, T. B., Nelson, A. L., & Zadell, H. J. (1964). Speech recognition by feature
26
27
28 abstraction techniques (Technical Report AL-TDR-64-176). Air Force Avionics
er
29
30 Lab.
31
32
Neri, A., Cucchiarini, C., Strik, H., & Boves, L. (2002). The pedagogy-technology
Re
33
34
35 interface in computer assisted pronunciation training. Computer Assisted
36
vi
37 Language Learning, 15(5), 441-467.

38
39
ew
40 Neumeyer, L., Franco, H., Digalakis, V., & Weintraub, M. (2000). Automatic scoring of
41
42 pronunciation quality. Speech Communication, 30, 83-93.
43
44
45
O’Brien, M. (2006). Teaching pronunciation and intonation with computer
46
47 technology. In L. Ducate & N. Arnold (Eds.), Calling on CALL: From theory and
48
49 research to new directions in foreign language teaching (pp. 127-148). San
50
51
52 Marcos, Texas: Calico Monograph Series.
53
54 Peinado, A. M., & Segura, J. C. (2006). Speech recognition over digital channels:
55
56
57
Robustness and standards. England: John Wiley & Sons Ltd.
58
59
60
John Wiley & Sons
17
1
2
3
4
Rabiner, L., Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood
5
6 Cliffs, NJ: Prentice Hall.
7
8 Rabiner, L. R., Juang, B.-H., & Lee, C.-H. (1996). An overview of automatic speech
9
10
11 recognition. In C.-H. Lee, F. K. Soong, & K. K. Paliwal (Eds.), Automatic speech
12
13 and speaker recognition: Advanced topics (pp. 1-30). Boston, MA: Kluwer
14
15
16
Academic Publishers.
17
18 Reddy, D. (1966). An approach to computer speech recognition by direct analysis of
Fo
19
20 the speech wave (Technical Report No C549). Stanford, CA: Stanford
21
22
r
23 University.
24
Pe
25 Rodman, R. D. (1999). Computer speech technology. Norwood, MA: Artech House.

26
27
28 Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human
er
29 and automatic speech recognition research. Speech Communication, 49, 336-

30
31 347.
32
Re
33 Torkkola, K. (1994). Stochastic models and artificial neural networks for automatic
34
35
speech recognition. In E. Keller (Ed.), Fundamentals of speech synthesis and
36
vi
37
38 speech recognition (pp. 149-169). England: John Wiley & Sons Ltd.
39
ew
40 Truong, K., Neri, A., de Wet, F., Cucchiarini, C., & Strik, H. (2005). Automatic detection
41
42
43 of frequent pronunciation errors made by L2 learners. Proceedings of
44
45 InterSpeech (pp. 1345-1348). Lisbon, Portugal.
46
47
48
Van Compernolle, D. (2001). Recognizing speech of goats, wolves, sheep and … non-
49
50 natives. Speech Communication, 35, 71-79.
51
52 Versant Language Test. Retrieved July 26, 2010 from www.ordiante.com
53
54
55 Versant English Test. Test description and validation summary. Retrieved July 26,
56
57 2010 from www.ordinate.com/technology/VersantEnglishTestValidation.pdf
58
59
60
John Wiley & Sons
18
1
2
3
4
Vintsyuk, T. K. (1968). Speech discrimination by dynamic programming. Kibernetika,
5
6 4(2), 81-88.
7
8 Witt, S. & Young, S. (2000). Phone-level pronunciation scoring and assessment for
9
10
11 interactive language learning. Speech Communication, 30, 95-108.
12
13
14
15
16
17
18
Fo
19
20
21
22
r
23
24
Pe
25
26
27
28
er
29
30
31
32
Re
33
34
35
36
vi
37
38
39
ew
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
John Wiley & Sons

ASR Proof

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASR Proof

Uploaded by

Copyright:

Available Formats

Encyclopedia of Applied Linguistics

Automated Speech Recognition

Manuscript ID: Draft

Wiley - Manuscript type: article

Date Submitted by the

CALL, Computational Linguistics, language learning technology,

John Wiley & Sons

40 development by demonstrating the benefits of data-driven statistical models over

25 to-human speech communication by systems speaking with human speakers (e.g.,

37 research on visual speech recognition, in which visual information, particularly lip

40 movements, improves ASR performance, especially in noisy environments (Liew &

40 products, more powerful, statistical approaches such as Hidden Markov Modeling

37 environmental conditions (e.g., background noise, several people speaking

40 simultaneously, etc.). Despite impressive successes in the field, these challenges

25 when a word is misidentified.

37 ASR system misrecognizes the speaker's input.

25 ASR in Applied Linguistics

37 modeling of more narrowly defined types of native speech (Van Compernolle,

25 the same way. Scharenborg (2007), in discussing limitations of ASR, compared

40 systems in recognizing nonnative speech, especially in adjusting for accented

37 was less successful for the other sounds.

40 to learners on their pronunciation errors. Additionally, the development of noise-

25 and automatic speech recognition research. Speech Communication, 49, 336-

25 York, NY: Marcel Dekker.

40 presentations with transient visual feedback. Language Learning and

37 Language Learning, 15(5), 441-467.

25 Rodman, R. D. (1999). Computer speech technology. Norwood, MA: Artech House.

29 and automatic speech recognition research. Speech Communication, 49, 336-

You might also like