Alak Presentation-Hong

Automated Evaluation of
Pronunciation for English Learners

Yeonjung Hong, Korea University
2021.08.22
Table of Contents
1. Research Background
2. Automated Evaluation of Pronunciation

1. Phoneme Evaluation
2. Rhythm Evaluation
3. Intonation Evaluation
4. Stress Evaluation
3. Effectiveness of Automated Phoneme Evaluation System
4. Conclusion
5. Future Studies
1. Research Background
Research Background
CALL & CAPT
• Computer-Assisted Language Learning (CALL)

• Early CALL (1960s): reading/writing-focused with the advent of computers
• Emergence of CAPT (1990s): expansion to speaking domain with the developed ASR
technology
• Computer-Assisted Pronunciation Training (CAPT)

• Developed from a “listen & repeat” to automated pronunciation evaluation for last 30yrs
• Research topic 1: automated evaluation - segmental level, supra-segmental level,
spontaneous speech
• Research topic 2: effectiveness on learning - testing the learning effectiveness using CAPT,
anlaysis on effective feedback types, learners’ satisfaction level
Research Background
Limitations on previous research
1. Lack of system providing criterion-based pronunciation scores
• Effective pronunciation learning requires quantitative feedback on each of 4 criteria: segmental(phoneme) and
suprasegmental(rhythm, intonation, stress) levels
(Neri et al., 2002; Lee et al., 2015; McGregor and Reed, 2018; Lee, 2019; Perez-Ramon et al., 2020;)
• Lack of previous systems providing score feedback on 4 criteria

- Holistic scores only (Zechner et al., 2009; Black et al., 2015)
- Hard-to-interpret feedback like oscillograms, spectrogram (Auralog, 2002; Rosetta Stone, 2013; Tell Me More, 2013)
- Segmental(phoneme) score only (Russell et al., 2000; Franco et al., 2010)
- Visualized graphs on rhythm, stress, intonation only, no scores (Kommissarchik and Kommissarchik, 2000; Betteraccent 2002)
2. Automated pronunciation evaluation research with low educational practicality
• State-of-the-art automated pronunciation evaluation technology is based on deep learning and requires too much
computing resources (Zhang et al., 2020; Lin et al., 2020)
• Even then, still holistic scores
• Without practicality in classrooms, CALL/CAPT is meaningless (Neri et al., 2002)
Research Background
Goal & Hypothesis
• Research Goal
• Develop automated evaluation system of phoneme, rhythm, intonation, stress—> Feedback on effective pronunciation learning
• Run experiments to test the effectiveness of the autoamted pronunciation evaluation system —> Technology considering educational
practicality
• Hypothesis
• Evaluation of English segementals
• ASR-based phoneme evaluation system will provide human-evaluator-level scores when it’s modeled with native speakers phoneme
information
• Evaluation of English suprasegmentals
• It will provide human-evaluator-level scores when rhythm is modeled based on duration, intonation on pitch, stress on energy and pitch
• Evaluation of effectiveness of automated English phoneme scoring systems
• English learners’ pronunciation level will be enhanced after practicing with automted English phoneme evaluation systems for multiple times
• The satisfaction level of English learners will be dependent on the pronunciation score units: word, syllable, phoneme
2. Automated Evaluation of
Pronunciation
Intro
Overview
STEP1. Modeling
Test Data
= NNS speech
Train Data Model Training Model Testing
STEP2. Scoring
Model Predicted
Scores
Model Performance Test
= Machine-Human Score Agreement
Test Data
= Human Scores
STEP3. Validation
Intro
Database - English Read by Japanese (ERJ)
1. Structure
Structure of L2 speech Struture of L2 speech
• Readspeech of English spoken by Japanese college students Number Criteria Number
(Minematsu et al., 2004) Criteria
Intonatio
Phoneme Rhythm Stress
n
• Providing human raters’ pronunciation scores on L2 speech Evaluator 5 4 4 2

Speech Total 1271
1~5 Likert scale scores on phoneme, rhythm, stress, intonation

Total 53
Speech 5674 950 950 1900
• English L1 (Ame.) readspeech included (identical sentences 0
with L2)
Item Word
Total 1019 120 53 77
Item Word 300 0 0 77 53

2. Usage
Sentence
Sentence 719 120 53 0

Model training and testing Total 17
Total 190 18 18 18
The correlation between automatic scores and human Speaker Male 95 8 8 8

Speaker Male 6
scores is the highest with Japanese L1-English L2 (Wang et al.,

2018) Female 95 10 10 10 Female 11
Intro
Inter-rater score agreement metrics
Metric Min Max Interpretation
Negative correlatin(-1), No correlation(0), Positive

Pearson Correlation Coefficient (PCC) -1 1 correlation(1)
|(between-group mean difference)/(std)| if 0, no

Standard Mean Difference (SMD) 0 Inf
between-group mean difference
Total disagreement(-1), Random agreement(0),
Quadratic Weigted Kappa (QWK) -1 1
Total agreement(1)
Percentage of exact matches between raters’
Exact Percentage Agreement (EPA) 0 100
scores
Percentage of exact matches or 1-point
Adjacent Percentage Agreement (APA) 0 100
different matches between raters’ scores
2.1. Phoneme Evaluation
Method
Overview
STEP1. Modeling
ERJ L2 speech
5,674 utt
Eng L1 speech DB
ASR-based phoneme Predict phoneme
scoring model scores
Pronunciation Dict
STEP2. Scoring
Model predicted
Phoneme scores
Machine-Human agreement
5 human raters’
scores
STEP3. Validation
Method
Modeling
1. Feature extraction 2. DNN-HMM based acoustic model training
- Word to pronunciation mapping is based on CMU dictionary

- Librispeech: US L1 audiobook, 590k utt of 2.5k spkrs (982hrs)
- Framing: 10ms shift, 25ms windowing
- Feature1: MFCC per frame (speaker-independent feature)
- Feature2: i-Vector per utt (speaker-dependent feature) IN: Frame-wise feature
AH
0 OUT: 69 phonemes
(MFCC + I-Vector)
(24cons + 15vowel x 3stresses)
AH
1
AH
2
10ms
…
MFCC …
He is running. (spkr-independent)
NG
R
i-Vector …
(spkr-dependent)
…
Z
Method
Scoring
1. Forced alignment of ERJ L2 speech & text 2. Percentage conversion of
frame-wise phoneme log-likelihood
- ERJ L2 speech with phoneme scores: 5,674 utterances
- Forced alignment using DNN-HMM acoustic model - Log-likelihood ranges from -infinity to 0
- Log-likelihood distributions are different phoneme by phoneme in L1
speech
He is running. - Normalization of L1 phoneme log-likelihood statistics
10ms
H IY1 IH1 Z R AH1 N IH0 NG
IH1 frame-wise llh -60.72 -50.45 -23 -120
Percentage conversion
based on L1’s IH1 llh stats
H IY1 IH1 Z R AH1 N IH0 NG 76 82 93
IH1 frame-wise percent score
51
Method
Validation
1. Normalize phoneme scores per utterance within 1 to 2. Machine-Human agreement
5
- For comparison with 1-5 Likert scale human raters’ scores -Human-Human agreement: 5 human raters’ score
agreement to test credibility of human scores
-Machine-Human agreement: To test credibility of

machine scores
-Use 5 metrics of score agreement

H IY1 IH1 Z R AH1 N IH0 NG
76 82 93
IH1 frame-wise percentage
51
IH1 frame-wise percentage mean 75.5
Across-phoneme percentage mean 83
Conversion to 1-5 Likert scale 4

Result
Score agreement
Human-Human Human-Machine
Metric
H1- H1- H1- H1- H2- H2- H2- H3- H3- H4- H-H H-M
H1-M H2-M H3-M H4-M H5-M
H2 H3 H4 H5 H3 H4 H5 H4 H5 H5 Avg Avg
Num of utt 5674 1890 1890 945 1890 1890 945 1890 945 945 5674 5674 1890 1890 945
PCC* 0.69 0.59 0.55 0.49 0.54 0.57 0.51 0.49 0.46 0.54 0.54 0.54 0.55 0.44 0.53 0.42 0.50
|SMD| 0.06 0.09 0.46 0.67 0.15 0.41 0.35 0.56 0.68 0.12 0.35 0.19 0.13 0.22 0.26 0.18 0.20
QWK 0.69 0.59 0.49 0.39 0.53 0.53 0.49 0.41 0.36 0.54 0.50 0.49 0.51 0.39 0.50 0.38 0.45
EPA 47.02 51.96 35.56 33.23 46.24 36.82 37.78 32.22 32.70 41.48 39.50 29.34 31.81 28.89 33.92 30.26 30.84
APA 92.12 94.34 83.60 81.69 92.91 85.87 86.03 81.01 81.48 87.09 86.61 73.37 74.09 76.93 77.99 74.39 75.35
Higher value, higher agreement
Lower value, higher agreement
*PCC is statistically significant showing P<0.001 in every condition

Summary
>> Hypothesis
- Low performance with models using raw log-likelihood of the users’ speech (Hu et
Metric al. ,2015)
Avg Avg
- ASR-based phoneme evaluation system will show human-level performance if
PCC* 0.54 0.50 native speakers’ log-likelihood is used to normalize raw log-likelhood of the
users.
|SMD| 0.35 0.20
>> Results
QWK 0.50 0.45
- Human raters’ scores are credible.
EPA 39.50 30.84 - Human-Machine agreement is similar to Human-Human agreement.
APA 86.61 75.35

>> Conclusion
- ASR-based phoneme evaluation system shows human-level performance when

native speakers’ log-likelihood is used to normalize raw log-likelhood of the users.
- The credibility of SpeechPro (MediaZen(Inc.), 2020) is proven;
SpeechPro is the commercialized pronunciation evaluation system.
2.2. Rhythm Evaluation
Method
Overview
STEP1. Modeling
ERJ L2 testset Japanese speech (JNAS)
ERJ L1 speech
ERJ L2 speech Rhythm
Multiple linear Predict rhythm
feature
selection regression scores
ERJ mean rhythm scores
STEP2. Scoring
Model predicted
rhythm scores
Human-Machine JapL1, EngL1, EngL2
agreement Rhythm score comparison
4 human raters’
scores
STEP3. Validation
Method
Modeling
1. Rhythm feature selection
List up 27 English rhythm features from previous studies

(Cucchiarini et al., 2000; Ramus et al., 2000; Yamashita et al., 2005; Arias et al., 2010; Honig et al., 2012; Prince, 2014; Kim,
2020)
Group Type Unit Group Type Unit
mean syllable std syllable
mean phone varco syllable
Duration
mean vowel std consonant
Global Interval Proportions
mean consonant varco consonant
(=gross statistics on segmental durations)
mean stressed syllable std vowel
std stressed syllable varco vowel
Isochrony
mean unstressed syllable ratio vowel
std unstressed syllable mean
raw syllable Pause std
raw vowel ratio
raw consonant word
Pairwise Variability Speech rate
Index normalize syllable
syllable (=number of units per second)
(=local changes in d phone
durations) normalize
vowel
d
normalize
consonant
d
Method
Modeling
2. Rhythm feature extraction from ERJ L2 3. Modeling rhythm scoring system
- multiple linear regression model
- ERJ L2 speech with rhythm scores: 950 utterances
- Forced alignment with DNN-HMM acoustic model - 27 rhythm features are input, then rhythm score is output
- Rhythm feature extraction using the alignment information - Trainset: pairs of 27 rhythm feats and mean of 4 human rhthm scores for 950
utterances
feat1
feat2
Multiple linear Mean of 4 human raters’
regression rhythm scores
feat3
…
feat27
4. Re-select rhythm features showing best performance

H IY1 IH1 Z R AH1 N IH0 NG - Analysis of rhythm features with the best performance in MLR
- Re-select features using RFECV(Recursive Feature Elimination with Cross Validation)
- Finalize modeling with the re-selected features
feat1 feat2 feat3 … feat27

Method
Scoring & Validation
1. ERJ L2 rhythm score prediction
-Select top 10% of trainset which shows the smallest standard deviations
among human raters. (Loukina et al., 2018) Validation of rhythm
-Forced alignment with English DNN-HMM acoustic model evaluation model
-Feature extraction
-Rhythm score prediction performance
- Human-Machiine agreement test
2. ERJ L1 rhythm score prediction
-ERJ L1 1,271 utterances
-Forced alignment with English DNN-HMM acoustic model
-Feature extraction Analysis of rhythm feautres and the
-Rhythm score prediction rhythm scoring model by comparing
-Hypothesis: Higher rhythm scores than ERJ L2 Eng L1 / Eng L2 / Jap L1 data
3. JNAS rhythm score prediction
-JNAS adult Japanese L1 readspeech DB (88,156 utterances)
-Forced alignment with Japanese DNN-HMM acoustic model
-Feature extraction
-Hypothesis: Lower rhythm scores than ERJ L2
(becasue it’s English rhythm scoring system)
Result
Feature Selection
REFVC Pearson Correlation One-way ANOVA
feature
rank r P F P
global_interval_proportion-std_consonant 1 -0.08 p<0.01 14564.50 p<0.01
global_interval_proportion-std_syllable 1 -0.03 0.28 13943.87 p<0.01
global_interval_proportion-std_vowel 1 -0.06 p<0.1 14471.65 p<0.01
23 features are selected out of 27 features.
global_interval_proportion-varco_consonant 3 0.06 0.13 7453.53 p<0.01
global_interval_proportion-varco_syllable 2 -0.20 p<0.1 13850.67 p<0.01
The followings are the eliminated 4 features:
global_interval_proportion-varco_vowel 1 -0.16 p<0.1 13153.29 p<0.01
• nPVI-Consonant
global_interval_proportion-vowel_ratio 5 -0.17 p<0.1 14676.94 p<0.01
• Varco-Consonant
isochrony-mean_between_stressed_syllable 1 -0.10 p<0.001 14527.06 p<0.01 • Varco-Syllable
isochrony-mean_between_unstressed_syllable 1 -0.13 p<0.001 14120.54 p<0.01 • Vowel ratio
isochrony-std_between_stressed_syllable 1 -0.17 p<0.001 14017.65 p<0.01
isochrony-std_between_unstressed_syllable 1 -0.16 p<0.01 12523.05 p<0.01
mean_duration-consonant 1 -0.16 p<0.001 13863.99 p<0.01
mean_duration-phone 1 0.02 p<0.001 13657.17 p<0.01
mean_duration-syllable 1 0.01 p<0.001 6990.42 p<0.01
mean_duration-vowel 1 -0.19 p<0.001 14125.74 p<0.01
npvi-consonant 4 -0.22 p<0.01 14490.86 p<0.01
npvi-syllable 1 -0.20 0.58 14635.00 p<0.01
npvi-vowel 1 -0.14 0.66 14525.74 p<0.01
pause-mean 1 -0.04 p<0.001 13657.17 p<0.01
pause-ratio 1 -0.06 p<0.001 14376.72 p<0.01
pause-std 1 -0.17 p<0.001 14017.65 p<0.01
rpvi-consonant 1 -0.16 p<0.001 12523.05 p<0.01
rpvi-syllable 1 -0.16 0.17 11828.71 p<0.01
rpvi-vowel 1 0.06 p<0.1 7946.00 p<0.01
srate-phone 1 -0.05 p<0.001 9257.23 p<0.01
srate-syllable 1 -0.09 p<0.001 9591.81 p<0.01
srate-word 1 -0.08 p<0.001 18285.23 p<0.01
Result
Score agreement
H-H H-M
Metric H1-H2 H1-H3 H1-H4 H2-H3 H2-H4 H3-H4 H1-M H2-M H3-M H4-M
Avg Avg
Num of items 950 950 950 950 950 950 950 950 950 950
PCC* 0.53 0.55 0.46 0.45 0.56 0.42 0.50 0.24 0.43 0.29 0.37 0.33
|SMD| 0.06 0.40 0.18 0.33 0.24 0.57 0.30 0.3492 0.29 0.03 0.50 0.29
QWK 0.53 0.50 0.45 0.42 0.55 0.35 0.47 0.18 0.37 0.24 0.28 0.27
EPA 38.21 40.11 39.05 34.00 39.47 30.95 36.96 24.21 26.84 24.63 25.47 25.29
APA 84.95 86.63 81.26 83.79 85.37 75.79 82.96 62.84 71.68 67.89 64.63 66.76

Summary
>> Hypothesis
- Multiple linear regression based rhythm scoring model trained with the
Metric selected rhythm features known to be proper in the previous studies will
Avg Avg
show human-level performance.
PCC* 0.50 0.33
>> Results
|SMD| 0.30 0.29
- Human-Machine agreement is similar to Human-Human agreement.
QWK 0.47 0.27
EPA 36.96 25.29 >> Conclusion
- Rhythm scoring model based on multiple linear regression is credible, but

APA 82.96 66.76
further validation is required by comparing the predicted scores with the
scores on Eng L1 and Jap L1.
Result
ERJ L1 vs ERJ L2 vs JNAS
Mean and variance of model predicted scores

ALL features (27) BEST features (23)
testset N of utterance mean variance mean variance
ERJ L1 1271 3.780 0.287 3.805 0.268
ERJ L2 95 3.394 0.134 3.395 0.145
JNAS 88156 3.831 0.125 3.786 0.115 • Hyp1 : L1 > L2 —> TRUE
Rankings of model predicted scores • Hyp2 : L2 > JNAS —> FALSE
ALL features (27) BEST features (23)
• Rankings are dependent on how to
Rank mean variance mean variance
group rhythm features
1 JNAS L1 L1 L1
2 L1 L2 JNAS L2
3 L2 JNAS L2 JNAS
Result
ERJ L1 vs ERJ L2 vs JNAS
Mean and variance of model predicted scores by different feature sets

gip 5 features duration 4 features npvi 3 features rpvi 3 features srate 3 features pause 3 features isochrony 4 features
testset mean var mean var mean var mean var mean var mean var mean var
ERJ L1 3.536 0.045 3.576 0.065 3.412 0.004 3.450 0.016 3.530 0.040 3.410 0.030 3.462 0.023
ERJ L2 3.387 0.033 3.398 0.051 3.393 0.010 3.396 0.018 3.383 0.027 3.350 0.065 3.378 0.036
JNAS 3.641 0.006 3.638 0.011 3.452 0.001 3.567 0.003 3.618 0.011 3.293 0.037 3.303 0.013
Rankings of model predicted scores by different feature sets

gip 7 features duration 4 features npvi 3 features rpvi 3 features srate 3 features pause 3 features isochrony 4 features
order mean var mean var mean var mean var mean var mean var mean var
1 JNAS L1 JNAS L1 JNAS L2 JNAS L2 JNAS L1 L1 L2 L1 L2
2 L1 L2 L1 L2 L1 L1 L1 L1 L1 L2 L2 JNAS L2 L1
3 L2 JNAS L2 JNAS L2 JNAS L2 JNAS L2 JNAS JNAS L1 JNAS JNAS
Interpretation1: Global Interval Proportion, Duration, nPVI, rPVI, Speech Rate —> Nativeness
Interpretation2: Pause, Isochrony —> English rhythm
Interpretation3: All the features combined, the resut ranking is L1 > JNAS > L2 with neutralized effects (BEST
features)
Summary
>> Hypothesis
- Rhythm scoring model trained with English rhythm scores will predict the scores in the following order:
Eng L1 > Eng L2 > Jap L1.
>> Results
- The results are Eng L1 > Jap L1 > Eng L2, proving Eng L1 > Eng L2
- Trained with sentence internal pause and isochrony, it proves Eng L1 > Eng L2 > Jap L1
- Trained with duration related features of language units, it rejects Jap L1 > Eng L1 > Eng L2
>> Conclusion
- Eng L1 > Eng L2: the rhythm scoring model is credible.

- Sentence internal pause and isochrony are “English” language specific features
- Duration related featres of language units are “L1” rhythm features
- Especially, isochrony shows the averaged durations of the consecutive stressed/unstressed syllables,
showing the language intrinsic rhythmic difference between more stressed-timed English and less
stress-timed Japanese
2.3. Intonation Evaluation
Method
Overview
STEP1. Modeling
ERJ L2 testset
ERJ L1 speech
ERJ L2 speech
Feature Multiple linear Predict intonation
extraction regression scores
ERJ mean intonation
scores
STEP2. Scoring
Model predicted
intonation scores
Human-Model agreement
4 human raters’
scores
STEP3. Validation
Method
Modeling
1. Intonation feature extraction from ERJ L2 & L1 speech

Trend similarity correlation
- ERJ L2 speech with intonation scores: 950 utterances Normalized dot product(A,B) / (std A x std B)
- ERJ L1 speech reading the same sentences as ERJ L2 (1 audio per sentence)
- Dynamic Time Warping (DTW) using MFCC of L1 and L2 speech
- Align pitch sequence of L1 and L2 based on DTW alignment
- Extract 8 similarity features from the 2 pitch sequences (Yamashita et al., 2005; Arias et al., 2010; Yarra and Ghosh, 2018)
L1 model speech
Pitch alignment Extraction of pitch similarity features

MFCC DTW
L2 speech Based on DTW • Trend similarity correlation
• Trend similarity euclidean distance
• Trend similarity correlation of derivative
• Trend similarity euclidean distance of derivative
• F_PD
• F_SD
Pitch • F_ND1
(Interpolation + smoothing) • F_ND2
Method
Modeling
2. Modeling intonation scoring system
- multiple linear regression model
-8 intonation features are input, intonation score is output
-Trainset: pairs of 8 intonation features and mean of 4 human raters’ intonation scores for 950 utterances
feat1
feat2
Multiple linear Mean of 4 human raters’
regression intonation scores
feat3
feat8
Method
1. ERJ L2 intonation score prediction

-Select top 10% of trainset which shows the smallest standard deviations among human raters. (Loukina et al., 2018)
-Feature extraction of pitch similarities between L1 and L2
-Intonation score prediction
2. Visualization of intonation of L1 and L2
-Plot the 2 pitch contours that are DTW-processed
-Comparison of pitch contours and machine/human scores
Result
Pitch comparison - plot Red: Learner
Green: Native
That’s from my brother who lives in London. Fred ate the beans. The play ended, happily.
[Model, H1, H2, H3, H4] [Model, H1, H2, H3, H4] [Model, H1, H2, H3, H4]
5, 3, 4, 3, 4 2, 1, 2, 1, 1 5, 2, 4, 3, 5
Result
Score agreement
Metric
H-H H-M
H1-H2 H1-H3 H1-H4 H2-H3 H2-H4 H3-H4 H1-M H2-M H3-M H4-M
Avg Avg
Num of items 950 950 950 950 950 950 950 950 950 950
PCC* 0.45 0.49 0.54 0.35 0.47 0.38 0.45 0.22 0.17 0.16 0.19 0.19
|SMD| 0.89 0.34 0.61 0.61 0.14 0.35 0.49 0.86 0.04 0.57 0.11 0.39
QWK 0.32 0.46 0.45 0.30 0.44 0.33 0.38 0.16 0.17 0.15 0.18 0.17
EPA 26.11 37.89 28.95 31.16 32.84 23.79 30.12 22.95 32.63 28.95 27.79 28.08
APA 69.26 85.37 70.42 79.68 79.47 70.84 75.84 65.89 79.37 74.95 67.89 72.03

Summary
Discussion
>> Hypothesis
Metric
Human-Human Human-Machine - Multiple linear regression based intonation scoring model trained with
Avg Avg features of L1-L2 pitch smilarities will show human-level performance.
PCC* 0.45 0.19 >> Results
|SMD| 0.49 0.39 - Human raters’ scores are credible.

- Human-Machine agreement is similar to Human-Human agreement.
QWK 0.38 0.17
- Pitch contour comparison also proves the model’s validity.
EPA 30.12 28.08 >> Conclusion
APA 75.84 72.03 - Multiple linear regression based intonation scoring model trained with
features of L1-L2 pitch smilarities shows human-level performance.
2.4. Stress Evaluation
Intro
Overview
1. ASR-based stress scoring system
2. Multiple linear regresion based stress scoring system

Method-1
Overview
STEP1. Modeling
ERJ L2 testset
Eng L1 speech DB ASR based

Predict stress
Vowel stress
recognition model scores
Pronunciation Dict
STEP2. Scoring
Model predicted
stress scores
Human-Model agreement
Mean of 2 human
raters’ scores
STEP3. Validation
Method-1
Modeling
2. Vowel stress recognition model
1. DNN-HMM based acoustic model
-Multiply the reference pronunciation by applying 3 types of stress on
-The same acoustic model used for phoneme scoring system vowels {0,1,2}
-Word-to-pronunciation is based on CMU dictionary -Develop vowel stress recognition model combinig the pronunciation
-0 = no stress, 1=primary stress, 2=secondary stress dictionary with the DNN-HMM acoustic model
AH
IN: Frame-wise feature 0 OUT: 69 phonemes “About”
(MFCC + I-Vector)
(24cons + 15vowel x 3stresses)
AH AH0 AW0
1
AH1 B AW1 T
AH
2 AH2 AW2
…
MFCC …
(spkr-independent)
NG
9 combination of pronunciations
can possibly be recognized
“About”
R Vowel stress AH0 B AW0 T
i-Vector … recognizer
(spkr-dependent) AH0 B AW1 T
…
Z …
Method-1
1. ERJ L2 stress score prediction
-ERJ L2 speech with stress scores: 1900 utterances

-Stress score prediction using DNN-HMM based English vowel stress
recognition system
-Compare vowel stresses between the reference pronunciation and
the hypothesis pronunciation
-Measure the ratio of stress symbol matches
(Ref) AH0 B AW1 T (Hyp) AH1 B AW0 T
=> (Num of stress matches) / (Total num of vowels) = 0/2 = 0
Result-1
Score agreement
Human-Machine Human-Machine
Human-Human
3-way stress (0,1,2) 2-way stress (unstressed, stressed)
Metric
H1-H2 Average H1-M H2-M Average H1-M H2-M Average
Num of items 1900 1900 1900 1900 1900
0.07 0.07 0.04 0.005

PCC* 0.45 0.45 0.07 0.02
(P=0.002) (P=0.003) (P=0.09) (P=0.84)
|SMD| 0.69 0.69 0.62 0.03 0.33 0.11 0.53 0.32
QWK 0.35 0.35 0.04 0.06 0.05 0.04 0.00 0.02
EPA 34.00 34.00 16.74 9.11 12.92 25.37 13.32 19.34
APA 75.32 75.32 52.95 49.68 51.32 64.11 49.74 56.92
*PCC is statistically significant showing P<0.001 in conditions where P-value is not mentioned.
Summary-1
Discussion
>> Hypothesis
- ASR-based stress scoring system will show human-level

Human-Machine performance.
Metric Human-Human - {no, yes} 2-way stress scoring system will show better
3-way 2-way performance than. {0,1,2} 3-way stress scoring.
PCC* 0.45 0.07 0.02 >> Results
|SMD| 0.69 0.33 0.32 - Human raters’ scores are credible.

- Human-Machine agreement is not equivalent to Human-Human
QWK 0.35 0.05 0.02 agreement.
- The 2-way stress socring is better than the 3-way scoring.
EPA 34.00 12.92 19.34
>> Conclusion
APA 75.32 51.32 56.92
- ASR-based stress scoring system does not show human-level
performance.
Method-2
Overview
STEP1. Modeling
ERJ L2 testset
ERJ L1 speech
ERJ L2 speech
Feature Multiple linear Predict stress
extraction regression scores
ERJ mean stress
scores
STEP2. Scoring
Model predicted
stress scores
Machine-Human agreement
Mean of 2 human
raters’ scores
STEP3. Validation
Method-2
Modeling
Trend similarity correlation
1. Stress feature extraction from ERJ L2 & L1 speech Normalized dot product(A,B) / (std A x std B)
- ERJ L2 speech with stress scores: 1900 utterances
- reading the same sentences as ERJ L2 (1 audio per sentence) Energy-pitch KL divergence
- Dynamic Time Warping (DTW) using MFCC of L1 and L2 speech alpha x energy trend similarity - (1-alpha) x pitch trend
- Align pitch and energy sequences of L1 and L2 based on DTW alignment similarity
- Extract 3 similarity features from the 2 pitch and energy sequences (Arias et al., 2010)
L1 speech Extraction of Pitch & Energy

Pitch, energy similarity features
MFCC DTW alignment • Pitch trend similarity correlation
based on DTW
• Energy trend similarity correlation
• Energy-pitch KL divergence
L2 speech
Pitch
(Interpolation + smoothing)
Energy
(Interpolation)
Method-2
1. ERJ L2 stress score prediction

-Select top 10% of trainset which shows the smallest standard deviations among human raters. (Loukina et al., 2018)
-Feature extraction of pitch and energy similarities between L1 and L2
-Stress score prediction
Result-2
Score agreement
Metric
H-H H-M
H1-H2 H1-M H2-M
Avg Avg
Num of
1900 1900 1900
items
PCC* 0.45 0.45 0.17 0.10 0.13
|SMD| 0.69 0.69 0.3382 0.43 0.39
QWK 0.35 0.35 0.16 0.09 0.12
EPA 34.00 34.00 36.53 29.16 32.84
APA 75.32 75.32 87.00 76.16 81.58

Summary-2
Discussion
>> Hypothesis
Metric
Avg Avg
- Multiple linear regression based stress scoring model trained with features of
L1-L2 pitch & energy similarities will show human-level performance.
PCC* 0.45 0.13
>> Results
|SMD| 0.69 0.39
QWK 0.35 0.12 - Human-Machine agreement is similar to Human-Human agreement.
>> Conclusion
EPA 34.00 32.84
- Multiple linear regression based stress scoring model trained with features of
APA 75.32 81.58 L1-L2 pitch & energy smilarities shows human-level performance.
3. Effectiveness of Automated
Evaluation of Phoneme
Intro
Overview
1. Goal
Examine the effectiveness of pronunciation score feedback of automated pronunciation evaluation system for
non-native English learners
2. Participants
17 Korean undergraduate students
3. Experiment
Practice pronunciation of English words with real-time pronunciation scores using the automated English
pronunciation system
3 types of pronunciation score feedback: word score, syllable score, phoneme score
4. Survey
User experience of the pronunciation evaluation system: Subjective evaluation of the participants
Method
Participants
• Affiliation & numbers: 17 undergraduates from GWNU Gangnueng campus

• Enrolled in English I class (3-credit, regular English conversation class)
• Class participation points will be given if volunteered for the experiment
• Gender: M14, F3
• Year: 12 1st yr, 1 2nd yr, 1 3rd yr, 3 4th yr
• Proficiency: In the beginning of a semester, every student took mock TOEIC
speaking test: 1 novice, 9 intermediate low, 6 intermediate mid, 1 advanced
• Major: 15 engineerings, 2 humanities
Method
Experiment - Stimuli preparation
• A list of English phonemes Korean English learners find hard to pronounce

(Bauman, 2006; Barrass, 2017)
• Phonemes used in the experiment

(1) /r/ vs /l/ (2) lax /i/ vs tense /I/ (3) consonant cluster /str/
• Prep 3-word sequences containing each phoneme (pair) - TongueTwister.
/r/ vs /l/ Lax /i/ vs Tense /I/ Consonant cluster ‘str’
Working while walking Cheap ship trip Strange strategic statistics

Method
Experiment - Automatic Pronunciation Evaluation System
Recording
Word to practice
Playback
Real-time score
Cond2: Phoneme score Cond3: Syllable score feedback

Cond1: Word score feedback
(Mean of phoneme scores per word)
feedback (Mean of phoneme scores per
(Phoneme scores) syllable)
Visual feedback for each /r/ vs /l/ Lax /i/ vs Tense /I/ Consonant cluster ‘str’
score range
• 80~: High
• 60~80: Intermediate Working while walking Cheap ship trip Strange strategic statistics
• ~60: Low
Method
Experiment - Procedure
1. The participants are guided to practice pronuncing words in the provided URL every last 15 minutes per
hour
(In the order of word score type, phoneme score type, syllable score type)
2. Right before the 1st practice, the professor demonstrates how to record and interpret the score feedback
with the testing sentence “This is a test.”
3. The participants access the URL and login with their student IDs using their own PCs (The class is online, so
they have to find their own private places where there is a strong Internet connection and no noises)
4. The participants should pracitce a word sequence at least 5 times in 15 mintues.
5. After the practice, they should answer the 21 questions on user experience of the system (GoogleForm)
Method
Experiment - Survey for user experience
ID Question Response type

1 Using the automatic pronunciation evaluation system enhances opportunities to speak English. 1 to 5 Likert scale
2 Using the automatic pronunciation evaluation system enhances opportunities to learn pronunciation. 1 to 5 Likert scale
3 I enjoy practicing English pronunciation with the automatic pronunciation evaluation system. 1 to 5 Likert scale
4 Using the automatic pronunciation evaluation system promotes my motivation to speak English. 1 to 5 Likert scale
5 Using the automatic pronunciation evaluation system improved my English pronunciation. 1 to 5 Likert scale
6 I feel embarrassed while practicing English with teachers and classmates. 1 to 5 Likert scale
7 I feel embarrassed while practicing English with the automatic pronunciation evaluation system. 1 to 5 Likert scale
8 I am afraid of making mistakes while practicing English with teachers and classmates. 1 to 5 Likert scale
9 I am afraid of making mistakes while practicing English with the automatic pronunciation evaluation system. 1 to 5 Likert scale
10 I would like to use the automatic pronunciation evaluation system for additional learning. 1 to 5 Likert scale
11 The information of the automatic pronunciation evaluation system is clear. 1 to 5 Likert scale
12 The feedback was helpful in improving my English speaking. 1 to 5 Likert scale
13 The voice recording playback was helpful. 1 to 5 Likert scale
14 Word score feedback was helpful. 1 to 5 Likert scale
15 Phoneme score feedback was helpful. 1 to 5 Likert scale
16 Syllable score feedback was helpful. 1 to 5 Likert scale
17 Which feedback type (word, phoneme, syllable) was the most helpful? Word, Phone, Syllable
18 The automatic pronunciation evaluation system would have been more helpful if a model pronunciation recording was provided. 1 to 5 Likert scale
19 Do you have any suggestions regarding the use of the automatic pronunciation evaluation system? Open Response
20 Did you encounter any problems while using the automatic pronunciation evaluation system for practicing your English pronunciation? If so, please describe them. Open Response
21 Do you have any other comments or feedback about the automatic pronunciation evaluation system? Open Response
1 indicates “Strongly Disagree” while 5 indicates “Strongly Agree”

Result
Overview
RQ0 : Does the automated pronunciation evaluation system have a positive

effect on pronunciation score enhancement and learner satisfaction?
Objective Subjective
X Y assessment assessment
Pronunciation
RQ1 Feedback type Y/N, how Y/N, how
improvement
RQ2 Feedback type Participation Y/N, how Y/N, how

Pronunciation
RQ3 English proficiency Y/N, how Y/N, how
improvement
RQ4 English proficiency Participation Y/N, how Y/N, how
• Score feedback type: word, syllable, phoneme score

• English proficiency: Mock TOEIC speaking score
• Score improvement: (obj) 5th score - 1st score (sub) survey response
• Participation: (obj) number of practices (sub) survey response
Result
Feedback type on pronunciation improvement (obj)
• In general, pronunciation scores increase throughout 5 practices.

• In the 5th trial, the mean increases while std decreases compared to the 1st trial.
Result
Feedback type on pronunciation improvement (obj)
The effect of feedback type on X (one-way ANOVA) type t df p Cohen’s d Pearson’s r

F-value P-value all -2.46 32 0.020** 0.431 0.398
Trial1 score 5.469423 0.007247*** word -2.10 32 0.046** 0.378 0.349
Trial5 score 8.764018 0.00057*** phoneme -0.95 32 0.349 0.17 0.166
Improvement 1.156243 0.323274 syllable -2.16 32 0.039** 0.38 0.356
***p<0.01, **p<0.05 ,*p<0.1 ***p<0.01, **p<0.05 ,*p<0.1
X Y Objective Subjective
• Feedback type has no statistically significant effect on score
RQ1 Feedback type
Pronunciation
Y/N, how Y/N, how improvment, but it has a statistically significant effect on both trial
improvement
1 and trial 5.
• In general, scores between trial 1 and trial 5 are significantly
RQ3 English proficiency
Pronunciation
Y/N, how Y/N, how different.
improvement
• But in the phoneme score type, there is no significant difference
between trial 1 and trial 5 (Kim et al., 2020)
Result
Feedback type on participation (obj)
Practice count for each feedback type

One-way ANOVA (practice count ~ feedback type)
type mean Std
F-value P-value
Word 51.59 29.39
Syllable 34.18 15.01 # practice 2.987379 0.05987*
Phoneme 39.18 16.82
***p<0.01, **p<0.05 ,*p<0.1
Pronunciation
improvement
• Feedback type has an effect on learners participation.
• The number of practices are in the following order:
Pronunciation
improvement
Y/N, how Y/N, how word > phoneme > syllable.
Result
Feedback type on participation (obj)
# practice & trial 1 score # practice & score improvement

Pearson’s r P-value Pearson’s r P-value
All -0.170 0.513 All 0.134 0.607
Word -0.432* 0.083 Word 0.440* 0.077
Syllable 0.062 0.812 Syllable 0.143 0.584
Phone -0.024 0.928 Phone -0.145 0.579
***p<0.01, **p<0.05 ,*p<0.1 ***p<0.01, **p<0.05 ,*p<0.1
• More practices when trial1 score is low.

X Y Objective Subjective • Especially in the word score type, there is negative correlation between trial 1
Pronunciation
score and the number of practices.
improvement —> With lack of information in the feedback (less granular), more efforts to
RQ2 Feedback type Participation Y/N, how Y/N, how overcome the low score with repetitive practices.
Pronunciation
improvement
Y/N, how Y/N, how
• In general, the number of practices and score improvement have positive
RQ4 English proficiency Participation Y/N, how Y/N, how correlation.
• But it is not the case in the word score feedback type.
Result
English proficiency on pronunciation improvement (obj)
TOEIC score & trial 1 score TOEIC score & trial 5 score TOEIC score & score improvement
Pearson’s r P-value Pearson’s r P-value Pearson’s r P-value
All 0.305 0.233 All 0.420* 0.093 All -0.017 0.949
Word 0.309 0.228 Word 0.210 0.418 Word -0.249 0.336
Syllable 0.201 0.439 Syllable 0.515** 0.035 Syllable 0.252 0.329
Phone 0.137 0.601 Phone 0.109 0.677 Phone -0.054 0.836
***p<0.01, **p<0.05 ,*p<0.1 ***p<0.01, **p<0.05 ,*p<0.1 ***p<0.01, **p<0.05 ,*p<0.1
• The higher TOEIC scores, the higher scores both in trial1 and trial5.
• The higher TOEIC scores, the smaller the pronuncaiton improvement.
X Y Objective Subjective (Derwing and Munro, 2005; Lee et al., 2015)
Pronunciation
RQ1 Feedback type
improvement
Y/N, how Y/N, how
• In syllable score feedback type, positive correlation between TOEIC
RQ2 Feedback type Participation Y/N, how Y/N, how scores and trial5 scores.
Pronunciation
Y/N, how Y/N, how
• In syllable score feedback type, positive correlation between TOEIC
improvement scores and score improvement.
RQ4 English proficiency Participation Y/N, how Y/N, how —> The more English knowledge, the better utilizing syllable score
information.
Result
English proficiency on participation (obj)
TOEIC score & practice count

Pearson’s r P-value
All -0.09 0.73
Word -0.11 0.68
Syllable 0.15 0.56
Phone -0.15 0.55
***p<0.01, **p<0.05 ,*p<0.1

Pronunciation
improvement
RQ2 Feedback type Participation Y/N, how Y/N, how • The higher TOEIC scores, the less number of practices.
Pronunciation
Y/N, how Y/N, how • Except for syllable score feedback type.
improvement

Result
Mean & Std of response in each survey question (17 questions in Likert
scale)
ID Question Response type Mean Std
1 Using the automatic pronunciation evaluation system enhances opportunities to speak English. 1 to 5 Likert scale 4.22 0.73
2 Using the automatic pronunciation evaluation system enhances opportunities to learn pronunciation. 1 to 5 Likert scale 4.28 0.75
3 I enjoy practicing English pronunciation with the automatic pronunciation evaluation system. 1 to 5 Likert scale 4.11 0.68
4 Using the automatic pronunciation evaluation system promotes my motivation to speak English. 1 to 5 Likert scale 4.17 0.62
5 Using the automatic pronunciation evaluation system improved my English pronunciation. 1 to 5 Likert scale 4.06 0.73
6 I feel embarrassed while practicing English with teachers and classmates. 1 to 5 Likert scale 1.78 0.65
7 I feel embarrassed while practicing English with the automatic pronunciation evaluation system. 1 to 5 Likert scale 1.50 0.62
8 I am afraid of making mistakes while practicing English with teachers and classmates. 1 to 5 Likert scale 2.67 1.08
9 I am afraid of making mistakes while practicing English with the automatic pronunciation evaluation system. 1 to 5 Likert scale 2.06 1.16
10 I would like to use the automatic pronunciation evaluation system for additional learning. 1 to 5 Likert scale 4.00 0.84
11 The information of the automatic pronunciation evaluation system is clear. 1 to 5 Likert scale 3.83 0.92
12 The feedback was helpful in improving my English speaking. 1 to 5 Likert scale 3.94 0.80
13 The voice recording playback was helpful. 1 to 5 Likert scale 4.11 0.76
14 Word score feedback was helpful. 1 to 5 Likert scale 4.33 0.69
15 Phoneme score feedback was helpful. 1 to 5 Likert scale 4.22 0.81
16 Syllable score feedback was helpful. 1 to 5 Likert scale 4.22 0.73
17 Which feedback type (word, phoneme, syllable) was the most helpful? Word, Phone, Syllable
18 The automatic pronunciation evaluation system would have been more helpful if a model pronunciation recording was provided. 1 to 5 Likert scale 4.61 0.61
19 Do you have any suggestions regarding the use of the automatic pronunciation evaluation system? Open Response
20 Did you encounter any problems while using the automatic pronunciation evaluation system for practicing your English pronunciation? If so, please describe them. Open Response
21 Do you have any other comments or feedback about the automatic pronunciation evaluation system? Open Response
• Generally positive opinions on the system’s effectiveness on learning pronunciaiton

• More mental comfort when practicing with the system then with the classmates/teachers.
• The opinion with the highest rating is that the system will be more helpful with model
pronunciation.
Result
Feedback type on pronunciation improvement (subj)
Q5. Using the automatic pronunciation evaluation system improved my English
pronunciation.
qID Feedback type Pearson’s r P-value
q5 all -0.2995500419 0.03272129362**
q5 word -0.425890797 0.08828342225*
q5 syllable -0.2035207258 0.4333489308
q5 phone -0.2992421989 0.24328751
***p<0.01, **p<0.05 ,*p<0.1
RQ1 Feedback type

Pronunciation
Y/N, how Y/N, how
• Those with lower improvement agreed more that the system is
improvement
helpful for their pronunciation improvement.
RQ2 Feedback type Participation Y/N, how Y/N, how • Discrepancy between the reality and the perception
Pronunciation
improvement

Result
Feedback type on pronunciation improvement (subj)
Q17. Which feedback type (word, phoneme, syllable) was the most helpful?
Feedback type %
word 11%
syllable 33%
phone 56%
• The most helpful feedback was answered in the order of

X Y Objective Subjective phoneme > syllable > word
RQ1 Feedback type
Pronunciation
Y/N, how Y/N, how
—> The higher the information size of the feedback
improvement
(granularity), the more helpful it is perceived.
RQ2 Feedback type Participation Y/N, how Y/N, how —> Not in the same order of score imporvment (score
Pronunciation
Y/N, how Y/N, how improvement of phoneme feedback type is the lowest)
improvement

Result
Feedback type on participation (subj)
Q1. Using the automatic pronunciation evaluation system enhances opportunities to speak English. Q14. Word score feedback was helpful.
qID Feedback type Pearson’s r P-value qID Feedback type Pearson’s r P-value
q1 all -0.1978036869 0.1641126682 q14 all 0.1496706055 0.2945084569
q1 word -0.5302052516** 0.02857243981 q14 word -0.1697226449 0.5148960448
q1 syllable -0.1449290317 0.5789042064 q14 syllable 0.01151319559 0.9650196665
q1 phone -0.0804180011 0.7589855369 q14 phone 0.437719236* 0.07888142686
Q4. Using the automatic pronunciation evaluation system promotes my motivation to speak English.
q4 all 0.02192380939 0.8786316241
q4 word -0.4179582675* 0.09502538627
q4 syllable 0.01617287723 0.9508759701
q4 phone 0.2816867017 0.2733723923
• Those with less practices agreed more that the system enhances the
X Y Objective Subjective motivation/opportunities to speak English.
Pronunciation
• This was statistically significant in word score feedback type
improvement —> The sooner they stopped practicing, the more positive perception
Pronunciation • Those with less practices in word feedback type agreed more that the
improvement word score feedback was helpful.
RQ4 English proficiency Participation Y/N, how Y/N, how —> Feedback satisfaction and number of practices have negative
correlation
***p<0.01, **p<0.05 ,*p<0.1
Result
English proficiency on pronunciation improvement (subj)
Q5. Using the automatic pronunciation evaluation system improved my English pronunciation.
qID Pearson’s r P-value
q5 0.2409254045* 0.08855220314
Q12. The feedback was helpful in improving my English speaking.

q12 -0.2685143919* 0.05675556256
***p<0.01, **p<0.05 ,*p<0.1
X Y Objective Subjective • Those with higher proficiency agreed more that their English
RQ1 Feedback type
Pronunciation
Y/N, how Y/N, how pronunciation was improved using the system.
improvement
• Those with lower proficinecy agreed more that their Englsih
speaking was improved using the feedback
Pronunciation
Y/N, how Y/N, how —> Those with lower proficiency consider explicit feedback as
improvement
more helpful.
Result
English proficiency on satisfaction/participation (subj)
Q3. I enjoy practicing English pronunciation with the automatic pronunciation evaluation system.
q3 0.3364481699** 0.01577884639
***p<0.01, **p<0.05 ,*p<0.1
X Y Objective Subjective • Those with higher proficiency enjoyed practicing Enlgish

Pronunciation
pronunciation with the system more.
RQ1 Feedback type
improvement
Y/N, how Y/N, how
—> The higher proficiency, the higher participation/satisfaction
RQ2 Feedback type Participation Y/N, how Y/N, how level
Pronunciation
improvement

Result
Others
Q8. I am afraid of making mistakes while practicing English with teachers and classmates.
q8 all 0.3199321212** 0.02209983436
q8 word 0.2766864718 0.2823396826
q8 syllable 0.4234253027* 0.09034072238
q8 phone 0.3659514158 0.148568887
Q19. Do you have any suggestions regarding the use of the automatic pronunciation evaluation
system?
Response % • Number of practices increased when they are more
no 61% afraid of making mistakes while speaking practice
model pronunciation 28%
—> The higher mental comfort, the more number of
practices
more specific feedback 5%
anonymous real-time ranking 5%
Q20. Did you encounter any problems while using the automatic pronunciation evaluation system
for practicing your English pronunciation? If so, please describe them. • There is no dissatisfaction on using the system in
Response %
general, but 1/3 of the participants asked for model
pronunciation
no 78%
no model pronunciation 17%
—> Learner’s needs on implicit feedback
often recording failure 5%
Q21. Do you have any other comments or feedback about the automatic pronunciation evaluation system?
Response %
no 61%
satisfied 11%
model pronunciation 17%
improvement of microphone sound quality 5%
recording system should be fixed 5%
Summary
Discussion
RQ0: Does the automated pronunciation evaluation system have a positive effect on pronunciation score enhancement and learner satisfaction?
Ans: Yes
RQ1: Does the subword score type have an effect on pronunciation improvement in the automated pronunciation evaluation system?
Ans: Yes
- Obj: In the order of Word > Syllable > Phoneme. Statistically significant except for phoneme
- Subj: In the order of Phoneme > Syllable > Word (The higher granularity of feedback, the higher learner satisfaction)
RQ2: Does the subword score type have an effect on participation/satisfaction in the automated pronunciation evaluation system?
Ans: Yes
- Obj: In the order of Word > Phoneme > Syllable. The lower the initial score, the lower the participation level
- Subj: The participants with lower participation showed higher satisfaction level
RQ3: Does the English proficiency have an effect on pronunciation improvement in the automated pronunciation evaluation system?
Ans: Yes
- Obj: Those with higher proficiency showed smaller degree of improvement, despite the highest final score.
But they were more capable of utilizing explicit feedback information in advanced pronunciation tasks.
- Subj: Those with higher proficiency agreed more that the automated pronunciation evaluation system helped improve their pronunciation.
RQ4: Does the English proficiency have an effect on participation/satisfaction in the automated pronunciation evaluation system?
Ans: Yes
- Obj: The higher the proficiency, the less number of practices.
- Subj: The higher the proficiency, the higher the satisfaction on the system.
4. Conclusion
결론
• Research hypothesis
• Evaluation of English segementals
• ASR-based phoneme evaluation system will provide human-evaluator-level scores when it’s modeled
with native speakers phoneme information > TRUE
• Evaluation of English suprasegmentals
• It will provide human-evaluator-level scores when rhythm is modeled based on duration, intonation
on pitch, stress on energy and pitch > TRUE
• Evaluation of effectiveness of automated English phoneme scoring systems
• English learners’ pronunciation level will be enhanced after practicing with automted English
phoneme evaluation systems for multiple times > TRUE
• The satisfaction level of English learners will be dependent on the pronunciation score units: word,
syllable, phoneme > TRUE
5. Future Studies
Future Studies
• Experiment on effectiveness of automated suprasegmental evaluation on
pronunciation learning
• Develop an end-to-end ASR-based automated phoneme evaluation system

• Develop a pronunciation evaluation system on free speech, not a read-
aloud speech
References
- Arias, J. P., Yoma, N. B., & Vivanco, H. (2010). Automatic intonation assessment for computer aided language learning. Speech communication, 52(3), 254-267.
- Barrass, J. (2017). The Intelligibility of Korean English Pronunciation from a Lingua Franca Perspective. University of Oxford.
- Bauman, N.R. (2006). A Catalogue of Errors Made by Korean Learners of English. KOTESOL International Conference 2006 Abstract & Paper.
- Black, M. P., Bone, D., Skordilis, Z. I., Gupta, R., Xia, W., Papadopoulos, P., ... & Narayanan, S. S. (2015). Automated evaluation of non-native English pronunciation quality: combining knowledge-and data-driven features at multiple time
scales. In Sixteenth Annual Conference of the International Speech Communication Association.
- Cucchiarini, C., Strik, H., & Boves, L. (2000). Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. The Journal of the Acoustical Society of America, 107(2), 989-999.
- Franco, H., Bratt, H., Rossier, R., Rao Gadde, V., Shriberg, E., Abrash, V., & Precoda, K. (2010). EduSpeak® : A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications. Language Testing,
27(3), 401-418.
- Hönig, F., Batliner, A., & Nöth, E. (2012). Automatic assessment of non-native prosody annotation, modelling and evaluation.
- Hu, W., Qian, Y., Soong, F. K., & Wang, Y. (2015). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154-166.
- Kim, J. E., Cho, Y., Cho, Y., Hong, Y., Kim, S., & Nam, H. (2020). The effects of L1–L2 phonological mappings on L2 phonological sensitivity: evidence from self-paced Listening. Studies in Second Language Acquisition, 42(5), 1041-1076.
- Kim, M. (2020) . A study of rhythm improvements and relevant linguistic factors in the pronunciation of English learners. Studies in Foreign Language Education, 34(1), 237-261.
- Kommissarchik, J., & Komissarchik, E. (2000). Better Accent Tutor–Analysis and visualization of speech prosody. Proceedings of InSTILL 2000, 86-89.
- Lee, J., Jang, J., & Plonsky, L. (2015). The effectiveness of second language pronunciation instruction: A meta-analysis. Applied Linguistics, 36(3), 345-366.
- Lee, O. (2019). Suprasegmental instruction and the improvement of EFL learners’ listening comprehension. English Language & Literature Teaching, 25(4), 41-60.
- Lin, B., Wang, L., Feng, X., & Zhang, J. (2020). Automatic scoring at multi-granularity for L2 pronunciation. Proc. Interspeech 2020, 3022-3026.
- Loukina, A., Zechner, K., Bruno, J., & Klebanov, B. B. (2018, June). Using exemplar responses for training and evaluating automated speech scoring systems. In Proceedings of the thirteenth workshop on innovative use of NLP for building
educational applications (pp. 1-12).
- McGregor, A., & Reed, M. (2018). Integrating pronunciation into the English language curriculum: A framework for teachers. CATESOL Journal, 30(1), 69-94.
- Minematsu, N., Tomiyama, Y., Yoshimoto, K., Shimizu, K., Nakagawa, S., Dantsuji, M., & Makino, S. (2004). Development of English speech database read by Japanese to support CALL research. In Proc. ICA (Vol. 1, No. 2004, pp. 557-
560).
- Neri, A., Cucchiarini, C., & Strik, H. (2002). Feedback in computer assisted pronunciation training: When technology meets pedagogy.
- Pérez-Ramón, R., Cooke, M., & Lecumberri, M. L. G. (2020). Is segmental foreign accent perceived categorically?. Speech Communication, 117, 28-37.
- Prince, J. B. (2014). Contributions of pitch contour, tonality, rhythm, and meter to melodic similarity. Journal of Experimental Psychology: Human Perception and Performance, 40(6), 2319.

Alak Presentation-Hong

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Alak Presentation-Hong

Uploaded by

Copyright:

Available Formats

Automated Evaluation of

Pronunciation for English Learners

2. Automated Evaluation of Pronunciation

3. Effectiveness of Automated Phoneme Evaluation System

• Computer-Assisted Language Learning (CALL)

• Computer-Assisted Pronunciation Training (CAPT)

1. Lack of system providing criterion-based pronunciation scores

• Lack of previous systems providing score feedback on 4 criteria

2. Automated pronunciation evaluation research with low educational practicality

Train Data Model Training Model Testing

• Providing human raters’ pronunciation scores on L2 speech Evaluator 5 4 4 2

1~5 Likert scale scores on phoneme, rhythm, stress, intonation

Item Word 300 0 0 77 53

Sentence 719 120 53 0

The correlation between automatic scores and human Speaker Male 95 8 8 8

scores is the highest with Japanese L1-English L2 (Wang et al.,

Metric Min Max Interpretation

Negative correlatin(-1), No correlation(0), Positive

|(between-group mean difference)/(std)| if 0, no

1. Feature extraction 2. DNN-HMM based acoustic model training

- Word to pronunciation mapping is based on CMU dictionary

H IY1 IH1 Z R AH1 N IH0 NG

IH1 frame-wise llh -60.72 -50.45 -23 -120

-Machine-Human agreement: To test credibility of

-Use 5 metrics of score agreement

IH1 frame-wise percentage mean 75.5

Across-phoneme percentage mean 83

Conversion to 1-5 Likert scale 4

Higher value, higher agreement

Lower value, higher agreement

*PCC is statistically significant showing P<0.001 in every condition

APA 86.61 75.35

- ASR-based phoneme evaluation system shows human-level performance when

List up 27 English rhythm features from previous studies

4. Re-select rhythm features showing best performance

feat1 feat2 feat3 … feat27

Higher value, higher agreement

Lower value, higher agreement

*PCC is statistically significant showing P<0.001 in every condition

EPA 36.96 25.29 >> Conclusion

- Rhythm scoring model based on multiple linear regression is credible, but

Mean and variance of model predicted scores

Mean and variance of model predicted scores by different feature sets

Rankings of model predicted scores by different feature sets

- Eng L1 > Eng L2: the rhythm scoring model is credible.

1. Intonation feature extraction from ERJ L2 & L1 speech

Pitch alignment Extraction of pitch similarity features

1. ERJ L2 intonation score prediction

Higher value, higher agreement

Lower value, higher agreement

*PCC is statistically significant showing P<0.001 in every condition

PCC* 0.45 0.19 >> Results

|SMD| 0.49 0.39 - Human raters’ scores are credible.

EPA 30.12 28.08 >> Conclusion

1. ASR-based stress scoring system

2. Multiple linear regresion based stress scoring system

Eng L1 speech DB ASR based

1. ERJ L2 stress score prediction

-ERJ L2 speech with stress scores: 1900 utterances

Num of items 1900 1900 1900 1900 1900

0.07 0.07 0.04 0.005

|SMD| 0.69 0.69 0.62 0.03 0.33 0.11 0.53 0.32

p<0.01, p<0.05 ,p<0.1 p<0.01, p<0.05 ,p<0.1

p<0.01, p<0.05 ,p<0.1

p<0.01, p<0.05 ,p<0.1

p<0.01, p<0.05 ,p<0.1

p<0.01, p<0.05 ,p<0.1