Ai - Assisted - Spoken - Language Batch 14

SURVEY ON AUTOMATIC PRONOUNCIATION ANALYSIS
USING ARTIFICIAL INTELLIGENCE TECHNOLOGIES
Mr. Rajeshram V
Abishek R Ajay Vishwa R
Department of Computer
Department of Computer Department of Computer
Science and Engineering
Science and Engineering Science and Engineering
M.Kumarasamy College of
M.Kumarasamy College of M.Kumarasamy College of
Engineering,Thalavapalayam,
Engineering,Thalavapalayam, Engineering,Thalavapalayam
Karur, Tamilnadu, India 639113
Karur, Tamilnadu, India 639113 , Karur, Tamilnadu ,India -639113
rajeshram107@gmail.com
abishekshanthy@gmail.com ajayvishwaram111@gmail.com
Hemandh M S
Department of Computer
Science and Engineering
M. Kumarasamy College of
Engineering,Thalavapalayam
,Karur ,Tamilnadu ,India -639113
hemandhnandhini@gmail.com
ABSTRACT
The Automatic Pronunciation analysis (APA)
is an innovative language learning tool designed to
revolutionize the way individuals acquire and perfect
multicultural environments. For language learners and
their pronunciation skills. Built upon advanced speech
educators, the quest for impeccable pronunciation has long
recognition and machine learning technologies, offers a
been a fundamental challenge. Pronunciation analysis
real-time, personalized, and effective solution for
represents a groundbreaking solution to this challenge.
learners of all levels. This groundbreaking system
Language learning has evolved with advancements in
operates by capturing the user's spoken language input
technology, and stands at the forefront of this evolution.
and meticulously analysing it in comparison to the
This innovative tool harnesses the power of artificial
desired target pronunciation. Leveraging deep neural
intelligence, speech recognition, and machine learning to
networks, acoustic modelling, and phonetic analysis,
provide real-time, personalized feedback on pronunciation
then can swiftly identify and assess pronunciation
accuracy. This introduction sets the stage for a
errors, enabling users to pinpoint areas of improvement.
comprehensive exploration of the APMD, its underlying
The APA provides immediate, constructive feedback,
technologies, and its potential to transform the way we
highlighting specific mispronunciations and offering
approach language learning. By seamlessly integrating
visual representations for enhanced learning. Key
cutting-edge speech analysis techniques, offers a new
features of the APA include its adaptability to a wide
dimension in pronunciation improvement, empowering
range of languages and dialects, making it a versatile
individuals to enhance their linguistic proficiency with
tool for learners from diverse linguistic backgrounds. It
unprecedented precision and efficiency. As we delve into
can be integrated seamlessly into language learning
the intricacies of the speech processing, it becomes evident
applications, e-learning platforms, or used as a
that this tool has the potential to redefine language learning
standalone tool for self-improvement or by educators
paradigms and facilitate effective cross-cultural
and speech therapists. In this survey, explore the APA's
communication on a global scale. The pursuit of linguistic
architecture, capabilities, and its transformative
proficiency is a journey fraught with challenges, chief
potential in the realm of language learning and
among them being the acquisition and refinement of
education.
accurate pronunciation. Pronunciation, the art of
KEYWORDS: Artificial intelligence, Acoustic
articulating words and sounds correctly, is not only pivotal
modelling, Pronunciation, Speech Processing,
for clear and effective communication but also for building
Language processing
cultural empathy and understanding. Incorrect
pronunciation can lead to misunderstandings and hinder
1. INTRODUCTION
one's ability to connect with speakers of a different
In an increasingly interconnected world,
language or dialect. The traditional approach to
effective communication in a global context has
pronunciation improvement often involves human teachers
become a critical skill. Language proficiency and
and extensive practice, which, while effective, can be time-
accurate pronunciation are integral components of this
consuming and costly. Furthermore, access to experienced
skill, enabling individuals to engage with diverse
language instructors is not always readily available,
cultures, forge international relationships, and succeed
limiting the opportunity for focused, personalized learning.
in today's
This is where the APMD steps in as a transformative
solution. By harnessing the power of state-of-the-art speech
recognition technology, the APA bridges the gap between
traditional language learning and cutting-edge artificial
intelligence. It provides users with an opportunity to conditions met this criterion value (Table 3). Based on
receive real-time feedback on their pronunciation, these results, both frequency groups demonstrated
allowing them to identify and rectify mistakes as they generalization of the treated phoneme in untreated words.
happen. This immediate feedback loop not only That being said, the individual Tau-U effect size
accelerates the learning process but also boosts learners' calculations revealed that only two children (Child 3 and 4)
confidence by enabling them to track their progress in a in the 2×/week condition demonstrated statistically
tangible and visible manner. The APA is versatile, significant gains in their treated phoneme accuracy in
accommodating a wide array of languages and dialects, untreated words. Thus, the less frequent intervention
thereby catering to the needs of learners from various program appeared to better support the generalization of the
linguistic backgrounds. Additionally, its adaptability for treated phoneme in untreated contexts than did the more
integration into language learning applications, e- frequent intervention program.
learning platforms, and as a stand-alone tool for self- Acquah, Emmanuel O , et.al,…[2] analysed 26
improvement or professional guidance has far-reaching studies looking into the impact of digital games on
implications for the field of education. Educators, language learning and related outcomes from 2014 to 2018.
speech therapists, and students alike can leverage this The majority of research was mixed methods, targeting
tool to enhance their teaching methodologies and English language learning using computers, and was
language learning experiences. Beyond its practical conducted in East Asia or the Middle East. It was found
applications, the APA has the potential to reshape the that DLGs can be used as effective L2 learning tools that
landscape of language research and cognitive science. It motivate players to learn and interact. DGBLL can be a fun,
can serve as a valuable resource for linguists studying engaging, and challenging way to learn, and provides
phonetics and pronunciation patterns, providing a differentiation and learner autonomy. From the included
wealth of data for scientific investigation. Furthermore, studies, 70% of the reported outcomes were entirely
the system's data-driven approach can be employed in positive. This is evidence of the positive outcomes of
automated language assessment, offering an objective DGBLL on primary through high school-age children.
means to evaluate pronunciation skills and language More specifically, researchers reported outcomes from
proficiency. As we embark on this journey to explore DLGs were positive 62% of the time for language
the intricacies of the Automatic Pronunciation Detector, acquisition, 81% for affective/psychological states, 88% for
we delve into the realm of transformative education, contemporary competences and 62% for participatory
cross-cultural understanding, and the ever-evolving behaviors. While the articles did not mention outcomes
relationship between technology and language. This related to cultural competences or building global networks,
exploration is a testament to the APA's promise in DLGs provide the possibility of expanding the classroom
enabling individuals to overcome linguistic barriers, outside of four walls by bridging schools and enabling
fostering effective communication, and ultimately cross-cultural communication. In order for DLGs to be
enriching lives through the power of pronunciation implemented successfully, it is essential to know how
perfection. Fig 1 shows the basic steps for acoustic data certain factors can influence the outcomes. This review
handling found DLGs produced positive outcomes with and without
teacher facilitation, but more insight into how teachers can
implement DGBLL is needed since none of the studies
directly analyzed implementation
Ping Li , et.al,..[3] charted an overall picture of
what DLL has evolved into, what impacts it has created,
and what future promises it may hold. We have also
attempted to provide theoretical perspectives from
psychology, education, linguistics, and neuroscience to
understand the cognitive, social, affective, and neural
dimensions of DLL. DLL has enormous potential given the
new generations of ‘digital natives’ and the interests in
digital applications and blended learning in the foreseeable
future. But significant work remains to be done to
understand the mechanisms under which DLL might
simulate language learning in its natural, authentic context
and consequently enhance its learning success. There are
also significant gaps that exist between our academic
knowledge of student learning and the industry’s
Figure 1: SPEECH PROCESSING STEPS commercial product design. We need quick knowledge
transfer from academia to the industry, which is currently
2. RELATED WORK hindered by many factors, including bureaucracies at
Cummings, Alycia, et.al,…[1] examined how different levels, and such problems are exacerbated by the
intervention dose frequency affects phonological different paces adopted by the academia versus the
acquisition and generalization in preschool children industry. To mend such gaps, we need the academics to
with speech sound disorders (SSD). Generalization data work more closely with the industry and with policy
for the intervention phoneme in untreated words is makers, which will facilitate and accelerate the
reported as the average production accuracy of all post- development of both knowledge discovery and knowledge
intervention speech probes. Recall that for a phoneme transfer.
to be considered generalized, there needed to be at least Zou, Di,et.al,…[4] reviewed 21 SSCI publications
a 10% accuracy increase from pre- to post-intervention. on DGVL from five perspectives: the general publication
Three of the four children in both dose frequency situation, digital games for vocabulary learning, theoretical
frameworks, research issues and findings, and the (game intensity) and greater playing regularity.
subsequent implications. It was found that DGVL has Bashori, Muzakki, et.al,…[7] presented study are
gathered increasing attention from linguists and to get more insight into the kind of linguistic gain the
educators, although the total number of studies on the students made. English was the target language in this
topic is small at the current stage. A large proportion of study. Our vocabulary test consisted of three different parts
studies have been conducted from the education and these parts may provide relevant information on the
dimension, and several education theories have been receptive and productive aspects of learning vocabulary. In
used as the theoretical foundations of the research on addition, we asked a subset of the students to pronounce
DGVL. Ten different types of digital games target words, as pronunciation might be an aspect that is
(simulation, tutorial, role-playing, motion-sensing, 3 D affected directly through ASR systems. We also employed
virtual, adventure, card, board, and serious games, as an open-source software package, the Automated Phonetic
well as gamified digital books) were investigated, and Transcription Comparison Tool (APTct), to help analyze
the research results generally showed positive effects of learners’ speech. Two preliminary studies evaluated these
the games in promoting short-term and long-term ASR-based websites and investigated to what extent they
vocabulary learning. In addition to facilitating affect learners’ cognitive and affective domains. The results
vocabulary learning, digital games were found to be revealed that these websites were evaluated positively and
conducive to reading and listening comprehension, as helped learners improve their vocabulary knowledge,
well as pronunciation improvement. Game-players also reduce their speaking anxiety, and enhance their language
were viewed as having higher motivation, better enjoyment.
engagement and more interactions than students who Shi, Jiatong ,et.al,…[8] proposed the context-
learned through other approaches, in addition to being aware GOP (CaGOP) scoring model in this work, which
less stressed. Moreover, despite the limited number of injects two context related factors into the model, the
studies on DGVL, insightful findings and meaningful transition factor and duration factor. The transition factor is
implications were reported. represented using the frame-wise posterior-probability
BALLARD,et.al,…[5] presented Apraxia entropy. The duration factor is represented based on the
World, a speech therapy game designed to give children duration mismatch, which is computed using a duration
more independence and make therapy practice more model with selfattention network. However, the above
enjoyable. Apraxia World is unique from other speech GOP-like methods do not fully consider the context
therapy games in that players control the game using information within and between the phonetic segments.
traditional joystick and button inputs, while speech Within the segments, the scoring strategies depend on the
input is used to collect in-game assets necessary to forced-alignments. The forced-alignments split the entire
complete the level. The game also supports speech sequence into phonetic segments corresponding to
pronunciation feedback provided by caregivers or an reference phonemes. Based on the facts of speech
automatic evaluation framework. To validate our game production, the vocal tract is gradually changing in the
design and speech therapy delivery approach, we production of different phonemes. Therefore, a hard
evaluated the long-term home use and clinical benefit of assignment of phonemes in time domain would include the
Apraxia World over a multi-month period. Children transition between phonemes within the force-aligned
reported enjoying the game, even over the long play segments. As the GOP based model does not consider the
period. Game personalization through in-game context information between phonetic segments, it does not
purchases of costumes, weapons, and avatars proved to consider the pronunciation phenomena from contextual
be a widely popular aspect of the game. phonemes. To tackle this issue, we introduce duration
Tejedor-Garcia, et.al,…[6] described and factor as the prosody context to the pronunciation scoring.
analysed a novel learning game for pronunciation The computation of the duration factor includes two steps.
training in which players can challenge each other. The First, we use a context-dependent duration model to predict
mobile game application turned out to be a useful the duration for the given phoneme sequences. Then, we
resource for English pronunciation training. It relied on refer the duration factor as the phonetic duration mismatch
speech technologies (ASR and TTS) which have proved between the reference and the test utterances
to be particularly useful for increasing the amount of Baevski, Alexei, et.al,…[9] presented wav2vec
game intensity, immediate feedback, and model 2.0, a framework for self-supervised learning of speech
pronunciations available to the students. It was also representations which masks latent representations of the
based on a specific cycle of pronunciation activities raw waveform and solves a contrastive task over quantized
following the minimal pairs paradigm. Native Spanish speech representations. Our experiments show the large
speakers played the game in a competition for English potential of pre-training on unlabeled data for speech
as foreign language pronunciation training, where a processing: when using only 10 minutes of labeled training
performance and motivation analysis was done to data, or 48 recordings of 12.5 seconds on average, we
examine the effects of challenging. We have studied an achieve a WER of 4.8/8.2 on test-clean/other of
important issue, with the current drive by educators to Librispeech. Neural networks benefit from large quantities
discover new ways to motivate students to encourage of labeled training data. However, in many settings labeled
effective uptake. Despite the fact that collaborative and data is much harder to come by than unlabeled data: current
competitive strategies in second language learning speech recognition systems require thousands of hours of
continue to invite discussion and disagreement on transcribed speech to reach acceptable performance which
which one should be included as an effective element of is not available for the vast majority of the nearly 7,000
motivation and performance for students, we have languages spoken worldwide. Learning purely from labeled
observed that the explicitly competitive structure of the examples does not resemble language acquisition in
game resulted in more positive effects on student humans: infants learn language by listening to adults
performance and motivation than a previous version of around them - a process that requires learning good
the game, in terms of a higher number of activities representations of speech.
Li-Wei Chen, et.al,…[10] described different 3. EXISTING METHODOLOGIES
fine-tuning strategies for wav2vec 2.0 on SER. These Automatic pronunciation in the context of TTS
strategies produce SOTA performance on IEMOCAP, a involves converting written text into spoken language with
well-studied corpus. We verify the presence of domain correct pronunciation, intonation, and natural-sounding
shift in SER and demonstrate that addressing it speech.
improves performance. We describe an algorithm for
learning contextualized emotion representation and RULE BASED ALGORITHM
show its advantage in fine-tuning a wav2vec 2.0 model A rule-based system for pronunciation mistake
for SER. We believe that these techniques can be detection is designed to identify and correct
generalized to other tasks and can provide a basis for mispronunciations in speech or text-to-speech (TTS)
research on the utility of contextualized emotion applications by applying predefined linguistic rules. These
representation. We intend to continue exploring the rules encompass various aspects of phonology and
usefulness of this approach, in a multi-modal setting phonetics, including phoneme pronunciation, stress
Table 1 provides the overall details about literature patterns, syllable divisions, and contextual variations. The
survey papers. system maintains comprehensive dictionaries mapping
NO TECHNIQ MERITS DEMERITS words to their correct pronunciation and employs phonetic
UES transcription, such as the International Phonetic Alphabet
1 Percent Provide the Manual (IPA), to represent words accurately. When a user input or
consonants score based on intervention TTS output deviates from these rules, the system flags it as
correct consonants can be need a pronunciation mistake and offers feedback or correction
(PCC) sound suggestions based on the established rules. It's a valuable
2 Digital Analyze the Expert tool for improving the correctness and naturalness of
learning behaviors of knowledge speech synthesis, but may have limitations when dealing
games students needed with complex or rare pronunciation variations. Combining
(DLGs) rule-based systems with data-driven techniques can
framework enhance overall performance.
3 Digital Provide a Theoretical Machine learning methods
language theoretical and practical Machine learning can be defined as artificial
learning synthesis and problems can intelligence algorithms that can infer and predict from data
(DLL) analytical occur to mimic the way humans learn. There are various machine
framework learning algorithms that are capable of solving
4 Short-term Several Large number classification, regression and clustering tasks. In our work,
and long- education of analyses are popular methods suitable for classification problem are
term theories have required emphasized which are support vector machine (SVM), k-
vocabulary been used Nearest neigbour (k-NN), decision tree (DT) and naïve
learning Bayes
5 Template Increase There is no SVM is a supervised learning approach. Kernel
matching practice automatic functions can also be used depending on the type of data
(TM) frequency evaluation during the operation of the algorithm. In this way, both
framework linear and nonlinear classification operations can be
6 Scoring Novel learning Computational performed. It is aimed to separate all data with a
system game for complexity is hyperplane. However, if the data cannot be fully separated,
pronunciation high they cannot be classified with a single plane. Therefore,
training different kernel functions are used. A margin is determined
7 Novo Targeted Statistical around the hyperplane. Whether this margin is large or
Learning vocabulary based analysis small directly affects the classification performance.
(NOVO) and Margin can be controlled with the “C” hyperparameter. The
system pronunciation larger the C, the narrower the margin. Also, if the model is
skills overfit, C needs to be reduced. In this work linear kernel
8 Context- Improvement Time and 0.02 used as C parameter.
aware for phonetic complexity k-NN is basically based on the determination of
Goodness of duration can be high the class of the data whose class is unknown, according to
Pronunciatio prediction the nearest “k” neighbor from the data in the training set.
n (CaGOP) As a result of performing a distance measurement between
scoring the test data and the training data, the nearest “k” nearest
model neighbors are determined. Then, the class value of the
tested data is determined according to these labels.
9 wav2vec Masks latent Less number
Basic purpose of decision trees is to divide the
construction representation of labeled
data set into smaller subgroups that are more visually
s of the raw datasets are
understandable within the framework of certain rules
waveform used
(decision rules). Since the output of the algorithm is a
10 Task Learning Only classify
flowchart that looks like a tree visually, it is called a
adaptive contextualized the tone of
decision tree. There are 4 basic structures on a decision
pretraining emotion speech
tree: root node, nodes, branches and leaves (terminal node).
(TAPT) representation
The root node is where classification process starts from
TABLE 1: SURVEY PAPERS FOR STUDENT
this point. If the observations are in a homogeneous
SKILLS ANALYSIS
structure, they will naturally be in the same class and the
classification process will end without branching the formats, such as WAV files, and segmenting it
root node. In heterogeneous observations, the root node into words or phrases.
divides into two or more branches according to the best  Feature Extraction: Extract acoustic features from
quality that divides the observations into classes and the audio data. Common features include Mel-
creates new nodes. The last non-branching node of the frequency cepstral coefficients (MFCCs), pitch,
tree is the terminal node and represents the classes to and energy.
which the observations are assigned.  Labeling and Ground Truth: Label the audio data
Naïve Bayes classification is based on Bayes with correct and incorrect pronunciation labels,
theorem. It is used to estimate the probability that a using the phonetic transcriptions and provided
particular set of features belongs to a particular class. It reference data.
aims to select the decision with the highest probability  CNN Architecture: Design a CNN architecture for
using probability calculations. Each attribute is pronunciation assessment. Consider using 1D
considered independent from other attributes in the CNNs, as they are well-suited for sequential audio
class. different class based on various attributes. Naïve data.
Bayes classifiers are extremely fast compared to more  Configure the network to have convolutional
complex methods. layers for feature extraction, followed by fully
connected layers for classification.
4. PROPOSED METHODOLOGIES  Training the Model: Train the CNN using the
The proposed system aims to address the labelled dataset. The model learns to distinguish
importance of precise pronunciation in effective between correct and incorrect pronunciations
communication, language learning, and text-to-speech based on the extracted acoustic features. Fine-tune
applications. This system is built on the foundation of hyperparameters and model architecture as
Convolutional Neural Networks (CNNs) and comprises needed.
several key components. Initially, it involves the
collection and preprocessing of a diverse dataset of
spoken language, containing both correctly pronounced
words and phrases, accompanied by their corresponding
phonetic transcriptions. This data is then meticulously
preprocessed, which includes converting audio into a
standardized format, segmenting it into individual
words or phrases, and extracting pertinent acoustic
features like Mel-frequency cepstral coefficients
(MFCCs), pitch, and energy. The dataset is further
enriched through labeling, with audio data being
annotated as having either correct or incorrect
pronunciation, using the provided phonetic
transcriptions as a reference for correctness. To
facilitate model development and assessment, the
dataset is divided into training, validation, and test
subsets. The heart of the system lies in the design of a
CNN architecture optimized for pronunciation
assessment. This CNN model is trained using the
labeled dataset, teaching it to distinguish between
correct and incorrect pronunciations based on the
acoustic features. The system also incorporates a
feedback mechanism, offering users insights into
detected pronunciation mistakes, such as highlighting Fig 2 shows the proposed work for spoken
mispronounced words or providing suggestions for language detection
correct pronunciation. Furthermore, a user-friendly 5. EXPERIMENTAL RESULTS
interface enables real-time pronunciation checking, Performance metrics are essential for assessing the
allowing users to input spoken language or text and effectiveness of a system or model. In the context of an
receive immediate feedback on their pronunciation automatic pronunciation checker system using
quality. The system is engineered to be versatile, Convolutional Neural Networks (CNNs), the following
supporting multiple languages and dialects and performance metrics can be considered:
constantly improving its capabilities by incorporating
user feedback and undergoing periodic model
retraining. Ultimately, this proposed system represents a Accuracy: Accuracy measures the proportion of
comprehensive solution for enhancing pronunciation correctly classified pronunciations (both correct and
accuracy and promoting effective language learning and incorrect). It's a fundamental metric for evaluating the
communication. overall performance of the system.
Fig 2 shows the proposed architecture and the
proposed work is described as follows:
 Data Collection and Preprocessing: Gather a
dataset of spoken language with corresponding
phonetic transcriptions, which serve as the
reference for correct pronunciation. Preprocess
the audio data by converting it to suitable
Learning 34.5-6 (2021): 751-777.
ACCURACY (%)
[5] BALLARD, KIRRIE J., CONSTANTINA
MARKOULLI, and PENELOPE MONROE. "A
Longitudinal Evaluation of Tablet-Based Child Speech
Therapy with Apraxia World." (2021).
80 [6] Tejedor-Garcia, Cristian, et al. "Using challenges to
enhance a learning game for pronunciation training of
70 English as a second language." IEEE Access 8 (2020):
74250-74266.
60 [7] Bashori, Muzakki, et al. "‘Look, I can speak correctly’:
learning vocabulary and pronunciation through websites
50 equipped with automatic speech recognition
technology." Computer Assisted Language
40 Learning (2022): 1-29.
[8] Shi, Jiatong, Nan Huo, and Qin Jin. "Context-aware
goodness of pronunciation for computer-assisted
30
pronunciation training." arXiv preprint
arXiv:2008.08647 (2020).
20
[9] Baevski, Alexei, et al. "wav2vec 2.0: A framework for
self-supervised learning of speech
10 representations." Advances in neural information
processing systems 33 (2020): 12449-12460.
0 [10]Chen, Li-Wei, and Alexander Rudnicky. "Exploring
SVM KNN NAIVES CNN
BAYES Wav2vec 2.0 Fine Tuning for Improved Speech Emotion
Recognition." ICASSP 2023-2023 IEEE International
Fig 3: Performance chart Conference on Acoustics, Speech and Signal Processing
From the above graph, proposed algorithm provides the (ICASSP). IEEE, 2023.
improved accuracy rate than the existing machine [11] Almelhes, Sultan A. "A Review of Artificial
learning algorithms Intelligence Adoption in Second-Language
Learning." Theory and Practice in Language Studies 13.5
6. CONCLUSION (2023): 1259-1269.
In survey, the proposed "Automatic [12] Ngo, Thuy Thi-Nhu, Howard Hao-Jan Chen, and Kyle
Pronunciation Checker with Convolutional Neural Kuo-Wei Lai. "The effectiveness of automatic speech
Networks" system holds significant promise in recognition in ESL/EFL pronunciation: A meta-
addressing the critical need for accurate pronunciation analysis." ReCALL (2023): 1-18.
in language learning, communication, and text-to- [13] Zakiyyah, Fina, Arso Setyaji, and Sukma Nur Ardini.
speech applications. Leveraging the power of "The analysis of pronunciation application based on the
Convolutional Neural Networks (CNNs), this system concept of Artificial Intelligence." UNCLLE
offers a comprehensive solution for assessing and (Undergraduate Conference on Language, Literature, and
improving pronunciation quality. By meticulously Culture). Vol. 2. No. 01. 2022.
collecting and preprocessing diverse spoken language [14] Nugraha, Sandy Vitra, Lalu Ari Irawan, and Tina Orel
data, including phonetic transcriptions, the system lays Frank. "Segmental Aspects of Pronunciation Errors
a robust foundation for pronunciation assessment. Its Produced by ELE Students in Classroom Settings." Journal
CNN architecture is tailored to effectively differentiate of Language and Literature Studies 2.2 (2022): 88-98.
between correct and incorrect pronunciations based on [15] Yuzawa, Nobuo. "An Analysis of Two English
extracted acoustic features, such as MFCCs, pitch, and Textbooks for Elementary School in Japan: Focusing on
energy. Moreover, the incorporation of a feedback Teaching Pronunciation." Journal of the Faculty of
mechanism and a user-friendly interface ensures real- International Studies, Utsunomiya University 53 (2022):
time pronunciation checking, providing users with 103-116.
valuable insights and corrective guidance. [16] Peura, Liisa, Maarit Mutta, and Marjut Johansson.
REFERENCES "Playing with Pronunciation: A study on robot-assisted
[1] Cummings, Alycia, Kristen Giesbrecht, and Janet French pronunciation in a learning game." Nordic Journal
Hallgrimson. "Intervention dose frequency: of Digital Literacy 2 (2023): 100-115.
Phonological generalization is similar regardless of [17] Al-Jarf, Reima. "Proper noun pronunciation
schedule." Child Language Teaching and Therapy 37.1 inaccuracies in English by Educated Arabic
(2021): 99-115. speakers." British Journal of Applied Linguistics
[2] Acquah, Emmanuel O., and Heidi T. Katz. "Digital (BJAL) 4.1 (2022): 14-21.
game-based L2 learning outcomes for primary through [18] Adam, Nuflihin Surya, Agus Hidayat, and Muhammad
high-school students: A systematic literature Ridho Kholid. "An Analysis of Pronunciation in Word
review." Computers & Education 143 (2020): 103667. Stress Towards Students of Sixth Semester of English
[3] Li, Ping, and Yu-Ju Lan. "Digital language learning Education at UIN Raden Intan Lampung." Journal of
(DLL): Insights from behavior, cognition, and the Linguistics and Social Sciences 1.2 (2023): 59-65.
brain." Bilingualism: Language and Cognition 25.3 [19] Maiza, Masfa. "An analysis of students’ pronunciation
(2022): 361-378. errors." JOEEL: Journal of English Education and
[4] Zou, Di, Yan Huang, and Haoran Xie. "Digital Literature 1.1 (2020): 18-23.
game-based vocabulary learning: where are we and [20] Rafael, Agnes Maria Diana. "An analysis on
where are we going?" Computer Assisted Language pronunciation errors made by first semester students of
English department STKIP CBN." Loquen: English Computer Assisted Pronunciation Training." Applied
Studies Journal 12.1 (2019): 1-10. Sciences 13.10 (2023): 5835.
[21] Ngo, Thuy Thi-Nhu, Howard Hao-Jan Chen, and [37] Zhao, Xiaoda, and Xiaoyan Jin. "Standardized
Kyle Kuo-Wei Lai. "The effectiveness of automatic evaluation method of pronunciation teaching based on deep
speech recognition in ESL/EFL pronunciation: A meta- learning." Security and Communication Networks 2022
analysis." ReCALL (2023): 1-18. (2022).
[22] Astina, Nurhamdah. "The analysis of teaching [38] Alashban, Adal A., and Yousef A. Alotaibi. "A Deep
English pronunciation at young learners." Inspiring: Learning Approach for Identifying and Discriminating
English Education Journal 3.1 (2020): 1-16. Spoken Arabic Among Other Languages." IEEE Access 11
[23] Kibria, Shafkat, et al. "Acoustic analysis of the (2023): 11613-11628.
speakers’ variability for regional accent-affected [39] Korzekwa, Daniel, et al. "Computer-assisted
pronunciation in Bangladeshi bangla: a study on Sylheti pronunciation training—Speech synthesis is almost all you
accent." IEEE Access 8 (2020): 35200-35221. need." Speech Communication 142 (2022): 22-33.
[24] Ambalegin, Ambalegin. "Phonological Analysis of [40] Wang, Xiaoman, and Lu Yuan. "Machine-learning
English Vowel Pronunciation." KnE Social based automatic assessment of communication in
Sciences (2021): 28-45. interpreting." Frontiers in Communication 8 (2023):
[25] Cengiz, Behice Ceyda. "Computer-Assisted 1047753.
Pronunciation Teaching: An Analysis of Empirical [41] Lu, Yijia, et al. "Decoding lip language using
Research." Participatory Educational Research 10.3 triboelectric sensors with deep learning." Nature
(2023): 72-88. communications 13.1 (2022): 1401.
[26] Du, Minghao, et al. "An Automatic Depression [42] Harere, Ahmad Al, and Khloud Al Jallad.
Recognition Method from Spontaneous Pronunciation
"Mispronunciation Detection of Basic Quranic
Using Machine Learning." Proceedings of the 2022 9th
International Conference on Biomedical and
Recitation Rules using Deep Learning." arXiv preprint
Bioinformatics Engineering. 2022. arXiv:2305.06429 (2023).
[27] Han, Xue, and Trip Huwan. "The modular design [43] Wei, Xing, et al. "Automatic Speech Recognition
of an english pronunciation level evaluation system and Pronunciation Error Detection of Dutch Non-
based on machine learning." Security and native Speech: cumulating speech resources in a
Communication Networks 2022 (2022). pluricentric language." Speech Communication 144
[28] Du, Minghao, et al. "An Automatic Depression (2022): 1-9.
Recognition Method from Spontaneous Pronunciation [44] Wu, Yanping, et al. "Implementation of a System
Using Machine Learning." Proceedings of the 2022 9th for Assessing the Quality of Spoken English
International Conference on Biomedical and Pronunciation Based on Cognitive Heuristic
Bioinformatics Engineering. 2022.
Computing." Computational Intelligence and
[29] Han, Xue, and Trip Huwan. "The modular design
of an english pronunciation level evaluation system Neuroscience 2022 (2022).
based on machine learning." Security and [45] Wu, Yanping, et al. "Implementation of a System
Communication Networks 2022 (2022). for Assessing the Quality of Spoken English
[30] Rukwong, Niyada, and Sunee Pongpinigpinyo. Pronunciation Based on Cognitive Heuristic
"An Acoustic Feature-Based Deep Learning Model for Computing." Computational Intelligence and
Automatic Thai Vowel Pronunciation Neuroscience 2022 (2022).
Recognition." Applied Sciences 12.13 (2022): 6595. [46] Çalik, Şükrü Selim, Ayhan Küçükmanisa, and
[31] Gong, Yuan, et al. "Transformer-based multi- Zeynep Hilal Kilimci. "Deep Learning-Based
aspect multi-granularity non-native English speaker Pronunciation Detection of Arabic Phonemes." 2022
pronunciation assessment." ICASSP 2022-2022 IEEE
International Conference on INnovations in Intelligent
International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2022.
SysTems and Applications (INISTA). IEEE, 2022.
[32] Kim, Eesung, et al. "Automatic pronunciation [47] Mathad, Vikram C., et al. "Consonant-vowel
assessment using self-supervised speech representation transition models based on deep learning for objective
learning." arXiv preprint arXiv:2204.03863 (2022). evaluation of articulation." IEEE/ACM Transactions
[33] Kim, Eesung, et al. "Automatic pronunciation on Audio, Speech, and Language Processing 31
assessment using self-supervised speech representation (2022): 86-95.
learning." arXiv preprint arXiv:2204.03863 (2022). [48] Sheoran, Kavita, et al. "Pronunciation Scoring
[33] Xu, Yushu. "English speech recognition and With Goodness of Pronunciation and Dynamic Time
evaluation of pronunciation quality using deep Warping." IEEE Access 11 (2023): 15485-15495.
learning." Mobile Information Systems 2022 (2022): 1- [49] Malakar, Mousumi, Ravindra B. Keskar, and Ajit
12.
Zadgaonkar. "A hierarchical automatic phoneme
[34] Tang, Hui. "An automatic correction system of
singing intonation based on deep recognition model for Hindi‐Devanagari consonants
learning." International Journal of Information and using machine learning technique." Expert
Communication Technology 22.4 (2023): 422-437. Systems (2023): e13288.
[35] Cámara-Arenas, Enrique, et al. "Automatic [50] Liu, Nian. "Automatic English Pronunciation
pronunciation assessment vs. automatic speech Evaluation Algorithm Based on Sequence Matching
recognition: A study of conflicting conditions for L2- and Feature Fusion." Mathematical Problems in
English." (2023). Engineering 2022 (2022).
[36] Bi, Yanjing, et al. "A RTL Implementation of
Heterogeneous Machine Learning Network for French

Ai - Assisted - Spoken - Language Batch 14

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai - Assisted - Spoken - Language Batch 14

Uploaded by

Copyright:

Available Formats

SURVEY ON AUTOMATIC PRONOUNCIATION ANALYSIS

USING ARTIFICIAL INTELLIGENCE TECHNOLOGIES

You might also like