A Deep Diacritics-Based Recognition Model For Arab

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2022.Doi Number

A Deep Diacritics-Based Recognition Model for


Arabic Speech: Quranic Verses as Case Study
Sarah S. Alrumiah, and Amal A. Alshargabi
Department of Information Technology, College of Computer, Qassim University, Buraydah, Saudi Arabia

Corresponding author: Sarah S. Alrumiah (e-mail: saraalrumih@gmail.com).


This work was supported by the Deanship of Scientific Research in Qassim University under the grant number: COC-2022-1-2-J- 30493.

ABSTRACT Arabic is the language of more than 422 million of the world's population. Although classic
Arabic is the Quran language that 1.9 billion Muslims are required to recite, limited Arabic speech recognition
exists. In classic Arabic, diacritics affect the pronunciation of a word, a change in a diacritic can change the
meaning of a word. However, most of the Arabic-based speech recognition models discarded the diacritics.
This work aims to recognize the classic Arabic speech while considering diacritics by converting audio
signals to diacritized text using Deep Neural Network (DNN)-based models. The DNN-based model
recognizes speech using DNN which outperformed the traditional speech recognition systems' phonetics
dependency. Three models were developed to recognize Arabic speech: (i) Time Delay Neural Network-
Connectionist Temporal Classification (CTC), (ii) Recurrent Neural Network (RNN)-CTC, and (iii)
transformer. A 100hours dataset of the Quran recordings has been used. Based on the results, the RNN-CTC
model obtained state-of-the-art results with the lowest word error rate of 19.43% and a 3.51% character error
rate. RNN-CTC model recognized character-by-character which is more reliable compared to transformers’
whole-sentence recognition behaviour. The model performed well with clear unstressed recordings of short
sentences. Moreover, the RNN-CTC model effectively recognized out-of-the-dataset sounds. The findings
recommend continuing the efforts in enhancing the diacritics-based Arabic speech recognition models using
clear and unstressed recordings to obtain better performance. Moreover, pretraining large speech models
could obtain accurate recognition. The outcomes can be used to enhance the existing classic Arabic speech
recognition solutions by supporting diacritics recognition.

INDEX TERMS Deep learning, speech recognition, RNN, TDNN, transformers.

I. INTRODUCTION diacritics differential can change their meaning. For instance,


Although more than 422 million people speak Arabic [1], the the terms "‫ "اﻟُﺠﻨﱠﺔ‬,"‫"اﻟَﺠﻨﱠﺔ‬, and "‫ "اﻟِﺠﻨﱠﺔ‬mean "Heaven",
Arabic speech recognition field still needs improvement [2], "Protector", and "Jinns", respectively.
[3]. Arabic is a complicated language due to its richness. The
Arabic language can be classified into three classes, as
illustrated in Fig. 1, (i) classic Arabic used in Quran, Hadith,
and old Arabic poetry, (ii) modern Arabic, which is a
modified classic Arabic used in news, formal
communications, and modern books, and (iii) dialectal
Arabic which is altered modern Arabic with regional FIGURE 1. Arabic language taxonomy with examples.
speaking additions [4]. Moreover, diacritics, i.e., vocal Researchers have made some efforts to recognize the
symbols associated with the letters and affect the Arabic speech, especially the modern [5]-[8] and dialectical
pronunciation, are highly used in classic Arabic than in [9]-[12] Arabic. Researchers have also developed an Arabic
modern or dialectal Arabic. Diacritics are important in poetry meter recognition model [13]. However, some efforts
Arabic speech, as mispronouncing a character, i.e., a letter or were made to recognize classic Arabic letters [14], [15], digits
diacritic, can change the word's meaning in Arabic. Even [7], [16], [17], and one-word commands or isolated words
though some Arabic words contain the same letters, the [16], [18], [19]. Although classic Arabic is mostly used in

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

education and Quran recitation, there is a lack of continuous II. RELATED WORK
classic Arabic speech recognition models. Additionally, Humans verbally communicate through sounds.
diacritics are highly associated with classic Arabic. Therefore, Recognizing sounds and speech is the key to understand
classic Arabic speech recognition models and systems should others' sayings [21]. Therefore, speech recognition
be able to recognize diacritics in speech. Although recognizing technology assists in using the computational power to
diacritics negatively affected the Arabic speech recognition recognize human speech by converting it to a machine-
model performance, as reported in [20], recognizing diacritics readable format that can be converted to text to perform
is still important and requires extensive efforts to train and specific actions. Speech recognition is applied in different
develop accurate classic Arabic speech recognition models. fields and applications. Traditional speech recognition
The speech recognition model is a model that converts the consists of three independent components, (i) acoustic
audio signals to text using either a traditional approach or an model, (ii) pronunciation dictionary, and (iii) language
end-to-end Deep Neural Network (DNN)-based approach model [21]. Statistical models, such as HMM and Gaussian
[21]. Traditional speech recognition depends on phonetics and Mixture Model (GMM), are used as acoustic models in
pronunciation dictionaries to convert the speech to text [22], traditional speech recognition systems. However, traditional
[23]. Traditional speech recognition consists of three parts, (i) speech recognition had several limitations, e.g., predefined
an acoustic model, (ii) a pronunciation dictionary, and (iii) a pronunciation dictionary requirements. Therefore, DNNs
language model. Hidden Markov Model (HMM) is the most have been developed to recognize audio signals directly into
used acoustic model in the traditional approach. However, text without the need for a predefined pronunciation lexicon
there are limitations in the traditional speech recognition forming an end-to-end speech recognition system.
approach, such as its phonetic dependency. Therefore, the end- Moreover, speech recognition systems can recognize (i)
to-end DNN-based speech recognition approach was proposed letters, (ii) isolated words, such as digits, commands, or single
[24]. End-to-end speech recognition models can recognize a words, or (iii) continuous speech. Traditional and end-to-end
speech using a DNN without the need for a predefined speech recognition methods have been used with the classic
pronunciation dictionary. End-to-end speech recognition Arabic language. The following subsections discuss the few
consists of an encoder, decoder, and alignment method, such classic Arabic speech recognition-related efforts, gaps, and
as Connectionist Temporal Classification (CTC). Few classic emerging speech recognition models.
Arabic speech recognition models have been developed using
traditional [20], [25] and end-to-end [3], [14], [17], [19], [26] A. TRADITIONAL CLASSIC ARABIC SPEECH
speech recognition models. RECOGNITION
Additionally, as there is a lack of a standard classic Arabic Regarding the importance of Arabic diacritics in classic
pronunciation dictionary, using an end-to-end speech Arabic, the diacritics' effect on recognizing the classic
recognition approach is preferable. Thus, this effort aims to Arabic speech was studied in [20]. Eight traditional-based
recognize classic Arabic speech using DNNs following the speech recognition models, namely (i) GMM-SI, (ii) GMM
end-to-end approach to convert audio signals to diacritized SAT, (iii) GMM MPE, (iv) GMM MMI, (v) SGMM, (vi)
text. A dataset of more than 100 hours of classic Arabic speech SGMM-bMMI, (vii) DNN, and (viii) DNN-MPE, were
with its diacritized transcripts was used to train the proposed trained with 23hours continuous speech datasets containing
models. Three DNN-based models were implemented and 4754 sentences. The authors used two sets of the same
compared, (i) Time Delay Neural Network (TDNN)-CTC- dataset, the diacritized dataset (supporting six diacritics
based, (ii) Recurrent Neural Network (RNN)-CTC-based, and only) and the non-diacritized dataset. However, the DNN-
(iii) transformers-based. MPE model reported the lowest Word Error Rates (WER)s
This work converts the input of classic Arabic speech to of 4.68% (without diacritics) and 5.53% (with diacritics).
diacritized text using the end-to-end speech recognition Even though the WER in diacritics increases by about 1%,
approach. The main contribution of this work is recognizing recognizing diacritics in classic Arabic speech is still
the diacritized continuous classic Arabic speech using three important and should be further improved.
DNN models that, to the best of our knowledge, have not been On the other hand, other researchers used parts of the
used with a large classic Arabic dataset. This effort also traditional speech recognition methods to convert graphemes
investigates the best-performed model's performance. to phonemes of the diacritized classic Arabic words [25]. The
Moreover, the best-performed model performance reaches the joint multigram model was used to predict phonemes of classic
performance of the state-of-the-art models that were fine- Arabic words and recorded 42.5% WER. Although dealing
tuned on Arabic datasets. with Arabic diacritics is challenging, interested researchers
The rest of the paper is structured as the following; Section continued their efforts using advanced methods (end-to-end)
2 discusses the classic Arabic speech recognition related that are explained in the following section.
works. Section 3 illustrates the work's methodology. Section 4
presents the experiments' details. Section 5 and 6 discuss the
results and the findings, and Section 7 concludes the study.

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

B. END-TO-END CLASSIC ARABIC SPEECH D. EMERGING SPEECH RECOGNITION


RECOGNITION DEVELOPMENTS
An end-to-end diacritized classic Arabic speech recognition The rapid development of the transformer-based speech
was discussed in [26]. The authors compared the recognition models and their performance in the streaming
performance of three different speech recognition models on environments regarding the fast training directed researchers
the single Arabic speaker corpus. The corpus consisted of to use transformers in end-to-end speech recognition [28].
audio recordings of 51 thousand words and diacritized texts. Transformer models are sequence to sequence models that
The authors built and trained one traditional speech use self-attention mechanisms instead of RNN [29]. The
recognition model and two end-to-end models (CTC-based authors in [29] aimed to transcript audio recordings of
and Convolutional Neural Network (CNN)-Long Short- dialectal Arabic using three different methods, (i)
Term Memory (LSTM)-attention-based models). As a result, transformer-based, (ii) HMM-DNN-based, and (iii) manual
the CNN-LSTM-attention-based model outperformed the transcription methods. The manual transcription was done by
traditional and CTC-based models by achieving 28.48% native and expert linguist participants given a set of audio
WER compared to 33.72% and 31.10% WERs, respectively. recordings. The findings in [29] stated that the end-to-end
Thus, the Arabic speech recognition applications valuing transformer-based model supplemented with CTC
diacritics should consider end-to-end-based models. outperformed the HMM-DNN-based and manual
Regarding the limited classic Arabic research attention transcription methods. In addition, using the transformer
specially in speech recognition, Arabic alphabet learning model without a language model recorded lower WER,
models were designed in [14] to correct mispronunciation. The hence, achieving better recognition results.
work in [14] was divided into two parts, (i) alphabet
recognition and (i) pronunciation quality classification. After III. METHODOLOGY
training two CNN-based and an RNN-based DNNs, namely This section discusses the followed methodology and
(i) DCNN, (ii) AlexNet, and (iii) Bidirectional LSTM materials used to conduct this study. The dataset, data
(BLSTM), the AlexNet outperformed the rest models in the processing, speech recognition models, and evaluation
two parts with 98.41% alphabet recognition accuracy and techniques used to recognize classic Arabic speech are
99.14% pronunciation classification accuracy. presented in the following sections.
Furthermore, non-diacritized Arabic digits and single-word
commands were collected to train SVM, LSTM, and KNN A. DATASET
speech recognition models [19]. LSTM reported the best A total of 72,735 classic Arabic, i.e., Quran, audio recordings
recognition accuracy of 98.12%, while KNN was the fastest in of more than 100hours have been collected from the
training duration. Moreover, the LSTM-based speech "EveryAyah" [30] dataset. Each audio file has been stored in
recognition model was also trained to recognize the Arabic the waveform file format (.wav) and consists of a Quran
digits and recorded 69% testing accuracy [17]. recitation recited by expert reciters. Furthermore, the
collected audio files have been recorded in optimal and noisy
C. CLASSIC ARABIC SPEECH RECOGNITION environments. Thus, some cleaning steps have been
RESEARCH performed, e.g., eliminating unclear audio recordings. Each
Although the wide spread of Arabic speakers around the audio file was mapped to its transcript, i.e., textual form, in
world, the classic Arabic speech recognition efforts are the CSV file.
limited [14]. Isolated classic Arabic words received the most Data splitting has been performed to avoid overfitting.
attention from the research community. Thus, continuous Overfitting is an issue that occurs when the model highly
classic Arabic speech recognition is left undeveloped and recognizes the trained data but poorly recognizes new and
lacks datasets' availability and efforts' development. unseen data [31]. Thus, overfitting affects the model's
Moreover, Arabic speech recognition generally still requires reproducibility. Therefore, the dataset was split into three
researchers' attention compared with other languages such as subsets, (i) training, (ii) validation, and (iii) testing sets, as
English [27]. However, speech recognition models applied presented in Tab. 1.
to other languages could be suitable for Arabic with slight TABLE I
THE CLASSIC ARABIC SPEECH DATASET DETAILS
tuning. Besides, as there is no standard Arabic pronunciation
dictionary, an end-to-end speech recognition approach is Set Number of audio files Number of voices
preferred for Arabic speech. Furthermore, most of the
Training set 57507 102
Arabic-based speech recognition models discarded the Validation set 7332 13
diacritics. However, diacritics affect the pronunciation of a Testing set 7896 14
word, where a change in a diacritic can change the word's Total 72735 129
meaning.
B. DATA PROCESSING

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

The dataset needed some preprocessing steps before being to-end speech recognition models consist of an encoder,
trained and tested in the developed models, such as alignment technique, and decoder. We applied three
transforming the data into a machine-readable format. There character-based speech recognition models with different
are two data types input to our classic Arabic speech DNN architectures in the encoder to find the best-performed
recognition models, (i) audio data and (ii) textual data model with our data. The TDNN-based, RNN-based, and
(transcripts). Those raw data need to be processed in a transformer-based speech recognition models were used and
machine-readable format. The audio signals will be trained with the diacritized classic Arabic speech. However,
converted to Mel-spectrograms. Spectrograms digitally a greedy decoder was applied in each model. The greedy
visualize audio signals' time, frequency, and amplitude using decoder decodes the aligned encoder's output, i.e., sequence
Short-Time Fourier Transform (STFT), which combines of characters' indices with high probabilities, to text, i.e.,
several Fast Fourier Transform (FFT)s of overlapped audio sequence of characters [33]. The greedy decoder was
segments over time [32]. FFT is a Fourier Transform implemented to support outputs with different character
algorithm that converts the time domain of a non-periodic combinations and avoid the dependency on language
segmented signal into a frequency domain. Non-periodic vocabularies with the beam search decoder. Fig. 4 overviews
signals represent real-life non-stationary signals, such as the followed steps to recognize classic Arabic speech.
audio signals. Therefore, the following subsections discuss the encoders'
Moreover, the Mel scale is applied to the spectrograms structures of the implemented speech recognition models.
generating Mel-spectrograms to mimic the human hearing
system detecting different frequencies. Thus, audio signals
will be converted to Mel-spectrograms to be input into the
speech recognition models. Fig. 2 illustrates a sample of audio
signals converted to a Mel-spectrogram.

FIGURE 2. From audio signal to Mel-spectrogram.

Furthermore, as we use character-based speech recognition FIGURE 4. Overview of the classic Arabic speech recognition workflow.
models, each character in Arabic will be treated as a class. The
characters in our situations are the Arabic letters, diacritics, 1) Time Delay Neural Network Speech Recognition Model
other symbols that affect letters' pronunciations, and a space with Connectionist Temporal Classification
character. Each character will be vectorized with a specific TDNN have been widely used as an acoustic model in the
index. Thus, character sequences in any verse, i.e., transcript, traditional and hybrid speech recognition systems [34].
will be converted to a sequence of indices, as exemplified in However, with the developments in end-to-end speech
Fig. 3. recognition models, the combinations of CNN and TDNN
have been proposed and reported state of the art results [35].
Jasper (Just Another SPEech Recognizer) is a TDNN-CTC
end-to-end speech recognition model developed by NVIDIA
in 2019 [35]. Jasper uses CTC loss and has a block
architecture consisting of blocks and convolutional sub-
blocks; each sub-block contains four layers. Blocks in Jasper
FIGURE 3. From character sequences in a word to indices. are connected using residual connections. The data in neural
networks with residual connections flow in different paths
C. SPEECH RECOGNITION MODELS and may skip some layers to reach the last layer instead of
This section discusses the different speech recognition the one-path sequential data flow in feedforward neural
models used to recognize the classic Arabic speech. The end- networks [36]. The residual connection in Jasper launches a

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

1x1 convolution, then a batch normalization layer where its 2) Recurrent Neural Network Speech Recognition Model
output will be added to the output of the batch normalization with Connectionist Temporal Classification
of the last sub-block. The summation is then passed to the RNN has been used in end-to-end speech recognition models
activation function and dropout layers to produce the block's to transform audio spectrograms into text transcriptions,
output. such as Deep Speech [38]. The Deep Speech model consists
Jasper achieved state-of-the-art results on the English of 5 layers of hidden units; the first three layers and the last
speech datasets. However, Jasper requires high computational layers are non-recurrent layers, while the fourth layer is an
power and memory requirements due to its utilization of many RNN with forward and backward passes. Moreover, the
parameters, i.e., over 200 million parameters. Thus, a smaller Deep Speech model uses CTC loss to align encoders' output
speech recognition model called QuartzNet was proposed to character sequences. However, speech recognition models
based on the Jasper architecture with fewer parameters and with a single recurrent layer in the encoder cannot deal with
lower computational power requirements [37]. QuartzNet large and continuous speech datasets, thus limiting their
implements depthwise separable convolutions by replacing capabilities [39]. Therefore, an updated version of Deep
Jasper's 1D convolutions with 1D time-channel separable Speech 2 was proposed with multiple CNN and RNN layers.
convolutions. Deep Speech 2 with one CNN layer, 5 GRU layers, and one
The depthwise separable convolutions deals with spatial fully connected layer achieved the lowest WER compared to
(height and width) and depth (channels) dimensions [37]. The other proposed CNN and RNN layers combinations.
depthwise separable convolutions faster the network and Therefore, in this work, we constructed our RNN-CTC
reduce the complexity by splitting the kernel into two smaller speech recognition model based on the enhanced Deep
kernels, (i) depthwise convolution and (ii) pointwise speech 2 model architecture proposed in [40] that
convolution. The depthwise convolution is individually outperformed the recognition performance of the original
implemented on each channel across a number of time frames Deep speech 2.
(time steps). While the pointwise convolution independently The architecture of our RNN-CTC speech recognition
operated on each time frame across all channels. Thus, the model consists of 4 CNN layers (1 traditional CNN and 3
components of the used QuartzNet are illustrated in Fig. 5. ResidualCNN), 5 Bidirectional Gated Recurrent Unit
This work applied QuartzNet as a TDNN-CTC speech (BiGRU) layers, a fully connected layer, and a linear
recognition model to recognize classic Arabic speech. classification layer, as presented in Fig. 6. The audio features
QuartzNet has never been used with classic Arabic speech. are extracted using those CNN layers. Whereas the predictions
of each frame, considering the previous frames, are performed
in the BiGRU layers. GRU is an RNN variant that uses less
computational resources compared to LSTM.
3) Transformer Speech Recognition Model
Transformers were first proposed to enhance machine
translation using attention mechanisms [41]. Attention
mechanisms are applied in sequence-to-sequence modelling
to allow modelling dependencies regardless of the distances
of the input or output sequences. Attention layers have been
implemented in RNN models [42]. However, the
transformers, recurrent-free models based solely on attention
layers to map input and output dependencies, were proposed
and reported state-of-the-art results using lower resources
and computation power compared to RNN-based models
[41]. Recently, transformers have been applied in speech
recognition and recorded competitive performance
compared with other sequence-to-sequence models [43],
[44]. The main advantages of transformers are their (i) fast
learning ability with low memory usage compared with RNN
models and (ii) long dependencies capturing capability.
Besides, transformers require large data to obtain good
results.
In speech recognition, CNN layers are added to the
transformer architecture to extract the audio features [44]. The
flattened audio feature vector and its d-dimensional positional
encoding will be the input to the transformer's encoder. In
FIGURE 5. The QuartzNet model architecture [37]. contrast, the characters encoding that converts the output

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

characters' sequences to d-dimensional vector with their


positional encoding will be the input to the transformer's
decoder. Additionally, the encoder's output will be input to the
second multi-head attention sub-layer in the decoder. Fig. 7
illustrates the speech transformer architecture.

FIGURE 7. The speech transformer architecture

D. EVALUATION METHOD
WER was measured to evaluate the speech recognition
model's performance. WER is a metric derived from
Levenshtein distance to measure the accuracy of a speech
recognition model [45]. WER is calculated based on the
number of deleted (D), inserted (I), and substituted (S) words
that appear in the recognized text, as shown in (1)
WER = (D+S+I)/N (1)
Where N is the number of words in the target (reference)
text. WER measures the speech recognition model's
performance in terms of word recognition considering words'
deletion, insertion, and substitution. The words' accuracy for
each speech recognition model was calculated by subtracting
the WER value from 1. Moreover, the Character Error Rate
(CER) was calculated for the best-performed speech
recognition model, i.e., the model with the lowest WER, to
measure the character recognition considering characters'
deletion, insertion, and substitution. The characters' accuracy
FIGURE 6. The architecture of Recurrent Neural Network speech was also calculated for the best-performed speech recognition
recognition model with Connectionist Temporal Classification. Where N model by subtracting the CER value from 1. Additionally, a
represents the number of Bidirectional Gated Recurrent Unit layers.
similarity score for each recognized verse was calculated by
finding the longest contiguous matching sub-sequence

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

between the recognized verse and the target verse. The


similarity score ranged between 1 (identical) and 0
(dissimilar).

IV. Experiments Setup


This section discusses experimental details of implementing
DNN-based models to recognize classic Arabic speech.
A. DATA PROCESSING
The Arabic speech dataset went through processing steps
before feeding it to the speech recognition models. The
performed preprocessing steps were:

1) Data Splitting
Data splitting is a technique used to split the data into two or
three subsets to eliminate overfitting the model on the data.
The model will be trained on a subset of the data and tested on
another subset that the model has not seen before. Our dataset
was split into three sets, (i) training, (ii) validation, and (iii)
FIGURE 8. Arabic characters and their mapped indices.
testing sets. To ensure that the sounds in the validation and
testing sets differ from those in the training set, we randomly B. SPEECH RECOGNITION MODELS
selected some reciters to only exist in the validation and testing This section specifically illustrates the experimental details of
sets, i.e., the training set did not contain any recording of those each speech recognition model trained and tested to recognize
selected reciters. Based on reciters separation, the splitting the classic Arabic speech. This work implemented three
ratios were approximately 79%, 10%, and 11% for the speech recognition models, (i) TDNN-CTC, (ii) RNN-CTC,
training, validation, and testing sets, respectively. and (iii) transformers. Each model’s details will be explained
in the following subsections.
2) Transforming Audio Files to Mel-Spectrograms 1) Time Delay Neural Network Speech Recognition Model
Each audio file was transformed from its waveform (time- with Connectionist Temporal Classification
series signals) to Mel-spectrograms using the Our TDNN model consisted of an encoder, decoder, and CTC
torchaudio.transforms library in Python. The loss. We used the default QuartzNet [37] encoder architecture
torchaudio.transforms library helps in transforming raw audio provided by NeMo [46] with two fixed blocks (the first and
files to other representations, such as Mel-spectrograms. Each last blocks) and 15 repeated blocks; each block contained five
raw .wav audio file was transformed to Mel-spectrogram sub-blocks. The encoder in the QuartzNet model is based on
tensor with the default 128 Mel filterbanks. Mel filterbanks the Jasper model [35] and consists of seven Jasper layers (6
mimic the filterbanks in human ears. Then, normalization was residual layers and one traditional layer). In contrast, the
applied to eliminate nulls in Mel-spectrogram tensors. decoder is a linear classification layer that converts the
3) CHARACTERS VECTORIZATION: MAPPING ARABIC encoder's output to 63 classes' probabilities (62 characters and
CHARACTERS TO NUMERICAL INDICES one blank character for the CTC loss). Then the CTC loss has
The Arabic characters, including letters, diacritics, symbols, been calculated to map the highest probability class to its
and spaces, have been mapped to numerical indices, i.e., character representation using a greedy decoder. Tab. 2
character embedding. After removing symbols that do not illustrates the TDNN model summary. Our TDNN model used
affect letters' pronunciation, such as mandatory and optional the Novograd optimization method. Novograd is an adaptive
recitation stop symbols, we remained with a total of 62 layer-wise stochastic optimization method that normalizes
characters. Thus, as our speech recognition models are gradient and decouples weight decay per layer [47].
character-based, those 62 characters represent the 62 classes TABLE II
TIME DELAY NEURAL NETWORK SPEECH RECOGNITION MODEL SUMMARY
our models deal with when recognizing speech. Fig. 8
Name Type Parameters
illustrates the mapping of those 62 characters. Therefore,
Encoder QuartzNet encoder 1.2 M
every textual sentence (sequence of characters) was converted Decoder QuartzNet decoder 64.6 K
to a vector (sequence of numerical indices). Loss CTC loss 0
Optimizer Novograd 0
Trainable parameters 1.2 M
Non-trainable parameters 0
Total parameters 1.2 M
M = Million, K = Thousand

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

2) Recurrent Neural Network Speech Recognition Model TABLE IV


with Connectionist Temporal Classification TRANSFORMER SPEECH RECOGNITION MODEL SUMMARY
Our RNN model consisted of an encoder, decoder, and CTC Name Type Parameters
Encoder Transformer encoder 3.1 M
loss. We used the improved version of Deep Speech 2 Decoder Transformer decoder 0.8 M
implemented in [40]. The encoder in the RNN model consists Loss Categorical cross 0
of 4 CNN layers (1 traditional CNN and 3 ResidualCNN), 5 entropy loss
BiGRU layers, and a fully connected layer. The decoder is a Optimizer Adam 0

linear classification layer that converts the encoder's output to Trainable parameters 3.9 M
Non-trainable parameters 0
63 classes' probabilities (62 characters and one blank character
Total parameters 3.9 M
for the CTC loss). Then the CTC loss has been calculated to
map the highest probability class to its character representation M = Million, K = Thousand
using a greedy decoder. Tab. 3 illustrates the RNN model Tab. 5 presents the experimental details of each speech
summary. Our RNN model used the AdamW optimization recognition model trained and validated with the same training
method. AdamW is a modified stochastic optimization and validation sets' sizes specified in Tab. 1. Due to the limited
method from Adam that decouples the weight decay [48]. storage and power resources, different parameter settings were
TABLE III applied to each model. For instance, the batch size differs from
RECURRENT NEURAL NETWORK SPEECH RECOGNITION MODEL SUMMARY
one model to another based on the limited memory storage.
Name Type Parameters
Encoder RNN encoder 23.7 M
Moreover, the learning rates were chosen after being tuned for
Decoder Sequential decoder 0 each model in several experiments. The TDNN-CTC model
Loss CTC loss 0 took the longest duration with approximately two days of
Optimizer AdamW 0
training, as illustrated in Tab. 6.
Trainable parameters 23.7 M On the other hand, we explored different numbers of the
Non-trainable parameters 0
Total parameters 23.7 M
transformers' encoders and decoders layers. Based on [44], the
transformer with six encoders and decoders layers reported the
M = Million, K = Thousand best results. However, we also trained four encoders and
3) Transformer Speech Recognition Model decoders transformer and four encoders and one decoder
Our transformer model consisted of an encoder and a decoder. transformer inspired from [50] to explore the effect of the
We used the speech transformer architecture proposed in [44] number of layers.
with different layers of encoders and decoders to explore the Furthermore, the early stopping technique was performed
transformer performance. Audio feature embedding and with the RNN-CTC and transformer models. Early stopping is
character embedding were applied to dataset's audios and an optimization technique representing the action of ending
textual verses before being fed to the encoder and decoder. the model training earlier when the model performance is not
The audio feature embedding method extracted audio features improving on the validation data [51]. The model performance
from Mel-spectrograms. In contrast, the character embedding will overfit the training data without performing an early stop.
method mapped each character to a numerical representation. Fig. 9 illustrates the points where early stopping should be
The transformer's encoder consisted of two sub-layers, (i) applied to the best-performed speech recognition model as the
multi-head attention and (ii) feedforward network layers. model seemed to overfit the training data. The early stopping
While the transformer's decoder consisted of three sub-layers, should be applied when the model's learning rate and training
(i) masked multi-head attention, (ii) multi-head attention, and loss decrease after a peak and the validation loss increases. The
(iii) feedforward network layers. The decoder's output is then red cycles in Fig. 9 identify the points where early stopping
fed to a linear classification layer using a softmax activation should be performed to avoid overfitting.
function that converts the transformer's decoder output to 64 TABLE V
THE EXPERIMENTAL DETAILS OF THE SPEECH RECOGNITION MODELS
classes' probabilities (62 characters and two special characters
Model Epochs Batch Learning Optimizer
(< and >) to notify the model with the beginning and ending Size Rate
of each verse). A categorical cross-entropy loss was calculated TDNN-CTC 100 32 0.01 novograd
instead of the CTC loss to map the highest probability class to RNN-CTC 100 8 0.0005 AdamW
its character representation using a greedy decoder. Tab. 4 Transformer (4 100 64 0.00001 Adam
encoders, 4
illustrates the transformer model summary. Our transformer's decoders)
models used the Adam optimization method. Adam is a Transformer (6 100 64 0.00001 Adam
modified stochastic optimization method based on the encoders, 6
decoders)
adaptive estimation of the first and second orders of moments Transformer (4 100 64 0.00001 Adam
[49]. encoders, 1
decoder)

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

TABLE VI CTC speech recognition model is the best-performed model as


THE TRAINING DETAILS OF THE SPEECH RECOGNITION MODELS
it reported the lowest WER.
Model Epochs GPU Trainable Training TABLE VII
Type Parameters Duration DIACRITICS-BASED ARABIC SPEECH RECOGNITION MODELS PERFORMANCE
(Hours)
Model WER
TDNN-CTC 100 P400 1.2 M 44
RNN-CTC 1001 V100 23.7 M 39.5 TDNN-CTC 45.73%
Transformer (4 1002 V100 3.9 M 2.5 RNN-CTC* 19.43%
encoders, 4 Base transformer (4 encoders, 1 decoder) 114.69%
decoders) Transformer* (4 encoders, 1 decoder) 111.20%
Transformer (6 1003 V100 3.9 M 2.5 Transformer* (4 encoders, 4 decoders) 95.03%
encoders, 6 *With early stopping
decoders)
Transformer (4 1004 V100 3.9 M 2.5
encoders, 1
decoder)
1
Early stop at epoch 29
2
Early stop at epoch 11
3
Early stop at epoch 6
4
Early stop at epoch 16

FIGURE 10. A sample of the diacritics-based Arabic speech recognition


models outcomes.
FIGURE 9. The points where the learning rate and training loss
dropped, while the validation loss increased for the best-performed
Based on Tab. 7, the RNN-CTC speech recognition model is
speech recognition model. The red cycles identify the points where considered the best-performed model as it reported the highest
early stopping should be performed to avoid overfitting. recognition performance with the lowest WER compared with
the other two recognition models. Furthermore, Tab. 8 shows
V. Results
the RNN-CTC model's recognition performance on the
This section discusses the experiments' outcomes regarding
validation and testing sets. The RNN-CTC model recorded
recognizing diacritized classic Arabic speech using three
low CER indicating a 96.49% characters accuracy of the
speech recognition models and analyzes the best-performed
testing set. However, around 44.22% of the recordings in the
model's recognition results. Besides, the effectiveness of the
testing set were misrecognized with different similarity scores
best-performed classic Arabic speech recognition model using
to their target sentence, as shown in Tab. 9. The similarity
a sample of out-of-the-dataset audios is explored.
scores in Tab. 9 represent the characters' similarity between
Three DNN-based speech recognition models have been
the recognized and target verses. Similarity scores close to 1
trained and tested to recognize diacritized classic Arabic
indicate high similarity and few generated mistakes. While
speech, (i) TDNN-CTC-based, (ii) RNN-CTC-based, and (iii)
similarity scores below 0.5 determine low similarity and major
transformer-based recognizers. Tab. 7 illustrates the testing
produced mistakes. Fig. 11 presents a sample of generated
WER for each model. Moreover, Fig. 10 presents a sample of
sentences with different similarity scores.
the models' recognized verses. Correctly recognized TABLE VIII
characters are in green, while the misrecognized characters are BEST-PERFORMED MODEL RECOGNITION PERFORMANCE
identified in red font colour. Set Size WER Words CER Characters
From Fig. 10, it is noticeable that the TDNN-CTC and (audio accuracy accuracy
files) (100- (100-CER)
RNN-CTC models are character-based recognition models, WER)
i.e., they recognize character by character, thus helping in Validation 7332 13.39% 86.61% 2.39% 97.61%
identifying character-based mistakes. However, the Testing 7896 19.43% 80.57% 3.51% 96.49%
transformer-based recognition model appeared to be a
TABLE IX
sentence-based recognition model, i.e., the model either SIMILARITY SCORES OF THE MISRECOGNIZED VERSES IN THE TESTING SET
recognizes the sentence and outputs the correct sentence or Similarity score (rounded to 1 decimal place) Number of verses
results with another irrelated sentence. Moreover, an early 1.0 2300
stopping technique was applied to the transformer models to 0.9 1003
0.8 144
avoid overfitting, as we noticed that the transformer-based 0.7 29
models overfit a specific sentence after a number of trained 0.6 13
epochs. However, the behaviours of the transformer-based 0.5 3
models were directed to recognizing the whole sentence Total misrecognized verses 3492
instead of recognizing character by character, i.e., the aim of
this work. Therefore, based on the WER results, the RNN-

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

misrecognized, as shown in Fig. 12. In contrast, the least


recognized speaker was Abdulbasit Abdulsamad, with 1172
misrecognized recordings out of 1692 total recordings.

FIGURE 12. The frequency of the model's misrecognized verses for


FIGURE 11. A sample of generated verses with different similarity
each reciter.
scores.
On the other hand, 72.58% of the dataset's audios were Furthermore, to test the model's effectiveness with out-of-the-
correctly recognized, while the remaining 27.42% of audios dataset voices, a sample of Quran recitation recordings of six
were misrecognized. Analyzing the misrecognized speech verses was collected from out-of-the-dataset voices of
helps in detecting the factors affecting the recognition different genders and age groups. The recordings of the top
performance. Verses that have been misrecognized more than two most and least misrecognized sentences were collected
100 times are considered the model's most misrecognized from online-available Quran recordings. In addition, the
recordings because the dataset contained 129 different recordings of two sentences that have been misrecognized 35
recordings of each verse. Tab. 10 illustrates the top 5 least and times, i.e., mean value, have also been collected from ordinary
most misrecognized verses. Out of 129 recordings of each people. A man, woman, boy, and girl online-available
verse, the most misrecognized verse was correctly recognized recordings of these six sentences were collected to test the
two times only, while the most recognized verse was correctly model's effectiveness along with the author's recordings. Tab.
recognized 122 times. Moreover, a verse may be 11 presents the RNN-CTC model performance in recognizing
misrecognized an average of 35 times, i.e., 35 is the mean out-of-the-dataset voices. A sample of the recognized
number of times the model misrecognizes a verse. sentences by ordinary speakers is illustrated in Fig. 13, with
TABLE X the model-generated mistakes highlighted in yellow. From
TOP FIVE LEAST AND MOST MISRECOGNIZED VERSES Fig. 14, we can derive that the model performance has not
Verse Verse Misrecognized been affected by the gender or the age group of the reciter.
length frequency TABLE XI
Least misrecognized verses = Top recognized verses BEST-PERFORMED MODEL'S PERFORMANCE WITH ORDINARY PEOPLE
‫ﺷِّﺮ َﻣﺎ َﺧﻠََﻖ‬ َ ‫ِﻣﻦ‬ 20 7 RECITATIONS
‫ﺎ‬z‫َوﻓَٰـِﻜَﮭ ۭﺔً َوأ َ ۭﺑ‬ 23 8 Number of WER Words CER Characters Number of
ُ‫ﻄَﻤﺔ‬ َ ‫َوَﻣﺎ ٓ أ َْدَر ٰﯨَﻚ َﻣﺎ ٱْﻟُﺤ‬ 33 8 recordings accuracy accuracy misrecognized
ُ‫َوَﻣﺎ ٓ أ َْدَر ٰﯨَﻚ َﻣﺎ ٱْﻟﻌَﻘَﺒَﺔ‬ 33 9 (100- (100-CER) verses
‫َﺟَﺰآًۭء ِوﻓَﺎﻗًﺎ‬ 18 9 WER)
Most misrecognized verses = Lower recognized verses 30 38.90% 61.1% 10.62% 89.38% 20
‫ﻋْﺪٍۢن ﺗ َْﺠِﺮى‬ َ ‫ﺖ‬ ُ ‫ِﻣﻦ ﺗ َْﺤﺘَِﮭﺎ َﺟَﺰآُؤُھْﻢ ِﻋﻨﺪَ َرﺑِِّﮭْﻢ َﺟﻨﱠٰـ‬ 188 127
‫ﻋْﻨﮫ‬َ ‫ﺿﻮ۟ا‬ ُ ‫ﻋْﻨُﮭْﻢ َوَر‬ َ ُ¤‫ﻰ ٱﱠ‬ َ ‫ﺿ‬ ِ ‫ٱْﻷ َْﻧَﮭٰـُﺮ َﺧٰـِﻠِﺪﯾَﻦ ﻓِﯿَﮭﺎ ٓ أ َﺑَ ۭﺪًا ﱠر‬
ٰ
ُ‫ﻰ َرﺑﱠﮫۥ‬َ ‫ذَِﻟَﻚ ِﻟَﻤْﻦ َﺧِﺸ‬
ٓ‫ﺼﯿَﻦ ﻟَﮫُ ٱﻟ ِﺪّﯾَﻦ ُﺣﻨَﻔَﺎَء‬ ْ ‫ﱠ‬ ۟ ‫ﱠ‬
ِ ‫َ ُﻣﺨِﻠ‬¤‫َوَﻣﺎ أِﻣُﺮٓوا إِﻻ ِﻟﯿَْﻌﺒُﺪ ُوا ٱ‬ ۟ ُ ٓ 164 126
‫ﺼﻠَ ٰﻮة َ َوﯾُْﺆﺗ ُﻮ۟ا ٱﻟﱠﺰَﻛ ٰﻮة َ َو ٰذَِﻟَﻚ ِدﯾُﻦ ٱْﻟﻘَﯿَِّﻤِﺔ‬ ‫َوﯾُِﻘﯿُﻤﻮ۟ا ٱﻟ ﱠ‬
ُ‫ﺖ ﯾَﺪَاه‬ ْ ‫ﻈُﺮ ٱْﻟَﻤْﺮُء َﻣﺎ ﻗَﺪﱠَﻣ‬ ُ ‫ﻋﺬَا ۭﺑًﺎ ﻗَِﺮﯾ ۭﺒًﺎ ﯾَْﻮَم ﯾَﻨ‬ َ ‫ إِﻧﱠﺎ ٓ أ َﻧﺬَْرﻧَٰـُﻜْﻢ‬141 122
‫ﺖ ﺗ َُٰﺮ ۢﺑًﺎ‬
ُ ‫َوﯾَﻘُﻮُل ٱْﻟَﻜﺎﻓُِﺮ ﯾَٰـﻠَْﯿﺘ َﻨِﻰ ُﻛﻨ‬
‫ﺎ ﱠﻻ ﯾَﺘ ََﻜﻠﱠُﻤﻮَن إِﱠﻻ َﻣْﻦ أ َِذَن‬z‫ﺻ ۭﻔ‬ َ ُ‫ ﯾَْﻮَم ﯾَﻘُﻮُم ٱﻟﱡﺮوُح َوٱْﻟَﻤﻠَٰـ ٓﺌَِﻜﺔ‬128 119
‫ﺻَﻮا ۭﺑًﺎ‬ َ ‫ﻟَﮫُ ٱﻟﱠﺮْﺣَﻤٰـُﻦ َوﻗَﺎَل‬
‫ﺖ ﺛ ُﱠﻢ ﻟَْﻢ ﯾَﺘ ُﻮﺑُﻮ۟ا ﻓَﻠَُﮭْﻢ‬ِ ‫ إِﱠن ٱﻟﱠِﺬﯾَﻦ ﻓَﺘ َﻨُﻮ۟ا ٱْﻟُﻤْﺆِﻣﻨِﯿَﻦ َوٱْﻟُﻤْﺆِﻣﻨَٰـ‬135 115
‫ﻖ‬ِ ‫ب ٱْﻟَﺤِﺮﯾ‬ ُ ‫ﻋﺬَا‬ َ ‫ب َﺟَﮭﻨﱠَﻢ َوﻟَُﮭْﻢ‬ ُ ‫ﻋﺬَا‬ َ

On the other side, the most recognized speaker was Ibrahim FIGURE 13. A sample of the recognized verses recited by ordinary
people.
Alakhdar, with only 190 out of 1692 of his recordings were

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

C. The sound and audio recording quality highly affect


the recognition model performance as stressed sounds
are likely to be misrecognized.
After analyzing the misrecognized verses by the RNN-CTC
speech recognition model, the model tends to misrecognize
lengthy speech more than short speech. Additionally, the
model performance highly depends on the stress during
speaking and the quality of the audio recordings, as the
model recognized Ibrahim Alakhdar's unstressed recordings
better than the stressed recordings of Abdulbasit
Abdulsamad. Stressed recordings are recordings with longer
FIGURE 14. The frequency of the model’s misrecognized verses for
each ordinary reciter
and louder sounds. Moreover, our model was not affected by
the age or gender of the reciter.
VI. DISCUSSION
According to the results, the main findings are: VII. CONCLUSION
Despite the high population rate of Arabic speakers, the
A. Speech recognition models can convert audio Arabic speech recognition efforts are still underdeveloped.
speech to diacritized text after training them with
Specifically, the continuous classic Arabic speech
diacritized target text labels.
recognition received the least attention. On the other hand,
Both TDNN-CTC and RNN-CTC models converted audio the majority of the Arabic speech recognition works are
speech to diacritized text as illustrated earlier in Figure 10. based on the traditional speech recognition structure relying
TDNN-CTC and RNN-CTC models are found to be on a pronunciation dictionary that links each word with its
character-based recognition models, i.e., they recognize phonetics representation. Creating and building
character by character, thus helping in identifying character- pronunciation dictionaries for any language or purpose
based mistakes. The characters in this case include letters and require huge efforts from different experts, e.g., linguistics
diacritics. On the other hand, the transformer-based models and phonetics. However, with the technology's growth,
failed to recognize character by character and recognized the DNN-based end-to-end speech recognition structures have
whole sentence instead. been developed to overcome the limitations of traditional
speech recognition systems. Although the rapid development
B. The RNN-based speech recognition model of end-to-end speech recognition-based solutions, limited
outperformed the transformer-based and TDNN-based classic Arabic speech recognition-related solutions have
models when trained with diacritized classic Arabic been developed.
speech. Furthermore, most of the Arabic-based speech recognition
Based on the results, our RNN-CTC speech recognition models discarded the diacritics. However, diacritics affect the
model outperformed the TDNN-CTC and transformers pronunciation of a word, where a change in a diacritic can
recognition models with the lowest WER of 19.43%. RNN- change the word's meaning. Therefore, this work contributed
CTC speech recognition model recognized the classic Arabic to recognizing diacritized classic Arabic speech using DNN-
speech and converted them to diacritized text with words' based speech recognition models. This work went through two
and characters' testing accuracies of 80.57% and 96.49%, phases, (i) data processing and (ii) classic Arabic speech
respectively. recognition. Three DNN-based models have been trained and
Moreover, the performance of our RNN-CTC model is very tested with classic Arabic recordings to convert them into
close to the state-of-the-art models that were fine-tuned on diacritized text in the speech recognition phase. After
Arabic dataset. Whisper [52], the large transformer-based comparing the performance of three DNNs, (i) TDNN-CTC,
speech recognition model, reported 34.28% WER when (ii) RNN-CTC, and (iii) transformer speech recognition
trained with Arabic dataset [53] and reached 13.4% WER and models, the RNN-CTC model obtained the best results with
2.8% CER when trained with the 10h diacritics-based single- the lowest WER of 19.43%. Based on the results, the
speaker classic Arabic dataset [3]. On the other hand, wav2vec diacritics-based Arabic speech recognition model performs
reported 16% WER and 3% CER on diacritics-based single well with clear unstressed recordings of short sentences. The
Arabic speaker dataset. longer the spoken sentence, the more mistakes the model
Therefore, considering the advantage of our multi-speakers could generate. Moreover, the best-performed recognition
trained RNN-CTC model, reaching a 19.43% WER and 3.5% model had effectively recognized out-of-the-dataset sounds.
CER is competitive to the state-of-the-art models’ The work's outcomes highly contributed to the classic
performance. Arabic speech recognition efforts and solutions due to the lack
of DNN-based continuous classic Arabic speech recognition

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

developments. This work's outcomes also enhance the existing [13] A. K. Al-Talabani, “Automatic Recognition of Arabic Poetry Meter
from Speech Signal using Long Short-term Memory and Support
smart classic Arabic speech recognition solutions by Vector Machine,” The Scientific Journal of Koya University, vol. 8,
recognizing diacritics. We believe that this effort will open no. 1, pp. 50-54, 2020.
opportunities regarding recognizing classic Arabic speech. [14] N. Ziafat et al., “Correct Pronunciation Detection of the Arabic
Alphabet Using Deep Learning,” Applied Sciences, vol. 11, no. 6, pp.
Moreover, the trained models could be retrained, i.e., using 1-19, 2021.
transfer learning, to build other Arabic speech recognition [15] F. Alqadheeb, A. Asif and H. F. Ahmad, "Correct Pronunciation
solutions in different fields, such as education. The main Detection for Classical Arabic Phonemes Using Deep Learning," 2021
contribution of this work was training and comparing the International Conference of Women in Data Science at Taif University
(WiDSTaif), pp. 1-6, 2021.
performance of three DNN models with diacritized classic [16] N. Zerari, S. Abdelhamid, H. Bouzgou and C. Raymond,
Arabic speech. We highly encourage interested researchers to "Bidirectional deep architecture for Arabic speech recognition," Open
contribute to developing smart Arabic solutions. We also plan Computer Science, vol. 9, no. 1, pp. 92-102, 2019.
[17] A. S. Mahfoudh Ba Wazir and J. Huang Chuah, "Spoken Arabic Digits
in the near future to continue our developments by Recognition Using Deep Learning," 2019 IEEE International
contributing to improving the recognition performance. Conference on Automatic Control and Intelligent Systems (I2CACIS),
pp. 339-344, 2019.
[18] S. R. Shareef and Y. F. Irhayim, “A Review: Isolated Arabic Words
ACKNOWLEDGMENT Recognition Using Artificial Intelligent Techniques,” Journal of
The authors gratefully acknowledge Qassim University, Physics: Conference Series, pp. 1-14, 2021.
represented by the Deanship of Scientific Research, on the [19] L. T. Benamer1 and O. A. S. Alkishriwo, “Database for Arabic Speech
Commands Recognition,” Third Conference for Engineering Sciences
financial support for this research under the number (COC- and Technology (CEST-2020), Alkhoms, Libya, pp. 1-9, 2020.
2022-1-2-J- 30493) during the academic year 1444 AH / 2022 [20] S. Abed, M. Alshayeji and S. Sultan, “Diacritics Effect on Arabic
AD. Speech Recognition,” Arabian Journal for Science and Engineering,
vol. 44, pp. 9043–9056, 2019.
[21] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, “Speech
REFERENCES Recognition Using Deep Neural Networks: A Systematic Review,”
[1] S. Boukil, M. Biniz, F. Eladnani, L. Cherrat and A. E. Elmoutaouakkil, IEEE Access, vol. 7, pp. 19143-19165, 2019.
“Arabic Text Classification Using Deep Learning Technics,” [22] S. Wang and G. Li, “Overview of end-to-end speech recognition,”
International Journal of Grid and Distributed Computing, vol. 11, no. Journal of Physics: Conference Series, vol. 1187, no. 5, pp. 1-4, 2019.
9, pp. 103-114, 2018. [23] A. A. Abdelhamid, H. A. Alsayadi, I. Hegazy and Z. T. Fayed, “End-
[2] B. H. A. Ahmed and A. S. Ghabayen, “Arabic Automatic Speech to-End Arabic Speech Recognition: A Review,” The 19th Conference
Recognition Enhancement,” 2017 Palestinian International of Language Engineering (ESOLEC’19), Alexandria, Egypt, pp. 1-14,
Conference on Information and Communication Technology 2020.
(PICICT), Gaza, Palestine, IEEE, pp. 98-102, 2017. [24] C. -C. Chiu et al., “State-of-the-Art Speech Recognition with
[3] H. Aldarmaki, and A. Ghannam. “Diacritic Recognition Performance Sequence-to-Sequence Models,” 2018 IEEE International Conference
in Arabic ASR,” arXiv preprint arXiv:2302.14022, pp. 1-5, 2023. on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB,
[4] W. Algihab, N. Alawwad, A. Aldawish and S. AlHumoud, “Arabic Canada, IEEE, pp. 4774-4778, 2018.
Speech Recognition with Deep Learning: A Review,” in International [25] E. H. Cherifi and M. Guerti, “Arabic grapheme-to-phoneme
Conference on Human-Computer Interaction, Springer, vol. 11578, conversion based on joint multi-gram model,” International Journal
pp. 15-31, 2019. of Speech Technology, vol. 24, pp. 173–182, 2021.
[5] M. Menacer et al., “An enhanced automatic speech recognition system [26] H. A. Alsayadi, A. A. Abselhamid, I. Hegazy, and Z. Fayed, “Arabic
for Arabic,” The third Arabic Natural Language Processing Workshop speech recognition using end-to-end deep learning,” IET Signal
- EACL 2017, Valencia, Spain, pp. 1-9, 2017. Processing, vol. 15, no. 8, pp. 521–534, 2020.
[6] A. Ahmed, Y. Hifny, K. Shaalan and S. Toral, “End-to-End Lexicon [27] F. Al-Anzi and D. AbuZeina, "Literature Survey of Arabic Speech
Free Arabic Speech Recognition Using Recurrent Neural Networks,” Recognition," 2018 International Conference on Computing Sciences
Computational Linguistics, Speech and Image Processing for Arabic and Engineering (ICCSE), pp. 1-6, 2018.
Language, pp. 231-248, 2018. [28] D. Obukhov, “Breakthroughs in speech recognition achieved with the
[7] A. Ouisaadane and S. Safi, “A comparative study for Arabic speech use of transformers,” 2021. Towards Data Science [Online]. Accessed:
recognition system in noisy environments,” International Journal of Jun. 8, 2022. Available:
Speech Technology, vol. 24, pp. 761–770, 2021. https://towardsdatascience.com/breakthroughs-in-speech-
[8] E. R. Rady, A. Hassen, N. M. Hassan and M. Hesham, “Convolutional recognition- achieved-with-the-use-of-transformers-6aa7c5f8cb02.
Neural Network for Arabic Speech Recognition,” Egyptian Journal of [29] A. Hussein, S. Watanabe, and A. Ali,“Arabic speech recognition by
Language Engineering, vol. 8, no. 1, pp. 27-38, 2021. end-to-end, modular systems and human,” Computer Speech and
[9] E. Alsharhan and A. Ramsay, “Investigating the effects of gender, Language, vol. 71, no. 101272, pp. 1–17, 2022.
dialect, and training size on the performance of Arabic speech [30] Quran.com, “Everyayah dataset,” 2009. [Online]. Accessed: May. 8,
recognition,” Language Resources and Evaluation, vol. 54, pp. 975– 2022. Available: https://everyayah.com/.
998, 2020. [31] X. Ying, “An overview of overfitting and its solutions,” Journal of
[10] M. Eldesouki, N. Gopee, A. Ali and K. Darwish, “FarSpeech: Arabic Physics: Conference Series, vol. 1168, no. 022022, pp. 1–7, 2019.
Natural Language Processing for Live Arabic Speech,” Interspeech [32] L. Roberts, “Understanding the Mel Spectrogram,” 2020. Analytics
2019, Graz, Austria, pp. 2372-2373, 2019. Vidhya, [Online]. Accessed: Jun. 8, 2022. Available:
[11] A. Ali et al., "The MGB-5 Challenge: Recognition and Dialect https://medium.com/analytics- vidhya/understanding-the-mel-
Identification of Dialectal Arabic Speech," 2019 IEEE Automatic spectrogram-fca2afa2ce53.
Speech Recognition and Understanding Workshop (ASRU), pp. 1026- [33] Siddhant, “Decoding connectionist temporal classification,” 2019.
1033, 2019. Github, [Online]. Accessed: Jun. 8, 2022. Available:
[12] H. Q. Jaber and H. A. Abdulbaqi, “Real time Arabic speech https://sid2697.github.io/Blog_Sid/algorithm/2019/11/04/Beam-
recognition based on convolution neural network,” Journal of search.html.
Information and Optimization Sciences, vol. 42, pp. 1657 - 1663, [34] M. Sugiyama, H. Sawai, and A. Waibel, “Review of TDNN (time
2021. delay neural network) architectures for speech recognition,” in 1991

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972

IEEE International Symposium on Circuits and Systems (ISCAS), vol. [43] N.-Q.Phametal.,“Very deep self-attention networks for end-to-end
1, Singapore, pp. 582–585, IEEE, 1991. speech recognition,” Computation and Language, pp. 1–5, 2019.
[35] J. Li et al., “Jasper: An End-to-End convolutional neural acoustic [44] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence
model,” in Interspeech 2019, pp. 71–75, 2019. sequence-to-sequence model for speech recognition,” in 2018 IEEE
[36] K. Huang, Y. Wang, M. Tao, and T. Zhao, “Why do deep residual International Conference on Acoustics, Speech and Signal Processing
networks generalize better than deep feedforward networks? — a (ICASSP), Calgary, AB, Canada, pp. 5884–5888, IEEE, 2018.
neural tangent kernel perspective,” in Proceedings of the 34th [45] V. I. Levenshtein, “Binary codes capable of correcting deletions,
International Conference on Neural Information Processing Systems, insertions and reversals,” Soviet Physics Doklady, vol. 10, p. 707,
vol. 227 of NIPS’20, Vancouver, BC, Canada, pp. 1–12, Curran 1966.
Associates Inc., 2020. [46] O. Kuchaiev et al., “Nemo: a toolkit for building ai applications using
[37] S. Kriman et al., “Quartznet: Deep automatic speech recognition with neural modules,” arXiv preprint, vol. arXiv:1909.09577, 2019.
1D Time-Channel Separable Convolutions,” in ICASSP 2020 - 2020 [47] B. Ginsburg et al., “Training deep networks with stochastic gradient
IEEE International Conference on Acoustics, Speech and Signal normalized by layerwise adaptive second moments,” in The
Processing (ICASSP), Barcelona, Spain, pp. 6124–6128, IEEE, 2020. International Conference on Learning Representations (ICLR), pp. 1–
[38] A. Y. Hannun et al., “Deep Speech: Scaling up end-to-end speech 7, 2019.
recognition,” ArXiv, vol. abs/1412.5567, pp. 1–12, 2014. [48] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[39] D. Amodei et al., “Deep Speech 2: End-to-end speech recognition in in The International Conference on Learning Representations (ICLR),
English and Mandarin,” in Proceedings of the 33rd International pp. 1–18, 2019.
Conference on International Conference on Machine Learning - [49] D. P. Kingma and J. Ba, “Adam: A method for stochastic
Volume 48, ICML’16, New York, NY, USA, pp. 173—-182, optimization,” in The International Conference on Learning
JMLR.org, 2016. Representations (ICLR), pp. 1–15, 2015.
[40] M. Nguyen, “Building an end-to-end speech recognition model in [50] A. Nandan, “Automatic speech recognition with transformer,” 2021.
PyTorch,” 2020. AssemblyAI, [Online]. Accessed: Jun. 8, 2022. Keras, [Online]. Accessed: Jun. 8, 2022. Available:
Available: https://www.assemblyai.com/blog/end-to-end-speech- https://keras.io/examples/audio/transformer_asr/.
recognition-pytorch/. [51] L. Prechelt, Neural Networks: Tricks of the Trade: Second Edition, ch.
[41] A. Vaswani et al., “Attention is all you need,” in Proceedings of the Early Stopping — But When?, pp. 53–67. Berlin, Heidelberg:
31st International Conference on Neural Information Processing Springer Berlin Heidelberg, 2012.
Systems, NIPS’17, Long Beach, California, USA, pp. 6000–6010, [52] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I.
Curran Associates Inc., 2017. Sutskever, “Robust Speech Recognition via Large-Scale Weak
[42] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint Supervision,” OpenIA, pp. 1-28, 2022.
CTC-attention based end-to-end speech recognition with a deep CNN [53] Z. Alyafeai, “whisperar,” 2022. GitHub, [Online]. Accessed: Jun. 24,
encoder and RNN-LM,” in Interspeech 2017, pp. 949–953, 2017. 2023. Available: https://github.com/ARBML/whisperar.

Sarah S. Alrumiah Received her B.Sc. and M.Sc.


degree in Information Technology from the
Department of Information Technology, College of
Computer, Qassim University, Saudi Arabia in 2020
and 2022. Her research interests include Artificial
Intelligence, Natural Language Processing and
Understanding, and Brain-Computer Interface. One
of her works was awarded with the silver medal in
Geneva Invention in 2022.

Amal A. Alshargabi received the master’s and


Ph.D. degrees from Universiti Teknologi Mara
(UiTM), Malaysia. She is currently an Assistant
Professor with the College of Computer, Qassim
University. Her research interests include program
comprehension, empirical software engineering,
and machine learning.

VOLUME XX, 2017 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like