Professional Documents
Culture Documents
A Deep Diacritics-Based Recognition Model For Arab
A Deep Diacritics-Based Recognition Model For Arab
A Deep Diacritics-Based Recognition Model For Arab
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2022.Doi Number
ABSTRACT Arabic is the language of more than 422 million of the world's population. Although classic
Arabic is the Quran language that 1.9 billion Muslims are required to recite, limited Arabic speech recognition
exists. In classic Arabic, diacritics affect the pronunciation of a word, a change in a diacritic can change the
meaning of a word. However, most of the Arabic-based speech recognition models discarded the diacritics.
This work aims to recognize the classic Arabic speech while considering diacritics by converting audio
signals to diacritized text using Deep Neural Network (DNN)-based models. The DNN-based model
recognizes speech using DNN which outperformed the traditional speech recognition systems' phonetics
dependency. Three models were developed to recognize Arabic speech: (i) Time Delay Neural Network-
Connectionist Temporal Classification (CTC), (ii) Recurrent Neural Network (RNN)-CTC, and (iii)
transformer. A 100hours dataset of the Quran recordings has been used. Based on the results, the RNN-CTC
model obtained state-of-the-art results with the lowest word error rate of 19.43% and a 3.51% character error
rate. RNN-CTC model recognized character-by-character which is more reliable compared to transformers’
whole-sentence recognition behaviour. The model performed well with clear unstressed recordings of short
sentences. Moreover, the RNN-CTC model effectively recognized out-of-the-dataset sounds. The findings
recommend continuing the efforts in enhancing the diacritics-based Arabic speech recognition models using
clear and unstressed recordings to obtain better performance. Moreover, pretraining large speech models
could obtain accurate recognition. The outcomes can be used to enhance the existing classic Arabic speech
recognition solutions by supporting diacritics recognition.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
education and Quran recitation, there is a lack of continuous II. RELATED WORK
classic Arabic speech recognition models. Additionally, Humans verbally communicate through sounds.
diacritics are highly associated with classic Arabic. Therefore, Recognizing sounds and speech is the key to understand
classic Arabic speech recognition models and systems should others' sayings [21]. Therefore, speech recognition
be able to recognize diacritics in speech. Although recognizing technology assists in using the computational power to
diacritics negatively affected the Arabic speech recognition recognize human speech by converting it to a machine-
model performance, as reported in [20], recognizing diacritics readable format that can be converted to text to perform
is still important and requires extensive efforts to train and specific actions. Speech recognition is applied in different
develop accurate classic Arabic speech recognition models. fields and applications. Traditional speech recognition
The speech recognition model is a model that converts the consists of three independent components, (i) acoustic
audio signals to text using either a traditional approach or an model, (ii) pronunciation dictionary, and (iii) language
end-to-end Deep Neural Network (DNN)-based approach model [21]. Statistical models, such as HMM and Gaussian
[21]. Traditional speech recognition depends on phonetics and Mixture Model (GMM), are used as acoustic models in
pronunciation dictionaries to convert the speech to text [22], traditional speech recognition systems. However, traditional
[23]. Traditional speech recognition consists of three parts, (i) speech recognition had several limitations, e.g., predefined
an acoustic model, (ii) a pronunciation dictionary, and (iii) a pronunciation dictionary requirements. Therefore, DNNs
language model. Hidden Markov Model (HMM) is the most have been developed to recognize audio signals directly into
used acoustic model in the traditional approach. However, text without the need for a predefined pronunciation lexicon
there are limitations in the traditional speech recognition forming an end-to-end speech recognition system.
approach, such as its phonetic dependency. Therefore, the end- Moreover, speech recognition systems can recognize (i)
to-end DNN-based speech recognition approach was proposed letters, (ii) isolated words, such as digits, commands, or single
[24]. End-to-end speech recognition models can recognize a words, or (iii) continuous speech. Traditional and end-to-end
speech using a DNN without the need for a predefined speech recognition methods have been used with the classic
pronunciation dictionary. End-to-end speech recognition Arabic language. The following subsections discuss the few
consists of an encoder, decoder, and alignment method, such classic Arabic speech recognition-related efforts, gaps, and
as Connectionist Temporal Classification (CTC). Few classic emerging speech recognition models.
Arabic speech recognition models have been developed using
traditional [20], [25] and end-to-end [3], [14], [17], [19], [26] A. TRADITIONAL CLASSIC ARABIC SPEECH
speech recognition models. RECOGNITION
Additionally, as there is a lack of a standard classic Arabic Regarding the importance of Arabic diacritics in classic
pronunciation dictionary, using an end-to-end speech Arabic, the diacritics' effect on recognizing the classic
recognition approach is preferable. Thus, this effort aims to Arabic speech was studied in [20]. Eight traditional-based
recognize classic Arabic speech using DNNs following the speech recognition models, namely (i) GMM-SI, (ii) GMM
end-to-end approach to convert audio signals to diacritized SAT, (iii) GMM MPE, (iv) GMM MMI, (v) SGMM, (vi)
text. A dataset of more than 100 hours of classic Arabic speech SGMM-bMMI, (vii) DNN, and (viii) DNN-MPE, were
with its diacritized transcripts was used to train the proposed trained with 23hours continuous speech datasets containing
models. Three DNN-based models were implemented and 4754 sentences. The authors used two sets of the same
compared, (i) Time Delay Neural Network (TDNN)-CTC- dataset, the diacritized dataset (supporting six diacritics
based, (ii) Recurrent Neural Network (RNN)-CTC-based, and only) and the non-diacritized dataset. However, the DNN-
(iii) transformers-based. MPE model reported the lowest Word Error Rates (WER)s
This work converts the input of classic Arabic speech to of 4.68% (without diacritics) and 5.53% (with diacritics).
diacritized text using the end-to-end speech recognition Even though the WER in diacritics increases by about 1%,
approach. The main contribution of this work is recognizing recognizing diacritics in classic Arabic speech is still
the diacritized continuous classic Arabic speech using three important and should be further improved.
DNN models that, to the best of our knowledge, have not been On the other hand, other researchers used parts of the
used with a large classic Arabic dataset. This effort also traditional speech recognition methods to convert graphemes
investigates the best-performed model's performance. to phonemes of the diacritized classic Arabic words [25]. The
Moreover, the best-performed model performance reaches the joint multigram model was used to predict phonemes of classic
performance of the state-of-the-art models that were fine- Arabic words and recorded 42.5% WER. Although dealing
tuned on Arabic datasets. with Arabic diacritics is challenging, interested researchers
The rest of the paper is structured as the following; Section continued their efforts using advanced methods (end-to-end)
2 discusses the classic Arabic speech recognition related that are explained in the following section.
works. Section 3 illustrates the work's methodology. Section 4
presents the experiments' details. Section 5 and 6 discuss the
results and the findings, and Section 7 concludes the study.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
The dataset needed some preprocessing steps before being to-end speech recognition models consist of an encoder,
trained and tested in the developed models, such as alignment technique, and decoder. We applied three
transforming the data into a machine-readable format. There character-based speech recognition models with different
are two data types input to our classic Arabic speech DNN architectures in the encoder to find the best-performed
recognition models, (i) audio data and (ii) textual data model with our data. The TDNN-based, RNN-based, and
(transcripts). Those raw data need to be processed in a transformer-based speech recognition models were used and
machine-readable format. The audio signals will be trained with the diacritized classic Arabic speech. However,
converted to Mel-spectrograms. Spectrograms digitally a greedy decoder was applied in each model. The greedy
visualize audio signals' time, frequency, and amplitude using decoder decodes the aligned encoder's output, i.e., sequence
Short-Time Fourier Transform (STFT), which combines of characters' indices with high probabilities, to text, i.e.,
several Fast Fourier Transform (FFT)s of overlapped audio sequence of characters [33]. The greedy decoder was
segments over time [32]. FFT is a Fourier Transform implemented to support outputs with different character
algorithm that converts the time domain of a non-periodic combinations and avoid the dependency on language
segmented signal into a frequency domain. Non-periodic vocabularies with the beam search decoder. Fig. 4 overviews
signals represent real-life non-stationary signals, such as the followed steps to recognize classic Arabic speech.
audio signals. Therefore, the following subsections discuss the encoders'
Moreover, the Mel scale is applied to the spectrograms structures of the implemented speech recognition models.
generating Mel-spectrograms to mimic the human hearing
system detecting different frequencies. Thus, audio signals
will be converted to Mel-spectrograms to be input into the
speech recognition models. Fig. 2 illustrates a sample of audio
signals converted to a Mel-spectrogram.
Furthermore, as we use character-based speech recognition FIGURE 4. Overview of the classic Arabic speech recognition workflow.
models, each character in Arabic will be treated as a class. The
characters in our situations are the Arabic letters, diacritics, 1) Time Delay Neural Network Speech Recognition Model
other symbols that affect letters' pronunciations, and a space with Connectionist Temporal Classification
character. Each character will be vectorized with a specific TDNN have been widely used as an acoustic model in the
index. Thus, character sequences in any verse, i.e., transcript, traditional and hybrid speech recognition systems [34].
will be converted to a sequence of indices, as exemplified in However, with the developments in end-to-end speech
Fig. 3. recognition models, the combinations of CNN and TDNN
have been proposed and reported state of the art results [35].
Jasper (Just Another SPEech Recognizer) is a TDNN-CTC
end-to-end speech recognition model developed by NVIDIA
in 2019 [35]. Jasper uses CTC loss and has a block
architecture consisting of blocks and convolutional sub-
blocks; each sub-block contains four layers. Blocks in Jasper
FIGURE 3. From character sequences in a word to indices. are connected using residual connections. The data in neural
networks with residual connections flow in different paths
C. SPEECH RECOGNITION MODELS and may skip some layers to reach the last layer instead of
This section discusses the different speech recognition the one-path sequential data flow in feedforward neural
models used to recognize the classic Arabic speech. The end- networks [36]. The residual connection in Jasper launches a
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
1x1 convolution, then a batch normalization layer where its 2) Recurrent Neural Network Speech Recognition Model
output will be added to the output of the batch normalization with Connectionist Temporal Classification
of the last sub-block. The summation is then passed to the RNN has been used in end-to-end speech recognition models
activation function and dropout layers to produce the block's to transform audio spectrograms into text transcriptions,
output. such as Deep Speech [38]. The Deep Speech model consists
Jasper achieved state-of-the-art results on the English of 5 layers of hidden units; the first three layers and the last
speech datasets. However, Jasper requires high computational layers are non-recurrent layers, while the fourth layer is an
power and memory requirements due to its utilization of many RNN with forward and backward passes. Moreover, the
parameters, i.e., over 200 million parameters. Thus, a smaller Deep Speech model uses CTC loss to align encoders' output
speech recognition model called QuartzNet was proposed to character sequences. However, speech recognition models
based on the Jasper architecture with fewer parameters and with a single recurrent layer in the encoder cannot deal with
lower computational power requirements [37]. QuartzNet large and continuous speech datasets, thus limiting their
implements depthwise separable convolutions by replacing capabilities [39]. Therefore, an updated version of Deep
Jasper's 1D convolutions with 1D time-channel separable Speech 2 was proposed with multiple CNN and RNN layers.
convolutions. Deep Speech 2 with one CNN layer, 5 GRU layers, and one
The depthwise separable convolutions deals with spatial fully connected layer achieved the lowest WER compared to
(height and width) and depth (channels) dimensions [37]. The other proposed CNN and RNN layers combinations.
depthwise separable convolutions faster the network and Therefore, in this work, we constructed our RNN-CTC
reduce the complexity by splitting the kernel into two smaller speech recognition model based on the enhanced Deep
kernels, (i) depthwise convolution and (ii) pointwise speech 2 model architecture proposed in [40] that
convolution. The depthwise convolution is individually outperformed the recognition performance of the original
implemented on each channel across a number of time frames Deep speech 2.
(time steps). While the pointwise convolution independently The architecture of our RNN-CTC speech recognition
operated on each time frame across all channels. Thus, the model consists of 4 CNN layers (1 traditional CNN and 3
components of the used QuartzNet are illustrated in Fig. 5. ResidualCNN), 5 Bidirectional Gated Recurrent Unit
This work applied QuartzNet as a TDNN-CTC speech (BiGRU) layers, a fully connected layer, and a linear
recognition model to recognize classic Arabic speech. classification layer, as presented in Fig. 6. The audio features
QuartzNet has never been used with classic Arabic speech. are extracted using those CNN layers. Whereas the predictions
of each frame, considering the previous frames, are performed
in the BiGRU layers. GRU is an RNN variant that uses less
computational resources compared to LSTM.
3) Transformer Speech Recognition Model
Transformers were first proposed to enhance machine
translation using attention mechanisms [41]. Attention
mechanisms are applied in sequence-to-sequence modelling
to allow modelling dependencies regardless of the distances
of the input or output sequences. Attention layers have been
implemented in RNN models [42]. However, the
transformers, recurrent-free models based solely on attention
layers to map input and output dependencies, were proposed
and reported state-of-the-art results using lower resources
and computation power compared to RNN-based models
[41]. Recently, transformers have been applied in speech
recognition and recorded competitive performance
compared with other sequence-to-sequence models [43],
[44]. The main advantages of transformers are their (i) fast
learning ability with low memory usage compared with RNN
models and (ii) long dependencies capturing capability.
Besides, transformers require large data to obtain good
results.
In speech recognition, CNN layers are added to the
transformer architecture to extract the audio features [44]. The
flattened audio feature vector and its d-dimensional positional
encoding will be the input to the transformer's encoder. In
FIGURE 5. The QuartzNet model architecture [37]. contrast, the characters encoding that converts the output
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
D. EVALUATION METHOD
WER was measured to evaluate the speech recognition
model's performance. WER is a metric derived from
Levenshtein distance to measure the accuracy of a speech
recognition model [45]. WER is calculated based on the
number of deleted (D), inserted (I), and substituted (S) words
that appear in the recognized text, as shown in (1)
WER = (D+S+I)/N (1)
Where N is the number of words in the target (reference)
text. WER measures the speech recognition model's
performance in terms of word recognition considering words'
deletion, insertion, and substitution. The words' accuracy for
each speech recognition model was calculated by subtracting
the WER value from 1. Moreover, the Character Error Rate
(CER) was calculated for the best-performed speech
recognition model, i.e., the model with the lowest WER, to
measure the character recognition considering characters'
deletion, insertion, and substitution. The characters' accuracy
FIGURE 6. The architecture of Recurrent Neural Network speech was also calculated for the best-performed speech recognition
recognition model with Connectionist Temporal Classification. Where N model by subtracting the CER value from 1. Additionally, a
represents the number of Bidirectional Gated Recurrent Unit layers.
similarity score for each recognized verse was calculated by
finding the longest contiguous matching sub-sequence
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
1) Data Splitting
Data splitting is a technique used to split the data into two or
three subsets to eliminate overfitting the model on the data.
The model will be trained on a subset of the data and tested on
another subset that the model has not seen before. Our dataset
was split into three sets, (i) training, (ii) validation, and (iii)
FIGURE 8. Arabic characters and their mapped indices.
testing sets. To ensure that the sounds in the validation and
testing sets differ from those in the training set, we randomly B. SPEECH RECOGNITION MODELS
selected some reciters to only exist in the validation and testing This section specifically illustrates the experimental details of
sets, i.e., the training set did not contain any recording of those each speech recognition model trained and tested to recognize
selected reciters. Based on reciters separation, the splitting the classic Arabic speech. This work implemented three
ratios were approximately 79%, 10%, and 11% for the speech recognition models, (i) TDNN-CTC, (ii) RNN-CTC,
training, validation, and testing sets, respectively. and (iii) transformers. Each model’s details will be explained
in the following subsections.
2) Transforming Audio Files to Mel-Spectrograms 1) Time Delay Neural Network Speech Recognition Model
Each audio file was transformed from its waveform (time- with Connectionist Temporal Classification
series signals) to Mel-spectrograms using the Our TDNN model consisted of an encoder, decoder, and CTC
torchaudio.transforms library in Python. The loss. We used the default QuartzNet [37] encoder architecture
torchaudio.transforms library helps in transforming raw audio provided by NeMo [46] with two fixed blocks (the first and
files to other representations, such as Mel-spectrograms. Each last blocks) and 15 repeated blocks; each block contained five
raw .wav audio file was transformed to Mel-spectrogram sub-blocks. The encoder in the QuartzNet model is based on
tensor with the default 128 Mel filterbanks. Mel filterbanks the Jasper model [35] and consists of seven Jasper layers (6
mimic the filterbanks in human ears. Then, normalization was residual layers and one traditional layer). In contrast, the
applied to eliminate nulls in Mel-spectrogram tensors. decoder is a linear classification layer that converts the
3) CHARACTERS VECTORIZATION: MAPPING ARABIC encoder's output to 63 classes' probabilities (62 characters and
CHARACTERS TO NUMERICAL INDICES one blank character for the CTC loss). Then the CTC loss has
The Arabic characters, including letters, diacritics, symbols, been calculated to map the highest probability class to its
and spaces, have been mapped to numerical indices, i.e., character representation using a greedy decoder. Tab. 2
character embedding. After removing symbols that do not illustrates the TDNN model summary. Our TDNN model used
affect letters' pronunciation, such as mandatory and optional the Novograd optimization method. Novograd is an adaptive
recitation stop symbols, we remained with a total of 62 layer-wise stochastic optimization method that normalizes
characters. Thus, as our speech recognition models are gradient and decouples weight decay per layer [47].
character-based, those 62 characters represent the 62 classes TABLE II
TIME DELAY NEURAL NETWORK SPEECH RECOGNITION MODEL SUMMARY
our models deal with when recognizing speech. Fig. 8
Name Type Parameters
illustrates the mapping of those 62 characters. Therefore,
Encoder QuartzNet encoder 1.2 M
every textual sentence (sequence of characters) was converted Decoder QuartzNet decoder 64.6 K
to a vector (sequence of numerical indices). Loss CTC loss 0
Optimizer Novograd 0
Trainable parameters 1.2 M
Non-trainable parameters 0
Total parameters 1.2 M
M = Million, K = Thousand
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
linear classification layer that converts the encoder's output to Trainable parameters 3.9 M
Non-trainable parameters 0
63 classes' probabilities (62 characters and one blank character
Total parameters 3.9 M
for the CTC loss). Then the CTC loss has been calculated to
map the highest probability class to its character representation M = Million, K = Thousand
using a greedy decoder. Tab. 3 illustrates the RNN model Tab. 5 presents the experimental details of each speech
summary. Our RNN model used the AdamW optimization recognition model trained and validated with the same training
method. AdamW is a modified stochastic optimization and validation sets' sizes specified in Tab. 1. Due to the limited
method from Adam that decouples the weight decay [48]. storage and power resources, different parameter settings were
TABLE III applied to each model. For instance, the batch size differs from
RECURRENT NEURAL NETWORK SPEECH RECOGNITION MODEL SUMMARY
one model to another based on the limited memory storage.
Name Type Parameters
Encoder RNN encoder 23.7 M
Moreover, the learning rates were chosen after being tuned for
Decoder Sequential decoder 0 each model in several experiments. The TDNN-CTC model
Loss CTC loss 0 took the longest duration with approximately two days of
Optimizer AdamW 0
training, as illustrated in Tab. 6.
Trainable parameters 23.7 M On the other hand, we explored different numbers of the
Non-trainable parameters 0
Total parameters 23.7 M
transformers' encoders and decoders layers. Based on [44], the
transformer with six encoders and decoders layers reported the
M = Million, K = Thousand best results. However, we also trained four encoders and
3) Transformer Speech Recognition Model decoders transformer and four encoders and one decoder
Our transformer model consisted of an encoder and a decoder. transformer inspired from [50] to explore the effect of the
We used the speech transformer architecture proposed in [44] number of layers.
with different layers of encoders and decoders to explore the Furthermore, the early stopping technique was performed
transformer performance. Audio feature embedding and with the RNN-CTC and transformer models. Early stopping is
character embedding were applied to dataset's audios and an optimization technique representing the action of ending
textual verses before being fed to the encoder and decoder. the model training earlier when the model performance is not
The audio feature embedding method extracted audio features improving on the validation data [51]. The model performance
from Mel-spectrograms. In contrast, the character embedding will overfit the training data without performing an early stop.
method mapped each character to a numerical representation. Fig. 9 illustrates the points where early stopping should be
The transformer's encoder consisted of two sub-layers, (i) applied to the best-performed speech recognition model as the
multi-head attention and (ii) feedforward network layers. model seemed to overfit the training data. The early stopping
While the transformer's decoder consisted of three sub-layers, should be applied when the model's learning rate and training
(i) masked multi-head attention, (ii) multi-head attention, and loss decrease after a peak and the validation loss increases. The
(iii) feedforward network layers. The decoder's output is then red cycles in Fig. 9 identify the points where early stopping
fed to a linear classification layer using a softmax activation should be performed to avoid overfitting.
function that converts the transformer's decoder output to 64 TABLE V
THE EXPERIMENTAL DETAILS OF THE SPEECH RECOGNITION MODELS
classes' probabilities (62 characters and two special characters
Model Epochs Batch Learning Optimizer
(< and >) to notify the model with the beginning and ending Size Rate
of each verse). A categorical cross-entropy loss was calculated TDNN-CTC 100 32 0.01 novograd
instead of the CTC loss to map the highest probability class to RNN-CTC 100 8 0.0005 AdamW
its character representation using a greedy decoder. Tab. 4 Transformer (4 100 64 0.00001 Adam
encoders, 4
illustrates the transformer model summary. Our transformer's decoders)
models used the Adam optimization method. Adam is a Transformer (6 100 64 0.00001 Adam
modified stochastic optimization method based on the encoders, 6
decoders)
adaptive estimation of the first and second orders of moments Transformer (4 100 64 0.00001 Adam
[49]. encoders, 1
decoder)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
On the other side, the most recognized speaker was Ibrahim FIGURE 13. A sample of the recognized verses recited by ordinary
people.
Alakhdar, with only 190 out of 1692 of his recordings were
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
developments. This work's outcomes also enhance the existing [13] A. K. Al-Talabani, “Automatic Recognition of Arabic Poetry Meter
from Speech Signal using Long Short-term Memory and Support
smart classic Arabic speech recognition solutions by Vector Machine,” The Scientific Journal of Koya University, vol. 8,
recognizing diacritics. We believe that this effort will open no. 1, pp. 50-54, 2020.
opportunities regarding recognizing classic Arabic speech. [14] N. Ziafat et al., “Correct Pronunciation Detection of the Arabic
Alphabet Using Deep Learning,” Applied Sciences, vol. 11, no. 6, pp.
Moreover, the trained models could be retrained, i.e., using 1-19, 2021.
transfer learning, to build other Arabic speech recognition [15] F. Alqadheeb, A. Asif and H. F. Ahmad, "Correct Pronunciation
solutions in different fields, such as education. The main Detection for Classical Arabic Phonemes Using Deep Learning," 2021
contribution of this work was training and comparing the International Conference of Women in Data Science at Taif University
(WiDSTaif), pp. 1-6, 2021.
performance of three DNN models with diacritized classic [16] N. Zerari, S. Abdelhamid, H. Bouzgou and C. Raymond,
Arabic speech. We highly encourage interested researchers to "Bidirectional deep architecture for Arabic speech recognition," Open
contribute to developing smart Arabic solutions. We also plan Computer Science, vol. 9, no. 1, pp. 92-102, 2019.
[17] A. S. Mahfoudh Ba Wazir and J. Huang Chuah, "Spoken Arabic Digits
in the near future to continue our developments by Recognition Using Deep Learning," 2019 IEEE International
contributing to improving the recognition performance. Conference on Automatic Control and Intelligent Systems (I2CACIS),
pp. 339-344, 2019.
[18] S. R. Shareef and Y. F. Irhayim, “A Review: Isolated Arabic Words
ACKNOWLEDGMENT Recognition Using Artificial Intelligent Techniques,” Journal of
The authors gratefully acknowledge Qassim University, Physics: Conference Series, pp. 1-14, 2021.
represented by the Deanship of Scientific Research, on the [19] L. T. Benamer1 and O. A. S. Alkishriwo, “Database for Arabic Speech
Commands Recognition,” Third Conference for Engineering Sciences
financial support for this research under the number (COC- and Technology (CEST-2020), Alkhoms, Libya, pp. 1-9, 2020.
2022-1-2-J- 30493) during the academic year 1444 AH / 2022 [20] S. Abed, M. Alshayeji and S. Sultan, “Diacritics Effect on Arabic
AD. Speech Recognition,” Arabian Journal for Science and Engineering,
vol. 44, pp. 9043–9056, 2019.
[21] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh and K. Shaalan, “Speech
REFERENCES Recognition Using Deep Neural Networks: A Systematic Review,”
[1] S. Boukil, M. Biniz, F. Eladnani, L. Cherrat and A. E. Elmoutaouakkil, IEEE Access, vol. 7, pp. 19143-19165, 2019.
“Arabic Text Classification Using Deep Learning Technics,” [22] S. Wang and G. Li, “Overview of end-to-end speech recognition,”
International Journal of Grid and Distributed Computing, vol. 11, no. Journal of Physics: Conference Series, vol. 1187, no. 5, pp. 1-4, 2019.
9, pp. 103-114, 2018. [23] A. A. Abdelhamid, H. A. Alsayadi, I. Hegazy and Z. T. Fayed, “End-
[2] B. H. A. Ahmed and A. S. Ghabayen, “Arabic Automatic Speech to-End Arabic Speech Recognition: A Review,” The 19th Conference
Recognition Enhancement,” 2017 Palestinian International of Language Engineering (ESOLEC’19), Alexandria, Egypt, pp. 1-14,
Conference on Information and Communication Technology 2020.
(PICICT), Gaza, Palestine, IEEE, pp. 98-102, 2017. [24] C. -C. Chiu et al., “State-of-the-Art Speech Recognition with
[3] H. Aldarmaki, and A. Ghannam. “Diacritic Recognition Performance Sequence-to-Sequence Models,” 2018 IEEE International Conference
in Arabic ASR,” arXiv preprint arXiv:2302.14022, pp. 1-5, 2023. on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB,
[4] W. Algihab, N. Alawwad, A. Aldawish and S. AlHumoud, “Arabic Canada, IEEE, pp. 4774-4778, 2018.
Speech Recognition with Deep Learning: A Review,” in International [25] E. H. Cherifi and M. Guerti, “Arabic grapheme-to-phoneme
Conference on Human-Computer Interaction, Springer, vol. 11578, conversion based on joint multi-gram model,” International Journal
pp. 15-31, 2019. of Speech Technology, vol. 24, pp. 173–182, 2021.
[5] M. Menacer et al., “An enhanced automatic speech recognition system [26] H. A. Alsayadi, A. A. Abselhamid, I. Hegazy, and Z. Fayed, “Arabic
for Arabic,” The third Arabic Natural Language Processing Workshop speech recognition using end-to-end deep learning,” IET Signal
- EACL 2017, Valencia, Spain, pp. 1-9, 2017. Processing, vol. 15, no. 8, pp. 521–534, 2020.
[6] A. Ahmed, Y. Hifny, K. Shaalan and S. Toral, “End-to-End Lexicon [27] F. Al-Anzi and D. AbuZeina, "Literature Survey of Arabic Speech
Free Arabic Speech Recognition Using Recurrent Neural Networks,” Recognition," 2018 International Conference on Computing Sciences
Computational Linguistics, Speech and Image Processing for Arabic and Engineering (ICCSE), pp. 1-6, 2018.
Language, pp. 231-248, 2018. [28] D. Obukhov, “Breakthroughs in speech recognition achieved with the
[7] A. Ouisaadane and S. Safi, “A comparative study for Arabic speech use of transformers,” 2021. Towards Data Science [Online]. Accessed:
recognition system in noisy environments,” International Journal of Jun. 8, 2022. Available:
Speech Technology, vol. 24, pp. 761–770, 2021. https://towardsdatascience.com/breakthroughs-in-speech-
[8] E. R. Rady, A. Hassen, N. M. Hassan and M. Hesham, “Convolutional recognition- achieved-with-the-use-of-transformers-6aa7c5f8cb02.
Neural Network for Arabic Speech Recognition,” Egyptian Journal of [29] A. Hussein, S. Watanabe, and A. Ali,“Arabic speech recognition by
Language Engineering, vol. 8, no. 1, pp. 27-38, 2021. end-to-end, modular systems and human,” Computer Speech and
[9] E. Alsharhan and A. Ramsay, “Investigating the effects of gender, Language, vol. 71, no. 101272, pp. 1–17, 2022.
dialect, and training size on the performance of Arabic speech [30] Quran.com, “Everyayah dataset,” 2009. [Online]. Accessed: May. 8,
recognition,” Language Resources and Evaluation, vol. 54, pp. 975– 2022. Available: https://everyayah.com/.
998, 2020. [31] X. Ying, “An overview of overfitting and its solutions,” Journal of
[10] M. Eldesouki, N. Gopee, A. Ali and K. Darwish, “FarSpeech: Arabic Physics: Conference Series, vol. 1168, no. 022022, pp. 1–7, 2019.
Natural Language Processing for Live Arabic Speech,” Interspeech [32] L. Roberts, “Understanding the Mel Spectrogram,” 2020. Analytics
2019, Graz, Austria, pp. 2372-2373, 2019. Vidhya, [Online]. Accessed: Jun. 8, 2022. Available:
[11] A. Ali et al., "The MGB-5 Challenge: Recognition and Dialect https://medium.com/analytics- vidhya/understanding-the-mel-
Identification of Dialectal Arabic Speech," 2019 IEEE Automatic spectrogram-fca2afa2ce53.
Speech Recognition and Understanding Workshop (ASRU), pp. 1026- [33] Siddhant, “Decoding connectionist temporal classification,” 2019.
1033, 2019. Github, [Online]. Accessed: Jun. 8, 2022. Available:
[12] H. Q. Jaber and H. A. Abdulbaqi, “Real time Arabic speech https://sid2697.github.io/Blog_Sid/algorithm/2019/11/04/Beam-
recognition based on convolution neural network,” Journal of search.html.
Information and Optimization Sciences, vol. 42, pp. 1657 - 1663, [34] M. Sugiyama, H. Sawai, and A. Waibel, “Review of TDNN (time
2021. delay neural network) architectures for speech recognition,” in 1991
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3300972
IEEE International Symposium on Circuits and Systems (ISCAS), vol. [43] N.-Q.Phametal.,“Very deep self-attention networks for end-to-end
1, Singapore, pp. 582–585, IEEE, 1991. speech recognition,” Computation and Language, pp. 1–5, 2019.
[35] J. Li et al., “Jasper: An End-to-End convolutional neural acoustic [44] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence
model,” in Interspeech 2019, pp. 71–75, 2019. sequence-to-sequence model for speech recognition,” in 2018 IEEE
[36] K. Huang, Y. Wang, M. Tao, and T. Zhao, “Why do deep residual International Conference on Acoustics, Speech and Signal Processing
networks generalize better than deep feedforward networks? — a (ICASSP), Calgary, AB, Canada, pp. 5884–5888, IEEE, 2018.
neural tangent kernel perspective,” in Proceedings of the 34th [45] V. I. Levenshtein, “Binary codes capable of correcting deletions,
International Conference on Neural Information Processing Systems, insertions and reversals,” Soviet Physics Doklady, vol. 10, p. 707,
vol. 227 of NIPS’20, Vancouver, BC, Canada, pp. 1–12, Curran 1966.
Associates Inc., 2020. [46] O. Kuchaiev et al., “Nemo: a toolkit for building ai applications using
[37] S. Kriman et al., “Quartznet: Deep automatic speech recognition with neural modules,” arXiv preprint, vol. arXiv:1909.09577, 2019.
1D Time-Channel Separable Convolutions,” in ICASSP 2020 - 2020 [47] B. Ginsburg et al., “Training deep networks with stochastic gradient
IEEE International Conference on Acoustics, Speech and Signal normalized by layerwise adaptive second moments,” in The
Processing (ICASSP), Barcelona, Spain, pp. 6124–6128, IEEE, 2020. International Conference on Learning Representations (ICLR), pp. 1–
[38] A. Y. Hannun et al., “Deep Speech: Scaling up end-to-end speech 7, 2019.
recognition,” ArXiv, vol. abs/1412.5567, pp. 1–12, 2014. [48] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[39] D. Amodei et al., “Deep Speech 2: End-to-end speech recognition in in The International Conference on Learning Representations (ICLR),
English and Mandarin,” in Proceedings of the 33rd International pp. 1–18, 2019.
Conference on International Conference on Machine Learning - [49] D. P. Kingma and J. Ba, “Adam: A method for stochastic
Volume 48, ICML’16, New York, NY, USA, pp. 173—-182, optimization,” in The International Conference on Learning
JMLR.org, 2016. Representations (ICLR), pp. 1–15, 2015.
[40] M. Nguyen, “Building an end-to-end speech recognition model in [50] A. Nandan, “Automatic speech recognition with transformer,” 2021.
PyTorch,” 2020. AssemblyAI, [Online]. Accessed: Jun. 8, 2022. Keras, [Online]. Accessed: Jun. 8, 2022. Available:
Available: https://www.assemblyai.com/blog/end-to-end-speech- https://keras.io/examples/audio/transformer_asr/.
recognition-pytorch/. [51] L. Prechelt, Neural Networks: Tricks of the Trade: Second Edition, ch.
[41] A. Vaswani et al., “Attention is all you need,” in Proceedings of the Early Stopping — But When?, pp. 53–67. Berlin, Heidelberg:
31st International Conference on Neural Information Processing Springer Berlin Heidelberg, 2012.
Systems, NIPS’17, Long Beach, California, USA, pp. 6000–6010, [52] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I.
Curran Associates Inc., 2017. Sutskever, “Robust Speech Recognition via Large-Scale Weak
[42] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint Supervision,” OpenIA, pp. 1-28, 2022.
CTC-attention based end-to-end speech recognition with a deep CNN [53] Z. Alyafeai, “whisperar,” 2022. GitHub, [Online]. Accessed: Jun. 24,
encoder and RNN-LM,” in Interspeech 2017, pp. 949–953, 2017. 2023. Available: https://github.com/ARBML/whisperar.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/