1 s2.0 S1877050922007141 Main

Available online at www.sciencedirect.
com
Available online at www.sciencedirect.com
ScienceDirect
ScienceDirect
Procedia
Available Computer
online Science 00 (2022) 000–000
at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 203 (2022) 733–738
International Workshop on Modeling and Analysis of Complex

International Workshop
Networks withonApplications
Modeling and Analysis of Complex
(MACNA)
Networks
August with
9-11, Applications
2022, (MACNA)
Niagara Falls, Canada
August 9-11, 2022, Niagara Falls, Canada
Arabic Named Entity Recognition in Arabic Tweets Using
Arabic Named Entity Recognition in Arabic Tweets Using
BERT-based Models
BERT-based Models
Brahim Ait Benalia,a,*, Soukaina Mihiaa, Nabil Laachfoubiaa , Addi Ait Mloukbb
Brahim Ait Benali *, Soukaina Mihi , Nabil Laachfoubi , Addi Ait Mlouk
a
Faculty of Sciences and Techniques, IR2M Laboratory, Hassan First University of Settat, Morocco.
b a
Department of Information
Faculty of Sciences Technology,
and Techniques, Division
IR2M of Scientific
Laboratory, HassanComputing, UppsalaofUniversity,
First University Sweden
Settat, Morocco.
b
Department of Information Technology, Division of Scientific Computing, Uppsala University, Sweden
Abstract
Abstract
With the large amount of unstructured data being broadcasted every day, building powerful methods enabling
With the large
information amount
retrieval and of unstructured
extraction data necessary.
becomes being broadcasted every day,
Unfortunately, building
named entitypowerful methods
recognition is a enabling
difficult
information retrieval and extraction becomes necessary. Unfortunately, named entity recognition
classification task to classify data into predefined labels, which is further challenged by the Arabic language'sis aparticular
difficult
classification task to classify data into predefined labels, which is further challenged by the Arabic language's
characteristics and complex nature. This work trains six BERT-based models (Bidirectional Encoder Representations particular
characteristics
from and complex
Transformers) and uses nature. This work trains
a BiLSTM-CRF six BERT-based
architecture models
for the NER (Bidirectional
task on dialectal Encoder
Arabic. Representations
Our fine-tuning
from Transformers) and uses a BiLSTM-CRF architecture for the NER task on dialectal
approach yields new state-of-the-art results on publicly available dialectal Arabic social media Arabic. Our fine-tuning
datasets.
approach yields new state-of-the-art results on publicly available dialectal Arabic social media datasets.
© 2022 The Authors. Published by Elsevier B.V.
© 2022 The Authors. Published by Elsevier B.V.
© 2022
This The
is an Authors.
open accessPublished by Elsevier
article under the CC B.V.
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an
Peer-reviewopen access
under article under
responsibility of the
the CC BY-NC-ND
Conference
Peer-review under responsibility of the Conference license
Program
Program (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Chairs.
Chairs.
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Named entity recognition, BERT, Transfer learning, Conditional random field, BiLSTM, Dialect arabic, Natural language processing;
Keywords: Named entity recognition, BERT, Transfer learning, Conditional random field, BiLSTM, Dialect arabic, Natural language processing;
1. Introduction
1. Introduction
Recent years have seen a tremendous revolution in developing a wide range of bidirectional transformer-based
Recentespecially
models, years have
forseen a tremendous
natural revolution
languages such in developing
as Arabic, to achievea wide range of bidirectional
state-of-the-art transformer-based
results on Arabic NER. These
models, especially for natural languages such as Arabic, to achieve state-of-the-art results on Arabic NER. These
* Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000 .

* Corresponding b.aitbenali@uhp.ac.ma
E-mail address:author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000 .
E-mail address: b.aitbenali@uhp.ac.ma
1877-0509 © 2022 The Authors. Published by Elsevier B.V.
This is an open
1877-0509 access
© 2022 Thearticle under
Authors. the CC BY-NC-ND
Published license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
by Elsevier B.V.
Peer-review under
This is an open responsibility
access of the Conference
article under CC BY-NC-NDProgram Chairs.
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
1877-0509 © 2022 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
10.1016/j.procs.2022.07.109
734 Brahim Ait Benali et al. / Procedia Computer Science 203 (2022) 733–738
2 Brahim AIT BENALI/ Procedia Computer Science 00 (2022) 000–000
models are powerful transfer learning methods for improving natural language processing (NLP) tasks like text
classification, question answering, named entity recognition (NER), etc.
Named Entity Recognition (NER) involves identifying portions of text mentioning named entities (NEs) and
categorizing them into predetermined classes, like persons (PERS), organizations (ORG), and places (LOC) [1].
Although NER is not a trivial task, categorizing it is very sensitive to the text's semantics and context. In this context,
the following BERT-based models will be used in the experiment.
• MARBERT and ArBERT: Released by [2]; these two models are constructed based on the BERT model, except
for MARBERT, which uses no objective of predicting the next sentence (NSP) and is trained on Tweets that are
basically short. ArBERT has been trained upon Arabic datasets, mainly books and articles written in Modern
Standard Arabic (MSA), along with 6.5 billion tokens. In contrast, MARBERT has been trained on dialectal (DA)
and MSA tweets and has 15.6 billion tokens.
• AraBERTv02 [3] Is trained using the BERTBASE size, having 12 heads, 12 layers, 768 hidden units per layer,
and 110M total of parameters. It has been trained in Arabic, consisting mainly of newspaper articles and Internet
texts (8.6 billion tokens).
• QARiB† (QCRI Arabic and Dialectal BERT): Has been trained upon a set of Arabic tweets and sentences written
on MSA using a total token count of 14B.
• Arabic BERT [4] Has been trained on approximately 8.2B words of both MSA and dialectal Arabic as well.
• mBERT (BERT multilingual base model (cased)): This model comes pre-trained to handle 104 languages, among
them Arabic, and the model is trained on the whole Wikipedia for each language [5].
In this paper, the model uses a BERT transformer to achieve NER improvement in Arabic by employing a BERT
as an embedding feature. The model BERT model is combined with the BiLSTM-CRF architecture. As a result, the
approach achieves higher performance than using state-of-the-art Arabic NER from Bi-LSTM with word and character
embedding.
2. Proposed Approach
The overall architecture of our proposed approach is shown in Figure 1; it is based on a BERT embedding layer,
and Bidirectional long-short term memory network layer, with a conditional random field layer. More details about
the working mechanism of the proposed approach will be explained in the following subsections.
2.1. Layer of input representation: BERT contextual model language
BERT was first developed as a contextual language model by [5] in 2019, using a multi-layer bidirectional
transformative encoder that considers both right and left contexts. BERT learned through at least two unsupervised
tasks: the Next Sentence Prediction (NSP) and the masked language model. First, through the masked language model,
15% of the tokens in the entry sequence are randomly masked, making the model capable of predicting the target token
in a multi-layer context. Second, the final hidden vectors corresponding to the hidden tokens are fed into an output
Softmax on the vocabulary. Finally, the Next Sentence Prediction aims to understand the relationship between two
sentences. For each incoming sequence, the first token is a special classification token [CLS], and this token's
corresponding final hidden state is considered a representation of the sequence. In addition, BERT works best for
languages with a very rich morphological system, like Arabic. Hence, we use the pre-trained BERT model to represent
our ANER proposed system's input context encoding layer.
† https://github.com/qcri/QARiB
Brahim Ait Benali et al. / Procedia Computer Science 203 (2022) 733–738 735
Brahim AIT BENALI/ Procedia Computer Science 00 (2022) 000–000 3
2.2. BiLSTM
Recurrent neural networks (RNNs) are widely used for many NLP tasks, and Long-term and short-term memory
(LSTM) [6] is a variant of RNNs. LSTM overcomes the gradient explosion and loss caused when using RNNs in long-
term dependency models. In addition, BiLSTM [7] relies on providing contextual information from the input sequence
through two hidden layers of LSTM. Eventually, the context vector will be generated by merging both hidden LSTM
%%%%⃗
vectors as: ℎ! = $ℎ ⃖%%%
! ,ℎ! ). To carry out this work, BiLSTM (Figures 2 and 3) is used for retrieving context-related
information.
Fig. 1. Bert-BiLSTM-CRF model architecture
Following are some of the key functions of BLSTM:

𝐶𝐶,! 𝜎𝜎
𝑓𝑓! 𝜎𝜎 𝑋𝑋
* 0 = 1 6 (𝑊𝑊 " 9 ! ; + 𝑏𝑏)
𝑜𝑜! 𝜎𝜎 ℎ!#$
𝑖𝑖! 𝑡𝑡𝑎𝑎𝑎𝑎ℎ
𝑐𝑐! = 𝑖𝑖! ʘ 𝑐𝑐̃! + 𝑓𝑓! ʘ 𝑐𝑐!#$
ℎ! = 𝑜𝑜! ʘ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ (ℎ! )
where 𝑊𝑊 " and b are trainable parameters; 𝜎𝜎() represents the sigmoid function; 𝑖𝑖! , 𝑜𝑜! and 𝑓𝑓! represent the input,
output, and forget gates, respectively; ʘ means the scalar product function; 𝑥𝑥! is the input vector of the current time
step.
Fig. 2. The main cell structure of LSTM Fig. 3. BiLSTM model

2.3. CRF
The conditional random field (CRF) [8] was used in all our models to take advantage of the dependency between
different tags. Considering the following sequence s = [S1; S2;…; ST ], its matching golden tag sequence is y = [y1;
y2;…; yT]. corresponding is y = [y1; y2;…; yT ], and Y(s) represents all valid tags. Y(s) denotes all valid tag sequences.
To calculate the probability of y, the following equation is used:
e%(',))
p(y⁄X) =
∑)!∈ -" e%(',)! )
The correct labelling sequence is maximized in terms of its probability of being recorded during learning. While
decoding, a best tagging sequence is predicted based on the maximum score given by:
𝑦𝑦 ∗ = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎( 𝑆𝑆(𝑋𝑋, 𝑦𝑦 / ))0!∈1#
3. Experiment and results
3.1. Measurement rule
The experiment in this paper uses precision rate, recall rate, and F1 value as metrics to evaluate the precision of the
model. The calculation formula is as follows:
𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = ∗ 100% 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = ∗ 100%
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
Where 𝑇𝑇𝑇𝑇 denotes the number of named entities correctly identified; 𝐹𝐹𝐹𝐹 means the number of named entities
incorrectly identified; FN indicates the number of named entities not identified; 𝑃𝑃 is the precision rate; 𝑅𝑅 is the recall
rate. F1 is the harmonic mean of precision and recall, and the balanced F-score is the most used:
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑥𝑥 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
𝐹𝐹 = 2 ∗ ∗ 100%
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
3.2. Datasets
We evaluated our model on TWEET dataset. We utilize the training and test data breakdown provided in [9], where
the training dataset includes 3,646 arbitrarily chosen tweets extracted from tweets written between May 3 and 7, 2012.
The tweets were retrieved from Twitter using the query “lang:ar”. The test data consists of 1,423 tweets randomly
retrieved through tweets written between November 23, 2011 and November 27, 2011. This dataset was tested
similarly by [10]. Both datasets follow the ACE tagging guidelines of the Linguistics Data Consortium [11]. Table 1
shows the statistics of the data.
Table. 1. Twitter Evaluation Data Statistics

Dataset Tokens PER LOC ORG
Twitter Train 55k 788 713 449
Twitter Test 26k 464 587 316
Brahim Ait Benali et al. / Procedia Computer Science 203 (2022) 733–738 737
Brahim AIT BENALI/ Procedia Computer Science 00 (2022) 000–000 5
3.3. Experiment Setting
All experiments were performed on the Google Colab platform‡ with a Tesla T4 GPU. The model was implemented
using the Pytorch API. The optimizer Adam is employed in our experiment. The maximum length of the given input
sequence is 128, where the LSTM dimension is fixed at 200, using a batch size of 16, while the learning rate is 1e-4.
The dropout rate is set to 0.5 to avoid overfitting problems.
3.4. Analysis of results
For the purpose of evaluating the performance of BERT pre-trained models, we conducted six experiments using
various publicly available sets of BERT pre-trained models. Table 2 shows the results obtained using different BERT
pre-trained models. Based on the results in this Table, we obtained a significant improvement using MARBERT
(Significant results are in bold).
Table. 2. Results with different choices of Pre-trained Bert-based models.

Bert-based models F1-Score
mBert [5] 41.97
QARiB§ 45.12
Arabic Bert [4] 62.45
AraBert [3] 65.70
ArBERT [2] 66.12
MARBERT [2] 67.40
Furthermore, the BERT-BiLSTM-CRF model performance (MARBERT as an embedding) is compared with

other available models on the Tweet corpus, Table 3 summarizes the results (Significant results are in bold).
Table. 3. Comparison of our system with other models on TWEETS Dataset.
Loc Org Pers Overall
Systems
F1 F1 F1 Avg. F1
K. Darwish 2014 [10] 76.7 55.6 55.8 65.2
A. Zirikly 2015 [12] 61.0 41.8 68.9 59.5
C. Helwe 2019 [13] 65.3 39.7 61.3 59.2
B. Ait benali 2020 [14] 72.4 46.7 73.7 65.6
Our system 77.9 56.5 72.6 67.4
We explore the application of the BERT transform to achieve NER improvement in Arabic by employing BERT
as an embedding feature. The Bidirectional Encoder Representations from Transformers (BERT) model is combined
with the BiLSTM-CRF architecture. A significant performance improvement was found with the Bi-LSTM-CRF
network with the BERT embedding model without any character embedding features. As a result, the approach
achieves higher performance than using state-of-the-art Arabic NER from Bi-LSTM with word and character
embedding. Hence, implementing these recent techniques in the Arabic language can be very useful since annotated
resources are scarce while abundant, unlabeled textual data.
‡ https://colab.research.google.com
§ https://github.com/qcri/QARiB
4. Conclusion
This paper obtains a vector of words through the BERT model and passes it to BiLSTM-CRF to build the BERT-
BiLSTM-CRF model. When tested on the tweet corpus and evaluated against other previous models, the F1 of the
proposed model reaches 67.40% on the tweet corpus, resulting in the highest performance. Furthermore, a significant
performance improvement was found with the Bi-LSTM-CRF network with the BERT embedding model without any
character embedding features. We aim to investigate a suitable manner of integrating the attention layer to enhance
the model's overall performance in further work.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
References
[1] B. Ait ben ali, S. Mihi, I. El Bazi, and N. Laachfoubi, "A Recent Survey of Arabic Named Entity Recognition on
Social Media," Rev. d'Intelligence Artif., vol. 34, no. 2, pp. 125–135, 2020, doi: https://doi.org/10.18280/ria.340202.
[2] M. Abdul-Mageed, A. R. Elmadany, and E. M. B. Nagoudi, "ARBERT & MARBERT: Deep bidirectional
transformers for Arabic," ACL-IJCNLP 2021 - 59th Annu. Meet. Assoc. Comput. Linguist. 11th Int. Jt. Conf. Nat. Lang.
Process. Proc. Conf., no. i, pp. 7088–7105, 2021, doi: 10.18653/v1/2021.acl-long.551.
[3] W. Antoun, F. Baly, and H. Hajj, "AraBERT: Transformer-based Model for Arabic Language Understanding,"
2020.
[4] A. Safaya, M. Abdullatif, and D. Yuret, "KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech
Identification in Social Media," 14th Int. Work. Semant. Eval. SemEval 2020 - co-located 28th Int. Conf. Comput.
Linguist. COLING 2020, Proc., no. Ml, pp. 2054–2059, 2020, doi: 10.18653/v1/2020.semeval-1.271.
[5] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for
language understanding," NAACL HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang.
Technol. - Proc. Conf., vol. 1, no. Mlm, pp. 4171–4186, 2019.
[6] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997,
doi: 10.1162/neco.1997.9.8.1735.
[7] F. Phoneme, "Framewise Phoneme Classification with Bidirectional LSTM Networks," Training, pp. 2047–2052,
2005.
[8] J. Lafferty and A. Mccallum, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling
Sequence Data," Proc. Eighteenth Int. Conf. Mach. Learn. 282-289., no. June, pp. 282–289, 2001, doi: 10.1007/978-0-
387-31439-6_100233.
[9] K. Darwish, "Named entity recognition using cross-lingual resources: Arabic as an example," ACL 2013 - 51st
Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., vol. 1, pp. 1558–1567, 2013.
[10] K. Darwish and W. Gao, "Simple effective microblog named entity recognition: Arabic as an example," Proc. 9th
Int. Conf. Lang. Resour. Eval. Lr. 2014, no. c, pp. 2513–2517, 2014.
[11] N. Alshammari and S. Alanazi, "An Arabic dataset for disease named entity recognition with multi-annotation
schemes," Data, vol. 5, no. 3, pp. 1–8, 2020, doi: 10.3390/data5030060.
[12] A. Zirikly and M. Diab, "Named Entity Recognition for Arabic Social Media," Assoc. Comput. Linguist., pp. 176–
185, 2015, doi: 10.3115/v1/w15-1524.
[13] C. Helwe and S. Elbassuoni, "Arabic named entity recognition via deep co-learning," Artif. Intell. Rev., vol. 52, no.
1, pp. 197–215, 2019, doi: 10.1007/s10462-019-09688-6.
[14] B. AIT BEN ALI, S. MIHI, I. EL BAZI, and N. LAACHFOUBI, "Arabic named entity recognition based on
treebased pipeline optimization tool," J. Theor. Appl. Inf. Technol., vol. 98, no. 15, pp. 2963–2976, 2020.

1 s2.0 S1877050922007141 Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1877050922007141 Main

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

International Workshop on Modeling and Analysis of Complex

* Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000 .

1877-0509 © 2022 The Authors. Published by Elsevier B.V.

2.1. Layer of input representation: BERT contextual model language

Fig. 1. Bert-BiLSTM-CRF model architecture

Following are some of the key functions of BLSTM:

Fig. 2. The main cell structure of LSTM Fig. 3. BiLSTM model

𝑦𝑦 ∗ = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎( 𝑆𝑆(𝑋𝑋, 𝑦𝑦 / ))0!∈1#

3. Experiment and results

3.1. Measurement rule

Table. 1. Twitter Evaluation Data Statistics

3.3. Experiment Setting

3.4. Analysis of results

Table. 2. Results with different choices of Pre-trained Bert-based models.

Furthermore, the BERT-BiLSTM-CRF model performance (MARBERT as an embedding) is compared with

You might also like