Project LLM

MULTIGUAL FILL MASK
PRINCE KESHRI, IIIT RAICHUR
Abstract One key task where MLMs shine is word predic-

tion. In this task, the model is presented with a
This research paper investigates the performance sentence containing masked words, and it predicts
of masked language models (MLMs) in word pre- the most probable words to fill in the blanks. This
diction tasks across diverse languages. We uti- task serves as a fundamental benchmark for evaluat-
lized five pre-trained MLMs: ’bert-base-uncased’, ing the model’s comprehension of language semantics
’facebook/roberta-base’, ’distilroberta-base’ trained and context.
on English datasets, ’bert-base-chinese’ trained In this study, we delve into the performance
on Chinese datasets, and ’flax-community/roberta- of five pre-trained MLMs – ’bert-base-uncased’,
hindi’ trained on Hindi datasets. Our objective was ’facebook/roberta-base’, ’distilroberta-base’, ’bert-
to evaluate the efficacy of these models in predict- base-chinese’, and ’flax-community/roberta-hindi’ –
ing masked words within sentences in their respective in word prediction tasks across three languages: En-
languages. glish, Chinese, and Hindi. Each model is trained
To conduct the experiment, we compiled three dis- on language-specific datasets, enabling us to evaluate
tinct datasets, each containing sentences in English, their performance across diverse linguistic contexts.
Hindi, and Chinese. Each model was evaluated on Our research aims to:
its ability to accurately predict masked words within
the sentences from the corresponding dataset. We 1. Assess the accuracy and effectiveness of MLMs
employed the mask token technique, masking a sin- in predicting masked words across different lan-
gle word in each sentence, and tasked the models with guages.
predicting the masked word.
This study contributes to advancing our under- 2. Compare the performance of MLMs trained on
standing of MLMs’ effectiveness in multilingual word diffrent lannguage datasets with those trained on
prediction tasks and offers valuable insights for future datasets from other languages.
research in natural language processing and cross-
3. Analyze factors influencing model performance,
lingual modeling.
including dataset characteristics and linguistic
complexity.
1 Introduction 4. Provide insights into the strengths and limita-

tions of MLMs in multilingual word prediction
Masked language models (MLMs) have revolution- tasks.
ized natural language processing (NLP) by excelling
in various language understanding tasks. Models like Through this study, we seek to advance our under-
BERT (Bidirectional Encoder Representations from standing of MLMs’ capabilities in multilingual set-
Transformers) and RoBERTa (Robustly optimized tings and offer valuable insights for researchers and
BERT approach) have been trained on vast amounts practitioners in the field of NLP and cross-lingual
of text data and fine-tuned for specific NLP tasks. modeling.
1
2 Methodology nese language. It applies random input mask-
ing to word pieces independently, following the
2.1 Model Selection BERT paper. Developed by HuggingFace, it’s
intended for masked language modeling tasks in
1. Selection Criteria:
Chinese. While its capabilities are aligned with
• Choose the MLMs based on popularity, avail- BERT, it’s tailored for the Chinese language,
ability, and pre-training on language-specific making it suitable for various NLP tasks in Chi-
datasets. nese text processing.
2. Selected Models: • ’flax-community/roberta-hindi’[5] RoBERTa

Hindi is a transformer model pretrained on a
• ’bert-base-uncased’[1] large corpus of Hindi data, including OSCAR,
BERT-base-uncased is a transformer model pre- mC4, and IndicGLUE datasets. It employs
trained on English text using masked language masked language modeling (MLM) for tasks like
modeling (MLM). It randomly masks 15 percent sentence completion. Trained on Google Cloud
of input words and predicts them bidirectionally, TPUv3-8, it dynamically masks tokens during
aiding in contextual understanding. The model pretraining. The model is evaluated across
does not differentiate between upper and lower various NLP tasks and provides accurate results
case. It learns representations of English suit- for tasks like summarization, classification, and
able for tasks like fill-in-the-blank. named entity recognition in Hindi text.
• ’facebook/roberta-base’[2] RoBERTa-base is a 2.2 Dataset Preparation

case-sensitive transformer model pretrained on
English text via masked language modeling 3. Data Collection:
(MLM). It masks 15 percent of input words bidi- • Gather datasets containing sentences in English,
rectionally, learning contextual representations. Chinese, and Hindi languages.
This approach differs from RNNs and autore-
gressive models like GPT, enhancing bidirec- 4. Data Pre-processing:
tional understanding. The model extracts fea- • Clean the datasets to formatting, and ensure lin-
tures for downstream tasks, enabling tasks like guistic coherence.
sentence classification using its learned English
representations.
2.3 Experimental Setup
• ’distilroberta-base’[3] DistilRoBERTa-base is
Set-up the Google Colab environment by connecting
a case-sensitive transformer-based language
to the run-time type - T4-GPU.
model, distilled from RoBERTa-base with 6 lay-
Word Prediction Task:
ers, 768 dimensions, and 12 heads, totaling 82M
parameters. It’s designed for masked language • For each model, perform word prediction exper-
modeling and downstream tasks like sequence iments on the respective language dataset.
classification and question answering. With half
the parameters of RoBERTa-base, it’s twice as • Utilize the mask token technique provided by the
fast. Not intended for generating text, it’s aimed Hugging Face Transformers library.
at tasks utilizing the entire sentence context for Availability:
decision-making.
• All selected models are readily available through
• ’bert-base-chinese’[4] ’bert-base-chinese’ is a the Hugging Face Transformers library, facilitat-
Fill-Mask model pretrained specifically for Chi- ing easy integration into the experimental setup.
2
3 Experimental Results formers) family, trained on English text us-
ing masked language modeling (MLM). It pro-
In this section, we present the experimental results vides competitive performance on various En-
obtained from evaluating the performance of the se- glish NLP tasks due to its bidirectional nature
lected masked language models (MLMs) in word pre- and large-scale pretraining.
diction tasks across English, Chinese, and Hindi lan-
guages.
3.4 facebook/roberta-base
3.1 Performance Metrics • Description: A base model in the RoBERTa (Ro-
bustly Optimized BERT Approach) family, pre-
We evaluated the performance of each model based trained on English text with enhancements over
on the following metrics: BERT’s training methodology. It achieves sim-
• Accuracy: The proportion of correctly pre- ilar accuracy to BERT but with more robust
dicted masked words. correct predictions training and optimization techniques.
based on the condition correct predictions =
row[’Actual Masked Word’] == row[’Predicted 3.5 distilroberta-base
Masked Word’]
• Description: A distilled version of RoBERTa-
• score: base, featuring fewer layers and parameters for
faster inference while maintaining competitive
Table 1: Model Comparison with Accuracy and Prob- performance. Despite its reduced complexity, it
ability Score achieves slightly higher accuracy than the larger
RoBERTa-base on the evaluated tasks.
Model Accuracy Probability f1 Score
(%) Score
bert-base- 38.78 0.3596 99.51 3.6 bert-base-chinese
uncased
facebook 38.78 0.3229 78.72
• Description: A BERT model pretrained specif-
/roberta- ically for Chinese language, using the same
base masked language modeling (MLM) objective as
distilroberta- 42.86 0.2393 99.6 bert-base-uncased but trained on Chinese text.
base Despite a lower accuracy score, it demonstrates
bert-base- 3.70 0.4142 83.28 the model’s applicability to non-English lan-
chinese guages.
flax- 1.43 0.3159 78.72
community/roberta-
hindi 3.7 flax-community/roberta-hindi
• Description: A RoBERTa-based model pre-
trained on a large corpus of Hindi text, en-
3.2 Analysis abling cross-lingual transfer learning for Hindi
NLP tasks. While the accuracy is low, it show-
In this subsection, we analyze the performance and cases the model’s potential for handling lan-
characteristics of five different language models: guages other than English.
Overall, these models represent a diverse range of
3.3 bert-base-uncased
languages and architectures, each optimized for spe-
• Description: A base model in the BERT (Bidi- cific linguistic contexts. They demonstrate the ef-
rectional Encoder Representations from Trans- fectiveness of pretrained language models in various
3
NLP tasks, highlighting the importance of choosing 6 References
the right model for the specific language and task at
hand. References
[1] https://huggingface.co/google-bert/bert-base-
4 Conclusion uncased
In conclusion, the analysis of the five language [2] https://huggingface.co/facebook/roberta-base

models demonstrates their effectiveness and ver-
[3] https://huggingface.co/distilbert/distilroberta-
satility in natural language processing (NLP)
base
tasks. Each model, whether it’s bert-base-uncased,
facebook/roberta-base, distilroberta-base, bert-base- [4] https://huggingface.co/google-bert/bert-base-
chinese, or flax-community/roberta-hindi, offers chinese
unique advantages and capabilities.
The bert-base-uncased and facebook/roberta-base [5] https://huggingface.co/flax-community/roberta-
models, both belonging to the BERT family, ex- hindi.
cel in handling English text with competitive ac-
curacy and probability scores. On the other hand,
the distilroberta-base model showcases the benefits
of model distillation, providing comparable perfor-
mance with fewer parameters.
The bert-base-chinese model demonstrates the ap-
plicability of pretrained models to non-English lan-
guages, specifically Chinese, despite its lower accu-
racy compared to English models. Similarly, the
flax-community/roberta-hindi model highlights the
potential for cross-lingual transfer learning in Hindi
NLP tasks, albeit with lower accuracy.
In conclusion, our research contributes to advanc-
ing our understanding of MLMs’ capabilities in mul-
tilingual settings and provides valuable insights for
researchers and practitioners in the field of natural
language processing and cross-lingual modeling.
5 Limitations
In this paper, our focus primarily revolves around
leveraging the resources readily available within the
open-source ecosystem, encompassing both open-
source models and the accessible Google Colab en-
vironment. It is worth noting that the Google Colab
environment offers a generous allocation of 15 GB
of free disk space and grants access to T4-GPU re-
sources, which significantly facilitate computational
tasks and model training procedures.

Project LLM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project LLM

Uploaded by

Copyright:

Available Formats

MULTIGUAL FILL MASK

PRINCE KESHRI, IIIT RAICHUR

Abstract One key task where MLMs shine is word predic-

1 Introduction 4. Provide insights into the strengths and limita-

2. Selected Models: • ’flax-community/roberta-hindi’[5] RoBERTa

• ’facebook/roberta-base’[2] RoBERTa-base is a 2.2 Dataset Preparation

In conclusion, the analysis of the five language [2] https://huggingface.co/facebook/roberta-base

You might also like