Professional Documents
Culture Documents
Project LLM
Project LLM
1
2 Methodology nese language. It applies random input mask-
ing to word pieces independently, following the
2.1 Model Selection BERT paper. Developed by HuggingFace, it’s
intended for masked language modeling tasks in
1. Selection Criteria:
Chinese. While its capabilities are aligned with
• Choose the MLMs based on popularity, avail- BERT, it’s tailored for the Chinese language,
ability, and pre-training on language-specific making it suitable for various NLP tasks in Chi-
datasets. nese text processing.
2
3 Experimental Results formers) family, trained on English text us-
ing masked language modeling (MLM). It pro-
In this section, we present the experimental results vides competitive performance on various En-
obtained from evaluating the performance of the se- glish NLP tasks due to its bidirectional nature
lected masked language models (MLMs) in word pre- and large-scale pretraining.
diction tasks across English, Chinese, and Hindi lan-
guages.
3.4 facebook/roberta-base
3.1 Performance Metrics • Description: A base model in the RoBERTa (Ro-
bustly Optimized BERT Approach) family, pre-
We evaluated the performance of each model based trained on English text with enhancements over
on the following metrics: BERT’s training methodology. It achieves sim-
• Accuracy: The proportion of correctly pre- ilar accuracy to BERT but with more robust
dicted masked words. correct predictions training and optimization techniques.
based on the condition correct predictions =
row[’Actual Masked Word’] == row[’Predicted 3.5 distilroberta-base
Masked Word’]
• Description: A distilled version of RoBERTa-
• score: base, featuring fewer layers and parameters for
faster inference while maintaining competitive
Table 1: Model Comparison with Accuracy and Prob- performance. Despite its reduced complexity, it
ability Score achieves slightly higher accuracy than the larger
RoBERTa-base on the evaluated tasks.
Model Accuracy Probability f1 Score
(%) Score
bert-base- 38.78 0.3596 99.51 3.6 bert-base-chinese
uncased
facebook 38.78 0.3229 78.72
• Description: A BERT model pretrained specif-
/roberta- ically for Chinese language, using the same
base masked language modeling (MLM) objective as
distilroberta- 42.86 0.2393 99.6 bert-base-uncased but trained on Chinese text.
base Despite a lower accuracy score, it demonstrates
bert-base- 3.70 0.4142 83.28 the model’s applicability to non-English lan-
chinese guages.
flax- 1.43 0.3159 78.72
community/roberta-
hindi 3.7 flax-community/roberta-hindi
• Description: A RoBERTa-based model pre-
trained on a large corpus of Hindi text, en-
3.2 Analysis abling cross-lingual transfer learning for Hindi
NLP tasks. While the accuracy is low, it show-
In this subsection, we analyze the performance and cases the model’s potential for handling lan-
characteristics of five different language models: guages other than English.
Overall, these models represent a diverse range of
3.3 bert-base-uncased
languages and architectures, each optimized for spe-
• Description: A base model in the BERT (Bidi- cific linguistic contexts. They demonstrate the ef-
rectional Encoder Representations from Trans- fectiveness of pretrained language models in various
3
NLP tasks, highlighting the importance of choosing 6 References
the right model for the specific language and task at
hand. References
[1] https://huggingface.co/google-bert/bert-base-
4 Conclusion uncased
5 Limitations
In this paper, our focus primarily revolves around
leveraging the resources readily available within the
open-source ecosystem, encompassing both open-
source models and the accessible Google Colab en-
vironment. It is worth noting that the Google Colab
environment offers a generous allocation of 15 GB
of free disk space and grants access to T4-GPU re-
sources, which significantly facilitate computational
tasks and model training procedures.