Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

BARTPHO: PRE-TRAINED SEQUENCE-TO-SEQUENCE MODELS FOR VIETNAMESE

{ NGUYEN LUONG TRAN, DUONG MINH LE AND DAT QUOC NGUYEN } VINAI RESEARCH, VIETNAM

INTRODUCTION SUMMARIZATION TASK


Problems We formulate the summarization task as a monolingual translation problem and fine-tune our BART-
pho and the baseline mBART on the Vietnamese single-document summarization dataset VNDS [4].
• Pre-trained sequence-to-sequence models (e.g. BART, T5, etc.) are proposed to obtain SOTA per- We find that there are duplicate articles in this dataset, therefore we filter the duplicates and conduct
formances for generative NLP tasks experiments on both original and filtered datasets.
• Pre-trained seq2seq models used for Vietnamese: Only multilingual models (mBART, mT5,. . . ), Model Filtered validation set Filtered test set
there is not an existing public monolingual model for Vietnamese R-1 R-2 R-L R-1 R-2 R-L Human
mBART 60.06 28.69 38.85 60.03 28.51 38.74 21/100
• Monolingual models are preferable as dedicated language-specific models still outperform multi-
BARTphosyllable 60.29 29.07 39.02 60.41 29.20 39.22 37/100
lingual ones. Moreover, in Vietnamese, the white space is not only used to mark word boundaries
BARTphoword 60.55 29.89 39.73 60.51 29.65 39.75 42/100
but also to separate syllables that constitute words, for example:
Model Original validation set Original test set
– Syllable-level text: “Chúng tôi là những nghiên cứu viên”We are researchers
R-1 R-2 R-L R-1 R-2 R-L
– Word segmented text: “Chúng_tôiWe làare những nghiên_cứu_viênreseachers ” fastAbs [⋆] _ _ _ 54.52 23.01 37.64
Contributions viBERT2viBERT [∗] _ _ _ 59.75 27.29 36.79
PhoBERT2PhoBERT [∗] _ _ _ 60.37 29.12 39.44
1. Presenting BARTpho with two versions - BARTphosyllable and BARTphoword - the first large-scale mT5 [∗] _ _ _ 58.05 26.76 37.38
monolingual seq2seq models pre-trained for Vietnamese mBART 60.39 29.19 39.18 60.35 29.13 39.21
2. Showing the effectiveness of BARTpho in a comparison with mBART on Vietnamese downstream BARTphosyllable 60.89 29.98 39.59 60.88 29.90 39.64
tasks: Text summarization, Capitalization and Punctuation restoration BARTphoword 61.10 30.34 40.05 61.14 30.31 40.15
[∗] and [⋆] denote the best performing model among different models experimented from previous
3. Publicly releasing our models at: https://github.com/VinAIResearch/BARTpho works [3, 4].

C APITALIZATION AND PUNCTUATION RESTORATION TASK


BARTPHO PRETRAINING
We follow the sequence-to-sequence approach to evaluate and compare our BARTpho and mBART on
• Based on BART [1], including two steps of pre-training: the Vietnamese capitalization and punctuation restoration tasks. The dataset used in this experiment
– Corrupting input text by noising function: was generated automatically using the TED-2020 v1 dataset.
Model Capitalization Punctuation restoration
Comma Period Question Overall
mBART 91.28 67.26 92.19 85.71 78.71
BARTphosyllable 91.98 67.95 91.79 88.15 79.09
– Training seq2seq model to reconstruct the original input text BARTphoword 92.41 68.39 92.05 87.82 79.29
• Pre-training data: 20GB of Vietnamese texts

REFERENCES
BARTPHO ARCHITECTURE [1] M. Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Gener-
ation, Translation, and Comprehension”. In: ACL. 2020.
[2] Y. Liu et al. “Multilingual Denoising Pre-training for Neural Machine Translation”. In: Transactions
of the ACL 8 (2020).
[3] H. Nguyen et al. “VieSum: How Robust Are Transformer-based Models on Vietnamese Summa-
rization?” In: arXiv preprint arXiv:2110.04257v1 (2021).
[4] V.-H. Nguyen et al. “VNDS: A Vietnamese Dataset for Summarization”. In: NICS. 2019.

• Using the standard sequence-to-sequence Transformer architecture and employing the GeLU acti-

You might also like