Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face

ng Face and fastai v…

Faster than training from scratch — Fine-tuning the


English GPT-2 in any language with Hugging Face
and fastai v2 (practical case with Portuguese)
Pierre Guillou
Jul 14 · 40 min read

The 3 main steps of ne-tuning the English GPT-2 to Portuguese with Hugging Face and fastai v2 (image edited
from fast.ai NLP)

In this tutorial, instead of training from scratch, we will see how to fine-tune in just
over a day, on one GPU and with a little more than 1GB of training data an English
pre-trained transformer-based language model to any another language. As a
practical case, we fine-tune to Portuguese the English pre-trained GPT-2 by
wrapping the Transformers and Tokenizers libraries of Hugging Face into fastai v2.
We thus create a new language model: GPorTuguese-2, a language model for
Portuguese text generation (and more NLP tasks…).

Other posts in the GPT-2 series: (NLP & fastai) GPT-2 | Byte-level BPE, an universal
tokenizer but… | GPT-2 use cases: beyond Text Generation | Fast pipeline to localize any
transformer-based model to any language | How to generate texts with a transformer-
based language model

Texts generated by GPorTuguese-2 on Covid-19, Netflix, Artificial


Intelligence and… unicorns

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 1/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Examples of texts generated by GPorTuguese-2 (Portuguese GPT-2 small) on Covid-19, Net ix, Arti cial
Intelligence and… unicorns

Acknowledgment
This tutorial was made possible thanks to the computing power of the AI Lab
(University of Brasilia) to which I am attached as an Associate Researcher in NLP and
the participation of its directors in the definition of NLP strategy, Professors Fabricio
Ataides Braz and Nilton Correia da Silva. Thank you so much!

AI Lab (University of Brasilia, Brazil)

And special thanks to Sylvain Gugger for his tutorial on Transformers and fastai v2
which is the basis of this tutorial.

I would also like to mention Nama.ai R&D team, and its CEO Rodrigo Scotti, which is
participating in Brazil in AI research to improve online services by the use of generative
NLP models.

Table of contents
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 2/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Texts generated by GPorTuguese-2 on Covid-19, Netflix, Artificial Intelligence


and… unicorns

Acknowledgment

Notebooks, Web App and model download

Results

About the need for language models not just in English… and how to do it in real
life

Why using fastai v2 over Hugging Face libraries to fine-tune a pre-trained


transformer-based language model?

About the choice of GPT-2

Main coding steps to fine-tune a Hugging Face language model with fastai v2

Model sharing and uploading in the Hugging Face model hub

Text Generation by our Portuguese GPT-2

Conclusion

Annex | Other articles about fine-tuning GPT-2 to another language

Notebooks, Web App and model download


The main code of the tutorial is published in this post, organized by paragraph.

To obtain the complete code, simply download the notebook finetuning-English-GPT2-


any-language-Portuguese-HuggingFace-fastaiv2.ipynb (nbviewer version). However,
as this notebook is very detailed, use this fast notebook finetuning-English-GPT2-any-
language-Portuguese-HuggingFace-fastaiv2_FAST.ipynb (nbviewer version) if you just
want to run the code without explanation.

In addition, our GPorTuguese-2 (Portuguese GPT-2 small), a language model for


Portuguese text generation (and more NLP tasks…), is testable online in the
Hugging face model hub with all usage information at this address:

pierreguillou/gpt2-small-portuguese · Hugging Face


GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 3/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
model for Portuguese based on the GPT-2 small…

huggingface.co

Results
Analysis of results
In a little more than a day (we only used one GPU NVIDIA V100 32GB; through a
Distributed Data Parallel (DDP) training mode, we could have divided by three this
time to 10 hours, just with 2 GPUs), we got a loss of 3.17, an accuracy of 37.99%
and a perplexity of 23.76 (see the validation results table below and explications
about perplexity at the end of the paragraph). Happy!

+------------+------+----------+------------+----------+-----------+
| after | loss | accuracy | perplexity | time | cumulative|
| ... epochs | | (%) | | by epoch | time |
+------------+------+----------+------------+----------+-----------+
| 0 | 9.95 | 9.90 | 20950.94 | 00:00:00 | 00:00:00 |
| 1 | 3.64 | 32.52 | 38.12 | 5:48:31 | 5:48:31 |
| 2 | 3.30 | 36.29 | 27.16 | 5:38:18 | 11:26:49 |
| 3 | 3.21 | 37.46 | 24.71 | 6:20:51 | 17:47:40 |
| 4 | 3.19 | 37.74 | 24.21 | 6:06:29 | 23:54:09 |
| 5 | 3.17 | 37.99 | 23.76 | 6:16:22 | 30:10:31 |
+------------+------+----------+------------+----------+-----------+
Fine-tuning of GPT-2 into Portuguese
Table of training and validation results

After a huge gain at the end of the first epoch (see validation results graph
below), the validation accuracy continues to improve until the end of training but
less (it goes to nearly 40%, that is considered a good performance for a language
model — check these notebooks nn-vietnamese.ipynb and nn-turkish.ipynb from
Jeremy Howard of fastai).

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 4/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Validation loss and accuracy of pre-trained English GPT-2 of Hugging Face ne-tuned to Portuguese by
fastai v2

The perplexity evolution graph of the validation dataset confirms that the fine-tuning
of the vocab and position embedding matrix in the first epoch brought a very
significant gain.

Validation perplexity of pre-trained English GPT-2 of Hugging Face ne-tuned to Portuguese by fastai v2

Our results validate the importance of having firstly trained the embedding
matrices (vocab and position) before the fine-tuning of the 3-layers groups (each
with 4 decoder blocks).

About our fine-tuning strategy

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 5/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Our Transfer Learning and fine-tuning approach to


get a Portuguese GPT-2 from an English one is
validated by the results obtained.
Indeed, the fact that our model quickly obtains a huge performance comes from our
fine-tuning strategy on a pre-trained model, i.e. the reuse of its vocab and position
embedding matrices (all token vectors in common between English and Portuguese
vocabs were kept) and model weights learned on an English corpus (WebText of
40GB).

Surely, this strategy worked because the language


rules between English and Portuguese are not that
different (languages rules implemented into the
pre-trained model in the embedding matrices and
weights)!
About the perplexity of our model
To get an idea of the performance of our GPT-2 fine-tuned to Portuguese, we would
need to train the same GPT-2 model on the same Portuguese dataset but from scratch
(with randomized position and vocab embedding and model parameters (weights)).

However, we can already compared our 23.76 perplexity to that of 25.6 for example
from the Transformers Tutorial on which Sylvain Gugger writes “25.6 as perplexity is
kind of amazing” (zero-shot perplexity of the English GPT-2 with BBPE tokenizer on the
WikiText2 corpus) or to that of 29.41 from the original GPT-2 paper (zero-shot
perplexity of the English GPT-2 with BPE tokenizer (not a BBPE one) on the WikiText2
corpus).

Looks good!

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 6/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Perplexity table from the original GPT-2 paper (Language Models are Unsupervised Multitask Learners)

About the need for language models not just in English… and how
to do it in real life
Even if English is today the most spoken language in the world (around 1.2 billion
people), the world is multilingual (for example, there are 34 languages having 45
million or more total speakers in the 2019 edition of Ethnologue, a language reference
published by SIL International).

It is therefore necessary to have natural language models trained in all existing


languages, and not just in English, since these models constitute the essential basis for
the training of models capable of performing a particular task in linguistics
(classification, Q&A, synthesis, entity searches, etc.).

This is a color coded diagram to indicate the percentage of English speakers of nearly all the world’s countries. A
few small islands have not been accounted for. (image source: List of countries by English-speaking population in
Wikipedia)

However, if it is extremely simple and free to download a language model trained

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 7/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

in English via the Transformers library of Hugging Face for example, it is often
much more difficult to find online a model trained in another language.

Option 1 | Fast pipeline to localize any transformer-based model to any


language
The easiest way to get theses language-specific language models would be to use a
pipeline of existing pre-trained transformer-based models like the following one:

Fast pipeline to localize any transformer-based model (here, a language model) to any language, for
example in Portuguese (image edited from fast.ai NLP)

For example, to obtain a Portuguese GPT-2, we could download from the Transformers
library of Hugging Face the OpenAI GPT-2 pre-trained in English and the MarianMT
translator (we could also use BART or T5 for the translation) in order to create the
following pipeline:

(input) Portuguese to English (MarianMT)


>> English pre-trained language model (GPT-2)
>> (output) English to Portuguese (MarianMT)

So, for free and with only a few lines of code, we can get any language model in any
language, and even any task-oriented NLP model (classification, Q&A, synthesis, entity
searches, etc.) using the same pipeline. Not bad!

We will find the code of this pipeline and examples of use for text generation in the
post “Fast pipeline to localize any transformer-based model to any language”.

However, the problem with this simple solution is that we depend on the quality
of training of 2 pre-trained NLP models, which greatly increases the risk of losing
the linguistic singularities and nuances of the desired language.

Option 2 | Fine-tuning of an existing pre-trained model


https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 8/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Therefore, it often becomes necessary to have to train its own language model.

Nevertheless, training from scratch a powerful transformer-based language model


like GPT-2 or GPT-3 of OpenAI , BART of Facebook or T5 of Google requires tens
or even hundreds of GB of text, which is impossible or difficult to find or requires
power gigantic computing that only a few companies in the world have. For example,

GPT-2 Extra-Large (1.5 billion parameters) was trained on 40GB of WebText on 32


Cloud TPU v3 for 1 week (cost of $43.008)

CamemBERT, the BERT in French, was trained on 38GB of raw text on 256 GPUs
(32 GB Tesla V100) for 1 day

RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s… and we
are not talking about T5 or GPT-3 (175 billion parameters) whose computational
cost was estimated at $4.6 million! (“We are waiting for OpenAI to reveal more
details about the training infrastructure and model implementation. But to put things
into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for
training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud
pricing we could find, this will take 355 GPU-years and cost $4.6M for a single
training run.”)

NLP models through time, with their number of parameters (Image credit: TensorFlow blog)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 9/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Thus, as it is easy to download a few GB of texts from an online language corpus


(Wikipedia, OSCAR, Common Crawl for example) and rent a NVIDIA V100 GPU for
$1.24 an hour (GCP, AWS, Azur for example), it is more realistic for the majority of
people and organizations wishing to use a language model other than English to
fine-tune on few GB of texts a model already pre-trained in English (i.e. fine-
tuning a model obtained by Transfer Learning) using Deep Learning frameworks such
as TensorFlow+Keras or PyTorch+fastai.

This tutorial show how to implement this second option and you will find examples of
use for text generation in the paragraph “Text Generation by our Portuguese GPT-2”
at the end of this tutorial.

Why using fastai v2 over Hugging Face libraries to fine-tune a pre-


trained transformer-based language model?
Tokenizers and Transformers from Hugging Face
The Tokenizers and Transformers library from Hugging Face (HF) are today the most
up-to-date NLP libraries (Natural Language Processing) used all over the world
(the libraries versions we used are from July 2020: transformers 3.0.0 and
tokenizers 0.8.0).

Hugging Face

According to the HF official documentation, they were designed with two strong goals
in mind:

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 10/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

. be as easy and fast to use as possible

. provide state-of-the-art models with performances as close as possible to the original


models

However, as written in the Philosophy paragraph of the Quickstart HF page:

the Transformers library is NOT a modular toolbox of building blocks for neural nets. If
you want to extend/build-upon the library, just use regular Python/PyTorch modules and
inherit from the base classes of the library to reuse functionalities like model
loading/saving.

Indeed, the reading of the new Hugging Face tutorials from june 2020 confirm that
plain PyTorch must be used in order to train from scratch or fine-tune a pre-trained
model in the Transformers library.

For example, the new Training and fine-tuning tutorial explains how Fine-tuning in
native PyTorch. It is very helpful but how to apply 1cycle policy fine-tuning method for
example? Or how to easily freeze or unfreeze some layers groups like in fastai v2 with
the functions learn.unfreeze() and learn.freeze_to() instead of typing full PyTorch
code?

fastai v2
Therefore, despite of the running py files published by Hugging Face (for example, the
run_language_modeling.py for fine-tuning the library models for language modeling
on a text file (GPT, GPT-2, BERT, RoBERTa)), when it comes necessary to fine-tune a
pre-trained model to another language and/or to another task, we need to use
easy fine-tuning methods over regular Python/PyTorch modules in order to apply
Transfer Learning and fine-tuning modern techniques.

Since fastai v2 provides all of these powerful fine-tuning techniques, this is a


primary candidate library for training transformer-based language models pre-trained
with the Tokenizers and Transformers libraries of Hugging Face.

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 11/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

fastai v2

Here is a non-exhaustive list of the fastai v2 fine-tuning techniques based on Transfer


Learning:

Learning rate finder (method that helps finding the best learning rate to train the
model)

Mixed precision training (some of the operations will be done in FP16, others in
FP32 in order to speed up the training)

Gradual unfreezing (layers groups are defined allowing to decide the layers to be
trained)

1cycle policy (the 1cycle policy was introduced by Leslie N. Smith et al. in Super-
Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. It
schedules the learning rate with a cosine annealing)

Differential learning rates (a specific learning rate is setup by layers group)

Distributed training (training distributed on different GPUs in order to speed up


the training)

About the choice of GPT-2


In order to demonstrate the feasibility of fine-tuning Hugging Face models via fastai
v2, we wanted to choose an emblematic model of the Transformer revolution in the
NLP since 2017.

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 12/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

The original transformer model is made up of an encoder and decode (image credit: The illustrated GPT-2)

Thus, between the 2 historic transformer-based models GPT-2 and BERT models, we
chose the GPT-2 model because it has strongly influenced minds beyond the
circle of Deep Learning specialists in early 2019 by writing texts of a quality level
close to that of humans. Today “exceeded” in number of parameters and performance
by more recent models like BART, T5 and of course GPT-3 (175 billion parameters!), it
remains a reference and a model used in research and applications.

(1/2) OpenAI GPT-2 is a transformer-based language model using only decoder blocks (image credit: The
illustrated GPT-2)

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 13/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

(2/2) OpenAI GPT-2 is a transformer-based language model using only decoder blocks (note:we use an
input sequence of 1024, not 4000 — image credit: The illustrated GPT-2)

Note: for those you want to understand better how GPT-2 works, read the following
posts:

The Illustrated GPT-2 (Visualizing Transformer Language Models)

NLP & fastai | GPT-2

About the version of GPT-2: there are 3 versions of the GPT-2 model (look at the
transformers documentation for more details). Here, we use the small version, the
one with the smallest number of weights (124 millions, not 117 as written in the
original paper) but you can change the model used by changing the content of
pretrained_weights (if it's not a GPT-2 model, you'll need to change the classes used for
the model and the tokenizer of course).

We used the English pre-trained GPT-2 small and its Byte-level BPE tokenizer in this tutorial (image credit:
The illustrated GPT-2)

English pre-trained GPT-2 small

12-layer, 768-hidden, 12-heads

124M parameters, file of 548 Mo

Download time: about 10 minutes

English pre-trained Byte-level BPE tokenizer

Byte-level BPE

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 14/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

vocab of 50.257 tokens

Note: to understand better what is a Byte-level BPE tokenizer, read this post: Byte-level
BPE, an universal tokenizer but…

Main coding steps to fine-tune a Hugging Face language model


with fastai v2
We will find in the tutorial notebook the code, detailed explications and results about
the 6 main coding steps to fine-tune a Hugging face language model with fastai v2. We
copied/pasted in this post the key parts in order to focus on them.

1. Initialization
2. Download Wikipedia in Portuguese
3. Download a GPT-2 English pre-trained model and train a GPT-2
tokenizer with a vocab in Portuguese
3.1 Get the pre-trained GPT-2 Tokenizer & Model (pre-trained with
an English corpus) from the Transformers library (Hugging Face)
3.2 Train a Byte-level BPE (BBPE) Tokenizer on the Portuguese
Wikipedia corpus by using the Tokenizers library (Hugging Face)
3.3 Import the tokenizer Portuguese config files into the pre-
trained GPT-2 Tokenizer
4. Create a fastai tokenizer and update the embedding matrix of the
GPT-2 English pre-trained model
4.1 GPT2TokenizerFast (imported GPT2 tokenizer) --> fastai
Tokenizer
4.2 Change vocab embedding in the GPT-2 pre-trained model to adapt
to the Portuguese vocab
5. Create fastai v2 Datasets and Dataloaders
6. Fine-tuning the model
6.1 Splitter (get layers groups)
6.2 Learner
6.2.1 Freeze all layers but the last layers group (wte, wpe
embedding matrices and last LayerNorm)
6.2.2 Freeze all layers but the last 2 layers groups
6.2.3 Freeze all layers but the last 3 layers groups
6.2.4 Unfreeze all layers

However, these 6 main steps can be summarized in 3 main ones:

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 15/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Fine-tuning the English GPT-2 to Portuguese with Hugging Face and fastai v2 in 3 main steps (image
edited from fast.ai NLP)

1. Initialization & download (download of Portuguese Wikipedia and GPT-2 English


pre-trained model and tokenizer)

2. GPT-2 tokenizer with a Portuguese vocab (train a GPT-2 tokenizer with a vocab
in Portuguese, wrap it into a fastai v2 tokenizer and update the embedding matrix
of the GPT-2 English pre-trained model according to the new Portuguese vocab:
keep the embedding vectors of the common tokens between English and
Portuguese vocabs)

3. Fine-tune on Portuguese Wikipedia the GPT-2 model with fastai v2 training


functionalities

Let’s start our journey to GPT-2 fine-tuned into Portuguese!

1. Initialization

# libraries installation
# fastai v2: read https://dev.fast.ai/#Installing
# tokenizers: !pip install tokenizers
# transformers: !pip install transformers

# import fastai v2
from fastai2.text.all import *
from nlputils_fastai2 import *

# import tokenizers and transformers


from transformers import GPT2TokenizerFast, GPT2LMHeadModel
from tokenizers import ByteLevelBPETokenizer

# setup new path_data and create the lang folder


lang = 'pt'
name = f'{lang}wiki'
config = Config()
data_path = config['data_path']
path_data = data_path/name
path_data.mkdir(exist_ok=True, parents=True)

2. Download Wikipedia in Portuguese


In Wikimedia Downloads, you will find the dump of the Portuguese Wikipedia that has
1.037.991 articles at the date of the study (07/03/2020).

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 16/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

By selecting those with a minimum text length of 1.800, we downloaded 20% of


these articles (204.315 files) which represent about 200 million words for a total
size of 1.6 GB.

This dataset size has to be compared to the 40 GB of WebText (text extracted from
Internet but Wikipedia) used by OpenAI to train from scratch English GPT-2 (see
“About the English dataset used to train GPT-2” at the end of this paragraph).

We use 25 times less training data to obtain a GPT-2


in Portuguese than that used to obtain the GPT2 in
English.
Note: all the following methods come from the file nlputils_fastai2.py from fastai. We
did try to use as well the nlp library of Hugging Face to download the Portuguese
Wikipedia but we faced an unsolved issue (see the notebook).

# download Portuguese Wikipedia


get_wiki(path_data,lang)

# create one text file by article


dest = split_wiki(path_data,lang)

# get all articles in one text file and one csv file
get_one_clean_file(dest,lang)
get_one_clean_csv_file(dest,lang)

Note: the text file (all the articles in one file) will allow the training of the Portuguese
tokenizer and the csv one will facilitate the tests of the study.

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 17/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

First articles from downloaded Portuguese Wikipedia

About the English dataset used to train GPT-2

(source) The resulting dataset, WebText, contains the text subset of these 45 million links.
To extract the text from HTML responses we use a combination of the Dragnet (Peters
&Lecocq, 2013) and Newspaper content extractors. All results presented in this paper use a
preliminary version of WebText which does not include links created after Dec 2017 and
which after de-duplication and some heuristic based cleaning contains slightly over 8
million documents for a total of 40 GB of text. We removed all Wikipedia documents from
WebText since it is a common data source for other datasets and could complicate analysis
due to overlapping training data with test evaluation tasks.

3. Download a GPT-2 English pre-trained model and train a GPT-2 tokenizer


with a vocab in Portuguese
We are following 3 steps in order to get a GPT-2 tokenizer with a vocab in
Portuguese:

1. Get the pre-trained GPT-2 Tokenizer & Model (pre-trained with an English
corpus) from the Transformers library (Hugging Face): it will give us the
tokenizer structure we need and the pre-trained model weights (it’s better to start
training our GPT-2 model in Portuguese from weights already trained even in
another language than from random values)

2. Train a Byte-level BPE (BBPE) Tokenizer on the Portuguese Wikipedia corpus


by using the Tokenizers library (Hugging Face): this will give us the vocabulary
files in Portuguese of our GPT-2 tokenizer.

3. Import the tokenizer Portuguese config files ( vocab.json , merges.txt ) into the
pre-trained GPT-2 Tokenizer: it will give us a GPT-2 tokenizer structure with the
vocab in Portuguese.

One relevant point is that we trained our Portuguese Byte-level BPE tokenizer on
Portuguese Wikipedia (here, 1.6 GB) in only 2min 7s. Thanks Hugging Face!

# 1. Get the pre-trained GPT2 Tokenizer (pre-trained with an English


# corpus) from the Transformers library (Hugging Face)

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 18/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

from tokenizers import ByteLevelBPETokenizer

pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tokenizer_en.pad_token = tokenizer_en.eos_token

# 2. Train a Byte Level BPE (BBPE) tokenizer on the Portuguese


# Wikipedia corpus by using the Tokenizers library (Hugging Face)

# 2.1 Get GPT2 tokenizer_en vocab size


ByteLevelBPE_tokenizer_pt_vocab_size = tokenizer_en.vocab_size
ByteLevelBPE_tokenizer_pt_vocab_size

# 2.2 ByteLevelBPETokenizer Represents a Byte-level BPE


# as introduced by OpenAI with their GPT-2 model
from tokenizers import ByteLevelBPETokenizer
ByteLevelBPE_tokenizer_pt = ByteLevelBPETokenizer()

# 2.3 Get list of paths to corpus files


# and customize training with <|endoftext|> special GPT-2 token
paths = [str(path_data/'all_texts_ptwiki.txt')]
ByteLevelBPE_tokenizer_pt.train(files=paths,
vocab_size=ByteLevelBPE_tokenizer_pt_vocab_size,
min_frequency=2,
special_tokens=["<|endoftext|>"])
# Get sequence length max of 1024
ByteLevelBPE_tokenizer_pt.enable_truncation(max_length=1024)

# 2.4 save tokenizer


ByteLevelBPE_tokenizer_pt_rep = 'ByteLevelBPE_tokenizer_pt'
path_to_ByteLevelBPE_tokenizer_pt_rep =
path_data/ByteLevelBPE_tokenizer_pt_rep
if not (path_to_ByteLevelBPE_tokenizer_pt_rep).exists():
path_to_ByteLevelBPE_tokenizer_pt_rep.mkdir(exist_ok=True,
parents=True)
ByteLevelBPE_tokenizer_pt.save_model(str(path_to_ByteLevelBPE_tokeni
zer_pt_rep))

# 3. Import the tokenizer config files in Portuguese into the pre-


trained GPT2 Tokenizer
tokenizer_pt = GPT2TokenizerFast.from_pretrained(
str(path_to_ByteLevelBPE_tokenizer_pt_rep),
pad_token='<|endoftext|>')
# Get sequence length max of 1024
tokenizer_pt.model_max_length = 1024

4. Create a fastai tokenizer and update the embedding matrix of the GPT-2
English pre-trained model
Now let’s see how we can use fastai v2 to fine-tune this model on Wikipedia in
Portuguese, using all the fastai v2 training and fine-tuning utilities.

We will follow these 2 following steps:


https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 19/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

1. GPT2TokenizerFast (imported GPT-2 tokenizer) → fastai Tokenizer: to process


the data to train a model, we need to build a fastai tokenizer from the GPT-2
tokenizer with vocab in Portuguese.

2. Change vocab embedding ( wte matrix) in the GPT-2 pre-trained model to


adapt to the Portuguese vocab: as the vocab embedding matrix ( wte ) of the pre-
trained GPT-2 model corresponds to the English vocabulary, we’ll keep the
embedding vectors of the common tokens between the English and Portuguese
vocab.

(text from Sylvain Gugger Transformers Tutorial) To process this data to train a model,
we need to build a Transform that will be applied lazily. In a fastai Transform you can
define:

an encodes method that is applied when you call the transform (a bit like the
forward method in a nn.Module )

a decodes method that is applied when you call the decode method of the
transform, if you need to decode anything for showing purposes (like converting
ids to a text here)

a setups method that sets some inner state of the Transform (not needed here)

# 1. GPT2TokenizerFast (imported GPT-2 tokenizer) → fastai Tokenizer


class TransformersTokenizer(Transform):
def __init__(self, tokenizer): self.tokenizer = tokenizer
def encodes(self, x):
toks = self.tokenizer.tokenize(x)
return tensor(self.tokenizer.convert_tokens_to_ids(toks))
def decodes(self, x): return
TitledStr(self.tokenizer.decode(x.cpu().numpy()))

tokenizer_fastai_en = TransformersTokenizer(tokenizer_en)
tokenizer_fastai_pt = TransformersTokenizer(tokenizer_pt)

# 2. Change vocab embedding in the GPT-2 pre-trained model to adapt


to the Portuguese vocab
# Get weights of the old wte
old_wgts =
model.transformer.get_input_embeddings().weight.clone().detach()

# Get the mean embedding vector of the old wte


wgts_m = old_wgts.mean(0)

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 20/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

# Initialize vocab size and weights of the new wte


new_vocab_size = tokenizer_fastai_pt.tokenizer.vocab_size
new_wgts = old_wgts.new_zeros(new_vocab_size,old_wgts.size(1))

# Get the new wte keeping the embedding vectors of tokens


# in common in the 2 vocabs
# A token present in the new vocab but not in the old one
# gets the mean embedding vector of the old wte
old_vocab = tokenizer_fastai_en.tokenizer.get_vocab()
new_vocab = tokenizer_fastai_pt.tokenizer.get_vocab()
same_tokens_list = list()
different_tokens_list = list()

for w,idx_new in new_vocab.items():


idx_old = old_vocab.get(w, -1)
if idx_old>=0:
new_wgts[idx_new] = old_wgts[idx_old]
same_tokens_list.append((w,idx_new))
else:
new_wgts[idx_new] = wgts_m
different_tokens_list.append((w,idx_new))

# setup in model the new wte


new_wte = nn.Embedding(new_vocab_size,old_wgts.size(1))
new_wte.weight.data = new_wgts
model.transformer.set_input_embeddings(new_wte)

# save new_wgts
torch.save(new_wgts, path_data/'new_wte_wgts.pt')
# save same_tokens_list and different_tokens_list
torch.save(same_tokens_list, path_data/'same_tokens_list.pt')
torch.save(different_tokens_list,
path_data/'different_tokens_list.pt')

# Changing lm_head weights with the new embedding


matrixmodel.lm_head.weight = model.transformer.wte.weight

Portuguese embedding wte matrix setup done!

We kept 12.948 embedding vectors from the English one (~25%).


We did not kept 37.309 embedding vectors (~75%) from the English one (instead, we
used the old wte mean vector).

15 first tokens IN common between the 2 vocabs:


[(‘ĠQuit’, 40195), (‘Smith’, 32470), (‘Ġomit’, 39040), (‘oc’, 574), (‘ym’, 18252),
(‘Ġactual’, 9443), (‘ck’, 911), (‘ĠPremier’, 16558), (‘Ġeste’, 987), (‘ĠInd’, 3438),
(‘Ġbol’, 4203), (‘phen’, 35836), (‘ĠParticip’, 36689), (‘ĠZeus’, 19316), (‘Ġnan’,
39770)]

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 21/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

15 first Portuguese tokens NOT in common between the 2 vocabs:


[(‘PSDB’, 23151), (‘Ġenvio’, 19270), (‘Ġocupação’, 5938), (‘Ġdocumentada’,
30011), (‘Ġduros’, 36706), (‘visto’, 44422), (‘ĠSiro’, 43061), (‘Ġdestacavam’,
47397), (‘Ġarqui’, 49060), (‘ĠArte’, 5977), (‘ĠValor’, 29721), (‘Ġalinhados’,
38446), (‘Ġnúmeros’, 4626), (‘Ġpênis’, 31686), (‘cisa’, 29710)]

5. Create fastai v2 Datasets and Dataloaders


(text from Sylvain Gugger Transformers Tutorial) You can then group your data with this
Transform using a TfmdLists . It has an s in its name because it contains the training

and validation datasets.

We indicate the indices of the training dataset and the validation dataset with splits

(here, 80% of the indices randomly chosen, then all the remaining indices).

We specify dl_type=LMDataLoader in the TfmdLists for when we will convert this

TfmdLists to DataLoaders : we will use an LMDataLoader since we have a language


modeling problem, not the usual fastai TfmdDL .

# train = 80%
# validation = 20%
num = int(0.8*len(df))
idxs = np.random.randint(0, len(df), len(df))
idxs_train = idxs[:num]
idxs_val = idxs[num:]

# We gather all texts in one numpy array


# (since it will be easier to use this way with fastai)
all_texts = np.concatenate([df.iloc[idxs_train].text.values,
df.iloc[idxs_val].text.values])

splits = [list(idxs_train), list(idxs_val)]


tls = TfmdLists(all_texts, TransformersTokenizer(tokenizer_pt),
splits=splits, dl_type=LMDataLoader)

(text from Sylvain Gugger Transformers Tutorial) The fastai v2 library expects the data
to be assembled in a DataLoaders object (something that has a training and validation
dataloader). We can get one by using the dataloaders method. We just have to specify
a batch size and a sequence length:

Let’s use a batch size of 8 (a value higher gives a “CUDA out of memory error” on
our single GPU).
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 22/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Since the GPT-2 model was trained with sequences of size 1024, we use this
sequence length (it’s a stateless model, so it will change the perplexity if we use
less).

bs,sl = 8,1024
dls = tls.dataloaders(bs=bs, seq_len=sl)

6. Fine-tuning the model


(text from Sylvain Gugger Transformers Tutorial) The Hugging Face model will return a
tuple in outputs, with the actual predictions and some additional activations (should
we want to use them is some regularization scheme). To work inside the fastai v2
training loop, we will need to drop those using a Callback : we use those to alter the

behavior of the training loop.

Here we need to write the event after_pred and replace self.learn.pred (which
contains the predictions that will be passed to the loss function) by just its first
element. In callbacks, there is a shortcut that lets you access any of the underlying
Learner attribute so we can write self.pred[0] instead of self.learn.pred[0] . That

shorcut only works for read access, not write, so we have to write self.learn.pred on
the right side (otherwise we would set a pred attribute in the Callback ).

class DropOutput(Callback):
def after_pred(self): self.learn.pred = self.pred[0]

6.1 Splitter (get the layers groups)

The model has 2 main layers groups (ou parameters groups): transformer and
lm_head . As we can read in The illustrated GPT2, the lm_head is a copy of the vocab
embedding matrix wte in order to get after the softmax probability of each token in
the vocab. Therefore, we need to split only the transformer layers group to get all
layers.

transformer

( wte ) vocab embedding (vocab tokens → embedding)

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 23/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

( wpe ) positionning embedding (tokens positions in input sequence → embedding)

12 decoder blocks (attention heads)

lm_head

LayerNorm

Now, we can create our layers groups that will allow us to use all the fastai v2 fine-
tuning techniques. Moreover, we decided to follow the fine-tuning method showed
for text classification training in the notebook 10_nlp.ipynb by creating 4 layers
groups: 3 layers groups of 4 decoder blocks each and one embedding groups
with the wte and wpe matrices.

def splitter(model):
"Split a GPT2 `model` in 3 groups for differential learning
rates."

# First layers group : decoder blocks from 0 to 3


modules = []
for i in range(4): modules.append(model.transformer.h[i])
groups = [nn.Sequential(*modules)]

# Second layers group : decoder blocks from 4 to 7


modules = []
for i in range(4,8,1): modules.append(model.transformer.h[i])
groups = L(groups + [nn.Sequential(*modules)])

# Third layers group : decoder blocks from 8 to 11


modules = []
for i in range(8,12,1): modules.append(model.transformer.h[i])
groups = L(groups + [nn.Sequential(*modules)])

# Fourth layers group : embedding matrices wte and wpe


# + LayerNorm at the model output
groups = L(groups +
[nn.Sequential(model.transformer.wte,model.transformer.wpe,model.tra
nsformer.ln_f)])

return groups.map(params)

6.2 Learner

(text from Sylvain Gugger Transformers Tutorial) Now, we are ready to create our
Learner , which is a fastai object grouping data, model and loss function and handles

model training or inference. Since we are in a language model setting, we pass


https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 24/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

accuracy and perplexity as metrics, and we need to use the callback we just defined.
Lastly, we use mixed precision to save every bit of memory we can (and if you have a
modern GPU, it will also make training faster).

learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(),


splitter = splitter,
cbs=[DropOutput],
metrics=[accuracy, Perplexity()]).to_fp16()

We can check how good the model is without any fine-tuning step by running
learn.validate() . In 53min 2s, we got:

validation loss: 9.949938774108887

validation accuracy: 0.09898579120635986

validation perplexity: 20950.939453125

Not so bad nearly 10% of accuracy without any fine-tuning! It means we start our
journey to GPT-2 in Portuguese with a language model that already has a strong
knowledge of the language rules (weights) and a basic one of Portuguese (25% of
its vocab embedding matrix).

Now that we have a Learner , we will use during training all the fastai v2 fine-tuning
techniques seen for text classification training (see the notebook 10_nlp.ipynb about
"NLP Deep Dive: RNNs") to take advantage of the Transfer Learning of the GPT-2 pre-
trained embedding matrices and model from Hugging Face Transformers:

Learning rate finder (method that helps finding the best learning rate to train the
model)

Mixed precision training (some of the operations will be done in FP16, others in
FP32 in order to speed up the training)

Gradual unfreezing (the model has 4 layers groups created by our method
splitter : the embedding one and the 3 groups of 4 decoder blocks each)

1cycle policy with the method fit_one_cycle() (The 1cycle policy was introduced
by Leslie N. Smith et al. in Super-Convergence: Very Fast Training of Neural
Networks Using Large Learning Rates. It schedules the learning rate with a cosine
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 25/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

annealing from lr_max/div to lr_max then lr_max/div_final (pass an array to


lr_max if you want to use differential learning rates) and the momentum with
cosine annealing according to the values in moms . The first phase takes pct_start

of the training. You can optionally pass additional cbs and reset_opt .)

Differential learning rates (each layers group with a learning rate different: the
biggest one for the embedding group, and the smallest one for the first 4 decoder
blocks)

6.2.1 Freeze all layers but the last layers group (do not freeze wte , wpe

embedding matrices and last LayerNorm )

learn.freeze()
learn.summary()

GPT2LMHeadModel (Input shape: ['8 x 1024'])


================================================================
Layer (type) Output Shape Param # Trainable
================================================================
Embedding 8 x 1024 x 768 38,597,376 True
________________________________________________________________
Embedding 8 x 1024 x 768 786,432 True
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 26/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 27/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 28/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 29/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 True
________________________________________________________________
Linear 8 x 1024 x 50257 38,597,376 True
________________________________________________________________

Total params: 163,037,184


Total trainable params: 77,982,720
Total non-trainable params: 85,054,464

Optimizer used: <function Adam at 0x7fce2f8dae60>


Loss function: FlattenedLoss of CrossEntropyLoss()

Model frozen up to parameter group number 3

Callbacks:
- DropOutput
- ModelToHalf
- TrainEvalCallback
- Recorder
- ProgressCallback
- MixedPrecision

The learn.summary() method gives almost the right numbers. In fact, it counts twice
the weights of the wte matrix (vocab embedding matrix) because they are duplicated
in the weights of the output linear layer.

The real numbers are:

Total params: 163,037,184–38,597,376 = 124,439,808 (about 124 millions)

Total trainable params: 77,982,720–38,597,376 = 39,385,344 (about 40


millions)

Total non-trainable params: 85,054,464 (about 85 millions)

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 30/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Now, let’s choose the best learning rate to launch the fine-tuning of the Portuguese
GPT-2 thanks to the fastai v2 learning rate finder.

learn.lr_find()

Results from learn.lr_ nd() before starting training the Portuguese GPT-2

The learning rate finder curve suggests a learning rate mininum of 6e-3. Let’s use 2e-3
which seems to give the highest decrease in validation loss according to the previous
graph.

learn.freeze()
learn.fit_one_cycle(1, 2e-3)

epoch 0
train_loss 3.803344
valid_loss 3.640777
accuracy 0.325177
perplexity 38.121441
time 5:48:31

In just one epoch, our model passed

from an accuracy of 9.90% to 32.52%

from a perplexity of 20950.94 to 38.12

Not too bad!

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 31/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

We can trace the training and validation loss curves thanks to the fastai v2 loss plotting
function in order to visually verify the strong improvement of our model (i.e. the strong
reduction in training and validation losses).

learn.recorder.plot_loss()

Evolution of training and validation losses during the rst ne-tuning epoch of the Portuguese GPT-2

Now, we can pass -2 to freeze_to to freeze all except the last two layers groups
( learn.unfreeze() = learn.freeze_to(-1) ).

6.2.2 Freeze all layers but the last 2 layers groups

learn.freeze_to(-2)
learn.summary()

Again, the learn.summary () method gives almost the right numbers. In fact, it counts
twice the weights of the wte matrix (vocab embedding matrix) because they are
duplicated in the weights of the output linear layer.

The real numbers are:

Total params: 163,037,184–38,597,376 = 124,439,808 (about 124 millions)

Total trainable params: 106,334,208–38,597,376 = 67,736,832 (about 68


millions)

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 32/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Total non-trainable params: 56,702,976 (about 57 millions)

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-3/(2.6**4),1e-3))

train_loss 3.453913
valid_loss 3.301886
accuracy 0.362879
perplexity 27.163816
time 5:38:18

Good! Our model goes on learning. It went

from an accuracy of 32.52% to 36.29%

from a perplexity of 38.12 to 27.16

We can plot the training and validations losses curves.

learn.recorder.plot_loss()

Evolution of training and validation losses during the second ne-tuning epoch of the Portuguese GPT-2

Let’s go one by passing -3 to freeze_to to freeze all except the last three layers
groups.

6.2.3 Freeze all layers but the last 3 layers groups

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 33/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

learn.freeze_to(-3)
learn.summary()

The learn.summary() method gives almost the right numbers. In fact, it counts twice
the weights of the wte matrix (vocab embedding matrix) because they are duplicated
in the weights of the output linear layer.

The real numbers are:

Total params: 163,037,184–38,597,376 = 124,439,808 (about 124 millions)

Total trainable params: 134,685,696–38,597,376 = 96,088,320 (about 96


millions)

Total non-trainable params: 28,351,488 (about 28 millions)

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-4/(2.6**4),5e-4))

train_loss 3.333389
valid_loss 3.207390
accuracy 0.374579
perplexity 24.714487
time 6:20:51

Yeap! Our model (still) goes on learning: it passed

from an accuracy of 36.29% to 37.46%

from a perplexity of 27.16 to 24.71

We can plot the training and validation losses curves.

learn.recorder.plot_loss()

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 34/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Evolution of training and validation losses during the third ne-tuning epoch of the Portuguese GPT-2

Let’s finish our work one by unfreezing all layers groups, which means all parameters
of the Portuguese GPT-2 model.

6.2.4 Unfreeze all layers

learn.unfreeze()
learn.summary()

One more time, the learn.summary() method gives almost the right numbers. In fact, it
counts twice the weights of the wte matrix (vocab embedding matrix) because they
are duplicated in the weights of the output linear layer.

The real numbers are:

Total params: 163,037,184–38,597,376 = 124,439,808 (about 124 millions)

Total trainable params: 163,037,184–38,597,376 = 124,439,808 (about 124


millions)

Total non-trainable params: 0

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-4/(2.6**4),1e-4))

epoch 0
train_loss 3.288433
valid_loss 3.186721
accuracy 0.377380
perplexity 24.208906
time 6:06:29

epoch 1
train_loss 3.232569
valid_loss 3.167864

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 35/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

accuracy 0.379885
perplexity 23.756687
time 6:16:22

GPUuuuuuuu! Our model (a bit but still) goes on learning: it went

from an accuracy of 37.46% to 37.99%

from a perplexity of 24.71 to 23.76

We can plot the training and validation losses curves.

learn.recorder.plot_loss()

Training and validation loss evolution during the fourth and fth epoch

Following the fastai v2 text classification fine tuning strategy and due to our very good
results (37.99% accuracy and 23.76 perplexity), we decided to stop fine-tuning the
Portuguese GPT-2 at the end of these 5 epochs.

Model sharing and uploading in the Hugging Face model hub


Let’s see now how we can share our Portuguese GPT-2 on the Hugging Face model hub
(source: Model sharing and uploading). You will find all the code corresponding to our
tokenizer and model in the tutorial notebook.

Thus, our model now has a page on huggingface.co/models 🔥


Anyone can load it from the following code:

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 36/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-
portuguese")
model = AutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-
small-portuguese")

Check our Hugging face model page to get more information.

Text Generation by our Portuguese GPT-2


Now that we have a GPT-2 in Portuguese, we can use it for different tasks in NLP (Text
Generation, Reading Comprehension, Translation, Summary) as showed in the post
“GPT-2 use cases: beyond Text Generation”.

For now, let’s use it to generate new texts, which allows us to check that it works
properly and also have a little fun.

Text Generation techniques


At each stage of text generation, GPT-2 provides a vector of 50.257 probabilities
(each corresponds to a possible token of the vocabulary whose size is 50.257). To
decide how to choose the output token from these probabilities, there are at least 5
methods: Greedy, Beam Search, Sampling with temperature, Top-k sampling and
Top-p (nucleus) sampling.

In this tutorial, we will test only 2 of these text generation methods: Top-k sampling
and Top-p (nucleus) sampling.

Note: to get more information on text generation techniques for transformer-based


language model, read the article “How to generate text: using different decoding
methods for language generation with Transformers from Patrick von Platen”
(Hugging Face, 03/18/2020).

(Use case 1) Top-k sampling

Our use case 1 follows the same method used by OpenAI in page 20 of the paper
Language Models are Unsupervised Multitask Learners by choosing Top-k sampling
text generation technique with a value of 40.

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 37/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

This text generation method is implemented in the model.generate() function of a


Transformers model thanks to the following argument:

top_k (int): the number of highest probability vocabulary tokens to keep for top-k-
filtering. Between 1 and infinity. Default to 50.

(Use case 2) Top-p (nucleus)


Our use case 2 follows the top-p (nucleus) sampling method with Top-p sampling
(top_p = 0.95), top-k sampling (top_k = 50), temperature (temperature = 0.7) and
repetition penalty (repetition_penalty = 1.2).

This text generation method is implemented in the model.generate() function of a


Transformers model thanks to the following argument:

top_p (float): the cumulative probability of parameter highest probability


vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default
to 1.

top_k (int): the number of highest probability vocabulary tokens to keep for top-k-
filtering. Between 1 and infinity. Default to 50.

temperature (float): the value used to module the next token probabilities. Must
be strictly positive. Default to 1.0.

repetition_penalty (float): the parameter for repetition penalty. Between 1.0 and
infinity. 1.0 means no penalty. Default to 1.0.

Text n°1 | Famous OpenAI generated text about unicorns


At the time of publication of GPT-2 in the article “Better Language Models and Their
Implications” (02/14/2019), the media retained from its different possibilities in NLP
that of text generation because of the now famous text generated on unicorns from this
small paragraph: “In a shocking finding, scientist discovered a herd of unicorns
living in a remote, previously unexplored valley, in the Andes Mountains. Even more
surprising to the researchers was the fact that the unicorns spoke perfect English.”

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 38/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Famous text on unicorns generated by English GPT-2 from OpenAI (sources: sample 1 and page 20 from “Language
Models are Unsupervised Multitask Learners”)

Get translated famous unicorn text in Portuguese

By using the MarianMT translator English to Portuguese that is available in the


Transformers library of Hugging Face, we’ve got the Portuguese version of this text:
“Em um achado chocante, o cientista descobriu um rebanho de unicórnios vivendo
em um vale remoto, anteriormente inexplorado, nas Montanhas dos Andes. Ainda
mais surpreendente para os pesquisadores foi o fato de que os unicórnios falavam
inglês perfeito.”

from transformers import MarianMTModel, MarianTokenizer

src_text = [
'>>pt_BR<< In a shocking finding, scientist discovered a herd of
unicorns living in a remote, previously unexplored valley, in the
Andes Mountains. Even more surprising to the researchers was the
fact that the unicorns spoke perfect English.',
]

model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer_en_pt = MarianTokenizer.from_pretrained(model_name)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 39/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

print(tokenizer.supported_language_codes)
model_en_pt = MarianMTModel.from_pretrained(model_name)

translated =
model_en_pt.generate(**tokenizer_en_pt.prepare_translation_batch(src
_text))
tgt_text = [tokenizer_en_pt.decode(t, skip_special_tokens=True) for
t in translated]

Get generated text

Use case 1 (Top-k sampling)

The code is:

#set top_k = 40 and num_return_sequences = 3


sample_outputs = model_pt.generate(input_ids, pad_token_id=50256,
do_sample=True,
max_length=max_length,
min_length=max_length,
top_k=40,
num_return_sequences=3)

for i, sample_output in enumerate(sample_outputs):


print(">> Generated text {}\n\n{}".format(i+1,
tokenizer_pt.decode(sample_output.tolist())))
print('\n---')

The best text among the 3 generated is:

Num achado chocante, o cientista descobriu uma manada de unicórnios


vivendo num vale remoto, anteriormente inexplorado, nas Montanhas
dos Andes. Ainda mais surpreendente para os pesquisadores foi o fato
de que os unicórnios falavam inglês perfeito. "Não é mais estranho
que a nossa forma tivesse o inglês com dois de suas asas como se o
macho fosse inglês — o que é interessante. Mas a sua natureza inata
seria estranha para o inglês", acredita eles.

Em 2015, cientistas realizaram uma nova análise sobre as formas dos


unicórnios. De acordo com especialistas na área, os membros
superiores do grupo foram provavelmente derivados de outra espécie
de escorpião — uma espécie com características semelhantes. Uma nova
equipe de cientistas calculou que uma fêmea unicornada da Eurásia
seria originalmente uma humana. "Isto significa que o ancestral do
unicórnio, um híbrido de um esquilo e um escorpião macho não nasceu.

Um estudo recente estimou que cerca de 12% do corpo humano é


composto por membros de qualquer um dos grupos mais diverso de
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 40/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

animais extintos, incluindo o ser humano e o unicórnio-do-sul. "A


análise dos dados mostra que a maioria dos membros do gênero é
composta por um exito e um exito macho que compartilham uma única
espécie de corpo. Em alguns casos, estes membros compartilham um
mesmo conjunto (a linhagem) de partes em dois espécimes." O estudo
indica que os membros de "P. rubi" são semelhantes em aparência e
morfologia aos membros humanos modernos, como as fêmeas modernas e
machos robustos. "Como é evidente com os membros de "P. rubi", os
ancestrais e o ancestral eram similares na forma e na composição das
semelhanças em um organismo."

O DNA do "P. rubi," chamado por sua forma em inglês de


"sonoroplasto", revela que o "sonoroplasto" inclui três genes de
alto nível e quatro genes relativamente reduzidos (e ausentes) e um
gene de baixo nível (e ausentes) com uma concentração de cloroplasto
em cada núcleo. A "sonoroplasto" se assemelha à "P. rubi" em
características morfológicas e comportamentais, embora as diferenças
na morfologia sejam menores. "A espécie "P. rubi" apresenta cinco
pares de cromossomos separados (com 6 pares se aproximando e 8 pares
se afastando) e um "sonoroplasto de base" (com 12 pares se
aproximando e 15 pares se afastando), sugerindo que o membro tenha
uma composição semelhante ao ancestral "P. rubi".

Use case 2 (Top-p nucleus sampling)

The code is:

#set top_p = 0.95, top_k = 50, temperature = 0.7, repetition_penalty


= 1.2 and num_return_sequences = 3

sample_outputs = model_pt.generate(input_ids, pad_token_id=50256,


do_sample=True,
max_length=max_length,
min_length=max_length,
repetition_penalty=1.2,
temperature=0.7,
top_k=50,
top_p=0.95,
num_return_sequences=3)

for i, sample_output in enumerate(sample_outputs):


print(">> Generated text {}\n\n{}".format(i+1,
tokenizer_pt.decode(sample_output.tolist())))
print('\n---')

The best text among the 3 generated is:

Num achado chocante, o cientista descobriu uma manada de unicórnios


vivendo num vale remoto, anteriormente inexplorado, nas Montanhas
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 41/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

dos Andes. Ainda mais surpreendente para os pesquisadores foi o fato


de que os unicórnios falavam inglês perfeito. Eles não sabiam onde
exatamente eram falantes nativos do idioma, e acreditaram que eles
simplesmente migraram das terras altas da região de Mendoza ao norte
como consequência do declínio populacional que ocorreu na
Cordilheira das Cobras.

Em 2004, o Departamento de Antropologia da Universidade do Colorado


anunciou que havia encontrado uma fêmea no vale do rio Orinoco na
Bolívia, mas essa fêmea foi morta durante a investigação. No
entanto, no início de 2006, as autoridades locais anunciaram que
havia identificado uma fêmea encontrada em uma área próxima à
Cordilheira dos Andes, no Vale do Cauca. A equipe de pesquisadores
relatou que esta fêmea era chamada de "El Maria" ou "El Maria".

O estudo revelou que o grupo de unicórnios habitava um ecossistema


bastante diverso, com espécies endêmicas incluindo espécies como as
tiláceas gigantescas (que são encontradas principalmente nos países
subdesenvolvidos) e as quelupus ("Erica azoricae").

Um dos principais objetivos do estudo da espécie é determinar se os


europeus teriam colonizado a região entre a década de 1940 e 1960 e
se estes últimos grupos étnicos sobreviveram até hoje. Os cientistas
acreditam que as populações desses grupos poderiam ter sido muito
maiores antes disso; por exemplo, a teoria sugere que a população
europeia provavelmente teria introduzido os humanos primitivos na
América Central depois que os espanhóis invadiram a região, embora
isso seja controverso.

O gênero "El Maria" tem um ancestral comum, os "Looney-do-the-Bone",


um pequeno grupo de "Looney-da-Daíndia" encontrados apenas no leste
dos Estados Unidos, Canadá e México. O gênero possui parentesco
próximo ao gênero "Lontrapyrus", também conhecido como lontras
negras. Acredita-se que esses indivíduos tenham migrado para o leste
dos Andes, atravessando regiões montanhosas do sul de América
Central e América Central.

Os membros desta família são geralmente confundidos com os lontras


brancos.

As fêmeas têm cerca de seis centímetros de comprimento, pesando de 9


quilogramas e medindo 11 cm de largura. A cabeça é branca, com
manchas escuras pretas escuras sobre seus flancos. As patas
posteriores podem ser amarelas, enquanto sua cauda pode estar preta
ou branca, dependendo da cor utilizada na identificação. As costas
apresentam quatro dedos dorsais bem desenvolvidas.

Text n°2 | Recent text on the coronavirus disease (Covid-19)


Among all the links presented by Google News with the keyword covid-19, we have
selected that of the article “Vacina contra coronavírus feita pela Rússia entra em ultima
fase de testes” (uol, 07/13/2020 ) and copied/pasted the first paragraph as input for
our GPorTuguese-2 model:

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 42/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

A Rússia está mais perto de se tornar o primeiro país a iniciar a


distribuição de uma vacina contra o coronavírus para a população. O
país anunciou hoje que concluiu parte dos testes clínicos
necessários para comprovar a eficácia da imunização desenvolvida por
iniciativa do governo russo. A expectativa é de que a distribuição
comece já em agosto.

Get generated text

Use case 1 (Top-k sampling)

The best text among the 3 generated is:

A Rússia está mais perto de se tornar o primeiro país a iniciar a


distribuição de uma vacina contra o coronavírus para a população. O
país anunciou hoje que concluiu parte dos testes clínicos
necessários para comprovar a eficácia da imunização desenvolvida por
iniciativa do governo russo. A expectativa é de que a distribuição
comece já em agosto.

Entre os primeiros casos confirmados de COVID-19 na Rússia estava um


homem russo com idade entre 50 e 49 anos, segundo o jornal " Moscow"
e o primeiro caso em 28 de fevereiro, a primeira no Hospital Pulkai.

No entanto, os primeiros casos foram mais sérios devido sua


"maturidade sem gravidade" e a falta de uma pessoa estar disponível
para testes de suas condições de vida. Entre os pacientes que foram
considerados estão funcionários de um hospital ou enfermeiros na
cidade de Moscou (ver abaixo) ou médicos. Depois de serem testados
negativos após um teste positivo, o paciente se recupera
completamente.

Os russos também anunciaram que serão realizados testes de


coronavírus de outros países, como a França, que inicialmente
acreditava que os vírus da gripe tinham sido transmitido pela Europa
(a expectativa é de 20 a 50 casos por dia). A situação foi resolvida
em 24 de fevereiro, quando o Ministério da Saúde confirmou sua
conclusão de que a COVID-19 é transmitida de via aérea.

Até ao dia do seu primeiro caso, a Rússia tinha o menor número de


funcionários e médicos com doença grave antes de o vírus ter se
tornado um vírus no país. Os números de funcionários com doença
grave não chegaram a ser confirmados, segundo o Ministério da Saúde,
devido a sua falta de apoio.

O vírus que está em curso no país é transmitido pela primeira vez


nos Estados Unidos, onde foi isolado em 14 de fevereiro. Em Portugal
o Ministério dos Negócios Estrangeiros declarou oficialmente em 11
de fevereiro que a COVID-19 está presente no país, mas não anunciou
nenhum impacto na prática. No entanto, o Ministério do Trabalho
declarou no dia seguinte que o coronavírus existe "em todo os países

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 43/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

que não têm regulamentação" e que nenhum "aedes", que era


identificado em 11 de fevereiro, tinha entrado na corrente
sanguínea.

Em 9 de março, autoridades russas confirmaram que o paciente não


está hospitalizado e que está em quarentena no Hospital Pulkai.

Mais dois casos de COVID-19 foram anunciados em 9 de março, mas


foram considerados casos de "propaganda" e não de risco.

Em 9 de maio, autoridades russas confirmaram os confirmados em todo


país.

Use case 2 (Top-p nucleus sampling)

The best text among the 3 generated is:

A Rússia está mais perto de se tornar o primeiro país a iniciar a


distribuição de uma vacina contra o coronavírus para a população. O
país anunciou hoje que concluiu parte dos testes clínicos
necessários para comprovar a eficácia da imunização desenvolvida por
iniciativa do governo russo. A expectativa é de que a distribuição
comece já em agosto.

A Organização Mundial de Saúde (OMS) estima que, no final de


dezembro de 2015, havia pelo menos 50 milhões de pessoas infectadas
com o vírus na Rússia e Ucrânia. Estimativas semelhantes foram
feitas pela OMS sobre os casos registrados desde a década passada.
Em março de 2018, o Ministério da Saúde confirmou a descoberta de um
novo coronavírus e recomendou aos cidadãos russos manter a
disposição para evitar viagens ao exterior após o início das aulas
médicas ou exames médicos.

Em novembro de 2016, o Ministro da Saúde russo confirmou que todos


os indivíduos infectados poderiam ser testados em março; assim como
seus familiares e amigos, eles podem fazer teste clínico em uma
unidade cirúrgica do hospital da cidade em que são colocados, caso
necessário. Também foi anunciado no mesmo dia que o Ministério do
Trabalho revelou que as autoridades russas estão trabalhando em
conjunto visando reduzir o número de mortes causadas pelas
epidemias. A agência informou que a Agência Nacional de Vigilância
Sanitária Russa (Anvisa) começou a monitorar a pandemia através de
máscaras faciais nos hospitais.

Em abril de 2019, o Ministério da Saúde divulgou que 582 mil pessoas


haviam sido infetadas com o vírus no país entre janeiro de 2019 e
maio de 2020. Cerca de 370 mil desses pacientes estariam diretamente
relacionados à doença.

A Rússia também tem planos promissores para produzir vacinas que


sejam eficazes contra o coronavírus, incluindo a vacina anti-SIDA e
antivirais (ver Lista Vermelha da OMS).

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 44/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

Em 1º de julho de 2017, a Organização Mundial de Saúde lançou uma


nota oficial alertando que "um grande aumento pode vir da
necessidade de medidas preventivas necessárias" antes do início das
aulas médicas em escolas públicas nas cidades ucranianas.

Em outubro de 2017, a Secretaria Municipal de Saúde ucraniana


publicou uma nota oficial informando que os profissionais
responsáveis pela coordenação da vacinação deveriam estar
preparados, bem como suas famílias e amigos durante a realização de
exames adicionais para determinar sua saúde mental.

Em 30 de junho de 2019, o Ministério da Saúde lançou um comunicado


afirmando que três grupos escolares teriam dificuldades de
administrar adequadamente o vacina contra o coronavírus na Rússia.

Conclusion
We are the first, fortunately surprised by the efficiency of fine-tuning in Portuguese an
English pre-trained transformer-based language model like GPT-2 small.

In about 1 day using 1 GPU and a little over 1 GB of Portuguese texts, we managed to
obtain a GPorTuguese-2 capable of generating contextual Portuguese texts of a
level comparable to that of the GPT-2 used by OpenAI in 2019.

Happy.

The next step would be to apply our fine-tuning method to most recent NLP models
like GPT-3, BART, T5 or Reformer. Let’s do it?

Annex | Other articles about fine-tuning GPT-2 to another


language
Train a GPT-2 Transformer to write Harry Potter Books! (03/02/2020)

How to train a new language model from scratch using Transformers and
Tokenizers (02/14/2020)

How To Fine-Tune GPT-2 So You Can Generate Long-Form Creative Writing


(11/19/2019)

Ensinando português ao GPT-2 (08/19/2019)

Russian GPT-2

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 45/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…

About the author: Pierre Guillou is an AI consultant in Brazil and France, Deep
Learning and NLP researcher in the AI Lab (Unb), and professor of Artificial
Intelligence (UnB). Please contact him via his Linkedin profile.

Hugging Face Fastai NLP Deep Learning Gpt 2

About Help Legal

Get the Medium app

https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 46/46

You might also like