Professional Documents
Culture Documents
Faster Than Training From Scratch - Fine-Tuning The English GPT-2 in Any Language With Hugging Face and Fastai v2 (Practical Case With Portuguese)
Faster Than Training From Scratch - Fine-Tuning The English GPT-2 in Any Language With Hugging Face and Fastai v2 (Practical Case With Portuguese)
The 3 main steps of ne-tuning the English GPT-2 to Portuguese with Hugging Face and fastai v2 (image edited
from fast.ai NLP)
In this tutorial, instead of training from scratch, we will see how to fine-tune in just
over a day, on one GPU and with a little more than 1GB of training data an English
pre-trained transformer-based language model to any another language. As a
practical case, we fine-tune to Portuguese the English pre-trained GPT-2 by
wrapping the Transformers and Tokenizers libraries of Hugging Face into fastai v2.
We thus create a new language model: GPorTuguese-2, a language model for
Portuguese text generation (and more NLP tasks…).
Other posts in the GPT-2 series: (NLP & fastai) GPT-2 | Byte-level BPE, an universal
tokenizer but… | GPT-2 use cases: beyond Text Generation | Fast pipeline to localize any
transformer-based model to any language | How to generate texts with a transformer-
based language model
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 1/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Examples of texts generated by GPorTuguese-2 (Portuguese GPT-2 small) on Covid-19, Net ix, Arti cial
Intelligence and… unicorns
Acknowledgment
This tutorial was made possible thanks to the computing power of the AI Lab
(University of Brasilia) to which I am attached as an Associate Researcher in NLP and
the participation of its directors in the definition of NLP strategy, Professors Fabricio
Ataides Braz and Nilton Correia da Silva. Thank you so much!
And special thanks to Sylvain Gugger for his tutorial on Transformers and fastai v2
which is the basis of this tutorial.
I would also like to mention Nama.ai R&D team, and its CEO Rodrigo Scotti, which is
participating in Brazil in AI research to improve online services by the use of generative
NLP models.
Table of contents
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 2/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Acknowledgment
Results
About the need for language models not just in English… and how to do it in real
life
Main coding steps to fine-tune a Hugging Face language model with fastai v2
Conclusion
huggingface.co
Results
Analysis of results
In a little more than a day (we only used one GPU NVIDIA V100 32GB; through a
Distributed Data Parallel (DDP) training mode, we could have divided by three this
time to 10 hours, just with 2 GPUs), we got a loss of 3.17, an accuracy of 37.99%
and a perplexity of 23.76 (see the validation results table below and explications
about perplexity at the end of the paragraph). Happy!
+------------+------+----------+------------+----------+-----------+
| after | loss | accuracy | perplexity | time | cumulative|
| ... epochs | | (%) | | by epoch | time |
+------------+------+----------+------------+----------+-----------+
| 0 | 9.95 | 9.90 | 20950.94 | 00:00:00 | 00:00:00 |
| 1 | 3.64 | 32.52 | 38.12 | 5:48:31 | 5:48:31 |
| 2 | 3.30 | 36.29 | 27.16 | 5:38:18 | 11:26:49 |
| 3 | 3.21 | 37.46 | 24.71 | 6:20:51 | 17:47:40 |
| 4 | 3.19 | 37.74 | 24.21 | 6:06:29 | 23:54:09 |
| 5 | 3.17 | 37.99 | 23.76 | 6:16:22 | 30:10:31 |
+------------+------+----------+------------+----------+-----------+
Fine-tuning of GPT-2 into Portuguese
Table of training and validation results
After a huge gain at the end of the first epoch (see validation results graph
below), the validation accuracy continues to improve until the end of training but
less (it goes to nearly 40%, that is considered a good performance for a language
model — check these notebooks nn-vietnamese.ipynb and nn-turkish.ipynb from
Jeremy Howard of fastai).
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 4/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Validation loss and accuracy of pre-trained English GPT-2 of Hugging Face ne-tuned to Portuguese by
fastai v2
The perplexity evolution graph of the validation dataset confirms that the fine-tuning
of the vocab and position embedding matrix in the first epoch brought a very
significant gain.
Validation perplexity of pre-trained English GPT-2 of Hugging Face ne-tuned to Portuguese by fastai v2
Our results validate the importance of having firstly trained the embedding
matrices (vocab and position) before the fine-tuning of the 3-layers groups (each
with 4 decoder blocks).
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 5/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
However, we can already compared our 23.76 perplexity to that of 25.6 for example
from the Transformers Tutorial on which Sylvain Gugger writes “25.6 as perplexity is
kind of amazing” (zero-shot perplexity of the English GPT-2 with BBPE tokenizer on the
WikiText2 corpus) or to that of 29.41 from the original GPT-2 paper (zero-shot
perplexity of the English GPT-2 with BPE tokenizer (not a BBPE one) on the WikiText2
corpus).
Looks good!
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 6/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Perplexity table from the original GPT-2 paper (Language Models are Unsupervised Multitask Learners)
About the need for language models not just in English… and how
to do it in real life
Even if English is today the most spoken language in the world (around 1.2 billion
people), the world is multilingual (for example, there are 34 languages having 45
million or more total speakers in the 2019 edition of Ethnologue, a language reference
published by SIL International).
This is a color coded diagram to indicate the percentage of English speakers of nearly all the world’s countries. A
few small islands have not been accounted for. (image source: List of countries by English-speaking population in
Wikipedia)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 7/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
in English via the Transformers library of Hugging Face for example, it is often
much more difficult to find online a model trained in another language.
Fast pipeline to localize any transformer-based model (here, a language model) to any language, for
example in Portuguese (image edited from fast.ai NLP)
For example, to obtain a Portuguese GPT-2, we could download from the Transformers
library of Hugging Face the OpenAI GPT-2 pre-trained in English and the MarianMT
translator (we could also use BART or T5 for the translation) in order to create the
following pipeline:
So, for free and with only a few lines of code, we can get any language model in any
language, and even any task-oriented NLP model (classification, Q&A, synthesis, entity
searches, etc.) using the same pipeline. Not bad!
We will find the code of this pipeline and examples of use for text generation in the
post “Fast pipeline to localize any transformer-based model to any language”.
However, the problem with this simple solution is that we depend on the quality
of training of 2 pre-trained NLP models, which greatly increases the risk of losing
the linguistic singularities and nuances of the desired language.
Therefore, it often becomes necessary to have to train its own language model.
CamemBERT, the BERT in French, was trained on 38GB of raw text on 256 GPUs
(32 GB Tesla V100) for 1 day
RoBERTa was pre-trained for 24 hours on 1,024 (full size, 32GB) V100s… and we
are not talking about T5 or GPT-3 (175 billion parameters) whose computational
cost was estimated at $4.6 million! (“We are waiting for OpenAI to reveal more
details about the training infrastructure and model implementation. But to put things
into perspective, GPT-3 175B model required 3.14E23 FLOPS of computing for
training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud
pricing we could find, this will take 355 GPU-years and cost $4.6M for a single
training run.”)
NLP models through time, with their number of parameters (Image credit: TensorFlow blog)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hug… 9/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
This tutorial show how to implement this second option and you will find examples of
use for text generation in the paragraph “Text Generation by our Portuguese GPT-2”
at the end of this tutorial.
Hugging Face
According to the HF official documentation, they were designed with two strong goals
in mind:
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 10/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
the Transformers library is NOT a modular toolbox of building blocks for neural nets. If
you want to extend/build-upon the library, just use regular Python/PyTorch modules and
inherit from the base classes of the library to reuse functionalities like model
loading/saving.
Indeed, the reading of the new Hugging Face tutorials from june 2020 confirm that
plain PyTorch must be used in order to train from scratch or fine-tune a pre-trained
model in the Transformers library.
For example, the new Training and fine-tuning tutorial explains how Fine-tuning in
native PyTorch. It is very helpful but how to apply 1cycle policy fine-tuning method for
example? Or how to easily freeze or unfreeze some layers groups like in fastai v2 with
the functions learn.unfreeze() and learn.freeze_to() instead of typing full PyTorch
code?
fastai v2
Therefore, despite of the running py files published by Hugging Face (for example, the
run_language_modeling.py for fine-tuning the library models for language modeling
on a text file (GPT, GPT-2, BERT, RoBERTa)), when it comes necessary to fine-tune a
pre-trained model to another language and/or to another task, we need to use
easy fine-tuning methods over regular Python/PyTorch modules in order to apply
Transfer Learning and fine-tuning modern techniques.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 11/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
fastai v2
Learning rate finder (method that helps finding the best learning rate to train the
model)
Mixed precision training (some of the operations will be done in FP16, others in
FP32 in order to speed up the training)
Gradual unfreezing (layers groups are defined allowing to decide the layers to be
trained)
1cycle policy (the 1cycle policy was introduced by Leslie N. Smith et al. in Super-
Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. It
schedules the learning rate with a cosine annealing)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 12/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
The original transformer model is made up of an encoder and decode (image credit: The illustrated GPT-2)
Thus, between the 2 historic transformer-based models GPT-2 and BERT models, we
chose the GPT-2 model because it has strongly influenced minds beyond the
circle of Deep Learning specialists in early 2019 by writing texts of a quality level
close to that of humans. Today “exceeded” in number of parameters and performance
by more recent models like BART, T5 and of course GPT-3 (175 billion parameters!), it
remains a reference and a model used in research and applications.
(1/2) OpenAI GPT-2 is a transformer-based language model using only decoder blocks (image credit: The
illustrated GPT-2)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 13/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
(2/2) OpenAI GPT-2 is a transformer-based language model using only decoder blocks (note:we use an
input sequence of 1024, not 4000 — image credit: The illustrated GPT-2)
Note: for those you want to understand better how GPT-2 works, read the following
posts:
About the version of GPT-2: there are 3 versions of the GPT-2 model (look at the
transformers documentation for more details). Here, we use the small version, the
one with the smallest number of weights (124 millions, not 117 as written in the
original paper) but you can change the model used by changing the content of
pretrained_weights (if it's not a GPT-2 model, you'll need to change the classes used for
the model and the tokenizer of course).
We used the English pre-trained GPT-2 small and its Byte-level BPE tokenizer in this tutorial (image credit:
The illustrated GPT-2)
Byte-level BPE
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 14/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Note: to understand better what is a Byte-level BPE tokenizer, read this post: Byte-level
BPE, an universal tokenizer but…
1. Initialization
2. Download Wikipedia in Portuguese
3. Download a GPT-2 English pre-trained model and train a GPT-2
tokenizer with a vocab in Portuguese
3.1 Get the pre-trained GPT-2 Tokenizer & Model (pre-trained with
an English corpus) from the Transformers library (Hugging Face)
3.2 Train a Byte-level BPE (BBPE) Tokenizer on the Portuguese
Wikipedia corpus by using the Tokenizers library (Hugging Face)
3.3 Import the tokenizer Portuguese config files into the pre-
trained GPT-2 Tokenizer
4. Create a fastai tokenizer and update the embedding matrix of the
GPT-2 English pre-trained model
4.1 GPT2TokenizerFast (imported GPT2 tokenizer) --> fastai
Tokenizer
4.2 Change vocab embedding in the GPT-2 pre-trained model to adapt
to the Portuguese vocab
5. Create fastai v2 Datasets and Dataloaders
6. Fine-tuning the model
6.1 Splitter (get layers groups)
6.2 Learner
6.2.1 Freeze all layers but the last layers group (wte, wpe
embedding matrices and last LayerNorm)
6.2.2 Freeze all layers but the last 2 layers groups
6.2.3 Freeze all layers but the last 3 layers groups
6.2.4 Unfreeze all layers
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 15/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Fine-tuning the English GPT-2 to Portuguese with Hugging Face and fastai v2 in 3 main steps (image
edited from fast.ai NLP)
2. GPT-2 tokenizer with a Portuguese vocab (train a GPT-2 tokenizer with a vocab
in Portuguese, wrap it into a fastai v2 tokenizer and update the embedding matrix
of the GPT-2 English pre-trained model according to the new Portuguese vocab:
keep the embedding vectors of the common tokens between English and
Portuguese vocabs)
1. Initialization
# libraries installation
# fastai v2: read https://dev.fast.ai/#Installing
# tokenizers: !pip install tokenizers
# transformers: !pip install transformers
# import fastai v2
from fastai2.text.all import *
from nlputils_fastai2 import *
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 16/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
This dataset size has to be compared to the 40 GB of WebText (text extracted from
Internet but Wikipedia) used by OpenAI to train from scratch English GPT-2 (see
“About the English dataset used to train GPT-2” at the end of this paragraph).
# get all articles in one text file and one csv file
get_one_clean_file(dest,lang)
get_one_clean_csv_file(dest,lang)
Note: the text file (all the articles in one file) will allow the training of the Portuguese
tokenizer and the csv one will facilitate the tests of the study.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 17/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
(source) The resulting dataset, WebText, contains the text subset of these 45 million links.
To extract the text from HTML responses we use a combination of the Dragnet (Peters
&Lecocq, 2013) and Newspaper content extractors. All results presented in this paper use a
preliminary version of WebText which does not include links created after Dec 2017 and
which after de-duplication and some heuristic based cleaning contains slightly over 8
million documents for a total of 40 GB of text. We removed all Wikipedia documents from
WebText since it is a common data source for other datasets and could complicate analysis
due to overlapping training data with test evaluation tasks.
1. Get the pre-trained GPT-2 Tokenizer & Model (pre-trained with an English
corpus) from the Transformers library (Hugging Face): it will give us the
tokenizer structure we need and the pre-trained model weights (it’s better to start
training our GPT-2 model in Portuguese from weights already trained even in
another language than from random values)
3. Import the tokenizer Portuguese config files ( vocab.json , merges.txt ) into the
pre-trained GPT-2 Tokenizer: it will give us a GPT-2 tokenizer structure with the
vocab in Portuguese.
One relevant point is that we trained our Portuguese Byte-level BPE tokenizer on
Portuguese Wikipedia (here, 1.6 GB) in only 2min 7s. Thanks Hugging Face!
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 18/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tokenizer_en.pad_token = tokenizer_en.eos_token
4. Create a fastai tokenizer and update the embedding matrix of the GPT-2
English pre-trained model
Now let’s see how we can use fastai v2 to fine-tune this model on Wikipedia in
Portuguese, using all the fastai v2 training and fine-tuning utilities.
(text from Sylvain Gugger Transformers Tutorial) To process this data to train a model,
we need to build a Transform that will be applied lazily. In a fastai Transform you can
define:
an encodes method that is applied when you call the transform (a bit like the
forward method in a nn.Module )
a decodes method that is applied when you call the decode method of the
transform, if you need to decode anything for showing purposes (like converting
ids to a text here)
a setups method that sets some inner state of the Transform (not needed here)
tokenizer_fastai_en = TransformersTokenizer(tokenizer_en)
tokenizer_fastai_pt = TransformersTokenizer(tokenizer_pt)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 20/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
# save new_wgts
torch.save(new_wgts, path_data/'new_wte_wgts.pt')
# save same_tokens_list and different_tokens_list
torch.save(same_tokens_list, path_data/'same_tokens_list.pt')
torch.save(different_tokens_list,
path_data/'different_tokens_list.pt')
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 21/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
We indicate the indices of the training dataset and the validation dataset with splits
(here, 80% of the indices randomly chosen, then all the remaining indices).
# train = 80%
# validation = 20%
num = int(0.8*len(df))
idxs = np.random.randint(0, len(df), len(df))
idxs_train = idxs[:num]
idxs_val = idxs[num:]
(text from Sylvain Gugger Transformers Tutorial) The fastai v2 library expects the data
to be assembled in a DataLoaders object (something that has a training and validation
dataloader). We can get one by using the dataloaders method. We just have to specify
a batch size and a sequence length:
Let’s use a batch size of 8 (a value higher gives a “CUDA out of memory error” on
our single GPU).
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 22/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Since the GPT-2 model was trained with sequences of size 1024, we use this
sequence length (it’s a stateless model, so it will change the perplexity if we use
less).
bs,sl = 8,1024
dls = tls.dataloaders(bs=bs, seq_len=sl)
Here we need to write the event after_pred and replace self.learn.pred (which
contains the predictions that will be passed to the loss function) by just its first
element. In callbacks, there is a shortcut that lets you access any of the underlying
Learner attribute so we can write self.pred[0] instead of self.learn.pred[0] . That
shorcut only works for read access, not write, so we have to write self.learn.pred on
the right side (otherwise we would set a pred attribute in the Callback ).
class DropOutput(Callback):
def after_pred(self): self.learn.pred = self.pred[0]
The model has 2 main layers groups (ou parameters groups): transformer and
lm_head . As we can read in The illustrated GPT2, the lm_head is a copy of the vocab
embedding matrix wte in order to get after the softmax probability of each token in
the vocab. Therefore, we need to split only the transformer layers group to get all
layers.
transformer
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 23/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
lm_head
LayerNorm
Now, we can create our layers groups that will allow us to use all the fastai v2 fine-
tuning techniques. Moreover, we decided to follow the fine-tuning method showed
for text classification training in the notebook 10_nlp.ipynb by creating 4 layers
groups: 3 layers groups of 4 decoder blocks each and one embedding groups
with the wte and wpe matrices.
def splitter(model):
"Split a GPT2 `model` in 3 groups for differential learning
rates."
return groups.map(params)
6.2 Learner
(text from Sylvain Gugger Transformers Tutorial) Now, we are ready to create our
Learner , which is a fastai object grouping data, model and loss function and handles
accuracy and perplexity as metrics, and we need to use the callback we just defined.
Lastly, we use mixed precision to save every bit of memory we can (and if you have a
modern GPU, it will also make training faster).
We can check how good the model is without any fine-tuning step by running
learn.validate() . In 53min 2s, we got:
Not so bad nearly 10% of accuracy without any fine-tuning! It means we start our
journey to GPT-2 in Portuguese with a language model that already has a strong
knowledge of the language rules (weights) and a basic one of Portuguese (25% of
its vocab embedding matrix).
Now that we have a Learner , we will use during training all the fastai v2 fine-tuning
techniques seen for text classification training (see the notebook 10_nlp.ipynb about
"NLP Deep Dive: RNNs") to take advantage of the Transfer Learning of the GPT-2 pre-
trained embedding matrices and model from Hugging Face Transformers:
Learning rate finder (method that helps finding the best learning rate to train the
model)
Mixed precision training (some of the operations will be done in FP16, others in
FP32 in order to speed up the training)
Gradual unfreezing (the model has 4 layers groups created by our method
splitter : the embedding one and the 3 groups of 4 decoder blocks each)
1cycle policy with the method fit_one_cycle() (The 1cycle policy was introduced
by Leslie N. Smith et al. in Super-Convergence: Very Fast Training of Neural
Networks Using Large Learning Rates. It schedules the learning rate with a cosine
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 25/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
of the training. You can optionally pass additional cbs and reset_opt .)
Differential learning rates (each layers group with a learning rate different: the
biggest one for the embedding group, and the smallest one for the first 4 decoder
blocks)
6.2.1 Freeze all layers but the last layers group (do not freeze wte , wpe
learn.freeze()
learn.summary()
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 27/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 28/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 29/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
________________________________________________________________
Conv1D 8 x 1024 x 2304 1,771,776 False
________________________________________________________________
Conv1D 8 x 1024 x 768 590,592 False
________________________________________________________________
Dropout 8 x 12 x 1024 x 102 0 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 False
________________________________________________________________
Conv1D 8 x 1024 x 3072 2,362,368 False
________________________________________________________________
Conv1D 8 x 1024 x 768 2,360,064 False
________________________________________________________________
Dropout 8 x 1024 x 768 0 False
________________________________________________________________
LayerNorm 8 x 1024 x 768 1,536 True
________________________________________________________________
Linear 8 x 1024 x 50257 38,597,376 True
________________________________________________________________
Callbacks:
- DropOutput
- ModelToHalf
- TrainEvalCallback
- Recorder
- ProgressCallback
- MixedPrecision
The learn.summary() method gives almost the right numbers. In fact, it counts twice
the weights of the wte matrix (vocab embedding matrix) because they are duplicated
in the weights of the output linear layer.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 30/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Now, let’s choose the best learning rate to launch the fine-tuning of the Portuguese
GPT-2 thanks to the fastai v2 learning rate finder.
learn.lr_find()
Results from learn.lr_ nd() before starting training the Portuguese GPT-2
The learning rate finder curve suggests a learning rate mininum of 6e-3. Let’s use 2e-3
which seems to give the highest decrease in validation loss according to the previous
graph.
learn.freeze()
learn.fit_one_cycle(1, 2e-3)
epoch 0
train_loss 3.803344
valid_loss 3.640777
accuracy 0.325177
perplexity 38.121441
time 5:48:31
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 31/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
We can trace the training and validation loss curves thanks to the fastai v2 loss plotting
function in order to visually verify the strong improvement of our model (i.e. the strong
reduction in training and validation losses).
learn.recorder.plot_loss()
Evolution of training and validation losses during the rst ne-tuning epoch of the Portuguese GPT-2
Now, we can pass -2 to freeze_to to freeze all except the last two layers groups
( learn.unfreeze() = learn.freeze_to(-1) ).
learn.freeze_to(-2)
learn.summary()
Again, the learn.summary () method gives almost the right numbers. In fact, it counts
twice the weights of the wte matrix (vocab embedding matrix) because they are
duplicated in the weights of the output linear layer.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 32/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-3/(2.6**4),1e-3))
train_loss 3.453913
valid_loss 3.301886
accuracy 0.362879
perplexity 27.163816
time 5:38:18
learn.recorder.plot_loss()
Evolution of training and validation losses during the second ne-tuning epoch of the Portuguese GPT-2
Let’s go one by passing -3 to freeze_to to freeze all except the last three layers
groups.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 33/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
learn.freeze_to(-3)
learn.summary()
The learn.summary() method gives almost the right numbers. In fact, it counts twice
the weights of the wte matrix (vocab embedding matrix) because they are duplicated
in the weights of the output linear layer.
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-4/(2.6**4),5e-4))
train_loss 3.333389
valid_loss 3.207390
accuracy 0.374579
perplexity 24.714487
time 6:20:51
learn.recorder.plot_loss()
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 34/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Evolution of training and validation losses during the third ne-tuning epoch of the Portuguese GPT-2
Let’s finish our work one by unfreezing all layers groups, which means all parameters
of the Portuguese GPT-2 model.
learn.unfreeze()
learn.summary()
One more time, the learn.summary() method gives almost the right numbers. In fact, it
counts twice the weights of the wte matrix (vocab embedding matrix) because they
are duplicated in the weights of the output linear layer.
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-4/(2.6**4),1e-4))
epoch 0
train_loss 3.288433
valid_loss 3.186721
accuracy 0.377380
perplexity 24.208906
time 6:06:29
epoch 1
train_loss 3.232569
valid_loss 3.167864
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 35/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
accuracy 0.379885
perplexity 23.756687
time 6:16:22
learn.recorder.plot_loss()
Training and validation loss evolution during the fourth and fth epoch
Following the fastai v2 text classification fine tuning strategy and due to our very good
results (37.99% accuracy and 23.76 perplexity), we decided to stop fine-tuning the
Portuguese GPT-2 at the end of these 5 epochs.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 36/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
tokenizer = AutoTokenizer.from_pretrained("pierreguillou/gpt2-small-
portuguese")
model = AutoModelWithLMHead.from_pretrained("pierreguillou/gpt2-
small-portuguese")
For now, let’s use it to generate new texts, which allows us to check that it works
properly and also have a little fun.
In this tutorial, we will test only 2 of these text generation methods: Top-k sampling
and Top-p (nucleus) sampling.
Our use case 1 follows the same method used by OpenAI in page 20 of the paper
Language Models are Unsupervised Multitask Learners by choosing Top-k sampling
text generation technique with a value of 40.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 37/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
top_k (int): the number of highest probability vocabulary tokens to keep for top-k-
filtering. Between 1 and infinity. Default to 50.
top_k (int): the number of highest probability vocabulary tokens to keep for top-k-
filtering. Between 1 and infinity. Default to 50.
temperature (float): the value used to module the next token probabilities. Must
be strictly positive. Default to 1.0.
repetition_penalty (float): the parameter for repetition penalty. Between 1.0 and
infinity. 1.0 means no penalty. Default to 1.0.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 38/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Famous text on unicorns generated by English GPT-2 from OpenAI (sources: sample 1 and page 20 from “Language
Models are Unsupervised Multitask Learners”)
src_text = [
'>>pt_BR<< In a shocking finding, scientist discovered a herd of
unicorns living in a remote, previously unexplored valley, in the
Andes Mountains. Even more surprising to the researchers was the
fact that the unicorns spoke perfect English.',
]
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
tokenizer_en_pt = MarianTokenizer.from_pretrained(model_name)
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 39/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
print(tokenizer.supported_language_codes)
model_en_pt = MarianMTModel.from_pretrained(model_name)
translated =
model_en_pt.generate(**tokenizer_en_pt.prepare_translation_batch(src
_text))
tgt_text = [tokenizer_en_pt.decode(t, skip_special_tokens=True) for
t in translated]
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 42/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 43/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 44/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
Conclusion
We are the first, fortunately surprised by the efficiency of fine-tuning in Portuguese an
English pre-trained transformer-based language model like GPT-2 small.
In about 1 day using 1 GPU and a little over 1 GB of Portuguese texts, we managed to
obtain a GPorTuguese-2 capable of generating contextual Portuguese texts of a
level comparable to that of the GPT-2 used by OpenAI in 2019.
Happy.
The next step would be to apply our fine-tuning method to most recent NLP models
like GPT-3, BART, T5 or Reformer. Let’s do it?
How to train a new language model from scratch using Transformers and
Tokenizers (02/14/2020)
Russian GPT-2
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 45/46
06/08/2020 Faster than training from scratch — Fine-tuning the English GPT-2 in any language with Hugging Face and fastai v…
About the author: Pierre Guillou is an AI consultant in Brazil and France, Deep
Learning and NLP researcher in the AI Lab (Unb), and professor of Artificial
Intelligence (UnB). Please contact him via his Linkedin profile.
https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hu… 46/46