Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

[Notes] Improving Language


Understanding by Generative Pre-
Training
Exercise: Reconstructing the Language Model from the Fine-Tuned
Model

Ceshine Lee · Follow


Published in Veritable · 8 min read · Aug 7, 2018

48

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 1 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

Source

Introduction
The field of NLP has been creating some buzz this year (in case you haven’t
heard, NLP’s Image moment has arrived). When I finally found some time
to read some of the new papers, I feel this paper by OpenAI [1] is interesting
enough for me to dig deeper:

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 2 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

Improving Language Understanding with Unsupervised


Learning
We've obtained state-of-the-art results on a suite of diverse
language tasks with a scalable, task-agnostic system…
blog.openai.com

It expands the work by Howard and Ruder [2]. The main differences
include:

1. Use transformer networks[3] instead of LSTM to achieve better capture


long-term linguistic structure.

2. Demonstrate the effectiveness of the approach on a wider range of


tasks.

3. Include auxiliary training objectives (e.g. language modeling) in


addition to the task objective when fine-tuning.

(The transformer network used is a variant called Transformer decoder,


proposed in [4])

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 3 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

General Model Structure. Taken from [1]

The Exercise
The author open-sourced a Github repo containing the code(using
Tensorflow) and pre-trained model weights needed to reproduce the results
(currently only the ROCStories Cloze dataset [5, 6] is supported as the fine-
tune target). According to the paper, the pre-trained language model is
trained with the BooksCorpus dataset[7].

Honestly the code can use some work improving its readability, and the
low-level Tensorflow APIs it uses are notoriously hard to comprehend and
modify (BTW, there is a PyTorch implementation available.). So I gave
myself an exercise to reconstruct the pre-trained language model. Since
neural network can be entirely ruined by very subtle bugs or mistakes, you
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 4 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

need to understand the code and model to ensure the final model produce
the correct results.

The reconstructed language model can be used to:

1. Inspect what the language model has learned. Feed it some texts and see
what prediction it makes.

2. Re-train the language model with a different corpus.

The rest of this post will cover some key ideas and code chunks that is
essential to complete this exercise. The full code is published as a fork from
the original OpenAI repo (check out the notebook in the root folder first):

ceshine/finetune-transformer-lm
finetune-transformer-lm - Code and model for the paper
"Improving Language Understanding by Generative Pre-Training"
github.com

Target Task: Story Cloze Test


Because the model published by the author specifically targets this kind of
tasks, so we need to understand what it is and how the model handles it
first.

‘Story Cloze Test’ is a new commonsense reasoning framework for evaluating

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 5 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

story understanding, story generation, and script learning. This test requires a
system to choose the correct ending to a four-sentence story.

Search Write
Take From [5]

Model Structure for Multiple Choice Tasks. Taken From [1]

This is a binary choice problem, so we’ll create two sequences and feed
them to the same transformer network. Three new special tokens are
added to the vocabulary from the pre-trained model — <start>, <delim>,
and <extract>.

Tokenization

We used a bytepair encoding (BPE) vocabulary with 40,000 merges

Bytepair encoding[8] starts by treating individual characters as tokens, and

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 6 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

then iteratively merge the most common token pairs N times. The resulting
token vocabulary will disintegrate the rare words into several chunks
consisting of more common character sequences.

It is implemented in the TextEncoder class. With 40,000 merges most


common words are already combined into a single token. You probably
shouldn’t worry about this part unless you want to train the model with
other languages. The two things we’ll be using is TextEncoder.encode()

method and TextEncoder.decoder attribute.

Transformation
After tokenization, the resulting tokens need be arranged into Numpy
arrays that are ready to be feed into the neural network as inputs. This is
done in the transform_roc function:

def transform_roc(X1, X2, X3):


n_batch = len(X1)
xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
start = encoder['_start_']
delimiter = encoder['_delimiter_']
for i, (x1, x2, x3), in enumerate(zip(X1, X2, X3)):
x12 = [start]+x1[:max_len]+[delimiter]+\
x2[:max_len]+[clf_token]
x13 = [start]+x1[:max_len]+[delimiter]+\
x3[:max_len]+[clf_token]
l12 = len(x12)
l13 = len(x13)
xmb[i, 0, :l12, 0] = x12
xmb[i, 1, :l13, 0] = x13
mmb[i, 0, :l12] = 1
mmb[i, 1, :l13] = 1
xmb[:, :, :, 1] = np.arange(
n_vocab+n_special, n_vocab+n_special+n_ctx)
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 7 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

return xmb, mmb

X1 is the context(4 sentences) tokens, X2 is the ending 1 tokens, and X3 is the


ending 2 tokens. xmb contains both tokens and position indices and is
shaped (batch_size, choice, sequence_length, token/position). mmb contains the
masks that indicate if the sequence has ended at that position(0 if ended),
which will be used when calculate language model losses.

Position indices is used in the transformer network to incorporate the order


information into the network. Later the indices will be mapped to
embeddings as opposed to the sinusoidal function used in the original
transformer:

We used learned position embeddings instead of the sinusoidal version proposed


in the original work.

Modifying the Transformation


The transformation function for the reconstructed language model:

def transform_texts(list_of_texts):
tokens = TEXT_ENCODER.encode(list_of_texts, verbose=False)
n_batch = len(tokens)
xmb = np.zeros((n_batch, N_CTX, 2), dtype=np.int32)
mmb = np.zeros((n_batch, N_CTX), dtype=np.float32)
for i, x in enumerate(tokens):
x1 = x[:N_CTX]
l1 = len(x1)
print(f"length: {l1}")
xmb[i, :l1, 0] = x1

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 8 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

mmb[i, :l1] = 1
xmb[:, :, 1] = np.arange(N_VOCAB, N_VOCAB+N_CTX)
return xmb, mmb

This is really straight forward. It removes the extra choice dimension, and
takes out the special tokens. (The tokenization is done inside the function
for the sake of convenience.)

The Model

Transformer Network
We’re not going to cover transformer in this post. They can be directly
copied into the language model without any modification. Check out theses
two great resources if you’re interested in the inner workings of the
transformer:

The Illustrated Transformer


In the previous post, we looked at Attention - a ubiquitous method
in modern deep learning models. Attention is a…
jalammar.github.io

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 9 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

The Annotated Transformer


The Annotated Transformer
The Annotated Transformernlp.seas.harvard.edu

Modifying the Input Layer


Code from the supervised model:

we = tf.get_variable(
"we",
[n_vocab+n_special+n_ctx, n_embd],
initializer=tf.random_normal_initializer(stddev=0.02))
we = dropout(we, embd_pdrop, train)
X = tf.reshape(X, [-1, n_ctx, 2])
M = tf.reshape(M, [-1, n_ctx])

As we do not need the special tokens anymore, we need to remove


n_special from the embedding matrix initialization.

There’s no other modification needed as X and M are reshaped and


agnostic to the choice dimension.

Modifying the Last Layer


The classifier can be completely removed without looking into its details (but I
encourage you to, as it involves some clever tensor manipulations). What we
need is to modify is the language modeling part:
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 10 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

lm_h = tf.reshape(h[:, :-1], [-1, n_embd])


lm_logits = tf.matmul(lm_h, we, transpose_b=True)
lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [-1]))
lm_losses = tf.reshape(
lm_losses, [shape_list(X)[0], shape_list(X)[1]-1])
lm_losses = tf.reduce_sum(
lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)

The above code use the hidden outputs from position 0 to L-1 (L is the
length of the longest sequence) to generate predictions to position 1 to L. In
contrast, in the reconstructed model we use position 0 to L to generate
predictions to position 1 to L+1, so we can generate more texts based on
existing ones:

lm_h = tf.reshape(h, [-1, N_EMBD])


lm_logits = tf.reshape(
tf.matmul(lm_h, we[:N_VOCAB, :], transpose_b=True),
[-1, N_CTX, N_VOCAB])
lm_logits_truncated = tf.reshape(
lm_logits[:, :-1],
[-1, N_VOCAB])
lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=lm_logits_truncated,
labels=tf.reshape(X[:, 1:, 0], [-1]))
lm_losses = tf.reshape(
lm_losses, [shape_list(X)[0], shape_list(X)[1]-1])
lm_losses = tf.reduce_sum(
lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)

But of course, we can not calculate losses of the predictions which we don’t
know the answer for, so we need a lm_logits_truncated when calculating

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 11 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

losses.

Another change I made is only using the vocabulary that actually


corresponds to a token when calculating logits — using we[:N_VOCAB, :]

instead of we . If using the latter one, the model will include the learned
position embeddings, and generate predictions to position indices, which
does not make any sense. I don’t think we need those embedding in the
supervised model, either.

Loading The Pre-trained Weights


The pre-trained weights are stored as Numpy arrays. What we need to do is
to remove the initialization of special token embeddings from the original
code.

This is the part where it can mostly go wrong, because all (Tensorflow)
variables must be initialized in the same shape and in the same order as in
the pre-trained model. If you mess with the variables when reconstructing
the language model, the weights won’t be loaded properly.

Inspecting the Reconstructed Language Model


Feeding the example four-sentence stories into the model, we can get the
next-token predictions:

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 12 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

Take the position 9 as an example, the story up to this position is “Karen was
assigned a roommate her first year of …”. The correct next token is college,
and the model correctly predicted that. The second and third most probable
tokens determined by the model are school and grad. Both are quite
reasonable choices. So it appears that the reconstructed model works! (You
might want to run more tests to be sure.)

Thank You!
I skipped some moderately important details in the post, mainly because I
run out of energy and thus the will to make this post even longer . If you
find anything not explained clear enough, or entirely missing, please feel
free to leave me a note. I’d be happy to correct it.

References:
1. Radford, A., & Salimans, T. (2018). Improving Language Understanding
by Generative Pre-Training.

2. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning


for Text Classification.

3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł.


Kaiser, and I. Polosukhin. (2017). Attention is all you need.

4. P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N.


Shazeer. (2018). Generating wikipedia by summarizing long sequences.

5. Story Cloze Test and ROCStories Corpora

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 13 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

6. Mostafazadeh, N., Roth, M., Louis, A., Chambers, N. W., & Allen, J. F.
(2017). LSDSem 2017 Shared Task : The Story Cloze Test.

7. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,


and S. Fidler. (2015). Aligning books and movies: Towards story-like
visual explanations by watching movies and reading books.

8. Sennrich, R., Haddow, B., & Birch, A. (2015). Neural Machine


Translation of Rare Words with Subword Units.

Machine Learning Deep Learning NLP Python TensorFlow

Written by Ceshine Lee Follow

1.6K Followers · Editor for Veritable

Data Geek. Maker. Researcher. Twitter: @ceshine_en

More from Ceshine Lee and Veritable


https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 14 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

More from Ceshine Lee and Veritable

Ceshine Lee in Veritable Ceshine Lee in Veritable

Feature Importance Measures for Quantile Regression — Part 2


Tree Models — Part I An Overview of Tensorflow, Pytorch,
An Incomplete Review LightGBM implementations

7 min read · Oct 28, 2017 6 min read · Jul 16, 2018

509 4 581 3

Ceshine Lee in Veritable Ceshine Lee in Towards Data Science

Playing with rstudio/gt R Package Understanding Bidirectional RNN


Exploring the Movies Dataset—Movie Genre in PyTorch
Overlappings, Ratings by Genre and Year Quick Recap

4 min read · Jan 22, 2019 3 min read · Nov 12, 2017

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 15 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

58 1 1K 11

See all from Ceshine Lee See all from Veritable

Recommended from Medium

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 16 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

Rahul Nayak in Towards Data Science Gavin Li in AI Advances

How to Convert Any Text Into a Unbelievable! Run 70B LLM


Graph of Concepts Inference on a Single 4GB GPU…
A method to convert any text corpus into a with
Large This NEW
language Technique
models require huge
Knowledge Graph using Mistral 7B. amounts of GPU memory. Is it possible to ru…
inference on a single GPU? If so, what is the
12 min read · Nov 9 6 min read ·GPU…
minimum Nov 18

3.2K 40 1.6K 22

Lists

Predictive Modeling w/ Practical Guides to Machine


Python Learning
20 stories · 651 saves 10 stories · 731 saves

Natural Language Processing Coding & Development


920 stories · 435 saves 11 stories · 292 saves

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 17 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

Krishna Yo… in Artificial Intelligence in Plain Engli… Intel(R) Neural Compres… in Intel Analytics Softw…
i h or re
Building a question-answering Supervised Fine-Tuning and Direct
system using LLM on your privat… Preference Optimization on Intel…
data Gaudi2
Demonstrating a Top-Ranked 7B Chat Model
on the LLM Leaderboard

11 min read · Oct 5 4 min read · Nov 14

467 6 26

Deepanshusachdeva AL Anany

Understanding Transformers Step The ChatGPT Hype Is Over—Now


by Step—Word Embeddings Watch How Google Will Kill…
Transformer is a deep learning model ChatGPT.
It never happens instantly. The business
architecture introduced in the paper… game is longer than you know.
“Attention is All You Need” by Vaswani et al. It
4 read · Jun 14
mingained…
has · 6 min read · Sep 1

25 19.8K 604

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 18 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 19 of 19

You might also like