Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium

[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
[Notes] Improving Language

Understanding by Generative Pre-
Training
Exercise: Reconstructing the Language Model from the Fine-Tuned
Model
Ceshine Lee · Follow

Published in Veritable · 8 min read · Aug 7, 2018
48
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 1 of 19
Source
Introduction
The field of NLP has been creating some buzz this year (in case you haven’t
heard, NLP’s Image moment has arrived). When I finally found some time
to read some of the new papers, I feel this paper by OpenAI [1] is interesting
enough for me to dig deeper:
Improving Language Understanding with Unsupervised

Learning
We've obtained state-of-the-art results on a suite of diverse
language tasks with a scalable, task-agnostic system…
blog.openai.com
It expands the work by Howard and Ruder [2]. The main differences
include:
1. Use transformer networks[3] instead of LSTM to achieve better capture

long-term linguistic structure.
2. Demonstrate the effectiveness of the approach on a wider range of

tasks.
3. Include auxiliary training objectives (e.g. language modeling) in

addition to the task objective when fine-tuning.
(The transformer network used is a variant called Transformer decoder,

proposed in [4])
General Model Structure. Taken from [1]
The Exercise
The author open-sourced a Github repo containing the code(using
Tensorflow) and pre-trained model weights needed to reproduce the results
(currently only the ROCStories Cloze dataset [5, 6] is supported as the fine-
tune target). According to the paper, the pre-trained language model is
trained with the BooksCorpus dataset[7].
Honestly the code can use some work improving its readability, and the
low-level Tensorflow APIs it uses are notoriously hard to comprehend and
modify (BTW, there is a PyTorch implementation available.). So I gave
myself an exercise to reconstruct the pre-trained language model. Since
neural network can be entirely ruined by very subtle bugs or mistakes, you
need to understand the code and model to ensure the final model produce
the correct results.
The reconstructed language model can be used to:
1. Inspect what the language model has learned. Feed it some texts and see
what prediction it makes.
2. Re-train the language model with a different corpus.
The rest of this post will cover some key ideas and code chunks that is
essential to complete this exercise. The full code is published as a fork from
the original OpenAI repo (check out the notebook in the root folder first):
ceshine/finetune-transformer-lm
finetune-transformer-lm - Code and model for the paper
"Improving Language Understanding by Generative Pre-Training"
github.com
Target Task: Story Cloze Test

Because the model published by the author specifically targets this kind of
tasks, so we need to understand what it is and how the model handles it
first.
‘Story Cloze Test’ is a new commonsense reasoning framework for evaluating
story understanding, story generation, and script learning. This test requires a
system to choose the correct ending to a four-sentence story.
Search Write
Take From [5]
Model Structure for Multiple Choice Tasks. Taken From [1]
This is a binary choice problem, so we’ll create two sequences and feed
them to the same transformer network. Three new special tokens are
added to the vocabulary from the pre-trained model — <start>, <delim>,
and <extract>.
Tokenization
We used a bytepair encoding (BPE) vocabulary with 40,000 merges
Bytepair encoding[8] starts by treating individual characters as tokens, and
then iteratively merge the most common token pairs N times. The resulting
token vocabulary will disintegrate the rare words into several chunks
consisting of more common character sequences.
It is implemented in the TextEncoder class. With 40,000 merges most

common words are already combined into a single token. You probably
shouldn’t worry about this part unless you want to train the model with
other languages. The two things we’ll be using is TextEncoder.encode()
method and TextEncoder.decoder attribute.
Transformation
After tokenization, the resulting tokens need be arranged into Numpy
arrays that are ready to be feed into the neural network as inputs. This is
done in the transform_roc function:
def transform_roc(X1, X2, X3):

n_batch = len(X1)
xmb = np.zeros((n_batch, 2, n_ctx, 2), dtype=np.int32)
mmb = np.zeros((n_batch, 2, n_ctx), dtype=np.float32)
start = encoder['_start_']
delimiter = encoder['_delimiter_']
for i, (x1, x2, x3), in enumerate(zip(X1, X2, X3)):
x12 = [start]+x1[:max_len]+[delimiter]+\
x2[:max_len]+[clf_token]
x13 = [start]+x1[:max_len]+[delimiter]+\
x3[:max_len]+[clf_token]
l12 = len(x12)
l13 = len(x13)
xmb[i, 0, :l12, 0] = x12
xmb[i, 1, :l13, 0] = x13
mmb[i, 0, :l12] = 1
mmb[i, 1, :l13] = 1
xmb[:, :, :, 1] = np.arange(
n_vocab+n_special, n_vocab+n_special+n_ctx)
return xmb, mmb
X1 is the context(4 sentences) tokens, X2 is the ending 1 tokens, and X3 is the

ending 2 tokens. xmb contains both tokens and position indices and is
shaped (batch_size, choice, sequence_length, token/position). mmb contains the
masks that indicate if the sequence has ended at that position(0 if ended),
which will be used when calculate language model losses.
Position indices is used in the transformer network to incorporate the order

information into the network. Later the indices will be mapped to
embeddings as opposed to the sinusoidal function used in the original
transformer:
We used learned position embeddings instead of the sinusoidal version proposed

in the original work.
Modifying the Transformation

The transformation function for the reconstructed language model:
def transform_texts(list_of_texts):
tokens = TEXT_ENCODER.encode(list_of_texts, verbose=False)
n_batch = len(tokens)
xmb = np.zeros((n_batch, N_CTX, 2), dtype=np.int32)
mmb = np.zeros((n_batch, N_CTX), dtype=np.float32)
for i, x in enumerate(tokens):
x1 = x[:N_CTX]
l1 = len(x1)
print(f"length: {l1}")
xmb[i, :l1, 0] = x1
mmb[i, :l1] = 1
xmb[:, :, 1] = np.arange(N_VOCAB, N_VOCAB+N_CTX)
return xmb, mmb
This is really straight forward. It removes the extra choice dimension, and
takes out the special tokens. (The tokenization is done inside the function
for the sake of convenience.)
The Model
Transformer Network
We’re not going to cover transformer in this post. They can be directly
copied into the language model without any modification. Check out theses
two great resources if you’re interested in the inner workings of the
transformer:
The Illustrated Transformer

In the previous post, we looked at Attention - a ubiquitous method
in modern deep learning models. Attention is a…
jalammar.github.io
The Annotated Transformer

The Annotated Transformer
The Annotated Transformernlp.seas.harvard.edu
Modifying the Input Layer

Code from the supervised model:
we = tf.get_variable(
"we",
[n_vocab+n_special+n_ctx, n_embd],
initializer=tf.random_normal_initializer(stddev=0.02))
we = dropout(we, embd_pdrop, train)
X = tf.reshape(X, [-1, n_ctx, 2])
M = tf.reshape(M, [-1, n_ctx])
As we do not need the special tokens anymore, we need to remove

n_special from the embedding matrix initialization.
There’s no other modification needed as X and M are reshaped and

agnostic to the choice dimension.
Modifying the Last Layer

The classifier can be completely removed without looking into its details (but I
encourage you to, as it involves some clever tensor manipulations). What we
need is to modify is the language modeling part:
lm_h = tf.reshape(h[:, :-1], [-1, n_embd])

lm_logits = tf.matmul(lm_h, we, transpose_b=True)
lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=lm_logits, labels=tf.reshape(X[:, 1:, 0], [-1]))
lm_losses = tf.reshape(
lm_losses, [shape_list(X)[0], shape_list(X)[1]-1])
lm_losses = tf.reduce_sum(
lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)
The above code use the hidden outputs from position 0 to L-1 (L is the
length of the longest sequence) to generate predictions to position 1 to L. In
contrast, in the reconstructed model we use position 0 to L to generate
predictions to position 1 to L+1, so we can generate more texts based on
existing ones:
lm_h = tf.reshape(h, [-1, N_EMBD])

lm_logits = tf.reshape(
tf.matmul(lm_h, we[:N_VOCAB, :], transpose_b=True),
[-1, N_CTX, N_VOCAB])
lm_logits_truncated = tf.reshape(
lm_logits[:, :-1],
[-1, N_VOCAB])
lm_losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=lm_logits_truncated,
labels=tf.reshape(X[:, 1:, 0], [-1]))
lm_losses = tf.reshape(
lm_losses, [shape_list(X)[0], shape_list(X)[1]-1])
lm_losses = tf.reduce_sum(
lm_losses*M[:, 1:], 1)/tf.reduce_sum(M[:, 1:], 1)
But of course, we can not calculate losses of the predictions which we don’t
know the answer for, so we need a lm_logits_truncated when calculating
losses.
Another change I made is only using the vocabulary that actually

corresponds to a token when calculating logits — using we[:N_VOCAB, :]
instead of we . If using the latter one, the model will include the learned
position embeddings, and generate predictions to position indices, which
does not make any sense. I don’t think we need those embedding in the
supervised model, either.
Loading The Pre-trained Weights

The pre-trained weights are stored as Numpy arrays. What we need to do is
to remove the initialization of special token embeddings from the original
code.
This is the part where it can mostly go wrong, because all (Tensorflow)
variables must be initialized in the same shape and in the same order as in
the pre-trained model. If you mess with the variables when reconstructing
the language model, the weights won’t be loaded properly.
Inspecting the Reconstructed Language Model

Feeding the example four-sentence stories into the model, we can get the
next-token predictions:
Take the position 9 as an example, the story up to this position is “Karen was
assigned a roommate her first year of …”. The correct next token is college,
and the model correctly predicted that. The second and third most probable
tokens determined by the model are school and grad. Both are quite
reasonable choices. So it appears that the reconstructed model works! (You
might want to run more tests to be sure.)
Thank You!
I skipped some moderately important details in the post, mainly because I
run out of energy and thus the will to make this post even longer . If you
find anything not explained clear enough, or entirely missing, please feel
free to leave me a note. I’d be happy to correct it.
References:
1. Radford, A., & Salimans, T. (2018). Improving Language Understanding
by Generative Pre-Training.
2. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning

for Text Classification.
3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł.

Kaiser, and I. Polosukhin. (2017). Attention is all you need.
4. P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N.

Shazeer. (2018). Generating wikipedia by summarizing long sequences.
5. Story Cloze Test and ROCStories Corpora
6. Mostafazadeh, N., Roth, M., Louis, A., Chambers, N. W., & Allen, J. F.
(2017). LSDSem 2017 Shared Task : The Story Cloze Test.
7. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,

and S. Fidler. (2015). Aligning books and movies: Towards story-like
visual explanations by watching movies and reading books.
8. Sennrich, R., Haddow, B., & Birch, A. (2015). Neural Machine

Translation of Rare Words with Subword Units.
Machine Learning Deep Learning NLP Python TensorFlow
Written by Ceshine Lee Follow
1.6K Followers · Editor for Veritable
Data Geek. Maker. Researcher. Twitter: @ceshine_en
More from Ceshine Lee and Veritable

More from Ceshine Lee and Veritable
Ceshine Lee in Veritable Ceshine Lee in Veritable
Feature Importance Measures for Quantile Regression — Part 2

Tree Models — Part I An Overview of Tensorflow, Pytorch,
An Incomplete Review LightGBM implementations
7 min read · Oct 28, 2017 6 min read · Jul 16, 2018
509 4 581 3
Ceshine Lee in Veritable Ceshine Lee in Towards Data Science
Playing with rstudio/gt R Package Understanding Bidirectional RNN

Exploring the Movies Dataset—Movie Genre in PyTorch
Overlappings, Ratings by Genre and Year Quick Recap
4 min read · Jan 22, 2019 3 min read · Nov 12, 2017
58 1 1K 11
See all from Ceshine Lee See all from Veritable
Recommended from Medium
Rahul Nayak in Towards Data Science Gavin Li in AI Advances
How to Convert Any Text Into a Unbelievable! Run 70B LLM

Graph of Concepts Inference on a Single 4GB GPU…
A method to convert any text corpus into a with
Large This NEW
language Technique
models require huge
Knowledge Graph using Mistral 7B. amounts of GPU memory. Is it possible to ru…
inference on a single GPU? If so, what is the
12 min read · Nov 9 6 min read ·GPU…
minimum Nov 18
3.2K 40 1.6K 22
Lists
Predictive Modeling w/ Practical Guides to Machine

Python Learning
20 stories · 651 saves 10 stories · 731 saves
Natural Language Processing Coding & Development

920 stories · 435 saves 11 stories · 292 saves
Krishna Yo… in Artificial Intelligence in Plain Engli… Intel(R) Neural Compres… in Intel Analytics Softw…
i h or re
Building a question-answering Supervised Fine-Tuning and Direct
system using LLM on your privat… Preference Optimization on Intel…
data Gaudi2
Demonstrating a Top-Ranked 7B Chat Model
on the LLM Leaderboard
11 min read · Oct 5 4 min read · Nov 14
467 6 26
Deepanshusachdeva AL Anany
Understanding Transformers Step The ChatGPT Hype Is Over—Now

by Step—Word Embeddings Watch How Google Will Kill…
Transformer is a deep learning model ChatGPT.
It never happens instantly. The business
architecture introduced in the paper… game is longer than you know.
“Attention is All You Need” by Vaswani et al. It
4 read · Jun 14
mingained…
has · 6 min read · Sep 1
25 19.8K 604
See more recommendations
Help Status About Careers Blog Privacy Terms Text to speech Teams

Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium

Uploaded by

Copyright:

Available Formats

You might also like

Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium

Uploaded by

Copyright:

Available Formats

[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM

[Notes] Improving Language

Ceshine Lee · Follow

Improving Language Understanding with Unsupervised

1. Use transformer networks[3] instead of LSTM to achieve better capture

2. Demonstrate the effectiveness of the approach on a wider range of

3. Include auxiliary training objectives (e.g. language modeling) in

(The transformer network used is a variant called Transformer decoder,

General Model Structure. Taken from [1]

The reconstructed language model can be used to:

2. Re-train the language model with a different corpus.

Target Task: Story Cloze Test

‘Story Cloze Test’ is a new commonsense reasoning framework for evaluating

Model Structure for Multiple Choice Tasks. Taken From [1]

We used a bytepair encoding (BPE) vocabulary with 40,000 merges

Bytepair encoding[8] starts by treating individual characters as tokens, and

It is implemented in the TextEncoder class. With 40,000 merges most

method and TextEncoder.decoder attribute.

def transform_roc(X1, X2, X3):

return xmb, mmb

X1 is the context(4 sentences) tokens, X2 is the ending 1 tokens, and X3 is the

Position indices is used in the transformer network to incorporate the order

We used learned position embeddings instead of the sinusoidal version proposed

Modifying the Transformation

The Illustrated Transformer

The Annotated Transformer

Modifying the Input Layer

As we do not need the special tokens anymore, we need to remove

There’s no other modification needed as X and M are reshaped and

Modifying the Last Layer

lm_h = tf.reshape(h[:, :-1], [-1, n_embd])

lm_h = tf.reshape(h, [-1, N_EMBD])

Another change I made is only using the vocabulary that actually

Loading The Pre-trained Weights

Inspecting the Reconstructed Language Model

2. Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning

3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł.

4. P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N.

5. Story Cloze Test and ROCStories Corpora

7. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,

8. Sennrich, R., Haddow, B., & Birch, A. (2015). Neural Machine

Machine Learning Deep Learning NLP Python TensorFlow

Written by Ceshine Lee Follow

1.6K Followers · Editor for Veritable

Data Geek. Maker. Researcher. Twitter: @ceshine_en

More from Ceshine Lee and Veritable

More from Ceshine Lee and Veritable

Ceshine Lee in Veritable Ceshine Lee in Veritable

Feature Importance Measures for Quantile Regression — Part 2

Ceshine Lee in Veritable Ceshine Lee in Towards Data Science

Playing with rstudio/gt R Package Understanding Bidirectional RNN

See all from Ceshine Lee See all from Veritable

Recommended from Medium

Rahul Nayak in Towards Data Science Gavin Li in AI Advances

How to Convert Any Text Into a Unbelievable! Run 70B LLM

Predictive Modeling w/ Practical Guides to Machine

Natural Language Processing Coding & Development

11 min read · Oct 5 4 min read · Nov 14