Professional Documents
Culture Documents
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
48
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 1 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
Source
Introduction
The field of NLP has been creating some buzz this year (in case you haven’t
heard, NLP’s Image moment has arrived). When I finally found some time
to read some of the new papers, I feel this paper by OpenAI [1] is interesting
enough for me to dig deeper:
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 2 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
It expands the work by Howard and Ruder [2]. The main differences
include:
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 3 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
The Exercise
The author open-sourced a Github repo containing the code(using
Tensorflow) and pre-trained model weights needed to reproduce the results
(currently only the ROCStories Cloze dataset [5, 6] is supported as the fine-
tune target). According to the paper, the pre-trained language model is
trained with the BooksCorpus dataset[7].
Honestly the code can use some work improving its readability, and the
low-level Tensorflow APIs it uses are notoriously hard to comprehend and
modify (BTW, there is a PyTorch implementation available.). So I gave
myself an exercise to reconstruct the pre-trained language model. Since
neural network can be entirely ruined by very subtle bugs or mistakes, you
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 4 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
need to understand the code and model to ensure the final model produce
the correct results.
1. Inspect what the language model has learned. Feed it some texts and see
what prediction it makes.
The rest of this post will cover some key ideas and code chunks that is
essential to complete this exercise. The full code is published as a fork from
the original OpenAI repo (check out the notebook in the root folder first):
ceshine/finetune-transformer-lm
finetune-transformer-lm - Code and model for the paper
"Improving Language Understanding by Generative Pre-Training"
github.com
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 5 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
story understanding, story generation, and script learning. This test requires a
system to choose the correct ending to a four-sentence story.
Search Write
Take From [5]
This is a binary choice problem, so we’ll create two sequences and feed
them to the same transformer network. Three new special tokens are
added to the vocabulary from the pre-trained model — <start>, <delim>,
and <extract>.
Tokenization
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 6 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
then iteratively merge the most common token pairs N times. The resulting
token vocabulary will disintegrate the rare words into several chunks
consisting of more common character sequences.
Transformation
After tokenization, the resulting tokens need be arranged into Numpy
arrays that are ready to be feed into the neural network as inputs. This is
done in the transform_roc function:
def transform_texts(list_of_texts):
tokens = TEXT_ENCODER.encode(list_of_texts, verbose=False)
n_batch = len(tokens)
xmb = np.zeros((n_batch, N_CTX, 2), dtype=np.int32)
mmb = np.zeros((n_batch, N_CTX), dtype=np.float32)
for i, x in enumerate(tokens):
x1 = x[:N_CTX]
l1 = len(x1)
print(f"length: {l1}")
xmb[i, :l1, 0] = x1
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 8 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
mmb[i, :l1] = 1
xmb[:, :, 1] = np.arange(N_VOCAB, N_VOCAB+N_CTX)
return xmb, mmb
This is really straight forward. It removes the extra choice dimension, and
takes out the special tokens. (The tokenization is done inside the function
for the sake of convenience.)
The Model
Transformer Network
We’re not going to cover transformer in this post. They can be directly
copied into the language model without any modification. Check out theses
two great resources if you’re interested in the inner workings of the
transformer:
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 9 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
we = tf.get_variable(
"we",
[n_vocab+n_special+n_ctx, n_embd],
initializer=tf.random_normal_initializer(stddev=0.02))
we = dropout(we, embd_pdrop, train)
X = tf.reshape(X, [-1, n_ctx, 2])
M = tf.reshape(M, [-1, n_ctx])
The above code use the hidden outputs from position 0 to L-1 (L is the
length of the longest sequence) to generate predictions to position 1 to L. In
contrast, in the reconstructed model we use position 0 to L to generate
predictions to position 1 to L+1, so we can generate more texts based on
existing ones:
But of course, we can not calculate losses of the predictions which we don’t
know the answer for, so we need a lm_logits_truncated when calculating
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 11 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
losses.
instead of we . If using the latter one, the model will include the learned
position embeddings, and generate predictions to position indices, which
does not make any sense. I don’t think we need those embedding in the
supervised model, either.
This is the part where it can mostly go wrong, because all (Tensorflow)
variables must be initialized in the same shape and in the same order as in
the pre-trained model. If you mess with the variables when reconstructing
the language model, the weights won’t be loaded properly.
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 12 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
Take the position 9 as an example, the story up to this position is “Karen was
assigned a roommate her first year of …”. The correct next token is college,
and the model correctly predicted that. The second and third most probable
tokens determined by the model are school and grad. Both are quite
reasonable choices. So it appears that the reconstructed model works! (You
might want to run more tests to be sure.)
Thank You!
I skipped some moderately important details in the post, mainly because I
run out of energy and thus the will to make this post even longer . If you
find anything not explained clear enough, or entirely missing, please feel
free to leave me a note. I’d be happy to correct it.
References:
1. Radford, A., & Salimans, T. (2018). Improving Language Understanding
by Generative Pre-Training.
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 13 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
6. Mostafazadeh, N., Roth, M., Louis, A., Chambers, N. W., & Allen, J. F.
(2017). LSDSem 2017 Shared Task : The Story Cloze Test.
7 min read · Oct 28, 2017 6 min read · Jul 16, 2018
509 4 581 3
4 min read · Jan 22, 2019 3 min read · Nov 12, 2017
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 15 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
58 1 1K 11
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 16 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
3.2K 40 1.6K 22
Lists
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 17 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
Krishna Yo… in Artificial Intelligence in Plain Engli… Intel(R) Neural Compres… in Intel Analytics Softw…
i h or re
Building a question-answering Supervised Fine-Tuning and Direct
system using LLM on your privat… Preference Optimization on Intel…
data Gaudi2
Demonstrating a Top-Ranked 7B Chat Model
on the LLM Leaderboard
467 6 26
Deepanshusachdeva AL Anany
25 19.8K 604
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 18 of 19
[Notes] Improving Language Understanding by Generative Pre-Training | by Ceshine Lee | Veritable | Medium 12/1/23, 9:03 PM
Help Status About Careers Blog Privacy Terms Text to speech Teams
https://medium.com/the-artificial-impostor/notes-improving-language-understanding-by-generative-pre-training-4c9d4214369c Page 19 of 19