Professional Documents
Culture Documents
Language Models are Unsupervised Multitask Learners _ by Skepsis Reviews _ Medium
Language Models are Unsupervised Multitask Learners _ by Skepsis Reviews _ Medium
Search
Listen Share
The key idea proposed in this paper is that modern machine learning systems suffer
from being “narrow specialists” rather than “broad generalists” due to the way
supervised learning systems are trained. Supervised learning systems not only
require tremendous human resources to classify the training sets, but are also
extremely limited in nature, and thus pose a natural stopping point for the
development of true generalized systems.
This paper develops a Language Model (LM) called the GPT-2, a 1.5-billion parameter
transformer network that achieves state-of-the-art performances on 7–8 LM
benchmark testing sets.
Approach
At its core, the approach taken in this paper is language modeling, which is usually
set up as an unsupervised distribution problem with a set of examples {x₁, x₂, …, xₙ}
each composed of a variable length of sequences of symbols {s₁, s₂, …, sₙ}. Due to
how language is structured as a sequential ordering, it is common to represent this
as a product of conditional probabilities
The answer to whether the model will converge turns out to be that as long as there
is sufficient data, then the unsupervised model will converge. So an appropriate
dataset becomes the key bottleneck to solving this problem.
Many of the datasets for Natural Language Processing found online appear too
restrictive for this particular task, such as only containing dialogue or created from
one specific article or text. For a generalized model, a diverse and comprehensive
dataset is required and will need to be created from pulling web data. A problem
quickly emerged however when web scraping for text data: while the data collected
from web-scraping was orders of magnitude larger than existing datasets, there
were significant issues with quality and way too much content that was
“unintelligible.”
So the researchers ended up creating their own dataset with their own methodology
that focuses on quality and human-readability:
1. Scrape Reddit comments for all outbound media links that come from comments
with at least 3 Reddit Karma (this is done to ensure that at least some users found
the link insightful, in an attempt at reducing human-readability issues and
quality decline)
2. Get rid of all links from Wikipedia (due to data-duplication problems that may
arise) and all links newer than 2018 (This was done as a precautionary measure;
future versions of the dataset will contain more recent links)
3. Go through each link and extract the text from the HTML responses using well-
established content extractors
4. Run some basic heuristic cleaning algorithms to sanitize the remaining data
The result of this procedure is a dataset that contains slightly over 8 million
documents for a total of 40 GB of text coming from 45 million links. This dataset is
now known as WebText.
This dataset is comprehensive and contains examples that the unsupervised model
to learn how to accomplish most well-defined NLP tasks.
Examples of naturally occurring demonstrations of English to French and French to English translation found
throughout the WebText training set, showing the ability for an unsupervised learning model to train even
machine translation tasks.
The actual model itself largely follows the details of the OpenAI GPT model. The
sizes of the different models tested are as follows:
Architecture hyperparameters for the 4 model sizes, GPT-2 refers to the 1542M model
Results
The model was tested against multiple benchmarks in multiple different tasks. Here
is a quick summary of the performance of the different sizes of the GPT model. In
most competitions, it has phenomenal performances compared to most supervised
systems, despite not being specifically trained for each task.
GPT results with no training or fine-tuning performed for any of these benchmarks.
Let’s take a look at one of the results from this table, namely the CBT-CN and CBT-NE
results which stand for the Children’s Book Test-(Common Nouns, Named Entities).
The CBT was created to examine the performance of LMs on different categories of
words: named entities, nouns, verbs, and prepositions. Rather than reporting
perplexity as an evaluation metric, CBT reports accuracy on an automatically
constructed cloze test where the task is to predict which of 10 possible choices for
an omitted word is correct.
The probability GPT-2 assigns to its generated answers is well calibrated and GPT-2
has an accuracy of 63.1% on the 1% of questions it is most confident in. The 30 most
confident answers generated by GPT-2 on development set questions are shown
below:
Overall, the GPT-2 shows very strong abilities to answer a wide variety of questions
successfully, and even in some of the answers it gets wrong, the GPT-2 answered
with a similar style answer, such as answering A for Sweden’s most common blood
type. Even though the answer is wrong, A is still a valid blood type, and still a
reasonable answer.
Generalization vs Memorization
It should be noted that many popular text datasets contain a non-trivial amount of
near-duplicate data. This can affect the results because if the model has seen part of
the testing set in the training set, it incentivizes the model to simply “memorize”
parts of the data instead of learning and generalizing.
This is a prevalent problem with a dataset as large as WebText, with the common
NLP datasets having a 5.9% overlap on average. The way this is evaluated is by
parsing for 8-gram similarity. Where the overlap percentage represents the number
of 8-grams (sequences of 8 words) found in both datasets.
The percentage of the test set 8 grams overlapping with training sets.
Because there is relatively low overlap in the WebText training set, removing
duplicate entries that artificially raise the performance of the model through
memorization does not actually have much of an effect on the test results of GPT-2.
Overall, the analysis suggests that data overlap between WebText training data and
specific evaluation datasets provides a small but consistent benefit to reported
results. However, for most datasets, there aren’t any significantly larger overlaps
than those already existing between standard training and test sets, so it suffices to
take the results from WebText training at face value.
Conclusion
When a large language model is trained on a sufficiently large and diverse dataset it
is able to perform well across many domains and datasets. GPT-2 shows impressive
results on 7 out of 8 benchmarks. The diversity of tasks the model is able to perform
suggests that high-capacity models trained to maximize the likelihood of a
sufficiently varied text corpus may be a new avenue forward towards the ultimate
goal of true generalized natural language processing systems.
Related Work
Here below is a list of papers that had an influence on the development of this one,
and a brief summary of what each tries to accomplish.
This paper is extremely influential in NLP for the development of the transformer, a
model architecture that the GPT-2 is based off.
Transformer architecture
This is the paper that Children’s Book Test is based on. The goal of this new test was
to see how well language models capture meaning in children’s books. Unlike
standard language modeling benchmarks, it distinguishes the task of predicting
syntactic function words from that of predicting lower-frequency words, which
carry greater semantic content and meaning. Several models were used to develop
the current benchmark, with the following results:
It should be noted that the GPT-2 ultimately outperformed all the models tested in
this paper without specific training.
This paper talks about the OpenAI model that the GPT-2 transformer network is
based on in greater detail, and it is written by many of the same authors who worked
on this paper.
Notice how this architecture is split up into several segments that each tackle a
specific task, this is more similar to the generalized approach of the GPT-2. In
contrast to previous approaches, it uses task-aware input transformations during
fine-tuning to achieve effective transfer while requiring minimal changes to the
model architecture.
This paper discusses a new benchmark for natural questions that the GPT-2 gets
tested on. The result is a baseline that can constitute a good starting point for
researchers wanting to create better models for the Natural Questions and for other
question answering datasets with similar characteristics. The model architecture
used as a baseline is a BERT architecture and requires a supervised training setup,
unlike the GPT-2 model.
This paper is an earlier work that again relies on an RNN architecture to solve large-
scale language modeling tasks. The researchers perform an exhaustive study on
techniques such as character Convolutional Neural Networks or Long-Short Term
Memory, on the One Billion Word Benchmark. This paper shows early promise in
the future of large scale language modeling problems and later led to the rise of
models such as GPT-2 and its predecessors.
Model Architecture used in the paper for large language modeling tasks
The GPT-2 architecture took a lot of inspiration from this model, with the notable
upgrade of transformer networks instead of LSTM cells.
Follow
Reviews and analysis of books ranging from self-improvement to finance to computer science.