Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

Open in app Sign up Sign in

Search

Language Models are Unsupervised Multitask


Learners
Skepsis Reviews · Follow
8 min read · May 5, 2020

Listen Share

In a recent NLP paper published by OpenAI, a novel approach to traditional NLP


problems is proposed through the use of an unsupervised learning dataset that
allows for better generalization and performance on several classic NLP
benchmarks such as on the CoQa Dataset.

The key idea proposed in this paper is that modern machine learning systems suffer
from being “narrow specialists” rather than “broad generalists” due to the way
supervised learning systems are trained. Supervised learning systems not only
require tremendous human resources to classify the training sets, but are also
extremely limited in nature, and thus pose a natural stopping point for the
development of true generalized systems.

This paper develops a Language Model (LM) called the GPT-2, a 1.5-billion parameter
transformer network that achieves state-of-the-art performances on 7–8 LM
benchmark testing sets.

Approach
At its core, the approach taken in this paper is language modeling, which is usually
set up as an unsupervised distribution problem with a set of examples {x₁, x₂, …, xₙ}
each composed of a variable length of sequences of symbols {s₁, s₂, …, sₙ}. Due to
how language is structured as a sequential ordering, it is common to represent this
as a product of conditional probabilities

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 1/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

Traditionally, learning how to perform a single task could be represented as a


conditional probability p(output|input), but since we want a more generalized model,
we need to also specify what task needs to be accomplished with the data given, so
we instead choose to represent this as p(output|input, task). So now in a sense, the
supervised learning model which does not take the task into account, is a subset of
the generalized model. The problem now becomes whether the unsupervised model
will ever reach a point of convergence.

The answer to whether the model will converge turns out to be that as long as there
is sufficient data, then the unsupervised model will converge. So an appropriate
dataset becomes the key bottleneck to solving this problem.

Many of the datasets for Natural Language Processing found online appear too
restrictive for this particular task, such as only containing dialogue or created from
one specific article or text. For a generalized model, a diverse and comprehensive
dataset is required and will need to be created from pulling web data. A problem
quickly emerged however when web scraping for text data: while the data collected
from web-scraping was orders of magnitude larger than existing datasets, there
were significant issues with quality and way too much content that was
“unintelligible.”

So the researchers ended up creating their own dataset with their own methodology
that focuses on quality and human-readability:

1. Scrape Reddit comments for all outbound media links that come from comments
with at least 3 Reddit Karma (this is done to ensure that at least some users found
the link insightful, in an attempt at reducing human-readability issues and
quality decline)

2. Get rid of all links from Wikipedia (due to data-duplication problems that may
arise) and all links newer than 2018 (This was done as a precautionary measure;
future versions of the dataset will contain more recent links)

3. Go through each link and extract the text from the HTML responses using well-
established content extractors

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 2/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

4. Run some basic heuristic cleaning algorithms to sanitize the remaining data

The result of this procedure is a dataset that contains slightly over 8 million
documents for a total of 40 GB of text coming from 45 million links. This dataset is
now known as WebText.

This dataset is comprehensive and contains examples that the unsupervised model
to learn how to accomplish most well-defined NLP tasks.

Examples of naturally occurring demonstrations of English to French and French to English translation found
throughout the WebText training set, showing the ability for an unsupervised learning model to train even
machine translation tasks.

The actual model itself largely follows the details of the OpenAI GPT model. The
sizes of the different models tested are as follows:

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 3/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

Architecture hyperparameters for the 4 model sizes, GPT-2 refers to the 1542M model

Results
The model was tested against multiple benchmarks in multiple different tasks. Here
is a quick summary of the performance of the different sizes of the GPT model. In
most competitions, it has phenomenal performances compared to most supervised
systems, despite not being specifically trained for each task.

GPT results with no training or fine-tuning performed for any of these benchmarks.

Let’s take a look at one of the results from this table, namely the CBT-CN and CBT-NE
results which stand for the Children’s Book Test-(Common Nouns, Named Entities).
The CBT was created to examine the performance of LMs on different categories of
words: named entities, nouns, verbs, and prepositions. Rather than reporting
perplexity as an evaluation metric, CBT reports accuracy on an automatically
constructed cloze test where the task is to predict which of 10 possible choices for
an omitted word is correct.

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 4/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

As seen in the figure, performance steadily improves as model-size is increased and


closes the majority of the gap to human performance on this test, surpassing the
benchmark previously set in 2016. GPT-2 achieves a new state of the art results of
93.3% on common nouns and 89.1% on named entities.

A potential way to test what information is contained within a language model is to


evaluate how often it generates the correct answer to factoid-style questions. From
the simulations, the GPT-2 answers 4.1% of the questions asked correctly by the
exact match metric commonly used on reading comprehension datasets like SQUAD.
It is noteworthy, that the smallest GPT model answers less than 1% of questions
correctly, 5.3 times less than GPT-2, showing that model size is a strong factor in
performance on this task.

The probability GPT-2 assigns to its generated answers is well calibrated and GPT-2
has an accuracy of 63.1% on the 1% of questions it is most confident in. The 30 most
confident answers generated by GPT-2 on development set questions are shown
below:

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 5/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

30 most confident GPT-2 questions answered

Overall, the GPT-2 shows very strong abilities to answer a wide variety of questions
successfully, and even in some of the answers it gets wrong, the GPT-2 answered
with a similar style answer, such as answering A for Sweden’s most common blood
type. Even though the answer is wrong, A is still a valid blood type, and still a
reasonable answer.

Generalization vs Memorization
It should be noted that many popular text datasets contain a non-trivial amount of
near-duplicate data. This can affect the results because if the model has seen part of
the testing set in the training set, it incentivizes the model to simply “memorize”
parts of the data instead of learning and generalizing.

This is a prevalent problem with a dataset as large as WebText, with the common
NLP datasets having a 5.9% overlap on average. The way this is evaluated is by
parsing for 8-gram similarity. Where the overlap percentage represents the number
of 8-grams (sequences of 8 words) found in both datasets.

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 6/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

The percentage of the test set 8 grams overlapping with training sets.

Because there is relatively low overlap in the WebText training set, removing
duplicate entries that artificially raise the performance of the model through
memorization does not actually have much of an effect on the test results of GPT-2.

Overall, the analysis suggests that data overlap between WebText training data and
specific evaluation datasets provides a small but consistent benefit to reported
results. However, for most datasets, there aren’t any significantly larger overlaps
than those already existing between standard training and test sets, so it suffices to
take the results from WebText training at face value.

Conclusion
When a large language model is trained on a sufficiently large and diverse dataset it
is able to perform well across many domains and datasets. GPT-2 shows impressive
results on 7 out of 8 benchmarks. The diversity of tasks the model is able to perform
suggests that high-capacity models trained to maximize the likelihood of a
sufficiently varied text corpus may be a new avenue forward towards the ultimate
goal of true generalized natural language processing systems.

Related Work
Here below is a list of papers that had an influence on the development of this one,
and a brief summary of what each tries to accomplish.

1. Attention Is All You Need

This paper is extremely influential in NLP for the development of the transformer, a
model architecture that the GPT-2 is based off.

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 7/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

Transformer architecture

The transformer is based solely on attention mechanisms, dispensing with


recurrence and convolutions entirely, and has proven extremely effective compared
to its predecessor the LSTM.

2. The Goldilocks Principle: Reading Children’s Books with Explicit Memory


Representations

This is the paper that Children’s Book Test is based on. The goal of this new test was
to see how well language models capture meaning in children’s books. Unlike
standard language modeling benchmarks, it distinguishes the task of predicting
syntactic function words from that of predicting lower-frequency words, which
carry greater semantic content and meaning. Several models were used to develop
the current benchmark, with the following results:

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 8/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

It should be noted that the GPT-2 ultimately outperformed all the models tested in
this paper without specific training.

3. Improving Language Understanding by Generative Pre-Training

This paper talks about the OpenAI model that the GPT-2 transformer network is
based on in greater detail, and it is written by many of the same authors who worked
on this paper.

The network architecture that ultimately became the GPT-2

Notice how this architecture is split up into several segments that each tackle a
specific task, this is more similar to the generalized approach of the GPT-2. In
contrast to previous approaches, it uses task-aware input transformations during
fine-tuning to achieve effective transfer while requiring minimal changes to the
model architecture.

4. A BERT Baseline for the Natural Questions

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 9/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

This paper discusses a new benchmark for natural questions that the GPT-2 gets
tested on. The result is a baseline that can constitute a good starting point for
researchers wanting to create better models for the Natural Questions and for other
question answering datasets with similar characteristics. The model architecture
used as a baseline is a BERT architecture and requires a supervised training setup,
unlike the GPT-2 model.

5. Exploring the Limits of Language Modeling

This paper is an earlier work that again relies on an RNN architecture to solve large-
scale language modeling tasks. The researchers perform an exhaustive study on
techniques such as character Convolutional Neural Networks or Long-Short Term
Memory, on the One Billion Word Benchmark. This paper shows early promise in
the future of large scale language modeling problems and later led to the rise of
models such as GPT-2 and its predecessors.

Model Architecture used in the paper for large language modeling tasks

The GPT-2 architecture took a lot of inspiration from this model, with the notable
upgrade of transformer networks instead of LSTM cells.

NLP Machine Learning Dataset Paper

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 10/11


2024/5/29 11:07 Language Models are Unsupervised Multitask Learners | by Skepsis Review s | Medium

Follow

Written by Skepsis Reviews


1 Follower

Reviews and analysis of books ranging from self-improvement to finance to computer science.

https://skepsisreview s.medium.com/language-models-are-unsupervised-multitask-learners-a99bbfe8608d 11/11

You might also like