Professional Documents
Culture Documents
Seminar Report - Transformer Model
Seminar Report - Transformer Model
Seminar Report - Transformer Model
CHAPTER 1
INTRODUCTION
Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing
knowledge gained while solving one problem and applying it to a different but related problem.
For example, knowledge gained while learning to recognize cars could apply when trying to
recognize trucks. This area of research bears some relation to the long history of psychological
literature on transfer of learning, although practical ties between the two fields are limited. From
the practical standpoint, reusing or transferring information from previously learned tasks for the
learning of new tasks has the potential to significantly improve the sample efficiency of
a reinforcement learning agent. Then comes the name of Transformer Model.
1.2 Background
Before transformers, most state-of-the-art NLP systems relied on gated RNNs, such
as LSTM and gated recurrent units (GRUs), with added attention mechanisms. Transformers are
built on these attention technologies without using an RNN structure, highlighting the fact that
attention mechanisms alone can match the performance of RNNs with attention.
Imagine yourself back in the days when you tried to ride a bicycle for the first time. It was
difficult and took time. You needed to learn everything from scratch: How to keep the balance,
how to steer the wheel, how to brake. Now back to the present: Imagine you want to learn how to
ride a motorcycle. You don’t need to start from zero. It is much easier for you to learn how to
keep the balance or use the brakes. Even though you are in a different setting, you
can transfer the skills learned from riding a bicycle. That’s also the essence of transfer learning.
“Transfer learning is [...] the improvement of learning in a new task through the transfer of
knowledge from a related task that has already been learned.”
Having learned how to keep the balance on a bicycle improves your learning of how to keep the
balance on a motorcycle. Similarly, an algorithm that has learned how to recognize dogs can be
trained to recognize cats with relative ease by transferring certain abstract concepts.
In brief, machine learning is the general term for when computers learn from data without being
explicitly programmed. Instead, machine learning algorithms recognize patterns in the data and
make predictions once new data arrives. If you are new to the field, we recommend that you first
read about the different disciplines of artificial intelligence.
So far, conventional machine learning algorithms have been built to learn specific tasks. They
are designed to work in isolation and this works well both in theory and practice. But training
algorithms from scratch also has drawbacks. As specialized algorithms, they reach high
performance only in their specific area of expertise. No matter how state-of-the-art they are, they
are only state-of-the-art for a specific thing. If tasked with a new problem, they would not know
what to do and make wrong predictions. Recall the bicycle example again: Imagine you have
learned how to ride a bicycle. Even if you were a world champion in trick-cycling, you would
have to start from scratch when learning how to ride a motorcycle. The world champion would
be a rookie again. Similarly, models have to be rebuilt from scratch in conventional machine
learning. Since model training requires time and money, many problems aren’t profitable with a
traditional learning approach. Besides, most machine learning algorithms require vast amounts of
data. Deep learning models, in particular, most often need millions of data points to generate
meaningful results. These data needs are often difficult to satisfy in practice. That is also one of
the primary reasons why machine learning has mainly been a privilege to large companies:
Smaller enterprises just haven’t had the required resources to continuously feed and train
machine learning algorithms from scratch.
Transfer learning is a technique that enables algorithms to learn a new task by using pre-trained
models. Let’s see how conventional machine learning and transfer learning compare:
In traditional learning, a machine learning algorithm works in isolation. When given a large
enough dataset, it learns how to perform a specific task. However, when tasked to solve a new
problem, it cannot resort to any previously gained knowledge. Instead, a conventional algorithm
needs a second dataset to begin a new learning process.
In transfer learning, the learning of new tasks relies on previously learned tasks. The algorithm
can store and access knowledge. The model is general instead of specific. By finding patterns
between elements mathematically, transformers eliminate that need, making available the
trillions of images and petabytes of text data on the web and in corporate databases. In addition,
the math that transformers use lends itself to parallel processing, so these models can run fast
Let’s understand some of the main concepts of transformer model in the next chapter which
makes the model specific and reliable.
In Chapter 2, we will understand the important concepts before jumping into the
transformer model itself such as, sequential processing, attention mechanism and what is
self-attention in transformer models.
In Chapter 4, we will look into some of the applications, where we can implement the
models, what do you mean by training and the advantages and disadvantages of
transformer model.
CHAPTER 2
UNDERSTANDING CONCEPTS
The Transformer architecture follows an encoder-decoder structure, but does not rely on
recurrence and convolutions in order to generate an output. But before jumping into the
architecture of transformer let’s get familiar with some of the important concepts to understand
the transformer model better.
Gated RNNs process tokens sequentially, maintaining a state vector that contains
representation of the data seen after every token. To process the th token, the model combines
the state representing the sentence up to token with the information of the new token to create
a new state, representing the sentence up to token . Theoretically, the information from one
token can propagate arbitrarily far down the sequence, if at every point the state continues to
encode contextual information about the token. In practice this mechanism is flawed: the
vanishing gradient problem leaves the model's state at the end of a long sentence without
precise, extractable information about preceding tokens. The dependency of token computations
on results of previous token computations also makes it hard to parallelize computation on
modern deep learning hardware. This can make the training of RNNs inefficient.
One bad way to try to translate that sentence would be to go through each word in the English
sentence and try to spit out its French equivalent, one word at a time. That wouldn’t work well
for several reasons, but for one, some words in the French translation are flipped: it’s “European
Economic Area” in English, but “la zone économique européenne” in French. Also, French is a
language with gendered words. The adjectives “économique” and “européenne” must be in
feminine form to match the feminine object “la zone.”
Attention is a mechanism that allows a text model to “look at” every single word in the original
sentence when making a decision about how to translate words in the output sentence. Here’s a
nice visualization from that original attention paper:
As shown in the Fig: 2.2.1.1, it’s a sort of heat map that shows where the model is “attending”
when it outputs each word in the French sentence. As you might expect, when the model outputs
the word “européenne,” it’s attending heavily to both the input words “European” and
“Economic.” And how does the model know which words it should be “attending” to at each
time step? It’s something that’s learned from training data. By seeing thousands of examples of
French and English sentences, the model learns what types of words are interdependent. It learns
how to respect gender, plurality, and other rules of grammar. The attention mechanism has been
an extremely useful tool for natural language processing since its discovery in 2015, but in its
original form, it was used alongside recurrent neural networks. So, the innovation of the 2017
Transformers paper was, in part, to ditch RNNs entirely. That’s why the 2017 paper was called
“Attention is all you need.”
The type of “vanilla” attention we just talked about helped align words across English and
French sentences, which is important for translation. But what if you’re not trying to translate
words but instead build a model that understands underlying meaning and patterns in language–a
type of model that could be used to do any number of language tasks? In general, what makes
neural networks powerful and exciting and cool is that they often automatically build up
meaningful internal representations of the data they’re trained on. When you inspect the layers of
a vision neural network, for example, you’ll find sets of neurons that “recognize” edges, shapes,
and even high-level structures like eyes and mouths. A model trained on text data might
automatically learn parts of speech, rules of grammar, and whether words are synonymous. The
better the internal representation of language a neural network learns, the better it will be at any
language task. And it turns out that attention can be a very effective way of doing just this, if it’s
turned on the input text itself.
CHAPTER 3
In the previous chapter, we looked at Attention – a ubiquitous method in modern deep learning
models. Attention is a concept that helped to improve the performance of neural machine
translation applications. In this chapter, we will look at The Transformer Model – a model that
uses attention to boost the speed with which these models can be trained. The Transformer
outperforms the Google Neural Machine Translation model in specific tasks. The biggest
benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact
Google Cloud’s recommendation to use The Transformer as a reference model to use
their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.
1. The first sublayer implements a multi-head self-attention mechanism. We had seen that
the multi-head mechanism implements h heads that receive a (different) linearly
projected version of the queries, keys and values each, to produce h outputs in parallel
that are then used to generate a final result.
2. The second sublayer is a fully connected feed-forward network, consisting of two linear
transformations with Rectified Linear Unit (ReLU) activation in between:
FFN(x) = ReLU(W1x+b1)W2+b2
The six layers of the Transformer encoder apply the same linear transformations to all of the
words in the input sequence, but each layer employs different weight (W1,W2) and bias (b1,b2)
parameters to do so.
Furthermore, each of these two sublayers has a residual connection around it. Each sublayer is
also succeeded by a normalization layer, layernorm(.), which normalizes the sum computed
between the sublayer input, x, and the output generated by the sublayer itself, sublayer(x):
layernorm(x+sublayer(x))
The decoder also consists of a stack of N = 6 identical layers that are, each, composed of three
sublayers:
1. The first sublayer receives the previous output of the decoder stack, augments it with
positional information, and implements multi-head self-attention over it. While the
encoder is designed to attend to all words in the input sequence, regardless of their
position in the sequence, the decoder is modified to attend only to the preceding words.
Hence, the prediction for a word at position, i, can only depend on the known outputs for
the words that come before it in the sequence. In the multi-head attention mechanism
(which implements multiple, single attention functions in parallel), this is achieved by
introducing a mask over the values produced by the scaled multiplication of
matrices Q and K. This masking is implemented by suppressing the matrix values that
would, otherwise, correspond to illegal connections:
Multi-head attention:
3. The third layer implements a fully connected feed-forward network, which is similar to
the one implemented in the second sublayer of the encoder.
Furthermore, the three sublayers on the decoder side also have residual connections
around them, and are succeeded by a normalization layer.
Positional encodings are also added to the input embeddings of the decoder, in the same
manner as previously explained for the encoder.
The softmax function is used as the activation function in the output layer of neural network
models that predict a multinomial probability distribution. That is, softmax is used as the
activation function for multi-class classification problems where class membership is required on
more than two class labels.
CHAPTER 4
Machine translation
Document summarization
Document generation
Named entity recognition (NER)
Biological sequence analysis.
Video understanding.
Some of the Transformer-Based Models which are built using the concept and architecture of
transformer include:
VGG16
Resnet Neural Network
GPT-2, GPT-3
BERT
XLNet
RoBERTa
T5
DALL-E and many more
In 2020, it was shown that the transformer architecture, more specifically GPT-2, could be tuned
to play chess. Transformers have been applied to image processing with results competitive
with convolution neural networks.
4.2 Advantages
Transformer Model brings a range of benefits to the development process of machine learning
models. The main benefits of transfer learning include the saving of resources and improved
efficiency when training new models. It can also help with training models when only unlabelled
datasets are available, as the bulk of the model will be pre-trained.
4.3 Disadvantages
There are mainly 2 limitations or we can say disadvantages of Transformer model or transfer
learning include:
If the transfer learning ends up with a decrease in the performance or accuracy of the new model,
then it is called negative transfer. Transfer learning only works if the initial and target problems
of both models are similar enough. If the first round of training data required for the new task is
too far from the data of the old task, then the trained models might perform worse than expected.
In transfer learning, developers cannot remove the network layers to find optimal AI models with
confidence. If they remove the first layers, then it will affect the dense layers as the number of
trainable parameters will change. And dense layers can be a good point for reducing layers, but
analyzing how many layers and neurons to remove so that the model does not become overfitting
is time-consuming and challenging. Overfitting happens when the new model learns details and
noises from training data that negatively impact its outputs.
If developers can overcome the limits of transfer learning, they can then solve two of the biggest,
if not all, challenges faced while training AI models, which are data requirement and training
time.
CHAPTER 5
CONCLUSION
Transformers are taking the world of Natural Language Processing by storm. But their
architectures are relatively complex and it takes quite some time to understand them sufficiently.
In this work, we presented the Transformer, the first sequence transduction model based entirely
on attention, replacing the recurrent layers most commonly used in encoder-decoder
architectures with multi-headed self-attention. For translation tasks, the Transformer can be
trained significantly faster than architectures based on recurrent or convolution layers. The future
scope is to extend the Transformer to problems involving input and output modalities other than
text and to investigate local, restricted attention mechanisms to efficiently handle large inputs
and outputs such as images, audio and video. Making generation less sequential is another
research goal.
BIBILOGRAPHY
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of
neural machine translation architectures. CoRR, abs/1703.03906, 2017.
[4] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv
preprint arXiv:1610.02357, 2016.
[5] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. CoRR,
abs/1412.3555, 2014.
[6] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention
networks. In International Conference on Learning Representations, 2017.
[8] https://jalammar.github.io/illustrated-transformer/