Seminar Report - Transformer Model

Transformer Model



Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing
knowledge gained while solving one problem and applying it to a different but related problem.
For example, knowledge gained while learning to recognize cars could apply when trying to
recognize trucks. This area of research bears some relation to the long history of psychological
literature on transfer of learning, although practical ties between the two fields are limited. From
the practical standpoint, reusing or transferring information from previously learned tasks for the
learning of new tasks has the potential to significantly improve the sample efficiency of
a reinforcement learning agent. Then comes the name of Transformer Model.

1.1 What is Transformer Model?

A Transformer or Transformer Model is a deep learning model that adopts the

mechanism of self-attention, differentially weighting the significance of each part of the input
data. It is used primarily in the fields of natural language processing (NLP), Neural networks
and computer vision (CV).

1.2 Background

Before transformers, most state-of-the-art NLP systems relied on gated RNNs, such
as LSTM and gated recurrent units (GRUs), with added attention mechanisms. Transformers are
built on these attention technologies without using an RNN structure, highlighting the fact that
attention mechanisms alone can match the performance of RNNs with attention.

1.3 Why Transformer Models?

Imagine yourself back in the days when you tried to ride a bicycle for the first time. It was
difficult and took time. You needed to learn everything from scratch: How to keep the balance,
how to steer the wheel, how to brake. Now back to the present: Imagine you want to learn how to
ride a motorcycle. You don’t need to start from zero. It is much easier for you to learn how to

Transformer Model

keep the balance or use the brakes. Even though you are in a different setting, you
can transfer the skills learned from riding a bicycle. That’s also the essence of transfer learning.

Fig: 1.3.1 Idea of Transfer Learning

1.4 A More Formal Definition of Transfer Learning

“Transfer learning is [...] the improvement of learning in a new task through the transfer of
knowledge from a related task that has already been learned.”

Having learned how to keep the balance on a bicycle improves your learning of how to keep the
balance on a motorcycle. Similarly, an algorithm that has learned how to recognize dogs can be
trained to recognize cats with relative ease by transferring certain abstract concepts.

1.5 How Conventional Machine Learning Algorithms Work

In brief, machine learning is the general term for when computers learn from data without being
explicitly programmed. Instead, machine learning algorithms recognize patterns in the data and
make predictions once new data arrives. If you are new to the field, we recommend that you first
read about the different disciplines of artificial intelligence.

So far, conventional machine learning algorithms have been built to learn specific tasks. They
are designed to work in isolation and this works well both in theory and practice. But training
algorithms from scratch also has drawbacks. As specialized algorithms, they reach high

Transformer Model

performance only in their specific area of expertise. No matter how state-of-the-art they are, they
are only state-of-the-art for a specific thing. If tasked with a new problem, they would not know
what to do and make wrong predictions. Recall the bicycle example again: Imagine you have
learned how to ride a bicycle. Even if you were a world champion in trick-cycling, you would
have to start from scratch when learning how to ride a motorcycle. The world champion would
be a rookie again. Similarly, models have to be rebuilt from scratch in conventional machine
learning. Since model training requires time and money, many problems aren’t profitable with a
traditional learning approach. Besides, most machine learning algorithms require vast amounts of
data. Deep learning models, in particular, most often need millions of data points to generate
meaningful results. These data needs are often difficult to satisfy in practice. That is also one of
the primary reasons why machine learning has mainly been a privilege to large companies:
Smaller enterprises just haven’t had the required resources to continuously feed and train
machine learning algorithms from scratch.

1.6 Comparison of Transfer Learning With Traditional Machine Learning

Transfer learning is a technique that enables algorithms to learn a new task by using pre-trained
models. Let’s see how conventional machine learning and transfer learning compare:

Fig: Traditional ML vs. Transfer Learning

Transformer Model

In traditional learning, a machine learning algorithm works in isolation. When given a large
enough dataset, it learns how to perform a specific task. However, when tasked to solve a new
problem, it cannot resort to any previously gained knowledge. Instead, a conventional algorithm
needs a second dataset to begin a new learning process.

In transfer learning, the learning of new tasks relies on previously learned tasks. The algorithm
can store and access knowledge. The model is general instead of specific. By finding patterns
between elements mathematically, transformers eliminate that need, making available the
trillions of images and petabytes of text data on the web and in corporate databases. In addition,
the math that transformers use lends itself to parallel processing, so these models can run fast
Let’s understand some of the main concepts of transformer model in the next chapter which
makes the model specific and reliable.

1.7 Report Organization

 In Chapter 1, we looked into the introduction to transformers, the definition for

transformer models, the background study and Why to use transformer based models

 In Chapter 2, we will understand the important concepts before jumping into the
transformer model itself such as, sequential processing, attention mechanism and what is
self-attention in transformer models.

 In Chapter 3, we will learn the architecture of Transformer in brief.

 In Chapter 4, we will look into some of the applications, where we can implement the
models, what do you mean by training and the advantages and disadvantages of
transformer model.

 Chapter 5 includes the conclusion to this report.

Transformer Model



The Transformer architecture follows an encoder-decoder structure, but does not rely on
recurrence and convolutions in order to generate an output. But before jumping into the
architecture of transformer let’s get familiar with some of the important concepts to understand
the transformer model better.

2.1 Sequential processing

Gated RNNs process tokens sequentially, maintaining a state vector that contains
representation of the data seen after every token. To process the th token, the model combines
the state representing the sentence up to token with the information of the new token to create
a new state, representing the sentence up to token . Theoretically, the information from one
token can propagate arbitrarily far down the sequence, if at every point the state continues to
encode contextual information about the token. In practice this mechanism is flawed: the
vanishing gradient problem leaves the model's state at the end of a long sentence without
precise, extractable information about preceding tokens. The dependency of token computations
on results of previous token computations also makes it hard to parallelize computation on
modern deep learning hardware. This can make the training of RNNs inefficient.

2.2 Attention Mechanism

Attention is a neural network structure that you’ll hear about all over the place in machine
learning these days. In fact, the title of the 2017 paper that introduced Transformers wasn’t
called, We Present You the Transformer. Instead it was called Attention is All You Need.
Attention was introduced in the context of translation two years earlier, in 2015.

One bad way to try to translate that sentence would be to go through each word in the English
sentence and try to spit out its French equivalent, one word at a time. That wouldn’t work well
for several reasons, but for one, some words in the French translation are flipped: it’s “European
Economic Area” in English, but “la zone économique européenne” in French. Also, French is a

Transformer Model

language with gendered words. The adjectives “économique” and “européenne” must be in
feminine form to match the feminine object “la zone.”

Attention is a mechanism that allows a text model to “look at” every single word in the original
sentence when making a decision about how to translate words in the output sentence. Here’s a
nice visualization from that original attention paper:

Fig: Figure from the paper, “Neural Machine Translation”

As shown in the Fig:, it’s a sort of heat map that shows where the model is “attending”
when it outputs each word in the French sentence. As you might expect, when the model outputs
the word “européenne,” it’s attending heavily to both the input words “European” and
“Economic.” And how does the model know which words it should be “attending” to at each
time step? It’s something that’s learned from training data. By seeing thousands of examples of
French and English sentences, the model learns what types of words are interdependent. It learns
how to respect gender, plurality, and other rules of grammar. The attention mechanism has been
an extremely useful tool for natural language processing since its discovery in 2015, but in its
original form, it was used alongside recurrent neural networks. So, the innovation of the 2017
Transformers paper was, in part, to ditch RNNs entirely. That’s why the 2017 paper was called
“Attention is all you need.”

Transformer Model

2.3 Self - Attention

The most impactful piece of the Transformer is a twist on attention called “self-attention.” Self
Attention, also called intra Attention, is an attention mechanism relating different positions of a
single sequence in order to compute a representation of the same sequence. It has been shown to
be very useful in machine reading, abstractive summarization, or image description generation.

The type of “vanilla” attention we just talked about helped align words across English and
French sentences, which is important for translation. But what if you’re not trying to translate
words but instead build a model that understands underlying meaning and patterns in language–a
type of model that could be used to do any number of language tasks? In general, what makes
neural networks powerful and exciting and cool is that they often automatically build up
meaningful internal representations of the data they’re trained on. When you inspect the layers of
a vision neural network, for example, you’ll find sets of neurons that “recognize” edges, shapes,
and even high-level structures like eyes and mouths. A model trained on text data might
automatically learn parts of speech, rules of grammar, and whether words are synonymous. The
better the internal representation of language a neural network learns, the better it will be at any
language task. And it turns out that attention can be a very effective way of doing just this, if it’s
turned on the input text itself.

For example, take these two sentences:

I. “Server, can I have the check?”

II. “Looks like I just crashed the server.”
 The word “Server” here means two very different things, which we humans can easily
disambiguate by looking at surrounding words. Self-attention allows a neural network to
understand a word in the context of the words around it. So when a model processes the
word “server” in the first sentence, it might be “attending” to the word “check,” which
helps disambiguate a human server from a metal one.
 In the second sentence, the model might attend to the word “crashed” to
determine this “server” refers to a machine. Self-attention help neural networks
disambiguate words, do part-of-speech tagging, entity resolution, learn semantic roles
and a lot more.

Transformer Model



In the previous chapter, we looked at Attention – a ubiquitous method in modern deep learning
models. Attention is a concept that helped to improve the performance of neural machine
translation applications. In this chapter, we will look at The Transformer Model – a model that
uses attention to boost the speed with which these models can be trained. The Transformer
outperforms the Google Neural Machine Translation model in specific tasks. The biggest
benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact
Google Cloud’s recommendation to use The Transformer as a reference model to use
their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.

Fig 3.1: The Transformer Model

Transformer Model

3.1 The Encoder

The encoder consists of a stack of N = 6 identical layers, where each layer is composed of two

1. The first sublayer implements a multi-head self-attention mechanism. We had seen that
the multi-head mechanism implements h heads that receive a (different) linearly
projected version of the queries, keys and values each, to produce h outputs in parallel
that are then used to generate a final result.

2. The second sublayer is a fully connected feed-forward network, consisting of two linear
transformations with Rectified Linear Unit (ReLU) activation in between:

FFN(x) = ReLU(W1x+b1)W2+b2

The six layers of the Transformer encoder apply the same linear transformations to all of the
words in the input sequence, but each layer employs different weight (W1,W2) and bias (b1,b2)
parameters to do so.

Furthermore, each of these two sublayers has a residual connection around it. Each sublayer is
also succeeded by a normalization layer, layernorm(.), which normalizes the sum computed
between the sublayer input, x, and the output generated by the sublayer itself, sublayer(x):


An important consideration to keep in mind is that the Transformer architecture cannot

inherently capture any information about the relative positions of the words in the sequence,
since it does not make use of recurrence. This information has to be injected by
introducing positional encodings to the input embeddings. The positional encoding vectors are of
the same dimension as the input embeddings, and are generated using sine and cosine functions
of different frequencies. Then, they are simply summed to the input embeddings in order
to inject the positional information.

Transformer Model

3.2 The Decoder

The decoder shares several similarities with the encoder.

The decoder also consists of a stack of N = 6 identical layers that are, each, composed of three

1. The first sublayer receives the previous output of the decoder stack, augments it with
positional information, and implements multi-head self-attention over it. While the
encoder is designed to attend to all words in the input sequence, regardless of their
position in the sequence, the decoder is modified to attend only to the preceding words.
Hence, the prediction for a word at position, i, can only depend on the known outputs for
the words that come before it in the sequence. In the multi-head attention mechanism
(which implements multiple, single attention functions in parallel), this is achieved by
introducing a mask over the values produced by the scaled multiplication of
matrices Q and K. This masking is implemented by suppressing the matrix values that
would, otherwise, correspond to illegal connections:

Multi-head attention:

Fig 3.2.1: The Multi-Head Attention in the Decoder

Transformer Model

2. The second layer implements a multi-head self-attention mechanism, which is similar to

the one implemented in the first sublayer of the encoder. On the decoder side, this multi-
head mechanism receives the queries from the previous decoder sublayer, and the keys
and values from the output of the encoder. This allows the decoder to attend to all of the
words in the input sequence.

3. The third layer implements a fully connected feed-forward network, which is similar to
the one implemented in the second sublayer of the encoder.

Furthermore, the three sublayers on the decoder side also have residual connections
around them, and are succeeded by a normalization layer.

Positional encodings are also added to the input embeddings of the decoder, in the same
manner as previously explained for the encoder.

3.2.1 What does softmax activation do?

The softmax function is used as the activation function in the output layer of neural network
models that predict a multinomial probability distribution. That is, softmax is used as the
activation function for multi-class classification problems where class membership is required on
more than two class labels.

3.3 Sum Up: The Transformer Model

The Transformer model runs as follows:
1. Each word forming an input sequence is transformed into a dmodel-dimensional
embedding vector.

Transformer Model

2. Each embedding vector representing an input word is augmented by summing it

(element-wise) to a positional encoding vector of the same dmodel length, hence
introducing positional information into the input.
3. The augmented embedding vectors are fed into the encoder block, consisting of the two
sublayers explained above. Since the encoder attends to all words in the input sequence,
irrespective if they precede or succeed the word under consideration, then the
Transformer encoder is bidirectional.
4. The decoder receives as input its own predicted output word at time-step, t–1.
5. The input to the decoder is also augmented by positional encoding, in the same manner as
this is done on the encoder side.
6. The augmented decoder input is fed into the three sublayers comprising the decoder
block explained above. Masking is applied in the first sublayer, in order to stop the
decoder from attending to succeeding words. At the second sublayer, the decoder also
receives the output of the encoder, which now allows the decoder to attend to all of the
words in the input sequence.
7. The output of the decoder finally passes through a fully connected layer, followed by a
softmax layer, to generate a prediction for the next word of the output sequence.

Transformer Model



4.1 Applications of Transformer Model

The transformer has had great success in natural language processing (NLP), for example the
tasks of machine translation and time series prediction. Many pretrained models such as GPT-
2, GPT-3, BERT, XLNet, and RoBERTa demonstrate the ability of transformers to perform a
wide variety of such NLP-related tasks, and have the potential to find real-world applications.
These may include:

 Machine translation
 Document summarization
 Document generation
 Named entity recognition (NER)
 Biological sequence analysis.
 Video understanding.

Some of the Transformer-Based Models which are built using the concept and architecture of
transformer include:

 VGG16
 Resnet Neural Network
 GPT-2, GPT-3
 XLNet
 T5
 DALL-E and many more

In 2020, it was shown that the transformer architecture, more specifically GPT-2, could be tuned
to play chess. Transformers have been applied to image processing with results competitive
with convolution neural networks.

Transformer Model

4.2 Advantages
Transformer Model brings a range of benefits to the development process of machine learning
models. The main benefits of transfer learning include the saving of resources and improved
efficiency when training new models. It can also help with training models when only unlabelled
datasets are available, as the bulk of the model will be pre-trained.

 Saving on training data

 Efficiently train multiple models
 Leverage knowledge to solve new challenges
 Simulated training to prepare for real-world tasks

4.3 Disadvantages
There are mainly 2 limitations or we can say disadvantages of Transformer model or transfer
learning include:

1. The problem of Negative Transfer

If the transfer learning ends up with a decrease in the performance or accuracy of the new model,
then it is called negative transfer. Transfer learning only works if the initial and target problems
of both models are similar enough. If the first round of training data required for the new task is
too far from the data of the old task, then the trained models might perform worse than expected.

2. The problem of Overfitting

In transfer learning, developers cannot remove the network layers to find optimal AI models with
confidence. If they remove the first layers, then it will affect the dense layers as the number of
trainable parameters will change. And dense layers can be a good point for reducing layers, but
analyzing how many layers and neurons to remove so that the model does not become overfitting
is time-consuming and challenging. Overfitting happens when the new model learns details and
noises from training data that negatively impact its outputs.

If developers can overcome the limits of transfer learning, they can then solve two of the biggest,
if not all, challenges faced while training AI models, which are data requirement and training

Transformer Model



Transformers are taking the world of Natural Language Processing by storm. But their
architectures are relatively complex and it takes quite some time to understand them sufficiently.
In this work, we presented the Transformer, the first sequence transduction model based entirely
on attention, replacing the recurrent layers most commonly used in encoder-decoder
architectures with multi-headed self-attention. For translation tasks, the Transformer can be
trained significantly faster than architectures based on recurrent or convolution layers. The future
scope is to extend the Transformer to problems involving input and output modalities other than
text and to investigate local, restricted attention mechanisms to efficiently handle large inputs
and outputs such as images, audio and video. Making generation less sequential is another
research goal.

Transformer Model

