Professional Documents
Culture Documents
DAA FinalReport
DAA FinalReport
DAA FinalReport
Semester Project
13/05/2024
Title: Analysis of the Transformer Architecture and Attention Mechanism
Team Members:
In order to understand the importance of the transformer neural network structure we must
see the background of the invention of the transformer and the reasons that caused it’s
development. To do this we must first start with the Encoder-Decoder architecture developed
for seq2seq models.
Encoder-Decoder Architecture:
Before
The Encoder-Decoder architecture, introduced in the paper "Sequence to Sequence Learning
with Neural Networks" by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, addresses the need for a
model capable of translating
sequences between domains. Before
its advent, CNNs were utilized for
spatial data like images, and RNNs
handled sequential data such as time
series or text. This architecture, also
known as Sequence-to-Sequence
(Seq2Seq) models, combines an
encoder and decoder. The encoder
processes the input sequence, while
the decoder generates the output
sequence. Following are the components of this architecture:
Encoder:
2
Decoder:
Takes context vector from encoder and generates output sequence.
Utilizes RNN (e.g., LSTM or GRU) to produce output sequence.
Generates words one at a time based on previous word and current hidden
state.
Continues until end of sequence token or maximum length is reached.
The Encoder-Decoder architecture, while effective for many sequence-to-sequence tasks, faces
limitations, particularly with longer sequences. One significant constraint is the need for the
context vector, generated by the encoder, to encapsulate all input sequence information within
a fixed-size representation, typically limited to around 30 words. However, for longer
sequences, this poses challenges as the context vector may struggle to retain all pertinent
details, potentially leading to loss of context or information compression. Consequently, the
decoder may encounter difficulty in accurately generating outputs for lengthy input sequences.
This limitation is particularly consequential in tasks such as translation or summarization, where
maintaining the semantic integrity of the entire input sequence is paramount.
The architecture introduced by Bahdanau et al. faced challenges primarily due to its reliance on
LSTM networks for sequence modeling. LSTMs process inputs sequentially, causing
computational inefficiencies during training, particularly with longer sequences. Moreover,
while the attention mechanism effectively captured local dependencies, it struggled to
integrate global context information efficiently. As a consequence, the model encountered
3
difficulties in handling longer input sequences, leading to suboptimal performance, especially in
tasks like translating lengthy sentences.
The The Transformer architecture, introduced in the paper "Attention is All You Need" by
Vaswani et al. in June 2017 ([1706.03762] Attention Is All You Need), revolutionized natural
language processing by offering an alternative to traditional sequential models like LSTMs.
Unlike LSTM-based models, which process input
sequences token by token, the Transformer utilizes a
self-attention mechanism. This mechanism allows
each word in the input sequence to attend to all
other words simultaneously, enabling the model to
capture complex relationships between distant
words more effectively. By considering the entire
input sequence at once, the Transformer can assign
higher attention weights to relevant words during
output token generation, improving its contextual
understanding and prediction accuracy. Additionally,
the Transformer employs multihead attention,
enabling it to capture various types of dependencies
and attend to different parts of the input sequence
concurrently. This enhances the model's
representational capacity and enables it to capture
intricate patterns in the data more efficiently.
Algorithm:
A Transformer consist two main component- An encoder and a decoder. An encoder takes
input sequence and generate representation of this sequence. And decoder uses this
representation and generate output. Note that In the encoder block paper stack six encoders
on top of each other and in decoder block also same number of decoder stack. Following are
the key components of the algorithm:
1. Embedding Layer:
The embedding layer converts input tokens into dense vectors of fixed size,
representing each word in the input sequence. This layer essentially maps the input
tokens to their corresponding embedding vectors in a continuous vector space. The time
4
complexity of this layer O(vocab_size * dmodel) where vocab_size is the size of the
vocabulary and dmodel is the dimension of the model.
Implementation:
class Embedding(nn.Module):
def __init__(self,vocab_size,dmodel=512) -> None:
# dmodel -> embedding model dimention
super(Embedding,self).__init__()
self.vocab_size = vocab_size
self.dmodel = dmodel
self.embed_layer = nn.Embedding(self.vocab_size,self.dmodel)
def forward(self,x):
embed_out = self.embed_layer(x)
# In the embedding layers, we multiply those weights by
sqrt(dmodel) -> pange 5
return embed_out * math.sqrt(self.dmodel)
2. Positional Encoding:
Implementation:
class PositionalEncoding(nn.Module):
def __init__(self,max_seq_len,d_model=512) -> None:
# dmodel -> embedding model dimention
super(PositionalEncoding,self).__init__()
self.d_model = d_model
pos = torch.arange(0, max_seq_len,dtype =
torch.float).unsqueeze(1)
# we know a^-x is equals to 1/a^x
frequency = torch.pow(10000,-torch.arange(0,d_model,2,dtype =
torch.float)/self.d_model)
pe = torch.zeros((max_seq_len,d_model))
pe[:,0::2] = torch.sin(pos * frequency)
pe[:,1::2] = torch.cos(pos * frequency)
self.register_buffer('pe', pe)
def forward(self,embed_vect):
5
pe = self.pe[:embed_vect.size()[1]]
return embed_vect + pe
3. Attention Layer:
Implementation:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model = 512, n_head = 8, dropout_rate = 0.2) ->
None:
super().__init__()
self.d_model = d_model
self.n_head = n_head
self.dropout = nn.Dropout(p=dropout_rate)
self.head_dim = int(d_model/n_head)
self.softmax_layer = nn.Softmax(dim=-1)
self.w_key = nn.Linear(d_model,d_model)
self.w_query = nn.Linear(d_model,d_model)
self.w_value = nn.Linear(d_model,d_model)
self.output_project = nn.Linear(d_model,d_model)
def attention(self,key,query,value,mask=None):
# calculate attenction score
# query = (BS,NH,S/T,HD) , key.transpose(-2,-1) = (BS,NH,HD,S/T)
# attention score size for encoder attention = (BS,NH,S,S) ,
decoder attention = (BS,NH,T,T), encoder-decoder attention = (BS,NH,T,S)
attenction_score = torch.matmul(query,key.transpose(-2,-
1))/math.sqrt(self.head_dim)
6
# apply masking
if mask is not None:
attenction_score.masked_fill(mask==torch.tensor(False),float("-inf"))
7
This layer applies a fully connected feed-forward network to each position in the
sequence separately and identically. It introduces non-linearity into the transformer
architecture, enabling the model to capture complex relationships between words and
their positions. The time complexity of the feed-forward network is typically O(N * d^2),
where N is the sequence length and d is the dimensionality of the input. This complexity
arises from the fact that the feed-forward network applies two linear transformations
(typically represented as fully connected layers) to each position in the sequence.
Implementation:
class PositionWiseFeedForward(nn.Module):
def __init__(self,d_model=512, dropout_rate = 0.2) -> None:
super().__init__()
self.d_model = d_model
hidden_width = 4
self.dropout = nn.Dropout(p=dropout_rate)
self.linear1 = nn.Linear(d_model,d_model*hidden_width)
self.linear2 = nn.Linear(d_model*hidden_width, d_model)
self.relu = nn.ReLU()
def forward(self,x):
return self.linear2(self.dropout(self.relu(self.linear1(x))))
5. Sublayer:
Implementation:
class SubLayer(nn.Module):
def __init__(self,d_model = 512) -> None:
super(SubLayer,self).__init__()
self.norm = nn.LayerNorm(d_model)
def forward(self,x,sub_layer_x):
return self.norm(x + sub_layer_x)
6. Encoder Layer:
8
relationships between words and their positions. The time complexity of the encoder
layer is dominated by the complexity of the multi-head self-attention layer, which is
typically O(N^2 * d), where N is the sequence length and d is the dimensionality of the
input.
Implementation:
class EnocderLayer(nn.Module):
def forward(self,vec_representation,src_mask=None):
# compute self attention
attention_score =self.multi_head_arttention_layer(key =
vec_representation,query = vec_representation,value =
vec_representation,mask = src_mask)
attention_score = self.dropout1(attention_score)
# Layer Norm
attention_out = self.sublayer1(vec_representation,attention_score)
# pass Position Wise Feedforward Network
position_wise_feedforward_out =
self.position_wise_feedforward_layer(attention_out)
position_wise_feedforward_out =
self.dropout2(position_wise_feedforward_out)
# Layer Norm
encoder_out =
self.sublayer2(attention_out,position_wise_feedforward_out)
return encoder_out
7. Encoder:
The encoder is composed of a stack of identical encoder layers. Each encoder layer
processes the input sequence independently and captures different levels of
abstraction. Stacking multiple encoder layers helps in capturing hierarchical
representations of the input sequence. The time complexity of the encoder is
determined by the number of encoder layers stacked and the complexity of each
layer.
Implementation:
class EncoderBlock(nn.Module):
def __init__(self,encoder_layer, num_layer = 6) -> None:
super().__init__()
self.encoder_layer = encoder_layer
self.encoder_layer_list = get_clone(self.encoder_layer,num_layer)
def forward(self,src_embedding,src_mask=None):
encoder_out = src_embedding
for encoder_layer in self.encoder_layer_list:
encoder_out = encoder_layer(encoder_out,src_mask)
return encoder_out
9
8. Decoder Layer:
Implementation:
class DnocderLayer(nn.Module):
def
__init__(self,d_model,multi_head_arttention_layer,position_wise_feedforwar
d_layer,dropout_rate = 0.2) -> None:
super().__init__()
self.d_model = d_model
self.decoder_attention_layer =
copy.deepcopy(multi_head_arttention_layer)
self.dropout1 = nn.Dropout(p=dropout_rate)
self.sublayer1 = SubLayer(d_model)
self.encoder_decoder_attention_layer =
copy.deepcopy(multi_head_arttention_layer)
self.dropout2 = nn.Dropout(p=dropout_rate)
self.sublayer2 = SubLayer(d_model)
self.position_wise_feedforward_layer =
position_wise_feedforward_layer
self.dropout3 = nn.Dropout(p=dropout_rate)
self.sublayer3 = SubLayer(d_model)
def forward(self,enc,dec,src_mask=None,target_mask=None):
decoder_attention_out = self.decoder_attention_layer(key =
dec,query = dec,value = dec,mask = target_mask)
decoder_attention_out = self.dropout1(decoder_attention_out)
decoder_attention_out = self.sublayer1(dec,decoder_attention_out)
enc_dec_attention_out = self.encoder_decoder_attention_layer(key =
enc,query = decoder_attention_out,value = enc,mask = src_mask)
enc_dec_attention_out = self.dropout2(enc_dec_attention_out)
enc_dec_attention_out =
self.sublayer2(decoder_attention_out,enc_dec_attention_out)
ffn_out =
self.position_wise_feedforward_layer(enc_dec_attention_out)
ffn_out = self.dropout2(ffn_out)
ffn_out = self.sublayer2(enc_dec_attention_out,ffn_out)
return ffn_out
10
9. Decoder Block:
The decoder block is a stack of identical decoder layers, enabling the model to generate
the output sequence token by token while attending to the input sequence. Each
decoder layer consists of a multi-head self-attention layer, an encoder-decoder
attention layer, and a position-wise feed-forward network. The time complexity of the
decoder block is determined by the number of decoder layers stacked and the
complexity of each layer.
Implementation:
class DecoderBlock(nn.Module):
def __init__(self,decoder_layer,num_layer = 6) -> None:
super().__init__()
self.decoder_layer = decoder_layer
self.decoder_layer_list = get_clone(self.decoder_layer,num_layer)
self.layer_norm = nn.LayerNorm(self.decoder_layer.d_model)
def
forward(self,encoder_out_vec,decoder_embedding,src_mask=None,target_mask=N
one):
dec_out = decoder_embedding
for decoder_layer in self.decoder_layer_list:
dec_out = decoder_layer(enc = encoder_out_vec,dec =
dec_out,src_mask = src_mask,target_mask = target_mask)
return dec_out
11
class Transformers(nn.Module):
def
__init__(self,src_seq_len,trg_seq_len,d_model,num_head,dropout_rate = 0.2)
-> None:
super().__init__()
self.src_seq_len = src_seq_len
self.trg_seq_len = trg_seq_len
self.d_model = d_model
self.num_head = num_head
self.src_embedding = Embedding(self.src_seq_len,self.d_model)
self.src_pe = PositionalEncoding(self.src_seq_len,self.d_model)
self.trg_embedding = Embedding(self.trg_seq_len,self.d_model)
self.trg_pe = PositionalEncoding(self.trg_seq_len,self.d_model)
self.multi_head_attention =
MultiHeadAttention(d_model,num_head,dropout_rate)
self.position_wise_feedforward =
PositionWiseFeedForward(self.d_model,dropout_rate)
self.encoder_layer =
EnocderLayer(d_model,self.multi_head_attention,self.position_wise_feedforw
ard,dropout_rate)
self.decoder_layer =
DnocderLayer(d_model,self.multi_head_attention,self.position_wise_feedforw
ard,dropout_rate)
self.encoder_block = EncoderBlock(self.encoder_layer,num_layer=6)
self.decoder_block = DecoderBlock(self.decoder_layer,num_layer=6)
self.decoder_out_gen = DecoderGenerator(d_model,self.trg_seq_len)
def
forward(self,src_token_id,target_token_id,src_mask=None,target_mask=None):
encode_out = self.encode(src_token_id,src_mask)
decode_out =
self.decode(encode_out,target_token_id,src_mask,target_mask)
return decode_out
def encode(self,src_token_id,src_mask):
embed = self.src_embedding(src_token_id)
pe_out = self.src_pe(embed)
encoder_out = self.encoder_block(pe_out,src_mask)
return encoder_out
def
decode(self,enc_out,trg_token_ids,src_mask=None,tagrget_mask=None):
embed = self.src_embedding(trg_token_ids)
pe_out = self.src_pe(embed)
decoder_out =
self.decoder_block(enc_out,pe_out,src_mask,tagrget_mask)
12
decoder_out = self.decoder_out_gen(decoder_out)
return decoder_out
Therefore, the total time complexity of attention in the "Attention is all you need" transformer
would be O((L_encoder + L_decoder) * h * N^2 * d), where L_encoder and L_decoder are the
number of layers in the encoder and decoder respectively, h is the number of attention heads,
N is the sequence length, and d is the embedding dimensionality.
Conclusion:
The analysis of the Transformer architecture presented in the "Attention Is All You Need" paper
has yielded valuable insights into its revolutionary impact on natural language processing. The
project successfully elucidated the novel self-attention mechanism employed by Transformers,
highlighting its ability to capture complex dependencies across input sequences more
effectively than traditional sequential models like explained before.
Honest Assessment
13
Transformer's self-attention mechanism and multihead attention posed challenges in grasping
their intricacies fully.
The future work of this project are to work further into this and brainstorm that how the
complexity of the multi-head attention mechanism can be reduced because although attention
mechanism is necessary yet its complexity is higher which leads to the excessive computation
costs. One work which we found related to this was from the article "Interactive Multi-Head
Self-Attention with Linear Complexity" by Hankyul Kang, Ming-Hsuan Yang, and Jongbin Ryu
where they propose a method to reduce the complexity of multi-head self-attention
mechanisms in Transformer architectures. The authors introduce an interactive approach that
enables efficient computation of attention scores by iteratively refining attention weights
through interactions between different heads. By incorporating interactive mechanisms, the
proposed method achieves linear time complexity with respect to the input sequence length,
making it computationally efficient even for long sequences. This approach offers a promising
solution to enhance the scalability of Transformer models while maintaining their effectiveness
in capturing complex dependencies across input sequences. This article can be further studied
for our future work
In retrospect, expanding the scope of the project to include a broader range of Transformer
variants and their applications could have provided a more holistic view. Additionally,
integrating discussions on potential challenges and limitations faced in real-world deployment
scenarios would have added practical relevance to the analysis. Overall, refining the project's
scope and incorporating practical implementation aspects would enhance its value and
applicability in understanding and leveraging Transformer architectures effectively.
Application:
The application of the above explained transformer architecture can be find in this Jupyter
notebook where transformer architecture is used to train a model to convert English sentences
to Hindi sentences. Below is the link to the notebook:
https://colab.research.google.com/drive/1Etkh5grys2mKqNxPy09l9qzHynxL5-GU?
usp=sharing#scrollTo=nsWsVfjHGcjh
References:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... &
Polosukhin, I. (2017). Attention is all you need. In Advances in neural information
processing systems (pp. 6000-6010).
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural
networks. In Advances in neural information processing systems (pp. 3104-3112).
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473.
Kang, H., Yang, M. H., & Ryu, J. (2020). Interactive Multi-Head Self-Attention with Linear
Complexity. arXiv preprint arXiv:2006.03236.
14