DAA FinalReport

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Design and Analysis of Algorithm

Semester Project

Malik Shahzaib Khan (406702)


Muhammad Bilal Khan (406977)
Muhammad Faiq Qazi (406483)

13/05/2024
Title: Analysis of the Transformer Architecture and Attention Mechanism

Team Members:

 Malik Shahzaib Khan (406702)


 Muhammad Bilal Khan (406977)
 Muhammad Faiq Qazi (406483)

Real World Problem:

Before the Transformer design, standard sequence-to-sequence approaches, such as recurrent


neural networks (RNNs), had difficulty capturing long-range dependencies in sequences. These
old approaches struggled to keep context across long sequences, resulting in limits in tasks like
machine translation, text summarization, and language understanding. Our effort aims to
overcome this issue by analyzing and utilizing the Transformer architecture for a variety of
natural language processing applications. Specifically, we want to look at how the Transformer
model's attention mechanism helps capture long-range connections in sequences, allowing for
more effective input data processing.

Motivation and Purpose:

In order to understand the importance of the transformer neural network structure we must
see the background of the invention of the transformer and the reasons that caused it’s
development. To do this we must first start with the Encoder-Decoder architecture developed
for seq2seq models.

Encoder-Decoder Architecture:

Before
The Encoder-Decoder architecture, introduced in the paper "Sequence to Sequence Learning
with Neural Networks" by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, addresses the need for a
model capable of translating
sequences between domains. Before
its advent, CNNs were utilized for
spatial data like images, and RNNs
handled sequential data such as time
series or text. This architecture, also
known as Sequence-to-Sequence
(Seq2Seq) models, combines an
encoder and decoder. The encoder
processes the input sequence, while
the decoder generates the output
sequence. Following are the components of this architecture:
 Encoder:

 Processes input sequence into a fixedsize context vector.


 Converts each word into dense vector representation.
 Uses RNN (e.g., LSTM or GRU) to capture word semantics and context.
 Final hidden state of RNN serves as context vector.

2
 Decoder:
 Takes context vector from encoder and generates output sequence.
 Utilizes RNN (e.g., LSTM or GRU) to produce output sequence.
 Generates words one at a time based on previous word and current hidden
state.
 Continues until end of sequence token or maximum length is reached.

Disadvantage of the Encoder-Decoder Architecture:

The Encoder-Decoder architecture, while effective for many sequence-to-sequence tasks, faces
limitations, particularly with longer sequences. One significant constraint is the need for the
context vector, generated by the encoder, to encapsulate all input sequence information within
a fixed-size representation, typically limited to around 30 words. However, for longer
sequences, this poses challenges as the context vector may struggle to retain all pertinent
details, potentially leading to loss of context or information compression. Consequently, the
decoder may encounter difficulty in accurately generating outputs for lengthy input sequences.
This limitation is particularly consequential in tasks such as translation or summarization, where
maintaining the semantic integrity of the entire input sequence is paramount.

Solution to Encoder-Decoder Architecture by Attention:

The Bahdanau et al. addressed the Encoder-Decoder architecture's struggle with


comprehending long sequences by introducing an attention mechanism in their paper "Neural
Machine Translation by Jointly Learning to Align and Translate" ([1409.0473] Neural Machine
Translation by Jointly Learning to Align and Translate). Published in 2014, this mechanism
allowed the model to focus selectively on relevant segments of the input sequence, overcoming
the limitation of the context vector's
capacity in the encoder-decoder
architecture. By dynamically aligning
input and output sequences, the
Bahdanau attention mechanism
offered a solution to handle longer
sequences effectively. This
improvement led to enhanced
performance in tasks like neural
machine translation, as demonstrated
by the selective attention weights
assigning higher importance to
specific input embeddings during the
decoding process.

Problems with Behandau


Attention:

The architecture introduced by Bahdanau et al. faced challenges primarily due to its reliance on
LSTM networks for sequence modeling. LSTMs process inputs sequentially, causing
computational inefficiencies during training, particularly with longer sequences. Moreover,
while the attention mechanism effectively captured local dependencies, it struggled to
integrate global context information efficiently. As a consequence, the model encountered

3
difficulties in handling longer input sequences, leading to suboptimal performance, especially in
tasks like translating lengthy sentences.

Transformer – Solution to the above Problems:

The The Transformer architecture, introduced in the paper "Attention is All You Need" by
Vaswani et al. in June 2017 ([1706.03762] Attention Is All You Need), revolutionized natural
language processing by offering an alternative to traditional sequential models like LSTMs.
Unlike LSTM-based models, which process input
sequences token by token, the Transformer utilizes a
self-attention mechanism. This mechanism allows
each word in the input sequence to attend to all
other words simultaneously, enabling the model to
capture complex relationships between distant
words more effectively. By considering the entire
input sequence at once, the Transformer can assign
higher attention weights to relevant words during
output token generation, improving its contextual
understanding and prediction accuracy. Additionally,
the Transformer employs multihead attention,
enabling it to capture various types of dependencies
and attend to different parts of the input sequence
concurrently. This enhances the model's
representational capacity and enables it to capture
intricate patterns in the data more efficiently.

Explanation of the Transformer’s


Algorithm:

Algorithm:

A Transformer consist two main component- An encoder and a decoder. An encoder takes
input sequence and generate representation of this sequence. And decoder uses this
representation and generate output. Note that In the encoder block paper stack six encoders
on top of each other and in decoder block also same number of decoder stack. Following are
the key components of the algorithm:

 Embedding Layer  Encoder Layering


 Positional Encoding  Encoder Block
 Attention Mechanism  Decoder Layering
 Position Wise Feed Forward  Decoder Block
Networking  Decoder Generator
 Sub-layering  Main Transformer Block

1. Embedding Layer:

The embedding layer converts input tokens into dense vectors of fixed size,
representing each word in the input sequence. This layer essentially maps the input
tokens to their corresponding embedding vectors in a continuous vector space. The time

4
complexity of this layer O(vocab_size * dmodel) where vocab_size is the size of the
vocabulary and dmodel is the dimension of the model.

Implementation:

class Embedding(nn.Module):
def __init__(self,vocab_size,dmodel=512) -> None:
# dmodel -> embedding model dimention
super(Embedding,self).__init__()
self.vocab_size = vocab_size
self.dmodel = dmodel
self.embed_layer = nn.Embedding(self.vocab_size,self.dmodel)
def forward(self,x):
embed_out = self.embed_layer(x)
# In the embedding layers, we multiply those weights by
sqrt(dmodel) -> pange 5
return embed_out * math.sqrt(self.dmodel)

2. Positional Encoding:

Positional encoding injects


information about the position of
tokens into the embedding vectors.
This allows the model to distinguish
between different positions in the
sequence, which is crucial for
understanding word order. In
positional encoding, for each position in the input sequence (N), a positional encoding
vector of dimensionality (d) is generated. This positional encoding vector is then added
element-wise to the corresponding embedding vector of the token at that position.
Since this operation is performed for each token in the sequence, the time complexity
becomes O(N * d).

Implementation:

class PositionalEncoding(nn.Module):
def __init__(self,max_seq_len,d_model=512) -> None:
# dmodel -> embedding model dimention
super(PositionalEncoding,self).__init__()
self.d_model = d_model
pos = torch.arange(0, max_seq_len,dtype =
torch.float).unsqueeze(1)
# we know a^-x is equals to 1/a^x
frequency = torch.pow(10000,-torch.arange(0,d_model,2,dtype =
torch.float)/self.d_model)
pe = torch.zeros((max_seq_len,d_model))
pe[:,0::2] = torch.sin(pos * frequency)
pe[:,1::2] = torch.cos(pos * frequency)
self.register_buffer('pe', pe)
def forward(self,embed_vect):

5
pe = self.pe[:embed_vect.size()[1]]
return embed_vect + pe

3. Attention Layer:

The attention mechanism in transformers is a sophisticated method for weighing the


importance of different words in a sequence when processing each word. It allows the
model to focus on relevant parts of the input sequence while generating the output
sequence, enabling it to capture long-range dependencies effectively. In the multi-head
attention layer, the input sequence is transformed into three sets of vectors: Query,
Key, and Value. These vectors are then used to compute attention scores, which
determine the relevance of each word to every other word in the sequence. This
process involves calculating dot products between the Query and Key vectors, followed
by scaling and applying a softmax function to obtain attention weights. Finally, the
weighted sum of the Value vectors is computed to produce the output. The time
complexity of the multi-head attention layer is O(N^2 * d), where N is the sequence
length and d is the dimensionality of the input. This complexity arises from the pairwise
dot product calculation between the Query and Key vectors for each word in the
sequence (resulting in O(N^2) operations) and the subsequent weighted sum operation
(involving O(N * d) operations). The
dimensionality of the input (d) also
contributes to the overall
complexity. Despite its higher
computational cost, the attention mechanism is crucial for capturing dependencies
across distant words in the input sequence, making it a fundamental component of
transformer models.

Implementation:

class MultiHeadAttention(nn.Module):
def __init__(self, d_model = 512, n_head = 8, dropout_rate = 0.2) ->
None:
super().__init__()
self.d_model = d_model
self.n_head = n_head
self.dropout = nn.Dropout(p=dropout_rate)
self.head_dim = int(d_model/n_head)
self.softmax_layer = nn.Softmax(dim=-1)
self.w_key = nn.Linear(d_model,d_model)
self.w_query = nn.Linear(d_model,d_model)
self.w_value = nn.Linear(d_model,d_model)
self.output_project = nn.Linear(d_model,d_model)
def attention(self,key,query,value,mask=None):
# calculate attenction score
# query = (BS,NH,S/T,HD) , key.transpose(-2,-1) = (BS,NH,HD,S/T)
# attention score size for encoder attention = (BS,NH,S,S) ,
decoder attention = (BS,NH,T,T), encoder-decoder attention = (BS,NH,T,S)
attenction_score = torch.matmul(query,key.transpose(-2,-
1))/math.sqrt(self.head_dim)

6
# apply masking
if mask is not None:

attenction_score.masked_fill(mask==torch.tensor(False),float("-inf"))

# pass through softmax layer


attention_weight = self.softmax_layer(attenction_score)

# multiply with value


# Final shape of score = (BS,NH,S/T,HD)
score = torch.matmul(attention_weight,value)
return score
def forward(self,key,query,value,mask=None):
batch_size = key.size()[0]
# dot product with weight matrices
# size of key/query/value = (BS,S/T,ED) ,
# where BS = batch size,
# S = Source Sequence length,
# T = target sequence lenth,
# ED = Embedding dimension,
# NH = Number Of Head
# HD = head dimension
key, query, value = self.w_key(key), self.w_query(query),
self.w_value(value)

# split vector by number of head and transpose


# size of key/query/value = (BS,NH,S/T,HD) , where BS = batch
size, NH = Number Of Head, ED = Head dimension
key = key.view(batch_size,-
1,self.n_head,self.head_dim).transpose(1, 2)
query = query.view(batch_size,-
1,self.n_head,self.head_dim).transpose(1, 2)
value = value.view(batch_size,-
1,self.n_head,self.head_dim).transpose(1, 2)

# size of attention_score = (BS,NH,S/T,HD)


attention_score = self.attention(key,query,value,mask) # size -
torch.Size([2, 4, 8, 64]) -> [batch_size, max_seq_len,n_head, head_dim]
attention_score = self.dropout(attention_score)
# concatenated output
attention_score = attention_score.transpose(1,2) # size =
(BS,S/T,NH,HD)
attention_score = attention_score.reshape(batch_size,-
1,self.head_dim*self.n_head) # size = (BS,S/T,ED)
# Pass through linear layer
attention_out = self.output_project(attention_score)
return attention_out

4. Position Wise Feed Forwarding:

7
This layer applies a fully connected feed-forward network to each position in the
sequence separately and identically. It introduces non-linearity into the transformer
architecture, enabling the model to capture complex relationships between words and
their positions. The time complexity of the feed-forward network is typically O(N * d^2),
where N is the sequence length and d is the dimensionality of the input. This complexity
arises from the fact that the feed-forward network applies two linear transformations
(typically represented as fully connected layers) to each position in the sequence.

Implementation:

class PositionWiseFeedForward(nn.Module):
def __init__(self,d_model=512, dropout_rate = 0.2) -> None:
super().__init__()
self.d_model = d_model
hidden_width = 4
self.dropout = nn.Dropout(p=dropout_rate)
self.linear1 = nn.Linear(d_model,d_model*hidden_width)
self.linear2 = nn.Linear(d_model*hidden_width, d_model)
self.relu = nn.ReLU()
def forward(self,x):
return self.linear2(self.dropout(self.relu(self.linear1(x))))
5. Sublayer:

Sub-layering in transformers involves stacking multiple sub-


layers, such as multi-head attention and feed-forward
networks, within each encoder and decoder block. This
enhances the model's ability to capture both global and local
dependencies within the input sequence. The time complexity
of sub-layering in transformers can be analyzed by
considering the time complexity of each individual sub-layer being stacked within the
encoder and decoder blocks. The overall complexity would then depend on the number
of sub-layers and their respective complexities.

Implementation:

class SubLayer(nn.Module):
def __init__(self,d_model = 512) -> None:
super(SubLayer,self).__init__()
self.norm = nn.LayerNorm(d_model)
def forward(self,x,sub_layer_x):
return self.norm(x + sub_layer_x)

6. Encoder Layer:

The encoder layer consists of two sub-layers: multi-head self-


attention and position-wise feed-forward network. In the multi-
head self-attention layer, each token attends to all other tokens
in the sequence, capturing dependencies within the input
sequence. The position-wise feed-forward network introduces
non-linearity into the model, enabling it to capture complex

8
relationships between words and their positions. The time complexity of the encoder
layer is dominated by the complexity of the multi-head self-attention layer, which is
typically O(N^2 * d), where N is the sequence length and d is the dimensionality of the
input.

Implementation:

class EnocderLayer(nn.Module):
def forward(self,vec_representation,src_mask=None):
# compute self attention
attention_score =self.multi_head_arttention_layer(key =
vec_representation,query = vec_representation,value =
vec_representation,mask = src_mask)
attention_score = self.dropout1(attention_score)
# Layer Norm
attention_out = self.sublayer1(vec_representation,attention_score)
# pass Position Wise Feedforward Network
position_wise_feedforward_out =
self.position_wise_feedforward_layer(attention_out)
position_wise_feedforward_out =
self.dropout2(position_wise_feedforward_out)
# Layer Norm
encoder_out =
self.sublayer2(attention_out,position_wise_feedforward_out)
return encoder_out

7. Encoder:

The encoder is composed of a stack of identical encoder layers. Each encoder layer
processes the input sequence independently and captures different levels of
abstraction. Stacking multiple encoder layers helps in capturing hierarchical
representations of the input sequence. The time complexity of the encoder is
determined by the number of encoder layers stacked and the complexity of each
layer.

Implementation:

class EncoderBlock(nn.Module):
def __init__(self,encoder_layer, num_layer = 6) -> None:
super().__init__()
self.encoder_layer = encoder_layer
self.encoder_layer_list = get_clone(self.encoder_layer,num_layer)
def forward(self,src_embedding,src_mask=None):
encoder_out = src_embedding
for encoder_layer in self.encoder_layer_list:
encoder_out = encoder_layer(encoder_out,src_mask)
return encoder_out

9
8. Decoder Layer:

The decoder layer is similar to the encoder layer but includes


an additional sub-layer for encoder-decoder attention. This
allows the decoder to attend over all positions in the input
sequence while generating the output sequence. The time
complexity of the decoder layer is similar to that of the
encoder layer, dominated by the complexity of the multi-
head attention mechanism.

Implementation:

class DnocderLayer(nn.Module):
def
__init__(self,d_model,multi_head_arttention_layer,position_wise_feedforwar
d_layer,dropout_rate = 0.2) -> None:
super().__init__()
self.d_model = d_model
self.decoder_attention_layer =
copy.deepcopy(multi_head_arttention_layer)
self.dropout1 = nn.Dropout(p=dropout_rate)
self.sublayer1 = SubLayer(d_model)

self.encoder_decoder_attention_layer =
copy.deepcopy(multi_head_arttention_layer)
self.dropout2 = nn.Dropout(p=dropout_rate)
self.sublayer2 = SubLayer(d_model)

self.position_wise_feedforward_layer =
position_wise_feedforward_layer
self.dropout3 = nn.Dropout(p=dropout_rate)
self.sublayer3 = SubLayer(d_model)
def forward(self,enc,dec,src_mask=None,target_mask=None):
decoder_attention_out = self.decoder_attention_layer(key =
dec,query = dec,value = dec,mask = target_mask)
decoder_attention_out = self.dropout1(decoder_attention_out)
decoder_attention_out = self.sublayer1(dec,decoder_attention_out)

enc_dec_attention_out = self.encoder_decoder_attention_layer(key =
enc,query = decoder_attention_out,value = enc,mask = src_mask)
enc_dec_attention_out = self.dropout2(enc_dec_attention_out)
enc_dec_attention_out =
self.sublayer2(decoder_attention_out,enc_dec_attention_out)

ffn_out =
self.position_wise_feedforward_layer(enc_dec_attention_out)
ffn_out = self.dropout2(ffn_out)
ffn_out = self.sublayer2(enc_dec_attention_out,ffn_out)

return ffn_out

10
9. Decoder Block:

The decoder block is a stack of identical decoder layers, enabling the model to generate
the output sequence token by token while attending to the input sequence. Each
decoder layer consists of a multi-head self-attention layer, an encoder-decoder
attention layer, and a position-wise feed-forward network. The time complexity of the
decoder block is determined by the number of decoder layers stacked and the
complexity of each layer.
Implementation:
class DecoderBlock(nn.Module):
def __init__(self,decoder_layer,num_layer = 6) -> None:
super().__init__()
self.decoder_layer = decoder_layer
self.decoder_layer_list = get_clone(self.decoder_layer,num_layer)
self.layer_norm = nn.LayerNorm(self.decoder_layer.d_model)

def
forward(self,encoder_out_vec,decoder_embedding,src_mask=None,target_mask=N
one):
dec_out = decoder_embedding
for decoder_layer in self.decoder_layer_list:
dec_out = decoder_layer(enc = encoder_out_vec,dec =
dec_out,src_mask = src_mask,target_mask = target_mask)
return dec_out

10. Decoder Generator:

In the decoder generator, the output sequence is


generated token by token based on the model's
predictions. This layer completes the generation process of
the output sequence. The time complexity of the decoder
generator is typically O(V * d) where V is the size of the
vocabulary and d is the dimension of the model’s output
which is typically equal to dimension of embedding.
Implementation:
class DecoderGenerator(nn.Module):
def __init__(self,d_model,target_vocab_size) -> None:
super().__init__()
self.linear = nn.Linear(d_model,target_vocab_size)
self.softmax_layer = nn.LogSoftmax(dim=-1)
def forward(self,target_vec_rep):
return self.softmax_layer(self.linear(target_vec_rep))

11. Transformer Block:


The transformer block integrates all the components mentioned above, including the
embedding layer, positional encoding, attention mechanism, position-wise feed-forward
network, and sub-layering. It forms the core building block of the transformer
architecture. The time complexity of the transformer block is determined by the
complexity of each component and the number of layers stacked within the block.
Implementation:

11
class Transformers(nn.Module):
def
__init__(self,src_seq_len,trg_seq_len,d_model,num_head,dropout_rate = 0.2)
-> None:
super().__init__()
self.src_seq_len = src_seq_len
self.trg_seq_len = trg_seq_len
self.d_model = d_model
self.num_head = num_head

self.src_embedding = Embedding(self.src_seq_len,self.d_model)
self.src_pe = PositionalEncoding(self.src_seq_len,self.d_model)

self.trg_embedding = Embedding(self.trg_seq_len,self.d_model)
self.trg_pe = PositionalEncoding(self.trg_seq_len,self.d_model)

self.multi_head_attention =
MultiHeadAttention(d_model,num_head,dropout_rate)
self.position_wise_feedforward =
PositionWiseFeedForward(self.d_model,dropout_rate)

self.encoder_layer =
EnocderLayer(d_model,self.multi_head_attention,self.position_wise_feedforw
ard,dropout_rate)
self.decoder_layer =
DnocderLayer(d_model,self.multi_head_attention,self.position_wise_feedforw
ard,dropout_rate)

self.encoder_block = EncoderBlock(self.encoder_layer,num_layer=6)
self.decoder_block = DecoderBlock(self.decoder_layer,num_layer=6)
self.decoder_out_gen = DecoderGenerator(d_model,self.trg_seq_len)

def
forward(self,src_token_id,target_token_id,src_mask=None,target_mask=None):
encode_out = self.encode(src_token_id,src_mask)
decode_out =
self.decode(encode_out,target_token_id,src_mask,target_mask)
return decode_out

def encode(self,src_token_id,src_mask):
embed = self.src_embedding(src_token_id)
pe_out = self.src_pe(embed)
encoder_out = self.encoder_block(pe_out,src_mask)
return encoder_out

def
decode(self,enc_out,trg_token_ids,src_mask=None,tagrget_mask=None):
embed = self.src_embedding(trg_token_ids)
pe_out = self.src_pe(embed)
decoder_out =
self.decoder_block(enc_out,pe_out,src_mask,tagrget_mask)

12
decoder_out = self.decoder_out_gen(decoder_out)
return decoder_out

Therefore, the total time complexity of attention in the "Attention is all you need" transformer
would be O((L_encoder + L_decoder) * h * N^2 * d), where L_encoder and L_decoder are the
number of layers in the encoder and decoder respectively, h is the number of attention heads,
N is the sequence length, and d is the embedding dimensionality.

Significance and Examples:

 Revolutionizing NLP Compared to Previous Models: Transformers, through the self-


attention mechanism, revolutionize NLP by efficiently capturing long-range
dependencies and linguistic structures, surpassing traditional models like RNNs and
LSTMs.
 Democratizing AI Through Transfer Learning: Transformer-based models like BERT and
GPT democratize AI by offering accessible pretrained models for fine-tuning across
diverse domains, reducing the reliance on extensive labeled datasets.
 MultiModal Capability of Transformers: Transformers exhibit multimodal capability,
processing text, images, and other data types seamlessly, opening avenues for
applications like image-text matching and multimodal sentiment analysis.
 Acceleration of Generative AI: Transformers accelerate generative AI development by
enabling the creation of sophisticated language models capable of generating human-
like text and responses, fueling advancements in conversational AI and creative
applications.

Applications and Examples:

 Machine Translation: Transformers, such as BERT and GPT, achieve state-of-the-art


results in translation tasks through self-attention.
 Document Summarization: Models like BERT and GPT produce high-quality summaries
by distilling key information from documents.
 Document Generation: Transformers generate coherent text for tasks like story writing
and dialogue generation by predicting the next word based on context.
 Named Entity Recognition (NER): Models like BERT excel in recognizing named entities
by leveraging contextual understanding, achieving high accuracy in classification tasks.

Conclusion:

The analysis of the Transformer architecture presented in the "Attention Is All You Need" paper
has yielded valuable insights into its revolutionary impact on natural language processing. The
project successfully elucidated the novel self-attention mechanism employed by Transformers,
highlighting its ability to capture complex dependencies across input sequences more
effectively than traditional sequential models like explained before.

Honest Assessment

While the project provided a comprehensive understanding of the Transformer architecture,


there were certain limitations encountered during the analysis. The complexity of the

13
Transformer's self-attention mechanism and multihead attention posed challenges in grasping
their intricacies fully.

Future Work and Improvements

The future work of this project are to work further into this and brainstorm that how the
complexity of the multi-head attention mechanism can be reduced because although attention
mechanism is necessary yet its complexity is higher which leads to the excessive computation
costs. One work which we found related to this was from the article "Interactive Multi-Head
Self-Attention with Linear Complexity" by Hankyul Kang, Ming-Hsuan Yang, and Jongbin Ryu
where they propose a method to reduce the complexity of multi-head self-attention
mechanisms in Transformer architectures. The authors introduce an interactive approach that
enables efficient computation of attention scores by iteratively refining attention weights
through interactions between different heads. By incorporating interactive mechanisms, the
proposed method achieves linear time complexity with respect to the input sequence length,
making it computationally efficient even for long sequences. This approach offers a promising
solution to enhance the scalability of Transformer models while maintaining their effectiveness
in capturing complex dependencies across input sequences. This article can be further studied
for our future work

Changes to Original Requirements

In retrospect, expanding the scope of the project to include a broader range of Transformer
variants and their applications could have provided a more holistic view. Additionally,
integrating discussions on potential challenges and limitations faced in real-world deployment
scenarios would have added practical relevance to the analysis. Overall, refining the project's
scope and incorporating practical implementation aspects would enhance its value and
applicability in understanding and leveraging Transformer architectures effectively.

Application:

The application of the above explained transformer architecture can be find in this Jupyter
notebook where transformer architecture is used to train a model to convert English sentences
to Hindi sentences. Below is the link to the notebook:

https://colab.research.google.com/drive/1Etkh5grys2mKqNxPy09l9qzHynxL5-GU?
usp=sharing#scrollTo=nsWsVfjHGcjh

References:

 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... &
Polosukhin, I. (2017). Attention is all you need. In Advances in neural information
processing systems (pp. 6000-6010).
 Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural
networks. In Advances in neural information processing systems (pp. 3104-3112).
 Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473.
 Kang, H., Yang, M. H., & Ryu, J. (2020). Interactive Multi-Head Self-Attention with Linear
Complexity. arXiv preprint arXiv:2006.03236.

14

You might also like