Transformer-Xl Attentive Language Models Beyond A Fixed-Length Context

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

PUBLISHED IN ACL 2019

GOOGLE AI RESEARCHERS

TRANSFORMER-XL
ATTENTIVE LANGUAGE MODELS BEYOND
A FIXED-LENGTH CONTEXT

PRESENTER
SALMAN YOUNUS & BILAL SHABIR
Language modeling (LM) is the use of various
statistical and probabilistic techniques to LANGUAGE
determine the probability of a given sequence
of words occurring in a sentence. MODELING
PREDICT THE LAST WORD IN THE TEXT

“THE CLOUDS ARE IN THE ***”


“THE CLOUDS ARE IN THE ***”

In the above example, the previous words are to predict the next word of a sentence.
Hence there is a need to remember the previous words.
RECURRENT NEURAL NETWORKS

 Recurrent Neural Network(RNN) are a type of Neural Network


where the output from previous step are fed as input to the
current step.

 The main and most important feature of RNN is Hidden state,


which remembers some information about a sequence
RECURRENT NEURAL NETWORKS HAS A
LONG TERM DEPENDENCY ISSUE !!
PREDICT THE LAST WORD IN THE TEXT

“I GREW UP IN FRANCE… I SPEAK FLUENT ****.”


LONG SHORT-TERM MEMORY NETWORKS

 Long Short-Term Memory networks – usually just called


“LSTMs” – are a special kind of RNN, capable of learning
long-term dependencies.
 But it’s hard to parallelize the work for processing sentences
because of word by word.
BIRTH OF “TRANSFORMER” IN 2017

Transformers are designed to


It is a deep learning model handle sequential data, such as
used primarily in the field of natural language, for tasks
natural language processing. such as translation and text
summarization.

Due to this feature, the


However, unlike RNNs,
Transformer allows for much
Transformers do not require
more parallelization than
that the sequential data be
RNNs and therefore reduced
processed in order.
training times.
CHALLENGE WITH TRANSFORMERS

CURRENTLY IMPLEMENTED WITH A FIXED-LENGTH CONTEXT


FIXED-LENGTH CONTEXT IN TRANSFORMER
CAUSED A
CONTEXT FRAGMENTATION ISSUE
LIMITATIONS OF FIXED LENGTH CONTEXT IN TRANSFORMER

The segments usually do


It is not able to model not respect the sentence
dependencies that are boundaries, resulting in
longer than a fixed context fragmentation
length. which leads to inefficient
optimization.
TRANSFORMER-XL

 Transformer-XL heavily relies on the vanilla


Transformer (Al-Rfou et al.) but introduces two
innovative techniques to overcome vanilla’s
shortcomings.
 Segment-level Recurrence
 Relative Positional Encodings
SEGMENT-
LEVEL
RECURRENCE
SEGMENT-LEVEL RECURRENCE
TRANSFORMER VS TRANSFORMER-XL
CHALLENGE WITH POSITION EMBEDDING

 How can we keep the positional information coherent when we reuse the states?
 The original positional encoding handles each segment separately and, as a result, tokens from
different segments have the same positional encoding.
 For example, the first token of the first and the second segments will have the same encoding, although
their position and importance are different.
Segment 1 Segment 2

0 1 0 1

This confusion might affect the network incorrectly.


RELATIVE
POSITIONAL
ENCODINGS
THANKS

Thanks

You might also like