Download as pdf or txt
Download as pdf or txt
You are on page 1of 117

Ram Krishna Hari

Ganpati mahesh janki

कृ य वासु वाय हरये परमा ने नत श नाशय गो दाय नमो नमः


Ram आशुतोष तुम अवढर दानी आर हर दीन जन जानी

From RNNs to LLMs: The Evolutionary


Journey

Rahul Mishra
22.04.2024
विं
ष्णा
ति
हु
दे
त्म
प्र
क्ले
Why do we need to represent the text?

Numerical Input Requirement

Feature Extraction Semantic Understanding Generalization

Dimensionality Reduction

Vectorization
Representing words as discrete symbols
Problem with words as discrete symbols
Count-based Methods

A term-document matrix
What about word vectors?
Problem with raw count based methods
Two common solutions for the word weighting
Sparse vs Dense Vectors
Sparse vs Dense Vectors
Distributional Hypothesis
Represent words by their “usage”
Directly Learning Low dimensional dense vectors
Getting short dense vectors
Word2vec
Word2vec
Predict if candidate word c is a “neighbour”
Skip-Gram Classifier
Similarity is computed using dot product
Turning dot products into probabilities
In nutshell
Skip Gram Training
Skip Gram Training
Learning Vectors
Recap SGD
Word2vec Embedding
Analogical Relations
Analogical Relations
Historical Semantics
Cultural Bias
Problem with Word2vec and similar methods
Do we have a simple solution?

ELMo
ULMfit
BERT
GPT
Neural Networks
NNs to the rescue
Artificial Neural Net
Running several logistic regression
Running several logistic regression
Running several logistic regression
Computing NN output: Forward Pass
Activation Functions

ez - e-z
tanh: a =
ez + e-z
Why deep learning?

Large DNN

Medium DNN

• Data
Performance

Small DNN • Computation


• Algorithms
Traditional ML
Algorithms

Small training data Data size


Why not a standard network?

<1> <1>

<2> <2>

⋮ ⋮ ⋮ ⋮
< > < >

Problems:
- Inputs, outputs can be different lengths in different examples.

- Doesn’t share features learned across different positions of text.


𝑥
𝑥
𝑦
𝑦
𝑦
𝑥
𝑦
𝑥
𝑇
𝑇
RNN

‣ Idea: We can
process a sequence
of vectors x by
applying a
recurrence formula
at every time step t

‣ Note: Same
parameters, weights
and function used at
each step.
Training RNN
Forward Propagation
Training RNN: Backdrop in RNN
Training RNN: Backdrop in RNN
Backpropagation Through Time

^<1> ^<2> ^ <3> ^< >

<1> <2> < −1>


<0> ⋯

<1> <2> <3> < >


𝑦
𝑦
𝑦
𝑦
𝑥
𝑎
𝑥
𝑎
𝑥
𝑎
𝑎
𝑥
𝑥
𝑦
𝑥
𝑇
𝑇
𝑇
Backpropagation Through Time
Vanishing and Exploding Gradient


^
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

The cat, that already ate……….., was full

The cats, that already ate……….., were full


𝑦
𝑥
Vanishing Gradient
Vanishing Gradient
Exploding Gradient
Another Problem
Full GRU

~< > = tanh( [ < −1>


, <>
]+ )

[ ]+ )
< −1> <>
Γ = ( ,

<>
= Γ ∗ ~< > + (1 − Γ ) + < −1>

The cat, which ate already, was full.


𝑢
𝑢
𝑢
𝑢
𝑢
𝑐
𝑐
𝑐
𝜎
𝑊
𝑐
𝑐
𝑥
𝑏
𝑐
𝑐
𝑥
𝑏
𝑐
𝑊
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
LSTM in pictures
<>

~< > = tanh(


[ ]+ )
softmax

< −1> <>


, <>

[ ]+ )
< −1> <> --
Γ = ( , < −1> <>
--

*
<>

[ ]+
< −1> <>
Γ = ( , )
tanh
*
<>
< −1> <> <> ~< > <>*

Γ = ( [ ]+ )
<>
< −1> <>
, forget gate update gate tanh output gate

<>
= Γ ∗ ~ <>
+Γ ∗ < −1>
<>
<> <>
=Γ ∗ <1> <2> <3>

softmax softmax softmax


<1> <2> <3>
<1> <2> <3>
<0>
* ⨁
-- <1>
* ⨁
-- <2>
* ⨁
--

<0> <1> <2>


<1> <2> <3>

<1> <2> <3>


𝑢
𝑓
𝑜
𝑜
𝑢
𝑓
𝑐
𝑜
𝑢
𝑓
𝑐
𝑢
𝑓
𝑜
𝑐
𝜎
𝜎
𝜎
𝑊
𝑊
𝑊
𝑎
𝑎
𝑎
𝑊
𝑎
𝑥
𝑥
𝑥
𝑥
𝑏
𝑏
𝑏
𝑏
𝑐
𝑐
𝑐
𝑎
𝑐
𝑐
𝑎
𝑐
𝑥
𝑎
𝑎
𝑦
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑥
𝑐
𝑎
𝑐
𝑎
𝑥
𝑎
𝑐
𝑥
𝑎
𝑐
𝑐
𝑦
𝑐
𝑦
𝑐
𝑡
𝑡
𝑐
𝑎
𝑜
𝑓
𝑖
𝑎
𝑎
𝑎
𝑎
𝑎
𝑡
𝑡
𝑦
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
LSTM in pictures
LSTM in pictures
Applications: Sequence tagging
Applications: Sequence Classification
Applications: Sequence Classification
Applications: Sequence Classification
Applications: as Encoder
Bidirectional-RNN/LSTM

He said, “Teddy bears are on sale!”

He said, “Teddy Roosevelt was a great President!”


The problem of long sequences

<1> < >


^ ^
<0> ⋯ ⋯

<1> < >

Jane s'est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré


beaucoup de gens merveilleux; elle est revenue en parlant comment son voyage était
merveilleux, et elle me tente d'y aller aussi.
Jane went to Africa last September, and enjoyed the culture and met many wonderful
people; she came back raving about how wonderful her trip was, and is tempting me to go
𝑦
𝑦
𝑥
𝑎
𝑥
𝑦
𝑥
𝑇
𝑇
Attention with RNN/LSTM
Attention with RNN/LSTM
LSTM with Self Attention

Distribution over
input words

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly


Learning to Align and Translate”, ICLR 2015
TagLM (Pre-ELMo) Peters et al. 2017
TagLM (Pre-ELMo) Peters et al. 2017
ELMo: Embeddings from Language Models

Best Paper award at NAACL 2018


ELMo: Embeddings from Language Models
ELMo
ELMo layers
Motivation for a New Architecture
Motivation for a New Architecture
Motivation for a New Architecture
Motivation for a New Architecture
Motivation for a New Architecture

•Vanishing gradient problem

•Parallel Processing

•No Recurrent Connections

•Better Capturing Long-Term Dependencies


Attention is All You Need:Transformers
Self-Attention
Self-Attention
Rough Idea of Transformers
Rough Idea of Transformers
Is it complex? Yes and No
Limitations of Transformers

• Computationally Expensive

• Lack of interpretability

• Limited sequence length

86
BERT
BERT
BERT
BERT Fine Tuning
BERT Next Sentence Prediction
BERT Sent Pair Encoding
LLMs
LLMs
BERT, 2019
BERT, 2019
BERT, 2019
T5, text-to-text, 2020
GPT/GPT 2, 2019
GPT 2, Training
GPT 3
GPT 3: In Context Learning
GPT 3: In Context Learning
GPT 3: In Context Learning
Fine-tuning LLM
Fine-tuning LLM
Fine-tuning LLM
Overall Pipeline
Multi-task Leanring
Instruction Finetuning
Compute perspective
Alignment
Knobs of Instruction Finetuning
Knobs of Instruction Finetuning
FLAN
Instruction Finetuning Benefits

What is the Problem?

You might also like