Lec RNNs 2 LLMs - 1

Ram Krishna Hari
Ganpati mahesh janki
कृ य वासु वाय हरये परमा ने नत श नाशय गो दाय नमो नमः

Ram आशुतोष तुम अवढर दानी आर हर दीन जन जानी
From RNNs to LLMs: The Evolutionary

Journey
Rahul Mishra
22.04.2024
विं
ष्णा
ति
हु
दे
त्म
प्र
क्ले
Why do we need to represent the text?
Numerical Input Requirement
Feature Extraction Semantic Understanding Generalization
Dimensionality Reduction
Vectorization
Representing words as discrete symbols
Problem with words as discrete symbols
Count-based Methods
A term-document matrix
What about word vectors?
Problem with raw count based methods
Two common solutions for the word weighting
Sparse vs Dense Vectors
Sparse vs Dense Vectors
Distributional Hypothesis
Represent words by their “usage”
Directly Learning Low dimensional dense vectors
Getting short dense vectors
Word2vec
Word2vec
Predict if candidate word c is a “neighbour”
Skip-Gram Classifier
Similarity is computed using dot product
Turning dot products into probabilities
In nutshell
Skip Gram Training
Skip Gram Training
Learning Vectors
Recap SGD
Word2vec Embedding
Analogical Relations
Analogical Relations
Historical Semantics
Cultural Bias
Problem with Word2vec and similar methods
Do we have a simple solution?
ELMo
ULMfit
BERT
GPT
Neural Networks
NNs to the rescue
Artificial Neural Net
Running several logistic regression
Computing NN output: Forward Pass
Activation Functions
ez - e-z
tanh: a =
ez + e-z
Why deep learning?
Large DNN
Medium DNN
• Data
Performance
Small DNN • Computation

• Algorithms
Traditional ML
Algorithms
Small training data Data size

Why not a standard network?
<1> <1>
<2> <2>
⋮ ⋮ ⋮ ⋮
< > < >
Problems:
- Inputs, outputs can be different lengths in different examples.
- Doesn’t share features learned across different positions of text.

𝑥
𝑥
𝑦
𝑦
𝑦
𝑥
𝑦
𝑥
𝑇
𝑇
RNN
‣ Idea: We can
process a sequence
of vectors x by
applying a
recurrence formula
at every time step t
‣ Note: Same
parameters, weights
and function used at
each step.
Training RNN
Forward Propagation
Training RNN: Backdrop in RNN
Training RNN: Backdrop in RNN
Backpropagation Through Time
^<1> ^<2> ^ <3> ^< >
<1> <2> < −1>

<0> ⋯
<1> <2> <3> < >

𝑦
𝑦
𝑦
𝑦
𝑥
𝑎
𝑥
𝑎
𝑥
𝑎
𝑎
𝑥
𝑥
𝑦
𝑥
𝑇
𝑇
𝑇
Backpropagation Through Time
Vanishing and Exploding Gradient
⋯
^
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
The cat, that already ate……….., was full
The cats, that already ate……….., were full

𝑦
𝑥
Vanishing Gradient
Vanishing Gradient
Exploding Gradient
Another Problem
Full GRU
~< > = tanh( [ < −1>

, <>
]+ )
[ ]+ )
< −1> <>
Γ = ( ,
<>
= Γ ∗ ~< > + (1 − Γ ) + < −1>
The cat, which ate already, was full.

𝑢
𝑢
𝑢
𝑢
𝑢
𝑐
𝑐
𝑐
𝜎
𝑊
𝑐
𝑐
𝑥
𝑏
𝑐
𝑐
𝑥
𝑏
𝑐
𝑊
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
LSTM in pictures
<>
~< > = tanh(

[ ]+ )
softmax
< −1> <>

, <>
[ ]+ )
< −1> <> --
Γ = ( , < −1> <>
--
⨁
*
<>
[ ]+
< −1> <>
Γ = ( , )
tanh
*
<>
< −1> <> <> ~< > <>*
Γ = ( [ ]+ )
<>
< −1> <>
, forget gate update gate tanh output gate
<>
= Γ ∗ ~ <>
+Γ ∗ < −1>
<>
<> <>
=Γ ∗ <1> <2> <3>
softmax softmax softmax

<1> <2> <3>
<1> <2> <3>
<0>
* ⨁
-- <1>
* ⨁
-- <2>
* ⨁
--
<0> <1> <2>

<1> <2> <3>
<1> <2> <3>

𝑢
𝑓
𝑜
𝑜
𝑢
𝑓
𝑐
𝑜
𝑢
𝑓
𝑐
𝑢
𝑓
𝑜
𝑐
𝜎
𝜎
𝜎
𝑊
𝑊
𝑊
𝑎
𝑎
𝑎
𝑊
𝑎
𝑥
𝑥
𝑥
𝑥
𝑏
𝑏
𝑏
𝑏
𝑐
𝑐
𝑐
𝑎
𝑐
𝑐
𝑎
𝑐
𝑥
𝑎
𝑎
𝑦
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑥
𝑐
𝑎
𝑐
𝑎
𝑥
𝑎
𝑐
𝑥
𝑎
𝑐
𝑐
𝑦
𝑐
𝑦
𝑐
𝑡
𝑡
𝑐
𝑎
𝑜
𝑓
𝑖
𝑎
𝑎
𝑎
𝑎
𝑎
𝑡
𝑡
𝑦
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
LSTM in pictures
LSTM in pictures
Applications: Sequence tagging
Applications: Sequence Classification
Applications: as Encoder
Bidirectional-RNN/LSTM
He said, “Teddy bears are on sale!”
He said, “Teddy Roosevelt was a great President!”

The problem of long sequences
<1> < >

^ ^
<0> ⋯ ⋯
<1> < >
Jane s'est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré

beaucoup de gens merveilleux; elle est revenue en parlant comment son voyage était
merveilleux, et elle me tente d'y aller aussi.
Jane went to Africa last September, and enjoyed the culture and met many wonderful
people; she came back raving about how wonderful her trip was, and is tempting me to go
𝑦
𝑦
𝑥
𝑎
𝑥
𝑦
𝑥
𝑇
𝑇
Attention with RNN/LSTM
Attention with RNN/LSTM
LSTM with Self Attention
Distribution over
input words
“I love coffee” -> “Me gusta el café”
Bahdanau et al, “Neural Machine Translation by Jointly

Learning to Align and Translate”, ICLR 2015
TagLM (Pre-ELMo) Peters et al. 2017
TagLM (Pre-ELMo) Peters et al. 2017
ELMo: Embeddings from Language Models
Best Paper award at NAACL 2018

ELMo: Embeddings from Language Models
ELMo
ELMo layers
Motivation for a New Architecture
•Vanishing gradient problem
•Parallel Processing
•No Recurrent Connections
•Better Capturing Long-Term Dependencies

Attention is All You Need:Transformers
Self-Attention
Self-Attention
Rough Idea of Transformers
Rough Idea of Transformers
Is it complex? Yes and No
Limitations of Transformers
• Computationally Expensive
• Lack of interpretability
• Limited sequence length
86
BERT
BERT
BERT
BERT Fine Tuning
BERT Next Sentence Prediction
BERT Sent Pair Encoding
LLMs
LLMs
BERT, 2019
BERT, 2019
BERT, 2019
T5, text-to-text, 2020
GPT/GPT 2, 2019
GPT 2, Training
GPT 3
GPT 3: In Context Learning
Fine-tuning LLM
Fine-tuning LLM
Fine-tuning LLM
Overall Pipeline
Multi-task Leanring
Instruction Finetuning
Compute perspective
Alignment
Knobs of Instruction Finetuning
Knobs of Instruction Finetuning
FLAN
Instruction Finetuning Benefits
What is the Problem?

Lec RNNs 2 LLMs - 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec RNNs 2 LLMs - 1

Uploaded by

Copyright:

Available Formats

Ram Krishna Hari

Ganpati mahesh janki

कृ य वासु वाय हरये परमा ने नत श नाशय गो दाय नमो नमः

From RNNs to LLMs: The Evolutionary

Numerical Input Requirement

Feature Extraction Semantic Understanding Generalization

Small DNN • Computation

Small training data Data size

- Doesn’t share features learned across different positions of text.

^<1> ^<2> ^ <3> ^< >

<1> <2> < −1>

<1> <2> <3> < >

The cat, that already ate……….., was full

The cats, that already ate……….., were full

~< > = tanh( [ < −1>

The cat, which ate already, was full.

~< > = tanh(

< −1> <>

softmax softmax softmax

<0> <1> <2>

<1> <2> <3>

He said, “Teddy bears are on sale!”

He said, “Teddy Roosevelt was a great President!”

<1> < >

<1> < >

Jane s'est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly

Best Paper award at NAACL 2018

•Vanishing gradient problem

•No Recurrent Connections

•Better Capturing Long-Term Dependencies

• Limited sequence length

What is the Problem?

You might also like