Professional Documents
Culture Documents
Lec RNNs 2 LLMs - 1
Lec RNNs 2 LLMs - 1
Rahul Mishra
22.04.2024
विं
ष्णा
ति
हु
दे
त्म
प्र
क्ले
Why do we need to represent the text?
Dimensionality Reduction
Vectorization
Representing words as discrete symbols
Problem with words as discrete symbols
Count-based Methods
A term-document matrix
What about word vectors?
Problem with raw count based methods
Two common solutions for the word weighting
Sparse vs Dense Vectors
Sparse vs Dense Vectors
Distributional Hypothesis
Represent words by their “usage”
Directly Learning Low dimensional dense vectors
Getting short dense vectors
Word2vec
Word2vec
Predict if candidate word c is a “neighbour”
Skip-Gram Classifier
Similarity is computed using dot product
Turning dot products into probabilities
In nutshell
Skip Gram Training
Skip Gram Training
Learning Vectors
Recap SGD
Word2vec Embedding
Analogical Relations
Analogical Relations
Historical Semantics
Cultural Bias
Problem with Word2vec and similar methods
Do we have a simple solution?
ELMo
ULMfit
BERT
GPT
Neural Networks
NNs to the rescue
Artificial Neural Net
Running several logistic regression
Running several logistic regression
Running several logistic regression
Computing NN output: Forward Pass
Activation Functions
ez - e-z
tanh: a =
ez + e-z
Why deep learning?
Large DNN
Medium DNN
• Data
Performance
<1> <1>
<2> <2>
⋮ ⋮ ⋮ ⋮
< > < >
Problems:
- Inputs, outputs can be different lengths in different examples.
‣ Idea: We can
process a sequence
of vectors x by
applying a
recurrence formula
at every time step t
‣ Note: Same
parameters, weights
and function used at
each step.
Training RNN
Forward Propagation
Training RNN: Backdrop in RNN
Training RNN: Backdrop in RNN
Backpropagation Through Time
⋯
^
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
[ ]+ )
< −1> <>
Γ = ( ,
<>
= Γ ∗ ~< > + (1 − Γ ) + < −1>
[ ]+ )
< −1> <> --
Γ = ( , < −1> <>
--
⨁
*
<>
[ ]+
< −1> <>
Γ = ( , )
tanh
*
<>
< −1> <> <> ~< > <>*
Γ = ( [ ]+ )
<>
< −1> <>
, forget gate update gate tanh output gate
<>
= Γ ∗ ~ <>
+Γ ∗ < −1>
<>
<> <>
=Γ ∗ <1> <2> <3>
Distribution over
input words
•Parallel Processing
• Computationally Expensive
• Lack of interpretability
86
BERT
BERT
BERT
BERT Fine Tuning
BERT Next Sentence Prediction
BERT Sent Pair Encoding
LLMs
LLMs
BERT, 2019
BERT, 2019
BERT, 2019
T5, text-to-text, 2020
GPT/GPT 2, 2019
GPT 2, Training
GPT 3
GPT 3: In Context Learning
GPT 3: In Context Learning
GPT 3: In Context Learning
Fine-tuning LLM
Fine-tuning LLM
Fine-tuning LLM
Overall Pipeline
Multi-task Leanring
Instruction Finetuning
Compute perspective
Alignment
Knobs of Instruction Finetuning
Knobs of Instruction Finetuning
FLAN
Instruction Finetuning Benefits