Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

VISUAL RECOGNITION – PART 2

Lecture 6: Transformer based Language Modelling ;


Transformers for Image Classification
http://jalammar.github.io/illustrated-transformer/

Position Encoding in Transformers

Will changing the order of input sequence affects the respective ‘𝑧 ′ values ?

We need to add additional position information to every token to maintain sequence information
https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Machine Translation : Self Attention + Cross Attention


Representation Learning with Self-Supervision + Transformers: BERT

Bi-directional Modeling done with the help of [MASK] tokens

Mask a percentage of input tokens at random (e.g. 15%)

Predict the masked token using Transformer encoder architecture

Sentence Embeddings :
Generative Modeling with Self-Supervision + Transformers: GPT
Vision Transformer (ViT)

Transformers can replace CNNs in image recognition !


Vision Transformer Steps:

• Split an image into fixed-size patches,


• Linearly embed each of them
• Add position embeddings

• Feed the resulting sequence of vectors to a standard


Transformer encoder.

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


ViT : Patch Creation

𝑃 = PatchSize
𝑥𝑝1 𝑥𝑝N

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


ViT : Patch Input Embedding
Patch Embedding

𝑥𝑝1
𝑥𝑝2

𝑧0

𝑥𝑝1 𝑥𝑝N 𝑥𝑝N

Learnable Position
Embeddings

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


ViT : Encoder & Final MLP Head

Probabilities

Cat Dog Horse Pattern

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE


Vision Transformer – Attention Maps

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

You might also like