Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Towards Unified Architecture and

Pre-training for AI
Han Hu (胡瀚)
Microsoft Research Asia
@DataFun
December 17th , 2022
Architecture Convergence (2020-)
• Architecture convergence of CV and NLP
16×

Transformer Blocks

Transformer Blocks

NLP Architecture: Transformer (2017) CV Architecture:ViT/Swin (2020-2021)


Pre-training Convergence (2021-)
• Pre-training convergence of CV and NLP
• Masked language (image) modeling

NLP Pre-training: BERT/GPT (2018) CV Pre-training: BEiT/iBoT/MAE/SimMIM (2021)


NLP/Speech CV Social computing

The Grand Unification


Human brain: Universal architecture
• Universal architecture (innate):neocortex

Decision, Gustatory Motor


planning, Somatosensory
consciousness,
etc
Visual

Olfactory
Auditory
language
Source: Cognitive Neuroscience – The Biology of the Mind
Human brain: Universal pre-training
• Unified learning approach:Compare difference between prediction and
ground truth input
• Thalamus plays a key role

thalamus

Credit: David Eagleman


Practical benefit of convergence
• Technology and knowledge sharing
• CV technologies => NLP technologies (2012 - 2018)
• NLP technologies => CV technologies (2020 - )
• Promote multi-modality applications
• Zero-shot recognition (CLIP, 2021), Text-to-image generation (DALL-E, 2021)
• Cost decreasing and benefit increasing
• E.g., the chip design only needs to consider the optimization of Transformers
Encouraging Large models for both NLP and CV
Human Brain
(200T)
108

Megatron-Turing
106 (530B)
GPT-3 (175B)
105
T5 (11B) Turing-NLG (17B)
104 Megatron-LM (8.3B)
GPT-2 (1.5B) ViT-G (1.8B) SwinV2 (3B)
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Towards Converged AI Architectures
Transformer (2017)
ViT / Swin Transformer (V1, 2021, V2, 2022)
Mainstream models for different AI problems

Vision-CNN Language-Transformer Social-Graph Networks


Model evolution for NLP or sequential data

1995 2014 2017


LSTM GRU Transformer

Jürgen Schmidhuber Google Google

2013 2015
Deep LSTM RNN+attention

Geoffrey Hinton CIFAR


& Microsoft

Ashish Vaswani et al. Attention is all you need. NeurIPS 2017


Model evolution for CV

1980
CNN

Kunihiko Fukushima
Yann LeCun

AlexNet, GoogleNet, VGGNet, ResNet, DenseNet, …


Can NLP and CV use the same architectures?
• Adapting CNN or convolution for NLP

2017.5 2019.2 2019.4


CNN based ConvSeq2Seq Dynamic Deformable
Convolution Convolution
FAIR FAIR MSRA

2017.6
Dominate
Transformer based Transformers

Google Brain
Can NLP and CV use the same architectures?
• Adapting Transformers or attention for CV

1980 2019.4 2020.10


CNN LR-Net (MSRA) ViT
(Google)
Kunihiko Fukushima
Yann LeCun

2017.11 2021.03
Attention Vision
NLNet (FAIR) models Transformers Swin Transformer
(MSRA) and
RelationNet many others
(MSRA)
Vision Transformer (ViT)
• New record on ImageNet-1K classification
• Key merits over CNN: Dynamics, long-term relationship modeling
ImageNet-1K classification
+0.1%
ViT

Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Swin Transformer
• New records on dense visual tasks
ADE20K语义分割 COCO物体检测

+3.3% +2.7%
Swin
Swin

Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV2021
Swin Transformer
✓ Swin Transformer = Transformer + visual priors
✓ Shifted Non-overlapping Windows
✓ Training suites for vision Transformers and quick open-sourcing

Swin Transformer block

Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV2021
Transformer + visual priors
hierarchy locality translation invariance
(scale invariance) (spatial smoothness)

I am a fluffy cat. I like the green grass. I am a fluffy cat.


I am a fluffy fluffy cat cat. (invalid) Fluffy cat is me.
No scale variation No spatial smoothness Sensitive to absolute locations
From sliding windows to shifted windows
• Non-overlapping windows (3x speed up in latency)
• Shifted window configurations in the next layer

q Layer L
q’
q …
q’ …

Layer L+1

Sliding window
(CNN/LR-Net) Shifted window (Swin, ICCV’2021)
Vision Transformer recipes
A general design principle to apply Transformers

Generic Transformer Unit

+
Domain-Specific Priors
(Defined by graphs)

𝒗𝒊

Graphormer, NeurIPS 2021


Why Transformer became the choice?
• General modeling capability
• All concepts (concrete or abstract) and their relationships can be modeled by a graph
• Modeling arbitrary relationship via projected verification, which is hard for CNN
out feat for
weighted sum
“player”
𝒗𝒊 similarity feat set

query embed key embed value embed

“player” input

Source: Towards Universal Learning Machine: Attention modeling, VALSE Webinar (2019.7)
What’s Next?
Swin Transformer V2
Scaling up capacity is the primary driving force for
the remarkable progress of NLP

Zero-shot Accuracy on LAMBADA

Parameters (Millions)
Computer vision models are relatively small
Human Brain
(200T)
108

Megatron-Turing
106 (530B)
GPT-3 (175B)
105
300x 100,000x
T5 (11B) Turing-NLG (17B)
104 Megatron-LM (8.3B)
GPT-2 (1.5B) ViT-G (1.8B) SwinV2 (3B)
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Swin Transformer V2

• The world’s largest dense vision model (3 billion parameters)


• New records on broad vision tasks (Nov. 2021)
• COCO object detection (63.1/54.4 box/mask AP)
• ADE20K semantic segmentation (59.9 mIoU)
• Kinetics400 video classification (86.8% top-1)
• ImageNet-V2 image classification (84.0% top-1)

car
riding a bike horse

Detection Segmentation Classification Classification


(region-level) (pixel-level) (video-level) (image-level)
Compared to previous billion-scale vision models

• 25% larger
Swin V2
• 40x less labelled images
• 10x lower training cost
ViT-G • Applicable to richer tasks

1.8B parameter; 3B images

3B parameters; 70M images


CoAtNet-7
2.4B parameter; 3B images
The key for SwinV2’s label and cost efficiency

Classification Vison Language Contrastive Masked image modeling


Learning (E.g., CLIP) (E.g., BEiT/SimMIM)

Category: Description: A penguin


Penguin stands on the snow

ViT-G/CoAtNet-7 BASIC-3B (Google) Swin V2 (Microsoft)


(Google) Florence / Bletchley (Microsoft)

1,00000 classes 1,000,000 sentences 224x224 pixels


≈ 10 Bits ≈ 20 Bits >> 100,000 Bits
Applicable to richer tasks with higher resolution
• Issue: Large resolution discrepancy Solution
between pre-training & fine-tuning
2000

224

224 1000 ∆𝑥̂


Image Classification
MLP
(Pre-training)
∆𝑦 [∆𝑥̂,
ො ∆𝑦]

Object Detection SwinV1: Discrete SwinV2: Continuous
(Fine-tuning) Relative Position Bias Relative Position Bias
Han Hu et al. Local Relation Networks for Image Recognition, ICCV 2019
Towards better training stability
• Vision tasks can be very unstable with ~30,000x larger value range

62 𝒙𝒍 3 𝒙𝒍

50 1
MLP Layer Norm
1 50
Layer Norm MLP
11 2

10 1
Attention Layer Norm
1 10
Layer Norm Cosine Attention
1 𝒙𝒍−𝟏 1 𝒙𝒍−𝟏

SwinV1: Pre-Norm SwinV2: Res-Post-Norm


+ Cosine Attention
The gap is still large
Human Brain
(200T)
108

Megatron-Turing
106 (530B)
GPT-3 (175B)
105
180x 70,000x
T5 (11B) Turing-NLG (17B)
104 Megatron-LM (8.3B)
GPT-2 (1.5B) ViT-G (1.8B) SwinV2 (3B)
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Tutel: ML system to support sparse MoE models
at scale
• Single layer: 4.96x speedup on 16-GPU, 5.75x speedup on 2,048-GPU
• End-to-end models:
• Collaborated with Meta to deliver ~1.4x speedup for Meta 1.1T model
• A Swin-MoE baseline

Meta (Facebook)’s 1.1 trillion–parameter MoE language model


BERT/ GPT

丘脑

BEiT/MAE
/SimMIM

Towards Converged Pre-training


BERT/GPT (2018/2019)
BEiT/MAE/SimMIM (2021)
Pre-training is the basis for deep learning to be
applicable to various visual tasks

Transfer
Supervised Pre-training Pre-trained Models
Fine-tuning on downstream tasks
On ImageNet
• Fine-grained Classification
• Object Detection
• Semantic Segmentation
Visual pre-training

2006.11 2019.11 2021.6


MoCo (Contrastive BEiT (MIM based
AutoEncoder (Self-supervised)
based Self-supervised) Self-supervised)
UToronto FAIR MSRA

2012.6 2021.1 2021.11


AlexNet (Image CLS) CLIP (Multi-modal) MAE / SimMIM

UToronto OpenAI (FAIR) (MSRA)


Generative Pre-training: Masked Image Modeling

Task: Predict the masked area


Three Questions
• How to make MIM pre-training as simple as possible?

• Could MIM benefit from larger-scale data?

• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
Three Questions
• How to make MIM pre-training as simple as possible?

• Could MIM benefit from larger-scale data?

• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
SimMIM: A Simple Framework on MIM
ℓ1 loss

One-layer Prediction Head


Encoder
(e.g., ViT, Swin, ResNet)

Random
mask

➢ Masking strategy: Random masking with relatively large patch size (e.g., 32x32)
➢ Prediction head: An extremely lightweight prediction head (e.g., a linear layer)
➢ Prediction target: A simple raw pixel regression task
➢ Encoder architecture: ViT, Swin and ResNet could all benefit from SimMIM
➢ Sparse decoder (MAE) or dense decoder (SimMIM)?
SimMIM: Masking Strategy
• Simple random masking works well
• Large patch size/High mask ratio matters
• Visual signals are redundant spatially and
exhibit strong locality
• A metric to evaluate different strategy?

Raw image Square(32) Block-wise(16)

Random(8) Random(16) Random(32)


SimMIM: Averaged Distance
AvgDist* is a more explicit metric for MIM than patch size/mask ratio
*Average of the minimum distance from the invisible area to the visible area.

Red line indicates the


minimum distance
SimMIM: Encoder
• ViT, Swin and even ResNet could all benefit from SimMIM

ViT-Base ResNet-50 ×4
➢ 2.0 improvement compared ➢ 0.9 improvement compared
to supervised baseline to supervised baseline
➢ 0.6 improvement compared
to previous SOTA
SimMIM: System-level Comparison
• ViT, Swin and ResNet could all benefit from SimMIM
86

ImageNet Top-1 Accuracy


Swin-Base/Large/Huge 85.5
on ImageNet-1K Classification 85
84.5 +2.4
+1.9
➢ Pre-trained only on ImageNet-1K
84
➢ 0.7 / 1.9 / 2.4 improvements
83.5 +0.7 Supervised
campared to supervised counterparts
➢ Avoid the seriously overfitting issue 83 SimMIM
0 200 400 600 800
Model Size (M)
91

ImageNet Top-1
MIM enables Swin-Giant 3B 90

Accuracy
90.2 with SwinV2-G 90.45 with ViT-G
89 Only 70M data JFT-3B dataset
➢ Pre-trained only on 70M in-house data
➢ 40× less than previous practice 88
0 500 1000 1500 2000 2500 3000
Dataset Size (M)
SimMIM: Visualizations

Raw Image Random Mask Mask w/o fully cover the object Mask w/ fully cover the object
Three Questions
• How to make MIM pre-training as simple as possible?

• Could MIM benefit from larger-scale data?

• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
Could MIM benefit from large-scale data?

Scaling Laws for Neural Language Models (OpenAI)


➢ Scaling law in terms of Compute C, Dataset Size D, Parameters N
➢ Performance has a power-law relationship with each individual factor
when not bottlenecked by the other two
Could MIM benefit from large-scale data?
Overfitting Non-Overfitting

𝐿 = 𝐶 Τ3.8 ∙ 1019 −0.0176


𝐿 = 𝑁Τ1.57 ∙ 1017 −0.0181

Model Size × Training Iterations Millions of Images Millions of Params

• Performance has a power-law relationship with Relative Compute C and Parameters N


• But not with Dataset Size D
• With a fixed model size and fixed computational cost, performance does not increase with the dataset size

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu. On Data Scaling in Masked Image Modeling. Tech report 2022
Three Questions
• How to make MIM pre-training as simple as possible?

• Could MIM benefit from larger-scale data?

• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
A simple approach that can generally improve
various approaches

• ViT-B performance
Method IN-1K ADE20K
Image classification (DeiT) 81.8 -> 83.0 (+1.2) 47.0 -> 48.0 (+1.0)
Instance contrastive learning (DINO) 82.8 -> 83.8 (+1.0) 46.2 -> 47.7 (+1.5)
CLIP (400M data) 82.9 -> 84.9 (+2.0) 49.5 -> 52.8 (+3.3)

Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, Baining Guo. Contrastive Learning Rivals Masked Image
Modeling in Fine-tuning via Feature Distillation. Tech report 2022
On Larger Models

Method IN-1K ADE20K


CLIP (ViT-L) 86.1 -> 87.7 (224) -> 89.0 (336) 53.5 -> 55.7 (+2.2)
Swin V2-G (3B) 89.2 -> 89.4 (256) 59.9 -> 61.4 (+1.5)
Takeaways
• Architecture convergence
• NLP: Transformers
• CV: ViT / Swin Transformer V1&V2

• Pre-training convergence
• NLP: BERT / GPT
• CV: BEiT / MAE / SimMIM
• Understanding of MIM

Still long way to go!

You might also like