Professional Documents
Culture Documents
1-1-1 面向人工智能统一神经架构和预训练方法
1-1-1 面向人工智能统一神经架构和预训练方法
1-1-1 面向人工智能统一神经架构和预训练方法
Pre-training for AI
Han Hu (胡瀚)
Microsoft Research Asia
@DataFun
December 17th , 2022
Architecture Convergence (2020-)
• Architecture convergence of CV and NLP
16×
Transformer Blocks
8×
Transformer Blocks
4×
Olfactory
Auditory
language
Source: Cognitive Neuroscience – The Biology of the Mind
Human brain: Universal pre-training
• Unified learning approach:Compare difference between prediction and
ground truth input
• Thalamus plays a key role
thalamus
Megatron-Turing
106 (530B)
GPT-3 (175B)
105
T5 (11B) Turing-NLG (17B)
104 Megatron-LM (8.3B)
GPT-2 (1.5B) ViT-G (1.8B) SwinV2 (3B)
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Towards Converged AI Architectures
Transformer (2017)
ViT / Swin Transformer (V1, 2021, V2, 2022)
Mainstream models for different AI problems
2013 2015
Deep LSTM RNN+attention
1980
CNN
Kunihiko Fukushima
Yann LeCun
2017.6
Dominate
Transformer based Transformers
Google Brain
Can NLP and CV use the same architectures?
• Adapting Transformers or attention for CV
2017.11 2021.03
Attention Vision
NLNet (FAIR) models Transformers Swin Transformer
(MSRA) and
RelationNet many others
(MSRA)
Vision Transformer (ViT)
• New record on ImageNet-1K classification
• Key merits over CNN: Dynamics, long-term relationship modeling
ImageNet-1K classification
+0.1%
ViT
Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Swin Transformer
• New records on dense visual tasks
ADE20K语义分割 COCO物体检测
+3.3% +2.7%
Swin
Swin
Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV2021
Swin Transformer
✓ Swin Transformer = Transformer + visual priors
✓ Shifted Non-overlapping Windows
✓ Training suites for vision Transformers and quick open-sourcing
Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV2021
Transformer + visual priors
hierarchy locality translation invariance
(scale invariance) (spatial smoothness)
q Layer L
q’
q …
q’ …
Layer L+1
Sliding window
(CNN/LR-Net) Shifted window (Swin, ICCV’2021)
Vision Transformer recipes
A general design principle to apply Transformers
+
Domain-Specific Priors
(Defined by graphs)
𝒗𝒊
“player” input
Source: Towards Universal Learning Machine: Attention modeling, VALSE Webinar (2019.7)
What’s Next?
Swin Transformer V2
Scaling up capacity is the primary driving force for
the remarkable progress of NLP
Parameters (Millions)
Computer vision models are relatively small
Human Brain
(200T)
108
…
Megatron-Turing
106 (530B)
GPT-3 (175B)
105
300x 100,000x
T5 (11B) Turing-NLG (17B)
104 Megatron-LM (8.3B)
GPT-2 (1.5B) ViT-G (1.8B) SwinV2 (3B)
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Swin Transformer V2
car
riding a bike horse
• 25% larger
Swin V2
• 40x less labelled images
• 10x lower training cost
ViT-G • Applicable to richer tasks
224
62 𝒙𝒍 3 𝒙𝒍
50 1
MLP Layer Norm
1 50
Layer Norm MLP
11 2
10 1
Attention Layer Norm
1 10
Layer Norm Cosine Attention
1 𝒙𝒍−𝟏 1 𝒙𝒍−𝟏
Megatron-Turing
106 (530B)
GPT-3 (175B)
105
180x 70,000x
T5 (11B) Turing-NLG (17B)
104 Megatron-LM (8.3B)
GPT-2 (1.5B) ViT-G (1.8B) SwinV2 (3B)
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Tutel: ML system to support sparse MoE models
at scale
• Single layer: 4.96x speedup on 16-GPU, 5.75x speedup on 2,048-GPU
• End-to-end models:
• Collaborated with Meta to deliver ~1.4x speedup for Meta 1.1T model
• A Swin-MoE baseline
丘脑
BEiT/MAE
/SimMIM
Transfer
Supervised Pre-training Pre-trained Models
Fine-tuning on downstream tasks
On ImageNet
• Fine-grained Classification
• Object Detection
• Semantic Segmentation
Visual pre-training
• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
Three Questions
• How to make MIM pre-training as simple as possible?
• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
SimMIM: A Simple Framework on MIM
ℓ1 loss
Random
mask
➢ Masking strategy: Random masking with relatively large patch size (e.g., 32x32)
➢ Prediction head: An extremely lightweight prediction head (e.g., a linear layer)
➢ Prediction target: A simple raw pixel regression task
➢ Encoder architecture: ViT, Swin and ResNet could all benefit from SimMIM
➢ Sparse decoder (MAE) or dense decoder (SimMIM)?
SimMIM: Masking Strategy
• Simple random masking works well
• Large patch size/High mask ratio matters
• Visual signals are redundant spatially and
exhibit strong locality
• A metric to evaluate different strategy?
ViT-Base ResNet-50 ×4
➢ 2.0 improvement compared ➢ 0.9 improvement compared
to supervised baseline to supervised baseline
➢ 0.6 improvement compared
to previous SOTA
SimMIM: System-level Comparison
• ViT, Swin and ResNet could all benefit from SimMIM
86
ImageNet Top-1
MIM enables Swin-Giant 3B 90
Accuracy
90.2 with SwinV2-G 90.45 with ViT-G
89 Only 70M data JFT-3B dataset
➢ Pre-trained only on 70M in-house data
➢ 40× less than previous practice 88
0 500 1000 1500 2000 2500 3000
Dataset Size (M)
SimMIM: Visualizations
Raw Image Random Mask Mask w/o fully cover the object Mask w/ fully cover the object
Three Questions
• How to make MIM pre-training as simple as possible?
• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
Could MIM benefit from large-scale data?
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu. On Data Scaling in Masked Image Modeling. Tech report 2022
Three Questions
• How to make MIM pre-training as simple as possible?
• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
A simple approach that can generally improve
various approaches
• ViT-B performance
Method IN-1K ADE20K
Image classification (DeiT) 81.8 -> 83.0 (+1.2) 47.0 -> 48.0 (+1.0)
Instance contrastive learning (DINO) 82.8 -> 83.8 (+1.0) 46.2 -> 47.7 (+1.5)
CLIP (400M data) 82.9 -> 84.9 (+2.0) 49.5 -> 52.8 (+3.3)
Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, Baining Guo. Contrastive Learning Rivals Masked Image
Modeling in Fine-tuning via Feature Distillation. Tech report 2022
On Larger Models
• Pre-training convergence
• NLP: BERT / GPT
• CV: BEiT / MAE / SimMIM
• Understanding of MIM