1-1-1 面向人工智能统一神经架构和预训练方法

Towards Unified Architecture and
Pre-training for AI
Han Hu (胡瀚)
Microsoft Research Asia
@DataFun
December 17th , 2022
Architecture Convergence (2020-)
• Architecture convergence of CV and NLP
16×
Transformer Blocks
8×
Transformer Blocks
4×
NLP Architecture: Transformer (2017) CV Architecture：ViT/Swin (2020-2021)

Pre-training Convergence (2021-)
• Pre-training convergence of CV and NLP
• Masked language (image) modeling
NLP Pre-training: BERT/GPT (2018) CV Pre-training: BEiT/iBoT/MAE/SimMIM (2021)

NLP/Speech CV Social computing
The Grand Unification

Human brain: Universal architecture
• Universal architecture (innate)：neocortex
Decision, Gustatory Motor

planning, Somatosensory
consciousness,
etc
Visual
Olfactory
Auditory
language
Source: Cognitive Neuroscience – The Biology of the Mind
Human brain: Universal pre-training
• Unified learning approach：Compare difference between prediction and
ground truth input
• Thalamus plays a key role
thalamus
Credit: David Eagleman

Practical benefit of convergence
• Technology and knowledge sharing
• CV technologies => NLP technologies (2012 - 2018)
• NLP technologies => CV technologies (2020 - )
• Promote multi-modality applications
• Zero-shot recognition (CLIP, 2021), Text-to-image generation (DALL-E, 2021)
• Cost decreasing and benefit increasing
• E.g., the chip design only needs to consider the optimization of Transformers
Encouraging Large models for both NLP and CV
Human Brain
(200T)
108
…
Megatron-Turing
106 (530B)
GPT-3 (175B)
105
T5 (11B) Turing-NLG (17B)
104 Megatron-LM (8.3B)
GPT-2 (1.5B) ViT-G (1.8B) SwinV2 (3B)
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Towards Converged AI Architectures
Transformer (2017)
ViT / Swin Transformer (V1, 2021, V2, 2022)
Mainstream models for different AI problems
Vision-CNN Language-Transformer Social-Graph Networks

Model evolution for NLP or sequential data
1995 2014 2017

LSTM GRU Transformer
Jürgen Schmidhuber Google Google
2013 2015
Deep LSTM RNN+attention
Geoffrey Hinton CIFAR

& Microsoft
Ashish Vaswani et al. Attention is all you need. NeurIPS 2017

Model evolution for CV
1980
CNN
Kunihiko Fukushima
Yann LeCun
AlexNet, GoogleNet, VGGNet, ResNet, DenseNet, …

Can NLP and CV use the same architectures?
• Adapting CNN or convolution for NLP
2017.5 2019.2 2019.4

CNN based ConvSeq2Seq Dynamic Deformable
Convolution Convolution
FAIR FAIR MSRA
2017.6
Dominate
Transformer based Transformers
Google Brain
Can NLP and CV use the same architectures?
• Adapting Transformers or attention for CV
1980 2019.4 2020.10

CNN LR-Net (MSRA) ViT
(Google)
Kunihiko Fukushima
Yann LeCun
2017.11 2021.03
Attention Vision
NLNet (FAIR) models Transformers Swin Transformer
(MSRA) and
RelationNet many others
(MSRA)
Vision Transformer (ViT)
• New record on ImageNet-1K classification
• Key merits over CNN: Dynamics, long-term relationship modeling
ImageNet-1K classification
+0.1%
ViT
Alexey Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Swin Transformer
• New records on dense visual tasks
ADE20K语义分割 COCO物体检测
+3.3% +2.7%
Swin
Swin
Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV2021
Swin Transformer
✓ Swin Transformer = Transformer + visual priors
✓ Shifted Non-overlapping Windows
✓ Training suites for vision Transformers and quick open-sourcing
Swin Transformer block
Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV2021
Transformer + visual priors
hierarchy locality translation invariance
(scale invariance) (spatial smoothness)
I am a fluffy cat. I like the green grass. I am a fluffy cat.

I am a fluffy fluffy cat cat. (invalid) Fluffy cat is me.
No scale variation No spatial smoothness Sensitive to absolute locations
From sliding windows to shifted windows
• Non-overlapping windows (3x speed up in latency)
• Shifted window configurations in the next layer
q Layer L
q’
q …
q’ …
Layer L+1
Sliding window
(CNN/LR-Net) Shifted window (Swin, ICCV’2021)
Vision Transformer recipes
A general design principle to apply Transformers
Generic Transformer Unit
+
Domain-Specific Priors
(Defined by graphs)
𝒗𝒊
Graphormer, NeurIPS 2021

Why Transformer became the choice?
• General modeling capability
• All concepts (concrete or abstract) and their relationships can be modeled by a graph
• Modeling arbitrary relationship via projected verification, which is hard for CNN
out feat for
weighted sum
“player”
𝒗𝒊 similarity feat set
query embed key embed value embed
“player” input
Source: Towards Universal Learning Machine: Attention modeling, VALSE Webinar (2019.7)
What’s Next?
Swin Transformer V2
Scaling up capacity is the primary driving force for
the remarkable progress of NLP
Zero-shot Accuracy on LAMBADA
Parameters (Millions)
Computer vision models are relatively small
Human Brain
(200T)
108
…
Megatron-Turing
106 (530B)
GPT-3 (175B)
105
300x 100,000x
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Swin Transformer V2
• The world’s largest dense vision model (3 billion parameters)

• New records on broad vision tasks (Nov. 2021)
• COCO object detection (63.1/54.4 box/mask AP)
• ADE20K semantic segmentation (59.9 mIoU)
• Kinetics400 video classification (86.8% top-1)
• ImageNet-V2 image classification (84.0% top-1)
car
riding a bike horse
Detection Segmentation Classification Classification

(region-level) (pixel-level) (video-level) (image-level)
Compared to previous billion-scale vision models
• 25% larger
Swin V2
• 40x less labelled images
• 10x lower training cost
ViT-G • Applicable to richer tasks
1.8B parameter; 3B images
3B parameters; 70M images

CoAtNet-7
2.4B parameter; 3B images
The key for SwinV2’s label and cost efficiency
Classification Vison Language Contrastive Masked image modeling

Learning (E.g., CLIP) (E.g., BEiT/SimMIM)
Category: Description: A penguin

Penguin stands on the snow
ViT-G/CoAtNet-7 BASIC-3B (Google) Swin V2 (Microsoft)

(Google) Florence / Bletchley (Microsoft)
1,00000 classes 1,000,000 sentences 224x224 pixels

≈ 10 Bits ≈ 20 Bits >> 100,000 Bits
Applicable to richer tasks with higher resolution
• Issue: Large resolution discrepancy Solution
between pre-training & fine-tuning
2000
224
224 1000 ∆𝑥̂

Image Classification
MLP
(Pre-training)
∆𝑦 [∆𝑥̂,
ො ∆𝑦]
ො
Object Detection SwinV1: Discrete SwinV2: Continuous
(Fine-tuning) Relative Position Bias Relative Position Bias
Han Hu et al. Local Relation Networks for Image Recognition, ICCV 2019
Towards better training stability
• Vision tasks can be very unstable with ~30,000x larger value range
62 𝒙𝒍 3 𝒙𝒍
50 1
MLP Layer Norm
1 50
Layer Norm MLP
11 2
10 1
Attention Layer Norm
1 10
Layer Norm Cosine Attention
1 𝒙𝒍−𝟏 1 𝒙𝒍−𝟏
SwinV1: Pre-Norm SwinV2: Res-Post-Norm

+ Cosine Attention
The gap is still large
Human Brain
(200T)
108
…
Megatron-Turing
106 (530B)
GPT-3 (175B)
105
180x 70,000x
BiT (940M)
103
ViT (632M)
BERT-L (340M)
102 NLP models
ELMo (94M)
CV models
101
2018 2019 2020 2021 2022 ……
Tutel: ML system to support sparse MoE models
at scale
• Single layer: 4.96x speedup on 16-GPU, 5.75x speedup on 2,048-GPU
• End-to-end models:
• Collaborated with Meta to deliver ~1.4x speedup for Meta 1.1T model
• A Swin-MoE baseline
Meta (Facebook)’s 1.1 trillion–parameter MoE language model

BERT/ GPT
丘脑
BEiT/MAE
/SimMIM
Towards Converged Pre-training

BERT/GPT (2018/2019)
BEiT/MAE/SimMIM (2021)
Pre-training is the basis for deep learning to be
applicable to various visual tasks
Transfer
Supervised Pre-training Pre-trained Models
Fine-tuning on downstream tasks
On ImageNet
• Fine-grained Classification
• Object Detection
• Semantic Segmentation
Visual pre-training
2006.11 2019.11 2021.6

MoCo (Contrastive BEiT (MIM based
AutoEncoder (Self-supervised)
based Self-supervised) Self-supervised)
UToronto FAIR MSRA
2012.6 2021.1 2021.11

AlexNet (Image CLS) CLIP (Multi-modal) MAE / SimMIM
UToronto OpenAI (FAIR) (MSRA)

Generative Pre-training: Masked Image Modeling
Task: Predict the masked area

Three Questions
• How to make MIM pre-training as simple as possible?
• Could MIM benefit from larger-scale data?
• Why does MIM pretraining work well, and could MIM inspire other
pretraining approaches?
Three Questions
SimMIM: A Simple Framework on MIM
ℓ1 loss
One-layer Prediction Head

Encoder
(e.g., ViT, Swin, ResNet)
Random
mask
➢ Masking strategy: Random masking with relatively large patch size (e.g., 32x32)
➢ Prediction head: An extremely lightweight prediction head (e.g., a linear layer)
➢ Prediction target: A simple raw pixel regression task
➢ Encoder architecture: ViT, Swin and ResNet could all benefit from SimMIM
➢ Sparse decoder (MAE) or dense decoder (SimMIM)?
SimMIM: Masking Strategy
• Simple random masking works well
• Large patch size/High mask ratio matters
• Visual signals are redundant spatially and
exhibit strong locality
• A metric to evaluate different strategy?
Raw image Square(32) Block-wise(16)
Random(8) Random(16) Random(32)

SimMIM: Averaged Distance
AvgDist* is a more explicit metric for MIM than patch size/mask ratio
*Average of the minimum distance from the invisible area to the visible area.
Red line indicates the

minimum distance
SimMIM: Encoder
• ViT, Swin and even ResNet could all benefit from SimMIM
ViT-Base ResNet-50 ×4
➢ 2.0 improvement compared ➢ 0.9 improvement compared
to supervised baseline to supervised baseline
➢ 0.6 improvement compared
to previous SOTA
SimMIM: System-level Comparison
• ViT, Swin and ResNet could all benefit from SimMIM
86
ImageNet Top-1 Accuracy

Swin-Base/Large/Huge 85.5
on ImageNet-1K Classification 85
84.5 +2.4
+1.9
➢ Pre-trained only on ImageNet-1K
84
➢ 0.7 / 1.9 / 2.4 improvements
83.5 +0.7 Supervised
campared to supervised counterparts
➢ Avoid the seriously overfitting issue 83 SimMIM
0 200 400 600 800
Model Size (M)
91
ImageNet Top-1
MIM enables Swin-Giant 3B 90
Accuracy
90.2 with SwinV2-G 90.45 with ViT-G
89 Only 70M data JFT-3B dataset
➢ Pre-trained only on 70M in-house data
➢ 40× less than previous practice 88
0 500 1000 1500 2000 2500 3000
Dataset Size (M)
SimMIM: Visualizations
Raw Image Random Mask Mask w/o fully cover the object Mask w/ fully cover the object
Three Questions
Could MIM benefit from large-scale data?
Scaling Laws for Neural Language Models (OpenAI)

➢ Scaling law in terms of Compute C, Dataset Size D, Parameters N
➢ Performance has a power-law relationship with each individual factor
when not bottlenecked by the other two
Could MIM benefit from large-scale data?
Overfitting Non-Overfitting
𝐿 = 𝐶 Τ3.8 ∙ 1019 −0.0176

𝐿 = 𝑁Τ1.57 ∙ 1017 −0.0181
Model Size × Training Iterations Millions of Images Millions of Params
• Performance has a power-law relationship with Relative Compute C and Parameters N

• But not with Dataset Size D
• With a fixed model size and fixed computational cost, performance does not increase with the dataset size
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, Han Hu. On Data Scaling in Masked Image Modeling. Tech report 2022
Three Questions
A simple approach that can generally improve
various approaches
• ViT-B performance
Method IN-1K ADE20K
Image classification (DeiT) 81.8 -> 83.0 (+1.2) 47.0 -> 48.0 (+1.0)
Instance contrastive learning (DINO) 82.8 -> 83.8 (+1.0) 46.2 -> 47.7 (+1.5)
CLIP (400M data) 82.9 -> 84.9 (+2.0) 49.5 -> 52.8 (+3.3)
Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, Baining Guo. Contrastive Learning Rivals Masked Image
Modeling in Fine-tuning via Feature Distillation. Tech report 2022
On Larger Models
Method IN-1K ADE20K

CLIP (ViT-L) 86.1 -> 87.7 (224) -> 89.0 (336) 53.5 -> 55.7 (+2.2)
Swin V2-G (3B) 89.2 -> 89.4 (256) 59.9 -> 61.4 (+1.5)
Takeaways
• Architecture convergence
• NLP: Transformers
• CV: ViT / Swin Transformer V1&V2
• Pre-training convergence
• NLP: BERT / GPT
• CV: BEiT / MAE / SimMIM
• Understanding of MIM
Still long way to go!

1-1-1 面向人工智能统一神经架构和预训练方法

Uploaded by

Copyright:

Available Formats

You might also like

1-1-1 面向人工智能统一神经架构和预训练方法

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-1-1 面向人工智能统一神经架构和预训练方法

Uploaded by

Copyright:

Available Formats

Towards Unified Architecture and

NLP Architecture: Transformer (2017) CV Architecture：ViT/Swin (2020-2021)

NLP Pre-training: BERT/GPT (2018) CV Pre-training: BEiT/iBoT/MAE/SimMIM (2021)

The Grand Unification

Decision, Gustatory Motor

Credit: David Eagleman

Vision-CNN Language-Transformer Social-Graph Networks

1995 2014 2017

Jürgen Schmidhuber Google Google

Geoffrey Hinton CIFAR

Ashish Vaswani et al. Attention is all you need. NeurIPS 2017

AlexNet, GoogleNet, VGGNet, ResNet, DenseNet, …

2017.5 2019.2 2019.4

1980 2019.4 2020.10

Swin Transformer block

I am a fluffy cat. I like the green grass. I am a fluffy cat.

Generic Transformer Unit

Graphormer, NeurIPS 2021

query embed key embed value embed

Zero-shot Accuracy on LAMBADA

• The world’s largest dense vision model (3 billion parameters)

Detection Segmentation Classification Classification

1.8B parameter; 3B images

3B parameters; 70M images

Classification Vison Language Contrastive Masked image modeling

Category: Description: A penguin

ViT-G/CoAtNet-7 BASIC-3B (Google) Swin V2 (Microsoft)

1,00000 classes 1,000,000 sentences 224x224 pixels

224 1000 ∆𝑥̂

SwinV1: Pre-Norm SwinV2: Res-Post-Norm

Meta (Facebook)’s 1.1 trillion–parameter MoE language model

Towards Converged Pre-training

2006.11 2019.11 2021.6

2012.6 2021.1 2021.11

UToronto OpenAI (FAIR) (MSRA)

Task: Predict the masked area

• Could MIM benefit from larger-scale data?

• Could MIM benefit from larger-scale data?

One-layer Prediction Head

Raw image Square(32) Block-wise(16)

Random(8) Random(16) Random(32)

Red line indicates the

ImageNet Top-1 Accuracy

• Could MIM benefit from larger-scale data?

Scaling Laws for Neural Language Models (OpenAI)

𝐿 = 𝐶 Τ3.8 ∙ 1019 −0.0176

Model Size × Training Iterations Millions of Images Millions of Params

• Performance has a power-law relationship with Relative Compute C and Parameters N

• Could MIM benefit from larger-scale data?

Method IN-1K ADE20K

Still long way to go!

You might also like