Guest-Lecture - Transfer Learning

Fall 2020 Applied Machine Learning
Guest Lecture
Transfer Learning
Jin Sun
Oct. 2020 1
Today - Transfer Learning
● Overview
● Fine-tuning
● Domain Transfer / Adaptation
● Multi-Task Learning
● Lifelong Learning
2
Overview
Standard Machine Learning Assumption
Train
Data Task
Test
3
Fine-tuning
Train
Data 1 Task 1
Train (Fine-tune)
Data 2 Task 2
Test
4
Domain Adaptation
Data 1 Train
Task
Data 2 Test
5
Multi-task Learning
Task 1
Train / Test
Data
Train / Test Task 2
6
Lifelong Learning
Train Train
Data Task 1 Data Task 2
Test Test
Time t Time t+1
7
Why Transfer Learning Is Important
Neural networks are generally hard to train
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings
of the IEEE conference on computer vision and pattern recognition. 2016.
8
Not enough data
http://seansoleyman.com/effect-of-dataset-size-on-image-classification-accuracy/
9
Multi-tasks serve as implicit regularization:
The other task’s loss function can serve as a regularize term for the current task.
Model is forced not to overfit too much to one task.
Otherwise it will fail on remaining tasks.
[Assumption] Multi-tasks are sharing some network structures.

Task 1
Net
Task 2
10
Can a non-pretrained network be useful?
11
Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. "Deep image prior." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
● Overview
● Fine-tuning
12
Fine-tuning
Train
Data 1 Task 1
Pretrained
Train (Fine-tune)
Data 2 Task 2
Test
Final model
13
Fine-tuning Steps
1. Get a good pre-trained recent neural network
Resnet 18/50/101, Inception, VGG
2. Determine how many layers to keep
Most common is the second to last layer, but you might need something
different!
3. Add new task-dependent layers
You might have to change type of output layers or number of outputs.
4. Re-train the network using new data and hyperparameters

14
Example: Fine-tuning in Pytorch
https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
from torchvision import models
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 2)
15
Understand Model Structure in PyTorch
Each model is an instance of a class.
Layers can be called by their names:

model.conv1, model.layer1, model.fc…
16
It is very flexible, you can ‘borrow’ partially a pre-trained model’s layer and
make a new and more complex one.
class MyNet(nn.Module):
def __init__(self, …):
…
resnet = models.resnet18(pretrained=True)
self.firstconv = nn.Conv2d(num_in_channels, 64, kernel_size=(7,7), stride=(2,2),

padding=(3,3), bias=False) # can change the input layer too, if you are not using image as input
…
self.encoder1 = resnet.layer1
self.myencoder = nn.Conv2d(128, 128, kernel_size=(3,3))
…
def forward(self, x):
x = self.firstconv(x)
…
x = self.encoder1(x)
x = self.myencoder(x) 17
Or you can unzip all modules into a list and build another sequential model for
simpler access.
resnet = models.resnet18(pretrained=True)
model = nn.Sequential(*(list(resnet.children())[:K])) # use K to control how many layers to keep
Then you can use the new model as any other sequential model.
For example the model can now be indexed:
model[0], model[1], ...
18
Which Layers To Keep
Depends on what information is important for the new task
Low level Mid level High level
Edges Object Parts Objects
New categories:
E.g., Beverages
19
Different tasks:
E.g., Dense predictions
20
??
21
Visualization of VGG Learned Filters
https://blog.keras.io/how-convolutional-neural-networks-see-th 22
e-world.html
Freeze or Not-Freeze
Only when you have large enough new task data
Pretrained Net Task Net
To implement in PyTorch:
for param in model.parameters():
param.requires_grad = False
23
Freeze or Not-Freeze (Softer Version)
Different Learning Rate for Different Layer
Pretrained Net Task Net
To implement in PyTorch:
optim.SGD([
{'params': model.base.parameters()},
{'params': model.task_net.parameters(), 'lr': 1e-4}
], lr=1e-6, momentum=0.9)
24
The Limitation of Fine-tuning
A pre-trained model captures the statistics of the dataset that it was trained
on.
In vision: ImageNet.
Take a look at the ImageNet dataset:
If your new task data distribution is vastly different from it,
For example, non-visual data, or data from another spectrum.
… a pretrained model is acting like a good (?) initialization, at best. 25

Train from scratch if you have infinite resources
Object detection performance on

COCO dataset:
He, Kaiming, Ross Girshick, and Piotr Dollár. "Rethinking imagenet pre-training." Proceedings of the IEEE International
26
Conference on Computer Vision. 2019.
● Overview
● Fine-tuning
27
Domain Transfer / Adaptation
Data 1 Train
Distribution Mismatch! Task
Data 2 Test
28
Domain Shift Problem
Slides from “What you saw is not what you get” Domain adaptation for deep learning, Kate Saenko 29
Test on San Francisco

Dataset Bias
If you work with computer vision datasets

long enough, given an image, you can guess
pretty accurately which dataset it came
from!
Torralba, Antonio, and Alexei A. Efros. "Unbiased look at dataset bias." (2011): 1521-1528. 32
Dataset Bias
Torralba, Antonio, and Alexei A. Efros. "Unbiased look at dataset bias." (2011): 1521-1528. 33
Domain Shift Not Just in Computer Vision
Email spam detection
Speech recognition
Recommendation system
Many machine learning systems face domain shift issues.
34
Domain Alignment
B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, 2012
Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by back propagation. In ICML 2015
35
Adversarial Domain Alignment
Tzeng, Eric, et al. "Adversarial discriminative domain adaptation." Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2017.
Another Perspective of Domain Adaptation
Large data
Fully labeled Data 1 Train
Low cost
Task
Small data
Partial/Not labeled Data 2 Finetune/Test
High cost
37
38
39
UnrealCV
https://unrealcv.org/
40
● Overview
● Fine-tuning
41
Multi-Task Learning
Task 1
Train / Test
Data
Train / Test Task 2
Many tasks are related to each other. For example: object detection and
segmentation; depth estimation and surface normal estimation.
42
It makes sense to share some network structures so each task doesn’t need to
learn everything from scratch independently.
https://www.researchgate.net/figure/a-Learning-process-of-multi-task-learning-Several-target-tasks-share-part-of-hidden_fig4_327435148
43
Dynamics of Multi-Task Learning
● Features might be easier to learn from Task B than Task A.
● The model is forced not to overfit to any single task.
● Learning a feature space that works for another task will benefit its own
task for unseen data.
● Implicit data augmentation: we get more information from our data.
44
Cross-stitch Networks
Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the 45
IEEE Conference on Computer Vision and Pattern Recognition.
Multi-Task Learning Using Uncertainty
Kendall, A., Gal, Y., & Cipolla, R. (2017). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics.
46
Taskonomy
http://taskonomy.stanford.edu/
CVPR 2018 Best Paper 47

Taskonomy - Tasks
48
Taskonomy - Motivation
Question:
There are many seemly independent tasks in computer vision, how does
one know whether they are indeed related or not?
Goal:
Learn the dependency relationships between various vision tasks.
Result:
Found transfer learning dependencies across a dictionary of twenty-six

2D, 2.5D, 3D, and semantic tasks.
49
Taskonomy - Task Dependent Models
Each task is trained using

an encoder-decoder
structure.
50
Taskonomy - Task Transfer Modeling
Features extracted from

encoders of other tasks are
used for each task.
Nth order means N other tasks

are used.
51
Taskonomy - Affinity Matrix
52
Taskonomy Generation
Given the affinity matrix, build a graph where tasks are nodes and transfers
are edges.
Find a graph that has maximum performance with minimum costs (number of
supervision used) -> Boolean Integer Programming
53
Taskonomy - Main Result
Live version
54
Taskonomy - Summary
Many vision tasks are correlated to each other than originally believed.
Number of labeled data samples needed to solve for N tasks can be greatly
reduced if we transfer features from one task to another.
Open Questions:
1. The results are model and data specific.

2. How about non visual tasks (for example robot manipulation).
3. Lifelong learning scenario where models are constantly updating.
55
Taskonomy - Website and API
http://taskonomy.stanford.edu/api/
https://github.com/StanfordVL/taskonomy/tree/master/taskbank
Data is huge: 4.5 million images, 11 TB!
https://github.com/StanfordVL/taskonomy/tree/master/data
56
● Overview
● Fine-tuning
57
Lifelong Learning
Train Train
Data Task 1 Data Task 2
Test Test
Time t Time t+1
Both data and model are changing over time.
How to learn new tasks without forgetting about old tasks?

58
iCaRL: Incremental Classifier and Representation Learning
Goal:
Learn new data on a single model that has limited access to old train data.
Key Ideas:
Exemplar set of previously seen classes.
Combine loss from training on both new and old data.
Rebuffi, Sylvestre-Alvise, et al. "icarl: Incremental classifier and representation learning." Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2017. 59
Wu, Zuxuan, et al. "Ace: Adapting to changing environments for semantic segmentation." Proceedings
of the IEEE International Conference on Computer Vision. 2019. 60
Summary - Transfer Learning
● Overview
● Fine-tuning
61

Guest-Lecture - Transfer Learning

Uploaded by

Copyright:

Available Formats

You might also like

Guest-Lecture - Transfer Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Guest-Lecture - Transfer Learning

Uploaded by

Copyright:

Available Formats

Fall 2020 Applied Machine Learning

Train / Test Task 2

Time t Time t+1

Model is forced not to overﬁt too much to one task.

Otherwise it will fail on remaining tasks.

[Assumption] Multi-tasks are sharing some network structures.

Resnet 18/50/101, Inception, VGG

2. Determine how many layers to keep

3. Add new task-dependent layers

You might have to change type of output layers or number of outputs.

4. Re-train the network using new data and hyperparameters

Layers can be called by their names:

self.firstconv = nn.Conv2d(num_in_channels, 64, kernel_size=(7,7), stride=(2,2),

model[0], model[1], ...

Low level Mid level High level

Edges Object Parts Objects

Low level Mid level High level

Edges Object Parts Objects

Low level Mid level High level

Edges Object Parts Objects

Only when you have large enough new task data

Pretrained Net Task Net

Pretrained Net Task Net

Take a look at the ImageNet dataset:

If your new task data distribution is vastly different from it,

For example, non-visual data, or data from another spectrum.

… a pretrained model is acting like a good (?) initialization, at best. 25

Object detection performance on

Distribution Mismatch! Task

Test on San Francisco

If you work with computer vision datasets

Many machine learning systems face domain shift issues.

Train / Test Task 2

CVPR 2018 Best Paper 47

Learn the dependency relationships between various vision tasks.

Found transfer learning dependencies across a dictionary of twenty-six

Each task is trained using

Features extracted from

Nth order means N other tasks

1. The results are model and data speciﬁc.

Data is huge: 4.5 million images, 11 TB!

Time t Time t+1

Both data and model are changing over time.

How to learn new tasks without forgetting about old tasks?

Exemplar set of previously seen classes.

Combine loss from training on both new and old data.

You might also like