DL21 - Lecture 1 - Intro

Lecture 1: Introduction to Deep Learning
“Deep Learning”, Spring 2021: Lecture 1, Introduction

Overview
This class is about:
• deep learning
• application in computer vision and graphics
• applications in natural language processing
It will include:
• 12 lectures
• 12 seminars
• 4 homeworks
• 1 project
• 3,5 guest lectures
2006
Computer vision = 60%

0.6 12 = 0.00217
2014
0.98912 = 0.875
2014

How to tell a cat from a dog?
The domestic dog (Canis lupus familiaris

The cat or domestic cat (Felis catus) is a small when considered a subspecies of the wolf or
carnivorous mammal.[1][2] It is the only Canis familiaris when considered a distinct
domesticated species in the family Felidae.[4] species)[4] is a member of the genus Canis
The cat is either a house cat, kept as a pet; or a
(canines), which forms part of the wolf-like
feral cat, freely ranging and avoiding human canids,[5] and is the most widely abundant
contact.[5] A house cat is valued by humans terrestrial carnivore.[6][7][8][9][10] The dog
for companionship and for its ability to hunt and the extant gray wolf are sister
rodents. About 60 cat breeds are recognized taxa[11][12][13] as modern wolves are not
by various cat registries… closely related to the wolves that were first
domesticated,[12][13] which implies that the
direct ancestor of the dog is extinct…
How to tell a cat from a dog: ML approach
• Does it have pointed ears?
• Does it have floppy ears?
• Does it have curvy tail?
• Does it have elongated
muzzle?
• Does it have legs longer
than 25cm? ML classifier
• ….
Easier questions, but still very hard

(feature engineering needed!)
Can pixels work as features?
?
x
Natural feature
mapping:
• Highly non-smooth
w.r.t. jitter
• Require lots of
training samples
f(x1) f(x2)
Haar features for face detection
Viola-Jones features:
• Smooth w.r.t. jitter
• Less training
examples needed
• (also fast to compute)
f(x1) f(x2)
[Viola Jones, CVPR’01]

Haar features
A B
D C

Viola-Jones detector
“face”
Linear “background”
Haar feature classifier
extractor +
thresholding
• Non-shallow, learnable representation (AdaBoost

greedy algorithm)
• Cascaded detector for speed
• One of the most impactful papers in CV history

From face detection to pedestrian detection
vs.
Good industy-grade performance by

Viola-Jones (for frontal faces)
Viola-Jones detector not good enough

Improving pedestrian detection
[Dollar et al. BMVC09]

Improved pedestrian detector
“channel features”
Haar
feature classifier “pedestrian”
“background”
Hand-crafted Trained using boosting
[Dollar, Tu, Perona,

Belongie. Integral Channel
Features. BMVC09]
“Deep learning” is not only depth
“channel features”
Haar
feature classifier “pedestrian”
“background”
Hand-crafted Trained using boosting
• Previous CV systems were “deep”, they

used multiple layers of representation with
certain success
• But they were not called “deep learning”
Deep learning definition
My caveat: the number of modules >= 3 (logreg is not DL)

Then, what is “deep learning”?
End-to-end joint learning of all layers:
• multiple assemblable blocks (“modules”)
• each block is piecewise-differentiable
• gradient-based learning
from examples
• gradients computed
by backpropagation
Deep learning “revolution”
(2012? – now): rapid
engineering improvements
following these principles
(Pre)-history of deep learning
chart from [Kamath et al. 2019]

Perceptron: the first “neural net”
[Rosenblatt 1957]: an “artificial
neuron”
loop over examples

y = H(wTxi);
w = w+1/2 xi * (yi-y);
end
• Non-gradient based training
• Converges to linear separator of the
training data if it exists.
Perceptron as a computational graph
“operations, layers, transforms”
* H
“units, neurons, activations, blobs”

Logistic regression
Similar diagram/computational graph:
* σ

Training logistic regression
* σ

Multi-layer perceptron idea
* σ * σ
• First layer: parallel logistic regression

• Each predicts presence of some feature
in the input
• Second layer is a logistic regression that
“weighs” the input of the first layer

Choosing non-linearity
To get more powerful model we need non-linearity:
* NL * sm
Possible elementwise non-linearities:

• Heaviside
• Sigmoid(logistic)/tanh
• More recently:
ReLU(x) = max(0,x)
Universal approximation theorem
M σ M
UAT: given non-polynomial non-linearity, a single hidden layer neural

network can approximate any continuous function on any compact
subset of Rd upto an arbitrary precision [Tsybenko 1989].
Caveat 1: the width of the network can be a very quickly growing

function of space dimension and approximation accuracy. Deeper
architectures are exponentially narrower for some classes of functions
[Rollnick&Tegmark 2018]
Caveat 2: no guarantees on extrapolation beyond the compact set

where the approximation is computed. Thus, designing a proper
parameterization of your space is useful.
Computing deeper derivatives
f3(;w3) z
f1(;w1) f2(;w2) f4(;w4)
x0 x1 x2 x3
z=f4( f3( f2( f1(x; w1); w2); w3); w4)
• We can implement derivatives of
elementary functions
• But how to implement e.g. ?

Recap: chainrule
x1 y1
f g
z
x2 y2

Recap: chainrule

Recap: chainrule

Computing deeper derivatives
f3(;w3) z
f1(;w1) f2(;w2) f4(;w4)
x0 x1 x2 x3
z=f4( f3( f2( f1(x; w1); w2); w3); w4)
Now, we are ready to compute all derivatives!

Sequential computation: backpropagation
can be computed
f3(;w3) z
f1(;w1) f2(;w2) f4(;w4)
x0 x1 x2 x3
General approach: layer abstraction
Each layer is defined by:

• forward performance:
• backward performance:

OOP pseudocode of deep learning
abstract class Layer {
params w,dzdw;
virtual y = forward(x);
virtual dzdx = backward(dzdy,x,y);
// should compute dzdw as well
void update (tau) {

w = w-tau*dzdw;
}
};
Efficient implementations have to use vector/matrix

instructions and work efficiently for minibatches!
Computing the partial derivatives
Options for partial derivatives:

• Finite differences
• Derive gradients analytically
Debugging is hard !
Gradient checking is a good idea!

Deep learning packages
• All packages facilitate stacking layers and
defining new layers
• Differ on languages/levels of granularity
• Modern packages (Pytorch, Tensorflow,
MXNet) come with lots of predefined blocks
of different granularity
• While implementing new layers is seldom
needed for non-research tasks, it is crucial
for understanding the intuitions behind deep
networks

Zoo of layers in modern packages
Multiplicative layer Copy layer

Convolutional layer Split layer
Cat layer
ReLu layer Merge layer
Sigmoid layer
Softmax layer Log-loss layer
Normalization layer Softmax loss layer
Max-pooling layer Hinge loss layer
L2-loss layer
Data providers Contrastive loss layer
Example: multitask learning
* sml
* NL * cp
* sml
Typical usecase:
• Two related tasks
• Limited labeled data for the main task
• Lots of labeled data for auxiliary task

Deep learning and overfitting
* sml
* NL * sml
• Overfitting is severe for deep models (why?)

• The progress on deep learning was not happening
till really large datasets arrived

Recap: regularization
Strategies to avoid overfitting (aka regularize
learning):
• Pick a “simpler” model (e.g. ConvNets)
• Stop optimization early (always keep
checking validation loss/error)
• Impose smoothness (weight decay)
• Inject noise (equivalent to smoothness)
• Bag (average) multiple models

Dropout regularization
Regularization with a special type of noise:
* NL * sm
• At training time, define which units are active at

random (mask) and which ones are dropped.
Divide active unit values by (1 - drop-out
probability)
[Srivastava et al. 2011]

How to implement dropout?
Define it as a layer!
Forward propagation (train-time only):
Backward propagation:
This layer is switched off during test

Deep learning: recap
End-to-end joint learning of all layers:
• multiple assembleable blocks
• each block is piecewise-differentiable
• gradient-based optimization
• gradients computed by
backpropagation
Wait! What about real neural

networks?
“Real” neurons (simplified model)
McCuloch-Pitts model:
The reality is much more messy

Biological layers and parallel computation

Real vs Artificial: state of deep learning
Human brain:
• 100 billion neurons
• average neuron is
connected to 1000-10000
other neurons
• 100 trillion synapses
• 10-25% is in visual cortex
Artificial nets:
• Upto billions of neurons
• Upto billions of parameters
• Typically tens of millions of
parameters
Importance of non-linearity
* * sm
still single matrix multiplication
To get more powerful model need non-linearity:
* NL * sm

The winner: convolutional networks
Operations:
generalized convolutions
pooling (image resizing)
elementwise non-linearity
matrix multiplication
Representations

Left-to-right = “smarter”
Layer 1 Layer 2 Layer 5

[Zeiler Fergus 14]
Transfer learning
Task where we
have a lot of
data
Final problem
Learning intermediate representations
• The essence of modern “deep learning”

• Is essential for intelligence
• Can be done via supervised,
unsupervised and other types of learning
• Has been done all along before “deep
learning” revolution

Gazoob world
[Tenenbaum et al. Science 2011]

Training logistic regression
* σ log

Logistic regression: simplifying training
* sml
Softmax loss = log loss over softmax/logistic

Multinomial logistic regression
Training:
“Loss”
* sml
Test:
* sm

Dropout idea: ensemble interpretation
Pseudo-ensemble training:
* NL * sm
A derived model: subsampling
* NL * sm
Goal: “train” an exponential number of

such reduced models [Srivastava et al. 2011]
Note: high level vision is part-based
• Does it have pointed ears?
• Does it have floppy ears?
• Does it have curvy tail?
• Does it have elongated
muzzle?
• Does it have legs longer
than 25cm?
• ….

Training a very big ensemble of models:

average
….
≈
• Approximation is not exact

• …but works well in practice

Example: “leaky ReLu”

Stay tuned…
Natasha wake up!
Natash wake up!

we have differentiated
everything
Matrix
multiplication
Softmax
ReLU, Natash
And plugged it into a We have differentiated

neural network! absolutely everything, Natash

DL21 - Lecture 1 - Intro

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL21 - Lecture 1 - Intro

Uploaded by

Copyright:

Available Formats

Lecture 1: Introduction to Deep Learning

“Deep Learning”, Spring 2021: Lecture 1, Introduction

Computer vision = 60%

“Deep Learning”, Spring 2021: Lecture 1, Introduction

The domestic dog (Canis lupus familiaris

Easier questions, but still very hard

[Viola Jones, CVPR’01]

[Viola Jones, CVPR’01]

• Non-shallow, learnable representation (AdaBoost

[Viola Jones, CVPR’01]

Good industy-grade performance by

Viola-Jones detector not good enough

[Dollar et al. BMVC09]

Hand-crafted Trained using boosting

[Dollar, Tu, Perona,

Hand-crafted Trained using boosting

• Previous CV systems were “deep”, they

My caveat: the number of modules >= 3 (logreg is not DL)

“Deep Learning”, Spring 2021: Lecture 1, Introduction

chart from [Kamath et al. 2019]

loop over examples

“units, neurons, activations, blobs”

“Deep Learning”, Spring 2021: Lecture 1, Introduction

Similar diagram/computational graph:

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

• First layer: parallel logistic regression

“Deep Learning”, Spring 2021: Lecture 1, Introduction

Possible elementwise non-linearities:

UAT: given non-polynomial non-linearity, a single hidden layer neural

Caveat 1: the width of the network can be a very quickly growing

Caveat 2: no guarantees on extrapolation beyond the compact set

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

z=f4( f3( f2( f1(x; w1); w2); w3); w4)

Now, we are ready to compute all derivatives!

“Deep Learning”, Spring 2021: Lecture 1, Introduction

Each layer is defined by:

“Deep Learning”, Spring 2021: Lecture 1, Introduction

void update (tau) {

Efficient implementations have to use vector/matrix

Options for partial derivatives:

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

Multiplicative layer Copy layer

“Deep Learning”, Spring 2021: Lecture 1, Introduction

• Overfitting is severe for deep models (why?)

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

• At training time, define which units are active at

[Srivastava et al. 2011]

This layer is switched off during test

“Deep Learning”, Spring 2021: Lecture 1, Introduction

Wait! What about real neural

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

still single matrix multiplication

To get more powerful model need non-linearity:

“Deep Learning”, Spring 2021: Lecture 1, Introduction

“Deep Learning”, Spring 2021: Lecture 1, Introduction

Layer 1 Layer 2 Layer 5

• The essence of modern “deep learning”