Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Lecture 1: Introduction to Deep Learning

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Overview
This class is about:
• deep learning
• application in computer vision and graphics
• applications in natural language processing

It will include:
• 12 lectures
• 12 seminars
• 4 homeworks
• 1 project
• 3,5 guest lectures
“Deep Learning”, Spring 2021: Lecture 1, Introduction
2006

Computer vision = 60%


0.6 12 = 0.00217
“Deep Learning”, Spring 2021: Lecture 1, Introduction
2014

0.98912 = 0.875
“Deep Learning”, Spring 2021: Lecture 1, Introduction
2014

“Deep Learning”, Spring 2021: Lecture 1, Introduction


How to tell a cat from a dog?

The domestic dog (Canis lupus familiaris


The cat or domestic cat (Felis catus) is a small when considered a subspecies of the wolf or
carnivorous mammal.[1][2] It is the only Canis familiaris when considered a distinct
domesticated species in the family Felidae.[4] species)[4] is a member of the genus Canis
The cat is either a house cat, kept as a pet; or a
(canines), which forms part of the wolf-like
feral cat, freely ranging and avoiding human canids,[5] and is the most widely abundant
contact.[5] A house cat is valued by humans terrestrial carnivore.[6][7][8][9][10] The dog
for companionship and for its ability to hunt and the extant gray wolf are sister
rodents. About 60 cat breeds are recognized taxa[11][12][13] as modern wolves are not
by various cat registries… closely related to the wolves that were first
domesticated,[12][13] which implies that the
direct ancestor of the dog is extinct…
“Deep Learning”, Spring 2021: Lecture 1, Introduction
How to tell a cat from a dog: ML approach
• Does it have pointed ears?
• Does it have floppy ears?
• Does it have curvy tail?
• Does it have elongated
muzzle?
• Does it have legs longer
than 25cm? ML classifier
• ….

Easier questions, but still very hard


(feature engineering needed!)
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Can pixels work as features?
?
x

Natural feature
mapping:
• Highly non-smooth
w.r.t. jitter
• Require lots of
training samples

f(x1) f(x2)
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Haar features for face detection

Viola-Jones features:
• Smooth w.r.t. jitter
• Less training
examples needed
• (also fast to compute)

f(x1) f(x2)

[Viola Jones, CVPR’01]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Haar features

A B

D C

[Viola Jones, CVPR’01]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Viola-Jones detector

“face”

Linear “background”
Haar feature classifier
extractor +
thresholding

• Non-shallow, learnable representation (AdaBoost


greedy algorithm)
• Cascaded detector for speed
• One of the most impactful papers in CV history

[Viola Jones, CVPR’01]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
From face detection to pedestrian detection

vs.

Good industy-grade performance by


Viola-Jones (for frontal faces)

Viola-Jones detector not good enough


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Improving pedestrian detection

[Dollar et al. BMVC09]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Improved pedestrian detector
“channel features”
Haar
feature classifier “pedestrian”

“background”

Hand-crafted Trained using boosting

[Dollar, Tu, Perona,


Belongie. Integral Channel
Features. BMVC09]
“Deep Learning”, Spring 2021: Lecture 1, Introduction
“Deep learning” is not only depth
“channel features”
Haar
feature classifier “pedestrian”

“background”

Hand-crafted Trained using boosting

• Previous CV systems were “deep”, they


used multiple layers of representation with
certain success
• But they were not called “deep learning”
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Deep learning definition

My caveat: the number of modules >= 3 (logreg is not DL)

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Then, what is “deep learning”?
End-to-end joint learning of all layers:
• multiple assemblable blocks (“modules”)
• each block is piecewise-differentiable
• gradient-based learning
from examples
• gradients computed
by backpropagation
Deep learning “revolution”
(2012? – now): rapid
engineering improvements
following these principles
“Deep Learning”, Spring 2021: Lecture 1, Introduction
(Pre)-history of deep learning

chart from [Kamath et al. 2019]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Perceptron: the first “neural net”
[Rosenblatt 1957]: an “artificial
neuron”

loop over examples


y = H(wTxi);
w = w+1/2 xi * (yi-y);
end
• Non-gradient based training
• Converges to linear separator of the
training data if it exists.
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Perceptron as a computational graph
“operations, layers, transforms”

* H

“units, neurons, activations, blobs”

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Logistic regression

Similar diagram/computational graph:

* σ

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Training logistic regression

* σ

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Multi-layer perceptron idea

* σ * σ

• First layer: parallel logistic regression


• Each predicts presence of some feature
in the input
• Second layer is a logistic regression that
“weighs” the input of the first layer

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Choosing non-linearity
To get more powerful model we need non-linearity:

* NL * sm

Possible elementwise non-linearities:


• Heaviside
• Sigmoid(logistic)/tanh
• More recently:
ReLU(x) = max(0,x)
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Universal approximation theorem

M σ M

UAT: given non-polynomial non-linearity, a single hidden layer neural


network can approximate any continuous function on any compact
subset of Rd upto an arbitrary precision [Tsybenko 1989].

Caveat 1: the width of the network can be a very quickly growing


function of space dimension and approximation accuracy. Deeper
architectures are exponentially narrower for some classes of functions
[Rollnick&Tegmark 2018]

Caveat 2: no guarantees on extrapolation beyond the compact set


where the approximation is computed. Thus, designing a proper
parameterization of your space is useful.
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Computing deeper derivatives

f3(;w3) z
f1(;w1) f2(;w2) f4(;w4)
x0 x1 x2 x3
z=f4( f3( f2( f1(x; w1); w2); w3); w4)
• We can implement derivatives of
elementary functions
• But how to implement e.g. ?

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Recap: chainrule
x1 y1
f g
z
x2 y2

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Recap: chainrule

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Recap: chainrule

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Computing deeper derivatives

f3(;w3) z
f1(;w1) f2(;w2) f4(;w4)
x0 x1 x2 x3

z=f4( f3( f2( f1(x; w1); w2); w3); w4)

Now, we are ready to compute all derivatives!

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Sequential computation: backpropagation
can be computed

f3(;w3) z
f1(;w1) f2(;w2) f4(;w4)
x0 x1 x2 x3
“Deep Learning”, Spring 2021: Lecture 1, Introduction
General approach: layer abstraction

Each layer is defined by:


• forward performance:
• backward performance:

“Deep Learning”, Spring 2021: Lecture 1, Introduction


OOP pseudocode of deep learning
abstract class Layer {
params w,dzdw;
virtual y = forward(x);
virtual dzdx = backward(dzdy,x,y);
// should compute dzdw as well

void update (tau) {


w = w-tau*dzdw;
}
};

Efficient implementations have to use vector/matrix


instructions and work efficiently for minibatches!
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Computing the partial derivatives

Options for partial derivatives:


• Finite differences
• Derive gradients analytically
Debugging is hard !
Gradient checking is a good idea!

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Deep learning packages
• All packages facilitate stacking layers and
defining new layers
• Differ on languages/levels of granularity
• Modern packages (Pytorch, Tensorflow,
MXNet) come with lots of predefined blocks
of different granularity
• While implementing new layers is seldom
needed for non-research tasks, it is crucial
for understanding the intuitions behind deep
networks

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Zoo of layers in modern packages

Multiplicative layer Copy layer


Convolutional layer Split layer
Cat layer
ReLu layer Merge layer
Sigmoid layer
Softmax layer Log-loss layer
Normalization layer Softmax loss layer
Max-pooling layer Hinge loss layer
L2-loss layer
Data providers Contrastive loss layer
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Example: multitask learning
* sml

* NL * cp
* sml

Typical usecase:
• Two related tasks
• Limited labeled data for the main task
• Lots of labeled data for auxiliary task

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Deep learning and overfitting

* sml

* NL * sml

• Overfitting is severe for deep models (why?)


• The progress on deep learning was not happening
till really large datasets arrived

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Recap: regularization
Strategies to avoid overfitting (aka regularize
learning):
• Pick a “simpler” model (e.g. ConvNets)
• Stop optimization early (always keep
checking validation loss/error)
• Impose smoothness (weight decay)
• Inject noise (equivalent to smoothness)
• Bag (average) multiple models

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Dropout regularization
Regularization with a special type of noise:

* NL * sm

• At training time, define which units are active at


random (mask) and which ones are dropped.
Divide active unit values by (1 - drop-out
probability)

[Srivastava et al. 2011]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
How to implement dropout?
Define it as a layer!
Forward propagation (train-time only):

Backward propagation:

This layer is switched off during test

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Deep learning: recap
End-to-end joint learning of all layers:
• multiple assembleable blocks
• each block is piecewise-differentiable
• gradient-based optimization
• gradients computed by
backpropagation

Wait! What about real neural


networks?
“Deep Learning”, Spring 2021: Lecture 1, Introduction
“Real” neurons (simplified model)

McCuloch-Pitts model:
“Deep Learning”, Spring 2021: Lecture 1, Introduction
The reality is much more messy

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Biological layers and parallel computation

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Real vs Artificial: state of deep learning
Human brain:
• 100 billion neurons
• average neuron is
connected to 1000-10000
other neurons
• 100 trillion synapses
• 10-25% is in visual cortex
Artificial nets:
• Upto billions of neurons
• Upto billions of parameters
• Typically tens of millions of
parameters
“Deep Learning”, Spring 2021: Lecture 1, Introduction
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Importance of non-linearity

* * sm

still single matrix multiplication

To get more powerful model need non-linearity:

* NL * sm

“Deep Learning”, Spring 2021: Lecture 1, Introduction


The winner: convolutional networks

Operations:
generalized convolutions
pooling (image resizing)
elementwise non-linearity
matrix multiplication
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Representations

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Left-to-right = “smarter”

Layer 1 Layer 2 Layer 5


[Zeiler Fergus 14]
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Transfer learning

Task where we
have a lot of
data

Final problem
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Learning intermediate representations

• The essence of modern “deep learning”


• Is essential for intelligence
• Can be done via supervised,
unsupervised and other types of learning
• Has been done all along before “deep
learning” revolution

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Gazoob world

[Tenenbaum et al. Science 2011]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Training logistic regression

* σ log

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Logistic regression: simplifying training

* sml

Softmax loss = log loss over softmax/logistic


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Multinomial logistic regression
Training:
“Loss”
* sml

Test:

* sm

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Dropout idea: ensemble interpretation
Pseudo-ensemble training:

* NL * sm

A derived model: subsampling

* NL * sm

Goal: “train” an exponential number of


such reduced models [Srivastava et al. 2011]
“Deep Learning”, Spring 2021: Lecture 1, Introduction
Note: high level vision is part-based
• Does it have pointed ears?
• Does it have floppy ears?
• Does it have curvy tail?
• Does it have elongated
muzzle?
• Does it have legs longer
than 25cm?
• ….

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Dropout idea: ensemble interpretation
Training a very big ensemble of models:

[Srivastava et al. 2011]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Dropout idea: ensemble interpretation

average
….

• Approximation is not exact


• …but works well in practice

[Srivastava et al. 2011]


“Deep Learning”, Spring 2021: Lecture 1, Introduction
Example: “leaky ReLu”

“Deep Learning”, Spring 2021: Lecture 1, Introduction


Stay tuned…
Natasha wake up!

Natash wake up!


we have differentiated
everything

Matrix
multiplication

Softmax

ReLU, Natash

And plugged it into a We have differentiated


neural network! absolutely everything, Natash

“Deep Learning”, Spring 2021: Lecture 1, Introduction

You might also like