Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Deep Learning and Vision

Jon Shlens
Google Research

28 April 2017
Agenda

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

4. Conclusions
Agenda

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

4. Conclusions
The hubris of artificial intelligence

http://dspace.mit.edu/handle/1721.1/6125
‘Simple’ problems proved most difficult.

• For decades we tried to write down every possible


rule for everyday tasks —> impossible

• Every day tasks we consider blindingly obvious have


been exceedingly difficult for computers.

cat?
Machine learning applied everywhere.

• The last decade has shown that if we teach computers


to perform a task, they can perform exceedingly better.

machine translation speech recognition


face recognition time series analysis
molecular activity prediction image recognition
road hazard detection object detection
optical character recognition motor planning
motor activity planning syntax parsing
language understanding …
The computer vision competition:

Large scale academic competition focused on predicting 1000


object classes (~1.2M images).

classes
• electric ray
• barracuda
• coho salmon
• tench
• goldfish
• sawfish
• smalltooth sawfish
• guitarfish
• stingray
• roughtail stingray
• ...

Imagenet: A large-scale hierarchical image database


J Deng et al (2009)
History of techniques in ImageNet Challenge
ImageNet 2010
Locality constrained linear coding + SVM NEC & UIUC
Fisher kernel + SVM Xerox Research Center Europe
SIFT features + LI2C Nanyang Technological Institute
SIFT features + k-Nearest Neighbors Laboratoire d'Informatique de Grenoble
Color features + canonical correlation analysis National Institute of Informatics, Tokyo

ImageNet 2011
Compressed Fisher kernel + SVM Xerox Research Center Europe
SIFT bag-of-words + VQ + SVM University of Amsterdam & University of
Trento
SIFT + ? ISI Lab, Tokyo University

ImageNet 2012
Deep convolutional neural network University of Toronto
Discriminatively trained DPMs University of Oxford
Fisher-based SIFT features + SVM ISI Lab, Tokyo University
Examples of artificial vision in action
Good fine-grain classification.

• fine-grain classification

hibiscus
Good generalization. dahila

• generalization

Both meal
Sensible errors. recognized as “meal” meal

• sensible errors

snake dog

** Trained a model for whole image recognition using Inception-v3 architecture.


Agenda

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

4. Conclusions
History of techniques in ImageNet Challenge
ImageNet 2010
Locality constrained linear coding + SVM NEC & UIUC
Fisher kernel + SVM Xerox Research Center Europe
SIFT features + LI2C Nanyang Technological Institute
SIFT features + k-Nearest Neighbors Laboratoire d'Informatique de Grenoble
Color features + canonical correlation analysis National Institute of Informatics, Tokyo

ImageNet 2011
Compressed Fisher kernel + SVM Xerox Research Center Europe
SIFT bag-of-words + VQ + SVM University of Amsterdam & University of
Trento
SIFT + ? ISI Lab, Tokyo University

ImageNet 2012
Deep convolutional neural network University of Toronto
Discriminatively trained DPMs University of Oxford
Fisher-based SIFT features + SVM ISI Lab, Tokyo University
Deep convolutional neural networks

ImageNet Classification with Deep Convolutional Neural Networks



A Krizhevsky I Sutskever, G Hinton (2012)

• Multi-layer perceptron trained with back-propagation


are ideas known since the 1980’s.

Backpropagation applied to handwritten zip code recognition


Y LeCun et al (1990)
Convolutional neural networks, revisited.

ImageNet Classification with Deep Convolutional Neural Networks



A Krizhevsky I Sutskever, G Hinton (2012)

• Winning network contained 60M parameters.

• Achieving scale in compute and data is critical.

• large academic data sets

• SIMD hardware (e.g. GPU’s, SSE instruction sets)


“Deep learning” = artificial
What is deep learning? neural networks

Loosely based on (what little)


we know about the brain
• Hierarchical composition of simple mathematical
functions

“cat”

Untangling invariant object recognition


J DiCarlo and D Cox (2007)
“Deep learning” = artificial neural networks

• Hierarchical composition of simple mathematical


What is
functionsdeep learning?

Loosely
Loosely inspired
based onby(what
(whatlittle)
little)
we
we know
know about
about the
the brain
brain

“cat”

Untangling invariant object recognition


J DiCarlo and D Cox (2007)
A toy model of a neuron: “perceptron”

Simplify the neuron to a sum over weighted inputs


and a nonlinear activation function.
X
y = f( wi xi + b)
• no spikes i

• no recurrence or feedback *

• no dynamics or state *

• no biophysics
f (z) = max(0, z)

The perceptron: a probabilistic model for information storage and organization in the brain.
F Rosenblatt (1958)
Employing a network for a task.

• A network is a hierarchical composition of nonlinear functions.

“dog”
y = f (f (...)) y

• Output of network is a real-valued vector.

cat dog car truck cow bicycle

label of node j
Example: how to classify with a network

Step 1: Convert the network output to a probability distribution with


the softmax function.

exp(yj )
P (j) = P
j exp(yj )

1
0.75
y 0.5
0.25
0
cat dog car truck cow bicycle cat dog car truck cow bicycle

label of node j label of node j


Example: how to classify with a network

Step 2: Minimize the cross-entropy loss between the predicted


distribution and a one-hot target distribution.

predicted distribution target distribution


1 1
0.75 0.75
q(x) 0.5 p(x) 0.5
0.25 0.25
0 0
cat dog car truck cow bicycle cat dog car truck cow bicycle

label of node j

• Cross entropy loss is the KL-divergence the predicted and target


distribution. X p(x)
loss = p(x) log
x
q(x)
Gradient descent with back-propagation.

• Calculate the partial derivatives of each parameter with respect to


the loss to minimize an objective function via gradient descent.

@ loss
@ wi y

• For weights buried inside the network, employ clever factorization


of the chain rule, i.e. back-propagation.

Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences
P Werbos (1974)
Learning Internal Representations by Error Propagation.
D Rumelhart, G Hinton, R Williams, James L. McClelland et al (1986)
Optimization is highly non-convex.

loss

weight 1 weight 2

Note that deep networks operation in O(1M) dimensions.


playground.tensorflow.org
E. Coli of image recognition

“4”

machine learning system


(e.g. neural network)

http://yann.lecun.com/exdb/mnist/

Gradient-based learning applied to document recognition


Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998)
Multi-layer perceptron on MNIST.
“4”
logistic classifier (M=10) M = # classes

# weights = N x M = 1000

fully connected (N=100) N = # hidden units

# weights = N x P2 = 78400

handwritten
zip codes

P=28

• Note that weights grow as the square of the number of pixels.


• Consider that the iPhone camera uses P = 2000, then the
number of weights would be 4 million.
Natural image statistics obey invariances.


translation
cropping
dilation
contrast
rotation
scale
brightness

Statistics of natural images: Scaling in the woods


D Ruderman and W Bialek (1994)
Natural image statistics and neural representation
E Simoncelli and B Olshausen (2001)
Translation invariance —> convolutions

• Models of natural image statistics begin with convolutional


filter bank.
interlude for convolutions
original filter (3 x 3) identity

0 0 0
0 1 0
0 0 0

https://docs.gimp.org/en/plug-in-convmatrix.html
original filter (5 x 5) blur

https://docs.gimp.org/en/plug-in-convmatrix.html
original filter (5 x 5) sharpen

https://docs.gimp.org/en/plug-in-convmatrix.html
original filter (3 x 3) vertical edge detector

https://docs.gimp.org/en/plug-in-convmatrix.html
original filter (3 x 3) all edge detector

https://docs.gimp.org/en/plug-in-convmatrix.html
interlude for convolutions
Multi-layer perceptron on MNIST.

“4”
logistic classifier (M=10)

# weights = N x M = 1000

fully connected (N=100)

# weights = N x P2 = 78400

handwritten
zip codes

P=28

• Note that weights grow as the square of the number of pixels!


Convolutional neural network on MNIST.

“4”
logistic classifier (M=10)

# weights = N x M x K= 1000 K

convolutional (N=100) N=100

# weights = N x F2 = 2500
F=5

handwritten
zip codes F=5

P=28

• Note that the number of model parameters is largely independent


of image size.
Generalizing convolutions in depth.

example input activations filter bank output activations

grayscale image

input depth

input depth
RGB image
Generalizing convolutions in depth.

example input activations filter bank output activations


output depth output
depth

edge detector
filter bank

output
depth
output depth

convolutional
network

• input and output depth are arbitrary parameters and not equal.
• Convolutional neural networks operate with depths up to 1024.
The first convolutional neural network.

“4”
logistic classifier (M=10)

fully connected (N=30)

convolutional (N=12)

convolutional (N=12)

Backpropagation applied to handwritten zip code recognition


Y LeCun et al (1989)
Convolutional neural networks, revisited

ImageNet Classification with Deep Convolutional Neural Networks



A Krizhevsky I Sutskever, G Hinton (2012)

• Similar architecture to original CNN architecture but


deeper and larger (70K —> 60M parameters).

• More nonlinearities and regularization.


Backpropagation applied to handwritten zip code recognition
Y LeCun et al (1990)
Steady progress in network architectures.

place top 5 error

2012 Supervision 1st 16.4%


2013 Clarifai 1st 11.5%
2014 VGG 2nd 7.3%
2014 GoogLeNet / Inception 1st 6.6%
2014 Andrej Karpathy n/a 5.1%
2015 Batch Normalization Inception n/a 4.8%
2015 Inception v3 2nd 3.6%
2015 ResNet 1st 3.6%
2016 Inception-ResNet n/a 3.1%
Steady progress in network architectures.

place top 5 error

2012 Supervision 1st 16.4%


2013 Clarifai 1st 11.5%
2014 VGG 2nd 7.3%
2014 GoogLeNet / Inception 1st 6.6%
2014 Andrej Karpathy n/a 5.1%
2015 Batch Normalization Inception n/a 4.8%
2015 Inception v3 2nd 3.6%
2015 ResNet 1st 3.6%
2016 Inception-ResNet n/a 3.1%
Steady progress in network architectures.

place top 5 error

2012 Supervision 1st 16.4%


2013 Clarifai 1st 11.5%
2014 VGG 2nd 7.3%
2014 GoogLeNet / Inception 1st 6.6%
2014 Andrej Karpathy n/a 5.1%
2015 Batch Normalization Inception n/a 4.8%
2015 Inception v3 2nd 3.6%
2015 ResNet 1st 3.6%
2016 Inception-ResNet n/a 3.1%
Advances in network architectures

Animation by Dan Mané


Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
C Szegedy, S Ioffe, V Vanhoucke (2016)

Deep Residual Learning for Image Recognition


K He, X Zhang, S Ren, J Sun (2015)

Rethinking the Inception Architecture for Computer Vision


C Szegedy, V Vanhoucke, S Ioffe, J Shlens, Z Wojna (2015)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S Ioffe and C Szegedy (2015)

What I learned from competing against a ConvNet on ImageNet


A Karpathy (2014)

Very Deep Convolutional Networks for Large-scale Image Recognition


Karen Simonyan and Andrew Zisserman (2015)

Going Deeper with Convolutions


C Szegedy et al (2014)

Visualizing and Understanding Convolutional Networks


M Zeiler and R Fergus (2013)

ImageNet Classification with Deep Convolutional Neural Networks



A Krizhevsky I Sutskever, G Hinton (2012)

Scalable Multiclass Object Categorization with Fisher Based Features


N. Gunji et al, (2012)

Compressed Fisher vectors for Large Scale Visual Recognition


F Perronnin, J Sanchez (2011)
Agenda

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

4. Conclusions
Covariate shifts are problematic in machine learning

• Traditional machine learning


must contend with covariate
shift between data sets.

• Covariate shifts must be


mitigates through domain
adaptation.
blog.bigml.com
Covariate shifts occur between network layers.

time = 1

time = N

• Traditional machine learning


must contend with covariate
shift between data sets.
layer i

• Covariate shifts must be


mitigates through domain time = 1
adaptation.
time = N
Covariate shifts occur between network layers.

logistic unit activation


during MNIST training
• Covariate shifts occur
across layers in a deep 85%

network.
50%

• Performing domain 15%

adaptation or whitening is
impractical in an online time
setting.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S Ioffe and C Szegedy (2015)
Previous method for addressing covariate shifts

time = 1

• Adagrad time = N

• whitening input data

layer i
• building invariances
through normalization
time = 1
• regularizing the network
time = N
(e.g. dropout, maxout)

I Goodfellow et al (2013)
N Srivastava et al. (2014)
Mitigate covariate shift via batch normalization.

1. Normalize the activations {xi } within a mini-batch.


n
X
1
µ= {xi }
n i xi µ
{xi } x̂i = p
n
X 2+✏
2 1
= (xi µ)2
n i

2. Learn the mean and variance ( , ) of each layer as


parameters
ŷi = x̂i +
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S Ioffe and C Szegedy (2015)
Batch normalization stabilizes training.

• The canonical module of a perceptron is updated:


X X
y = f( wi xi + b) y = f ( BatchNorm( w i xi ) )
i i
• Activations are more stable over training.
hidden layer activations on MNIST

85%

50%

15%

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S Ioffe and C Szegedy (2015)
Batch normalization speeds up training enormously.

• CNN’s train faster with fewer data samples (15x).

• Employ faster learning rates and less network


regularizations.
precision @ 1

number of mini-batches

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S Ioffe and C Szegedy (2015)
Agenda

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

4. Conclusions
Switching to other types of gradients

@ For training a network, one focused on how to change


@ wi parameters with respect to a loss function.

@ The rest of this talk is instead focused on how does an


@ image activation or loss function depend on the image.

An important distinction:
• the former provides an update that “lives” in weight space
• the latter provides an update that “lives” in image space
Gradient propagation to find responsible pixels

• Which pixels elicit large activation values within an


image?
layer 3 layer 5

• Examine activations at middle layers in a trained network.


Gradient propagation to find responsible pixels

layer 3

Visualizing and Understanding Convolutional Networks


M Zeiler and R Fergus (2013)
Gradient propagation to find responsible pixels

layer 5

Visualizing and Understanding Convolutional Networks


M Zeiler and R Fergus (2013)
Gradient propagation for distorting images.

• What happens if we distort the original image to amplify the


label using the gradient signal?

Inception-v3

“dog”

http://mscoco.org
Gradient propagation for distorting images.

• What happens if we distort the original image to amplify the


label using the gradient signal?

“dog”

…. But if we used the wrong image?

Inceptionism: Going Deeper into Neural Networks


A. Mordvintsev, C. Olah and M. Tyka (2015)
Gradient propagation for distorting images.

• Apply gradient distortion, feed back the distorted image into the
network and iterate.

“dog”

Inceptionism: Going Deeper into Neural Networks


A. Mordvintsev, C. Olah and M. Tyka (2015)
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
A. Mordvintsev, C. Olah and M. Tyka
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
A Neural Algorithm for Artistic Style

A Neural Algorithm of Artistic Style


L. Gatys, A. Ecker, M. Bethge (2015)
https://github.com/kaishengtai/neuralart

A Neural Algorithm of Artistic Style


L. Gatys, A. Ecker, M. Bethge (2015)
Gradient propagation for breaking things.

@ loss
which pixels are sensitive to the label
@ image

@ loss how to change pixels to decrease


- @ image the probability of the label

Inception-v3

“dog”

Intriguing properties of neural networks


C Szegedy et al (2014)
Explaining and Harnessing Adversarial Examples
I Goodfellow, J Shlens and C Szegedy (2015)
Gradient propagation for breaking things.

• Constrained optimization to find adversarial adjustment to an


image (L1 norm).
• Robust across trained networks, network architectures and
other machine learning systems.
Intriguing properties of neural networks
C Szegedy et al (2014)
Explaining and Harnessing Adversarial Examples
I Goodfellow, J Shlens and C Szegedy (2015)
Agenda

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

4. Conclusions
Quick Start Guide

1. Purchase a desktop with a fast GPU.

2. Download an open-source library for deep learning.

3. Download a pre-trained model a similar vision task.

4. Retrain (fine-tune) the network for your particular data set.

Online resources:
http://www.tensorflow.org
http://cs231n.github.io/convolutional-networks/
Google Brain Residency Program

One year immersion program in deep learning research


● First class started six weeks ago, planning for next year’s class is underway

Learn to conduct deep learning research w/experts in our team


● Fixed one-year employment with salary, benefits, ...
● Goal after one year is to have conducted several research projects
● Interesting problems, TensorFlow, and access to computational resources

g.co/brainresidency
Google Brain Residency Program

Who should apply?


● people with BSc, MSc or PhD, ideally in CS, mathematics or statistics
● completed coursework in calculus, linear algebra, and probability, or equiv.
● programming experience
● motivated, hard working, and have a strong interest in deep learning

g.co/brainresidency

You might also like