CERN Deep Learning and Vision

Deep Learning and Vision
Jon Shlens
Google Research 
28 April 2017
Agenda
1. A brief history and motivation
2. Deep learning for vision
• What is deep learning?
• Convolutions and neural networks
3. Advances in neural networks
• Nonlinearities: example of batch normalization
• Understanding: example of gradient propagation
4. Conclusions
Agenda
4. Conclusions
The hubris of artificial intelligence
http://dspace.mit.edu/handle/1721.1/6125
‘Simple’ problems proved most difficult.
• For decades we tried to write down every possible

rule for everyday tasks —> impossible
• Every day tasks we consider blindingly obvious have

been exceedingly difficult for computers.
cat?
Machine learning applied everywhere.
• The last decade has shown that if we teach computers

to perform a task, they can perform exceedingly better.
machine translation speech recognition

face recognition time series analysis
molecular activity prediction image recognition
road hazard detection object detection
optical character recognition motor planning
motor activity planning syntax parsing
language understanding …
The computer vision competition:
Large scale academic competition focused on predicting 1000

object classes (~1.2M images).
classes
• electric ray
• barracuda
• coho salmon
• tench
• goldfish
• sawfish
• smalltooth sawfish
• guitarfish
• stingray
• roughtail stingray
• ...
Imagenet: A large-scale hierarchical image database

J Deng et al (2009)
History of techniques in ImageNet Challenge
ImageNet 2010
Locality constrained linear coding + SVM NEC & UIUC
Fisher kernel + SVM Xerox Research Center Europe
SIFT features + LI2C Nanyang Technological Institute
SIFT features + k-Nearest Neighbors Laboratoire d'Informatique de Grenoble
Color features + canonical correlation analysis National Institute of Informatics, Tokyo
ImageNet 2011
Compressed Fisher kernel + SVM Xerox Research Center Europe
SIFT bag-of-words + VQ + SVM University of Amsterdam & University of
Trento
SIFT + ? ISI Lab, Tokyo University
ImageNet 2012
Deep convolutional neural network University of Toronto
Discriminatively trained DPMs University of Oxford
Fisher-based SIFT features + SVM ISI Lab, Tokyo University
Examples of artificial vision in action
Good fine-grain classification.
• fine-grain classification
hibiscus
Good generalization. dahila
• generalization
Both meal
Sensible errors. recognized as “meal” meal
• sensible errors
snake dog
** Trained a model for whole image recognition using Inception-v3 architecture.

Agenda
4. Conclusions
History of techniques in ImageNet Challenge
ImageNet 2010
Locality constrained linear coding + SVM NEC & UIUC
Fisher kernel + SVM Xerox Research Center Europe
SIFT features + LI2C Nanyang Technological Institute
SIFT features + k-Nearest Neighbors Laboratoire d'Informatique de Grenoble
Color features + canonical correlation analysis National Institute of Informatics, Tokyo
ImageNet 2011
Compressed Fisher kernel + SVM Xerox Research Center Europe
SIFT bag-of-words + VQ + SVM University of Amsterdam & University of
Trento
SIFT + ? ISI Lab, Tokyo University
ImageNet 2012
Deep convolutional neural network University of Toronto
Discriminatively trained DPMs University of Oxford
Fisher-based SIFT features + SVM ISI Lab, Tokyo University
Deep convolutional neural networks
ImageNet Classification with Deep Convolutional Neural Networks 

A Krizhevsky I Sutskever, G Hinton (2012)
• Multi-layer perceptron trained with back-propagation

are ideas known since the 1980’s.
Backpropagation applied to handwritten zip code recognition

Y LeCun et al (1990)
Convolutional neural networks, revisited.

• Winning network contained 60M parameters.
• Achieving scale in compute and data is critical.
• large academic data sets
• SIMD hardware (e.g. GPU’s, SSE instruction sets)

“Deep learning” = artificial
What is deep learning? neural networks
Loosely based on (what little)

we know about the brain
• Hierarchical composition of simple mathematical
functions
“cat”
Untangling invariant object recognition

J DiCarlo and D Cox (2007)
“Deep learning” = artificial neural networks
• Hierarchical composition of simple mathematical

What is
functionsdeep learning?
Loosely
Loosely inspired
based onby(what
(whatlittle)
little)
we
we know
know about
about the
the brain
brain
“cat”
Untangling invariant object recognition

J DiCarlo and D Cox (2007)
A toy model of a neuron: “perceptron”
Simplify the neuron to a sum over weighted inputs

and a nonlinear activation function.
X
y = f( wi xi + b)
• no spikes i
• no recurrence or feedback *
• no dynamics or state *
• no biophysics
f (z) = max(0, z)
The perceptron: a probabilistic model for information storage and organization in the brain.
F Rosenblatt (1958)
Employing a network for a task.
• A network is a hierarchical composition of nonlinear functions.
“dog”
y = f (f (...)) y
• Output of network is a real-valued vector.
cat dog car truck cow bicycle
label of node j
Example: how to classify with a network
Step 1: Convert the network output to a probability distribution with

the softmax function.
exp(yj )
P (j) = P
j exp(yj )
1
0.75
y 0.5
0.25
0
cat dog car truck cow bicycle cat dog car truck cow bicycle
label of node j label of node j

Example: how to classify with a network
Step 2: Minimize the cross-entropy loss between the predicted

distribution and a one-hot target distribution.
predicted distribution target distribution

1 1
0.75 0.75
q(x) 0.5 p(x) 0.5
0.25 0.25
0 0
cat dog car truck cow bicycle cat dog car truck cow bicycle
label of node j
• Cross entropy loss is the KL-divergence the predicted and target

distribution. X p(x)
loss = p(x) log
x
q(x)
Gradient descent with back-propagation.
• Calculate the partial derivatives of each parameter with respect to

the loss to minimize an objective function via gradient descent.
@ loss
@ wi y
• For weights buried inside the network, employ clever factorization

of the chain rule, i.e. back-propagation.
Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences
P Werbos (1974)
Learning Internal Representations by Error Propagation.
D Rumelhart, G Hinton, R Williams, James L. McClelland et al (1986)
Optimization is highly non-convex.
loss
weight 1 weight 2
Note that deep networks operation in O(1M) dimensions.

playground.tensorflow.org
E. Coli of image recognition
“4”
machine learning system

(e.g. neural network)
http://yann.lecun.com/exdb/mnist/
Gradient-based learning applied to document recognition

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998)
Multi-layer perceptron on MNIST.
“4”
logistic classifier (M=10) M = # classes
# weights = N x M = 1000
fully connected (N=100) N = # hidden units
# weights = N x P2 = 78400
handwritten
zip codes
P=28
• Note that weights grow as the square of the number of pixels.

• Consider that the iPhone camera uses P = 2000, then the
number of weights would be 4 million.
Natural image statistics obey invariances.
…
translation
cropping
dilation
contrast
rotation
scale
brightness
…
Statistics of natural images: Scaling in the woods

D Ruderman and W Bialek (1994)
Natural image statistics and neural representation
E Simoncelli and B Olshausen (2001)
Translation invariance —> convolutions
• Models of natural image statistics begin with convolutional

filter bank.
interlude for convolutions
original filter (3 x 3) identity
0 0 0
0 1 0
0 0 0
https://docs.gimp.org/en/plug-in-convmatrix.html
original filter (5 x 5) blur
original filter (5 x 5) sharpen
original filter (3 x 3) vertical edge detector
original filter (3 x 3) all edge detector
interlude for convolutions
Multi-layer perceptron on MNIST.
“4”
logistic classifier (M=10)
# weights = N x M = 1000
fully connected (N=100)
# weights = N x P2 = 78400
handwritten
zip codes
P=28
• Note that weights grow as the square of the number of pixels!

Convolutional neural network on MNIST.
“4”
# weights = N x M x K= 1000 K
convolutional (N=100) N=100
# weights = N x F2 = 2500
F=5
handwritten
zip codes F=5
P=28
• Note that the number of model parameters is largely independent

of image size.
Generalizing convolutions in depth.
example input activations filter bank output activations
grayscale image
input depth
input depth
RGB image
Generalizing convolutions in depth.
example input activations filter bank output activations

output depth output
depth
edge detector
filter bank
output
depth
output depth
convolutional
network
• input and output depth are arbitrary parameters and not equal.
• Convolutional neural networks operate with depths up to 1024.
The first convolutional neural network.
“4”
fully connected (N=30)
convolutional (N=12)
convolutional (N=12)

Convolutional neural networks, revisited

• Similar architecture to original CNN architecture but

deeper and larger (70K —> 60M parameters).
• More nonlinearities and regularization.

Steady progress in network architectures.
place top 5 error
2012 Supervision 1st 16.4%

2013 Clarifai 1st 11.5%
2014 VGG 2nd 7.3%
2014 GoogLeNet / Inception 1st 6.6%
2014 Andrej Karpathy n/a 5.1%
2015 Batch Normalization Inception n/a 4.8%
2015 Inception v3 2nd 3.6%
2015 ResNet 1st 3.6%
2016 Inception-ResNet n/a 3.1%
place top 5 error

2014 VGG 2nd 7.3%
place top 5 error

2014 VGG 2nd 7.3%
Advances in network architectures
Animation by Dan Mané

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
C Szegedy, S Ioffe, V Vanhoucke (2016)
Deep Residual Learning for Image Recognition

K He, X Zhang, S Ren, J Sun (2015)
Rethinking the Inception Architecture for Computer Vision

C Szegedy, V Vanhoucke, S Ioffe, J Shlens, Z Wojna (2015)
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S Ioffe and C Szegedy (2015)
What I learned from competing against a ConvNet on ImageNet

A Karpathy (2014)
Very Deep Convolutional Networks for Large-scale Image Recognition

Karen Simonyan and Andrew Zisserman (2015)
Going Deeper with Convolutions

C Szegedy et al (2014)
Visualizing and Understanding Convolutional Networks

M Zeiler and R Fergus (2013)

Scalable Multiclass Object Categorization with Fisher Based Features

N. Gunji et al, (2012)
Compressed Fisher vectors for Large Scale Visual Recognition

F Perronnin, J Sanchez (2011)
Agenda
4. Conclusions
Covariate shifts are problematic in machine learning
• Traditional machine learning

must contend with covariate
shift between data sets.
• Covariate shifts must be

mitigates through domain
adaptation.
blog.bigml.com
Covariate shifts occur between network layers.
time = 1
time = N
• Traditional machine learning

must contend with covariate
shift between data sets.
layer i
• Covariate shifts must be

mitigates through domain time = 1
adaptation.
time = N
Covariate shifts occur between network layers.
logistic unit activation

during MNIST training
• Covariate shifts occur
across layers in a deep 85%
network.
50%
• Performing domain 15%
adaptation or whitening is
impractical in an online time
setting.
Previous method for addressing covariate shifts
time = 1
• Adagrad time = N
• whitening input data
layer i
• building invariances
through normalization
time = 1
• regularizing the network
time = N
(e.g. dropout, maxout)
I Goodfellow et al (2013)
N Srivastava et al. (2014)
Mitigate covariate shift via batch normalization.
1. Normalize the activations {xi } within a mini-batch.

n
X
1
µ= {xi }
n i xi µ
{xi } x̂i = p
n
X 2+✏
2 1
= (xi µ)2
n i
2. Learn the mean and variance ( , ) of each layer as

parameters
ŷi = x̂i +
Batch normalization stabilizes training.
• The canonical module of a perceptron is updated:

X X
y = f( wi xi + b) y = f ( BatchNorm( w i xi ) )
i i
• Activations are more stable over training.
hidden layer activations on MNIST
85%
50%
15%
Batch normalization speeds up training enormously.
• CNN’s train faster with fewer data samples (15x).
• Employ faster learning rates and less network

regularizations.
precision @ 1
number of mini-batches
Agenda
4. Conclusions
Switching to other types of gradients
@ For training a network, one focused on how to change

@ wi parameters with respect to a loss function.
@ The rest of this talk is instead focused on how does an

@ image activation or loss function depend on the image.
An important distinction:
• the former provides an update that “lives” in weight space
• the latter provides an update that “lives” in image space
Gradient propagation to find responsible pixels
• Which pixels elicit large activation values within an

image?
layer 3 layer 5
• Examine activations at middle layers in a trained network.

layer 3

layer 5

Gradient propagation for distorting images.
• What happens if we distort the original image to amplify the

label using the gradient signal?
Inception-v3
“dog”
http://mscoco.org
• What happens if we distort the original image to amplify the

label using the gradient signal?
“dog”
…. But if we used the wrong image?
Inceptionism: Going Deeper into Neural Networks

A. Mordvintsev, C. Olah and M. Tyka (2015)
• Apply gradient distortion, feed back the distorted image into the
network and iterate.
“dog”
Inceptionism: Going Deeper into Neural Networks

A. Mordvintsev, C. Olah and M. Tyka (2015)
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
A. Mordvintsev, C. Olah and M. Tyka
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
A Neural Algorithm for Artistic Style
A Neural Algorithm of Artistic Style

L. Gatys, A. Ecker, M. Bethge (2015)
https://github.com/kaishengtai/neuralart
A Neural Algorithm of Artistic Style

L. Gatys, A. Ecker, M. Bethge (2015)
Gradient propagation for breaking things.
@ loss
which pixels are sensitive to the label
@ image
@ loss how to change pixels to decrease

- @ image the probability of the label
Inception-v3
“dog”
Intriguing properties of neural networks

Explaining and Harnessing Adversarial Examples
I Goodfellow, J Shlens and C Szegedy (2015)
Gradient propagation for breaking things.
• Constrained optimization to find adversarial adjustment to an

image (L1 norm).
• Robust across trained networks, network architectures and
other machine learning systems.
Intriguing properties of neural networks
Explaining and Harnessing Adversarial Examples
I Goodfellow, J Shlens and C Szegedy (2015)
Agenda
4. Conclusions
Quick Start Guide
1. Purchase a desktop with a fast GPU.
2. Download an open-source library for deep learning.
3. Download a pre-trained model a similar vision task.
4. Retrain (fine-tune) the network for your particular data set.
Online resources:
http://www.tensorflow.org
http://cs231n.github.io/convolutional-networks/
Google Brain Residency Program
One year immersion program in deep learning research

● First class started six weeks ago, planning for next year’s class is underway
Learn to conduct deep learning research w/experts in our team

● Fixed one-year employment with salary, benefits, ...
● Goal after one year is to have conducted several research projects
● Interesting problems, TensorFlow, and access to computational resources
g.co/brainresidency
Google Brain Residency Program
Who should apply?

● people with BSc, MSc or PhD, ideally in CS, mathematics or statistics
● completed coursework in calculus, linear algebra, and probability, or equiv.
● programming experience
● motivated, hard working, and have a strong interest in deep learning
g.co/brainresidency

CERN Deep Learning and Vision

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CERN Deep Learning and Vision

Uploaded by

Copyright:

Available Formats

Deep Learning and Vision

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

• For decades we tried to write down every possible

• Every day tasks we consider blindingly obvious have

• The last decade has shown that if we teach computers

machine translation speech recognition

Large scale academic competition focused on predicting 1000

Imagenet: A large-scale hierarchical image database

** Trained a model for whole image recognition using Inception-v3 architecture.

1. A brief history and motivation

2. Deep learning for vision

• What is deep learning?

• Convolutions and neural networks

3. Advances in neural networks

• Nonlinearities: example of batch normalization

• Understanding: example of gradient propagation

ImageNet Classification with Deep Convolutional Neural Networks

• Multi-layer perceptron trained with back-propagation

Backpropagation applied to handwritten zip code recognition

ImageNet Classification with Deep Convolutional Neural Networks

• Winning network contained 60M parameters.

• Achieving scale in compute and data is critical.

• large academic data sets

• SIMD hardware (e.g. GPU’s, SSE instruction sets)

Loosely based on (what little)

Untangling invariant object recognition

• Hierarchical composition of simple mathematical

Untangling invariant object recognition

Simplify the neuron to a sum over weighted inputs

• A network is a hierarchical composition of nonlinear functions.

• Output of network is a real-valued vector.

cat dog car truck cow bicycle

Step 1: Convert the network output to a probability distribution with

label of node j label of node j

Step 2: Minimize the cross-entropy loss between the predicted

predicted distribution target distribution

• Cross entropy loss is the KL-divergence the predicted and target

• Calculate the partial derivatives of each parameter with respect to

• For weights buried inside the network, employ clever factorization

Note that deep networks operation in O(1M) dimensions.

machine learning system

Gradient-based learning applied to document recognition

fully connected (N=100) N = # hidden units

• Note that weights grow as the square of the number of pixels.

Statistics of natural images: Scaling in the woods

• Models of natural image statistics begin with convolutional

fully connected (N=100)

• Note that weights grow as the square of the number of pixels!

convolutional (N=100) N=100

ImageNet Classification with Deep Convolutional Neural Networks 

ImageNet Classification with Deep Convolutional Neural Networks 

ImageNet Classification with Deep Convolutional Neural Networks 

ImageNet Classification with Deep Convolutional Neural Networks