Download as pdf or txt
Download as pdf or txt
You are on page 1of 87

Deep

 Learning

Kairit  Sirts
Lecture  in  TUT  19.12.2016
Outline

• What can be done with deep learning?


• Deep learning demystified
• How can you get started with deep learning?

2
Why deep learning?
Deep learning Gradient boosting

Random Forest Linear model

3
http://www.infoworld.com/article/3003315/big-data/deep-learning-a-brief-guide-for-practical-problem-solvers.html
What can be done with deep learning?
Handwritten digit recognition

MNIST benchmark dataset


The best reported error rate is 0.21%

5
Street view number recognition

• Obtained from house numbers in


Google Street View images
• Best error rate is 1.69%

6
Image classification

7
Image classification
10 objects
6000 labeled instances for each object
Best accuracy so far 96.53%

8
Image classification

9
Image classification

20 superclasses
100 finegrained classes
600 labeled images per class
Best classification accuracy 75.72%

10
Detecting doodles

https://quickdraw.withgoogle.com
There are other simple and fun AI
experiments launched by Google
https://aiexperiments.withgoogle.com

11
Image captioning

12
Image captioning – not so great results

13
Automatic colorization of images

14
http://richzhang.github.io/colorization/resources/images/teaser3.jpg
Automatic colorization of images - failed

15
DeepDream

https://deepdreamgenerator.com
16
DeepDream

17
DeepDream

18
DeepDream

19
Word embeddings

20
http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png
Word embeddings
months

weekdays

numbers

21
Word embeddings

• 𝑊 man − 𝑊 woman ≈ 𝑊 king − 𝑊(queen)


• 𝑊 walking − 𝑊 walked ≈ 𝑊 swimming − 𝑊(swam)
22
Automatic text generation – pseudo Shakespeare

23
http://karpathy.github.io/2015/05/21/rnn-effectiveness
Machine translation

• Google Translate app

24
Learning to play Atari Arcade games

25
https://www.youtube.com/watch?v=cjpEIotvwFY
AlphaGo

26
https://www.youtube.com/watch?v=PQCrX1sQSzY
Other tasks tackled with deep neural networks

• Speech recognition
• Various tasks in robotics
• Log analysis/risk detection
• Recommendation systems
• Motion detection from videos
• Business and Economics analytics
• Etc …

27
Deep learning demystified
How does deep learning work?
• Biological neuron • Artificial neuron

http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7
29
• Biological neural network • Artificial neural network

30
https://www.eeweb.com/blog/rob_riemen/deep-machine-learning-and-the-google-brain http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7
What happens inside a neuron?

<

ℎ = 𝑥7 𝑤7 + 𝑥: 𝑤: + ⋯ + 𝑥< 𝑤< = = 𝑥> 𝑤>


>?7

Output: ℎ = 𝑓(𝑧)

31
Activation function

1  if  𝑧 ≥ th 1 𝑒 E − 𝑒 DE
𝑓 𝑧 =J 𝑓 𝑧 = 𝑓 𝑧 = E 𝑓 𝑧 = max  (0, 𝑧)
0  if  𝑧 < th 1 + 𝑒 DE 𝑒 + 𝑒 DE

32
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/neural_networks.html
Single neuron logic gates

• Threshold activation function

33
https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/
XOR gate
• Cannot be done with a single neuron
• A hidden layer is necessary

𝒙𝟏 𝒙𝟐 OR NOT AND AND


0 0 𝕀 0 ∙ 1 + 0 ∙ 1 > 0.5 = 0 𝕀 0 ∙ −1 + 0 ∙ −1 > −1.5 = 1 𝕀 0 ∙ 1 + 1 ∙ 1 > 1.5 = 0
0 1 𝕀 0 ∙ 1 + 1 ∙ 1 > 0.5 = 1 𝕀 0 ∙ −1 + 1 ∙ −1 > −1.5 = 1 𝕀 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1
1 0 𝕀 1 ∙ 1 + 0 ∙ 1 > 0.5 = 1 𝕀 1 ∙ −1 + 0 ∙ −1 > −1.5 = 1 𝕀 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1
1 1 𝕀 1 ∙ 1 + 1 ∙ 1 > 0.5 = 1 𝕀 1 ∙ −1 + 1 ∙ −1 > −1.5 = 0 𝕀 1 ∙ 1 + 0 ∙ 1 > 1.5 = 0
34
https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/
How to assign weights?

8Y9+9Y9+9Y9+9Y4=
= 270 weights

35
http://neuralnetworksanddeeplearning.com/
Backpropagation

• Standard and efficient method for training neural networks


• The general idea:
• Compute the error with a forward pass
• Propagate the error back to change the weights such that the error would become smaller

ERROR à ERROR’
ERROR’ < ERROR

36
Diversion to calculus - derivative

• 𝑦_ = 𝑓 _ 𝑥
• Derivative is the slope of the tangent
line
• It is the rate of change when going in
the direction of steepest ascent

37
Derivatives

• When 𝑓 _ 𝑥 = 0 then it is the local or


global maximum or minimum or a
saddle point
• When 𝑓 _ 𝑥 > 0 then the function is
increasing
• When 𝑓 _ 𝑥 < 0 then the function is
decreasing

38
Gradients
• Generalization of derivatives to
multivariate functions
• Derivative is a vector pointing to the
direction of steepest ascent
ab ab
• ∇𝑓(𝑥, 𝑦) = ,
ac ad
ab ab
• , - partial derivatives – take
ac ad
derivative wrt one variable while
treating all others as constant

39
Gradients and backpropagation

• Backpropagation is used to compute the gradients with respect to all parameters in a


neural network.
• The gradients are then used in a general method of gradient descent for minimizing
functions.
• We want to minimize the cost function that measures the error made by the neural
network.
• In order to do that we need to move to the direction of deepest descent given by the
gradients.

40
Gradient descent
• An iterative algorithm
• Start with initial parameter values 𝜃 f
• Update parameters iteratively until
convergence:
𝜃 gh7 =:  𝜃 g − 𝛼∇𝑓 𝜃
• 𝛼 - learning rate, controls the step size

41
Deep learning demystified
How does backpropagation work?
Backpropagation explained

• Example from:
https://mattmazur.com/2015/03/17/

• 2 inputs
• 1 hidden layer with 2 neurons
• Bias terms in both the hidden and
output layer
• 2 outputs

43
Initial configuration

• Training values

• Initial weights: 𝑤7 , … , 𝑤l

• Initial biases: 𝑏7 , 𝑏:

44
Forward pass – first hidden unit

45
Forward pass – first hidden unit

46
Forward pass – second hidden unit

47
Forward pass – first output unit

48
Forward pass – second output unit

49
Forward pass – error of the first output

50
Forward pass – output error

51
Forward pass – output error

52
Backwards pass
• Consider 𝑤n
• How much a change in 𝑤n affects the
total error?
• Apply the chain rule:

53
Chain rule
• Formula for computing derivative of the composition of two or more functions
• 𝐹 𝑥 ≡ 𝑓(𝑔 𝑥 ) ≡ (𝑓 ∘ 𝑔)(𝑥) – composition of functions 𝑓 and 𝑔
• 𝐹 _ 𝑥 = 𝑓 _ 𝑔 𝑥 𝑔_ 𝑥

• 𝐹 𝑥 =   𝑒 sc 𝑔 𝑥 = 3𝑥 𝑓 𝑔 𝑥 = 𝑒 u(c) = 𝑒 sc

• 𝐹 _ 𝑥 = 𝑓 _ 𝑔 𝑥 𝑔_ 𝑥 = (𝑒 u(c) )′𝑔′(𝑥) = 𝑒 u c (3𝑥)′ = 𝑒 sc Y 3 = 3𝑒 sc

54
Backwards pass
• Consider 𝑤n
• How much a change in 𝑤n affects the
total error?
• Apply the chain rule:

55
How much does error change wrt the output?

56
How much does output change wrt its net input?

57
Derivative of the sigmoid function

1
𝑓 𝑧 =
1 + 𝑒 DE

𝑓 _ 𝑧 = 𝑓(𝑧)(1 − 𝑓 𝑧 )

58
How much does output change wrt its net input?

59
How much does net input change wrt 𝑤n ?

60
Putting it all together

61
This is known as the delta rule

• Delta rule is the gradient descent rule for updating the weights of the inputs to
neurons in a single-layer neural network

62
Apply delta rule to outer layer weights

63
Update the weights with gradient descent
• set learning rate 𝛼 = 0.5 𝜽𝒕h𝟏 =:  𝜽𝒕 − 𝜶𝜵𝒇 𝜽

64
Backpropagation to hidden layer

• Continue backwards pass to


calculate new values for 𝑤7 , 𝑤: , 𝑤s
and 𝑤|

65
BP through hidden layer

• 𝑜𝑢𝑡€7 affects both 𝑜7 and 𝑜: and thus


needs to take into account both:

66
BP through hidden layer
• Consider one of those:

• First term can be calculated using values


computed before:

• Second term is just 𝑤n

67
BP through hidden layer
• Plug the values in:

• Compute the same value for 𝑜: :

• Compute the total:

68
BP through hidden layer
a•‚gƒ„ a<…gƒ„
• Next we need and for each
a<…gƒ„ a†
weight 𝑤

• Compute the partial derivative wrt a weight

69
BP through hidden layer
• Putting it together

• We can now update 𝑤7

70
BP through hidden layer
• Compute the partial derivatives in the same
way for 𝑤: , 𝑤s and 𝑤|
• Update 𝑤: , 𝑤s and 𝑤|

71
After first update with backpropagation

72
Did the error decrease?

• Old error was: 0.298371109


• Improvement: 0.007343335

• After 10000 updates the error will be


ca 0.000035085
• The generated outputs will be
0.015912196 for 0.01 target and
0.984065734 for 0.99 target

73
In conclusion
• Neural networks consist of artificial neurons organized into layers and connected
to each other with learnable weights.
• Backpropagation with gradient descent is the standard method for training neural
networks.
• Backpropagation can be used to compute the gradients of a neural network,
regardless of the depth of the network.
• Of course, there are other important tricks and tips but this is the basis of
understanding neural networks and deep learning.

74
Common neural network architectures
Feed-forward network

• Simplest type of neural network


• Connections between units do not
form cycles
• Information always moves in one
direction
• It never goes backwards

76
https://upload.wikimedia.org/wikipedia/en/5/54/Feed_forward_neural_net.gif
Recurrent neural network

• Connections between units form cycles


• They possess internal memory – they “remember” the past inputs
• Suitable for modeling sequential/temporal data, such as for instance text and
language data

77
Convolutional neural networks

• Convolutional layers have neurons


arranged in 3 dimensions
• Especially suitable for processing
image data

78
http://parse.ele.tue.nl/education/cluster2
Autoencoders
• Output layer attempts to reconstruct
the input
• Used for unsupervised feature learning
• The hidden layer has typically less
neurons, thus performing data
compression

79
Getting started with neural networks
Courses and tutorials
• https://www.coursera.org/learn/machine-learning -
• Introductory course on machine learning, provides necessary background

• https://www.coursera.org/learn/neural-networks
• Course on neural networks – assumes knowledge about machine learning

• http://ufldl.stanford.edu/tutorial/
• Tutorial on deep learning but covers also some simpler machine learning

• http://cs231n.stanford.edu/
• Course on convolutional neural networks

• https://www.udacity.com/course/deep-learning--ud730
• Course on deep learning

• There are many others … just google …

81
Books

• http://www.deeplearningbook.org/
• Deep Learning: A Practitioner’s approach – not released yet
• Fundamentals of deep learning – not released yet

• See more from:


• http://machinelearningmastery.com/deep-learning-books/

82
Low level libraries
• Theano - http://deeplearning.net/software/theano/
• Tensorflow - https://www.tensorflow.org/get_started/
• Python-based
• Automatic differentiation
• Can use cuda for computing on GPU

• Torch – http://torch.ch/
• Based on Lua
• Modular pieces that are easy to combine
• Lots of pretrained models

• See more: https://deeplearning4j.org/compare-dl4j-torch7-pylearn

83
Higher level libraries
• Keras - https://keras.io/
• On top of theano and tensorflow
• Based on python
• Modular
• Supports both convolutional and recurrent networks
• Supports arbitrary connectivity
• Runs on both CPU and GPU

84
Keras – example code

85
What else?
• Take the Machine Learning course in spring semester
• Use neural networks for your thesis work
• Potential supervisors in UT:
• Kairit Sirts (problems involving natural language)
• Mark Fishel (machine translation)
• Raul Vicente (computational neuroscience)
• Ilya Kuzovkin (computational neuroscience)

• Potential supervisors in TUT


• Juhan Ernits
• Tanel Alumäe (speech data)
• There are possibly others

86
In conclusion - Deep learning
• Can be used to solve very complex problems
• Based on artificial neural networks with many hidden layers
• Each artificial neuron is a simple computational unit
• Neural networks are trained with gradient descent algorithm
• Backpropagations algorithm is used to compute the gradients with respect to
tunable parameters
• There are many tutorials and online courses about deep learning
• There are various software libraries that enable to get started with deep learning
relatively easily
87

You might also like