Dlvu Lecture01

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 93

Lecture 1: Introduction

Jakub M. Tomczak
Deep learning

COURSE ORGANIZATION

2
DO NOT HESITATE TO CONTACT US!

If you have any question, please contact us:


TAs: Lecturers:
• Dimitrios Alivanistos
• Daniel Daza
• Emile van Krieken
• Anna Kuzina Peter Bloem

• Stefan Schouten
• Shuai Wang
• Roderick van der Weerdt
• Taraneh Younesian Michael Cochez

3 Jakub Tomczak

LECTURES AND PRACTICAL SESSIONS

During this course we will have:


• 14 lectures (incl. 2 invited lectures)
• 7 days with practical sessions (1 day, 5 time slots)
• 1 meeting for the final exam

LECTURES AND PRACTICAL SESSIONS

Content of the course:


1. From logistic regression to fully-connected networks
2. Convolutional nets, recurrent neural nets
3. Generative modeling: GANs, VAEs, autoregressive models, flows
4. Graph convolutions, self-attention, transformers
5. Reinforcement Learning

LECTURES AND PRACTICAL SESSIONS

Assignments:
1. MLP (Nov 14)
2. Autograd/backpropagation (Nov 28)
3. One of the two (Dec 8):
a. CNNs
b. RNNs
4. One of the three (Dec 19):
a. VAEs & GANs
b. Graph convolutions
c. Reinforcement learning
6

LECTURES AND PRACTICAL SESSIONS

Assignments:
1. Use Python 3 ONLY!
2. Assignments 1 and 2 are implemented INDIVIDUALLY.
3. Assignments 3 and 4 are implemented IN GROUPS OF 3.
4. All methods must be implemented by you unless it’s specified otherwise.
5. In the assignments we will use mainly Numpy and PyTorch.
6. THERE IS NO RESIT FOR ASSIGNMENTS.
7. Late submissions: First contact Jakub. Second, -2pts for each day after
the deadline.

GRADING

Assignments
• Max. 40 pts (10 pts per assignment)
• Partial grade from the assignments: round(achieved points / 4)

Exam
• Max. 40 pts (40 questions)
• Partial grade: round(achieved points / 4)

Final grade
• At least 5.5 from the assignments and 5.5 from the exam.
• Final grade:
round(0.5 * partial grade from assignments + 0.5 * partial grade from the exam)
8

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

9
HISTORICAL PERSPECTIVE

1940-1950
 194 1950
 1956
 195 196


Cellular automat Information theor Turing tes Term “AI Perceptron
 ELIZ
(von Neumann-Ulam) (Shannon) (Turing) (McCarthy) (Rosenblatt) (Weizenbaum)

1960s and 1970s



1986
 1985
 1982
 1970 Evolutionary computatio
Backpropagation
 Bayesian Net ConvNet Expert system (Fogel, Holland,
(Rumelhart et al.) (Pearl) (Fukushima) (e.g., Quinlan, Michalski) Rechenberg, Schwefel)

1995
 199 199


SVMs
 Deep Blu LST 2000-
(Cortes&Vapnik) Kasparov (Hochreiter&Schmidhüber) Deep learning era

10
5

WHY AI IS SUCCESSFUL?

Accessible hardware Intuitive programming languages

Powerful hardware Specialized packages

11
COMPONENTS OF AI SYSTEMS

Knowledge representation
How to represent & process data?

Knowledge acquisition (learning objective & algorithms)


How to extract knowledge?

Learning problems
What kind of problems can we formulate?

12

LEARNING AS OPTIMIZATION

For given data , find the best data representation from a given class of
representations that minimizes given learning objective (loss).

min f(x; )
x∈

s.t. x ∈ ⊆
𝕏
13
𝒟
𝒟
𝕐
𝕏

LEARNING AS OPTIMIZATION

For given data , find the best data representation from a given class of
representations that minimizes given learning objective (loss).

min f(x; )
x∈

s.t. x ∈ ⊆

Optimization algorithm = learning algorithm.


𝕏
14
𝒟
𝒟
𝕐
𝕏

LEARNING TASKS

15
LEARNING TASKS

Supervised Learning
• We distinguish inputs
and outputs.
• We are interested in
prediction.
• We minimize a
prediction error.

16

LEARNING TASKS

Supervised Learning Unsupervised learning


• We distinguish inputs • No distinction among
and outputs. variables.
• We are interested in • We look for a data
prediction. structure.
• We minimize a • We minimize a
prediction error. reconstruction error,
compression rate, …

17

LEARNING TASKS

Supervised Learning Unsupervised learning Reinforcement learning


• We distinguish inputs • No distinction among • An agent interacts with
and outputs. variables. an environment.
• We are interested in • We look for a data • We want to learn a
prediction. structure. policy.

• We minimize a • We minimize a • Each action is rewarded.


prediction error. reconstruction error, • We maximize a total
compression rate, … reward.

18

LEARNING TASKS

Supervised Learning
• We distinguish inputs and outputs. Classification

• We are interested in prediction.


• We minimize a prediction error.

19

LEARNING TASKS

Unsupervised learning
• No distinction among variables.
• We look for a data structure.
• We minimize a reconstruction error,
compression rate, …

20

LEARNING TASKS

Reinforcement learning
• An agent interacts with an
environment.
• We want to learn a policy.
• Each action is rewarded.
• We maximize a total reward.

21

DEEP LEARNING

22
APPLICATION AREAS

Computer Vision
Information Retrieval
Speech Recognition
Natural Language Processing
Recommendation Systems
Drug Discovery
Robotics

23

EXAMPLES: HANDWRITING GENERATOR

24
A. Graves, “Generating Sequences With Recurrent Neural Networks“
EXAMPLES: IMAGE GENERATION

25
Gatopoulos & Tomczak, “Self-supervised Variational Auto-Encoders“
EXAMPLES: GENERATING IMAGE DESCRIPTIONS

26
Karpathy & Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions"
THREE CORE COMPONENTS OF DEEP LEARNING

27
AUTOMATIC FEATURE EXTRACTION / REPRESENTATION LEARNING

• A representation = a set of (abstract) features.


• Deep learning automatically extract feature.
• A hidden layer = an abstraction level.
• Good-quality features should be:
• Informative (e.g., discriminative);
• Robust to small perturbations;
• Invariant/Equivariant to transformations.
• Representation Learning = optimization algo.
28
􏰀
􏰀

PROBABILISTIC LEARNING

29
TOSS A COIN… 🎶

• c ∈ {0,1} - a random variable (a result of tossing a coin)

30
TOSS A COIN… 🎶

• c ∈ {0,1} - a random variable (a result of tossing a coin)


• x = p(c = 1) - probability of observing head
• 1 − x = p(c = 0) - probability of observing tail
• p(c | x) = x c (1 − x)1−c - Bernoulli distribution

31

TOSS A COIN… 🎶

• c ∈ {0,1} - a random variable (a result of tossing a coin)


• x = p(c = 1) - probability of observing head
• 1 − x = p(c = 0) - probability of observing tail
• p(c | x) = x c (1 − x)1−c - Bernoulli distribution

Quick check:
p(c = 0 | x) = x 0 (1 − x)1−0 = 1 − x
p(c = 1 | x) = x 1 (1 − x)1−1 = x
32

TOSS A COIN… 🎶

• c ∈ {0,1} - a random variable (a result of tossing a coin)


• x = p(c = 1) - probability of observing head
• 1 − x = p(c = 0) - probability of observing tail
• p(c | x) = x c (1 − x)1−c - Bernoulli distribution
• = {c1, c2, …, cN} - iid observations (data)

33
𝒟

TOSS A COIN… 🎶

• c ∈ {0,1} - a random variable (a result of tossing a coin)


• x = p(c = 1) - probability of observing head
• 1 − x = p(c = 0) - probability of observing tail
• p(c | x) = x c (1 − x)1−c - Bernoulli distribution
• = {c1, c2, …, cN} - iid observations (data)

EXAMPLE:
= {0,0,1,1,0,1,1}
34
𝒟
𝒟

TOSS A COIN… 🎶

• c ∈ {0,1} - a random variable (a result of tossing a coin)


• x = p(c = 1) - probability of observing head
• 1 − x = p(c = 0) - probability of observing tail
• p(c | x) = x c (1 − x)1−c - Bernoulli distribution
• = {c1, c2, …, cN} - iid observations (data)
• The likelihood function: N


p( | x) = p(cn | x)
n=1
𝒟
35
𝒟

TOSS A COIN… 🎶

The optimization problem:

36

TOSS A COIN… 🎶

The optimization problem:


Find such x ∈ [0,1] that minimizes the negative log-likelihood function:

min − log p( | x)
x∈[0,1]

37
𝒟

TOSS A COIN… 🎶

The optimization problem:


Find such x ∈ [0,1] that minimizes the negative log-likelihood function:

min − log p( | x)
x∈[0,1]

Remarks:

38
𝒟

TOSS A COIN… 🎶

The optimization problem:


Find such x ∈ [0,1] that minimizes the negative log-likelihood function:

min − log p( | x)
x∈[0,1]

Remarks:
1) Why negative? Because: max f (x) = min {−f (x)}.

39
𝒟

TOSS A COIN… 🎶

The optimization problem:


Find such x ∈ [0,1] that minimizes the negative log-likelihood function:

min − log p( | x)
x∈[0,1]

Remarks:
1) Why negative? Because: max f (x) = min {−f (x)}.
2) Why logarithm? Because: log ∏ = ∑ log and optimum is the same.

40
𝒟

TOSS A COIN… 🎶
N


log p( | x) = log p(cn | x) the log-likelihood
n=1
N


= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn

= Bernoulli distribution
n=1
N

∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
41

TOSS A COIN… 🎶
N


log p( | x) = log p(cn | x) the log-likelihood
n=1
N


= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn

= Bernoulli distribution
n=1
N

∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
42

TOSS A COIN… 🎶
N


log p( | x) = log p(cn | x) the log-likelihood
n=1
N


= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn

= Bernoulli distribution
n=1
N

∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
43

TOSS A COIN… 🎶
N


log p( | x) = log p(cn | x) the log-likelihood
n=1
N


= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn

= Bernoulli distribution
n=1
N

∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
44

TOSS A COIN… 🎶

(Finding x ⋆) Calculate derivative wrt x and set to 0:

d N

dx n=1 ( cn log x + (1 − cn)log(1 − x)) = 0

N
cn (1 − cn)
∑( x )=0

n=1
(1 − x)
N

∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1

45

TOSS A COIN… 🎶

(Finding x ⋆) Calculate derivative wrt x and set to 0:

d N d
( cn log x + (1 − cn)log(1 − x)) = 0 f (x) = 0 gives optimum

dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N

∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1

46

TOSS A COIN… 🎶

(Finding x ⋆) Calculate derivative wrt x and set to 0:

d N d
( cn log x + (1 − cn)log(1 − x)) = 0 f (x) = 0 gives optimum

dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N

∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1

47

TOSS A COIN… 🎶

(Finding x ⋆) Calculate derivative wrt x and set to 0:

d N d
( cn log x + (1 − cn)log(1 − x)) = 0 f (x) = 0 gives optimum

dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N

∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N N
EXAMPLE:
1
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn = {0,0,1,1,0,1,1}
n=1 n=1 n=1
N n=1
x ⋆ = 4/7
48
𝒟

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

49
PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).

50

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).


2) Determine p( | x).

51
𝒟

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).


2) Determine p( | x).
3) Check constraints.

52
𝒟

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).


2) Determine p( | x).
3) Check constraints.
4) Find the best solution by minimizing −log p( | x).

53
𝒟
𝒟

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x). e.g., Bernoulli, Gaussian…


2) Determine p( | x). e.g., iid or sequential
3) Check constraints. e.g., only values between [0, 1]
4) Find the best solution by e.g., using gradient-descent
minimizing −log p( | x).

54
𝒟
𝒟

LOGISTIC REGRESSION

55
SPAM DETECTION

• Example: Spam detection


D
‣ x ∈ {0,1} - whether a dth word occurs in an e-mail (x = 1) or not

‣ y ∈ {0,1} - whether an e-mail is a spam (y = 1) or not

‣ Goal: provide probability of a spam OR classify messages.

0,5 Detector
0
hello ne mom call

56
fi

LOGISTIC REGRESSION

• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD

• We model y by using the Bernoulli distribution:


p(y | x, θ) = Bern(y | sigm(θ ⊤ x))

where:
D
linear dependency: θ ⊤ x = ∑ θd xd
d=1
1
sigmoid function: sigm(s) =
1 + exp(−s)

57

LOGISTIC REGRESSION

• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD

• We model y by using the Bernoulli distribution:


p(y | x, θ) = Bern(y | sigm(θ ⊤ x))

where:
D
linear dependency: θ ⊤ x = ∑ θd xd
d=1
1
sigmoid function: sigm(s) =
1 + exp(−s)

58

LOGISTIC REGRESSION

• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD

• We model y by using the Bernoulli distribution:


p(y | x, θ) = Bern(y | sigm(θ ⊤ x))

where: Sigmoid can mode


D probabilities!
linear dependency: θ ⊤ x = ∑ θd xd
d=1
1
sigmoid function: sigm(s) =
1 + exp(−s)

59

LOGISTIC REGRESSION

• Properties of the sigmoid function:


‣ sigm(s) ∈ [0,1]
d
‣ ds sigm(s) = sigm(s) (1 − sigm(s))

‣ sigm(−s) = 1 − sigm(s)
• In our model we have:
p(y = 1 | x, θ) = sigm(θ ⊤ x)
p(y = 0 | x, θ) = 1 − sigm(θ ⊤ x)

60

LOGISTIC REGRESSION

• p(y | x, θ) = Bern(y | sigm(θ ⊤ x))

D
where: θ ⊤ x =

θd xd
d=1

x1
θ1
sigm

… p(y = 1 | x, θ)

θD
xD

61

LOGISTIC REGRESSION: GRADIENT-DESCENT

y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i

∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤

=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)

=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i

=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i

∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i

∑(
= sigm(θ ⊤ xi) − yi) xi
i
62
𝒟
𝒟

LOGISTIC REGRESSION: GRADIENT-DESCENT

y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i

∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤

=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)

=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i

=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i

∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i

∑(
= sigm(θ ⊤ xi) − yi) xi
i
63
𝒟
𝒟

LOGISTIC REGRESSION: GRADIENT-DESCENT

y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i

∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤

=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)

=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i

=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i

∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i

∑(
= sigm(θ ⊤ xi) − yi) xi
i
64
𝒟
𝒟

LOGISTIC REGRESSION: GRADIENT-DESCENT

y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i

∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤

=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)

=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i

=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i

∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i

∑(
= sigm(θ ⊤ xi) − yi) xi
i
65
𝒟
𝒟

LOGISTIC REGRESSION: GRADIENT-DESCENT

y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i

∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤

=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)

=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i

=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i

∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i

∑(
= sigm(θ ⊤ xi) − yi) xi
i
66
𝒟
𝒟

LOGISTIC REGRESSION: GRADIENT-DESCENT

• The update rule:


N

( xi) − yi) xi


θ := θ − α sigm(θ
i=1
• What if N is large?
‣ Use mini-batches (M ≪ N)! (Stochastic Gradient descent)
M
θ := θ − α
∑ ( sigm(θ ⊤ xj) − yj) xj
j=1

θ := θ − α(sigm(θ ⊤ xj) − yj) xj

67

LOGISTIC REGRESSION: GRADIENT-DESCENT

• The update rule:


N

( xi) − yi) xi


θ := θ − α sigm(θ
i=1
• What if N is large?
‣ Use mini-batches (M ≪ N)! (Stochastic Gradient descent)
M
θ := θ − α
∑ ( sigm(θ ⊤ xj) − yj) xj
j=1

θ := θ − α(sigm(θ ⊤ xj) − yj) xj

68

LOGISTIC REGRESSION: GRADIENT-DESCENT

• The update rule:


N

( xi) − yi) xi


θ := θ − α sigm(θ
i=1
• What if N is large?
‣ Use mini-batches (M ≪ N)! (Stochastic Gradient descent)
M
θ := θ − α
∑ ( sigm(θ ⊤ xj) − yj) xj
or even: j=1

θ := θ − α(sigm(θ ⊤ xj) − yj) xj

69

LOGISTIC REGRESSION: SGD VS. GD

Stochastic Gradient Descent Gradient Descent

70
FULLY-CONNECTED NEURAL NETWORKS

71
WHAT IF WE STACK MULTIPLE LOGISTIC REGRESSORS

x1
θ1
sigm

… p(y = 1 | x, θ)

θD
xD

72
STACKING LOGISTIC REGRESSORS

W ∈ ℝD×M
x1
sigm h1 θ ∈ ℝM×1

sigm

p(y = 1 | x, θ)
… …
sigm
hM
xD ∑

weights: W ∈ ℝD×M, θ ∈ ℝM×1


73
FULLY CONNECTED NEURAL NETWORKS (FCN)

• Stacking logistic regressions models the probability as follows:


p(y = 1 | x, W, θ) = sigm(θ ⊤ sigm(Wx))
h

• Notice that we still use the log-likelihood function as our objective!


• We refer to h as a hidden layer, and hm is called a neuron.
• We can stack even more:
p(y = 1 | x, {Wi}, θ) = sigm(θ ⊤sigm(⋯W2sigm(W1x)))

74

FULLY CONNECTED NEURAL NETWORKS (FCN)

• Stacking logistic regressions models the probability as follows:


p(y = 1 | x, W, θ) = sigm(θ ⊤ sigm(Wx))
h

• Notice that we still use the log-likelihood function as our objective!


• We refer to h as a hidden layer, and hm is called a neuron.
• We can stack even more:
p(y = 1 | x, {Wi}, θ) = sigm(θ ⊤sigm(⋯W2sigm(W1x)))

75

FULLY CONNECTED NEURAL NETWORKS (FCN)

• Stacking logistic regressions models the probability as follows:


p(y = 1 | x, W, θ) = sigm(θ ⊤ sigm(Wx))
h

• Notice that we still use the log-likelihood function as our objective!


• We refer to h as a hidden layer, and hm is called a neuron.
• We can stack even more:
p(y = 1 | x, {Wi}, θ) = sigm(θ ⊤sigm(⋯W2sigm(W1x)))

76

FULLY CONNECTED NEURAL NETWORKS (FCN)

• Stacking logistic regressions models the probability as follows:


p(y = 1 | x, W, θ) = sigm(θ ⊤ sigm(Wx))
h

• Notice that we still use the log-likelihood function as our objective!


• We refer to h as a hidden layer, and hm is called a neuron.
• We can stack even more:
p(y = 1 | x, {Wi}, θ) = sigm(θ ⊤sigm(⋯W2sigm(W1x)))

77

FCN: COMPONENTS

• A linear layer: a = Wx (or with explicit bias: a = Wx + b)


• with a non-linearity: h = f (Wx)
• Typical non-linearities:
‣ sigmoid: sigm(x) = (1 + exp(−x))
−1

‣ tanh: tanh(x) = 2sigm(x) − 1


‣ ReLU: relu(x) = max{0,x}
‣ softmax: softmax(x) = exp(xi) / ∑ exp(xj)

78

FCN: COMPONENTS

• A linear layer: a = Wx (or with explicit bias: a = Wx + b)


• with a non-linearity: h = f (Wx)
• Typical non-linearities:
‣ sigmoid: sigm(x) = (1 + exp(−x))
−1

‣ tanh: tanh(x) = 2sigm(2x) − 1


‣ ReLU: relu(x) = max{0,x}
‣ softmax: softmax(x) = exp(xi) / ∑ exp(xj)

79

LINEAR LAYERS

• A linear layer: a = Wx
• Initialization of W ∈ ℝD×M:

• Gaussian: W ∼ (0,σ 2)

• Xavier (for tanh): W ∼ (0, 1 / D)

• He (for ReLU): W ∼ (0, 2 / D)


𝒩
𝒩
80
𝒩

ACTIVATION FUNCTIONS
sigmoid tanh ReLU

∇f (x)

81
HOW TO DEAL WITH MULTIPLE CLASSES?

• So far, we talked about 2 classes (sigmoid for modeling probabilities).


• How to deal with K classes?

82

HOW TO DEAL WITH MULTIPLE CLASSES?

• So far, we talked about 2 classes (sigmoid for modeling probabilities).


• How to deal with K classes?
• Softmax function:
exp(θi⊤ x)
p(y = i | x, θ) = K
∑k=1 exp(θk⊤ x)

83

LEARNING: BACKPROPAGATION

• We introduced neural networks as stacked logistic regressors.


• How to learn FCN? Can we use SGD?

84

LEARNING: BACKPROPAGATION

• We introduced neural networks as stacked logistic regressors.


• How to learn FCN? Can we use SGD? YES!

85

LEARNING: BACKPROPAGATION

• We introduced neural networks as stacked logistic regressors.


• How to learn FCN? Can we use SGD? YES!
• Let us consider one hidden layer:
p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))

86

LEARNING: BACKPROPAGATION

• We introduced neural networks as stacked logistic regressors.


• How to learn FCN? Can we use SGD? YES!
• Let us consider one hidden layer:
p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))

• We can use the chain rule to calculate gradients wrt all weights:
dℓ dℓ df1 df2
=
du df1 df2 du

87

LEARNING: BACKPROPAGATION

• We introduced neural networks as stacked logistic regressors.


• How to learn FCN? Can we use SGD? YES!
• Let us consider one hidden layer:
p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))

• We can use the chain rule to calculate gradients wrt all weights:
dℓ dℓ df1 df2
=
Negative log-likelihood du df1 df2 du

88

LEARNING: BACKPROPAGATION

• We introduced neural networks as stacked logistic regressors.


• How to learn FCN? Can we use SGD? YES!
• Let us consider one hidden layer:
p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))

• We can use the chain rule to calculate gradients wrt all weights:
dℓ dℓ df1 df2
=
du df1 df2 du
Full derivation
Homework 🤓
89 Or wait till the next lecture!
:

VANISHING GRADIENT PROBLEM

• The sigmoid function is a good choice to model probabilities, however,


it is bad as a non-linearity for hidden layers.
• The problem arises while calculating gradients:
d
sigm(s) = sigm(s) (1 − sigm(s))
ds
If sigm(x) ≈ 1, then:
d
sigm(s) ≈ 1 (1 − 1) = 0
ds

90

VANISHING GRADIENT PROBLEM

• The sigmoid function is a good choice to model probabilities, however,


it is bad as a non-linearity for hidden layers.
• The problem arises while calculating gradients:
d
sigm(s) = sigm(s) (1 − sigm(s))
ds
If sigm(x) ≈ 1, then:
d For instance:
sigm(s) ≈ 1 (1 − 1) = 0
ds dℓ dℓ df2
= ⋅0⋅ =0
du df1 du

91 😵

Thank you!
EXTRA READING

Bishop, “Pattern Recognition and Machine Learning”

Murphy, “Machine Learning: A Probabilistic Perspective”

Courville, Goodfellow, Bengio,“Deep Learning”

Kruse et al., “Computational Intelligence: A Methodological Introduction”,


Springer

93

You might also like