Dlvu Lecture01

Lecture 1: Introduction
Jakub M. Tomczak
Deep learning
 
COURSE ORGANIZATION
2
DO NOT HESITATE TO CONTACT US!
If you have any question, please contact us:

TAs: Lecturers:
• Dimitrios Alivanistos
• Daniel Daza
• Emile van Krieken
• Anna Kuzina Peter Bloem
• Stefan Schouten
• Shuai Wang
• Roderick van der Weerdt
• Taraneh Younesian Michael Cochez
3 Jakub Tomczak
LECTURES AND PRACTICAL SESSIONS
During this course we will have:

• 14 lectures (incl. 2 invited lectures)
• 7 days with practical sessions (1 day, 5 time slots)
• 1 meeting for the final exam
Content of the course:

1. From logistic regression to fully-connected networks
2. Convolutional nets, recurrent neural nets
3. Generative modeling: GANs, VAEs, autoregressive models, flows
4. Graph convolutions, self-attention, transformers
5. Reinforcement Learning
Assignments:
1. MLP (Nov 14)
2. Autograd/backpropagation (Nov 28)
3. One of the two (Dec 8):
a. CNNs
b. RNNs
4. One of the three (Dec 19):
a. VAEs & GANs
b. Graph convolutions
c. Reinforcement learning
6

Assignments:
1. Use Python 3 ONLY!
2. Assignments 1 and 2 are implemented INDIVIDUALLY.
3. Assignments 3 and 4 are implemented IN GROUPS OF 3.
4. All methods must be implemented by you unless it’s specified otherwise.
5. In the assignments we will use mainly Numpy and PyTorch.
6. THERE IS NO RESIT FOR ASSIGNMENTS.
7. Late submissions: First contact Jakub. Second, -2pts for each day after
the deadline.
GRADING
Assignments
• Max. 40 pts (10 pts per assignment)
• Partial grade from the assignments: round(achieved points / 4)
Exam
• Max. 40 pts (40 questions)
• Partial grade: round(achieved points / 4)
Final grade
• At least 5.5 from the assignments and 5.5 from the exam.
• Final grade:
round(0.5 * partial grade from assignments + 0.5 * partial grade from the exam)
8

INTRODUCTION TO ARTIFICIAL INTELLIGENCE
9
HISTORICAL PERSPECTIVE
1940-1950  194 1950  1956  195 196

Cellular automat Information theor Turing tes Term “AI Perceptron  ELIZ
(von Neumann-Ulam) (Shannon) (Turing) (McCarthy) (Rosenblatt) (Weizenbaum)
1960s and 1970s 

1986  1985  1982  1970 Evolutionary computatio
Backpropagation  Bayesian Net ConvNet Expert system (Fogel, Holland,
(Rumelhart et al.) (Pearl) (Fukushima) (e.g., Quinlan, Michalski) Rechenberg, Schwefel)
1995  199 199

SVMs  Deep Blu LST 2000-
(Cortes&Vapnik) Kasparov (Hochreiter&Schmidhüber) Deep learning era
10
5
WHY AI IS SUCCESSFUL?
Accessible hardware Intuitive programming languages
Powerful hardware Specialized packages
11
COMPONENTS OF AI SYSTEMS
Knowledge representation
How to represent & process data?
Knowledge acquisition (learning objective & algorithms)

How to extract knowledge?
Learning problems
What kind of problems can we formulate?
12

LEARNING AS OPTIMIZATION
For given data , find the best data representation from a given class of
representations that minimizes given learning objective (loss).
min f(x; )
x∈
s.t. x ∈ ⊆
𝕏
13
𝒟
𝒟
𝕐
𝕏
 

LEARNING AS OPTIMIZATION
For given data , find the best data representation from a given class of
representations that minimizes given learning objective (loss).
min f(x; )
x∈
s.t. x ∈ ⊆
Optimization algorithm = learning algorithm.

𝕏
14
𝒟
𝒟
𝕐
𝕏
 

LEARNING TASKS
15
LEARNING TASKS
Supervised Learning
• We distinguish inputs
and outputs.
• We are interested in
prediction.
• We minimize a
prediction error.
16

LEARNING TASKS
Supervised Learning Unsupervised learning

• We distinguish inputs • No distinction among
and outputs. variables.
• We are interested in • We look for a data
prediction. structure.
• We minimize a • We minimize a
prediction error. reconstruction error,
compression rate, …
17

LEARNING TASKS
Supervised Learning Unsupervised learning Reinforcement learning

• We distinguish inputs • No distinction among • An agent interacts with
and outputs. variables. an environment.
• We are interested in • We look for a data • We want to learn a
prediction. structure. policy.
• We minimize a • We minimize a • Each action is rewarded.

prediction error. reconstruction error, • We maximize a total
compression rate, … reward.
18

LEARNING TASKS
Supervised Learning
• We distinguish inputs and outputs. Classification
• We are interested in prediction.

• We minimize a prediction error.
19

LEARNING TASKS
Unsupervised learning
• No distinction among variables.
• We look for a data structure.
• We minimize a reconstruction error,
compression rate, …
20

LEARNING TASKS
Reinforcement learning
• An agent interacts with an
environment.
• We want to learn a policy.
• Each action is rewarded.
• We maximize a total reward.
21

DEEP LEARNING
22
APPLICATION AREAS
Computer Vision
Information Retrieval
Speech Recognition
Natural Language Processing
Recommendation Systems
Drug Discovery
Robotics
…
23
EXAMPLES: HANDWRITING GENERATOR
24
A. Graves, “Generating Sequences With Recurrent Neural Networks“
EXAMPLES: IMAGE GENERATION
25
Gatopoulos & Tomczak, “Self-supervised Variational Auto-Encoders“
EXAMPLES: GENERATING IMAGE DESCRIPTIONS
26
Karpathy & Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions"
THREE CORE COMPONENTS OF DEEP LEARNING
27
AUTOMATIC FEATURE EXTRACTION / REPRESENTATION LEARNING
• A representation = a set of (abstract) features.

• Deep learning automatically extract feature.
• A hidden layer = an abstraction level.
• Good-quality features should be:
• Informative (e.g., discriminative);
• Robust to small perturbations;
• Invariant/Equivariant to transformations.
• Representation Learning = optimization algo.
28
􏰀
􏰀

PROBABILISTIC LEARNING
29
TOSS A COIN… 🎶
• c ∈ {0,1} - a random variable (a result of tossing a coin)
30
TOSS A COIN… 🎶

• x = p(c = 1) - probability of observing head
• 1 − x = p(c = 0) - probability of observing tail
• p(c | x) = x c (1 − x)1−c - Bernoulli distribution
31

TOSS A COIN… 🎶

Quick check:
p(c = 0 | x) = x 0 (1 − x)1−0 = 1 − x
p(c = 1 | x) = x 1 (1 − x)1−1 = x
32

TOSS A COIN… 🎶

• = {c1, c2, …, cN} - iid observations (data)
33
𝒟

TOSS A COIN… 🎶

EXAMPLE:
= {0,0,1,1,0,1,1}
34
𝒟
𝒟

TOSS A COIN… 🎶

• The likelihood function: N
∏
p( | x) = p(cn | x)
n=1
𝒟
35
𝒟

TOSS A COIN… 🎶
The optimization problem:
36

TOSS A COIN… 🎶

Find such x ∈ [0,1] that minimizes the negative log-likelihood function:
min − log p( | x)
x∈[0,1]
37
𝒟

TOSS A COIN… 🎶

min − log p( | x)
x∈[0,1]
Remarks:
38
𝒟

TOSS A COIN… 🎶

min − log p( | x)
x∈[0,1]
Remarks:
1) Why negative? Because: max f (x) = min {−f (x)}.
39
𝒟

TOSS A COIN… 🎶

min − log p( | x)
x∈[0,1]
Remarks:
1) Why negative? Because: max f (x) = min {−f (x)}.
2) Why logarithm? Because: log ∏ = ∑ log and optimum is the same.
40
𝒟

TOSS A COIN… 🎶
N
∏
log p( | x) = log p(cn | x) the log-likelihood
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn
∑
= Bernoulli distribution
n=1
N
∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
41

TOSS A COIN… 🎶
N
∏
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
∑
n=1
N
∑( n
n=1
𝒟
42

TOSS A COIN… 🎶
N
∏
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
∑
n=1
N
∑( n
n=1
𝒟
43

TOSS A COIN… 🎶
N
∏
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
∑
n=1
N
∑( n
n=1
𝒟
44

TOSS A COIN… 🎶
(Finding x ⋆) Calculate derivative wrt x and set to 0:
d N
∑
dx n=1 ( cn log x + (1 − cn)log(1 − x)) = 0
N
cn (1 − cn)
∑( x )=0
−
n=1
(1 − x)
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1
45

TOSS A COIN… 🎶
d N d
( cn log x + (1 − cn)log(1 − x)) = 0 f (x) = 0 gives optimum
∑
dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1
46
TOSS A COIN… 🎶
d N d
∑
dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1
47
TOSS A COIN… 🎶
d N d
∑
dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N N
EXAMPLE:
1
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn = {0,0,1,1,0,1,1}
n=1 n=1 n=1
N n=1
x ⋆ = 4/7
48
𝒟

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)
49
1) Determine p(y | x).
50


2) Determine p( | x).
51
𝒟


3) Check constraints.
52
𝒟


3) Check constraints.
4) Find the best solution by minimizing −log p( | x).
53
𝒟
𝒟

1) Determine p(y | x). e.g., Bernoulli, Gaussian…

2) Determine p( | x). e.g., iid or sequential
3) Check constraints. e.g., only values between [0, 1]
4) Find the best solution by e.g., using gradient-descent
minimizing −log p( | x).
54
𝒟
𝒟

LOGISTIC REGRESSION
55
SPAM DETECTION
• Example: Spam detection

D
‣ x ∈ {0,1} - whether a dth word occurs in an e-mail (x = 1) or not
‣ y ∈ {0,1} - whether an e-mail is a spam (y = 1) or not
‣ Goal: provide probability of a spam OR classify messages.
0,5 Detector
0
hello ne mom call
56
fi

LOGISTIC REGRESSION
• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD
• We model y by using the Bernoulli distribution:

p(y | x, θ) = Bern(y | sigm(θ ⊤ x))
where:
D
linear dependency: θ ⊤ x = ∑ θd xd
d=1
1
sigmoid function: sigm(s) =
1 + exp(−s)
57

LOGISTIC REGRESSION
• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD

where:
D
d=1
1
1 + exp(−s)
58

LOGISTIC REGRESSION
• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD

where: Sigmoid can mode

D probabilities!
d=1
1
1 + exp(−s)
59

LOGISTIC REGRESSION
• Properties of the sigmoid function:

‣ sigm(s) ∈ [0,1]
d
‣ ds sigm(s) = sigm(s) (1 − sigm(s))
‣ sigm(−s) = 1 − sigm(s)
• In our model we have:
p(y = 1 | x, θ) = sigm(θ ⊤ x)
p(y = 0 | x, θ) = 1 − sigm(θ ⊤ x)
60

LOGISTIC REGRESSION
• p(y | x, θ) = Bern(y | sigm(θ ⊤ x))
D
where: θ ⊤ x =
∑
θd xd
d=1
x1
θ1
sigm
∑
… p(y = 1 | x, θ)
θD
xD
61

LOGISTIC REGRESSION: GRADIENT-DESCENT
y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i
∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤
∑
=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i
=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i
∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i
∑(
= sigm(θ ⊤ xi) − yi) xi
i
62
𝒟
𝒟

y| ∑
i
∑(
i
1 1
⊤ ⊤
∑
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
i
=−
i
∑(
i
∑(
i
63
𝒟
𝒟

y| ∑
i
∑(
i
1 1
⊤ ⊤
∑
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
i
=−
i
∑(
i
∑(
i
64
𝒟
𝒟

y| ∑
i
∑(
i
1 1
⊤ ⊤
∑
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
i
=−
i
∑(
i
∑(
i
65
𝒟
𝒟

y| ∑
i
∑(
i
1 1
⊤ ⊤
∑
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
i
=−
i
∑(
i
∑(
i
66
𝒟
𝒟

• The update rule:

N
( xi) − yi) xi
⊤
∑
θ := θ − α sigm(θ
i=1
• What if N is large?
‣ Use mini-batches (M ≪ N)! (Stochastic Gradient descent)
M
θ := θ − α
∑ ( sigm(θ ⊤ xj) − yj) xj
j=1
θ := θ − α(sigm(θ ⊤ xj) − yj) xj
67


N
( xi) − yi) xi
⊤
∑
i=1
M
θ := θ − α
j=1
68


N
( xi) − yi) xi
⊤
∑
i=1
M
θ := θ − α
or even: j=1
69

LOGISTIC REGRESSION: SGD VS. GD
Stochastic Gradient Descent Gradient Descent
70
FULLY-CONNECTED NEURAL NETWORKS
71
WHAT IF WE STACK MULTIPLE LOGISTIC REGRESSORS
x1
θ1
sigm
∑
… p(y = 1 | x, θ)
θD
xD
72
STACKING LOGISTIC REGRESSORS
W ∈ ℝD×M
x1
sigm h1 θ ∈ ℝM×1
∑
sigm
∑
p(y = 1 | x, θ)
… …
sigm
hM
xD ∑
weights: W ∈ ℝD×M, θ ∈ ℝM×1

73
FULLY CONNECTED NEURAL NETWORKS (FCN)
• Stacking logistic regressions models the probability as follows:

p(y = 1 | x, W, θ) = sigm(θ ⊤ sigm(Wx))
h
• Notice that we still use the log-likelihood function as our objective!

• We refer to h as a hidden layer, and hm is called a neuron.
• We can stack even more:
p(y = 1 | x, {Wi}, θ) = sigm(θ ⊤sigm(⋯W2sigm(W1x)))
74


h

75


h

76


h

77

FCN: COMPONENTS
• A linear layer: a = Wx (or with explicit bias: a = Wx + b)

• with a non-linearity: h = f (Wx)
• Typical non-linearities:
‣ sigmoid: sigm(x) = (1 + exp(−x))
−1
‣ tanh: tanh(x) = 2sigm(x) − 1

‣ ReLU: relu(x) = max{0,x}
‣ softmax: softmax(x) = exp(xi) / ∑ exp(xj)
78

FCN: COMPONENTS
• A linear layer: a = Wx (or with explicit bias: a = Wx + b)

• with a non-linearity: h = f (Wx)
• Typical non-linearities:
‣ sigmoid: sigm(x) = (1 + exp(−x))
−1
‣ tanh: tanh(x) = 2sigm(2x) − 1

‣ ReLU: relu(x) = max{0,x}
‣ softmax: softmax(x) = exp(xi) / ∑ exp(xj)
79

LINEAR LAYERS
• A linear layer: a = Wx
• Initialization of W ∈ ℝD×M:
• Gaussian: W ∼ (0,σ 2)
• Xavier (for tanh): W ∼ (0, 1 / D)
• He (for ReLU): W ∼ (0, 2 / D)

𝒩
𝒩
80
𝒩

ACTIVATION FUNCTIONS
sigmoid tanh ReLU
∇f (x)
81
HOW TO DEAL WITH MULTIPLE CLASSES?
• So far, we talked about 2 classes (sigmoid for modeling probabilities).

• How to deal with K classes?
82
HOW TO DEAL WITH MULTIPLE CLASSES?
• So far, we talked about 2 classes (sigmoid for modeling probabilities).

• How to deal with K classes?
• Softmax function:
exp(θi⊤ x)
p(y = i | x, θ) = K
∑k=1 exp(θk⊤ x)
83

LEARNING: BACKPROPAGATION
• We introduced neural networks as stacked logistic regressors.

• How to learn FCN? Can we use SGD?
84

• How to learn FCN? Can we use SGD? YES!
85

• Let us consider one hidden layer:
p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))
86


p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))
• We can use the chain rule to calculate gradients wrt all weights:
dℓ dℓ df1 df2
=
du df1 df2 du
87


p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))
dℓ dℓ df1 df2
=
Negative log-likelihood du df1 df2 du
88


p(y = 1 | x, W, θ) = sigm(θ ⊤f (Wx)))
dℓ dℓ df1 df2
=
du df1 df2 du
Full derivation
Homework 🤓
89 Or wait till the next lecture!
:
VANISHING GRADIENT PROBLEM
• The sigmoid function is a good choice to model probabilities, however,

it is bad as a non-linearity for hidden layers.
• The problem arises while calculating gradients:
d
sigm(s) = sigm(s) (1 − sigm(s))
ds
If sigm(x) ≈ 1, then:
d
sigm(s) ≈ 1 (1 − 1) = 0
ds
90

VANISHING GRADIENT PROBLEM
• The sigmoid function is a good choice to model probabilities, however,

it is bad as a non-linearity for hidden layers.
• The problem arises while calculating gradients:
d
sigm(s) = sigm(s) (1 − sigm(s))
ds
If sigm(x) ≈ 1, then:
d For instance:
sigm(s) ≈ 1 (1 − 1) = 0
ds dℓ dℓ df2
= ⋅0⋅ =0
du df1 du
91 😵

Thank you!
EXTRA READING
Bishop, “Pattern Recognition and Machine Learning”
Murphy, “Machine Learning: A Probabilistic Perspective”
Courville, Goodfellow, Bengio,“Deep Learning”
Kruse et al., “Computational Intelligence: A Methodological Introduction”,

Springer
93

Dlvu Lecture01

Uploaded by

Copyright:

Available Formats

You might also like

Dlvu Lecture01

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dlvu Lecture01

Uploaded by

Copyright:

Available Formats

Lecture 1: Introduction

If you have any question, please contact us:

LECTURES AND PRACTICAL SESSIONS

During this course we will have:

LECTURES AND PRACTICAL SESSIONS

Content of the course:

LECTURES AND PRACTICAL SESSIONS

LECTURES AND PRACTICAL SESSIONS

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

1940-1950 194 1950 1956 195 196

1960s and 1970s

1995 199 199

Accessible hardware Intuitive programming languages

Powerful hardware Specialized packages

Knowledge acquisition (learning objective & algorithms)

Optimization algorithm = learning algorithm.

Supervised Learning Unsupervised learning

Supervised Learning Unsupervised learning Reinforcement learning

• We minimize a • We minimize a • Each action is rewarded.

• We are interested in prediction.

EXAMPLES: HANDWRITING GENERATOR

• A representation = a set of (abstract) features.

• c ∈ {0,1} - a random variable (a result of tossing a coin)

• c ∈ {0,1} - a random variable (a result of tossing a coin)

• c ∈ {0,1} - a random variable (a result of tossing a coin)

• c ∈ {0,1} - a random variable (a result of tossing a coin)

• c ∈ {0,1} - a random variable (a result of tossing a coin)

• c ∈ {0,1} - a random variable (a result of tossing a coin)

The optimization problem:

The optimization problem:

The optimization problem:

The optimization problem:

The optimization problem:

(Finding x ⋆) Calculate derivative wrt x and set to 0:

(Finding x ⋆) Calculate derivative wrt x and set to 0:

(Finding x ⋆) Calculate derivative wrt x and set to 0:

(Finding x ⋆) Calculate derivative wrt x and set to 0:

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x).

PROBABILISTIC LEARNING (LIKELIHOOD-BASED)

1) Determine p(y | x). e.g., Bernoulli, Gaussian…

• Example: Spam detection

‣ y ∈ {0,1} - whether an e-mail is a spam (y = 1) or not

‣ Goal: provide probability of a spam OR classify messages.

• We model y by using the Bernoulli distribution:

• We model y by using the Bernoulli distribution:

• We model y by using the Bernoulli distribution:

where: Sigmoid can mode

• Properties of the sigmoid function:

• p(y | x, θ) = Bern(y | sigm(θ ⊤ x))

LOGISTIC REGRESSION: GRADIENT-DESCENT

1940-1950  194 1950  1956  195 196

1960s and 1970s 

1995  199 199