Professional Documents
Culture Documents
Dlvu Lecture01
Dlvu Lecture01
Dlvu Lecture01
Jakub M. Tomczak
Deep learning
COURSE ORGANIZATION
2
DO NOT HESITATE TO CONTACT US!
• Stefan Schouten
• Shuai Wang
• Roderick van der Weerdt
• Taraneh Younesian Michael Cochez
3 Jakub Tomczak
Assignments:
1. MLP (Nov 14)
2. Autograd/backpropagation (Nov 28)
3. One of the two (Dec 8):
a. CNNs
b. RNNs
4. One of the three (Dec 19):
a. VAEs & GANs
b. Graph convolutions
c. Reinforcement learning
6
Assignments:
1. Use Python 3 ONLY!
2. Assignments 1 and 2 are implemented INDIVIDUALLY.
3. Assignments 3 and 4 are implemented IN GROUPS OF 3.
4. All methods must be implemented by you unless it’s specified otherwise.
5. In the assignments we will use mainly Numpy and PyTorch.
6. THERE IS NO RESIT FOR ASSIGNMENTS.
7. Late submissions: First contact Jakub. Second, -2pts for each day after
the deadline.
GRADING
Assignments
• Max. 40 pts (10 pts per assignment)
• Partial grade from the assignments: round(achieved points / 4)
Exam
• Max. 40 pts (40 questions)
• Partial grade: round(achieved points / 4)
Final grade
• At least 5.5 from the assignments and 5.5 from the exam.
• Final grade:
round(0.5 * partial grade from assignments + 0.5 * partial grade from the exam)
8
9
HISTORICAL PERSPECTIVE
10
5
WHY AI IS SUCCESSFUL?
11
COMPONENTS OF AI SYSTEMS
Knowledge representation
How to represent & process data?
Learning problems
What kind of problems can we formulate?
12
LEARNING AS OPTIMIZATION
For given data , find the best data representation from a given class of
representations that minimizes given learning objective (loss).
min f(x; )
x∈
s.t. x ∈ ⊆
𝕏
13
𝒟
𝒟
𝕐
𝕏
LEARNING AS OPTIMIZATION
For given data , find the best data representation from a given class of
representations that minimizes given learning objective (loss).
min f(x; )
x∈
s.t. x ∈ ⊆
LEARNING TASKS
15
LEARNING TASKS
Supervised Learning
• We distinguish inputs
and outputs.
• We are interested in
prediction.
• We minimize a
prediction error.
16
LEARNING TASKS
17
LEARNING TASKS
18
LEARNING TASKS
Supervised Learning
• We distinguish inputs and outputs. Classification
19
LEARNING TASKS
Unsupervised learning
• No distinction among variables.
• We look for a data structure.
• We minimize a reconstruction error,
compression rate, …
20
LEARNING TASKS
Reinforcement learning
• An agent interacts with an
environment.
• We want to learn a policy.
• Each action is rewarded.
• We maximize a total reward.
21
DEEP LEARNING
22
APPLICATION AREAS
Computer Vision
Information Retrieval
Speech Recognition
Natural Language Processing
Recommendation Systems
Drug Discovery
Robotics
…
23
24
A. Graves, “Generating Sequences With Recurrent Neural Networks“
EXAMPLES: IMAGE GENERATION
25
Gatopoulos & Tomczak, “Self-supervised Variational Auto-Encoders“
EXAMPLES: GENERATING IMAGE DESCRIPTIONS
26
Karpathy & Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions"
THREE CORE COMPONENTS OF DEEP LEARNING
27
AUTOMATIC FEATURE EXTRACTION / REPRESENTATION LEARNING
PROBABILISTIC LEARNING
29
TOSS A COIN… 🎶
30
TOSS A COIN… 🎶
31
TOSS A COIN… 🎶
Quick check:
p(c = 0 | x) = x 0 (1 − x)1−0 = 1 − x
p(c = 1 | x) = x 1 (1 − x)1−1 = x
32
TOSS A COIN… 🎶
33
𝒟
TOSS A COIN… 🎶
EXAMPLE:
= {0,0,1,1,0,1,1}
34
𝒟
𝒟
TOSS A COIN… 🎶
∏
p( | x) = p(cn | x)
n=1
𝒟
35
𝒟
TOSS A COIN… 🎶
36
TOSS A COIN… 🎶
min − log p( | x)
x∈[0,1]
37
𝒟
TOSS A COIN… 🎶
min − log p( | x)
x∈[0,1]
Remarks:
38
𝒟
TOSS A COIN… 🎶
min − log p( | x)
x∈[0,1]
Remarks:
1) Why negative? Because: max f (x) = min {−f (x)}.
39
𝒟
TOSS A COIN… 🎶
min − log p( | x)
x∈[0,1]
Remarks:
1) Why negative? Because: max f (x) = min {−f (x)}.
2) Why logarithm? Because: log ∏ = ∑ log and optimum is the same.
40
𝒟
TOSS A COIN… 🎶
N
∏
log p( | x) = log p(cn | x) the log-likelihood
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn
∑
= Bernoulli distribution
n=1
N
∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
41
TOSS A COIN… 🎶
N
∏
log p( | x) = log p(cn | x) the log-likelihood
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn
∑
= Bernoulli distribution
n=1
N
∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
42
TOSS A COIN… 🎶
N
∏
log p( | x) = log p(cn | x) the log-likelihood
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn
∑
= Bernoulli distribution
n=1
N
∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
43
TOSS A COIN… 🎶
N
∏
log p( | x) = log p(cn | x) the log-likelihood
n=1
N
∑
= log p(cn | x)
∏ ∑
log = log
n=1
N
log x cn (1 − x)1−cn
∑
= Bernoulli distribution
n=1
N
∑( n
= c log x + (1 − cn)log(1 − x)) log a b = b log a
n=1
log ab = log a + log b
𝒟
44
TOSS A COIN… 🎶
d N
∑
dx n=1 ( cn log x + (1 − cn)log(1 − x)) = 0
N
cn (1 − cn)
∑( x )=0
−
n=1
(1 − x)
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1
45
TOSS A COIN… 🎶
d N d
( cn log x + (1 − cn)log(1 − x)) = 0 f (x) = 0 gives optimum
∑
dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1
46
TOSS A COIN… 🎶
d N d
( cn log x + (1 − cn)log(1 − x)) = 0 f (x) = 0 gives optimum
∑
dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N
1 N
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn
n=1 n=1 n=1
N n=1
47
TOSS A COIN… 🎶
d N d
( cn log x + (1 − cn)log(1 − x)) = 0 f (x) = 0 gives optimum
∑
dx n=1
dx
and
N d 1
cn (1 − cn)
∑( x )=0
− log x =
(1 − x) dx x
n=1
N
∑(
cn(1 − x) − (1 − cn)x) = 0
n=1
N N N N
EXAMPLE:
1
∑ ∑ ∑ ∑
cn − x cn − Nx + x cn = 0 ⇒ x= cn = {0,0,1,1,0,1,1}
n=1 n=1 n=1
N n=1
x ⋆ = 4/7
48
𝒟
49
PROBABILISTIC LEARNING (LIKELIHOOD-BASED)
50
51
𝒟
52
𝒟
53
𝒟
𝒟
54
𝒟
𝒟
LOGISTIC REGRESSION
55
SPAM DETECTION
0,5 Detector
0
hello ne mom call
56
fi
LOGISTIC REGRESSION
• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD
where:
D
linear dependency: θ ⊤ x = ∑ θd xd
d=1
1
sigmoid function: sigm(s) =
1 + exp(−s)
57
LOGISTIC REGRESSION
• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD
where:
D
linear dependency: θ ⊤ x = ∑ θd xd
d=1
1
sigmoid function: sigm(s) =
1 + exp(−s)
58
LOGISTIC REGRESSION
• x ∈ ℝD, y ∈ {0,1}, θ ∈ ℝD
59
LOGISTIC REGRESSION
‣ sigm(−s) = 1 − sigm(s)
• In our model we have:
p(y = 1 | x, θ) = sigm(θ ⊤ x)
p(y = 0 | x, θ) = 1 − sigm(θ ⊤ x)
60
LOGISTIC REGRESSION
D
where: θ ⊤ x =
∑
θd xd
d=1
x1
θ1
sigm
∑
… p(y = 1 | x, θ)
θD
xD
61
y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i
∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤
∑
=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i
=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i
∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i
∑(
= sigm(θ ⊤ xi) − yi) xi
i
62
𝒟
𝒟
y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i
∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤
∑
=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i
=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i
∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i
∑(
= sigm(θ ⊤ xi) − yi) xi
i
63
𝒟
𝒟
y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i
∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤
∑
=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i
=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i
∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i
∑(
= sigm(θ ⊤ xi) − yi) xi
i
64
𝒟
𝒟
y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i
∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤
∑
=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i
=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i
∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i
∑(
= sigm(θ ⊤ xi) − yi) xi
i
65
𝒟
𝒟
y| ∑
∇θ − log p( x, θ) = ∇θ − log p(yi | xi, θ)
i
∑(
= ∇θ − yi log sigm(θ ⊤ xi) + (1 − yi)log sigm(−θ ⊤ xi))
i
1 1
(yi sigm(θ ⊤ xi)sigm(−θ ⊤ xi)xi)
⊤ ⊤
∑
=− sigm(θ xi )sigm(−θ xi )xi − (1 − yi )
i
sigm(θ xi)
⊤ sigm(−θ xi)
⊤
=−
∑ ( yi sigm(−θ ⊤ xi) xi − (1 − yi) sigm(θ ⊤ xi) xi)
i
=−
∑ ( yi (sigm(−θ ⊤ xi) + sigm(θ ⊤ xi)) xi − sigm(θ ⊤ xi) xi)
i
∑(
=− yi xi − sigm(θ ⊤ xi) xi)
i
∑(
= sigm(θ ⊤ xi) − yi) xi
i
66
𝒟
𝒟
( xi) − yi) xi
⊤
∑
θ := θ − α sigm(θ
i=1
• What if N is large?
‣ Use mini-batches (M ≪ N)! (Stochastic Gradient descent)
M
θ := θ − α
∑ ( sigm(θ ⊤ xj) − yj) xj
j=1
67
( xi) − yi) xi
⊤
∑
θ := θ − α sigm(θ
i=1
• What if N is large?
‣ Use mini-batches (M ≪ N)! (Stochastic Gradient descent)
M
θ := θ − α
∑ ( sigm(θ ⊤ xj) − yj) xj
j=1
68
( xi) − yi) xi
⊤
∑
θ := θ − α sigm(θ
i=1
• What if N is large?
‣ Use mini-batches (M ≪ N)! (Stochastic Gradient descent)
M
θ := θ − α
∑ ( sigm(θ ⊤ xj) − yj) xj
or even: j=1
69
70
FULLY-CONNECTED NEURAL NETWORKS
71
WHAT IF WE STACK MULTIPLE LOGISTIC REGRESSORS
x1
θ1
sigm
∑
… p(y = 1 | x, θ)
θD
xD
72
STACKING LOGISTIC REGRESSORS
W ∈ ℝD×M
x1
sigm h1 θ ∈ ℝM×1
∑
sigm
∑
p(y = 1 | x, θ)
… …
sigm
hM
xD ∑
74
75
76
77
FCN: COMPONENTS
78
FCN: COMPONENTS
79
LINEAR LAYERS
• A linear layer: a = Wx
• Initialization of W ∈ ℝD×M:
• Gaussian: W ∼ (0,σ 2)
ACTIVATION FUNCTIONS
sigmoid tanh ReLU
∇f (x)
81
HOW TO DEAL WITH MULTIPLE CLASSES?
82
83
LEARNING: BACKPROPAGATION
84
LEARNING: BACKPROPAGATION
85
LEARNING: BACKPROPAGATION
86
LEARNING: BACKPROPAGATION
• We can use the chain rule to calculate gradients wrt all weights:
dℓ dℓ df1 df2
=
du df1 df2 du
87
LEARNING: BACKPROPAGATION
• We can use the chain rule to calculate gradients wrt all weights:
dℓ dℓ df1 df2
=
Negative log-likelihood du df1 df2 du
88
LEARNING: BACKPROPAGATION
• We can use the chain rule to calculate gradients wrt all weights:
dℓ dℓ df1 df2
=
du df1 df2 du
Full derivation
Homework 🤓
89 Or wait till the next lecture!
:
90
91 😵
Thank you!
EXTRA READING
93