Lec 01 Introduction

Deep Learning
Lecture 1 – Introduction
Prof. Dr.-Ing. Andreas Geiger

Autonomous Vision Group
University of Tübingen
Thies, Elgharib, Tewari, Theobalt and Niessner: Neural Voice Puppetry: Audio-driven Facial Reenactment. ECCV, 2020. 2
Flipped Classroom
https://uni-tuebingen.de/de/175884
3
Agenda
1.1 Organization
1.2 History of Deep Learning
1.3 Machine Learning Basics
4
1.1
Organization
Team
Prof. Dr.-Ing. Bozidar Haoyu

Andreas Geiger Antic He
Lectures offered by our research group:

I Deep Learning (recommended for 1st semester MSc)
I Computer Vision (recommended for 2nd semester MSc)
I Self-Driving Cars (recommended for 3rd semester MSc)
6
Contents
Goal: Students gain an understanding of the theoretical and practical concepts of
deep neural networks including, optimization, inference, architectures and
applications. After this course, students should be able to develop and train deep
neural networks, reproduce research results and conduct original research.
I History of deep learning I Convolutional Neural Networks

I Linear/logistic regression I Sequence Models
I Multi-layer perceptrons I Natural Language Processing
I Backpropagation I Graph Neural Networks
I Loss and Activation Functions I Autoencoders
I Optimization and Regularization I Generative Adversarial Networks
7
Teaching Philosophy
What is Teaching?
I Knowledge transfer (bidirectional)

I Understanding
I Learning to learn
I Critical thinking
I Engaging students
I Fostering discussion
I Apply knowledge in practice
I Having some fun along the way ..
9
Flipped Classroom
Regular Classroom Flipped Classroom

I Lecturer “reads” lecture in class I Students watch lectures at home
I Students digest content at home I Content discussed during lecture
I Little “quality time”, no fun I Maximize interaction time
10
Flipped Classroom
https://uni-tuebingen.de/de/175884 11
Flipped Classroom
More time for:
I Interaction and discussion
I Answering questions on ILIAS
I Improving the materials
I Understanding the learning progress
I Developing new formats
(e.g., Interactive sessions, quizzes . . . )
I Implementing new tools
(e.g., Lecture quiz server, . . . )
13
Flipped Classroom
I High quality lecture videos I Can be watched multiple times

I Available anytime, anywhere I Adjustable speed, closed captions
I Slides available while watching I Questions ⇒ Forum & live sessions
14
Organization
Organization
I Lectures provided in advance via YouTube
I Students watch lecture before joining live session (mark time for this in calendar!)
I Students are encouraged to ask questions via the ILIAS forum and lecture quiz
I Live sessions (Wednesdays: 14:15-16:00)
I Lecture Q&A, Exercise introduction and Q&A, live exercise solving
I Will not be hybrid and not recorded (unless overflow)
I No tutorials by student assistants (this is a Master course!), instead we provide:
I Plenary feedback via Q&A sessions, individual help via weekly helpdesk (please use it!)
I Fast (<24h) feedback via ILIAS forum (please use it!)
I Exam
I Written (date and time available on course website)
I A bonus can be obtained upon participation in the quizzes
I Course Website with YouTube, Slides/Exercises, ILIAS, Quiz and Zoom links:
I Register to ILIAS and Lecture Quiz Server today to participate in this course! 16
Exercises
I Exercises are offered every 2 weeks, 6 assignments in total
I We will use EDF, a python-only DL framework and PyTorch
I Completion is voluntary and not a requirement to participate in the exam
I But: Empirically, performance in exercises and exam highly correlated!
17
Exercises
I Problems handed out, introduced and discussed in the live sessions

I Can be conducted in groups of up to 2 students
I We offer a helpdesk every week via zoom to answer your questions
I In the past, we graded exercises, but students rather preferred solutions
⇒ We will not grade exercises, but hand out and discuss the solutions instead
I But how do I stay motivated to watch the lectures and do the exercises?
18
AVG Lecture Quiz Server
I We provide quiz questions to students
during lectures 2-12 and exercises 1-6
I Students may gain up to 5 bonus points
for the exam (out of 50 exam points)
I Students collect AI generated Pokemon!
I Answers may not be shared
I Participation is voluntary
I Opening & Deadline: Tuesdays, 3pm
I Register today ⇒ participation link
I Use student email, validate information
I Please report bugs directly to me
https://uni-tuebingen.de/de/175884 21
AVG Lecture Quiz Server: Student Questions
We appreciate your feedback as well:

I Questions stipulate thinking and provide valuable feedback to us
I Students are encouraged to ask a valid question regarding the lecture (1 point)
I Questions will be discussed during the interactive session ⇒ only ask if you join
I Also dare to ask “easy” questions, many others will be wondering as well
I Our goal is to build a relationship of trust and openness in this class!
23
AVG Lecture Quiz Server: Student Questions
High-Level
#Student Questions
Technical
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Lecture Lecture
I Dare to ask any kind of question

I All questions are valuable: technical and high-level
I But do not only ask high-level questions
I Dare to ask “easy” and “technical” questions - they are not easy
I Your questions will not negatively influence your exam grade
24
Live Sessions
The “Hybrid Lecture Paradox”
I Hybrid lectures are the future!
Students can join from anywhere!
I However, as it turns out:
I Students only join online
I But they prefer to join live
I ???
I Result: Severely reduced interaction
(bad audio, no faces, no spontaneity)
I Conclusion:
Live sessions will be live. Period.
25
Live Sessions
1. Lecture Q&A (45 min)

I Discussion of hard quiz questions
I Discussion of student questions
(join to hear your answer!)
I Discussion of live questions
2. Exercise Q&A (45 min)

I Exercise introduction
I Exercise Q&A and live solving
(start solving before joining!)
I Exercise discussion
26
Helpdesk
I We offer a weekly zoom helpdesk where
our TAs will provide individual support
I Ask any question about the exercise
I Share your screen to show a problem
I Start working on your exercise early!
I Let’s do a quick time poll:
I Mon, 10:00-12:00 I Fri, 10:00-12:00
I Mon, 14:00-16:00 I Fri, 14:00-16:00
I Mon, 17:00-19:00 I Fri, 16:00-18:00
I Mon, 19:00-21:00 I Fri, 18:00-20:00
27
What will the exam look like?
I Written main and make-up exam
I You may choose freely (but no 3rd exam)
I Registration via Quiz Server
I Only pen and ruler allowed (no notes)
I Duration: 90 minutes (can be solved in 60)
I 5 Tasks, 10 points each, 25 points will pass
I Bonus added only to passed exams
I Tasks cover both lectures and exercises
I Mix of knowledge, calculation, multiple choice
I Old exams available on ILIAS
More information: https://uni-tuebingen.de/de/175884 28

Bonus Points
I 11 lecture quizzes (10 points per quiz)
I 6 exercise quizzes (10 points per quiz)
I 170 quiz points in total
j m
I Formula: bonus points = quiz points
34
I ≥ 17 quiz points ⇒ 1 bonus point
I ≥ 51 quiz points ⇒ 2 bonus points
I 5 bonus points yield up to 2 grade steps
I Bonus added only to passed exams
29
Work Ethics
This lecture has 6 ECTS, corresponding to a total workload of ∼180 hours (MHB)
Lecture Videos Self-Study Live Sessions Exercises Quizzes
The Diligent Student The Relaxed Student The Stressed Student

("4 wins" strategy)
20 20 20
15 15 15
Hours
Hours
Hours
10 10 10
5 5 5
1 2 3 4 5 6 7 8 9 10 1112 13 14 1 2 3 4 5 6 7 8 9 10 1112 13 14 1 2 3 4 5 6 7 8 9 10 1112 13 14

Weeks Weeks Weeks
30
Course Materials and Prerequisites
Course Materials
Books:
I Goodfellow, Bengio, Courville: Deep Learning
http://www.deeplearningbook.org
I Bishop: Pattern Recognition and Machine Learning

http://www.springer.com/gp/book/9780387310732
I Zhang, Lipton, Li, Smola: Dive into Deep Learning

http://d2l.ai
I Deisenroth, Faisal, Ong: Mathematics for Machine Learning

https://mml-book.github.io
I Petersen, Pedersen: The Matrix Cookbook

http://cs.toronto.edu/~bonner/courses/2018s/csc338/matrix_cookbook.pdf
I Inoffical lecture notes written by students in winter 2020/21

32
Course Materials
Courses:
I McAllester (TTI-C): Fundamentals of Deep Learning
http://mcallester.github.io/ttic-31230/Fall2020/
I Kolterl, Chen (CMU): Deep Learning Systems

https://dlsyscourse.org/lectures/
I Leal-Taixe, Niessner (TUM): Introduction to Deep Learning

http://niessner.github.io/I2DL/
I Grosse (UoT): Intro to Neural Networks and Machine Learning

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/
I Li (Stanford): Convolutional Neural Networks for Visual Recognition

http://cs231n.stanford.edu/
I Abbeel, Chen, Ho, Srinivas (Berkeley): Deep Unsupervised Learning

https://sites.google.com/view/berkeley-cs294-158-sp20/home
33
Course Materials
Tutorials:
I The Python Tutorial
https://docs.python.org/3/tutorial/
I NumPy Quickstart
https://numpy.org/devdocs/user/quickstart.html
I PyTorch Tutorial
https://pytorch.org/tutorials/
Frameworks / IDEs:
I Visual Studio Code
https://code.visualstudio.com/
I Google Colab
https://colab.research.google.com
34
Prerequisites
Math:
I Linear algebra, probability and information theory. If unsure, have a look at:
Goodfellow et al.: Deep Learning (Book), Chapters 1-4
Luxburg: Mathematics for Machine Learning (Lecture)
Deisenroth et al.: Mathematics for Machine Learning (Book)
Computer Science:
I Variables, functions, loops, classes, algorithms
Python:
I https://docs.python.org/3/tutorial/
35
Prerequisites
Linear Algebra:
I Vectors: x, y ∈ Rn
I Matrices: A, B ∈ Rm×n
I Operations: AT , A−1 , Tr(A), det(A), A + B, AB, Ax, x> y
I Norms: kxk1 , kxk2 , kxk∞ , kAkF
I SVD: A = UDV>
36
Prerequisites
Probability and Information Theory:

I Probability distributions: P (X = x)
R
I Marginal/conditional: p(x) = p(x, y)dy , p(x, y) = p(x|y)p(y)
I Bayes rule: p(x|y) = p(y|x)p(x)/p(y)
I Conditional independence: x ⊥ ⊥ y | z ⇔ p(x, y|z) = p(x|z)p(y|z)
R
I Expectation: Ex∼p [f (x)] = x p(x)f (x)dx
I Variance: Var(f (x)) = E (f (x) − E[f (x)])2

I Distributions: Bernoulli, Categorical, Gaussian, Laplace

I Entropy: H(x) , KL Divergence: DKL (pkq)
37
Thank You!
Looking forward to our discussions
1.2
History of Deep Learning
A Brief History of Deep Learning
Three waves of development:
I 1940-1970: “Cybernetics” (Golden Age)
I Simple computational models of biological learning, simple learning rules
I 1980-2000: “Connectionism” (Dark Age)
I Intelligent behavior through large number of simple units, Backpropagation
I 2006-now: “Deep Learning” (Revolution Age)
I Deeper networks, larger datasets, more computation, state-of-the-art in many areas
ng
s
ni
s
ni
tic
ar
io
ne
Le
e ct
r
ep
be
nn
De
Cy
Co
1950 1960 1970 1980 1990 2000 2010 2020
40
1943: McCullock and Pitts
I Early model for neural activation
I Linear threshold neuron (binary):

+1 if wT x ≥ 0
fw (x) =
−1 otherwise
I More powerful than AND/OR gates

I But no procedure to learn weights
n
ro
eu
N
1950 1960 1970 1980 1990 2000 2010 2020

McCulloch and Pitts: A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 1943. 41
1958-1962: Rosenblatt’s Perceptron
I First algorithm and implementation
to train single linear threshold neuron
I Optimization of perceptron criterion:
X
L(w) = − w T xn yn
n∈M
I Novikoff proved convergence

n
ro
ept
rc
Pe
1950 1960 1970 1980 1990 2000 2010 2020

Rosenblatt: The perceptron - a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. 42
1958-1962: Rosenblatt’s Perceptron
I First algorithm and implementation
to train single linear threshold neuron
I Overhyped: Rosenblatt claimed that
the perceptron will lead to computers
that walk, talk, see, write, reproduce
and are conscious of their existence
n
ro
ept
rc
Pe
1950 1960 1970 1980 1990 2000 2010 2020

Rosenblatt: The perceptron - a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. 42
1969: Minsky and Papert publish book
I Several discouraging results
I Showed that single-layer perceptrons
cannot solve some very simple
problems (XOR problem, counting)
I Symbolic AI research dominates 70s
t
er
ap
/P
sky
in
M
1950 1960 1970 1980 1990 2000 2010 2020

Minsky and Papert: Perceptrons: An introduction to computational geometry. MIT Press, 1969. 43
1979: Fukushima’s Neocognitron
I Inspired by Hubel and Wiesel
experiments in the 1950s
I Study of visual cortex in cats
I Found that cells are sensitive to
orientation of edges but insensitive
to their position (simple vs. complex)
I H&W received Nobel price in 1981
on
itr
gn
co
eo
N
1950 1960 1970 1980 1990 2000 2010 2020

Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position. IECE (in Japanese), 1979. 44
1979: Fukushima’s Neocognitron
I Multi-layer processing
to create intelligent behavior
I Simple (S) and complex (C) cells
implement convolution and pooling
I Reinforcement based learning
I Inspiration for modern ConvNets
on
itr
gn
co
eo
N
1950 1960 1970 1980 1990 2000 2010 2020

Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position. IECE (in Japanese), 1979. 44
1986: Backpropagation Algorithm
I Efficient calculation of gradients in a
deep network wrt. network weights
I Enables application of gradient
based learning to deep networks
I Known since 1961, but
first empirical success in 1986
n
io
at
I Remains main workhorse today
ag
op
pr
ck
1950 1960 1970 1980 Ba 1990 2000 2010 2020
Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986. 45
1997: Long Short-Term Memory
I In 1991, Hochreiter demonstrated the
problem of vanishing/exploding
gradients in his Diploma Thesis
I Led to development of long-short
term memory for sequence modeling
I Uses feedback and forget/keep gate
TM
LS
1950 1960 1970 1980 1990 2000 2010 2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997. 46
1997: Long Short-Term Memory
I In 1991, Hochreiter demonstrated the
problem of vanishing/exploding
gradients in his Diploma Thesis
I Led to development of long-short
term memory for sequence modeling
I Uses feedback and forget/keep gate
I Revolutionized NLP (e.g. at Google)
many years later (2015)
TM
LS
1950 1960 1970 1980 1990 2000 2010 2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997. 46
1998: Convolutional Neural Networks
I Similar to Neocognitron, but trained
end-to-end using backpropagation
I Implements spatial invariance via
convolutions and max-pooling
I Weight sharing reduces parameters
I Tanh/Softmax activations
I Good results on MNIST
et
I But did not scale up (yet)
N
nv
Co
1950 1960 1970 1980 1990 2000 2010 2020
LeCun, Bottou, Bengio, Haffner: Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 47
2009-2012: ImageNet and AlexNet
ImageNet
I Recognition benchmark (ILSVRC)
I 10 million annotated images
I 1000 categories
AlexNet
I First neural network to win ILSVRC
et
N
via GPU training, deep models, data
ex
Al
e/
ag
Im
1950 1960 1970 1980 1990 2000 2010 2020
Krizhevsky, Sutskever, Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012. 48
2009-2012: ImageNet and AlexNet
ImageNet
I Recognition benchmark (ILSVRC)
I 10 million annotated images
I 1000 categories
AlexNet
I First neural network to win ILSVRC
et
N
via GPU training, deep models, data
ex
Al
e/
I Sparked deep learning revolution
ag
Im
1950 1960 1970 1980 1990 2000 2010 2020
Krizhevsky, Sutskever, Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012. 48
2012-now: Golden Age of Datasets
I KITTI, Cityscapes: Self-driving
I PASCAL, MS COCO: Recognition
I ShapeNet, ScanNet: 3D DL
I GLUE: Language understanding
I Visual Genome: Vision/Language
I VisualQA: Question Answering
I MITOS: Breast cancer
ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Geiger, Lenz and Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR, 2012. 49
2012-now: Synthetic Data
I Annotating real data is expensive
I Led to surge of synthetic datasets
I Creating 3D assets is also costly
ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015. 50
2012-now: Synthetic Data
I Annotating real data is expensive
I Led to surge of synthetic datasets
I Creating 3D assets is also costly
I But even very simple 3D datasets
proved tremendously useful for
pre-training (e.g., in optical flow)
ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015. 50
2014: Generalization
I Empirical demonstration that deep
representations generalize well
despite large number of parameters
I Pre-train CNN on large amounts of
data on generic task (e.g., ImageNet)
I Fine-tune (re-train) only last layers on
few data of a new task
n
a tio
iz
I State-of-the-art performance
l
ra
ne
Ge
1950 1960 1970 1980 1990 2000 2010 2020
Razavian, Azizpour, Sullivan, Carlsson: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014. 51
2014: Visualization
I Goal: provide insights into what the
network (black box) has learned
I Visualized image regions that most
strongly activate various neurons at
different layers of the network
I Found that higher levels capture
more abstract semantic information
n
tio
a
iz
al
su
Vi
1950 1960 1970 1980 1990 2000 2010 2020
Zeiler and Fergus: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014. 52
2014: Adversarial Examples
I Accurate image classifiers can be
fooled by imperceptible changes
I Adversarial example:
x+argmin {k∆xk2 : f (x+∆x) 6= f (x)}

∆x
I All images classified as “ostrich”
1950 1960 1970 1980 1990 2000 2010 2020

Szegedy et al.: Intriguing properties of neural networks. ICLR, 2014. 53
2014: Domination of Deep Learning
I Machine translation (Seq2Seq)
1950 1960 1970 1980 1990 2000 2010 2020

Sutskever, Vinyals, Quoc: Sequence to Sequence Learning with Neural Networks. NIPS, 2014. 54
I Deep generative models (VAEs,
GANs) produce compelling images
1950 1960 1970 1980 1990 2000 2010 2020

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 54
1950 1960 1970 1980 1990 2000 2010 2020

Zhang, Goodfellow, Metaxas, Odena: Self-Attention Generative Adversarial Networks. ICML, 2019. 54
I Graph Neural Networks (GNNs)
revolutionize the prediction of
molecular properties
1950 1960 1970 1980 1990 2000 2010 2020

Duvenaud et al.: Convolutional Networks on Graphs for Learning Molecular Fingerprints. NIPS 2015. 54
I Graph Neural Networks (GNNs)
revolutionize the prediction of
molecular properties
I Dramatic gains in vision and speech
(Moore’s Law of AI)
1950 1960 1970 1980 1990 2000 2010 2020

Duvenaud et al.: Convolutional Networks on Graphs for Learning Molecular Fingerprints. NIPS 2015. 54
2015: Deep Reinforcement Learning
I Learning a policy (state→action)
through random exploration and
reward signals (e.g., game score)
I No other supervision
I Success on many Atari games
I But some games remain hard
RL
ep
De
1950 1960 1970 1980 1990 2000 2010 2020
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 55
2016: WaveNet
I Deep generative model
of raw audio waveforms
I Generates speech which
mimics human voice
I Generates music
et
eN
av
W
1950 1960 1970 1980 1990 2000 2010 2020
Oord et al.: WaveNet: A Generative Model for Raw Audio. Arxiv, 2016. 56
2016: Style Transfer
I Manipulate photograph to adopt
style of a another image (painting)
I Uses deep network pre-trained on
ImageNet for disentangling
content from style
I It is fun! Try yourself:
https://deepart.io/
r
fe
s
an
Tr
yle
St
1950 1960 1970 1980 1990 2000 2010 2020
Gatys, Ecker and Bethge: Image Style Transfer Using Convolutional Neural Networks. CVPR, 2016. 57
2016: AlphaGo defeats Lee Sedol
I Developed by DeepMind
I Combines deep learning with
Monte Carlo tree search
I First computer program to
defeat professional player
I AlphaZero (2017) learns via self-play
and masters multiple games
o
aG
ph
Al
1950 1960 1970 1980 1990 2000 2010 2020
Silver et al.: Mastering the game of Go without human knowledge. Nature, 2017. 58
2017: Mask R-CNN
I Deep neural network for joint object
detection and instance segmentation
I Outputs “structured object”, not only
a single number (class label)
I State-of-the-art on MS-COCO
N
CN
R-
k
as
M
1950 1960 1970 1980 1990 2000 2010 2020
He, Gkioxari, Dollár and Ross Girshick: Mask R-CNN. ICCV, 2017. 59
2017-2018: Transformers and BERT
I Transformers: Attention replaces
recurrence and convolutions
E
LU
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Vaswani et al.: Attention is All you Need. NIPS 2017. 60
I BERT: Pre-training of language
models on unlabeled text
E
LU
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Devlin, Chang, Lee and Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Arxiv, 2018. 60
I GLUE: Superhuman performance on
some language understanding tasks
(paraphrase, question answering, ..)
E
LU
I But: Computers still fail in dialogue
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019. 60
I GLUE: Superhuman performance on
some language understanding tasks
(paraphrase, question answering, ..)
E
LU
I But: Computers still fail in dialogue
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019. 60
2018: Turing Award
In 2018, the “nobel price of computing”
has been awarded to:
I Yoshua Bengio
I Geoffrey Hinton
I Yann LeCun
d
ar
Aw
g
rin
Tu
1950 1960 1970 1980 1990 2000 2010 2020
61
2016-2020: 3D Deep Learning
I First models to successfully output
3D representations
I Voxels, point clouds, meshes,
implicit representations
I Prediction of 3D models
even from a single image
I Geometry, materials, light, motion
DL
3D
1950 1960 1970 1980 1990 2000 2010 2020
Niemeyer, Mescheder, Oechsle, Geiger: Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision. CVPR, 2020. 62
2020: GPT-3
I Language model by OpenAI
I 175 Billion parameters
I Text-in / text-out interface
I Many use cases: coding, poetry,
blogging, news articles, chatbots
I Controversial discussions
I Licensed exclusively to Microsoft
on September 22, 2020
3
T-
GP
1950 1960 1970 1980 1990 2000 2010 2020
Brown et al.: Language Models are Few-Shot Learners. Arxiv, 2020. 63
Current Challenges
I Un-/Self-Supervised Learning
I Interactive learning
I Accuracy (e.g., self-driving)
I Robustness and generalization
I Inductive biases
I Understanding and mathematics
I Memory and compute
I Ethics and legal questions
I Does “Moore’s Law of AI” continue?
64
1.3
Machine Learning Basics
Goodfellow et al.: Deep Learning, Chapter 5
http://www.deeplearningbook.org/contents/ml.html
Learning Problems
I Supervised learning
I Learn model parameters using dataset of data-label pairs {(xi , yi )}N
i=1
I Examples: Classification, regression, structured prediction
I Unsupervised learning
I Learn model parameters using dataset without labels {xi }Ni=1
I Examples: Clustering, dimensionality reduction, generative models
I Self-supervised learning
I Learn model parameters using dataset of data-data pairs {(xi , x0i )}N
i=1
I Examples: Self-supervised stereo/flow, contrastive learning
I Reinforcement learning
I Learn model parameters using active exploration from sparse rewards
I Examples: deep q learning, gradient policy, actor critique
67
Supervised Learning
Classification, Regression, Structured Prediction
Classification / Regression:
f :X →N or f :X →R
I Inputs x ∈ X can be any kind of objects

I images, text, audio, sequence of amino acids, . . .
I Output y ∈ N/y ∈ R is a discrete or real number
I classification, regression, density estimation, . . .
Structured Output Learning:

f :X →Y
I Inputs x ∈ X can be any kind of objects
I Outputs y ∈ Y are complex (structured) objects
I images, text, parse trees, folds of a protein, computer programs, . . .
69
Supervised Learning
Input Model Output
70
Supervised Learning
Input Model Output
I Learning: Estimate parameters w from training data {(xi , yi )}N

i=1
70
Supervised Learning
Input Model Output
I Learning: Estimate parameters w from training data {(xi , yi )}N

i=1
I Inference: Make novel predictions: y = fw (x)
70
Classification
Input Model Output
"Beach"
I Mapping: fw : RW ×H → {“Beach”, “No Beach”}
70
Regression
Input Model Output
143,52 €
I Mapping: fw : RN → R
70
Structured Prediction
Input Model Output
"Das Pferd
frisst keinen
Gurkensalat."
I Mapping: fw : RN → {1, . . . , C}M
70
Input Model Output
Can
Monkey
I Mapping: fw : RW ×H → {1, . . . , C}W ×H
70
Input Model Output
3
I Mapping: fw : RW ×H×N → {0, 1}M
I Suppose: 323 voxels, binary variable per voxel (occupied/free)
3
I Question: How many different reconstructions? 232 = 232768
I Comparison: Number of atoms in the universe? ∼ 2273
70
Linear Regression
Linear Regression
Let X denote a dataset of size N and let (xi , yi ) ∈ X denote its elements (yi ∈ R).
Goal: Predict y for a previously unseen input x. The input x may be multidimensional.
1.5
Ground Truth
Noisy Observations
1.0
0.5
0.0
y
0.5
1.0
1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
72
Linear Regression
The error function E(w) measures the displacement along the y dimension between
the data points (green) and the model f (x, w) (red) specified by the parameters w.
1.5
f (x, w) = w> x Ground Truth
Noisy Observations
N
X 1.0 Linear Fit
E(w) = (f (xi , w) − yi )2
0.5
i=1
N 2 0.0
y
X
= x>
i w − yi
i=1 0.5
= kXw − yk22 1.0
1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
Here: x = [1, x]> ⇒ f (x, w) = w0 + w1 x 73

Linear Regression
The gradient of the error function with respect to the parameters w is given by:
∇w E(w) = ∇w kXw − yk22

= ∇w (Xw − y)> (Xw − y)

= ∇w w> X> Xw − 2w> X> y + y> y
= 2X> Xw − 2X> y
As E(w) is quadratic and convex in w, its minimizer (wrt. w) is given in closed form:
∇w E(w) = 0
−1
⇒ w = (X> X) X> y
−1
The matrix (X> X) X> is also called Moore-Penrose inverse or pseudoinverse.
74
Example: Line Fitting
Line Fitting
1.5 8
Ground Truth Error Curve
Noisy Observations 7 Minimum
1.0 Linear Fit
6
0.5
5
4
Error
0.0
y
3
0.5
2
1.0 1
1.5 0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x w1
Linear least squares fit of model f (x, w) = w0 + w1 x (red) to data points (green).

Errors are also shown in red. Right: Error function E(w) wrt. parameter w1 .
76
Example: Polynomial Curve Fitting
Polynomial Curve Fitting
Let us choose a polynomial of order M to model dataset X :
M
X
f (x, w) = wj x j = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0
Tasks:
I Training: Estimate w from dataset X
I Inference: Predict y for novel x given estimated w
Note:
I Features can be anything, including multi-dimensional inputs (e.g., images, audio),
radial basis functions, sine/cosine functions, etc. In this example: monomials.
78
Let us choose a polynomial of order M to model the dataset X :
M
X
j=0
How can we estimate w from X ?
79
The error function from above is quadratic in w but not in x:
 2
N N 2 N M
wj xji − yi 
X X X X
E(w) = (f (xi , w) − yi )2 = w> xi − yi = 
i=1 i=1 i=1 j=0
It can be rewritten in the matrix-vector notation (i.e., as linear regression problem)
E(w) = kXw − yk22
with feature matrix X, observation vector y and weight vector w:

 . . .. ..   .   
.. .. . . .. w0
 . 
. 
   
2 M
X =  1 xi xi . . . xi y= w=
 yi  . 
  
 
.. .. .. .. ..
. . . . . wM
80
Polynomial Curve Fitting Results
1.5 1.5
M=0 Ground Truth M=1 Ground Truth
Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set
0.5 0.5
0.0 0.0
y
y
0.5 0.5
1.0 1.0
1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
82
1.5 1.5
M=3 Ground Truth M=9 Ground Truth
Test Set Test Set
0.5 0.5
0.0 0.0
y
y
0.5 0.5
1.0 1.0
1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
82
Capacity, Overfitting and Underfitting
Goal:
I Perform well on new, previously
1.5
unseen inputs (test set, blue), not Ground Truth
Noisy Observations
1.0 Test Set
only on the training set (green)
I This is called generalization and 0.5
separates ML from optimization
0.0
y
I Assumption: training and test data
0.5
independent and identically (i.i.d.)
drawn from distribution pdata (x, y) 1.0
I Here: pdata (x) = U(0, 1) 1.5

0.0 0.2 0.4 0.6 0.8 1.0
pdata (y|x) = N (sin(2πx), σ) x
83
Terminology:
I Capacity: Complexity of functions which can be represented by model f
I Underfitting: Model too simple, does not achieve low error on training set
I Overfitting: Training error small, but test error (= generalization error) large
1.5 1.5 1.5
M=1 Ground Truth M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set Test Set
0.5 0.5 0.5
0.0 0.0 0.0

y
y
0.5 0.5 0.5
1.0 1.0 1.0
1.5 1.5 1.5

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x
Capacity too low Capacity about right Capacity too high

84
Example: Generalization error for various polynomial degrees M
I Model selection: Select model with the smallest generalization error
103
Training Error
Generalization Error
102
101
Error
100
10 1
10 2
10 3
0 1 2 3 4 5 6 7 8 9
Degree of Polynomial 85
General Approach: Split dataset into training, validation and test set
I Choose hyperparameters (e.g., degree of polynomial, learning rate in neural net, ..)
using validation set. Important: Evaluate once on test set (typically not available).
Test
20%
60% 20%
Training Validation
I When dataset is small, use (k-fold) cross validation instead of fixed split.
86
Ridge Regression
Ridge Regression
Polynomial Curve Model:
M
X
j=0
Ridge Regression:
N
X M
X
E(w) = (f (xi , w) − yi )2 + λ w2
i=1 j=0
= kXw − yk22 + λkwk22
I Idea: Discourage large parameters by adding a regularization term with strength λ

I Closed form solution: w = (X> X + λI)−1 X> y
88
Ridge Regression
1.5 1.5
M = 9, = 10 8 Ground Truth M = 9, = 103 Ground Truth
Test Set Test Set
0.5 0.5
0.0 0.0
y
y
0.5 0.5
1.0 1.0
1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Plots of polynomial with degree M = 9 fitted to 10 data points using ridge regression.
Left: weak regularization (λ = 10−8 ). Right: strong regularization (right, λ = 103 ).
89
Ridge Regression
102
Model Weights Training Error
Generalization Error
100000
101
50000
100
0
Error
50000 10 1
100000
10 2
10 13 10 12 10 11 10 10 10 9 10 8 10 7 10 6
10 11 10 8 10 5 10 2 101 104
Regularization Weight Regularization weight
Left: With low regularization, parameters can become very large (ill-conditioning).
Right: Select model with the smallest generalization error on the validation set.
90
Estimators, Bias and Variance
Point Estimator:
I A point estimator g(·) is function that maps a dataset X to model parameters ŵ:
ŵ = g(X )
I Example: Estimator of ridge regression model: ŵ = (X> X + λI)−1 X> y

I We use the hat notation to denote that ŵ is an estimate
I A good estimator is a function that returns a parameter set close to the true one
I The data X = {(xi , yi )} is drawn from a random process (xi , yi ) ∼ pdata (·)
I Thus, any function of the data is random and ŵ is a random variable.
92
Properties of Point Estimators:
Bias: Variance:
Bias(ŵ) = E(ŵ) − w Var(ŵ) = E(ŵ2 ) − E(ŵ)2
I Expectation over datasets X I Variance over datasets X

p
I ŵ is unbiased ⇔ Bias(ŵ) = 0 I Var(ŵ) is called “standard error”
I A good estimator has little bias I A good estimator has low variance
Bias-Variance Dilemma:
I Statistical learning theory tells us that we can’t have both ⇒ there is a trade-off
93
0.8 0.8
Estimates = 10 8 = 10 Estimates
0.6 Ground Truth 0.6 Ground Truth
Mean Mean
0.4 0.4
0.2 0.2
0.0 0.0
y
y
0.2 0.2
0.4 0.4
0.6 0.6
0.8 0.8
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Ridge regression with weak (λ = 10−8 ) and strong (λ = 10) regularization.

Green: True model. Black: Plot of model with mean parameters w̄ = E(w).
94
I There is a bias-variance tradeoff: E[(ŵ − w)2 ] = Bias(ŵ)2 + Var(ŵ)

I Or not? In deep neural networks the test error decreases with network width!
https://www.bradyneal.com/bias-variance-tradeoff-textbooks-update
Neal et al.: A Modern Take on the Bias-Variance Tradeoff in Neural Networks. ICML Workshops, 2019. 95
Maximum Likelihood Estimation
I We now reinterpret our results by taking a probabilistic viewpoint
I Let X = {(xi , yi )}N
i=1 be a dataset with samples drawn i.i.d. from pdata
I Let the model pmodel (y|x, w) be a parametric family of probability distributions
I The conditional maximum likelihood estimator for w is given by
ŵM L = argmax pmodel (y|X, w)

w
N
iid
Y
= argmax pmodel (yi |xi , w)
w
i=1
XN
= argmax log pmodel (yi |xi , w)
w
i=1
| {z }
Log-Likelihood
Note: In this lecture we consider log to be the natural logarithm. 97
Example: Assuming pmodel (y|x, w) = N (y|w> x, σ), we obtain
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X 1 − 1 2
(w> xi −yi )
= argmax log √ e 2σ 2
w
i=1 2πσ 2
N N
X 1 X 1 > 2
= argmax − log(2πσ 2 ) − w x i − yi
w 2 2σ 2
i=1 i=1
XN 2
= argmax − w > x i − yi
w
i=1
= argmin kXw − yk22

w
98
We see that choosing pmodel (y|x, w) to be Gaussian causes maximum likelihood to

yield exactly the same least squares estimator derived before:
ŵ = argmin kXw − yk22

w
Variations:
I If we were choosing pmodel (y|x, w) as a Laplace distribution, we would obtain an
estimator that minimizes the `1 norm: ŵ = argmin w kXw − yk1
I Assuming a Gaussian distribution over the parameters w and performing a
maximum a-posteriori (MAP) estimation yields ridge regression:
argmax p(w|y, x) = argmax p(y|x, w)p(w)
w w
99
We see that choosing pmodel (y|x, w) to be Gaussian causes maximum likelihood to

yield exactly the same least squares estimator derived before:
ŵ = argmin kXw − yk22

w
Remarks:
I Consistency: As the number of training samples approaches infinity N → ∞,
the maximum likelihood (ML) estimate converges to the true parameters
I Efficiency: The ML estimate converges most quickly as N increases
I These theoretical considerations make ML estimators appealing
99

Lec 01 Introduction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 01 Introduction

Uploaded by

Copyright:

Available Formats

Deep Learning

Prof. Dr.-Ing. Andreas Geiger

1.2 History of Deep Learning

1.3 Machine Learning Basics

Prof. Dr.-Ing. Bozidar Haoyu

Lectures offered by our research group:

I History of deep learning I Convolutional Neural Networks

I Knowledge transfer (bidirectional)

Regular Classroom Flipped Classroom

I High quality lecture videos I Can be watched multiple times

I Problems handed out, introduced and discussed in the live sessions

We appreciate your feedback as well:

I Dare to ask any kind of question

1. Lecture Q&A (45 min)

2. Exercise Q&A (45 min)

I Mon, 14:00-16:00 I Fri, 14:00-16:00

I Mon, 17:00-19:00 I Fri, 16:00-18:00

I Mon, 19:00-21:00 I Fri, 18:00-20:00

More information: https://uni-tuebingen.de/de/175884 28

Lecture Videos Self-Study Live Sessions Exercises Quizzes

The Diligent Student The Relaxed Student The Stressed Student

1 2 3 4 5 6 7 8 9 10 1112 13 14 1 2 3 4 5 6 7 8 9 10 1112 13 14 1 2 3 4 5 6 7 8 9 10 1112 13 14

I Bishop: Pattern Recognition and Machine Learning

I Zhang, Lipton, Li, Smola: Dive into Deep Learning

I Deisenroth, Faisal, Ong: Mathematics for Machine Learning

I Petersen, Pedersen: The Matrix Cookbook

I Inofﬁcal lecture notes written by students in winter 2020/21

I Kolterl, Chen (CMU): Deep Learning Systems

I Leal-Taixe, Niessner (TUM): Introduction to Deep Learning

I Grosse (UoT): Intro to Neural Networks and Machine Learning

I Li (Stanford): Convolutional Neural Networks for Visual Recognition

I Abbeel, Chen, Ho, Srinivas (Berkeley): Deep Unsupervised Learning

Probability and Information Theory:

I Distributions: Bernoulli, Categorical, Gaussian, Laplace

I More powerful than AND/OR gates

1950 1960 1970 1980 1990 2000 2010 2020

I Novikoff proved convergence

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

x+argmin {k∆xk2 : f (x+∆x) 6= f (x)}

I All images classiﬁed as “ostrich”

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

1950 1960 1970 1980 1990 2000 2010 2020

I Inputs x ∈ X can be any kind of objects

Structured Output Learning:

Input Model Output

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N

Input Model Output

I Mapping: fw : RW ×H → {“Beach”, “No Beach”}

Input Model Output

Input Model Output

I Mapping: fw : RN → {1, . . . , C}M