Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

ML: Lecture 1

Introduction to Machine Learning

Panagiotis Papapetrou, PhD


Professor, Stockholm University
Course logistics
• Responsible teacher:
– Panagiotis Papapetrou - panagiotis@dsv.su.se

• Course Assistant:
– Zed Lee - zed.lee@dsv.su.se

• Schedule:
– 12 Lectures
– 5 Lab sessions
– Written Exam: Mar 14

• Guest lecturers:
– Ioanna Miliou - ioanna.miliou@dsv.su.se
– Korbinean Randl - korbinian.randl@dsv.su.se
– Alejandro Kuratomi - alejandro.kuratomi@dsv.su.se
– Sindri Magnusson - sindri.magnusson@dsv.su.se

• Office hours: by appointment only


Syllabus
Jan 15 Introduction to machine learning
Jan 17 Regression analysis
Jan 18 Laboratory session I: numpy and linear regression
Jan 22 Machine learning pipelines
Jan 25 Laboratory session II: ML pipelines and grid search
Jan 29 Training neural networks
Jan 31* Laboratory session III: training NNs and tensorflow basics
Feb 6 Convolutional neural networks
Feb 7 Recurrent neural networks and autoencoders
Feb 12 Time series classification
Feb 15 Laboratory session IV: CNNs and RNNs
Feb 19 Deep time series classification

*to be moved to Jan 31 from Feb 1.


Syllabus
Feb 20 Attention mechanisms
Feb 22 Laboratory session V: Time series classification
Feb 26 Transformers
Mar 4 Explainable machine learning
Mar 6 Reinforcement learning
Mar 14 Examination
Apr 29 Re-take Examination
Course workload

• Assignments 3hp
– Three programming assignments (python and
tensorflow)

• Written Exam 4.5hp


Assignments
• To be done individually
• Will involve python/tensorflow programming
• Five lab sessions
• Tutoring sessions will be offered
• Grading scheme: Pass – Fail
• To pass: you should pass each assignment with full
points
• You have two attempts per assignment + a re-take!
Exam
• Two parts:
– Part A: multiple-choice questions
– Part B: free-text questions

• It will examine both your knowlegde on several concepts


discussed in the lectures are well as your critical ability on
applying what you have learned
• Will include some coding tasks
• To pass you need to obtain at least 60% of the total points
• Grading scheme: A – F
Learning Objectives
• Describe, implement and apply machine learning methods
and techniques for data exploration and data analysis
• Reason about the selected algorithms and solutions for
machine learning
• Implement and apply machine learning methods to large
and complex amounts of data
• Describe, implement, and apply ML solutions for a given
problem, as well as evaluate ML results
• Become fluent in python and tensorflow
Textbooks
• Elements of Statistical Learning
Edition: 2nd Publisher: Springer
ISBN: 0387848576

• Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:


Concepts, Tools, and Techniques to Build Intelligent Systems
Edition: Second Publisher: O'REILLY
ISBN: 978-1492032649

• Python Machine Learning


Edition: Third
ISBN: 978-1-78995-575-0

Research papers (pointers will be provided)


Above all
• The goal of the course is to learn and enjoy

• The basic principle is to ask questions when you don’t


understand

• Say when things are unclear; not everything can be


clear from the beginning

• Participate in the class as much as possible


Today
• What is machine learning?

• What are the different types of machine learning methods?

• What are examples of machine learning tasks in connection to

the methods?

• What is the bias-variance trade-off?

• How do we evaluate machine learning models?


Machine Learning
“Learning is any process by which a system improves
performance from experience.”
- Herbert Simon

“Leaerning is the study of algorithms that improve


their performance with experience given a task.”
- Tom Michael Mitchell
Real-world examples of Machine Learning

*https://www.aiiottalk.com/machine-learning-examples-in-real-life/
Types of Machine Learning

• Supervised learning
• Unsupervised learning
• Semi-supervised learning
• Active learning
• Reinforcement learning
• Federated learning
• …
Supervised Learning
• Setup: a model is trained using labeled data, where the input is
paired with the correct output labels, with the objective of
learning to predict the output label, given an input data example.
• Learning: by analyzing a training set of labeled examples and
adjusting the model parameters in order to minimize the error
between the model’s predictions and the correct output labels.

Drawbacks:
• Labelled data is required
• Susceptible to bias
• Generalization issues
• May require feature engineering
Unsupervised Learning
• Setup: a model is trained using unlabeled data, with the objective
of finding underlying hidden structures and patterns in the data.
• Learning: by analyzing a training set of unlabeled examples and
adjusting the model parameters in order to maximize a learning
objective.

Drawbacks:
• Lack of ground truth
• Difficult to evaluate
• Curse of dimensionality
• Computationally intensive
Semi-supervised Learning
• Setup: a model is trained using both labelled and unlabeled data,
with the objective of learning to predict the output label given an
input data example.
• Learning: by iteratively training the model using the most
confident predictions, and augmenting the training dataset.

Drawbacks:
• Sensitive to data quality
• Difficult to evaluate
• Difficult to interpret
• Computationally intensive
Semi-supervised learning: self-learning

• Pick a small amount of labeled data and use this dataset to train a base model.
• Then apply the process known as pseudo-labeling by taking the partially
trained model and use it to make predictions for the unlabeled data
• Take the most confident predictions made with by the model and add them
into the labeled dataset; hence creating a new, combined input to train an
improved model.
• Repeat the process for several iterations.
Semi-supervised learning: co-learning
• Train two individual classifiers
based on two views of data.
• Views: different sets of features
but same instances.
• The bigger pool of unlabeled
data receives pseudo-labels
from both classifiers.
• Classifiers co-train one another
using the pseudo-labels with
the highest confidence level.
• The predictions of the two
classifiers are combined.
Active Learning
• Setup: a model is trained using both labelled and unlabeled data,
with the objective of learning to predict the output label given an
input data example.
• Learning: by iteratively training the model using a human expert
annotator for the least confident predictions.

Drawbacks:
• Sensitive to data quality
• Dependence on human
annotator
• Computationally intensive
Types of Machine Learning…
• Reinforcement learning: an agent learns to interpret its
environment and optimize a given objective by interacting with
the environment through actions that are either rewarded or
penalized.
• Federated learning: training data is locally stored in inter-
connected nodes, and it and cannot be shared. A central model
is learned using the distributed local data by only sharing model
parameters and not the data points themselves.
Examples of supervised learning
• Predict whether a patient, hospitalized due to
heart attack, will have a second heart attack
• Identify the strongest predictors for heart attack
• Prediction based on many data modalities:
demographics, diet, clinical measurements, clinical
history

H
Examples of supervised learning
• Predict the price of a stock in 6 months from now, based on
– the performance of the company (e.g., sales, future outlook)
– other economic metrics (e.g., market value, Price-to-Earnings ratio)
– other exogenous variables (e.g., social media)
Examples of supervised learning
• Identify the handwritten numbers in a
a digitized image
• Identify animals in an image
Examples of reinforcement learning
Define the measures that need to be taken by the public health authorities
under a pandemic situation
training
(“rewards” / “penalties”)

RL Output
Input “actions”
System
Examples of federated learning
Given data from distributed hospitals learn a central model that can propose
the optimal patient treatment without sharing any data

Medical Images

Electronic patient records

Smart phone data

Global
model

Frac. of Zagreb No.


Solu- rotatable Geom Log group Topol No. C heavy
Name bility bonds diam. P index diam. atoms bonds

methylpentane good 0.40 3.46 2.44 19 4 6 5


methylcyclohexene good 0 3.00 2.51 31 4 7 7
nonene med. 0.75 6.93 3.53 28 8 9 8
hexadiene good 0.60 4.36 2.14 16 5 6 5
butadiene good 0.33 2.65 1.36 8 3 4 3
naphthalene good 0 3.61 2.84 57 5 10 11
acenaphthylene good 0 3.58 3.32 83 5 12 14
pyrene poor 0 5.00 4.58 117 7 16 19

Sensor reading
dimethylanthracene poor 0 5.29 4.61 108 7 16 18
hexahydropyrene med. 0 5.00 3.82 117 7 16 19
triphenylene poor 0 5.00 5.15 126 7 18 21
benzo(e)pyrene poor 0 5.29 5.64 152 7 20 24

Chemical compound data

Onboard maintenance tracking


The classical learning setup
Attributes (or variables)
categorical, discrete, ordinal, numercial
denoted as X

Output
Input Machine Learning responses or
predictors or
independent variables function f dependent variables
denoted as Y

Classification is the task of learning a Regression is the task of learning a


target function f that maps an input target function f that maps an input
(attribute set) X to a class label Y (attribute set) X to a real value Y in a
given range of real values
How good is f ?

Appropriate-fitting

Underfitting: an ML model is unable to capture the relationship between the input and
output variables accurately, generating a high error rate on both training and test sets
Overfitting: an ML model is unable to generalize, hence the relationship it learns between
the input and output variables in the training set does not reflect the true relationship in the
test set, generating a low error rate on the training set but a high error rate on the test set

What determines a good fit?


How good is f ?
the U-Shape effect

training error
test error

training error

Underfitting: an ML model is unable to capture the relationship between the input and
output variables accurately, generating a high error rate on both training and test sets
Overfitting: an ML model is unable to generalize, hence the relationship it learns between
the input and output variables in the training set does not reflect the true relationship in the
test set, generating a low error rate on the training set but a high error rate on the test set

What determines a good fit?


The bias-variance decomposition
• Assume the data complies to the following statistical model:

• The relationship between X and Y is non-deterministic (hence the


noise factor ε)

• But we assume:

• We also assume that f is fixed and unknown, and we let 𝒇! denote


the obtained predictions

• The metric to optimize is MSE (mean squared error):


The bias-variance decomposition
• Bias: the error induced by erroneous assumptions by the learning algorithm.

• High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).

• Bias: over different realizations of the training set, how much is


𝒇! 𝑿 affected in approximating f:
The bias-variance decomposition
• Variance: the error from sensitivity to small fluctuations in the training set.

• High variance may cause an algorithm to model random noise or erroneous


relations in the training dataset (overfitting).

• Variance: over different realizations of the training set, how much does
𝒇! 𝑿 deviate from its expectation:

• Bias–variance decomposition: a way of analyzing the expected


generalization error of a learning algorithm with respect to a particular
problem, as a sum of three terms:
bias2 + variance + irreducible error
irreducible error: resulting from noise in the problem itself.
Proof
Proof

Recall:
Proof

• square expansion
• linear property of expectation
Proof

• Properties of error
• independence of random variables and 𝒇!
Proof

Let us expand this now


Continuing the decomposition
Continuing the decomposition
Continuing the decomposition

: is a constant, hence
the expectation is also constant
Continuing the decomposition

linearity of
expactation
is applied
Continuing the decomposition
Model selection & assessment

Ideal scenario: plenty of data, so divide it into three compartments:


(1) training, (2) validation, (3) test

assesment of the
model generalization error of
model fitting selection the final model

training validation test


Hold out & subsampling
• Holdout (or using a validation set)
– reserve 50% for training, 25% for validation, and 25% for testing
• Random subsampling
training validation test
– repeated holdout
• Stratified subsampling
– preserve the class balance in the samples

• Shortcoming:
- depends highly on which data points end up in the training set and
which end up in the test set
k-fold Cross Validation
• Partition the N data samples into k disjoint subsets
– train on k-1 partitions
– test on the remaining one

D 1 2 3 … k-1 k

• Each subset is given a chance to be in the test set


• Performance is averaged over the k iterations:

If k=N then leave-one-out Cross


Validation (LOOCV)

What is good about LOOCV?


LOOCV vs k-fold
• LOOCV:
o the test error is approximately unbiased to the true (expected)
prediction error; training on “almost” the whole training set
o training on almost identical training sets produces models that are
highly correlated, hence the error estimate will have high variance
if different data samples are used
o the computational burden will be very high: the training algorithm
has to be rerun from scratch N times
• k-fold:
o less correlated models
o the computational time is lower
o the bias of the procedure increases as the training set will contain
fewer examples
o the lower the k the higher the bias!
Learning Curve
• Learning curve: a plot that shows
changes in learning performance (y-
axis) over time in terms of
experience (x-axis).
• Learning curves can be defined to
depict model performance on the (1)
the training set or (2) validation set.
• They can be used to diagnose an
underfit, overfit, or well-fit model.
• They can be used to diagnose
Train learning curve: how
whether the train or validation sets well the model is learning
are not representative of the
problem domain. Validation learning curve:
how well the model is
generalizing
Learning Curve: size of the training set
Two observations:
• performance improves as we move to
100 examples
• increasing to 200 brings a very small
benefit

What training set size


works for 5-fold CV?

What if training set = 200 examples?


• 5-fold: training sets of size 160 What if training set = 50 examples?
• Same as the performance for 200 • 5-fold: training sets of size 40
examples; hence no bias • Error will be overestimated; hence bias
Learning Curve: underfitting

The training loss remains flat The training loss continues to


regardless of training. decrease until the end of training.
Learning Curve: overfitting

The training loss continues to decrease The validation loss decreases to a


with experience, while the validation point and begins increasing again.
loss is stable.
Learning Curve: unrepresentative datasets

The training set is not representative The validation set is not representative
enough (few examples to support the enough (few examples to support the
validation set). training set).
Comparing Learning Curves
• Indication of high bias
As # of training examples increases:
o training error increases too
o model is not fitting the data with
an appropriate curve

Bias leads to comparable errors in


cross-validation and training

Adding more data does not help!

What should one do to reduce bias?


increase the number of features
add more complex features
Comparing Learning Curves
• Indication of high variance
Training error grows slowly but
well within the desired range

High variance leads to overfitting


and hence high test error

The model is forced to learn


more generalized properties

Adding more examples reduces


test error and the gap between
the curves closes down

What should one do to reduce variance?


get more data
reduce the number of features
Nested Cross Validation (5x2 CV)
Inner loop: optimize
parameters (2 splits)

Outer loop: evaluate


with optimal
parameters (5 splits)

Which model do
we finally deploy?
Cross-validation tests a model
building procedure and not a
single model instance

We deploy the feature set and


parameters obtained by
applying the winning procedure
to the whole dataset
Bootstrapping
• Create B replications (of the same size) of a training set of size N
by selecting examples uniformly with replacement

Predicted value of model


built on the bth replication
(i.e., bootstrap sample)

• Train on each replication b Any problem with that?


• Test on the whole training set
Leave-one-out bootstrap
• Train on each replication b

• Test each sample not in b, i.e., the out-of-bag set C-i against all replications
it does not appear in
Leave-one-out bootstrap
• Train on each replication b

• Test each sample not in b, i.e., the out-of-bag set C-i against all replications
it does not appear in

• Problems?
– the average number of distinct observations in each bootstrap sample is
about 0.632 N
– upward bias on the validation error
– behaves roughly like two-fold Cross Validation
.632 bootstrap

• Main idea:
– extension of leave-one-out bootstrap
– intuition: pulls the leave-one out bootstrap estimate down towards the
training error rate, and hence reduces its upward bias

training error Leave-one-out


(on the whole bootstrap error
training set)
Coming up next…
Jan 15 Introduction to machine learning
Jan 17 Regression analysis
Jan 18 Laboratory session I: numpy and linear regression
Jan 22 Machine learning pipelines
Jan 25 Laboratory session II: ML pipelines and grid search
Jan 29 Training neural networks
Jan 31* Laboratory session III: training NNs and tensorflow basics
Feb 6 Convolutional neural networks
Feb 7 Recurrent neural networks and autoencoders
Feb 12 Time series classification
Feb 15 Laboratory session IV: CNNs and RNNs
Feb 19 Deep time series classification
TODOs
• Reading:
– Main course book (ESL): Chapters 1, 2, and 7

• Prepare for this week’s lab:


– Start reading HML: Chapter 4

You might also like