Unit 1b - Fundamentals of Machine Learning

Fundamentals of
machine learning
Chapter 4 from Chollet book
1
Four branches of Machine Learning
 Supervised learning
 Unsupervised learning
 Self-supervised learning
 Reinforcement learning
2
Types of ML – supervised
 In supervised learning, the training set we
feed to the algorithm includes the desired
solutions, called labels.
 Generally, almost all applications of deep
learning that are in the spotlight these days
belong in this category, such as optical
character recognition, speech recognition,
image classification, and language
translation.
3
Types of ML – Supervised
 Classification
 Regression
 Sequence generation: Given a picture, predict a
caption describing it.
 Syntax tree prediction: Given a sentence,
predict its decomposition into a syntax tree.
 Object detection: Given a picture, draw a
bounding box around certain objects in it.
 Image segmentation: Given a picture, draw a
pixel-level mask on a specific object.
4
Types of ML – Unsupervised
 Finding interesting transformations of the input
data without the help of any targets, for the
purposes of data visualization, data compression,
or data denoising, or to better understand the
correlations present in the data at hand.
 Bread and butter of data analytics.
 As a preprocessing step before supervised
learning.
 Dimensionality reduction and clustering are well-
known categories of unsupervised learning.
5
Types of ML – Self-supervised
 Supervised learning without any humans in
the loop.
 Autoencoders, where the generated targets
are the input, unmodified.
 Trying to predict the next frame in a video,
given past frames, or the next word in a text,
given previous words (temporally supervised
learning - supervision comes from future input
data).
 Not a standard definition…
6
Types of ML – Reinforcement learning
 An agent receives information about its

environment and learns to choose actions
that will maximize some reward.
 AlphaGo Zero
 Many real-world applications in self-driving
cars, robotics, resource management,
education, and so on.
 Future is RL’s to dominate.
7
4.2 Evaluating ML models
 In ML, the goal is to achieve models that
generalize - that perform well on never-
before-seen data.
 Evaluating ML models involves measuring
generalization.
 Ways to measure generalization
 Simple hold-out validation
 K-fold validation
 Iterated K-fold validation with shuffling
8
Evaluating ML models
 Evaluating a model always boils down to
splitting the available data into three sets:
training, validation, and test.
 We train on the training data and evaluate
(and fine tune) our model on the validation
data.
 After a few iterations, we train on the data
containing both training and validation data.
 Once our model is ready for prime time, we
test it one final time on the test data.
9
 Developing a model always involves tuning its
configuration: for example, choosing the number
of layers or the size of the layers.
 We do this tuning by using the performance of
the model on the validation data.
 Tuning is a form of learning: a search for a
good configuration in some parameter space.

 It can result in overfitting to the validation set,
even though our model is never directly trained

on it – called information leak.
10
 We care about performance on completely new
data, not the validation data, so we need to use a
completely different, never-before-seen dataset
to evaluate the model: the test dataset.
 Our model shouldn’t have had access to any
information about the test set, even indirectly.
 Splitting data into training, validation, and test
sets becomes tricky when little data is available.
 Simple hold-out validation,
 K-fold validation, and
 Iterated K-fold validation with shuffling.
11
SIMPLE HOLD-OUT VALIDATION
12
data[num_validation_samples:]
13
 If little data is available, then our validation
and test sets may contain too few samples
to be statistically representative of the data
at hand.
 Different random shuffling rounds of the data
before splitting end up yielding very different
measures of model performance.
 K-fold validation and iterated K-fold
validation are two ways to address this.
14
K-FOLD VALIDATION
 We split our data into K partitions of equal size.
 For each partition i, we train a model on the remaining
K – 1 partitions, and evaluate it on partition i.
 The final score is the average of those K scores.
15
ITERATED K-FOLD VALIDATION WITH SHUFFLING
 Particularly useful when we have relatively little data
available and we need to evaluate our model as
precisely as possible.
 found to be extremely helpful in Kaggle competitions.
 It consists of applying K-fold validation multiple times,
shuffling the data every time before splitting.
 The final score is the average of the scores obtained
in training and evaluating P × K models (where P is
the number of iterations we use).
16
Things to keep in mind
 Data representativeness - we want both our
training set and test set to be representative of
the data at hand.
 The arrow of time - If we’re trying to predict the
future given the past (i.e., stock movements), we
should not shuffle our data before splitting.
 Redundancy in our data - If some data points
appear twice (fairly common with real-world
data), then it can result in redundancy between
the training and validation sets.
17
4.4 Overfitting and underfitting
 After just a few epochs, all three models of chapter 3
began to overfit.
 Learning how to deal with overfitting is essential to
mastering ML.
 At the beginning of training, the loss on both training and
test data is high, our model is said to be an underfit:
 The network hasn’t yet modeled all relevant patterns.
 After some iterations, test loss does not follow the

downward movement of training loss – overfitting starts
 Model starts modelling misleading or irrelevant
patterns found in the training data.
18
Overfitting and underfitting
Canonical overfitting behaviour

19
 To prevent a model from learning misleading or
irrelevant patterns found in the training data, the
best solution is to get more training data.
 A model trained on more data generalizes better.
 When that isn’t possible, the next-best solution is
to modulate the quantity of information that our
model is allowed to store or to add constraints on
what information it’s allowed to store.
 With fewer patterns allowed, the optimization process
will focus on the most prominent patterns - a better
chance of generalizing well. Called regularization.
20
Reducing the network’s size
 The simplest way to prevent overfitting is to
reduce the size of the model: the number of
learnable parameters (memorization capacity).
 A model with 500,000 binary parameters could easily
be made to learn the class of every digit in the
MNIST training set (10 for each of the 50,000 digits).
 Probably, useless on the data unseen.
 A network with limited memorization resources is
forced to learn features that have predictive
power regarding the targets.
21
Reducing the network’s size
 At the same time, we should use models that
have enough parameters that they don’t underfit -
our model shouldn’t be starved for memorization
resources.
 There is a compromise to be found between too much
capacity and not enough capacity.
 Unfortunately, there is no magical formula.
 Start with relatively few layers and parameters,
and keep increasing the size until we see
diminishing returns with regard to validation loss.
22
The smaller network

starts overfitting later
than the original
network, and its
performance degrades
more slowly once it
starts overfitting.
IMDB movie-review classification

 Original model: 3 layers of 16, 16, 1 units
 Smaller model: 3 layers of 4, 4, 1 units
23
Adding weight regularization
 A common way to mitigate overfitting is to put
constraints on the complexity of a network by
forcing its weights to take only small values.
 This is called weight regularization, and it’s done
by adding to the loss function of the network a
cost associated with having large weights.
 Regularization comes in two flavours:
 L1 regularization
 L2 regularization
24
The model with L2
regularization has
become much more
resistant to overfitting
than the reference
model, even though both
models have the same
number of parameters.
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
25
Adding dropout
 Dropout is one of the most effective and most
commonly used regularization techniques for
neural networks, developed by Geoff Hinton.
 Dropout, applied to a layer, consists of randomly
dropping out (setting to zero) a number of output
features of the layer during training.
 The dropout rate is the fraction of the features
that are zeroed out; it’s usually set between 0.2
and 0.5.
26
Adding dropout
 x
27
Adding dropout
 Either
 At training time, we zero out at random a fraction of
the values in the matrix.
 At test time, we scale down the output by the dropout
rate.
 Or
 At training time, we zero out at random a fraction of
the values in the matrix.
 Then, we scale up the output by the dropout rate.
 At test time, we do not do anything.
28
This is a clear improvement
over the reference model - it
also seems to be working
much better than L2
regularization, since the lowest
validation loss reached has
improved.
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
29
 To recap, these are the most common
ways to prevent overfitting in neural
networks:
 Get more training data.
 Reduce the capacity of the network.
 Add weight regularization.
 Add dropout.
 Weights have less chance of “collusion”.
 Each weight “trains harder” to capture a feature,
since other weights may dropout during training.
30
Chapter summary
 The purpose of a machine learning model is to
generalize: to perform accurately on never-
before-seen inputs.
 Many model evaluation methods.
 Holdout validation, K-fold cross-validation, etc.
 The fundamental problem in machine learning is
the tension between optimization and
generalization.
 First work on optimization; tuning hyperparameters.
 Then work on generalization; model regularization.
31

Unit 1b - Fundamentals of Machine Learning

Uploaded by

Copyright:

Available Formats

You might also like

Unit 1b - Fundamentals of Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1b - Fundamentals of Machine Learning

Uploaded by

Copyright:

Available Formats

Fundamentals of

Chapter 4 from Chollet book

 An agent receives information about its

good configuration in some parameter space.

even though our model is never directly trained

 After some iterations, test loss does not follow the

patterns found in the training data.

Canonical overfitting behaviour

The smaller network

IMDB movie-review classification

You might also like