2021 10 11 - Intro ML - Inserm

Machine Learning:
How do computers learn?
Chloé-Agathe Azencott
CBIO, Mines ParisTech & Institut Curie
http://cazencott.info
What is learning?
●
Learning: acquiring a skill by
experience, practice
What is learning?
●
skill = algorithm / model
What is learning?
●
experience = data
What is learning?
●
experience = data
●
Machine learning: using data to build
an algorithm / a model
Artificial intelligence
●
Reproduce (with machines) behaviours of life we perceive as
intelligent
●
Involves much more than machine learning!
●
Perception, reasoning, language, motion, etc.
Machine
learning
Artificial
Deep learning
intelligence
1. Exemples of
machine learning problems
1. Supervised machine learning
Making predictions
Data ML Predictor
Labels
Problem 1: binary classification
Example: Identification of metastases in lymph nodes biopsies
H Y
R LT
C E E A
N H
CA
Babak Ehteshami Bejnordi et al. (2017), Diagnostic Assessment of Deep Learning Algorithms for
Detection of Lymph Node Metastases in Women With Breast Cancer, JAMA.
Problem 2: regression
Example: solubility of a molecule in ethanol
Acetaminophen
Aspirin
mL mL
/ /
m g mg
25 80
Chloé-Agathe Azencott et al. (2007). One- to four-dimensional kernels for virtual screening and the
prediction of physical, chemical and biological properties. Journal of Chemical Information and Modelling
2. Unsupervised learning
Data exploration:
Better understand your data
Data ML Data!
Problem 1: Clustering
Group similar samples together
Data ML
Problem 1: Clustering
Example: disease subtype identification
Hege G. Russnes (2017) Breast Cancer Molecular Stratification: From Intrinsic Subtypes to Integrative
Clusters, The American Journal of Pathology
Problem 2: Dimensionality reduction
Represent your data with fewer features
m
Data ML
X
n
Problem 2: Dimensionality reduction
Example: Project SNP data on 2 dimensions
J. Novembre et al. (2008), Genes mirror geography

within Europe, Nature
2. How to train a supervised
machine learning model
Learning a supervised learning model
1. Training data D
n samples x1, x2, …, xn and their labels y1, y2, … yn
1. Training data D
2. Hypothesis space H
the shape of the model f = what kind of model we can learn
1. Training data D
3. Loss function L
L(y, f(x)) = the error made by predicting f(x) instead of y
1. Training data D
3. Loss function L
L(y, f(x)) = the error made by predicting f(x) instead of y
Empirical risk minimization:
Find the model f in the hypothesis space H that minimizes the loss L on
average on the training data D
Empirical risk minimization:

Find the model f in the hypothesis space H that minimizes the loss L on
average on the training data D
4. Optimization procedure
How to solve the empirical risk minimization problem
– Sometimes exact or as accurate as we want
– Sometimes the solution is unique, sometimes not
– Sometimes need to use heuristics
Linear models
ML algorithms for learning linear models
●
Linear/logistic regression, support vector machines
●
Hypothesis space
linear models (weighted sum of the features)
●
Optimization procedure
usually fast, easy, accurate
Non-linear models
Idea 1: Create new features
Quadratic model
● Map (x1, x2, …, xp) to φ(x1, x2, …, xp)
●
Use a linear approach
●
Hypothesis space: linear models of φ(x)
●
Optimization procedure: same as before
●
Example: quadratic regression
●
φ(x1, x2, …, xp) = (x1, x2, …, xp, x12, x1x2, …,
xp2)
Linear model in the new space

Idea 2: Use kernels
●
Kernel: (non-linear) similarity between samples
●
Many kernels for biological objects that aren’t vectors
Protein sequences, SNPs, molecular graphs, etc.
●
Kernel trick: replace dot products (= linear similarities) with kernels
at no added computational cost
Example: kernel support vector machines
● Hypothesis space: linear models of φ(x1, x2, …, xp)
φ is such that k(x, x’) = ‹φ(x), φ(x’)›
●
Optimization procedure: still easy and accurate
see L. Ralaivola’s talk
Idea 2: Use kernels
●
Kernel: (non-linear) similarity between samples
●
Many kernels for biological objects that aren’t vectors
Protein sequences, SNPs, molecular graphs, etc.
●
Kernel trick: replace dot products (= linear similarities) with kernels
at no added computational cost
Example: kernel support vector machines
●
Also used for non-linear tests of independence
– Hilbert-Schmidt Independence Criterion (HSIC)
– Sequence Kernel Association Tests (SKAT) in GWAS
Idea 3: Use non-linear parametric models
●
Artificial neural networks
●
Hypothesis space: very flexible
– given by the architecture of the model
– Non-linear, high number of parameters
non-linear function of a
linear combination of the inputs
●
Optimization procedure: no guarantee to find the solution
Idea 4: Use a tree-based hypothesis space
●
Decision trees
●
Hypothesis space: Color?
– models look like if (x1 > 0.3) and [(x3 = 1) and Grey Yellow
… or (x4 < 2.9)…] then label = y
Horn? Stripes?
– Categorical and quantitative features
●
Optimization procedure: Yes No Yes No
Heuristic! No guarantee
●
Often perform poorly
Idea 4: Use a tree-based hypothesis space
●
Random forests
●
Hypothesis space:
– Combination of many trees
– (Ensemble learning)
●
Optimization procedure:
– Learn each tree independently from the others
Vote
3. How to avoid overfitting
(≈ learning by heart)
Overfitting & generalization
●
The true challenge of machine learning:
learning a model that works on new data
●
Overfitting: when the model is specific to the training data but
doesn’t generalize to new data
●
Particularly likely to happen with
few samples and very many
features
(hello, genomics!)
Regularization
●
Regularized empirical risk minimization
Find the model f in the hypothesis space H that minimizes the loss L on average
on the training data D
Regularization
●
on the training data D under some constraints
Regularization
●
on the training data D under some constraints
●
The constraints are meant to keep your model simple
– Weight decay / ridge: prevents coefficients from growing too large
– Sparsity: sets some coefficients to zero (remove the corresponding features)
4. Evaluating & choosing a
supervised ML model
Set aside a final test set
Full data set
Train set Test set
●
You are not allowed to touch the test set during training
– Not when deciding which ML algorithm to use (model selection)
– Nor when fitting the model
– Nor when pre-processing the features (feature engineering, feature selection)
– Nor when removing outliers.
Set aside a final test set
Full data set
Train set Test set
●
You need to choose an evaluation criterion
– Classification: Accuracy, balanced accuracy, precision, recall, etc.
– Regression: RMSE, R2, etc.
Conclusion
ML = statistics + computing
●
How it works under the hood
– Pick a hypothesis class (modeling)
– Minimize a loss function (optimization)
– Regularize to avoid overfitting (modeling again)
Conclusion
ML = statistics + computing
●
How it works under the hood
– Pick a hypothesis class (modeling)
– Minimize a loss function (optimization)
– Regularize to avoid overfitting (modeling again)
●
How it works as a user
– Represent your data as input vectors (or choose kernels)
(often 80% of the work)
– Decide on a few ML algorithms to try out
– Evaluate performance unbiasedly on a left-out test set
Thanks !

2021 10 11 - Intro ML - Inserm

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2021 10 11 - Intro ML - Inserm

Uploaded by

Copyright:

Available Formats

Machine Learning:

How do computers learn?

J. Novembre et al. (2008), Genes mirror geography

Empirical risk minimization:

Linear model in the new space

Train set Test set

Train set Test set

You might also like