Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Machine Learning:

How do computers learn?

Chloé-Agathe Azencott
CBIO, Mines ParisTech & Institut Curie
http://cazencott.info
What is learning?


Learning: acquiring a skill by
experience, practice
What is learning?


Learning: acquiring a skill by
experience, practice
skill = algorithm / model
What is learning?


Learning: acquiring a skill by
experience, practice
skill = algorithm / model
experience = data
What is learning?


Learning: acquiring a skill by
experience, practice
skill = algorithm / model
experience = data


Machine learning: using data to build
an algorithm / a model
Artificial intelligence

Reproduce (with machines) behaviours of life we perceive as
intelligent

Involves much more than machine learning!

Perception, reasoning, language, motion, etc.

Machine
learning

Artificial
Deep learning
intelligence
1. Exemples of
machine learning problems
1. Supervised machine learning
Making predictions

Data ML Predictor

Labels
1. Supervised machine learning
Problem 1: binary classification
Example: Identification of metastases in lymph nodes biopsies

H Y
R LT
C E E A
N H
CA

Babak Ehteshami Bejnordi et al. (2017), Diagnostic Assessment of Deep Learning Algorithms for
Detection of Lymph Node Metastases in Women With Breast Cancer, JAMA.
1. Supervised machine learning
Problem 2: regression
Example: solubility of a molecule in ethanol
Acetaminophen
Aspirin
mL mL
/ /
m g mg
25 80

Chloé-Agathe Azencott et al. (2007). One- to four-dimensional kernels for virtual screening and the
prediction of physical, chemical and biological properties. Journal of Chemical Information and Modelling
2. Unsupervised learning
Data exploration:
Better understand your data

Data ML Data!
2. Unsupervised learning
Problem 1: Clustering
Group similar samples together

Data ML
2. Unsupervised learning
Problem 1: Clustering
Example: disease subtype identification

Hege G. Russnes (2017) Breast Cancer Molecular Stratification: From Intrinsic Subtypes to Integrative
Clusters, The American Journal of Pathology
2. Unsupervised learning
Problem 2: Dimensionality reduction
Represent your data with fewer features

m
Data ML
X

n
2. Unsupervised learning
Problem 2: Dimensionality reduction
Example: Project SNP data on 2 dimensions

J. Novembre et al. (2008), Genes mirror geography


within Europe, Nature
2. How to train a supervised
machine learning model
Learning a supervised learning model
1. Training data D
n samples x1, x2, …, xn and their labels y1, y2, … yn
Learning a supervised learning model
1. Training data D
n samples x1, x2, …, xn and their labels y1, y2, … yn

2. Hypothesis space H
the shape of the model f = what kind of model we can learn
Learning a supervised learning model
1. Training data D
n samples x1, x2, …, xn and their labels y1, y2, … yn

2. Hypothesis space H
the shape of the model f = what kind of model we can learn
3. Loss function L
L(y, f(x)) = the error made by predicting f(x) instead of y
Learning a supervised learning model
1. Training data D
n samples x1, x2, …, xn and their labels y1, y2, … yn

2. Hypothesis space H
the shape of the model f = what kind of model we can learn
3. Loss function L
L(y, f(x)) = the error made by predicting f(x) instead of y
Empirical risk minimization:
Find the model f in the hypothesis space H that minimizes the loss L on
average on the training data D
Learning a supervised learning model

Empirical risk minimization:


Find the model f in the hypothesis space H that minimizes the loss L on
average on the training data D
4. Optimization procedure
How to solve the empirical risk minimization problem
– Sometimes exact or as accurate as we want
– Sometimes the solution is unique, sometimes not
– Sometimes need to use heuristics
Linear models
ML algorithms for learning linear models


Linear/logistic regression, support vector machines


Hypothesis space
linear models (weighted sum of the features)

Optimization procedure
usually fast, easy, accurate
Non-linear models
Idea 1: Create new features
Quadratic model
● Map (x1, x2, …, xp) to φ(x1, x2, …, xp)

Use a linear approach

Hypothesis space: linear models of φ(x)

Optimization procedure: same as before

Example: quadratic regression

φ(x1, x2, …, xp) = (x1, x2, …, xp, x12, x1x2, …,
xp2)

Linear model in the new space


Idea 2: Use kernels

Kernel: (non-linear) similarity between samples

Many kernels for biological objects that aren’t vectors
Protein sequences, SNPs, molecular graphs, etc.

Kernel trick: replace dot products (= linear similarities) with kernels
at no added computational cost
Example: kernel support vector machines
● Hypothesis space: linear models of φ(x1, x2, …, xp)
φ is such that k(x, x’) = ‹φ(x), φ(x’)›

Optimization procedure: still easy and accurate
see L. Ralaivola’s talk
Idea 2: Use kernels

Kernel: (non-linear) similarity between samples

Many kernels for biological objects that aren’t vectors
Protein sequences, SNPs, molecular graphs, etc.

Kernel trick: replace dot products (= linear similarities) with kernels
at no added computational cost
Example: kernel support vector machines

Also used for non-linear tests of independence
– Hilbert-Schmidt Independence Criterion (HSIC)
– Sequence Kernel Association Tests (SKAT) in GWAS
see L. Ralaivola’s talk
Idea 3: Use non-linear parametric models

Artificial neural networks

Hypothesis space: very flexible
– given by the architecture of the model
– Non-linear, high number of parameters

non-linear function of a
linear combination of the inputs


Optimization procedure: no guarantee to find the solution
see L. Ralaivola’s talk
Idea 4: Use a tree-based hypothesis space

Decision trees

Hypothesis space: Color?
– models look like if (x1 > 0.3) and [(x3 = 1) and Grey Yellow
… or (x4 < 2.9)…] then label = y
Horn? Stripes?
– Categorical and quantitative features

Optimization procedure: Yes No Yes No

Heuristic! No guarantee

Often perform poorly
Idea 4: Use a tree-based hypothesis space

Random forests

Hypothesis space:
– Combination of many trees
– (Ensemble learning)

Optimization procedure:
– Learn each tree independently from the others

Vote
3. How to avoid overfitting
(≈ learning by heart)
Overfitting & generalization

The true challenge of machine learning:
learning a model that works on new data

Overfitting: when the model is specific to the training data but
doesn’t generalize to new data

Particularly likely to happen with
few samples and very many
features
(hello, genomics!)
Regularization

Regularized empirical risk minimization
Find the model f in the hypothesis space H that minimizes the loss L on average
on the training data D
Regularization

Regularized empirical risk minimization
Find the model f in the hypothesis space H that minimizes the loss L on average
on the training data D under some constraints
Regularization

Regularized empirical risk minimization
Find the model f in the hypothesis space H that minimizes the loss L on average
on the training data D under some constraints

The constraints are meant to keep your model simple
– Weight decay / ridge: prevents coefficients from growing too large
– Sparsity: sets some coefficients to zero (remove the corresponding features)
4. Evaluating & choosing a
supervised ML model
Set aside a final test set
Full data set

Train set Test set


You are not allowed to touch the test set during training
– Not when deciding which ML algorithm to use (model selection)
– Nor when fitting the model
– Nor when pre-processing the features (feature engineering, feature selection)
– Nor when removing outliers.
Set aside a final test set
Full data set

Train set Test set


You need to choose an evaluation criterion
– Classification: Accuracy, balanced accuracy, precision, recall, etc.
– Regression: RMSE, R2, etc.
Conclusion
ML = statistics + computing

How it works under the hood
– Pick a hypothesis class (modeling)
– Minimize a loss function (optimization)
– Regularize to avoid overfitting (modeling again)
Conclusion
ML = statistics + computing

How it works under the hood
– Pick a hypothesis class (modeling)
– Minimize a loss function (optimization)
– Regularize to avoid overfitting (modeling again)

How it works as a user
– Represent your data as input vectors (or choose kernels)
(often 80% of the work)
– Decide on a few ML algorithms to try out
– Evaluate performance unbiasedly on a left-out test set
Thanks !

You might also like