Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 54

Machine learning

Definition
• Machine learning can be broadly defined as
computational methods using experience to
improve performance or to make accurate
predictions.
Data?
• Experience refers to the past information available
to the learner, which typically takes the form of
electronic data collected and made available for
analysis.
• Thes data could be in the form of digitized human-
labeled training sets, or other types of information
obtained via interaction with the environment.
• In all cases, its quality and size are crucial to the
success of the predictions made by the learner.
Sample complexity
• To evaluate the sample size required for the
algorithm to learn a family of concepts.
• Theoretical learning guarantees for an
algorithm depend on the complexity of the
concept classes considered and the size of the
training sample.
Applications and problems
• Learning algorithms have been successfully deployed in a
variety of applications, including
• Text or document classification, e.g., spam detection;
• Natural language processing, e.g., morphological analysis,
part-of-speech tagging, statistical parsing, named-entity
recognition;
• Speech recognition, speech synthesis, speaker verification;
Optical character recognition (OCR);
• Computational biology applications, e.g., protein function or
structured prediction
Contd…
• Computer vision tasks, e.g., image recognition, face
detection;
• Fraud detection (credit card, telephone) and network
intrusion;
• Games, e.g., chess, backgammon;
• Unassisted vehicle control (robots, navigation);
• Medical diagnosis;
• Recommendation systems, search engines,
information extraction systems.
Classification
• Assign a category to each item. For example,
document classification may assign items with
categories such as politics, business, sports, or
weather while image classification may assign items
with categories such as landscape, portrait or animal.
• The number of categories in such tasks is often
relatively small, but can be large in some difficult
tasks and even unbounded as in OCR, text
classification, or speech recognition.
Regression
• Predict a real value for each item. Examples of
regression include prediction of stock values or
variations of economic variables.
• In this problem, the penalty for an incorrect
prediction depends on the magnitude of the
difference between the true and predicted
values, in contrast with the classification
problem, where there is typically no notion of
closeness between various categories.
Ranking
• Order items according to some criterion. Web
search, e.g., returning web pages relevant to a
search query, is the canonical ranking
example.
• Many other similar ranking problems arise in
the context of the design of information
extraction or natural language processing
systems.
Clustering
• Partition items into homogeneous regions.
Clustering is often performed to analyze very
large data sets.
• For example, in the context of social network
analysis, clustering algorithms attempt to
identify “communities” within large groups of
people.
Dimensionality reduction or
manifold learning
• Transform an initial representation of items
into a lower-dimensional representation of
these items while preserving some properties
of the initial representation.
• A common example involves preprocessing
digital images in computer vision tasks.
Objective of ML systems
• The main practical objectives of machine
learning consist of generating accurate
predictions for unseen items and of designing
efficient and robust algorithms to produce
these predictions, even for large-scale
problems.
Learning scenarios
• Supervised learning
• The learner receives a set of labeled examples
as training data and makes predictions for all
unseen points.
• This is the most common scenario associated
with classification, regression, and ranking
problems.
Unsupervised learning
• The learner exclusively receives unlabeled
training data, and makes predictions for all
unseen points.
• Since in general no labeled example is available
in that setting, it can be difficult to
quantitatively evaluate the performance of a
learner.
• Clustering and dimensionality reduction are
example of unsupervised learning problems.
Semi-supervised learning
• The learner receives a training sample consisting of
both labeled and unlabeled data, and makes
predictions for all unseen points. Semi-supervised
learning is common in settings where unlabeled
data is easily accessible but labels are expensive to
obtain.
• Various types of problems arising in applications,
including classification, regression, or ranking tasks,
can be framed as instances of semi-supervised
learning.
Transductive inference:
• As in the semi-supervised scenario, the learner receives a
labeled training sample along with a set of unlabeled test
points. However, the objective of transductive inference
is to predict labels only for these particular test points.
• Transductive inference appears to be an easier task and
matches the scenario encountered in a variety of modern
applications. However, as in the semi-supervised setting,
the assumptions under which a better performance can
be achieved in this setting are research questions that
have not been fully resolved.
On-line learning:
• In contrast with the previous scenarios, the
online scenario involves multiple rounds and
training and testing phases are intermixed.
• At each round, the learner receives an
unlabeled training point, makes a prediction,
receives the true label, and incurs a loss.
• The objective in the on-line setting is to
minimize the cumulative loss over all rounds.
Reinforcement learning
• The training and testing phases are also
intermixed in reinforcement learning. To
collect information, the learner actively
interacts with the environment and in some
cases affects the environment, and receives an
immediate reward for each action.
• The object of the learner is to maximize his
reward over a course of actions and iterations
with the environment.
Ctd…
• However, no long-term reward feedback is
provided by the environment, and the learner
is faced with the exploration versus
exploitation dilemma, since he must choose
between exploring unknown actions to gain
more information versus exploiting the
information already collected.
Active learning
• The learner adaptively or interactively collects
training examples, typically by querying an oracle
to request labels for new points.
• The goal in active learning is to achieve a
performance comparable to the standard
supervised learning scenario, but with fewer
labeled examples.
• Active learning is often used in applications
where labels are expensive to obtain, for
example computational biology applications.
What is generalization in machine
learning?
• Generalization refers to your model's ability
to adapt properly to new, previously unseen
data, drawn from the same distribution as the
one used to create the model. ... Divide a data
set into a training set and a test set.
What Is Model Selection
• Model selection is the process of selecting
one final machine learning model from among a
collection of candidate machine learning models
for a training dataset.
• Model selection is a process that can be applied
both across different types of models (e.g. logistic
regression, SVM, KNN, etc.) and across models of
the same type configured with different model
hyperparameters (e.g. different kernels in an SVM).
Model selection
• For example, we may have a dataset for which we are
interested in developing a classification or regression
predictive model. We do not know beforehand as to
which model will perform best on this problem, as it is
unknowable. Therefore, we fit and evaluate a suite of
different models on the problem.
• Model selection is the process of choosing one of the
models as the final model that addresses the problem.
• Model selection is different from model assessment.
Model assessment
• once a model is chosen, it can be evaluated in
order to communicate how well it is expected
to perform in general; this is model
assessment.
Data selection
• Training datasets
• Validation datasets
• Test datasets
Validation datasets
• A validation dataset is a sample of data held back
from training your model that is used to give an
estimate of model skill while tuning model’s hyper
parameters.
• The validation dataset is different from the test
dataset that is also held back from the training of
the model, but is instead used to give an unbiased
estimate of the skill of the final tuned model when
comparing or selecting between final models.
Ctd…

• The evaluation of a model skill on the training


dataset would result in a biased score.
Therefore the model is evaluated on the held-
out sample to give an unbiased estimate of
model skill. This is typically called a train-test
split approach to algorithm evaluation.
Gareth James, et al., Page 176, 
An Introduction to Statistical Learning: with Appl
ications in R
, 2013.
• Suppose that we would like to estimate the test error
associated with fitting a particular statistical learning method
on a set of observations. The validation set approach […] is a
very simple strategy for this task. It involves randomly dividing
the available set of observations into two parts, a training set
and a validation set or hold-out set. The model is fit on the
training set, and the fitted model is used to predict the
responses for the observations in the validation set. The
resulting validation set error rate — typically assessed using
MSE in the case of a quantitative response—provides an
estimate of the test error rate.
Test set
•  The dataset used to evaluate the final model
performance is called the “test set”. The
importance of keeping the test set completely
separate is reiterated by Russell and Norvig in
their seminal AI textbook. They refer to using
information from the test set in any way as
“peeking”. They suggest locking the test set
away completely until all model tuning is
complete.
Stuart Russell and Peter Norvig, page 709, 
Artificial Intelligence: A Modern Approach, 2009
(3rd edition)
• Peeking is a consequence of using test-set
performance to both choose a hypothesis and
evaluate it. The way to avoid this is to really hold
the test set out—lock it away until you are
completely done with learning and simply wish to
obtain an independent evaluation of the final
hypothesis. (And then, if you don’t like the results …
you have to obtain, and lock away, a completely
new test set if you want to go back and find a better
hypothesis.)
Validation set
• Importantly, Russell and Norvig comment that
the training dataset used to fit the model can
be further split into a training set and a
validation set, and that it is this subset of the
training dataset, called the validation set, that
can be used to get an early estimate of the
skill of the model.
• Training set: A set of examples used for
learning, that is to fit the parameters of the
classifier.
• Validation set: A set of examples used to tune
the parameters of a classifier, for example to
choose the number of hidden units in a neural
network.
• Test set: A set of examples used only to assess
the performance of a fully-specified classifier.
Python code
• # split data
• data = ...
• train, validation, test = split(data)
• # tune model hyper parameters
• parameters = ...
• for params in parameters:
• model = fit(train, params)
• skill = evaluate(model, validation)
• # evaluate final model for comparison with other models
• model = fit(train)
• skill = evaluate(model, test)
Validation Dataset Is Not Enough?
• There are other ways of calculating an
unbiased, (or progressively more biased in the
case of the validation dataset) estimate of
model skill on unseen data.
• One popular example is to use k-fold cross-
validation to tune model hyperparameters
instead of a separate validation dataset.
– A test set is a single evaluation of the model and has
limited ability to characterize the uncertainty in the
results.
– Proportionally large test sets divide the data in a way
that increases bias in the performance estimates.
– With small sample sizes:
– The model may need every possible data point to
adequately determine model values.
– The uncertainty of the test set can be considerably
large to the point where different test sets may produce
very different results.
– Resampling methods can produce reasonable
predictions of how well the model will perform on future
samples.
Techniques to approximate the
ideal case of model selection
• Probabilistic Measures: Choose a model via
in-sample error and complexity.
• Resampling Methods: Choose a model via
estimated out-of-sample error.
Probabilistic measures
• Probabilistic measures involve analytically
scoring a candidate model using both its
performance on the training dataset and the
complexity of the model.
Ctd…
• It is known that training error is optimistically
biased, and therefore is not a good basis for
choosing a model.
• The performance can be penalized based on
how optimistic the training error is believed to
be. This is typically achieved using algorithm-
specific methods, often linear, that penalize the
score based on the complexity of the model.
Probabilistic model selection
measures
• Four commonly used probabilistic model selection
measures include:
• Akaike Information Criterion (AIC).
• Bayesian Information Criterion (BIC).
• Minimum Description Length (MDL).
• Structural Risk Minimization (SRM).
Probabilistic measures are appropriate when using simpler
linear models like linear regression or logistic regression
where the calculating of model complexity penalty (e.g.
in sample bias) is known and tractable.
Resampling methods
• Resampling methods seek to estimate the performance of a
model (or more precisely, the model development process) on
out-of-sample data.
• This is achieved by splitting the training dataset into sub train
and test sets, fitting a model on the sub train set, and evaluating
it on the test set. This process may then be repeated multiple
times and the mean performance across each trial is reported.
• It is a type of Monte Carlo estimate of model performance on
out-of-sample data, although each trial is not strictly
independent as depending on the resampling method chosen,
the same data may appear multiple times in different training
datasets, or test datasets.
Resampling model selection
methods
• Random train/test splits.
• Cross-Validation (k-fold, LOOCV, etc.).
• Bootstrap
Cross-validation
• Probably the simplest and most widely used method
for estimating prediction error is cross-validation.
• An example is the widely used k-fold cross-validation
that splits the training dataset into k folds where each
example appears in a test set only once.
• Another is the leave one out (LOOCV) where the test
set is comprised of a single sample and each sample
is given an opportunity to be the test set, requiring N
(the number of samples in the training set) models to
be constructed and evaluated.
Attribute types
Types
Data objects
Relational data types

You might also like