Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

INTRODUCTION TO MACHINE LEARNING

UNIT # 8

FALL 2023 Sajjad Haider 1


ACKNOWLEDGEMENT

 The slides and code in this lecture are primarily taken from
 Machine Learning with PyTorch and Scikit-Learn by Rischka et al.
 Principles of Data Mining by Bremer
 Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow by Geron

FALL 2023 Sajjad Haider 2


TODAY’S AGENDA

 Recap of the previous lecture


 Holdout Method
 Testing vs Validation Set
 K-Fold Cross Validation
 Python Implementation

FALL 2022 Sajjad Haider 3


HOLD OUT METHOD (TRAIN-TEST SPLIT)

 The holdout method is what we have alluded to


so far in our discussions about accuracy.
 In this method, the given data are randomly
partitioned into two independent sets: a training
set and a test set.
 The training set is used to derive the model,
whose accuracy is estimated with the test set.
 The estimate is pessimistic because only a
portion of the initial data is used to derive the
model.

FALL 2022 Sajjad Haider 4


TRAIN-TEST SPLIT

 The error rate on new cases is called the generalization error (or out-of-
sample error), and by evaluating your model on the test set, you get an
estimate of this error.
 This value tells you how well your model will perform on instances it has
never seen before.
 If the training error is low (i.e., your model makes few mistakes on the
training set) but the generalization error is high, it means that your model
is overfitting the training data.

FALL 2023 Sajjad Haider 5


TRAIN-TEST SPLIT PERCENTAGE

 It is common to use 70-80% of the data for training and hold out 30-20%
for testing.
 However, this depends on the dataset size: if it contains 10 million
instances, then holding out 1% means your test set will contain 100,000
instances, probably more than enough to get a good estimate of the
generalization error.

FALL 2023 Sajjad Haider 6


LIMITATION OF TRAIN-TEST SPLIT IN PARAMETER TUNING

 However, in typical machine learning applications, we are also interested in tuning and
comparing different parameter settings to further improve the performance for making
predictions on unseen data.
 This process is called model selection, with the name referring to a given classification
problem for which we want to select the optimal values of tuning parameters (also called
hyperparameters).
 However, if we reuse the same test dataset over and over again during model selection, it
will become part of our training data and thus the model will be more likely to overfit.
 Thus, using the test dataset for model selection is not a good machine learning practice.

FALL 2023 Sajjad Haider 7


TESTING AND VALIDATION DATASET

 A better way of using the holdout method for model selection is to


separate the data into three parts: a training dataset, a validation dataset,
and a test dataset.
 The training dataset is used to fit the different models, and the
performance on the validation dataset is then used for model selection.
 The advantage of having a test dataset that the model hasn’t seen before
during the training and model selection steps is that we can obtain a less
biased estimate of its ability to generalize to new data.

FALL 2023 Sajjad Haider 8


TESTING AND VALIDATION DATASET (CONT’D)

 The figure illustrates the


concept of holdout cross-
validation, where we use a
validation dataset to repeatedly
evaluate the performance of the
model after training using
different hyperparameter values.

FALL 2023 Sajjad Haider 9


TESTING AND VALIDATION DATASET (CONT’D)

 More specifically, you train multiple models with various hyperparameters on


the reduced training set (i.e., the full training set minus the validation set) and
select the model that performs best on the validation set.
 After this holdout validation process, you train the best model on the full
training set (including the validation set), and this gives you the final model.
 Lastly, you evaluate this final model on the test set to get an estimate of the
generalization error.

FALL 2023 Sajjad Haider 10


A DISADVANTAGE OF THE HOLDOUT METHOD

 A disadvantage of the holdout method is that the performance estimate


may be very sensitive to how we partition the training dataset into the
training and validation subsets; the estimate will vary for different
examples of the data.
 This can be addressed by using cross-validation method.

FALL 2023 Sajjad Haider 11


CROSS VALIDATION
 Cross Validation helps to get a more accurate
performance evaluation
 K-fold Cross Validation
 Divide the labeled data set into k subsets
 Repeat k times: Take subset i as the test data and the
rest of the subsets as the training data. Train the
classifier and asses the testing error on subset i.
 Average the testing error from the k iterations
 Leave one out Cross Validation
 Cross validation with k = 1
FALL 2022 Sajjad Haider 12
K-FOLD CROSS VALIDATION

 Cross Validation helps to get a more


accurate performance evaluation
 In k-fold cross-validation, we randomly
split the training dataset into k-folds
without replacement.
 k – 1 folds are used for the model training,
and one fold, is used for performance
evaluation.
 This procedure is repeated k times.
FALL 2023 Sajjad Haider 13
K-FOLD CROSS VALIDATION

 We then calculate the average performance of the models based on the different, independent
test folds to obtain a performance estimate that is less sensitive to the sub-partitioning of the
training data compared to the holdout method.
 Typically, we use k-fold cross-validation for model tuning.
 Once we have found satisfactory hyperparameter values, we retrain the model on the
complete training dataset and obtain a performance estimate using the independent test
dataset.
 The rationale behind fitting a model to the whole training dataset after k-fold cross-validation
is that first, we are typically interested in a single, final model (versus k individual models),
and second, providing more training examples to a learning algorithm usually results in a
more accurate and robust model.
FALL 2023 Sajjad Haider 14
 A good standard
value for k in k-
fold cross-
validation is 10,
as empirical
evidence shows.

FALL 2023 Sajjad Haider 15


LOOCV

 A special case of k-fold cross-validation is the leave-one-out cross-


validation (LOOCV) method. In LOOCV, we set the number of folds
equal to the number of training examples (k = n) so that only one training
example is used for testing during each iteration, which is a
recommended approach for working with very small datasets.

FALL 2023 Sajjad Haider 16


CROSS-VALIDATION IN SKLEARN

 An extremely useful feature of the cross_val_score approach is that we can


distribute the evaluation of the different folds across multiple central processing
units (CPUs) on our machine.
 If we set the n_jobs parameter to 1, only one CPU will be used to evaluate the
performances.
 However, by setting n_jobs=2, we could distribute the 10 rounds of cross-
validation to two CPUs (if available on our machine), and by setting n_jobs=-1,
we can use all available CPUs on our machine to do the computation in parallel.

FALL 2023 Sajjad Haider 17


PIPELINE FUNCTION IN SKLEARN

 The make_pipeline function takes an arbitrary number of scikit-learn transformers (objects that
support the fit and transform methods as input), followed by a scikit-learn estimator that
implements the fit and predict methods.
 We can think of a scikit-learn Pipeline as a meta-estimator or wrapper around those individual
transformers and estimators. If we call the fit method of Pipeline, the data will be passed down a
series of transformers via fit and transform calls on these intermediate steps until it arrives at the
estimator object (the final element in a pipeline).
 The estimator will then be fitted to the transformed training data.
 we should note that there is no limit to the number of intermediate steps in a pipeline; however, if
we want to use the pipeline for prediction tasks, the last pipeline element has to be an estimator.

FALL 2023 Sajjad Haider 18


PIPELINE FUNCTION IN SKLEARN

 Similar to calling fit on a pipeline, pipelines also implement a predict


method if the last step in the pipeline is an estimator.
 If we feed a dataset to the predict call of a Pipeline object instance, the
data will pass through the intermediate steps via transform calls.
 In the final step, the estimator object will then return a prediction on the
transformed data.

FALL 2023 Sajjad Haider 19


FALL 2023 Sajjad Haider 20
FALL 2023 Sajjad Haider 21
TESTING AND VALIDATION DATASET (CONT’D)

 The most important rule to remember is that both the validation set and
the test set must be as representative as possible of the data you expect to
use in production.
 After training your model, if you observe that the performance of the
model on the validation set is disappointing, you will not know whether
this is because your model has overfit the training set, or whether this is
just due to the mismatch between the web pictures and the mobile app
pictures.

FALL 2023 Sajjad Haider 22

You might also like