Professional Documents
Culture Documents
Several Model Validation Techniques in Python - by Terence Shin - Towards Data Science
Several Model Validation Techniques in Python - by Terence Shin - Towards Data Science
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Save
I wanted to write this article because I think a lot of people tend to overlook the validation and
testing stage of machine learning. Similar to experimental design, it’s important that you spend
56 2
enough time and use the right technique(s) to validate your ML models. Model validation goes
Open in app Get started
far beyond train_test_split(), which you’ll soon find out if you keep reading!
There are several different methods that you can use to validate your ML models, which we’ll
dive into below:
I came across Gradio about a month ago, and I have been a big advocate for it, and rightfully so.
It is extremely useful for a number of reasons including the ability to validate and test your
model with your own inputs.
I find Gradio incredibly useful when validating my models for the following reasons:
2. It allows me to get feedback from domain users and domain experts (who may be non-
coders)
3. It takes 3 lines of code to implement and it can be easily distributed via a public link.
This type of “validation” is something that I always do on top of the following validation
techniques…
2. Train/Validate/Test Split
This method is the most commonly used in model validation. Here, the dataset for the model is
split into training, validation, and the test sample. All these sets are defined below:
Training set: The dataset on which a model trains. All the learning happens on this set of data.
Open in app Get started
Validation set: This dataset is used to tune the model(s) trained from the dataset. Here, this is
also when a final model is chosen to be tested using the test set.
Test set: The generalizability of a model is tested against the test set. It is the final stage of
evaluation as it gives a signal if the model is ready for real-life application or not.
The goal of this method is to check the behavior of the model for the new data. The dataset is
split into different percentages that mainly depend on your project and the number of resources
you have.
The following python code implements this method. The training, validation, and test set will
be 60%, 20%, and 20% of the total dataset respectively:
This method does not apply to every situation and it has its pros and cons.
Pros:
It is very simple to implement.
The execution is relatively quick in comparison to other methods.
Open in app Get started
Cons:
For the models with small datasets, this method can decrease the accuracy if there are not
enough data points in each set.
For the accurate evaluation metrics, the split should be random or it becomes inaccurate.
3. K-Fold Cross-Validation
K-fold cross-validation solves all the problems of the train/test split. With K-fold cross-validation,
the dataset is split into K folds or sections and each of the fold is used as a test set at some
position.
For example, imagine having a 4-fold cross-validation set — with four folds, the model will be
tested four times, where each fold is used as the test set and the other folds are used as the
training set. Then, the model’s final evaluation is simply the average of all k tests. The image
below gives a clear demonstration of the process.
Image created by Author
Open in app Get started
kf = KFold(n_splits=5, random_state=None)
model = LogisticRegression(solver= ‘liblinear’)
acc_score = []
model.fit(X_train,y_train)
pred_values = model.predict(X_test)
avg_acc_score = sum(acc_score)/k
Pros:
The evaluation metrics generated by this method are a lot more realistic.
Cons:
It takes a lot of computational power as more calculations are needed to be done.
Similarly the time required is greater as well.
Open in app Get started
4. Leave-one-out Cross-Validation
Leave-one-out is a special case of K fold validation where the training set has all the instances
minus one data point of the dataset and the test set the remaining observation left out. Let us
suppose we have a dataset of M instances, the training set will be M-1 and the test set will be
one.
This explains the name of this approach. In LOOCV, K=N where one model is created and
evaluated for each instance in the dataset. As every instant is used for the process, this removes
the need for the sampling of data.
# create dataset
X, y = make_blobs(n_samples=100, random_state=1)
# enumerate splits
y_true, y_pred = list(), list()
for train_ix, test_ix in cv.split(X):
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]
# fit model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
# evaluate model
yhat = model.predict(X_test)
# store
y_true.append(y_test[0])
y_pred.append(yhat[0])
# calculate accuracy
acc = accuracy_score(y_true, y_pred)
print('Accuracy: %.3f' % acc)
Pros:
Open in app Get started
Highly accurate and unbiased models can be generated with leave one out cross-validation.
Cons:
It is the most computationally expensive version of k-folds cross-validation as the model has
to be fitted M times.
Mean squared error will vary due to a single instant of test data which can result in higher
variability (high variance)
Let us take a simple example for a classification problem where our machine learning model
identifies a cat or a dog from the image. If we have a dataset where 70% of pictures are of cats
and the other 30% are dogs, in the stratified k-Fold, we will maintain the 70/30 ratio for each
fold.
This technique is ideal when we have smaller datasets and we have to maintain the class ratio as
well. Sometimes, the data is over or undersampled to match the required criteria.
The python implementation works as follows. Sklearn provides us with the StratifiedKFold
function.
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1]) Open in app Get started
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
print(skf)
Pros:
It works well for a dataset with few training examples and imbalanced data.
Cons:
It is not an ideal approach for regression problems.
Not sure what to read next? I’ve picked another article for you:
Terence Shin
If you enjoyed this, follow me on Medium for more
Give a tip
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to
original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.