Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Open in app Get started

Published in Towards Data Science

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

Terence Shin Follow

Mar 18, 2021 · 7 min read · · Listen

Save

Several Model Validation Techniques in Python


A comprehensive guide to four popular cross validation techniques

Man photo created by karlyukav — www.freepik.com

I wanted to write this article because I think a lot of people tend to overlook the validation and
testing stage of machine learning. Similar to experimental design, it’s important that you spend
56 2
enough time and use the right technique(s) to validate your ML models. Model validation goes
Open in app Get started
far beyond train_test_split(), which you’ll soon find out if you keep reading!

But first, what is model validation?


Model validation is a method of checking how close the predictions of a model is to reality.
Likewise, model validation means to calculate the accuracy (or metric of evaluation) of the
model that you’re training.

There are several different methods that you can use to validate your ML models, which we’ll
dive into below:

1. Model Validation with Gradio


While this isn’t necessarily a technique, I would consider this a bonus because it can be used as
an additional validation step for almost any ML model that you create.

I came across Gradio about a month ago, and I have been a big advocate for it, and rightfully so.
It is extremely useful for a number of reasons including the ability to validate and test your
model with your own inputs.

I find Gradio incredibly useful when validating my models for the following reasons:

1. It allows me to interactively test different inputs into the model.

2. It allows me to get feedback from domain users and domain experts (who may be non-
coders)

3. It takes 3 lines of code to implement and it can be easily distributed via a public link.

This type of “validation” is something that I always do on top of the following validation
techniques…

2. Train/Validate/Test Split
This method is the most commonly used in model validation. Here, the dataset for the model is
split into training, validation, and the test sample. All these sets are defined below:
Training set: The dataset on which a model trains. All the learning happens on this set of data.
Open in app Get started

Validation set: This dataset is used to tune the model(s) trained from the dataset. Here, this is
also when a final model is chosen to be tested using the test set.

Test set: The generalizability of a model is tested against the test set. It is the final stage of
evaluation as it gives a signal if the model is ready for real-life application or not.

The goal of this method is to check the behavior of the model for the new data. The dataset is
split into different percentages that mainly depend on your project and the number of resources
you have.

The image below gives a clear demonstration of this example.

Image created by author

The following python code implements this method. The training, validation, and test set will
be 60%, 20%, and 20% of the total dataset respectively:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2,


random_state=1)

X_train, X_val, y_train, y_val = train_test_split( X_train, y_train,


test_size=0.25, random_state=1)

This method does not apply to every situation and it has its pros and cons.

Pros:
It is very simple to implement.
The execution is relatively quick in comparison to other methods.
Open in app Get started

Cons:
For the models with small datasets, this method can decrease the accuracy if there are not
enough data points in each set.

For the accurate evaluation metrics, the split should be random or it becomes inaccurate.

It can cause models to overfit the validation set.

3. K-Fold Cross-Validation
K-fold cross-validation solves all the problems of the train/test split. With K-fold cross-validation,
the dataset is split into K folds or sections and each of the fold is used as a test set at some
position.

For example, imagine having a 4-fold cross-validation set — with four folds, the model will be
tested four times, where each fold is used as the test set and the other folds are used as the
training set. Then, the model’s final evaluation is simply the average of all k tests. The image
below gives a clear demonstration of the process.
Image created by Author
Open in app Get started

Here is how you can implement it in Python for a fold of size 5:

from sklearn.datasets import load_iris


import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_iris(as_frame = True)


df = data.frame
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

kf = KFold(n_splits=5, random_state=None)
model = LogisticRegression(solver= ‘liblinear’)

acc_score = []

for train_index , test_index in kf.split(X):


X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]

model.fit(X_train,y_train)
pred_values = model.predict(X_test)

acc = accuracy_score(pred_values , y_test)


acc_score.append(acc)

avg_acc_score = sum(acc_score)/k

print(‘accuracy of each fold — {}’.format(acc_score))


print(‘Avg accuracy : {}’.format(avg_acc_score))

This method has the following pros and cons:

Pros:
The evaluation metrics generated by this method are a lot more realistic.

The overfitting problem is solved to a great extent.

This results in reduced bias.

Cons:
It takes a lot of computational power as more calculations are needed to be done.
Similarly the time required is greater as well.
Open in app Get started

4. Leave-one-out Cross-Validation
Leave-one-out is a special case of K fold validation where the training set has all the instances
minus one data point of the dataset and the test set the remaining observation left out. Let us
suppose we have a dataset of M instances, the training set will be M-1 and the test set will be
one.

This explains the name of this approach. In LOOCV, K=N where one model is created and
evaluated for each instance in the dataset. As every instant is used for the process, this removes
the need for the sampling of data.

The python implementation of it is given as follows:

from sklearn.datasets import make_blobs


from sklearn.model_selection import LeaveOneOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# create dataset
X, y = make_blobs(n_samples=100, random_state=1)

# create loocv procedure


cv = LeaveOneOut()

# enumerate splits
y_true, y_pred = list(), list()
for train_ix, test_ix in cv.split(X):
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]
# fit model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
# evaluate model
yhat = model.predict(X_test)
# store
y_true.append(y_test[0])
y_pred.append(yhat[0])

# calculate accuracy
acc = accuracy_score(y_true, y_pred)
print('Accuracy: %.3f' % acc)
Pros:
Open in app Get started

Highly accurate and unbiased models can be generated with leave one out cross-validation.

We do not have to divide the data into random samples.

It is perfect for smaller data sets.

Cons:

It is the most computationally expensive version of k-folds cross-validation as the model has
to be fitted M times.

It is not suitable for larger datasets.

Mean squared error will vary due to a single instant of test data which can result in higher
variability (high variance)

4. Stratified K-Fold Cross-Validation


The stratified k-fold method is the extension of the simple k-cross-validation which is mainly
used for classification problems. The splits in this method are not random like the k-cross-
validation. Stratification ensures that each fold is representative of all strata of the data —
specifically, it aims to ensure that each class is equally represented across each test fold.

Let us take a simple example for a classification problem where our machine learning model
identifies a cat or a dog from the image. If we have a dataset where 70% of pictures are of cats
and the other 30% are dogs, in the stratified k-Fold, we will maintain the 70/30 ratio for each
fold.

This technique is ideal when we have smaller datasets and we have to maintain the class ratio as
well. Sometimes, the data is over or undersampled to match the required criteria.

The python implementation works as follows. Sklearn provides us with the StratifiedKFold
function.

import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1]) Open in app Get started

skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)

print(skf)

for train_index, test_index in skf.split(X, y):


print(“TRAIN:”, train_index, “TEST:”, test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

The pros and cons of this method are as follows:

Pros:
It works well for a dataset with few training examples and imbalanced data.

The class ratio is preserved.

Cons:
It is not an ideal approach for regression problems.

It struggles with larger datasets for the best results.

Thanks for Reading!


In this article, we have seen different model validation techniques, each serving different
purposes and are best suited in different scenarios. Before using any of these validation
techniques, always take account of your computational resources, time limit, and the type of
problem you are trying to solve.

As always, I wish you the best in your learning endeavors! :)

Not sure what to read next? I’ve picked another article for you:

All Machine Learning Algorithms You Should Know in 2021


Intuitive explanations of the most popular machine learning models
towardsdatascience.com
and another one!
Open in app Get started

A Complete 52 Week Curriculum to Become a Data Scientist in 2021


Learn something every week for 52 weeks!
towardsdatascience.com

Terence Shin
If you enjoyed this, follow me on Medium for more

Interested in collaborating? Let’s connect on LinkedIn

Sign up for my email list here!

Enjoy the read? Reward the writer.Beta


Your tip will go to Terence Shin through a third-party platform of their choice, letting them know you appreciate their story.

Give a tip

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to
original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter


Open in app Get started

About Help Terms Privacy

Get the Medium app

You might also like