Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Unit - 3

Hyper Parameter Tuning, Versioning


Hyper Parameter Tuning
Hyperparameter tuning is the process of selecting the optimal values for a machine
learning model’s hyperparameters. Hyperparameters are settings that control the learning
process of the model, such as the learning rate, the number of neurons in a neural network,
or the kernel size in a support vector machine. The goal of hyperparameter tuning is to find
the values that lead to the best performance on a given task.
In the context of machine learning, hyperparameters are configuration variables that
are set before the training process of a model begins. They control the learning process
itself, rather than being learned from the data. Hyperparameters are often used to tune the
performance of a model, and they can have a significant impact on the model’s accuracy,
generalization, and other metrics.

Versioning
Models and code may be frequently updated to account for drift or for
experimentation. Systems must ensure that the same versions of models and code are deployed
or that they differ in deliberate ways. For debugging, developers must often identify what
specific version of the models and code has made the specific decision and might want to
retrieve or recreate that specific version. When building systems with machine-learning
components, responsible engineers usually aim to version data, ML pipeline code, models,
non-ML code, and possibly infrastructure configurations.
On terminology: Revisions refer to versions of an artifact over time, where one revision
succeeds another, traditionally identified through increasing numbers or a sequence of
commits in a version control system. Variants refer to versions of an artifact that exist in
parallel, for example, models for different countries or two models deployed in an A/B test.
Traditionally, variants are stored in branches or different files in a repository. Version is the
general term that refers to both revisions and variants. Releases are select versions that are
often given a special name and are chosen for deployment. Here, we care about all forms of
versioning.
Versioning Data-Science Code

 Versioning of code is standard practice for software developers, who have grown

accustomed to committing their work incrementally, usually with meaningful commit

messages, to a version control system like Git. Also operators now commonly version

containers and infrastructure configurations (the “infrastructure as code” strategy

discussed in chapter Planning for Operations). The version control system tracks every

single change and who has submitted it, and it enables developers to identify and

retrieve any earlier revision.

 Data scientists using computational notebooks tend to be less rigorous about

versioning their work. Their exploratory workflow does not align well with traditional

version control practices of committing cohesive incremental steps, since there often

are no obvious milestones and much code is not intended to be permanent. For

example, data scientists in our bank might experiment with many different ideas in a

notebook when developing a fraud detection model before committing to a specific

approach. In addition, notebook environments usually store notebooks in an internal

format that makes identifying and showing changes difficult in traditional version

control systems, though many tools now address this,

including nbdime, ReviewNB, jupyterlab-git, and most hosted notebook-style data-

science platforms. For example, a data scientists exploring whether

 While versioning of experimental code can be useful as backup and for tracking ideas,

versioning usually becomes important once models move into production. This is also

often a time when data-science code is migrated from notebooks to well-maintained

pipelines (see chapter ML Pipeline Quality), for example, when we decide to first

deploy our fraud detection model as part of an A/B experiment. At this point, the code

should be considered as any other production code, and standard version control
practices should be used.
 When versioning pipeline code, it is important to also track versions of involved

frameworks or libraries to ensure reproducible executions. To version library

dependencies, the common strategies are to (1) use a package manager and declare

dependencies with pinned versions (e.g., requirements.txt) or (2) versioning the copied

code of all dependencies together with the pipeline. Optionally, it is possible to

package all learning code and dependencies into versioned virtual execution

environments, like Docker containers, to ensure that environment changes are also

tracked. Building the code or containers in a continuous integration environment

ensures that all necessary dependencies are declared.

Reproducing Study

Tuning Hyperparameters with Reproducible Experiments

When we are starting to build a new machine learning model and we deciding on the model
architecture, there are a number of issues that arise. We have to monitor code changes you
make, note any differences in the data you've used for training, and keep up with
hyperparameter value updates.

Being able to track all of these changes is important so that you can reproduce your
experiments without wondering which changes gave you the best model. We can go back to
any point in your experimenting process to see which changes gave you the best results.

Example of hyperparameter tuning with reproducibility using DVC.

Background on Hyperparameters

Hyperparameters are the values that define your model. This includes things like the number
of layers in a neural network or the learning rate for gradient descent. These parameters are
different from model parameters because we can't get them from training our model. They are
used to create the model we train with.
Optimizing these values means running training steps for different kinds of models to see
how accurate the results are. We can get the best model from iterating through different
hyperparameter values and seeing how they effect our accuracy. That's why we do
hyperparameter tuning. There are a couple common methods that we'll do some code
examples with: grid search and random search.

Tuning with DVC

Let's start by talking about DVC a bit because we'll be using it to add reproducibility to our
tuning process. This is the tool we'll be using to track changes in our data, code, and
hyperparameters.

With DVC, we can add some automation to the tuning process and be able to find and restore
any really good models that emerge.

A few things DVC makes easier to do:

 Letting us make changes without worrying about finding them later


 Onboarding other engineers to a project
 Sharing experiments with other engineers on different machines

For hyperparameter tuning, this means we can play with their values without losing track of
which changes made the best model and also have other engineers take a look. We'll do an
example of this with grid search in DVC first.

Working with a DVC project


First make sure we are in a virtual environment with a command similar to this.

$ python -m venv .venv

After we have cloned the repo, install all of the dependencies with this command.
$ pip install -r requirements.txt
We should be able to open your terminal and run an experiment with the following command.

$ dvc exp run

This will trigger the training process to run and it will record the ROC-AUC of our model.
We can check out the results of your experiment with the following command.
$ dvc exp show --no-timestamp --include-params train.n_est,train.min_split

We're adding a few options here to make the table view clearer. We aren't showing
timestamps and we're only looking at two hyperparameter values. We can run dvc exp
show without the options to see the entire table. This will produce a table similar to this.

Hyperparameter tuning with grid search

Now that we have seen how to run an experiment, we're going to write a small script to
automate grid search for us using DVC. Using grid search in hyperparameter tuning means
we have an exhaustive list of hyperparameter values you want to cycle through. Grid search
will cover every combination of those hyperparameter values.

We'll do this by creating queues. A queue is how DVC allows us to create experiments that
won't be run until later. That way we can cycle through multiple hyperparameters quickly
instead of manually updating a config file with new hyperparameter values for each
experiment run. The command syntax for creating queues looks like this:

$ dvc exp run --queue --set-param train.min_split=8

In the example queue above, we're updating the train.min_split value that's inside of
the params.yaml file. This file holds all of the hyperparameter values and is where DVC
looks to determine if any values have changed. With the command above, we're
automatically updating that value in the params.yaml using a queued experiment.

Now we can make the script. We can add a new file to the src directory called grid_search.py.
Inside of the file, add the following code.

import itertools
import subprocess
# Automated grid search experiments
n_est_values = [250, 300, 350, 400, 450, 500]
min_split_values = [8, 16, 32, 64, 128, 256]

# Iterate over all combinations of hyperparameter values.


for n_est, min_split in itertools.product(n_est_values, min_split_values):
# Execute "dvc exp run --queue --set-param train.n_est=<n_est> --set-param
train.min_split=<min_split>".
subprocess.run(["dvc", "exp", "run", "--queue",
"--set-param", f"train.n_est={n_est}",
"--set-param", f"train.min_split={min_split}"])

This is a simple grid search. We have two hyperparameters we want to


tune: n_est and min_split. So we have arrays with a few values in them to mimic the
exhaustive search a grid search can handle. Then we loop through the values and create
queued experiments for them using subprocess.

We can run this script now and generate our queue with this command.

$ python src/grid_search.py

We'll see some outputs in the terminal telling you that your experiments have been queued.
Then you can run them all with the following command.

$ dvc exp run --run-all

This will run every experiment that has been queued. Once all of those have run, take a look
at your metrics for each experiment.

$ dvc exp show --include-params=train.min_split,train.n_est --no-timestamp

Our table should look similar to this when you run the command above. We've included the

--include-params and --no-timestamp options to give us a table that's easier to read.


Now we can see how your precision changed with each hyperparameter value update. This is
a quick implementation of grid search in DVC. We could read the hyperparameter values
from a different file or data source or make this tuning script as fancy as you like. The main
thing we need is the dvc exp run --queue --set-param <param> command to execute when
you add new values.

Random search

Another commonly used method for tuning hyperparameters is random search. This takes
random values for hyperparameters and builds the model with them. It usually takes less time
than an exhaustive grid search and it can perform better if run for a similar amount of time as
a grid search.

We're going to add an example of random search in a new file


called random_search.py similar to the file we created for grid search. This will add queued
experiments with the randomly selected hyperparameter values. Add the following code
to random_search.py.
import subprocess
import random
# Automated random search experiments
num_exps = 10
random.seed(0)

for _ in range(num_exps):
params = {
"rand_n_est_value": random.randint(250, 500),
"rand_min_split_value": random.choice([8, 16, 32, 64, 128, 256])
}
subprocess.run(["dvc", "exp", "run", "--queue",
"--set-param", f"train.n_est={params['rand_n_est_value']}",
"--set-param", f"train.min_split={params['rand_min_split_value']}"])

This search could be far more complex with Bayesian optimization to handle the
hyperparameter value selections, but we're keeping it super simple by choosing random
numbers to focus on reproducibility. This will generate ten experiments with random values
for each hyperparameter.

We can run these new experiments with dvc exp run --run-all and then take a look at the
results with dvc exp show --include-params=train.min_split,train.n_est --no-timestamp. Our
table should look something like this.
This shows the difference in the randomly selected values and the values from grid search.
You might find a better value with random search because it jumps around a range of values
which might hit the optimum faster than it would with a grid search.

Conclusion

With the comparison between grid search and random search, you can see how
reproducibility can help you find the best model for your project. We'll be able to see all of
the hyperparameter changes and code changes that created each model.

This gives you the ability to fine tune your model because you can go to any experiment and
resume training with different values, code, or data.
Machine Learning Metrics

Evaluating our machine learning algorithm is an essential part of any project. Our

model may give us satisfying results when evaluated using a metric say accuracy_score but

may give poor results when evaluated against other metrics such as logarithmic_loss or any

other such metric. Most of the times we use classification accuracy to measure the

performance of our model, however it is not enough to truly judge our model. Different types

of evaluation metrics are:

Classification Accuracy
Logarithmic Loss
Confusion Matrix
Area under Curve
F1 Score
Mean Absolute Error
Mean Squared Error

Classification Accuracy

Classification Accuracy is what we usually mean, when we use the term accuracy. It is the

ratio of number of correct predictions to the total number of input samples.

It works well only if there are equal number of samples belonging to each class.

For example, consider that there are 98% samples of class A and 2% samples of class B in our

training set. Then our model can easily get 98% training accuracy by simply predicting
every training sample belonging to class A.
When the same model is tested on a test set with 60% samples of class A and 40% samples of

class B, then the test accuracy would drop down to 60%. Classification Accuracy is great,

but gives us the false sense of achieving high accuracy.

The real problem arises, when the cost of misclassification of the minor class samples are very

high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a

sick person is much higher than the cost of sending a healthy person to more tests.

Logarithmic Loss

Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for

multi-class classification. When working with Log Loss, the classifier must assign probability

to each class for all the samples. Suppose, there are N samples belonging to M classes, then

the Log Loss is calculated as below:

where,

y_ij, indicates whether sample i belongs to class j or not

p_ij, indicates the probability of sample i belonging to class j

Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates

higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.

In general, minimising Log Loss gives greater accuracy for the classifier.
Confusion Matrix

Confusion Matrix as the name suggests gives us a matrix as output and describes the complete

performance of the model. Let’s assume we have a binary classification problem. We have

some samples belonging to two classes: YES or NO. Also, we have our own classifier which

predicts a class for a given input sample. On testing our model on 165 samples, we get the

following result.

Confusion Matrix

There are 4 important terms:


 True Positives: The cases in which we predicted YES and the actual output was also

YES.
 True Negatives: The cases in which we predicted NO and the actual output was NO.
 False Positives: The cases in which we predicted YES and the actual output was NO.
 False Negatives: The cases in which we predicted NO and the actual output was YES.

Accuracy for the matrix can be calculated by taking average of the values lying across
the “main diagonal” i.e

Confusion Matrix forms the basis for the other types of metrics.
Area Under Curve

Area Under Curve (AUC) is one of the most widely used metrics for evaluation. It is used for

binary classification problem. AUC of a classifier is equal to the probability that the classifier

will rank a randomly chosen positive example higher than a randomly chosen negative

example. Before defining AUC, let us understand two basic terms:

 True Positive Rate (Sensitivity): True Positive Rate is defined as TP/ (FN+TP). True

Positive Rate corresponds to the proportion of positive data points that are correctly

considered as positive, with respect to all positive data points.

 True Negative Rate (Specificity): True Negative Rate is defined as TN / (FP+TN). False

Positive Rate corresponds to the proportion of negative data points that are correctly

considered as negative, with respect to all negative data points.

 False Positive Rate: False Positive Rate is defined as FP / (FP+TN). False Positive Rate

corresponds to the proportion of negative data points that are mistakenly considered as

positive, with respect to all negative data points.

False Positive Rate and True Positive Rate both have values in the range [0,

1]. FPR and TPR both are computed at varying threshold values such as (0.00, 0.02, 0.04, ….,

1.00) and a graph is drawn. AUC is the area under the curve of plot False Positive Rate vs

True Positive Rate at different points in [0, 1].


As evident, AUC has a range of [0, 1]. The greater the value, the better is the performance of

our model.

F1 Score
F1 Score is used to measure a test’s accuracy. F1 Score is the Harmonic Mean between
precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is
(how many instances it classifies correctly), as well as how robust it is (it does not miss a
significant number of instances). High precision but lower recall, gives you an extremely
accurate, but it then misses a large number of instances that are difficult to classify. The
greater the F1 Score, the better is the performance of our model. Mathematically, it can be
expressed as:

F1 Score tries to find the balance between precision and recall.

 Precision: It is the number of correct positive results divided by the number of positive

results predicted by the classifier.


 Recall: It is the number of correct positive results divided by the number of all relevant

samples (all samples that should have been identified as positive).

Mean Absolute Error

Mean Absolute Error is the average of the difference between the Original Values and the

Predicted Values. It gives us the measure of how far the predictions were from the actual

output. However, they don’t gives us any idea of the direction of the error i.e. whether we are

under predicting the data or over predicting the data. Mathematically, it is represented as:

Mean Squared Error

Mean Squared Error (MSE) is quite similar to Mean Absolute Error, the only difference being

that MSE takes the average of the square of the difference between the original values and the

predicted values. The advantage of MSE being that it is easier to compute the gradient,

whereas Mean Absolute Error requires complicated linear programming tools to compute the

gradient. As, we take square of the error, the effect of larger errors become more pronounced

then smaller error, hence the model can now focus more on the larger errors.
Machine Learning Model Versioning
It is crucial to understand version control to appreciate model versioning.

Version control
It is the process of tracking and managing modifications in software code or ML systems and
it is an essential part of maintaining a detailed record of changes to a system, enabling data
science teams to revert to previous (favourable) versions and collaborate effectively. Model
versioning, on the other hand, is a specific type of version control focused on tracking changes
made to the ML model in a machine learning system. By versioning the model, teams can
maintain a complete history of changes made to the model, enabling them to reproduce results,
debug issues, and collaborate effectively. In addition, model versioning can track datasets,
metrics, hyperparameters, algorithms, and artifacts to ensure transparency and accuracy in the
ML development process.

Importance of version control in ML development

Due to the iterative nature of ML model development or lifecycles, continuous modifications


to various components of the model, data, or code are a common occurrence. To track and
manage these changes, machine learning model versioning plays a crucial role in creating
simple, iterative, and retrievable records of these modifications. Here are some key benefits:

Collaboration: If you’re a solo researcher, this might not be important. When you work with
a team and your project is complex, it becomes very difficult to collaborate without a version
control system.

Versioning: While making changes, the model can break. With a version control system, you
get a changelog which will be helpful when your model breaks and you can revert your
changes to get back to a stable version.

Reproducibility: By taking snapshots for the entire machine learning pipeline, you make it
possible to reproduce the same output again, even with the trained weights, which saves the
time of retraining and testing.

Dependency tracking: Tracking different versions of the datasets (training, evaluation, and
development), tuning the model hyperparameters and parameters. By using version control,
you can test more than one model on different branches or repositories, tune the model
parameters and hyperparameters, and monitor the accuracy of each change.

Model Updates: Model development is not done in one step, it works in cycles. With the help
of version control, you can control which version is released while continuing the
development for the next release.

Overview of version control systems

A Version Control System (VCS) is a software tool that enables developers to track and
manage changes to source code, data, or model. By programmatically versioning files and
projects, these tools help data scientists reduce the burden of manual versioning and enable
team collaboration. Also, these tools reduce the likelihood of single-point failure compared to
manual versioning where all changes can be lost if your disk gets damaged.

There are three primary types of version control systems:

 Local Version Control Systems (LVCS)

 Centralized Version Control Systems (CVCS)

 Distributed Version Control Systems (DVCS)

Local Version Control System (LVCS):

It involves creating a database of changes to files and directories in the project, allowing you
to revert to previous versions of the project in case of any issues. The system stores a complete
copy of the project in the local database, which allows the team to work offline without the
need for a network connection. It can lead to a single-point failure since it has inadequate
backup capabilities.
A diagram illustrating a local version control system

Centralized Version Control System (CVCS):

It stores the code in a central repository, and collaborators work on a local copy of the code.
Changes are made to the local copy, and then the changes are committed to the central
repository. Examples of CVCS include Subversion (SVN), and Perforce.

A diagram illustrating a centralized version control system


Distributed Version Control System (DVCS):

It also stores the code in a central repository, but each collaborator has a local copy of the
entire repository, enabling them to work offline and commit changes to the local repository.
Changes can then be merged into the central repository as needed. Examples of DVCs
include Git, Mercurial, and Bazaar. Git is a version control system while GitHub is a cloud-
based hosting service that helps you manage Git repositories.

A diagram illustrating a distributed version control system.

Machine Learning Model versioning with Git

Git is the most popular version control system used by developers and data scientists alike as it
has a very reliable workflow, is massively supported by most third-party platforms like
GitHub, GitLab, etc, and the immense adoption of distributed version control systems by the
vibrant development community. Here is an example of using git for model versioning on
your local machine. We can connect to our GitHub account to interact with your git repository
as well. This is a general example showing the idea behind model versioning with git.
## initialize a new Git repository in your project directory
git init
## create a new PyTorch model and save it to a file
import torch
import torch.nn as nn
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
self.fc1 = nn.Linear(32 * 8 * 8, 64)
self.fc2 = nn.Linear(64, 10)

def forward(self, x):


x = self.conv1(x)
x = nn.functional.relu(x)
x = self.pool(x)
x = self.conv2(x)
x = nn.functional.relu(x)
x = self.pool(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = nn.functional.relu(x)
x = self.fc2(x)
return x
model = MyModel()
torch.save(model.state_dict(), 'model.pth')

## add the model file to the Git repository and commit the changes
git add model.pth
git commit -m "Initial version of PyTorch model"
## train the model on some data and save the new version
# Load some data and train the model
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):


for i, (images, labels) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

# Save the new version of the model


torch.save(model.state_dict(), 'model_v2.pth')

## Add the new model version to the Git repository and commit the changes

git add model_v2.pth


git commit -m "Trained model on some data"

What needs to be versioned?

Apart from versioning the code and data in an ML development lifecycle, the model and the
environment should be versioned most for reproducibility and model optimization.

Models:

 Model architecture or algorithm, hyperparameters (batch size, learning rate, epochs, etc.), and
weights can be versioned.
 Model evaluation metrics and results for each version, including test accuracy and other
relevant performance indicators, should be documented. This enhances the model’s
explainability and performance throughout experimentation.

Training and Deployment environments:

 The configurations used for training and deploying models should be versioned. These
configurations include dependencies like libraries and packages to ensure training and
deployment environment consistency.

 The deployment scripts used to deploy the model can be versioned to enable the
reproducibility of the deployment process.

 The dependencies required for the deployment environment, such as the operating system,
runtime libraries, and software packages, can be versioned to ensure the consistency of the
deployment environment.

Types of Model Versions

In order to ensure consistency and clarity in model versioning, different types of version
numbers are used to indicate the scope of the changes made to a model. These types of version
numbers are typically broken down into

 Major version

 Minor version

 Patch version

Major Version: A major version indicates a significant change that could impact the
performance or functionality of your model. Typically, a major version update involves a
significant change in your model’s architecture, algorithms, or training data. You can also
introduce new features or capabilities using this. Major versions are typically denoted by
incrementing the first digit in the version number.

Minor Version: A minor version indicates a smaller change that typically does not
significantly affect the model’s performance or functionality. For example, a minor version
update could involve a bug fix, a small optimization, or a new feature that does not
fundamentally alter the model’s behaviour. Minor versions are typically denoted by
incrementing the second digit in the version number.

Patch Version: A patch version indicates a small change or bug fix that is made to a specific
version of the model. Patch versions are typically denoted by incrementing the third digit in
the version number.

Components of versioning numbering

Versioning schemes

Choosing a versioning scheme is an essential step toward efficient collaboration in ML


development. Failure to do so could result in confusion and time wastage, especially for larger
projects. Here are some versioning schemes to consider when developing machine learning
models:

 Semantic Versioning:
Semantic versioning is a widely used versioning scheme that uses a three-part version number
consisting of major, minor, and patch versions. The version number is typically written in the
format “major.minor.patch”. This scheme is often used for software libraries, frameworks, and
APIs.

 Calendar Versioning:
Calendar versioning is a versioning scheme that uses the date of release as the version number.
For example, a model released on January 1st, 2022 would have a version number of
2022.01.01. This scheme is often used for data science projects where the focus is on tracking
changes over time.

Components of calendar versioning

 Sequential Versioning:
Sequential versioning is a versioning scheme that uses a simple sequential numbering system
to track versions. Each new version is assigned the next available number in the sequence
(e.g., 1, 2, 3, etc.). This scheme is often used for small projects or individual models.

 Git Commit Hash:


This scheme involves versioning based on the Git commit hash, which is a unique identifier
for every commit made to a code repository. It allows for precise tracking of changes made to
the model and its associated code. For example, ‘4dfc13a’, ‘8c2fb85’, etc for each commit
made to the repository. It is commonly used for collaborative development on machine
learning projects, where multiple developers may be making changes to the codebase at the
same time.

Git commit hash


Challenges and Considerations

While model versioning is an essential part of machine learning development, it comes with
its own set of challenges. As models become more complex and teams grow, it becomes
increasingly important to manage the versioning process effectively. Here are a few
considerations that should be made depending on the size of your project:

 Data management and versioning

 Scalability and resource requirements

 Integrating with existing workflows and systems

Data management and versioning:

One major challenge in model versioning is data management and versioning. Keeping track
of changes in data used to train models is crucial as it can have a significant impact on model
performance. However, managing large datasets and tracking changes to them can be
challenging, especially when dealing with distributed datasets across multiple machines. It is
essential to establish proper protocols and workflows for data versioning to ensure data
consistency, reliability, and easy tracking of changes.

Scalability and resource requirements:

Another challenge in model versioning is scalability and resource requirements. As models


grow in complexity, their versioning requirements can become increasingly complex, and
storing multiple versions of models can quickly become resource-intensive. It is crucial to
consider the scalability and resource requirements of model versioning systems when
implementing them to ensure they can handle increasing model complexity and volume.

Integrating with existing workflows and systems:

Integrating model versioning with existing workflows and systems can also pose a challenge.
Different teams may have different tools and workflows, and integrating model versioning
into these can require significant effort. It is essential to choose a model versioning system that
can integrate seamlessly with existing tools and workflows to minimize disruption and ensure
smooth collaboration across teams.
Machine Learning Model Versioning with DVC

DVC lets us connect with storage providers like AWS S3, Microsoft Azure Blob
Storage, Google Drive, Google Cloud Storage, HDFS, etc., to store ML models and
datasets.
ML Experiment Management

It helps in easy navigation for automatic metric tracking.

Deployment and Collaboration

DVC introduces pipelines that help in the easy bundling of ML models, data, and code

into production, remote machines, or a colleague’s computer.

PyPi repository using the following command line:

pip install dvc

Depending on the type of remote storage that will be used, we have to install optional
dependencies: [s3], [gdrive], [gs], [azure], [ssh], [hdfs], [webdav], [oss]. Use [all] to
include them all. Here, we will be using google drive as remote storage, so

pip install dvc[gdrive] for installing gdrive dependencies.

Getting Started

We will see how to use dvc for tracking data and ml models with gdrive as remote

storage. Imagine the Git repository which contains the following structure:

Git-Repository (Sample Structure)


code
data

models

utils
Gdrive Remote Configuration

Now, we need to configure gdrive remote storage. Go to your google drive and create a
folder called dvc_storage in it. Open the folder dvc_storage. Get the folder-id of the
dvc_storage folder from the URL:
https://drive.google.com/drive/folders/folder-id
Now, use the following command to use the dvc_storage folder created in the google
drive as remote storage:
dvc remote add myremote gdrive://folder-id
Now, we need to commit the changes to git repository by using the command:
git add -A
git commit -m "configure dvc remote storage"
To push the data to remote storage, we use the following command:
dvc push
Then, we push the changes to git using the command:
git push
To pull data from dvc, we can use the following command:
dvc pull

DVC Pipelines

We can make use of DVC pipelines to reproduce the workflows in our repository. The

main advantage of this is that we can go back to a particular point in time and run the

pipeline to reproduce the same result that we had achieved during the previous time.

There are different stages in the DVC pipeline like prepare, train, and evaluate, with

each of them performing different tasks. The DVC pipeline is nothing but a DAG

(Directed Acyclic Graph). In this DAG graph, there are nodes and edges, with nodes

representing the stages and edges representing the direct dependencies. The pipeline is

defined in a YAML file (dvc.yaml). A simple dvc.yaml file is as follows:


stages:
prepare:
cmd: source src/cleanup.sh
deps:
- src/cleanup.sh
- data/raw
outs:
- data/clean.csv
train:
cmd: python src/model.py data/model.csv
deps:
- src/model.py
- data/clean.csv
outs:
- data/predict.dat
evaluate:
cmd: python src/evaluate.py data/predict.dat
deps:
- src/evaluate.py
- data/predict.dat

Use the prepare stage to run the data cleaning and pre-processing steps. Use
the train stage to train the machine learning model using the data from the prepare stage.
The evaluate stage uses the trained model and predictions to provide different plots and
metrics.

Model Management Framework

Machine learning model management framework will help data scientists and
engineers more efficiently manage the end-to-end machine learning lifecycle. The framework
provides a centralized repository for storing, sharing, and tracking machine learning models
and metadata. The repository can be used to store both code and model artifacts, and it
provides a web interface for accessing models and training results. The framework also
includes tools for automated model deployments, monitoring, and versioning. These tools
will help data scientists track the performance of their models in production and quickly roll
back changes if necessary.

Why this framework is needed

A Machine learning model management framework helps organizations more


effectively manage the lifecycle of their models. This framework is needed because current
approaches to model management are often ad hoc, manual, which can lead to inefficiencies
and errors. The framework provides a more structured and automated approach that can help
organizations better manage their models from development to deployment and beyond.

How the framework works

The framework is designed to work with any machine learning model, allowing users
to easily manage and deploy models. The framework includes a set of tools that allow users
to train, test, and deploy models. The framework also includes a set of APIs that allow
developers to easily integration the framework into their applications.

Benefits of using the framework

The machine learning model management framework offers several benefits for users,
including the ability to:

 Easily track machine learning models throughout their entire lifecycle, from training
to deployment
 Efficiently manage large numbers of models
 Automate key tasks such as model retraining and performance monitoring
 Share models and collaborate with other users in a secure and controlled manner.

How to get started with the framework

The Machine Learning model management framework is a toolkit that helps you manage
your machine learning models throughout their lifecycle, from development to production.
The framework includes tools for training, tuning, and deploying machine learning models.

The goal of the Machine Learning model management framework is to make it easy to
manage machine learning models at scale. The framework is designed to work with any
machine learning platform and any type of machine learning model.

To get started with the Machine Learning model management framework, you’ll need to
install the following:
– Python 3.5 or higher
– TensorFlow 1.12 or higher
– The latest version of the Machine Learning model management framework package

Studio ml setup

Studio is a model management framework written in Python to help simplify and


expedite your model building experience. It was developed to minimize the overhead
involved with scheduling, running, monitoring and managing artifacts of your machine
learning experiments. Setting up a machine learning (ML) studio involves creating an
environment where you can develop, train, and deploy machine learning models efficiently.
Here's a step-by-step guide to setting up an ML studio:

1. Choose Your Development Environment:

 Decide whether you want to work in a cloud-based environment or on your local


machine.
 Cloud-based options include platforms like Google Colab, Kaggle Kernels, or AWS
SageMaker.
 For a local setup, you can use Jupyter Notebook, JupyterLab, or IDEs like PyCharm or
Visual Studio Code.

2. Install Python and Required Libraries:

 Install Python, preferably using a distribution like Anaconda, which comes with
popular data science libraries pre-installed.
 Install essential libraries such as NumPy, pandas, scikit-learn, TensorFlow, PyTorch,
or any other libraries you plan to use for ML development.

3. Set Up Version Control:

 Use version control systems like Git to manage your ML projects and collaborate with
others.
 Create a Git repository for your projects and initialize it with a README file.
4. Data Management:

 Organize your data effectively by setting up directories for raw data, processed data,
and datasets used in specific projects.
 Use data versioning tools like DVC (Data Version Control) to track changes to your
data and ensure reproducibility.

5. Notebook Management:

 Use Jupyter Notebook or JupyterLab to create interactive notebooks for


experimentation and analysis.
 Organize your notebooks into directories and use descriptive filenames to keep track of
experiments.

6. Experiment Tracking:

 Use experiment tracking tools like MLflow or Neptune to log parameters, metrics, and
artifacts from your experiments.
 These tools help you keep track of experiments, compare results, and reproduce
experiments later.

7. Model Training and Deployment:

 Set up environments for model training and deployment.


 Experiment with different algorithms, hyperparameters, and architectures to train
models.
 Deploy trained models to production environments using frameworks like TensorFlow
Serving, TorchServe, or cloud platforms like AWS, Azure, or Google Cloud.

8. Monitoring and Maintenance:

 Implement monitoring solutions to track model performance, data drift, and model
drift in production.
 Continuously monitor and update models to ensure they remain accurate and reliable
over time.
9. Documentation and Communication:

 Document your workflows, experiments, and findings to share with collaborators and
stakeholders.
 Use tools like Markdown, Jupyter Notebooks, or wiki pages to create documentation.

10. Security and Compliance:

 Implement security best practices to protect sensitive data and models.


 Ensure compliance with regulations such as GDPR, HIPAA, or industry-specific
standards.

11. Continuous Integration and Deployment (CI/CD):

 Set up CI/CD pipelines to automate testing, building, and deployment of ML models.


 Integrate testing frameworks to ensure code quality and model performance before
deployment.

By following these steps, we can create a robust and efficient ML studio environment to
develop, train, and deploy machine learning models effectively. Adjust the setup according to
your specific requirements and preferences.

Machine Learning Model Creation


Creating a machine learning model involves several steps, from gathering and preprocessing
data to training and evaluating the model's performance. Here's a high-level overview of the
machine learning model creation process:
Problem Definition: Clearly define the problem you want to solve with machine learning.
Identify the type of problem (e.g., classification, regression, clustering) and the target
variable you want to predict or infer.
1. Data Collection: Gather relevant data from various sources, such as databases, APIs, or
files. Ensure that the data is representative, diverse, and sufficient for training a robust model.
2. Data Preprocessing:
• Data Cleaning: Handle missing values, outliers, and errors in the data. This may
involve imputation, removal, or correction of data points.
• Feature Engineering: Create new features or transform existing features to improve
the model's predictive power. This includes tasks like encoding categorical variables, scaling
numerical features, and extracting relevant information from raw data.
• Data Splitting: Split the data into training, validation, and test sets to evaluate the
model's performance and prevent overfitting.
3. Model Selection: Choose an appropriate machine learning algorithm or model architecture
based on the problem type, data characteristics, and performance requirements. Consider
factors such as interpretability, scalability, and computational complexity.
4. Model Training: Train the selected model using the training data. During training, the
model learns patterns and relationships in the data to make predictions or classifications. This
process involves adjusting model parameters to minimize a chosen loss function.
5. Model Evaluation:
• Validation: Assess the model's performance on the validation set to tune
hyperparameters and prevent overfitting.
• Test: Evaluate the final model on the test set to estimate its generalization
performance. Measure metrics such as accuracy, precision, recall, F1-score, or mean squared
error, depending on the problem type.
6. Model Optimization: Fine-tune the model's hyperparameters and architecture to improve
performance. This may involve techniques like grid search, random search, or Bayesian
optimization.
7. Model Interpretation:
• Understand the model's decision-making process and interpret its predictions or
classifications.
• Visualize important features, coefficients, or decision boundaries to gain insights into
the model's behaviour and identify factors driving predictions.
8. Deployment:
• Deploy the trained model into production to make predictions on new, unseen data.
• Implement the necessary infrastructure, APIs, and integration pipelines for accessing
and using the model in real-world applications.
9. Monitoring and Maintenance: Continuously monitor the deployed model's performance,
drift, and reliability in production. Update the model as needed to adapt to changing data
distributions or requirements.
10. Documentation and Communication: Document the entire model creation process,
including data sources, preprocessing steps, model architecture, training procedures, and
evaluation results. Communicate findings, insights, and limitations to stakeholders
effectively.
By following these steps, you can create and deploy machine learning models that effectively
address real-world problems and deliver actionable insights.

Machine Learning Model in Production

The goal of building a machine learning model is to solve a problem, and a machine
learning model can only do so when it is in production and actively in use by consumers. As
such, model deployment is as important as model building. Data scientists excel at creating
models that represent and predict real-world data, but effectively deploying machine learning
models is more of an art than science. Deployment requires skills more commonly found in
software engineering and DevOps.

From model to production

Many teams embark on machine learning projects without a production plan, an approach
that often leads to serious problems when it's time to deploy. It is both expensive and time-
consuming to create models, and you should not invest in an ML project if you have no plan
to put it in production, except of course when doing pure research. There are three key areas
your team needs to consider before embarking on any ML projects are:

1. Data storage and retrieval

2. Frameworks and tooling

3. Feedback and iteration

Data storage and retrieval

A machine learning model is of no use to anyone if it doesn’t have any data associated with
it. You’ll likely have training, evaluation, testing, and even prediction data sets. You need to
answer questions like:

 How is your training data stored?

 How large is your data?


 How will you retrieve the data for training?

 How will you retrieve data for prediction?


These questions are important as they will guide you on what frameworks or tools to use,
how to approach your problem, and how to design your ML model. Data can be stored in on-
premise, in cloud storage, or in a hybrid of the two. It makes sense to store your data where
the model training will occur and the results will be served: on-premise model training and
serving will be best suited for on-premise data especially if the data is large, while data stored
in cloud storage systems like GCS, AWS S3, or Azure storage should be matched with cloud
ML training and serving.

The size of your data also matters a lot. If your dataset is large, then you need more
computing power for preprocessing steps as well as model optimization phases. This means
you either have to plan for more compute if you’re operating locally, or set up auto-scaling in
a cloud environment from the start. Remember, either of these can get expensive if you
haven’t thought through your data needs, so pre-plan to make sure your budget can support
the model through both training and production

Even if you have your training data stored together with the model to be trained, you still
need to consider how that data will be retrieved and processed. Here the question of batch vs.
real-time data retrieval comes to mind, and this has to be considered before designing the ML
system. Batch data retrieval means that data is retrieved in chunks from a storage system
while real-time data retrieval means that data is retrieved as soon as it is available.

Along with training data retrieval, you will also need to think about prediction data retrieval.
Your prediction data is rarely as neatly packaged as the training data, so you need to consider
a few more issues related to how your model will receive data at inference time:

 Are you getting inference data from webpages?

 Are you receiving prediction requests from APIs?

 Are you making batch or real-time predictions?


If you’re getting data from webpages, the question then is what type of data? Data from users
in webpages could be structured data (CSVs, JSON) or unstructured data (images, videos,
sound), and the inference engine should be robust enough to retrieve, process, and to make
predictions. Inference data from web pages may be very sensitive to users, and as such, you
must take into consideration things like privacy and ethics. Another issue here has to do with
data quality. Data used for inference will often be very different from training data, especially
when it is coming directly from end-users not APIs. Therefore you must provide the
necessary infrastructure to fully automate the detection of changes as well as the processing
of this new data.

As with retrieval, you need to consider whether inference is done in batches or in real-time.
These two scenarios require different approaches, as the technology/skill involved may be
different. For batch inference, you might want to save a prediction request to a central store
and then make inferences after a designated period, while in real-time, prediction is
performed as soon as the inference request is made. Knowing this will enable you to
effectively plan when and how to schedule compute resources, as well as what tools to use.

Frameworks and tooling

Your model isn’t going to train, run, and deploy itself. For that, you need frameworks and
tooling, software and hardware that help you effectively deploy ML models. These can be
frameworks like Tensorflow, Pytorch, and Scikit-Learn for training models, programming
languages like Python, Java, and Go, and even cloud environments like AWS, GCP, and
Azure.

After examining and preparing your use of data, the next line of thinking should consider
what combination of frameworks and tools to use.

The choice of framework is very important, as it can decide the continuity, maintenance, and
use of a model. In this step, you must answer the following questions:

 What is the best tool for the task at hand?

 Are the choices of tools open-source or closed?

 How many platforms/targets support the tool?


To help determine the best tool for the task, you should research and compare findings for
different tools that perform the same job. For instance, you can compare these tools based on
criteria like:
Efficiency: How efficient is the framework or tool in production? A framework or tool is
efficient if it optimally uses resources like memory, CPU, or time. It is important to consider
the efficiency of Frameworks or tools you intend to use because they have a direct effect on
project performance, reliability, and stability.

Popularity: How popular is the tool in the developer community? Popularity often means it
works well, is actively in use, and has a lot of support. It is also worth mentioning that there
may be newer tools that are less popular but more efficient than popular ones, especially for
closed-source, proprietary tools. You’ll need to weigh that when picking a proprietary tool to
use. Generally, in open source projects, you’d lean to popular and more mature tools for
reasons I’ll discuss below.

Support: How is support for the framework or tool? Does it have a vibrant community
behind it if it is open-sourced, or does it have good support for closed-source tools? How fast
can you find tips, tricks, tutorials, and other use cases in actual projects? Does it run on
Windows, Linux, or Mac OS? Is it easy to customize or implement in this target
environment? These questions are important as there can be many tools available to research
and experiment on a project, but few tools that adequately support your model while in
production.

Feedback and iteration

ML projects are never static. This is part of engineering and design that must be considered
from the start. Here you should answer questions like:

 How do we get feedback from a model in production?

 How do you set up continuous delivery?


Getting feedback from a model in production is very important. Actively tracking and
monitoring model state can warn you in cases of model performance depreciation/decay, bias
creep, or even data skew and drift. This will ensure that such problems are quickly addressed
before the end-user notices.
Deploying Model Process

Deploying a machine learning model involves the process of making your trained model
available for use in real-world applications. Here's a step-by-step overview of the deployment
process:

1. Preprocessing and Feature Engineering: Ensure that the preprocessing steps and feature
engineering techniques used during training are implemented in the deployment pipeline.
This includes scaling, normalization, encoding categorical variables, handling missing values,
etc.

2. Choosing a Deployment Environment: Select a suitable environment for deploying your


model. This could be cloud-based platforms like AWS, Azure, Google Cloud, or on-premises
servers.

3. Containerization (Optional): Containerization tools like Docker can be used to package


your model along with its dependencies and runtime environment. This ensures consistency
across different environments and simplifies deployment.

4. Model Integration: Integrate your trained model into the deployment environment. This
may involve loading the model weights, architecture, and any other necessary files.

5. Building APIs: Expose your model's functionality through APIs (Application


Programming Interfaces). This allows other software components or systems to interact with
your model for making predictions. Popular frameworks for building APIs include Flask,
Django, FastAPI, etc.

6. Scalability and Performance: Ensure that your deployment setup can handle the expected
load and maintain performance requirements. This may involve load testing and optimizing
the deployment infrastructure.

7. Monitoring and Logging: Implement monitoring and logging mechanisms to track the
performance of your deployed model in real-time. This helps in identifying issues,
monitoring model drift, and gathering insights for further improvement.

8. Security Considerations: Implement security measures to protect your deployed model


from attacks and unauthorized access. This includes encryption, authentication, and access
control mechanisms.
9. Continuous Integration/Continuous Deployment (CI/CD): Set up automated pipelines
for deploying updates or new versions of your model. This ensures smooth and efficient
deployment without manual intervention.

10. Feedback Loop: Establish a feedback loop to collect user feedback and performance
metrics from the deployed model. This feedback can be used to iteratively improve the model
over time.

By following these steps, you can effectively deploy your machine learning model and make
it available for use in real-world applications.

Deploying Machine Learning Model in Production

Deployment
Method Description Implementation Advantages
Decouples model from
Deploy the model application logic, supports
as an API service, real-time predictions, enables
allowing other integration with various
applications to send Use web frameworks platforms, and facilitates
API-Based input and receive like Flask, Django, or scalability and concurrent
Deployment predictions. FastAPI. access.
Ensures consistency across
Package the model environments, simplifies
and dependencies deployment and dependency
into a container management, supports
image for Use Docker to create microservices architecture,
consistency and and deploy container and enables scalability and
Containerization portability. images. portability.
Eliminates server
Deploy the model Utilize serverless management overhead,
as a serverless platforms like AWS supports event-driven
function, which Lambda, Azure architectures, auto-scaling,
Serverless automatically scales Functions, or Google and pay-per-use billing, and
Deployment based on demand. Cloud Functions. reduces operational costs.
Enables offline inference,
reduces latency by avoiding
network communication,
Integrate the model enhances privacy and security
directly into by keeping data local, and
embedded systems Optimize the model supports use cases with
Embedded or edge devices for for resource- limited or unreliable internet
Deployment local inference. constrained devices. connectivity.
Deployment
Method Description Implementation Advantages
Streamlines deployment
process, abstracts away
Use specialized infrastructure management,
model serving provides built-in support for
frameworks for Utilize frameworks common deployment tasks,
deploying and like TensorFlow and offers scalability,
Model Serving managing models Serving or reliability, and monitoring
Frameworks in production. TorchServe. capabilities.
Reduces latency for real-time
applications, improves
Deploy the model privacy and security by
to edge computing Utilize edge keeping data local, conserves
devices or servers computing platforms network bandwidth, and
Edge closer to the data or edge AI enables offline operation in
Deployment source or users. frameworks. disconnected environments.

You might also like