Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Bachelor Degree Project

Image Classification with


Machine Learning as a Service
- A comparison between Azure, SageMaker, and
Vertex AI

Author: Gustav Berg


Supervisor: Alisa Lincke
Semester: VT 2022
Subject: Computer Science
Abstract
Machine learning is a growing area of artificial intelligence that is widely used
in our world today. Training machine learning models requires knowledge and
computing power. Machine Learning as a Service (MLaaS) tries to solve these
issues. By storing the datasets and using virtual computing instances in the
cloud, one can create machine learning models without writing a single line of
code. When selecting an MLaaS platform to use, the natural question of which
one to use arises. This thesis conducts controlled experiments to compare the
image classification capabilities of Microsoft Azure ML, Amazon Web
Services SageMaker, and Google Cloud Platform Vertex AI. The prediction
accuracy, training time, and cost will be measured with three different datasets.
Some subjective comments about the user experience while conducting these
experiments will also be provided. The results of these experiments will be
used to make recommendations as to which MLaaS platform to use depending
on which metric is most suitable. This thesis found that Microsoft Azure ML
performed best in terms of prediction accuracy, and training cost, across all
datasets. Amazon Web Services SageMaker had the shortest time to train but
performed the worst in terms of accuracy and had trouble with two of the three
datasets. Google Cloud Platform Vertex AI did achieve the second-best
prediction accuracy but was the most expensive platform by far as it had the
largest time to train. It did, however, provide the smoothest user experience.
Overall, Azure ML would be the platform of choice for image classification
tasks after weighing together the results of the experiments as well as the
subjective user experience.

Keywords: machine learning, image classification, MLaaS, supervised


learning, artificial intelligence
Preface
I would like to thank my supervisor Alisa Lincke for the support and advice through
this process. I would also give a huge thanks to Andreas and Stefan at Knowit Luleå
for coming up with the idea behind this project and allowing me to do this thesis in
collaboration with them.
Contents
1 Introduction ________________________________________________ 6
1.1 Background ___________________________________________ 6
1.2 Related work __________________________________________ 7
1.3 Problem formulation ____________________________________ 7
1.4 Motivation ____________________________________________ 8
1.5 Result ________________________________________________ 8
1.6 Scope/Limitation _______________________________________ 9
1.7 Target group __________________________________________ 9
1.8 Outline _______________________________________________ 9

2 Method __________________________________________________ 10
2.1 Research Project ______________________________________ 10
2.2 Research method ______________________________________ 10
2.2.1 Controlled experiments ______________________________ 10
2.2.2 Ease-of-use evaluation matrix _________________________ 12
2.3 Reliability and Validity _________________________________ 12
2.4 Ethical Considerations__________________________________ 13

3 Theoretical Background _____________________________________ 14


3.1 Machine Learning as a Service (MLaaS) ___________________ 14
3.2 Image classification ____________________________________ 14
3.3 Data augmentation_____________________________________ 14
3.4 Amazon Web Services SageMaker [17] ____________________ 15
3.5 Google Cloud Platform Vertex AI [22]_____________________ 16
3.6 Microsoft Azure ML [24] _______________________________ 17
3.7 Comparison criteria ____________________________________ 18
3.7.1 Accuracy _________________________________________ 18
3.7.2 Time to train ______________________________________ 18
3.7.3 Cost _____________________________________________ 18
3.7.4 Ease-of-use _______________________________________ 18

4 Research project – Implementation_____________________________ 19


4.1 Datasets _____________________________________________ 19
4.2 Amazon SageMaker ___________________________________ 19
4.3 Google Cloud Vertex AI ________________________________ 22
4.4 Microsoft Azure ML ___________________________________ 23

5 Results ___________________________________________________ 25
5.1 Classification _________________________________________ 25
8.1 Accuracy _____________________________________________ 25
5.2 Training time _________________________________________ 27
5.3 Cost_________________________________________________ 29
5.4 Ease-of-use ___________________________________________ 30

6 Analysis__________________________________________________ 31
6.1 Answer to RQ 1 _______________________________________ 31
6.2 Answer to RQ 2 _______________________________________ 31
6.3 Answer to RQ 3 _______________________________________ 32
6.4 Ease-of-use experience __________________________________ 32
6.4.1 Vertex AI ________________________________________ 32
6.4.2 Azure ML ________________________________________ 32
6.4.3 SageMaker _______________________________________ 33

7 Discussion _______________________________________________ 34

8 Conclusions and Future Work ________________________________ 36

References ___________________________________________________ 37
1 Introduction
This work is a 15 HEC bachelor thesis in computer science, comparing three
different Machine Learning as a Service (MLaaS) platforms in image
classification tasks.

I am doing this thesis in cooperation with the IT consultant company Knowit,


Luleå office. The topic was their idea as they are looking to get more
knowledge about MLaaS platforms to offer more services to their customers.

1.1 Background
Artificial Intelligence is an area within Computer Science and is the ability of
a computer or machine to perform tasks that would normally require the
intelligence of a human. Machine Learning is one of the fastest-growing areas
within Computer Science [1] and is considered a very important technology
[2]. It is a part of artificial intelligence that uses data and algorithms to make
predictions or decisions. The goal of machine learning is to make computers
“learn” how to do something without being specifically programmed to do so
[2].

There are three main types of machine learning: supervised learning,


unsupervised learning, and reinforced learning. Supervised learning is when
the training set contains the label or attribute that the model will estimate or
predict. Unsupervised learning lacks this attribute or label, and reinforced
learning means that the results can lead to actions that can change the
environment.

Machine learning can require a lot of computational resources to process and


analyze the data and deep knowledge of machine learning algorithms [3]. One
solution to these problems is to use a Machine Learning as a Service (MLaaS)
platform. An MLaaS platform can provide several services, including
computer vision, speech recognition, image- and video analysis, as well as
scalable computational resources. Using an MLaaS platform enables the data
scientists or data engineers to focus on the data analysis instead of ML
algorithm implementations and establishing the computing resources [1] [4].

Image classification usually requires many, and sometimes large, images to be


processed and that means it’s very resource demanding. Image processing,
training models, and storing the data could be expensive for small companies,
individual developers, or researchers. MLaaS platforms provide an opportunity
to overcome these problems [2].
1.2 Related work
There have been limited publications focusing on MLaaS that have been
published since 2019. A systematic literature review conducted in 2020 found
only 30 contributions on the MLaaS topic and only five of those focused on
performance [5].

An experimental study conducted in 2017 compared the relations of


complexity versus performance on six of the most popular MLaaS platforms.
This study focused on binary classification and used 119 different datasets.
They found that if used correctly MLaaS can achieve comparable results to
fine-tuned specific ML classifiers, but “ready to use” systems provided worse
results. That study is pretty similar to what this study aims to do. However, this
study with the main difference is that this study will focus on image
classification and with only three MLaaS: Microsoft Azure ML, Amazon Web
Services SageMaker, and Google Cloud Platform Vertex AI.

Another study conducted in 2021 [6] evaluated the performance of four MLaaS
platforms with a focus on multi-class classification. They assessed the four
platforms according to cost, training time, and overall accuracy. They found
that Microsoft Azure is a viable option for small-scale applications with
smaller budgets, but Google ML had better accuracy and lower training time
at a greater cost. That study is very similar to this study in terms of evaluation.
The researchers also mention in their paper that they will conduct a similar
study focusing on image classification in the future.

Gokay Saldamli et al. argued that choosing the right MLaaS platform is critical.
Their study conducted an in-depth comparison of the Natural Language
Processing APIs provided by Google, AWS, IBM, and Azure. Their
assessment focused primarily on the following four factors: accuracy, time,
cost, and ease-of-use, and they also point out the need for more research in the
area [7]. In this work, the same factors were selected, with the only difference
being that this thesis will conduct an in-depth comparison of deep learning
models for image classification.

1.3 Problem formulation


In recent years, there has been an increasing interest in providing MLaaS
platforms for data scientists and data engineers. Several powerful MLaaS have
emerged and will continue appearing more in the near future. The main
challenge many customers (data scientists, researchers, data engineers, small
IT companies, etc.) face is the selection of MLaaS according to their needs
(budget, data type and size, security issues, data analysis purpose, and others)
[5] [8] [9]. To the best of our knowledge, no previous study has compared
several MLaaS image classifications in terms of accuracy, time cost, and ease-
of-use aspects. This work aims to compare MLaaS in terms of accuracy, time,
cost, and ease-of-use, to simplify the selection process by providing
recommendations for future users of MLaaS.
This leads to the three main questions that this thesis will be trying to answer:

RQ1: Which MLaaS performs best in terms of classification accuracy?


RQ1 aims to show which platform to use if classification accuracy is of the
most importance.

RQ2: Which MLaaS performs best in terms of training time and cost?
RQ2 aims to show which platform to use if training time or cost is of the most
importance.

RQ3: Do the different image datasets affect MLaaSs’ respective performance?


RQ3 aims to show which platform to use if versatility with different datasets
is of the most importance.

Complexity and ease of use will also be noted and discussed as that can be a
relevant factor as well, even though the user experience is subjective to this
author’s opinions.

1.4 Motivation
There are over 10 different MLaaS platforms available [8] and it can be hard
to choose which one to use. Prediction accuracy is the most used standard
criterion for comparing different image classification models [9]. Training
time and cost can also be essential selection criteria for businesses and the
research sector. With my contribution, I hope to find out the performance
differences between the three of the most popular MLaaSs as well as help
Knowit gain some knowledge of the offerings of these MLaaS platforms.

Three MLaaS platforms were selected since Google, Amazon, and Microsoft
are all in the top 5 of the largest companies in the world by market cap [10].
They all are considered a member of the “big four” [7].

1.5 Result
There were 19 controlled experiments performed and the results showed that
Microsoft Azure ML had the highest prediction accuracy followed by Google
Cloud Vertex AI and then Amazon Web Services SageMaker. SageMaker did
perform best in terms of training time, but Azure was not far behind and was
slightly cheaper to train with. Vertex AI was the most expensive option by far
and the models had the longest time to train. These results cannot fully answer
the research question since there is a possibility that the results can vary
depending on the specific use case such as type of images, number of classes,
size, and number of images, etc.

There was a noticeable difference in user experience and ease-of-use. Vertex


AI required no coding and was ready to use directly after registration. Azure
did require some coding with their Python SDK but was also ready to use after
registration. SageMaker required a support request to get permission to use a
computing instance with image classification support and had a graphical user
interface with poor instructions.

1.6 Scope/Limitation
This thesis aims to compare three well-known MLaaS platforms: Amazon Web
Services SageMaker, Microsoft Azure ML, and Google Cloud Vertex AI. This
work will not consider other MLaaS platforms available on the market. This
thesis will also only look at image classification. Therefore, no other forms of
machine learning tasks will be covered. There will only be three image datasets
used, which vary in image size (32x32, 227x227, and 204x153 to 4368x2912)
and image type (objects such as planes, cars, and dogs, weather, and concrete).
No multilabel datasets will be used. There will also be no manual
hyperparameter tuning of the models. If automatic hyperparameter tuning is
not supported, the standard values will be used.

1.7 Target group


The main target group for this thesis is small software companies that are
developing AI-based applications and researchers in Data Science to provide
them with recommendations for which MLaaS to use for their applications and
research projects.

1.8 Outline
The rest of the thesis will be organized as follows:
Chapter 2 describes the research project, the selected research method, and the
reliability and validity of the results. It also describes the ethical
considerations. Chapter 3 will provide some theoretical background,
describing the MLaaS concept as well as the selected platforms. Chapter 4 will
describe the implementations that were made for each of the platforms. Chapter
5 will showcase the results with charts. Chapter 6 will analyze the results and
provide the answers to research questions. Chapter 7 will discuss the results
and put them into relation to the related work. Finally, chapter 8 will draw
conclusions from the result, and suggestions for future work will be made.
2 Method
This chapter will give a background of the research project and explain the
method selected to answer the research questions stated in chapter one.

2.1 Research Project


The IT-consultant company Knowit [13] initiated this project. They were
looking for more insight into MLaaS platforms as they are looking to expand
their service offerings. We selected the controlled experiment approach to test
the difference between Microsoft Azure ML, Amazon Web Services
SageMaker, and Google Cloud Vertex AI in terms of prediction accuracy,
training time, and cost. The results are provided by the services themselves
since the training process measures and collect this information. I will also
comment on the user experience in terms of ease-to-use after conducting these
experiments. While this is my subjective thoughts, it might be considered as
guidance or recommendation for users in the search for the most suitable
platform.

2.2 Research method


To answer the research questions, several controlled experiments were
conducted with a quantitative data collection method. This method is beneficial
for comparing machine learning techniques due to its ability to test different
parameters, criteria, or aspects [11]. Thus, controlled experiments are the best-
suited method to answer the research questions.

A controlled experiment is when a system is systematically tested in an


environment that is under control. Such an experiment needs to have defined
dependent and independent variables. The dependent variable is the metric of
interest and the factors that are used as input(s) to affect the dependent variable
are the independent variable(s). In this case, the dependent variables will be
prediction accuracy, training time, and cost, while the independent variables
will be platform, algorithm, and dataset.

2.2.1 Controlled experiments


A total of 19 experiments were conducted in the following configuration
presented in Table 2.2.1:
Classification Data Hyperparameter Computing
N Dataset type Algorithm augmentation tuning Platform instance
MobileNetV2
1 (Azure)

2 ViT (Azure)
SEResNext
3 (Azure)
ResNet18
4 cifar10 Multiclass (Azure)
SEResNeXt
5 (Azure)

6 ViT (Azure)
ResNet18
7 (Azure)
MobileNetV2
8 concrete Binary (Azure)
MobileNetV2 6 core
9 (Azure) CPU,
56GB
ResNet18 RAM,
10 (Azure) resize, crop, 380GB
SEResNext horizontal HDD, and
11 (Azure) flip, color NVIDIA
jitter, Tesla K80
12 weather Multiclass ViT (Azure) normalization Automatic Azure ML GPU
SageMaker
built-in
13 algorithm
SageMaker
built-in
algorithm
14 cifar10 Multiclass (pre-trained)
SageMaker
built-in 4 CPU
15 algorithm cores,
SageMaker 61GiB
built-in RAM and
algorithm Crop, color, 1 NVIDIA
16 concrete Binary (pre-trained) transform None SageMaker K80 GPU

17 cifar10 Multiclass

18 concrete Binary No No
Vertex AI information information
19 weather Multiclass AutoML available Automatic Vertex AI available

Table 2.2.1.1: Experiment configurations for the controlled experiments conducted during
this study.
2.2.2 Ease-of-use evaluation matrix
To make the ease-of-use evaluation more systematic, the following evaluation
matrix was constructed, presented in Table 2.2.2.1. All criteria are weighted
the same and the criteria were selected from Knowit’s suggestions.

Steps to train Documentation Data preparation Score


SageMaker
Azure ML
Vertex AI

Table 2.2.2.1: The evaluation matrix for the ease-of-use experience.

Every category will be awarded a score between 1 to 5 where 5 is the best


score. Steps to train will take into consideration how time-consuming it is to
train a model, where a high score means that it is easy to start and configure
the training job. The documentation category will indicate how helpful the
documentation is, where a high score means that it was very detailed and
helpful. The data preparation score will indicate how much data preparation
is needed, where a high score means that not much manual effort was
required. The scores will be awarded from our subjective experience.

2.3 Reliability and Validity


The datasets used in this experiment are open source and free for anyone to
use. The algorithms and platforms used are open to anyone to use as well.
Detailed step-by-step instructions are included in Chapter 4. The results should
be easy to recreate with the provided information. Thus, the results presented
in this work are reliable.

The validity of these results should be very high as well, even though the
platforms themselves measure and account for their own statistics, the results
are easily validated with manual validation. However, just because the
platforms performed a certain way in these experiments, that does not mean
that this will be true for all use cases. Different image sizes, amounts of images,
number of classes, and other factors will make each use case unique. Another
thing to take into consideration is that the models trained on Azure ML and
AWS SageMaker had the datasets split into the platforms’ default 80/20
training/validation split, while the Vertex AI models were trained with its
default 80/10/10 training/validation/test split since they did not provide any
validation accuracy data. This could affect the potential conclusions drawn
from the results. Since the ease-of-use is judged by this author’s personal
experience, the results are very subjective, and a person with another technical
background may have a very different user experience with the MLaaSs.
2.4 Ethical Considerations
This research project does not include any confidential information,
interviews, or any other form of sensitive data. The datasets used in the
experiments are open source. Therefore, I consider there not to be any
substantial ethical considerations for this thesis.
3 Theoretical Background
In this chapter, some theoretical background will be provided on the MLaaS
concept in general, algorithms, and some specific information about the
selected platforms, with a focus on their image classification capabilities.

3.1 Machine Learning as a Service (MLaaS)


Machine Learning as a Service uses the power of cloud computing to give its
clients a solution to the machine learning infrastructure problem [7]. There are
many different vendors out there that offer a variety of different machine
learning tools that range from computer vision, natural language processing,
and predictive analysis to deep learning and face recognition [11] [8]. Apart
from being useful for data storage, training, and validating a machine learning
model, MLaaS can also be used to deploy the model as a web service endpoint
for production use.

3.2 Image classification


Image classification is a supervised machine learning task where a model is
trained with labeled images to learn how to classify unseen and unlabeled
images. Image classification is performed with deep learning algorithms called
neural networks. The neural network analyzes features in the images to predict
which class they belong to.

3.3 Data augmentation


Data augmentation in deep learning is an action performed to increase the
number of images in a dataset by creating copies of the original data and then
applying transformations such as cropping, rotation, color adjustments, etc.
Research suggests that data augmentation can increase classification
performance [12].
3.4 Amazon Web Services SageMaker [17]

Figure 3.4.1: Screenshot of the SageMaker console.

SageMaker was introduced by AWS in November of 2017 [13] and is AWS’s


cloud-based machine learning platform. Apart from the previously mentioned
use-cases, SageMaker also supports the deployment of machine learning
models on edge devices [14]. SageMaker is configured to work with popular
deep-learning frameworks including Apache MXNet, PyTorch, and
TensorFlow [14].

For image classification training jobs, SageMaker has one built-in algorithm
to choose from, ResNet. The algorithm can be trained from scratch or using
the “transfer learning” mode which uses pre-trained weights and fine-tunes the
network with the user data [15]. There is also an “automatic model tuning”
mode which finds the best hyperparameter combination for the selected
dataset, within a given parameter range [16].

SageMaker also supports the training of algorithms purchased via the AWS
marketplace or your own algorithms created with the supported frameworks.

The training jobs can be started via the graphical user interface or via the
SageMaker Python SDK. The training jobs can be conducted on the user’s
local machine or via a computing instance.
3.5 Google Cloud Platform Vertex AI [22]

Figure 3.5.1: Screenshot of the Vertex AI dashboard.

Vertex AI was launched on the 18th of May 2021 and claimed that the platform
would require nearly 80% fewer lines of code while training machine learning
models compared to their competitors [17]. Vertex AI offers both AutoML and
custom training. AutoML requires no data science or coding knowledge and
has a faster training time compared to custom training. It uses a pre-trained
algorithm and fine-tunes it with the user’s dataset. Custom training requires a
TensorFlow or scikit-learn model and requires some data science knowledge.

When training in the AutoML mode, Vertex AI uses automatic hyperparameter


tuning to try to optimize model performance.
3.6 Microsoft Azure ML [24]

Figure 3.6.1: Screenshot of the Azure Machine Learning Studio.

Azure Machine Learning became public in 2014 [18]. Azure shares a lot of
functionality with the previous platforms but for image classification, it has
five different algorithms to choose from:

• ViT
• SEResNeXt
• ResNet
• ResNeSt
• MobileNetV2

Via the AutoML for computer vision models functionality, Azure can select
the best algorithm for your dataset as well as tune the hyperparameters to
maximize the prediction accuracy [19]. Azure has some framework support
that includes PyTorch, TensorFlow, and sci-kit-learn.

To improve the model performance Azure AutoML uses data augmentation


before feeding the input to the model. The actions performed on the dataset
include crop, resize, horizontal flip, color jitter, and normalization.

AutoML hyperparameter tuning can be performed in three different ways.


Random sampling, Grid sampling, and Bayesian sampling are the available
choices. The random sampling uses random hyperparameter values, Grid
sampling does a grid sweep over all possible values and the Bayesian sampling
is based on the Bayesian optimization algorithm.

3.7 Comparison criteria


The comparison criteria for these experiments are accuracy, time to train, cost,
and ease of use.

3.7.1 Accuracy
Prediction accuracy is the most used standard criteria for comparing different
image classification models [9] and is calculated by dividing the number of
correct predictions by the number of incorrect predictions. Figure 3.7.1.1
shows the mathematical equation for accuracy.

𝑇𝑟𝑢𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑟𝑢𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Figure 3.7.1: The mathematical equation for accuracy.

3.7.2 Time to train


Time to train will be measured in minutes and is the total time taken from
starting the training until the training cycle is complete, and the model is
validated and ready to use.

3.7.3 Cost
The cost will be the total amount of money (in US dollars currency) spent on
training and validating the model, including any other fees related to the
training itself, such as the storage of images.

3.7.4 Ease-of-use
The ease-of-use criteria will be highly subjective, but several different aspects
will be taken into consideration such as the number of steps to start and
complete training, how helpful the documentation is, and a subjective opinion
regarding how much effort was required to perform the experiments.
4 Research project – Implementation
4.1 Datasets
Three different datasets were selected for this research project. The aim was to
have different sizes, numbers of classes as well as types of images. They are
all pretty common datasets used for image classification. Table 4.1.1 presents
them in more detail.

N Image Dataset Description Image Number of Link to


Resolution images dataset
1 CIFAR-10 10 different 32x32 60 000 [23]
classes made
out of normal
objects such as
cars, planes,
and dogs.
2 Concrete Crack This dataset 227x227 40 000 [24]
contains
images of
concrete that
are either
cracked or not
(for binary
classification)
3 Multi-class This dataset from 1125 [25]
Weather Images contains 4 204x153
classes of to
images of 4368x291
weather in 2
various sizes.
Table 2.2.1.1: Dataset information.

4.2 Amazon SageMaker


Before the experiment could be conducted, I had to change the region to “us-
east-2” since image classification requires a certain type of computing instance
that was not available in the Swedish region. The selected instance was
“ml.p2.xlarge” which was the cheapest instance approved for image
classification training jobs. This instance has four CPU cores, 61GiB RAM,
and 1 NVIDIA K80 GPU. I had to initiate a support request to get access to
such an instance which was a process that took about a week. After the
approval I took the following steps:

Step Description
Load the data The first step is to upload the images to S3 storage.
Training and validation data must be stored in
separate folders. The images should be uploaded via
the AWS CLI [23] since the S3 graphical user
interface froze when trying to upload more than 1000
images. SageMaker requires the data to be split into
training and validation before uploading the data, the
different parts have to be stored in different locations.

Data preparation There are three different dataset format types to


choose from and detailed instructions on how to create
the datasets accordingly can be found in the
SageMaker developer guide [24]. After the dataset is
created in the desired format it should also be
uploaded to the S3 storage. I chose to use the
Augmented Manifest Image format. I used a python
script to create the JSONL file and then uploaded the
files to the S3 storage.

Algorithm selection SageMaker offers 1 built-in algorithm so there was no


choice to be made. This is the ResNet residual
network. The experiment was made once with the pre-
trained version of this algorithm, and once trained
from scratch.

Train the model I used the graphical user interface to schedule a


training job. The SageMaker built-in image
classification algorithm was selected, and the
previously mentioned instance type was selected as
well. All the standard hyperparameters were used
except for the “augmentation_type” which was set to
crop, color, and transform. This was because Azure
ML conducted some image augmentation on the
datasets as well. The concrete dataset had to be
downscaled to 175x175px due to a memory error,
despite the computing instance having 61GiB of
RAM. The number of layers in the neural network was
set to 18 since ResNet 18 was used in the Azure
experiment. The input data configuration was
configured according to the documentation for the
selected format. The CIFAR-10 dataset was trained in
an upscale 150x150 resolution as well in order to see
if that increased the accuracy, but it did not.

Step 4 had to be re-done a few times since SageMaker


gave some very unclear error messages, it ran out of
memory for the concrete dataset which resulted in the
downscaling of the image size and the middle of
training with the weather dataset no matter which of
the dataset types that was used.

I used the graphical user interface to schedule a


training job. The SageMaker built-in image
classification algorithm was selected, and the
previously mentioned instance type was selected as
well. All the standard hyperparameters were used
except for the “augmentation_type” which was set to
crop, color, and transform. This was because Azure
ML conducted some image augmentation on the
datasets as well. The concrete dataset had to be
downscaled to 175x175px due to a memory error,
despite the computing instance having 61GiB of
RAM. The number of layers in the neural network was
set to 18 since ResNet 18 was used in the Azure
experiment. The input data configuration was
configured according to the documentation for the
selected format. The CIFAR-10 dataset was trained in
an upscale 150x150 resolution as well in order to see
if that increased the accuracy, but it did not.

Step 4 had to be re-done a few times since SageMaker


gave some very unclear error messages, it ran out of
memory for the concrete dataset which resulted in the
downscaling of the image size and the middle of
training with the weather dataset no matter which of
the dataset types that was used.
Validation SageMaker validates the model automatically as a part
of the training job. After the training job was
completed, the accuracy metrics were noted, and
training time and cost were calculated.

4.3 Google Cloud Vertex AI


Vertex AI required no computing instance or quota-limit set-up before starting
to train a model.

Step Description
Load the data The images were uploaded with the gcloud CLI to
Google Cloud storage.
Data preparation The dataset JSONL file was created according to the
Google Cloud Vertex AI documentation instructions
[25]. I created a python script to reach the desired
format. Vertex AI splits the dataset by default to an
80/10/10 train/validate/test split but a custom split via
their graphical user interface is possible.

Algorithm selection Vertex AI AutoML was selected since that was the
only built-in algorithm option.

Train the model The next step is to go to the training tab and create a
training job. The selected training method was
“AutoML” and then “Train a new model”. The data
split under the advanced option were kept at the
standard randomly assigned split of 80/10/10
training/validation/test. After this step, the maximum
node hour budget was set to 24 hours, and the training
was started. After the training finished, the accuracy
metrics were noted, training, and cost were calculated.
There is no available information regarding the
technical specifications of the computing instance.

Validation Vertex AI automatically validates the model after


training.
After the training job was completed, the accuracy
metrics were noted, and training time and cost were
calculated.
4.4 Microsoft Azure ML
Step Description
Load the data The first step was to upload the images to azure blob
storage via the Azure-CLI. This kept upload times to
a few minutes compared to 7-12 hours via the
graphical user interface. Azure ML can split the
dataset into training and validation automatically
with an 80/20 split or using a custom split vi the
experiment configuration.

Data preparation With a python script, I created the JSONL file


according to Azure’s documentation. After the
JSONL file was uploaded, I created the dataset with
the Azure Python SDK [29].

Algorithm selection Azure offers 5 different algorithms to choose from


and the initial idea was to train them all once with
each dataset. However, the ResNeSt algorithm
resulted in an “out-of-memory error”, so those
results are excluded in this thesis. This resulted in 4
models per dataset.

Train the model The selected compute instance was the cheapest
available instance that was approved for image
classification, the NC6 instance with 6 cores, 56GB
RAM, 380GB HDD, and an NVIDIA Tesla K80
GPU. Grid hyperparameter sampling was used for all
the experiments in order to maximize the accuracy.
The Python SDK has to be used in order to start an
Auto ML computer vision task and the training job
was configured with the settings shown in Figure
4.4.1.

Validation After the training job was completed, the accuracy


metrics were noted, and training time and cost were
calculated.
Figure 4.4.1: Code snippet used to train a SEResNeXt model with the concrete dataset.
5 Results
As a result of this thesis, the following recommendations are made. If
classification accuracy or cost is the priority, and the user has some coding
knowledge, Azure ML is the recommended choice. If the user has no coding
knowledge and can tolerate a higher cost, Vertex AI is the recommendation.
SageMaker had marginally faster training time than Azure, but the lack of
performance in classification accuracy makes this platform hard to recommend
ahead of the others. The data used to come to these conclusions will be
presented in the following sub-chapters.

5.1 Classification Accuracy


The classification accuracy per model with the CIFAR-10 dataset is presented
in Figure 5.1.1.

Figure 5.1.1: Classification accuracy measured in percentage for the CIFAR-10 dataset.

The classification accuracy per model with the concrete dataset is presented
in Figure 5.1.2
Figure 5.1.2: Classification accuracy measured in percentage for the Concrete dataset.

The classification accuracy per model with the weather dataset is presented in
Figure 5.1.3.

Figure 5.1.3: Classification accuracy measured in percentage for the Weather dataset.
The results show that the models trained on the Azure platforms achieved the
highest classification accuracy for the validation data. Azure has the highest-
performing model for each dataset. Vertex AI performs on par with Azure with
both the concrete and weather dataset, but there is a 7.2-10.7 percentage point
difference with the CIFAR-10 dataset. SageMaker is on par with the
competitors with the concrete dataset but performs very poorly with the
CIFAR-10 dataset. SageMaker was not able to handle the weather dataset,
hence why the results are missing.

5.2 Training time

The time required to train each model with the CIFAR-10 dataset is presented
in Figure 5.2.1.

Figure 5.2.1: Training time measured in minutes for the CIFAR-10 dataset.

The time to train each model with the concrete dataset is presented in Figure
5.2.2.
Figure 5.2.2: Training time measured in minutes for the Concrete dataset.

The cost required to train each model with the weather dataset is presented in
Figure 5.3.3.

Figure 5.3.3: Training time measured in minutes for the Weather dataset.
In terms of training time, Azure’s models are on both ends of the spectrum. It
has 2 of the 4 fastest models, as well as the 2 slowest models to train.
SageMaker has the 2 lowest training times and Vertex AI places in the middle.

5.3 Cost
The cost required to train each model with the CIFAR-10 dataset is presented
in Figure 5.3.1.

Figure 5.3.1: Training cost measured in US$ for the CIFAR-10 dataset.

The cost required to train each model with the concrete dataset is presented in
Figure 5.3.2.

Figure 5.3.2: Training cost measured in US$ for the Concrete dataset.
The cost required to train each model with the weather dataset is presented in
Figure 5.3.3.

Figure 5.3.2: Training cost measured in US$ for the Concrete dataset.

Figure 5.3.3: Training cost measured in US$ for the Weather dataset

Azure has the cheapest models to train but the gap to SageMaker is not large.
Vertex AI is the outlier with a huge cost compared to its competitors.

5.4 Ease-of-use
Each platform’s score is presented in Table 5.4.1.

Steps to train Documentation Data preparation Score


SageMaker 1 2 2 5
Azure ML 4 4 4 12
Vertex AI 5 5 5 15

Table 5.4.1: The evaluation matrix for the ease-of-use experience.

Vertex AI required the least steps to train the model as well as no coding or
machine learning competence. Azure was the second easiest platform to use
with detailed guides and very little coding required but no machine learning
skills. SageMaker was the most complicated platform to use with the most
steps required to use as well as it required the most manual data processing.
6 Analysis
6.1 Answer to RQ 1
In terms of accuracy, the results are very similar for the concrete dataset, seen
in Figure 5.1.2 with only a 0.396% difference between the most accurate and
the least accurate model. Looking at the CIFAR-10 dataset in Figure 5.1.1,
there are some big differences. The four different Azure models perform best
with a noticeable margin down to Vertex AI and both SageMaker models
perform poorly down at just above 10%. The SageMaker platform was unable
to process the weather dataset in Figure 5.1.3 at all, but it is very close between
the Azure models and the Vertex AI model. All Azure models perform slightly
better than the Vertex AI one though. Overall, Azure is the most accurate
platform in this experiment, followed by Vertex AI AutoML before
SageMaker. However, these results could vary depending on the use case, type
of images, resolution, etc.

Since SageMaker and Azure have the same algorithm available, those results
become extra interesting. Since the algorithm is the same, the difference in
accuracy should be the hyperparameters and the different data augmentation.
These results show that Azure is able to provide a higher classification
accuracy with these automatic tools. However, it is unclear if a high technical
knowledge within the SageMaker platform is able to close this gap.

6.2 Answer to RQ 2
The difference in training time between Azure’s different algorithms is quite
large. For the CIFAR-10 dataset, the fastest Azure model was training in about
a tenth of the time of the slowest model. The fastest Azure model was almost
as fast as the SageMaker models which were the fastest. The Vertex AI model
is somewhere between the faster models and the slowest ones. The pattern
repeats itself for the concrete dataset but for the weather dataset, Azure has a
clear advantage in terms of training time. As mentioned previously, SageMaker
could not handle this dataset.

With SageMaker and Azure, the cost is pretty much directly linked to the
training time. The SageMaker instance that trained the models ran at a cost of
1.125$ per hour and the Azure instance at 0.9$ per hour. Both of these
platforms have options ranging from these numbers up to 30-40$ an hour.
However, SageMaker charged a storage fee as well which is included in these
diagrams. Vertex AI only has one instance option which costs 3.465$ per hour
for image classification. For these experiments, Azure had the cheapest options
followed closely by SageMaker. Vertex AI was the most expensive training by
far, due to that it scales the training job amongst many instances at once. All
three training jobs ended automatically at the 24-hour time limit, which was
the cause of the huge cost in relation to the actual training time. I do think that
these results are a bit more generalizable than the accuracy results. The
automatic hyperparameter tuning of Vertex AI and Azure ML makes the
training time increase. There is also a clear difference in training time
depending on which algorithm is selected, as demonstrated by the Azure
experiments. We can also conclude that image size is a big factor. Despite
having 50% more images than the concrete dataset, the CIFAR-10 dataset is
very similar in training time.

6.3 Answer to RQ 3
From the results presented in figures 5.1.1 to 5.3.3, it is quite clear that Azure
was able to handle different datasets the best. Azure performed the best in
terms of classification accuracy and cost, while very close in training time
compared to the fastest training platform, SageMaker. While being fast to train,
SageMaker had a remarkable low accuracy with the CIFAR-10 dataset
compared to its rivals, as well as not being able to handle the weather dataset
at all. Vertex AI was on par with Azure in terms of classification accuracy with
the concrete and weather datasets but did not quite perform as well with the
CIFAR-10 dataset. As with the

6.4 Ease-of-use experience


6.4.1 Vertex AI
Vertex AI required the least amount of time to conduct the experiments.
Everything could be done via a graphical user interface and the instructions
were clear. The platform was ready to use directly after sign-up which was not
the case with SageMaker. In total, I spent about 2 days reading the
documentation and preparing and conducting the experiments. There were no
issues configuring the training job or creating the datasets. Vertex AI analyzed
the images while processing the dataset, removing invalid images. The best
user experience by far.

6.4.2 Azure ML
Azure required some code via their Python SDK, but the steps were few and
noticeably clear. In total, I spent about 2 weeks with this platform. That
included reading the documentation, contacting their support as well as
performing the experiments. The main issue I had with this platform was
related to the computing instance. Azure requires a certain GPU-enabled
computing instance, but it was very unclear which of the instances were
approved. This resulted in a lot of trial-and-error as well as a support case
which did not help at all since the technical support requires a paid
subscription. The training job configuration required some trial-and-error as
well since there was no single complete guide provided, I had to take different
pieces from different guides. But overall, it was a decent experience that I think
was worth the larger selection of algorithms and services.

6.4.3 SageMaker
The experience with SageMaker was the worst by far. It started with the lack
of detailed documentation so the only way of moving forward was with trial-
and-error, but to even get started I had to contact the support to get permission
to use a GPU accelerated computing instance. This process took about a week.
Once I received approval, I then quickly found out that the instance the support
team gave me access to was not approved for image classification training jobs.
The support team could not tell me which instances were approved for image
classification since the technical support is a paid service, so I had to look at
external sources to get this information. I then had to submit another support
request to get access to the new computing instance, delaying my experiment
by another 4-to 5 business days.

Due to the lack of documentation, the training job process required a lot of
learning-by-doing. The error messages were generic and provided no hint as to
how to solve the very irritating issues. SageMaker has no dataset tool, so the
images configured at the training configuration, are not imported and
processed before the training as with the other 2 platforms. This resulted in the
creation of training jobs that failed after 4-5 minutes, due to image input
configurations. No validity check was done at the creation of the experiments.

For some unclear reason, SageMaker was unable to handle the weather dataset,
despite configuring the data in 3 different ways, including the .rec format. The
training jobs exited about 5-15 minutes in, stating “failed to read image data”.

After this long process of figuring out the correct configuration a training job
with the concrete dataset was up and running. After about 30-45 minutes, the
training job exited due to a memory error. This was very surprising considering
that the GPU in the SageMaker instance was the same as the Azure instance,
with 12GB VRAM and the CPU had 61GB RAM. After a few more test runs,
the training job finally was able to complete at a resolution of 175x175.

In general, the graphical user interface of the platform is a bit slower, and the
training job metrics were few and not very easily accessible. In total, the whole
process with SageMaker took about 6 weeks.
7 Discussion
Since there are, to the best of my knowledge, no other studies conducted on
MLaaS with a focus on image classification there is nothing that I can compare
these results with. It is important to state that these findings are not very
generalizable, and results can vary greatly depending on what type of data is
used.

In general, there should be a trade-off between accuracy and training time/cost.


These 3 MLaaSs offers a time limit for each training job so cost and time can
be limited by the user’s preference. From the results of these experiments, there
is also a correlation between training time and accuracy, SageMaker was the
least accurate platform but also the fastest to train.

From a researcher’s or a data scientist’s point of view, the accuracy, time, and
cost results should be considered not very generalizable, but depending on their
technical knowledge, the ease-of-use conclusion could be valuable. Since
Vertex AI requires no code at all, that would be a great MLaaS to use for
someone with no machine learning and coding experience. Azure requires a
few lines of code but has a bigger selection of algorithms and services. The
platform that would require the highest technical knowledge would be
SageMaker since there is no automatic hyperparameter tuning and the data
preparation has to be done manually before uploading it to the cloud storage.

As a result of this thesis, the following recommendations are made. If


classification accuracy or cost is the priority, and the user has some coding
knowledge, Azure ML is the recommended choice. If the user has no coding
knowledge and can tolerate a higher cost, Vertex AI is the recommendation.
SageMaker had marginally faster training time than Azure, but the lack of
performance in classification accuracy makes this platform hard to recommend
ahead of the others.

Another thing to consider is the complexity and possibilities the platforms


enable for the user. Azure has by far the largest selection of algorithms and
services, making it better suited for a wider target group. If I were to
recommend only one platform, it would be Azure. With its top performance,
large selection, and a fast-learning curve, it’s the most versatile and capable
choice out of the 3 platforms tested in this project.

Knowit found the results of this thesis very useful. They have just launched a
new focus area within artificial intelligence and machine learning. From the
results, they found that Azure ML is the most beneficial MLaaS platform out
of the 3. They are also using the Azure platform within other focus areas today,
which is a further benefit if they were to choose Azure ML as their MLaaS.
Even though the ease-of-use criteria are very subjective, Knowit found this
aspect particularly helpful. Good documentation, flexible pricing, and a good
user experience are important for them. Since Knowit is a company with high
technical competence, the flexibility of Azure’s many algorithms outweighs
the no-code advantage of Vertex AI for them.
8 Conclusions and Future Work
This thesis has provided some guidance in terms of the performance and cost
of these three MLaaS providers. While the performance in terms of
classification and accuracy may vary greatly due to the type of datasets that are
used, the experiments show a clear advantage for Azure ML compared to both
SageMaker and Vertex AI in terms of versatility and performance. This could
be of help for a researcher or a software developer when choosing which
MLaaS platform to use. Even if the results are not very general, they can be
some guidance for a starting point.

To draw more accurate conclusions, the scope could have been even narrower
in terms of the datasets selected. If only one category of images would have
been used, more precise conclusions could have been drawn regarding how
image size, number of classes, or number of images in the dataset affects the
result on the different platforms.

For future work, there could be more focus on a specific task of image
classification. A more specific focus on only binary classification, multi-label
classification, or a specific category of images would be interesting. There
could also be research done on additional MLaaS platforms.
References
[1] K. Ribieiro, K. Grolinger and M. Capretz, “MLaaS: Machine Learning
as a Service,” in Proceedings of the International Conference on
Machine Learning and Applications, Miami, 2015.
[2] L. Erran Li, E. Chen, J. Hermann, P. Zhang and L. Wang, “Scaling
Machine Learning as a Service,” in Proceedings of The 3rd
International Conference on Predictive Applications and APIs, Boston,
2017.
[3] H. Assem, L. Xu, T. Sandra Buda and D. O'sullivan, “Machine learning
as a service for enabling Internet of Things and People,” Personal and
Ubiquitous Computing, vol. 20, no. 6, pp. 899-914, 2016.
[4] R. Philipp, A. Mladenow, C. Strauss and A. Voelz, “REVEALING
CHALLENGES WITHIN THE APPLICATION OF MACHINE
LEARNING SERVICES – A DELPHI STUDY,” Journal of Data
Intelligence, vol. 2, no. 1, 2021.
[5] R. Philipp, A. Mladenow, C. Strauss and A. Völz, “Machine Learning
as a Service – Challenges in Research and Applications,” in
Proceedings of the 22nd International Conference on Information
Integration and Web-based Applications and Services, Chiang Mai,
2020.
[6] N. Noshiri, M. Khorramfar and T. Halabi, “Machine Learning-as-a-
Service Performance Evaluation on Multi-class Datasets,” in
Proceedings of the 5th IEEE International Conference on Smart
Internet of Things (SmartIoT), 2021.
[7] G. Saldamli, N. A. Doshi, V. Gadapa, J. J. Parikh, M. M. Patel and L.
Ertaul, “Analysis of Machine Learning as a Service,” in Proceedings of
The 17th International Conference on Grid, Cloud, & Cluster
Computing, GCC'21, Las Vegas, 2021.
[8] S. Xie, Y. Xue, Y. Zhu and Z. Wang, “Cost Effective MLaaS
Federation: A Combinatorial Reinforcement Learning Approach,”
arXiv preprint arXiv:2202.13971, 2022.
[9] “Tanuwidjaja, H.C; Choi, R; Baek, S; Kim, K;,” IEEE, vol. 8, pp.
167425 - 167447, 2020.
[10] P. Orza, “levity.ai,” Levity, 28 10 2021. [Online]. Available:
https://levity.ai/blog/mlaas-platforms-comparative-guide. [Accessed 26
4 2022].
[11] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen and Y. Gao, “Is
Robustness the Cost of Accuracy? – A Comprehensive Study on the
Robustness of 18 Deep Image Classification Models,” in Proceedings
of the European Conference on Computer Vision (ECCV), Munich,
2018.
[12] “https://companiesmarketcap.com/,” [Online]. Available:
https://companiesmarketcap.com/. [Accessed 10 05 2022].
[13] “knowit,” [Online]. Available: https://www.knowit.eu/.
[14] R. Marée, P. Geurts, G. Visimberga, J. Piater and L. Wehenkel, “A
comparison of generic machine learning algorithms for image
classification.,” in International Conference on Innovative Techniques
and Applications of Artificial Inelligence, Springer, Londin, 2003.
[15] “CoreIT,” [Online]. Available: https://coreitx.com/blogs/what-is-
machine-learning-as-a-service-mlaas. [Accessed 10 05 2022].
[16] L. Taylor and G. Nitschke, “Improving Deep Learning with Generic
Data Augmentation,” in Proceedings of IEEE Symposium Series on
Computational Intelligence (SSCI), Bengaluru, 2018.
[17] “Amazon Web Services,” [Online]. Available:
https://aws.amazon.com/.
[18] R. Miller, “TechCrunch,” 29 11 2017. [Online]. Available:
https://techcrunch.com/2017/11/29/aws-releases-sagemaker-to-make-it-
easier-to-build-and-deploy-machine-learning-models/. [Accessed 10 5
2022].
[19] Amazon Web Services, “aws.amazon.com,” [Online]. Available:
https://aws.amazon.com/sagemaker/features/?nc=sn&loc=2. [Accessed
10 5 2022].
[20] Amazon Web Services, [Online]. Available:
https://docs.aws.amazon.com/sagemaker/latest/dg/IC-
HowItWorks.html. [Accessed 10 5 2022].
[21] Amazon Web Services, [Online]. Available:
https://docs.aws.amazon.com/sagemaker/latest/dg/IC-tuning.html.
[Accessed 10 5 2022].
[22] “Google Cloud Platform,” [Online]. Available:
https://console.cloud.google.com/.
[23] Google Cloud, [Online]. Available:
https://cloud.google.com/blog/products/ai-machine-learning/google-
cloud-launches-vertex-ai-unified-platform-for-mlops. [Accessed 10 5
2022].
[24] “Microsoft Azure,” [Online]. Available: https://portal.azure.com/.
[25] Microsoft, 16 6 2014. [Online]. Available:
https://blogs.microsoft.com/blog/2014/06/16/microsoft-azure-machine-
learning-combines-power-of-comprehensive-machine-learning-with-
benefits-of-cloud/. [Accessed 10 5 2022].
[26] Microsoft Azure, [Online]. Available: https://docs.microsoft.com/en-
us/azure/machine-learning/how-to-auto-train-image-models. [Accessed
10 05 2022].
[27] A. Krizhevsky, “Learning multiple layers of features from tiny images,”
2009.
[28] Ç. Özgenel and A. Gönenç Sorguç, “Performance Comparison of
Pretrained Convolutional Neural Networks on Crack Detection in
Buildings,” in ISARC 2018, Berlin, 2018.
[29] G. Ajayi, “Multi-class Weather Dataset for Image Classification,”
Mendeley Data, 2018.
[30] Amazon Web Services, [Online]. Available:
https://docs.aws.amazon.com/cli/latest/reference/s3/. [Accessed 10 5
2022].
[31] Amazon Web Services, [Online]. Available:
https://docs.aws.amazon.com/sagemaker/latest/dg/image-
classification.html. [Accessed 10 5 2022].
[32] Google Cloud, [Online]. Available: https://cloud.google.com/vertex-
ai/docs/datasets/prepare-image. [Accessed 10 5 2022].
[33] Microsoft Azure, [Online]. Available: https://docs.microsoft.com/en-
us/azure/machine-learning/how-to-prepare-datasets-for-automl-images.
[Accessed 10 5 2022].
[34] Y. Yao, Z. Xiao, B. Wang, B. Viswanath, H. Zheng and B. Zhao, “.
Complexity vs. Performance: Empirical Analysis of Machine Learning
as a Service,” in Proceedings of IMC ’17, ACM, New York, 2017.
[35] MXNet Apache, [Online]. Available:
https://mxnet.apache.org/versions/1.9.0/api/faq/recordio.html.
[Accessed 10 5 2022].

You might also like