Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Optimal Parallel Computing Solution for Machine Learning

Recommender Systems

Introduction
At BlueEvent company, speedy event recommendation and matching is the primary
purpose of our flagship software products. We require a reliable computational hardware system
that allows us to prototype a machine learning (ML) powered recommender model and aid in the
rapid release and integration of the model in our future software shipments. However, the cost of
the said hardware systems, mainly parallel computing hardware necessary for training ML
models, is somewhat overwhelming for our small private company. Therefore, we require a
graphical processing unit solution (a popular form of parallel computation hardware) that offers
the most performance efficiency per price.

There are several options for graphical processing units (GPU) that are available in the
marketplace. The first option is to invest in an NVidia GPU to build an on-premise system,
which is the most popular and most potent hardware option for parallel computing and ML with
broad support and adoption on the market; however, their initial pricing is rather steep. The
alternative is online rental GPUs, a versatile and initially affordable option of renting on-demand
GPU time from hosting platforms such as Google Cloud, but with the downside of performance
and long-term cost compared to the other option. Our main goal is to build a system to quickly
and reliably prototype new ML recommender models; thus, stability/performance comes first,
and pricing comes second in our evaluation.

Commercial Recommender Systems - An Overview


Recommender systems are among the most widespread applications of machine learning
in the 21st century and a must-have for any matching platform—notable examples are large-scale
recommender systems, including YouTube’s video and Amazon’s product recommender systems.
Recommender systems are based on neural network architectures accelerated with GPU because
of their unmatched performance in parallel computing (Caulfield, 2020). There are three main
categories of recommender systems: content-based, collaborative filtering, and hybrid systems
(Liao, 2018). The main mechanisms behind these types of systems are described in Figure 1.
Figure 1. An overview of recommender systems

On-premise GPU-powered Workstation (NVidia GPUs)


A popular Machine Learning platform for building recommender systems is Tensorflow,
and it is GPU-accelerated with native support for parallel computing architecture found in
NVidia hardware. Our goal is to find a reliable GPU that supports the necessary parallel
instruction operations while training the recommender system since millions of data entries are
analyzed by the training algorithms (Liao, 2018).

Advantages

● Native Support for Machine Learning Libraries: NVidia GPUs are unmatched in standard
ML libraries support, including the NVidia CUDA Deep Neural Network library; there
are powerful alternatives but with a significant lack of library support (Ramesh, 2018).
Having a widely supported platform allows for more flexibility when the company needs
to adapt the recommender systems to a new environment.
● Long-term Support: NVidia GPUs are covered by a three-year warranty policy, ensuring
that the company can effectively utilize the unit’s lifecycle while handling model training
for a long time.
● Higher Control and Security over Data: On-premise servers let the company have
complete control over the data processing, meaning the data is less susceptible to attacks
that are otherwise prevalent on cloud computing platforms (“Cloud vs.”, 2018). Because
BlueEvent is bound to handle the personal information of several million customers and
thousands of events, the promise to privacy can be a critical factor in the company’s
recommender systems and applications’ success.
● High Performance: on-premise GPU setups are shown to boast unparalleled performance,
up to 6x faster than online on-demand/cloud GPU (Boesen, 2017). This allows for more
rapid prototyping of the company’s recommender systems.

Disadvantages

● Expensive Up-front Investment: According to Figure 2, the price-per-performance for a


4-GPU setup necessary to run machine learning models and tasks is highest if the user
chooses to purchase a set of RTX 3080 with 10GB RAM (Dettmers, 2020). However, a
set of 4 of these GPUs costs an upward of $2,796; this is a steep price considering that
BlueEvent needs to build medium-size servers and may expand operation in the future.
● Operational Responsibilities: Having powerful on-premise computation means that the
company has to take care of various system set-up and maintenance tasks. For example,
there might be unscheduled reboots or other issues that terminate all current processes
and delay the schedule for a training task (Boesen, 2017).

Figure 2. Normalized 4-GPU deep learning performance-per-dollar relative to RTX 3080


Cloud-based On-demand GPU computation
Cloud-based computational services can be an excellent alternative to on-premise servers,
especially when BlueEvent may not find the up-front investment for on-premise servers within
the initial budget due to high pricing. Nevertheless, cloud computing can be limited in terms of
performance, and surprisingly, in long-term pricing.

The requirements for running computational tasks on the cloud is vastly different from
running the same tasks on-premise with NVidia GPUs. The data is loaded to an online server, the
company then choose from a variety of supported GPUs to run on, and GPU workloads are
handled by the cloud-hosting platform (“Cloud GPUs”, n.d.). A high-speed, stable Internet
connection is required to ensure data stream stability, and the cost is determined by GPU time,
which is the time consumed by the computation tasks on the upstream servers.

Advantages

● Operational Simplicity: As opposed to the extended set of responsibilities demanded by


on-premise GPU servers, online computation frees the company from having to manually
perform maintenance. Online GPU servers such as Amazon’s have automatic, scheduled
regular updates that permit BlueEvent to manage the computation schedule and
effectively manage the prototyping time (Boesen, 2017).
● Small Up-front Cost: The company pays for time per hour of GPU computation, and that
is synonymous with paying only on-demand. For example, a 4-GPU NVidia Tesla P100
set-up only cost $5.84/hour (Figure 3) (“GPUs Pricing”, n.d.). On-demand processing
means that the company can utilize the low initial cost to push out recommender system
prototypes and evaluation early on without making large investments to get a new server
running.
Figure 3. GPU pricing by hour on Google Cloud Platform

Disadvantages

● Reliance on Internet Connection: The fact that cloud computing needs the data to be
uploaded to the server before processing, which means the company will have to ensure
that the connection to the server is fast and stable. This is a compromise on BlueEvent’s
side since unplanned interruption of internet service will result in prototyping delays.
● Low Customizability: For less operational complications, cloud-based servers are not
nearly as customizable as on-premise servers. This means the company cannot fit custom
recommender systems or cannot build an exclusive environment for any prototype.
● Lower Data Security: Transferring data for processing over the Internet comes with
frequent data leakage risks, as demonstrated by AWS voter information leakage in 2016
(“Cloud vs.”, 2018). This is a liability for BlueEvent since the company handles massive
user information flows in software application products.
● Expensive Long-term: Even though cloud-based servers’ start-up cost is small, the
cumulative cost of running these cloud models surpasses on-premise NVidia GPU servers
(Cheng, 2017). This can prove troublesome if the company needs to maintain trained
recommender models in the long run. After around 100 days of training on a cloud server,
the cost exceeds training on-premise (Boesen, 2017).
● Lower Performance compared to On-premise Servers: As mentioned before, on-premise
servers outperform cloud-based computing, which is a significant factor in light of the
rapid recommender system prototyping requirement of the company. Furthermore, on
average, the same ML task’s running time is lower on-average for an on-premise server
than a cloud-based server (Figure 4) (Cheng, 2017).

Figure 4. Training time for each platform, on-premise and two clouds (left to right)

Conclusion
Both an on-premise server solution and a cloud-based computation solution meet certain
different requirements for the purposes of training recommender systems that BlueEvent needs
them for. An on-premise server is a costly upfront investment that is powerful enough to rapidly
roll out prototypes of the recommender system, which can speed up software development for
event matching functionalities. The focus on on-premise performance requires more maintenance
but provides the company with more control over the data security and model customizability
aspects of the development process. Alternatively, renting a cloud-based server for computation
enables the company to cut the resource-consuming maintenance responsibilities while also
offering a more affordable option to develop early recommender prototypes. The trade-off would
be lower performance and less control over data model and security, making the option ideal if
BlueEvent would like to explore early recommender systems with less initial commitment and
more experimentation-oriented.
Works Cited

Boesen, M. R. (2017, May 8). GPU servers for machine learning startups: Cloud vs
On-premise? Medium.
https://medium.com/@thereibel/gpu-servers-for-machine-learning-startups-cloud-vs-on-p
remise-9a9dedfcadc9

Caulfield, B. (2020, May 14). What’s a Recommender System? NVIDIA Corporation.


https://blogs.nvidia.com/blog/2020/05/14/whats-a-recommender-system/

Cheng, C. H. (2017, September 5). On-premise (DIY) vs Cloud GPU. Towards Data Science.
https://towardsdatascience.com/on-premise-diy-vs-cloud-gpu-d5280320d53d

Dettmers, T. (2020, September 7). Which GPU(s) to Get for Deep Learning: My Experience and
Advice for Using GPUs in Deep Learning. Tim Dettmers.
http://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/#The_Most_Important_
GPU_Specs_for_Deep_Learning_Processing_Speed

Exxact Corporation. (2018, November 13). Cloud vs. On-Premises: Which is Really Better for
Deep Learning?
https://blog.exxactcorp.com/cloud-vs-on-premises-which-is-really-better-for-deep-learnin
g/

Google. (n.d.). Cloud GPUs. https://cloud.google.com/gpu

Google. (n.d.). GPUs Pricing. https://cloud.google.com/compute/gpus-pricing

Liao, K. (2018, November 11). Prototyping a Recommender System Step by Step Part 1: KNN
Item-Based Collaborative Filtering. Towards Data Science.
https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-k
nn-item-based-collaborative-filtering-637969614ea

Ramesh, P. (2018, August 29). NVIDIA leads the AI hardware race. But which of its GPUs
should you use for deep learning? Packt Publishing Ltd.
https://hub.packtpub.com/nvidia-leads-the-ai-hardware-race-but-which-of-its-gpus-shoul
d-you-use-for-deep-learning/

You might also like