2023 State of AI Infrastructure Survey

'23
eBook | Jan 2023
The 2023 State of

AI Infrastructure Survey
All rights reserved to Run:ai. No part of this content may be used without express permission of Run:ai. www.run.ai
eBook | Jan 2023
Table of Contents
Introduction and Key Findings 3
Survey Report Findings 7
Challenges for AI Development 7
AI/ML Stack Architecture 8
Tools Used to Optimize GPU Allocation Between Users 9
Plan to Grow GPU Capacity in the Next 12 Months 10
Aspects of AI Infrastructure Planned for Implementation (within 6-12 months) 11
Measurement of AI/ML Infrastructure Success 12
Tools Used to Monitor GPU Cluster Utilization 12
On-Demand Access to GPU Compute 13
Plans to Move AI Applications and Infrastructure to the Cloud 14
Percentage of AI/ML Models Deployed in Production 15
Main Impediments to Model Deployment 16
Demographics 17
About Run:ai 19
2
The 2023 State of AI Infrastructure Survey
eBook | Jan 2023
Introduction
and Key Findings
Introduction
The artificial intelligence (AI) industry has grown the rapidly evolving industry, it is akin to a sort of
rapidly in recent years, as has the need for more technological “Wild West”, with no real best practices
advanced and scalable infrastructure to support for enterprises to follow as they get AI into production.
its development and deployment. The global AI As they begin to invest more heavily in AI, there’s a lot
Infrastructure Market was valued at $23.50 billion in riding on how they decide to build their infrastructure
2021 and is expected to reach $422.55 billion by 2029, and service their practitioners.
at a forecasted CAGR of 43.50% between 2022-2029.
This is the second ‘State of AI Infrastructure’ survey
One of the main drivers of progress in the AI we are running, due to all the new activity in the
infrastructure market has been increasing awareness industry and new AI companies in the AI space,
among enterprises of how AI can enhance their we’re keen to see what’s changed. We’re particularly
operational efficiency, attract new business and grow interested in new insights into how organizations
new revenue streams, while reducing costs through are approaching the build of their AI infrastructure,
the automation of process flows. Other drivers include why they are building it, how they are building it,
the adoption of smart manufacturing processes using what are the main challenges they face, and how the
AI, blockchain and IoT technologies, the increased abundance of different tools has affected getting AI
investment by GPU/CPU manufacturers in the into production. We hope that the insights from this
development of compute-intensive chips, and the survey will be helpful to those who both build and use
rising popularity of chatbots, like OpenAI’s recently AI infrastructure.
launched ChatGPT, for example.
The current hype around AI has given rise to a

renewed focus on getting it into the enterprise, and
organizations are increasingly eager to start using and
developing AI applications themselves. But with an
abundance of new AI infrastructure tools flooding
3
eBook | Jan 2023
Methodology
To get more insight into the current state of AI US and Western EU ranging in size between under
Infrastructure, we commissioned a survey of 450 200 and over 10,000 employees. The respondents
Data, Engineering, AI and ML (Machine Learning) were recruited through a global B2B research panel
professionals from a variety of industries. This report and invited via email to complete the survey, with all
was administered online by Global Surveyz Research, responses collected during the second half of 2022.
an independent global research firm. The survey is The average amount of time spent on the survey
based on responses from a mix of Data Scientists, was five minutes and fifty seconds. The answers to
Researchers, Heads of AI, Heads of Deep Learning, the majority of the non-numerical questions were
IT Directors, VPs IT, Systems Architects, ML Platform randomized, in order to prevent order bias in the
Engineers and MLOps, from companies across the answers.
4
eBook | Jan 2023
Key Findings
1
Data has been overtaken by
Infrastructure and Compute as the
main challenges for AI development
A whopping 88% of survey respondents admitted wait for resources, etc.) – chosen by 54% and 43% of
to having AI development challenges (Figure 1), respondents respectively as their main challenges.
which is telling in itself. But it’s also interesting This year, Data ranked as the third biggest challenge
to note that Data, which was ranked by 61% of in AI development (41%). The fact that infrastructure
respondents in last year’s survey as their main and compute-related challenges are now the top
challenge in AI development, was overtaken this concern for companies reinforces the importance of
year by infrastructure (i.e., the different platforms building the right foundation, for the right stack, to get
and tools that comprise “the stack”), and compute the most out of their compute.
(i.e., getting access to GPU resources, not having to
2
The more GPUs, the bigger the
reliance on multiple third-party tools
As organizations scale and require more GPUs, the from 29% reliance in companies with less than 50
more complex it has become to build the right AI GPUs, to 50% reliance in companies with more
infrastructure to get the right amount of compute than 100 GPUs (Figure 2). What would make more
to all of the different workloads, tools, and end users. sense, is a more open, middleware approach, where
80% of companies are now using third-party tools, organizations can use different tools that run on the
and the more GPUs they require, the bigger their same infrastructure, so that they are not locked into
reliance on multiple third-party platforms, increasing one end-to-end platform.
5
eBook | Jan 2023
Key Findings
3
On-demand access to GPU compute
is still very low, with 89% of companies
facing resource allocation issues
regularly
Only 28% of the respondents have on-demand access lacking. So, it’s no wonder that 89% of respondents
to GPU compute (Figure 8). When asked how GPUs face allocation issues regularly (Figure 10) – even
are assigned when not available via on-demand, though some of them (58%) claim to have somewhat
51% indicated they are using a ticketing system automatic access – with 40% facing those GPU/
4
(Figure 9), suggesting that on-demand access is still Compute resource allocation issues weekly.
In 88% of companies, more than half

of AI/ML models never make it to
production
While most companies are planning to grow their actually deploying (Figure 14) include scalability (47%),
GPU capacity or other AI infrastructure in the coming performance (46%), technical (45%), and resources
year, for 88% of them (compared with 77% in last (42%). The fact that they were all mentioned as a “main
year’s survey), more than half their AI/ML models impediment” by such a substantial portion of the
don’t make it to production. On average, only respondents, shows that there isn’t just one glaring
37% of AI/ML models are deployed in production impediment to model deployment, but rather a multi-
environments (Figure 13). The main impediments to faceted one.
5
91% of companies are planning to
grow their GPU capacity or other AI
infrastructure in the next 12 months
The vast majority (91%) of companies are planning they can actually get value out of it, so this result
to grow their GPU capacity or other AI infrastructure is a resounding testament to the fact that most
by an average of 23% in the next 12 months (Figure companies see huge potential and value in continued
4) despite the uncertainty of the current economic investment in AI.
climate. Organizations won’t invest in AI unless
6
eBook | Jan 2023
Survey
Report Findings
Challenges for AI Development

Infrastructure-related challenges
When asked what their company’s main challenges 54%

are around AI development, 88% of respondents
Compute-related challenges
admitted to having AI development challenges.
43%
The top challenges are infrastructure related
Data-related challenges
challenges (54%), compute related challenges (43%),
and data related challenges (41%). 41%
Expense of doing AI
It’s interesting to note that infrastructure and
34%
compute have overtaken Data as the biggest
challenges. Training-related challenges
34%
This reinforces the importance of building the right
foundation, for the right stack, to get the most out of Defining business goals
your compute. 18%

We have no AI development challenges
12%
Figure 1: Challenges for AI development
7
eBook | Jan 2023
AI/ML Stack Architecture

When asked how their AI/ML infrastructure is
Only one 3rd-party All built
architected, 11% of respondents said it is all built platform in-house
in-house, while 47% have a mix of in-house and 9% 11%
third-party platforms. We also saw that the use of
multiple third-party platforms grows with the number
of GPUs (29% for those with < 50 GPUs, and up to
50% for those with 100+ GPUs). This confirms that the
33% All Respondents 47%
practice of taking AI into production and streamlining
it (MLOps) isn’t a one-size-fits-all process.
Organizations are using a mix of different tools to

build their own best-of-breed platforms to support
their needs (and those of their users). Multiple 3rd-party
platforms Mix of in-house
and 3rd party
The fact that the state of AI infrastructure appears platforms
to be somewhat chaotic, with an abundance of tools
and no real best practices, is also testament to the
growing need among organizations for multiple
platforms to meet their various AI development
needs, giving rise to new technologies, and new
types of users and applications. But this could also
All built in-house
overwhelm infrastructure resources, so the more
GPUs companies have, there’s also an increasingly Mix of in-house and 3rd party platforms
urgent need for a unified compute layer that supports
Multiple 3rd-party platforms
all these different tools to make sure that the volume
of resources and the way they are accessed are Only one 3rd-party platform
aligned to specific end users and the different tools
they need.
7% 8% 8% 3%
12%
17%
29% < 50 GPUs 47% 30% 51-100 GPUs 50% 50% 100+ GPUs 39%
Figure 2: AI/ML stack architecture.
8
eBook | Jan 2023
Tools Used to Optimize GPU

Open-source tools
Allocation Between Users
36%
According to the survey results, basically everyone
Home-grown
(99% of respondents) is using tools to optimize GPU
allocations between users, with open-source being 23%
the most popular choice (36%), followed by home-
HPC tools like Slurm or LSF
grown (23%).
18%
Interestingly, the top two tools are also the most Excel spreadsheets / Whiteboard
challenging for organizations when taking AI into
production, because they are both very brittle,
14%
indicating there is room for more than half (59%) of Run:ai
companies to move to a more professional way of
7%
optimizing GPU allocations between users.
Something else
The fact that 73% are using open source, home-grown 3% 99% Using tools
tools or Excel sheets, also shows that organizations
are obviously facing a lot of issues allocating GPU Don't use tools to manage GPU allocation
resources, so it appears that despite plenty of options, 0.8%

there is still no clear or definitive way to optimize.
With the majority of respondents still using tools Figure 3: Tools used to optimize GPU allocation between users
that are not enterprise-grade, these tools will require
attention, especially as their organizations scale and
need to take more AI into production.
9
eBook | Jan 2023
Plan to Grow GPU Capacity Weighted average: Increase by 23%

in the Next 12 Months
91% Plan to increase
It’s interesting to see that the vast majority of
companies (91%) are planning to grow their GPU 50%
capacity or other AI infrastructure by an average of
23% in the next 12 months – despite the uncertainty 39%
surrounding the current economic climate.
Organizations won’t invest in AI unless they can

actually get value out of it, so this slide shows they’re
still definitely seeing value in continued investment
in AI.
9%
3%
No 1-25% 26-50% 51%+

increase increase increase increase
Figure 4: Plan to grow GPU capacity in the next 12 months.
10
eBook | Jan 2023
Aspects of AI Infrastructure
Monitoring, observability and explainability
Planned for Implementation
50%
(within 6-12 months)
Model depolyment and serving
When asked what aspect of AI infrastructure they
44%
are looking to implement in the next 6-12 months,
respondents indicated that the top aspects relate Orchestration and pipelines
to challenges in production, including monitoring, 34%
observability, and explainability (50%), model
deployment and serving (44%), and orchestration and Data versioning and lineage
pipelines (34%), so it appears that their answers are 33%

focused more on AI models in production and less
Feature stores
about model development.
29%
Even the aspect least indicated for planned
Distributed training
implementation was mentioned by 19% or
respondents, which is still a respectable number, 27%
so with all of the options provided indicated as Synthetic data
important, it shows that there’s “a lot going on”: there’s
no one or two glaringly obvious priorities, but rather
19%
multiple aspects of AI infrastructure that companies
need to focus on to get their AI into production
Figure 5: Aspects of AI infrastructure planned for
faster. Many of these aspects involve MLOps, which is implementation (within 6-12 months).
interesting, because each aspect requires different
tools, and therefore a solid computing platform to
plug all of these different tools into.
11
eBook | Jan 2023
Measurement of AI/ML Reaching new customers
Infrastructure Success 28%

When asked how they are measuring the success Increasing our revenue
from their AI/ML infrastructure, 28% of respondents 21%

said they are monitoring their reach to new
Saving time
costumers, 21% are measuring their increase in
revenue, and 17% measure success by how much time 17%
they are saving.
Saving money
The top measures of success indicated by 13%

respondents represent benefits that are both internal
Better serving our existing customers
and external (both to the companies and their end
users), and clearly demonstrate that they are investing 11%
in AI because they want to create value, which is Improving our products and services
especially important during an uncertain economic
climate.
9%
Inspiring the creation of new products and services
2%
Tools Used to Monitor

Figure 6: Measurement of AI/ML infrastructure success.
GPU Cluster Utilization
The top tools used to monitor GPU cluster utilization
are NVIDIA-SMI (86%), GCP-GPU-utilization-metrics
NVIDIA-SMI
(53%), and ngputop (49%).
86%
It’s very hard to visualize problems with utilization, and
GCP-GPU-utilization metrics
therefore very important for companies to get insight
into how they are utilizing their GPUs. 53%
NGPUtop
But with most respondents using NVIDIA-SMI (86%),
GCP-GPU-utilization-metrics (53%), and ngputop
52%
(49%) to monitor GPU cluster utilization, it appears Nvtop
it’s just as difficult finding the right tool to better
49%
understand their GPUs utilization, because they all
show metrics from one platform, or even one host, Run:ai's Rntop
and don’t provide a broad overview (or ‘the optimal 35%

view’) of an organization’s GPUs utilization across
its entire infrastructure. They only provide a ‘current
snapshot’ of a small subset of the infrastructure.
Figure 7: Tools used to monitor GPU cluster utilization.
12
eBook | Jan 2023
On-Demand Access to
GPU Compute
When asked about availability of on demand access No automatic
Automatically
access
to GPU compute, only 28% of the respondents said accessed
they have on-demand access (figure 8).

14%
When asked how GPUs are assigned when not 28%
available via on-demand, 51% of the respondents
indicated they are using a ticketing system (figure
9), suggesting that there’s still no good on-demand
access yet. It’s no wonder, therefore, that 89% of 58%
respondents face allocation issues regularly (Figure
10), even though some of them (58%) claim to have
Somewhat
somewhat automatic access. automatically
accessed
Only 11% said they rarely run GPU allocation issues, 13%
have allocation issues daily, and 40% face allocation
issues weekly.
Figure 8: Availability of on-demand access to GPU compute.
89% face allocation issues regularly

Assigned by Assigned
manual request by ticketing
system
18%
51% 40%
31%
24%
Assigned
statically to 13% 13% 11%
specific users/
jobs
Daily Weekly Bi-weekly Monthly Rarely
Figure 9: GPUs assignment w/o on-demand access. Figure 10: Frequency of GPU/compute resource allocation issues.
13
eBook | Jan 2023
Plans to Move AI Applications

and Infrastructure to the Cloud 51%
51% of the respondents already have their applications

and infrastructure on the cloud, and 33% said they 33%
were planning to move it to the cloud by the end of
2022. 16%
When deep diving to see how those already in the

0% 0%
cloud differ based on their level of automatic access
to their GPU, we saw that of the 51% of companies Already This year Next year In 5 years No plans
on cloud
already on the cloud, the highest level of adoption
(78%) is by companies without automatic access.
Figure 11: Plans to move AI Applications and infrastructure to the
All respondents indicated that they are either already cloud.
on the cloud or planning to move to the cloud either
this or next year, but according to these results, it
doesn’t necessarily solve their access to the GPU Automatic accessed
problem, which is something that those who haven’t 47%
moved to the cloud yet should be aware of.
Somewhat automatic accessed
46%
No automatic access
78%
Figure 12: “Already on cloud” by access to GPU.
14
eBook | Jan 2023
Percentage of AI/ML Models Weighted average: 37% of AI/ML models

Deployed in Production
88% deploying < 50%
On average, 37% of AI/ML models are deployed in
production.
36%
32%
88% of respondents (compared with 77% in last year’s
survey) are deploying less than 50% of the models,
indicating that it’s even harder now to get things 21%
into production, not just because of infrastructure
considerations but also business and organizational 12%
ones as well.
0% 0% 0%
It’s also why companies are investing so heavily in a
variety of different tools (as opposed to just one) to <10% 10-24% 25-39% 40-49% 50-74% 75-90% 90%+
get more AI into production, but as these results show,

there’s still a lot of room for improvement. Figure 13: Percentage of AI/ML Models Deployed in Production.
15
eBook | Jan 2023
Main Impediments Scalability
to Model Deployment 47%

The main impediments to actually deploying include Performance
scalability (47%), performance (46%), technical 46%

(45%), and resources (42%). The fact that they were
all mentioned as a “main impediment” by a fairly Technical
substantial portion of the respondents, shows once 45%

again that there isn’t just one glaring impediment to
Resources
model deployment, but rather a multi-faceted one.
42%
Approval to deploy
23%
Privacy / legal issue
8%
Figure 14: Main impediments to model deployment.
16
eBook | Jan 2023
Demographics
Country, Department, Role, Job Seniority
Germany R&D / Engineering

IT
16%
United
States 13%
37%
16% 24%
France 50%
18% 26%
DevOps
United Data /
Kingdom Data Science
Figure 15: Country Figure 16: Department
IT / Infrastructure Architect C-Suite

Analyst
35%
4%
Data Scientist / Researcher Team Member
6% Manager
33% 7%
MLOps / DevOps
25% 35%
Head of an AI / Deep Learning Intiative
17%
5%
Platforms Engineer
31%
3%
VP / Head
CTO
1% Director
Figure 17: Role Figure 18: Job Seniority

17
eBook | Jan 2023
Company Size, GPU Farm Size
Amount of People
42% 42%
40%
20% 22%
16%
9%
6% 4%
2%
<200 201-500 501-1K 1,001-5K 5,001-10K >10K <10 GPUs 11 - 50 51 - 100 100+
Figure 19: Company size Figure 20: GPU farm size
18
eBook | Jan 2023
About Run:ai
Run:ai's Atlas Platform brings cloud-like simplicity to implementation, increase team productivity, and
AI resource management - providing researchers gain full utilization of expensive GPUs. Using run:ai,
with on-demand access to pooled resources for any companies streamline development, management,
AI workload. An innovative cloud-native operating and scaling of AI applications across any
system - which includes a workload-aware scheduler infrastructure, including on-premises, edge and cloud.
and an abstraction layer - helps IT simplify AI
For more information please visit us:

https://www.run.ai/
19

2023 State of AI Infrastructure Survey

Uploaded by

Copyright:

Available Formats

You might also like

2023 State of AI Infrastructure Survey

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2023 State of AI Infrastructure Survey

Uploaded by

Copyright:

Available Formats

'23

eBook | Jan 2023

The 2023 State of

Survey Report Findings 7

Challenges for AI Development 7

AI/ML Stack Architecture 8

Tools Used to Optimize GPU Allocation Between Users 9

Plan to Grow GPU Capacity in the Next 12 Months 10

Aspects of AI Infrastructure Planned for Implementation (within 6-12 months) 11

Measurement of AI/ML Infrastructure Success 12

Tools Used to Monitor GPU Cluster Utilization 12

On-Demand Access to GPU Compute 13

Plans to Move AI Applications and Infrastructure to the Cloud 14

Percentage of AI/ML Models Deployed in Production 15

Main Impediments to Model Deployment 16

The current hype around AI has given rise to a

In 88% of companies, more than half

Challenges for AI Development

When asked what their company’s main challenges 54%

your compute. 18%

Figure 1: Challenges for AI development

AI/ML Stack Architecture

Organizations are using a mix of different tools to

Figure 2: AI/ML stack architecture.

Tools Used to Optimize GPU

resources, so it appears that despite plenty of options, 0.8%

Plan to Grow GPU Capacity Weighted average: Increase by 23%

Organizations won’t invest in AI unless they can

No 1-25% 26-50% 51%+

Figure 4: Plan to grow GPU capacity in the next 12 months.

pipelines (34%), so it appears that their answers are 33%

Measurement of AI/ML Reaching new customers

Infrastructure Success 28%

from their AI/ML infrastructure, 28% of respondents 21%

The top measures of success indicated by 13%

Tools Used to Monitor

and don’t provide a broad overview (or ‘the optimal 35%

Figure 7: Tools used to monitor GPU cluster utilization.

they have on-demand access (figure 8).

89% face allocation issues regularly

Daily Weekly Bi-weekly Monthly Rarely

Plans to Move AI Applications

51% of the respondents already have their applications

When deep diving to see how those already in the

Figure 12: “Already on cloud” by access to GPU.

Percentage of AI/ML Models Weighted average: 37% of AI/ML models

get more AI into production, but as these results show,

Main Impediments Scalability

to Model Deployment 47%

scalability (47%), performance (46%), technical 46%

substantial portion of the respondents, shows once 45%

Figure 14: Main impediments to model deployment.

Germany R&D / Engineering

Figure 15: Country Figure 16: Department

IT / Infrastructure Architect C-Suite

Figure 17: Role Figure 18: Job Seniority

Company Size, GPU Farm Size

Figure 19: Company size Figure 20: GPU farm size