LLMOps Truera Intel LLM Ops Explained2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

WHITEPAPER

LLMOps
Explained
How to build an effective, 

performant LLMOps tech stack

Authors
Anupam Datta, Shayak Sen, TruEra and

Arijit Bandyopadhyay, Intel Corporation
WHITEPAPER

Table of Contents
The Rise of Foundation Models and the LLMOps Stack . . . . . . . . . . . . . 3
.............
The Emerging LLMOps Tech Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
. . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .
LLM Observability with TruEra . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . 11
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. .
. . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... .. .. .. .. .. . 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... .. .. .. . . . . . 13
...........................................
The Road Ahead. . . .for . . GenAI
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
................................
................................
............................

© 2023 All Rights Reserved TruEra www.truera.com 2


WHITEPAPER

The Rise of Foundation


Models and the LLMOps Stack
Foundation models such as GPT-4, PaLM, LLaMA, DALL-E, and Falcon, are With the proliferation of new LLM-based apps has come an intense
creating a paradigm shift in how AI-powered applications are built and need for ways to productionize, scale, and manage these
deployed. A foundation model is “any model that is trained on broad data applications. As a result, a new LLMOps tech stack is emerging to
(generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) enable effective LLMs and LLM apps to be built, deployed, and
to a wide range of downstream tasks.” One popular downstream model is maintained (see Figure 1). We believe that this space will see an
ChatGPT, with which users can interact in a conversational way. Built on the expansion of LLMs – proprietary and open source – that developers
GPT series of foundation models, ChatGPT saw tremendous uptake, with its and enterprises will build on and adapt to their use cases, with
user base growing to 100 million users in just 2 months after launch. careful consideration to their quality and governance. This
whitepaper presents a perspective on the key components of this
The growing set of applications built on these models, especially Large tech stack, spanning the storage, compute, and observability layers.
Language Models (LLMs), have enormous potential to transform the world. In doing so, we highlight the key areas of focus for TruEra and Intel –
Two classes of applications are particularly taking off: (1) Retrieval- LLM Observability and Hardware/Software capabilities to enable
augmented Generation (RAG), where LLMs are augmented with long-term performant and efficient LLM apps.
memory to make them less prone to hallucinations (e.g., chatbots that
support question-answering against a knowledge base of documents); and (2)
Agents, where LLMs are augmented with capabilities to leverage a suite of
external tools to plan and execute actions that can impact the real world (e.g.,
LLMOps consists of the processes 

a personal assistant that can plan a travel itinerary and make reservations). and tools for the operational and
performance management of LLMs 

¹ Foundation model definition from Bommasani et al. 2021
in production.

© 2023 All Rights Reserved TruEra www.truera.com 3


WHITEPAPER

The Emerging 


LLMOps Tech Stack

We describe below the key components of the

emerging LLMOps tech stack and the tasks

that they support to enable effective LLMs and


Observability Testing Debugging Monitoring
LLM apps. These tasks are described roughly

in the order of the steps of a typical workflow

for building, deploying, and maintaining LLMs

and LLM apps.


Training Experimentation Model Serving

Compute

LLM Fine-tuning LLM Prompt Engineering LLM FM APIs

Model Repository Feature Store LLM Vector DB

Storage

Data Lake / Warehouse

² For each component, we mention some

representative vendors. This list is not exhaustive.

Figure 1: The LLMOps technology stack has three

major layers,including storage, compute, and

observability. All three layers are critical to creating

and maintaining successful LLM applications.

© 2023 All Rights Reserved TruEra www.truera.com 4


WHITEPAPER

The Emerging LLMOps Tech Stack, continued

Typical Workflow for Building, Deploying, and Maintaining LLMs and LLM Apps

Vector 


1. Foundation
Model Training 2. Data Preparation
3. Database
Index
Construction
4. Model Fine
Tuning 5. App Creation

OpenAI Snorkel AI Pinecone OpenAI LlamaIndex


PaLM (Google) LlamaIndex Weaviate Google LangChain
Llama (Meta) LangChain Chroma AWS SageMaker Haystack
Anthropic Milvus JumpStart
Cohere Elastic Search Cohere
Rockset

Prompt
Model 

6. Engineering,
Tracking,
Collaboration
7. Evaluation and
Debugging 8. Deployment 

and Inference
9. App Hosting
10. Monitoring

TruLens TruLens AWS GCP Vercel TruEra


TruEra TruEra Azure OpenAI Streamlit Honeyhive
W&B Prompts OpenAI Evals Anthropic Anyscale Steamship
Together.ai Cohere Modal
Databricks Snowflake

Figure 2: A representative workflow for building an LLM application, with some representative vendors providing technical support at each step.

© 2023 All Rights Reserved TruEra www.truera.com 5


WHITEPAPER

The Emerging LLMOps Tech Stack, continued

Foundation model training


Generative pre-training: An LLM is trained on Supervised learning, RLHF, RLAIF: Next, language models are
trained on specific examples of human-provided prompts and

vast amounts of data to predict the next word


after a sequence of text. This allows generative responses in order to guide their behaviors. One common stage is

language models to be really good at producing Reinforcement Learning via Human Feedback (RLHF). This step

human-like text. Over the past few years, LLMs (used for example in chatbots like ChatGPT), involves training a

have grown in scale from the 1-2 billion reward model, which quantifies how good different responses to a

parameter BERT models to 10s to 100s of billions prompt are, and then uses reinforcement learning to train the

of parameters with the GPT series from OpenAI, generative model with that reward model (see this article for more

PaLM from Google, Llama from Meta, and details). RLHF has helped produce more helpful and harmless

models from HuggingFace, Databricks, Anthropic, models while reducing model size by orders of magnitude.

Cohere, and more. The training process is


Recognizing that the RLHF step could be made programmatic by
accelerated using advanced hardware and
using feedback from AI instead of humans to create the reward
software stacks that enable massive
model based on certain principles, we are also seeing
parallelization, e.g., from Intel and NVIDIA. In
Reinforcement Learning with AI Feedback (RLAIF) gain adoption
particular, the Intel Habana Gaudi processors
(e.g., see Constitutional AI from Anthropic that helps improve
offer significant performance, scaling, and
alignment and harmlessness of LLMs). Currently, tooling for these
developer ease of use benefits for training while
steps is created by LLM providers and companies such as ScaleAI.
keeping cost and power consumption low.

© 2023 All Rights Reserved TruEra www.truera.com 6


WHITEPAPER

The Emerging LLMOps Tech Stack, continued

Data preparation
There are several flavors of data preparation tasks with associated tools:

LLM creators like OpenAI, Google, Meta, etc. LLMs are fine-tuned, often on private data held by enterprises and
prepare data for generative pre-training – often small businesses. These data sets could, for example, include a set
very large volumes of unlabeled data. Recently, of prompts and responses to fine tune a chatbot to a domain-
we are also seeing more carefully curated specific use case, such as for customer service for ecommerce.
smaller data sets used to train LLMs that are Tools from companies like Snorkel AI are useful to prepare data
orders of magnitude smaller than state-of-the- for fine-tuning.
art (SOTA) models but competitive in certain
tasks (e.g., see this paper from Microsoft LLM applications, in particular, retrieval augmented generation
Research and a related blog post from Intel). (RAG) involves augmenting LLMs with a knowledge base of
documents that serve as a source of truth and can be queried. One
example of this paradigm is Morgan Stanley’s use of OpenAI
models to create a chatbot for their financial advisers. Creating
this knowledge base involves data preparation, with which tools
like LlamaIndex and LangChain can help.

© 2023 All Rights Reserved TruEra www.truera.com 7


WHITEPAPER

The Emerging LLMOps Tech Stack, continued

Vector DB index construction: RAGs, such as App Creation: LLMs are often connected to other tools such as
the Morgan Stanley wealth management chatbot, LLM

APP
vector databases, search indices or other APIs. RAGs and Agents,
require the knowledge base of documents to be mentioned above, are two popular classes of LLM applications.
split up, converted into embeddings, and stored Building by chaining has emerged as a popular paradigm with
in a vector database, which is indexed to support tools like LlamaIndex, LangChain, and Haystack seeing
querying. Vector databases, such as Pinecone, widespread developer adoption.
Weaviate, Chroma, Milvus, ElasticSearch, and
Rockset are thus seeing rapid adoption and Prompt-engineering, tracking, collaboration: The app
becoming a key part of the LLMOps tech stack. developer creates prompts tailored for a specific use case. This
process often involves experimentation: the developer creates a
Model fine-tuning: LLMs are fine-tuned, often 
 prompt, observes the results and then iterates on the prompts to
on private data held by enterprises and small improve the effectiveness of the app. Tools, such as TruLens and
businesses. These data sets could, for example, W&B Prompts, help developers with this process by tracking
include a set of prompts and responses to fine 
 prompts, responses, and intermediate results of apps and
tune a chatbot to a domain-specific use case, 
 enabling this information to be shared across developer teams.
such as for customer service for ecommerce. LLM
providers, such as OpenAI and Google are
increasingly making fine-tuning APIs available for TruLens
their SOTA models. Fine-tuning APIs are also TruLens is an open-source software tool that helps to objectively measure
available for open source LLMs hosted on services, the quality and effectiveness of your LLM-based applications using feedback
such as AWS Amazon SageMaker JumpStart. functions. Feedback functions help to programmatically evaluate the quality
of inputs, outputs, and intermediate results, to expedite and scale up
experiment evaluation. Learn more: TruLens.org

© 2023 All Rights Reserved TruEra www.truera.com 7


WHITEPAPER

The Emerging LLMOps Tech Stack, continued

Evaluation and debugging: Systematic Model deployment and inference: LLMs are deployed by and
evaluation and debugging of LLMs and LLM apps available via APIs by the major cloud providers, including AWS,
based on RAGs and Agents is absolutely essential GCP, and Azure as well as hosted by LLM providers, such as
before they are moved into production. A first OpenAI, Anthropic, Together.ai, and Cohere. 

step in this direction is to use human evaluations Platform companies, such as Databricks/MosaicML and
and benchmark datasets to evaluate LLMs (e.g., Snowflake also offer model deployment services. Strong price-
see the HELM paper from Stanford). While performance ROI can be achieved with the latest generation Intel
useful, these methods do not scale. Recent work Xeon CPUs (see HuggingFace/Intel page, Intel extensions for
has shown the power of programmatic transformers, Intel Neural Compressors, and Q8-chat-LLM).
evaluation methods to evaluate LLMs and LLM The Intel Habana Gaudi processors offer additional
apps, e.g. see OpenAI Evals and TruLens. performance, scaling, and developer ease of use benefits for
Evaluations help ensure that LLM apps are inference at low latency with higher costs than the Xeon CPUs
honest, helpful, and harmless. This includes (see, in particular, this article for the inference performance of
evaluations to guard against hallucinations, in the Llama 2 7B and Llama 2 13B models, respectively, on a single
particular, groundedness, as well as toxicity, bias, Habana Gaudi2 device).
relevance, and a number of other considerations.

Intel Disruptor Initiative


The Intel Disruptor Initiative helps to drive innovation
for AI and data-centric use cases. Participants are
companies that are pushing the limits of innovation.
Companies in the ecosystem include TruEra, Databricks
(MosaicML), Domino, Snowflake, DataRobot, Weaviate,
ActiveLoop, Hugging Face, Roboflow, and Rockset.

intel.com/disruptor
© 2023 All Rights Reserved TruEra www.truera.com 8
WHITEPAPER

The Emerging LLMOps Tech Stack, continued

App hosting: App hosting services, such as Vercel, 



are increasingly being used to deploy LLM apps faster.

Monitoring: Deployed LLMs and LLM apps need to be


monitored for quality metrics (in line with the
evaluation metrics described above) as well as cost,
latency, and more. Feedback functions enable
monitoring these metrics in TruEra’s AI Observability
product at scale. Several other companies, such as LLM-App1: Feedback functions Evaluate Alerts:
Relevance

Firing Now

HoneyHive, also offer monitoring services. 1


Chain-1
19:54:00 02/24/2022

Model Performance
(7 components)
19:54:00 02/24/2022
Chain-7 Model Performance

Average Relevance
(5 components) 19:54:00 02/24/2022
0.5
Model Performance
19:54:00 02/24/2022

Model Performance
19:54:00 02/24/2022
0
Model Performance
1/1 1/8 1/15 1/22 1/28 2/4 2/10 2/17 2/24 3/1
19:54:00 02/24/2022

Prompt Sentiment

10:22:00 02/15/2023
19:54:00 02/24/2022

1
Model Performance
19:54:00 02/24/2022
Average Groundedness

View Alert History

0.5

0 1/1 1/8 1/15 1/22 1/28 2/4 2/10 2/17 2/24 3/1

© 2023 All Rights Reserved TruEra www.truera.com 10


WHITEPAPER

LLM Observability

with TruEra

TruEra has a unique, comprehensive approach


Feedback loop for debugging
to LLM Observability. TruEra spans the full app
Feedback Loop
lifecycle, from evaluation and experiment
tracking during development of LLM apps, to Broad App
monitoring apps in production. Support

Evaluate Track Monitor


Reliable, comprehensive, Experiment tracking to Highly scalable
TruEra AI Observability extensible evals to ensure 3H select best app config Cost effective
– honest, helpful, harmless
TruEra provides AI Observability software Low latency
for traditional and generative Artificial
Intelligence (AI) applications. It helps data
science teams prevent AI failures and Easy integrations with
drive high-performing models by emerging tech stack
providing a single, cloud-first,
comprehensive solution that combines
monitoring, debugging, testing,
explainability, and responsible AI Figure 3: TruEra AI Observability spans the full
lifecycle: from evaluation and experiment tracking
capabilities. Learn more: TruEra.com.
during development to monitoring apps in production.
This ensures both speed and effectiveness.

© 2023 All Rights Reserved TruEra www.truera.com 11


WHITEPAPER

LLM Observability with TruEra, continued

Evaluation
Observability begins when you start prototyping an app. Evaluation For example, in order to guard against hallucinations in Retrieval
during development is standard practice for machine learning Augmented Generation (RAG) apps, the RAG Triad provides tests to
practitioners. Evaluation provides a useful step for developers to check that the retrieved context is relevant to the query that was asked,
iterate and make improvements aimed at achieving the app’s goals. that the final response is grounded in the retrieved context, and that the
However, for LLMs, this evaluation step is particularly challenging: final response is safe and relevant to the query that was asked (see this
there are very few standard metrics and tests that provide a view of blog post and this video for more details).
the quality of the models and the apps that they power. TruEra’s
feedback functions provide a useful, extensible framework for
evaluating LLM apps. Feedback function evaluations include 

out-of-the-box and custom tests (specialized to the user’s data) to
Query
ensure that apps are honest, helpful, and harmless.

Answer Relevance Context Relevance


Is the answer relevant to Is the context relevant 

the query? to the query?

Response Context
Groundedness
Is the response supported
by the context?

Figure 4: 

The RAG Triad tests to check that retrieved context is relevant to the query that was asked, that the
final response is grounded in the retrieved context, and that the final response is safe and relevant.

© 2023 All Rights Reserved TruEra www.truera.com 12


WHITEPAPER

LLM Observability with TruEra, continued

Tracking Monitoring
Evaluating apps is effective when coupled with the ability to quickly Finally, when apps are deployed into the real world, monitoring on an
iterate on ideas to test out their impact. With TruEra, evaluations and ongoing basis provides a view of any emerging failure modes. It’s hard to
tests on different app versions can be tracked on a leaderboard where anticipate how an app will be used in the wild, so ongoing evaluation
users can clearly understand the tradeoffs between cost, latency, and provides clear guardrails and early warning systems to understand when
performance of their LLM app versions. things go wrong. TruEra helps to provide the evaluations, alerting, and
feedback loop that help the developer to debug and improve their
For example, this kind of experiment tracking functionality is useful in
applications. One key challenge addressed by TruEra LLM monitoring is
selecting the best configuration while setting up an LLM and vector
scaling up evaluations while keeping cost and latency low; see this blog
database with a RAG app. Examples of choices that developers have to
post for more details.
make include chunk size, index distance metric, amount of context
retrieved, retrieval algorithm choice, LLM choice, and
Average Relevance
hyperparameters. As these choices are varied, one can use TruEra to Chain-1

track their impact on app metrics, such as the RAG Triad, and pick the Chain-7

Average Relevance
best configuration for the app; see this blog post for details and a
sample notebook with a workflow for evaluating and tracking Pinecone
vector database choices with TruLens.

© 2023 All Rights Reserved TruEra www.truera.com 13


WHITEPAPER

The Road Ahead

for GenAI
The tech stack for LLMs has taken shape rapidly over the past year and is
gaining adoption and delivering value at an amazingly fast pace. Over the
course of the next year or two, we anticipate several significant advances in Where to get more information
GenAI technologies to take root. Intel Whitepaper
Large Multi-modal Models (LMMs) that combine multiple modalities, 
 Accelerate Artificial Intelligence (AI) Workloads with
e.g., text and image, represent a significant opportunity. Recent examples Intel Advanced Matrix Extensions (Intel AMX)
include Flamingo, GPT-4V, and Llava.
Intel Web Page
Frameworks for building agent-based applications will mature and see Intel Xeon Scalable Processors

widespread adoption in consumer-facing personal assistants, business apps Intel Habana Gaudi

for search, and more. Intel Developer Cloud

Technical advances in bringing down the cost of model training and TruEra Whitepaper
inference will continue; in particular, addressing bottlenecks with the Full Lifecycle AI Observability
dominant transformer architecture (e.g. see this paper) and hardware
TruEra Web Page
advances (e.g. Habana).
LLM Observability with TruEra
Observability will become an ingrained part of the tech stack to ensure
honest, helpful, and harmless GenAI models and applications.

© 2023 All Rights Reserved TruEra www.truera.com 14

You might also like