Professional Documents
Culture Documents
LLMOps Truera Intel LLM Ops Explained2
LLMOps Truera Intel LLM Ops Explained2
LLMOps Truera Intel LLM Ops Explained2
LLMOps
Explained
How to build an effective,
performant LLMOps tech stack
Authors
Anupam Datta, Shayak Sen, TruEra and
Arijit Bandyopadhyay, Intel Corporation
WHITEPAPER
Table of Contents
The Rise of Foundation Models and the LLMOps Stack . . . . . . . . . . . . . 3
.............
The Emerging LLMOps Tech Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
. . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .
LLM Observability with TruEra . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . 11
. . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. .
. . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... .. .. .. .. .. . 13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... .. .. .. . . . . . 13
...........................................
The Road Ahead. . . .for . . GenAI
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
................................
................................
............................
The Emerging
Compute
Storage
Typical Workflow for Building, Deploying, and Maintaining LLMs and LLM Apps
Vector
1. Foundation
Model Training 2. Data Preparation
3. Database
Index
Construction
4. Model Fine
Tuning 5. App Creation
Prompt
Model
6. Engineering,
Tracking,
Collaboration
7. Evaluation and
Debugging 8. Deployment
and Inference
9. App Hosting
10. Monitoring
Figure 2: A representative workflow for building an LLM application, with some representative vendors providing technical support at each step.
language models to be really good at producing Reinforcement Learning via Human Feedback (RLHF). This step
human-like text. Over the past few years, LLMs (used for example in chatbots like ChatGPT), involves training a
have grown in scale from the 1-2 billion reward model, which quantifies how good different responses to a
parameter BERT models to 10s to 100s of billions prompt are, and then uses reinforcement learning to train the
of parameters with the GPT series from OpenAI, generative model with that reward model (see this article for more
PaLM from Google, Llama from Meta, and details). RLHF has helped produce more helpful and harmless
models from HuggingFace, Databricks, Anthropic, models while reducing model size by orders of magnitude.
Data preparation
There are several flavors of data preparation tasks with associated tools:
LLM creators like OpenAI, Google, Meta, etc. LLMs are fine-tuned, often on private data held by enterprises and
prepare data for generative pre-training – often small businesses. These data sets could, for example, include a set
very large volumes of unlabeled data. Recently, of prompts and responses to fine tune a chatbot to a domain-
we are also seeing more carefully curated specific use case, such as for customer service for ecommerce.
smaller data sets used to train LLMs that are Tools from companies like Snorkel AI are useful to prepare data
orders of magnitude smaller than state-of-the- for fine-tuning.
art (SOTA) models but competitive in certain
tasks (e.g., see this paper from Microsoft LLM applications, in particular, retrieval augmented generation
Research and a related blog post from Intel). (RAG) involves augmenting LLMs with a knowledge base of
documents that serve as a source of truth and can be queried. One
example of this paradigm is Morgan Stanley’s use of OpenAI
models to create a chatbot for their financial advisers. Creating
this knowledge base involves data preparation, with which tools
like LlamaIndex and LangChain can help.
Vector DB index construction: RAGs, such as App Creation: LLMs are often connected to other tools such as
the Morgan Stanley wealth management chatbot, LLM
APP
vector databases, search indices or other APIs. RAGs and Agents,
require the knowledge base of documents to be mentioned above, are two popular classes of LLM applications.
split up, converted into embeddings, and stored Building by chaining has emerged as a popular paradigm with
in a vector database, which is indexed to support tools like LlamaIndex, LangChain, and Haystack seeing
querying. Vector databases, such as Pinecone, widespread developer adoption.
Weaviate, Chroma, Milvus, ElasticSearch, and
Rockset are thus seeing rapid adoption and Prompt-engineering, tracking, collaboration: The app
becoming a key part of the LLMOps tech stack. developer creates prompts tailored for a specific use case. This
process often involves experimentation: the developer creates a
Model fine-tuning: LLMs are fine-tuned, often
prompt, observes the results and then iterates on the prompts to
on private data held by enterprises and small improve the effectiveness of the app. Tools, such as TruLens and
businesses. These data sets could, for example, W&B Prompts, help developers with this process by tracking
include a set of prompts and responses to fine
prompts, responses, and intermediate results of apps and
tune a chatbot to a domain-specific use case,
enabling this information to be shared across developer teams.
such as for customer service for ecommerce. LLM
providers, such as OpenAI and Google are
increasingly making fine-tuning APIs available for TruLens
their SOTA models. Fine-tuning APIs are also TruLens is an open-source software tool that helps to objectively measure
available for open source LLMs hosted on services, the quality and effectiveness of your LLM-based applications using feedback
such as AWS Amazon SageMaker JumpStart. functions. Feedback functions help to programmatically evaluate the quality
of inputs, outputs, and intermediate results, to expedite and scale up
experiment evaluation. Learn more: TruLens.org
Evaluation and debugging: Systematic Model deployment and inference: LLMs are deployed by and
evaluation and debugging of LLMs and LLM apps available via APIs by the major cloud providers, including AWS,
based on RAGs and Agents is absolutely essential GCP, and Azure as well as hosted by LLM providers, such as
before they are moved into production. A first OpenAI, Anthropic, Together.ai, and Cohere.
step in this direction is to use human evaluations Platform companies, such as Databricks/MosaicML and
and benchmark datasets to evaluate LLMs (e.g., Snowflake also offer model deployment services. Strong price-
see the HELM paper from Stanford). While performance ROI can be achieved with the latest generation Intel
useful, these methods do not scale. Recent work Xeon CPUs (see HuggingFace/Intel page, Intel extensions for
has shown the power of programmatic transformers, Intel Neural Compressors, and Q8-chat-LLM).
evaluation methods to evaluate LLMs and LLM The Intel Habana Gaudi processors offer additional
apps, e.g. see OpenAI Evals and TruLens. performance, scaling, and developer ease of use benefits for
Evaluations help ensure that LLM apps are inference at low latency with higher costs than the Xeon CPUs
honest, helpful, and harmless. This includes (see, in particular, this article for the inference performance of
evaluations to guard against hallucinations, in the Llama 2 7B and Llama 2 13B models, respectively, on a single
particular, groundedness, as well as toxicity, bias, Habana Gaudi2 device).
relevance, and a number of other considerations.
Firing Now
Model Performance
(7 components)
19:54:00 02/24/2022
Chain-7 Model Performance
Average Relevance
(5 components) 19:54:00 02/24/2022
0.5
Model Performance
19:54:00 02/24/2022
Model Performance
19:54:00 02/24/2022
0
Model Performance
1/1 1/8 1/15 1/22 1/28 2/4 2/10 2/17 2/24 3/1
19:54:00 02/24/2022
Prompt Sentiment
10:22:00 02/15/2023
19:54:00 02/24/2022
1
Model Performance
19:54:00 02/24/2022
Average Groundedness
0.5
0 1/1 1/8 1/15 1/22 1/28 2/4 2/10 2/17 2/24 3/1
LLM Observability
with TruEra
Evaluation
Observability begins when you start prototyping an app. Evaluation For example, in order to guard against hallucinations in Retrieval
during development is standard practice for machine learning Augmented Generation (RAG) apps, the RAG Triad provides tests to
practitioners. Evaluation provides a useful step for developers to check that the retrieved context is relevant to the query that was asked,
iterate and make improvements aimed at achieving the app’s goals. that the final response is grounded in the retrieved context, and that the
However, for LLMs, this evaluation step is particularly challenging: final response is safe and relevant to the query that was asked (see this
there are very few standard metrics and tests that provide a view of blog post and this video for more details).
the quality of the models and the apps that they power. TruEra’s
feedback functions provide a useful, extensible framework for
evaluating LLM apps. Feedback function evaluations include
out-of-the-box and custom tests (specialized to the user’s data) to
Query
ensure that apps are honest, helpful, and harmless.
Response Context
Groundedness
Is the response supported
by the context?
Figure 4:
The RAG Triad tests to check that retrieved context is relevant to the query that was asked, that the
final response is grounded in the retrieved context, and that the final response is safe and relevant.
Tracking Monitoring
Evaluating apps is effective when coupled with the ability to quickly Finally, when apps are deployed into the real world, monitoring on an
iterate on ideas to test out their impact. With TruEra, evaluations and ongoing basis provides a view of any emerging failure modes. It’s hard to
tests on different app versions can be tracked on a leaderboard where anticipate how an app will be used in the wild, so ongoing evaluation
users can clearly understand the tradeoffs between cost, latency, and provides clear guardrails and early warning systems to understand when
performance of their LLM app versions. things go wrong. TruEra helps to provide the evaluations, alerting, and
feedback loop that help the developer to debug and improve their
For example, this kind of experiment tracking functionality is useful in
applications. One key challenge addressed by TruEra LLM monitoring is
selecting the best configuration while setting up an LLM and vector
scaling up evaluations while keeping cost and latency low; see this blog
database with a RAG app. Examples of choices that developers have to
post for more details.
make include chunk size, index distance metric, amount of context
retrieved, retrieval algorithm choice, LLM choice, and
Average Relevance
hyperparameters. As these choices are varied, one can use TruEra to Chain-1
track their impact on app metrics, such as the RAG Triad, and pick the Chain-7
Average Relevance
best configuration for the app; see this blog post for details and a
sample notebook with a workflow for evaluating and tracking Pinecone
vector database choices with TruLens.
for GenAI
The tech stack for LLMs has taken shape rapidly over the past year and is
gaining adoption and delivering value at an amazingly fast pace. Over the
course of the next year or two, we anticipate several significant advances in Where to get more information
GenAI technologies to take root. Intel Whitepaper
Large Multi-modal Models (LMMs) that combine multiple modalities,
Accelerate Artificial Intelligence (AI) Workloads with
e.g., text and image, represent a significant opportunity. Recent examples Intel Advanced Matrix Extensions (Intel AMX)
include Flamingo, GPT-4V, and Llava.
Intel Web Page
Frameworks for building agent-based applications will mature and see Intel Xeon Scalable Processors
widespread adoption in consumer-facing personal assistants, business apps Intel Habana Gaudi
Technical advances in bringing down the cost of model training and TruEra Whitepaper
inference will continue; in particular, addressing bottlenecks with the Full Lifecycle AI Observability
dominant transformer architecture (e.g. see this paper) and hardware
TruEra Web Page
advances (e.g. Habana).
LLM Observability with TruEra
Observability will become an ingrained part of the tech stack to ensure
honest, helpful, and harmless GenAI models and applications.