LlamaIndex Talk (W&B Fully Connected 2024)

Beyond RAG: Building Advanced Context-
Augmented LLM
Applications
Jerry Liu, LlamaIndex co-founder/CEO
LlamaIndex: Context Augmentation for your LLM app
RAG
RAG
Data Parsing & Ingestion Data Querying
Data Parsing + LLM +

Data Index Retrieval Response
Ingestion Prompts
Naive RAG
Sentence
Dense Retrieval Simple QA
PyPDF Splitting
Top-k = 5
Prompt
Chunk Size 256
Data Parsing + LLM +

Data Index Retrieval Response
Ingestion Prompts
Naive RAG is Limited
RAG Prototypes are Limited
Naive RAG approaches tend to work well for simple questions over a simple,
small set of documents.
● “What are the main risk factors for Tesla?” (over Tesla 2021 10K)
● “What did the author do during his time at YC?” (Paul Graham essay)
Pain Points
There’s certain questions we want to ask where naive RAG will fail.
Examples:
● Summarization Questions: “Give me a summary of the entire <company>
10K annual report”
Pain Points
Examples:
● Comparison Questions: “Compare the open-source contributions of
candidate A and candidate B”
Pain Points
Examples:
● Comparison Questions: “Compare the open-source contributions of
candidate A and candidate B”
● Structured Analytics + Semantic Search: “Tell me about the risk factors of
the highest-performing rideshare company in the US”
Pain Points
Examples:
● Summarization Questions: “Give me a summary of the entire <company> 10K
annual report”
● Comparison Questions: “Compare the open-source contributions of candidate
A and candidate B”
● Structured Analytics + Semantic Search: “Tell me about the risk factors of the
highest-performing rideshare company in the US”
● General Multi-part Questions: “Tell me about the pro-X arguments in article A,
and tell me about the pro-Y arguments in article B, make a table based on our
internal style guide, then generate your own conclusion based on these facts.”
Can we do more?
In the naive setting, RAG is boring.
🚫 It’s just a glorified search system
🚫 There’s many questions/tasks that naive RAG can’t give an answer to.
💡 Can we go beyond simple search/QA to building a general

context-augmented research assistant?
Beyond RAG: Adding Layers of Agentic Reasoning
From RAG to Agents
Query RAG Response

From RAG to Agents
Query RAG Response
Single-shot
No query understanding/planning
No tool use
No reflection, error correction
No memory (stateless)
From RAG to Agents
✅ Multi-turn Tool
✅ Query / task planning layer
✅ Tool interface for external environment
✅ Reflection Tool
✅ Memory for personalization
Tool
Query Agent RAG Response

From Simple to Advanced Agents
Agent Ingredients Full Agents
Routing Tool Use

Dynamic
Planning +
One-Shot Query ReAct Execution
Planning
Conversation
Memory
Simple Advanced
Lower Cost Higher Cost
Lower Latency Higher Latency
Routing
Simplest form of agentic
reasoning.
Given user query and set of

choices, output subset of
choices to route query to.
Routing
Use Case: Joint QA and
Summarization
Guide
Conversation Memory
In addition to current query,
take into account
conversation history as
input to your RAG
pipeline.
Conversation Memory
How to account for
conversation history in a
RAG pipeline?
● Condense question
● Condense question +
context
Compare revenue growth of
Query Planning Uber and Lyft in 2021
Break down query into

Describe revenue Describe revenue growth
parallelizable sub-queries. growth of Lyft in 2021 of Uber in 2021
Each sub-query can be

executed against any set of top-2
RAG pipelines
Uber 10-K chunk 4
Uber 10-K
Uber 10-K chunk 8
top-2
Lyft 10-K
Lyft 10-K chunk 4
Lyft 10-K chunk 8

Compare revenue growth of
Query Planning Uber and Lyft in 2021
Example: Compare
Describe revenue Describe revenue growth
revenue of Uber and Lyft in growth of Lyft in 2021 of Uber in 2021
2021
Query Planning Guide top-2
Uber 10-K chunk 4
Uber 10-K
Uber 10-K chunk 8
top-2
Lyft 10-K
Lyft 10-K chunk 4
Lyft 10-K chunk 8

Tool Use
Use an LLM to call an API
Infer the parameters of that

API
Tool Use
In normal RAG you just
pass through the query.
But what if you used the

LLM to infer all the
parameters for the API
interface?
A key capability in many QA

use cases (auto-retrieval,
text-to-SQL, and more)
Let’s put them together
● All of these are agent ingredients
● Let’s put them together for a full agent system
○ Query planning
○ Memory
○ Tool Use
● Let’s add additional components
○ Reflection
○ Controllability
○ Observability
Core Components of a Full Agent
Minimum necessary
ingredients:
● Query planning
● Memory
● Tool Use
ReAct: Reasoning + Acting with LLMs
Source: https://react-lm.github.io/
Query Planning:
Generate next step
given previous steps
(chain-of-thought
prompt)
Tool Use:
Sequential tool
calling.
Memory: Maintain
simple buffer.
ReAct + RAG Guide

Can we make this even better?
● Stop being so short-sighted - plan ahead at each step
● Parallelize execution where we can
LLMCompiler (Kim et al. 2023)
Kim et al. 2023
An agent compiler
for parallel multi-
function planning +
execution.
LLMCompiler
Query Planning:
Generate a DAG of
steps. Replan if steps
don’t reach desired
state
Tool Use: Parallel

function calling.
Memory: Maintain
simple buffer.
LLMCompiler Agent
Tree-based Planning
Tree of Thoughts
(Yao et al. 2023)
Reasoning via
Planning (Hao et al.
2023)
Language Agent
Tree Search (Zhou
et al. 2023)
Tree-based Planning
Query Planning in the
face of uncertainty:
Instead of planning out
a fixed sequence of
steps, sample a few
different states.
Run Monte-Carlo Tree

Search (MCTS) to
balance exploration vs.
exploitation.
Self-Reflection
Use feedback to improve

agent execution and
reduce errors
Human feedback
🤖 LLM feedback
Use few-shot examples

instead of retraining the
model!
Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023)
Additional Requirements
● Observability: see the full trace of the agent
○ Observability Guide
● Control: Be able to guide the intermediate steps of an agent step-by-step
○ Lower-Level Agent API
● Customizability: Define your own agentic logic around any set of tools.
○ Custom Agent Guide
○ Custom Agent with Query Pipeline Guide
● Multi-agents: Define multi-agent interactions!
○ Synchronously: Define an explicit flow between agents
○ Asynchronously: Treat each agent as a microservice that can communicate with each other.
■ Upcoming in LlamaIndex!
○ Current Frameworks: Autogen, CrewAI
LlamaIndex + W&B
Tracing and
Observability are
essential developer
tools for RAG/agent
development.
We have first-class
integrations with
Weights and Biases.
Guide

LlamaIndex Talk (W&amp;B Fully Connected 2024)

Uploaded by

Copyright:

Available Formats

You might also like

LlamaIndex Talk (W&amp;B Fully Connected 2024)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LlamaIndex Talk (W&amp;B Fully Connected 2024)

Uploaded by

Copyright:

Available Formats

Beyond RAG: Building Advanced Context-

Data Parsing & Ingestion Data Querying

Data Parsing + LLM +

Data Parsing + LLM +

🚫 It’s just a glorified search system

💡 Can we go beyond simple search/QA to building a general

Query RAG Response

Query RAG Response

Query Agent RAG Response

Routing Tool Use

Given user query and set of

Break down query into

Each sub-query can be

Lyft 10-K chunk 8

Query Planning Guide top-2

Uber 10-K chunk 4

Lyft 10-K chunk 8

Infer the parameters of that

But what if you used the

A key capability in many QA

ReAct + RAG Guide

Kim et al. 2023

Tool Use: Parallel

Run Monte-Carlo Tree

Use feedback to improve

Use few-shot examples

You might also like

LlamaIndex Talk (W&B Fully Connected 2024)

LlamaIndex Talk (W&B Fully Connected 2024)

LlamaIndex Talk (W&B Fully Connected 2024)