Professional Documents
Culture Documents
Retrieval-Augmented Generation For Large Language Models A Survey
Retrieval-Augmented Generation For Large Language Models A Survey
Yunfan Gao 1 , Yun Xiong 2 , Xinyu Gao 2 , Kangxiang Jia 2 , Jinliu Pan 2 , Yuxi Bi 3 , Yi
Dai1 , Jiawei Sun1 , Qianyu Guo4 , Meng Wang 3 and Haofen Wang 1,3 ∗
1
Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University
2
Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
3
College of Design and Innovation, Tongji University
4
School of Computer Science, Fudan University
trajectory, propelling LLMs into the forefront. The com- vide a comprehensive overview and analysis of existing RAG
munity’s focal point shifted towards harnessing the capabil- technologies and offer conclusions and prospects for future
ities of LLMs to attain heightened controllability and ad- development methods. This survey intends to furnish readers
dress evolving requirements. Consequently, the lion’s share and practitioners with a thorough and systematic comprehen-
of RAG endeavors concentrated on inference, with a minor- sion of large models and RAG, elucidate the progression and
ity dedicated to fine-tuning processes. As LLM capabili- key technologies of retrieval augmentation, clarify the merits
ties continued to advance, especially with the introduction of and limitations of various technologies along with their suit-
GPT-4, the landscape of RAG technology underwent a sig- able contexts, and forecast potential future developments.
nificant transformation. The emphasis evolved into a hybrid Our contributions are as follows:
approach, combining the strengths of RAG and fine-tuning, • We present a thorough and systematic review of the
alongside a dedicated minority continuing the focus on opti- state-of-the-art RAG, delineating its evolution through
mizing pre-training methodologies. paradigms including naive RAG, advanced RAG, and
Despite the rapid growth of RAG research, there has been modular RAG. This review contextualizes the broader
a lack of systematic consolidation and abstraction in the field, scope of RAG research within the landscape of LLMs.
which poses challenges in understanding the comprehensive • We identify and discuss the central technologies integral
landscape of RAG advancements. This survey aims to out- to the RAG process, specifically focusing on the aspects
line the entire RAG process and encompass the current and of “Retrieval”, “Generator” and “Augmentation”, and
future directions of RAG research, by providing a thorough delve into their synergies, elucidating how these com-
examination of retrieval augmentation in LLMs. ponents intricately collaborate to form a cohesive and
Therefore, this paper aims to comprehensively summarize effective RAG framework.
and organize the technical principles, developmental history, • We construct a thorough evaluation framework for RAG,
content, and, in particular, the relevant methods and applica- outlining the evaluation objectives and metrics. Our
tions after the emergence of LLMs, as well as the evaluation comparative analysis clarifies the strengths and weak-
methods and application scenarios of RAG. It seeks to pro- nesses of RAG compared to fine-tuning from various
perspectives. Additionally, we anticipate future direc- pertinent information via search algorithms. This information
tions for RAG, emphasizing potential enhancements to is then woven into the LLM’s prompts, providing additional
tackle current challenges, expansions into multi-modal context for the generation process. RAG’s key advantage lies
settings, and the development of its ecosystem. in its obviation of the need for retraining of LLMs for task-
The paper unfolds as follows: Section 2 and 3 define RAG specific applications. Developers can instead append an ex-
and detail its developmental process. Section 4 through 6 ex- ternal knowledge repository, enriching the input and thereby
plore core components—Retrieval, “Generation” and “Aug- refining the model’s output precision. RAG has become one
mentation”—highlighting diverse embedded technologies. of the most popular architectures in LLMs’ systems, due to
Section 7 focuses on RAG’s evaluation system. Section 8 its high practicality and low barrier to entry, with many con-
compare RAG with other LLM optimization methods and versational products being built almost entirely on RAG.
suggest potential directions for its evolution. The paper con- The RAG workflow comprises three key steps. First, the
cludes in Section 9. corpus is partitioned into discrete chunks, upon which vec-
tor indices are constructed utilizing an encoder model. Sec-
ond, RAG identifies and retrieves chunks based on their vec-
2 Definition tor similarity to the query and indexed chunks. Finally, the
The definition of RAG can be summarized from its workflow. model synthesizes a response conditioned on the contextual
Figure 2 depicts a typical RAG application workflow. In this information gleaned from the retrieved chunks. These steps
scenario, a user inquires ChatGPT about a recent high-profile form the fundamental framework of the RAG process, under-
event (i.e., the abrupt dismissal and reinstatement of Ope- pinning its information retrieval and context-aware genera-
nAI’s CEO) which generated considerable public discourse. tion capabilities. Next, we will provide an introduction to the
ChatGPT as the most renowned and widely utilized LLM, RAG research framework.
constrained by its pretraining data, lacks knowledge of re-
cent events. RAG addresses this gap by retrieving up-to-date 3 RAG Framework
document excerpts from external knowledge bases. In this in- The RAG research paradigm is continuously evolving, and
stance, it procures a selection of news articles pertinent to the this section primarily delineates its progression. We cate-
inquiry. These articles, alongside the initial question, are then gorize it into three types: Naive RAG, Advanced RAG, and
amalgamated into an enriched prompt that enables ChatGPT Modular RAG. While RAG were cost-effective and surpassed
to synthesize an informed response. This example illustrates the performance of the native LLM, they also exhibited sev-
the RAG process, demonstrating its capability to enhance the eral limitations. The development of Advanced RAG and
model’s responses with real-time information retrieval. Modular RAG was a response to these specific shortcomings
Technologically, RAG has been enriched through various in Naive RAG.
innovative approaches addressing pivotal questions such as
“what to retrieve” “when to retrieve” and “how to use the 3.1 Naive RAG
retrieved information”. For “what to retrieve” research has The Naive RAG research paradigm represents the earliest
progressed from simple token [Khandelwal et al., 2019] and methodology, which gained prominence shortly after the
entity retrieval [Nishikawa et al., 2022] to more complex widespread adoption of ChatGPT. The Naive RAG follows a
structures like chunks [Ram et al., 2023] and knowledge traditional process that includes indexing, retrieval, and gen-
graph [Kang et al., 2023], with studies focusing on the eration. It is also characterized as a “Retrieve-Read” frame-
granularity of retrieval and the level of data structur- work [Ma et al., 2023a].
ing. Coarse granularity brings more information but
Indexing
with lower precision. Retrieving structured text provides
The indexing process is a crucial initial step in data prepara-
more information while sacrificing efficiency. The ques-
tion that occurs offline and involves several stages. It begins
tion of “when to retrieve” has led to strategies ranging
with data indexing, where original data is cleansed and ex-
from single [Wang et al., 2023e, Shi et al., 2023] to adap-
tracted, and various file formats such as PDF, HTML, Word,
tive [Jiang et al., 2023b, Huang et al., 2023] and multiple
and Markdown are converted into standardized plain text. In
retrieval [Izacard et al., 2022] methods. High frequency of
order to fit within the context limitations of language models,
retrieval brings more information and lower efficiency. As
this text is then segmented into smaller, more manageable
for ”how to use” the retrieved data, integration techniques
chunks in a process known as chunking. These chunks are
have been developed across various levels of the model
subsequently transformed into vector representations through
architecture, including the input [Khattab et al., 2022],
an embedding model, chosen for its balance between infer-
intermediate [Borgeaud et al., 2022], and output lay-
ence efficiency and model size. This facilitates similarity
ers [Liang et al., 2023]. Although the “intermediate” and
comparisons during the retrieval phase. Finally, an index is
“output layers” are more effective, there are problems with
created to store these text chunks and their vector embed-
the need for training and low efficiency.
dings as key-value pairs, which allows for efficient and scal-
RAG is a paradigm that enhances LLMs by integrating ex-
able search capabilities.
ternal knowledge bases. It employs a synergistic approach,
combining information retrieval mechanisms and In-Context Retrieval
Learning (ICL) to bolster the LLM’s performance. In this Upon receipt of a user query, the system employs the same en-
framework, a query initiated by a user prompts the retrieval of coding model utilized during the indexing phase to transcode
Figure 2: A representative instance of the RAG process applied to question answering
the input into a vector representation. It then proceeds to hensive responses. Outdated information further compounds
compute the similarity scores between the query vector and the problem, potentially yielding inaccurate retrieval results.
the vectorized chunks within the indexed corpus. The system Response generation quality presents hallucination chal-
prioritizes and retrieves the top K chunks that demonstrate lenge, where the model generates answers not grounded in
the greatest similarity to the query. These chunks are subse- the provided context, as well as issues of irrelevant context
quently used as the expanded contextual basis for addressing and potential toxicity or bias in the model’s output.
the user’s request. The augmentation process presents its own challenges in
effectively integrating context from retrieved passages with
Generation
the current generation task, potentially leading to disjointed
The posed query and selected documents are synthesized into or incoherent output. Redundancy and repetition are also
a coherent prompt to which a large language model is tasked concerns, especially when multiple retrieved passages con-
with formulating a response. The model’s approach to an- tain similar information, resulting in repetitive content in the
swering may vary depending on task-specific criteria, allow- generated response.
ing it to either draw upon its inherent parametric knowledge Discerning the importance and relevance of multiple re-
or restrict its responses to the information contained within trieved passages to the generation task is another challenge,
the provided documents. In cases of ongoing dialogues, requiring the proper balance of each passage’s value. Addi-
any existing conversational history can be integrated into the tionally, reconciling differences in writing styles and tones to
prompt, enabling the model to engage in multi-turn dialogue ensure consistency in the output is crucial.
interactions effectively. Lastly, there’s a risk of generation models overly depend-
Drawbacks in Naive RAG ing on augmented information, potentially resulting in out-
Naive RAG faces significant challenges in three key areas: puts that merely reiterate the retrieved content without pro-
“Retrieval,” “Generation,” and “Augmentation”. viding new value or synthesized information.
Retrieval quality poses diverse challenges, including low
precision, leading to misaligned retrieved chunks and po- 3.2 Advanced RAG
tential issues like hallucination or mid-air drop. Low recall Advanced RAG has been developed with targeted enhance-
also occurs, resulting in the failure to retrieve all relevant ments to address the shortcomings of Naive RAG. In terms
chunks, thereby hindering the LLMs’ ability to craft compre- of retrieval quality, Advanced RAG implements pre-retrieval
and post-retrieval strategies. To address the indexing chal- of LLMs like GPT, is a sophisticated dynamic embedding
lenges experienced by Naive RAG, Advanced RAG has re- model that captures contextual understanding. However, it
fined its indexing approach using techniques such as slid- may not exhibit the same sensitivity to context as the latest
ing window, fine-grained segmentation, and metadata. It has full-size language models like GPT-4.
also introduced various methods to optimize the retrieval pro-
Post-Retrieval Process
cess [ILIN, 2023].
After retrieving valuable context from the database, it is es-
Pre-Retrieval Process sential to merge it with the query as an input into LLMs while
Optimizing Data Indexing.The goal of optimizing data index- addressing challenges posed by context window limits. Sim-
ing is to enhance the quality of the content being indexed. ply presenting all relevant documents to the LLM at once may
This involves five primary strategies: enhancing data gran- exceed the context window limit, introduce noise, and hinder
ularity, optimizing index structures, adding metadata, align- the focus on crucial information. Additional processing of the
ment optimization, and mixed retrieval. retrieved content is necessary to address these issues.
Enhancing data granularity aims to elevate text standard- Re-Ranking. Re-ranking the retrieved information to re-
ization, consistency, factual accuracy, and rich context to im- locate the most relevant content to the edges of the prompt
prove the RAG system’s performance. This includes remov- is a key strategy. This concept has been implemented
ing irrelevant information, dispelling ambiguity in entities in frameworks such as LlamaIndex4 , LangChain5 , and
and terms, confirming factual accuracy, maintaining context, HayStack [Blagojevi, 2023]. For example, Diversity Ranker6
and updating outdated documents. prioritizes reordering based on document diversity, while
Optimizing index structures involves adjusting the size of LostInTheMiddleRanker alternates placing the best docu-
chunks to capture relevant context, querying across multiple ment at the beginning and end of the context window. Ad-
index paths, and incorporating information from the graph ditionally, approaches like cohereAI rerank [Cohere, 2023],
structure to capture relevant context by leveraging relation- bge-rerank7 , and LongLLMLingua [Jiang et al., 2023a] re-
ships between nodes in a graph data index. calculate the semantic similarity between relevant text and the
Adding metadata information involves integrating refer- query, addressing the challenge of interpreting vector-based
enced metadata, such as dates and purposes, into chunks for simulated searches for semantic similarity.
filtering purposes, and incorporating metadata like chapters Prompt Compression. Research indicates that noise in re-
and subsections of references to improve retrieval efficiency. trieved documents adversely affects RAG performance. In
Alignment optimization addresses alignment issues and post-processing, the emphasis lies in compressing irrelevant
disparities between documents by introducing “hypothetical context, highlighting pivotal paragraphs, and reducing the
questions” [Li et al., 2023d] into documents to rectify align- overall context length. Approaches such as Selective Context
ment issues and differences. and LLMLingua [Litman et al., 2020, Anderson et al., 2022]
utilize small language models to calculate prompt mu-
Retrieval tual information or perplexity, estimating element impor-
During the retrieval stage, the primary focus is on identifying tance. Recomp [Xu et al., 2023a] addresses this by train-
the appropriate context by calculating the similarity between ing compressors at different granularities, while Long
the query and chunks. The embedding model is central to Context [Xu et al., 2023b] and “Walking in the Memory
this process. In the advanced RAG, there is potential for op- Maze” [Chen et al., 2023a] design summarization techniques
timization of the embedding models. to enhance LLM’s key information perception, particularly in
Fine-tuning Embedding. Fine-tuning embedding models dealing with extensive contexts.
significantly impact the relevance of retrieved content in RAG
systems. This process involves customizing embedding mod- 3.3 Modular RAG
els to enhance retrieval relevance in domain-specific contexts, The modular RAG structure diverges from the tradi-
especially for professional domains dealing with evolving or tional Naive RAG framework, providing greater versatil-
rare terms. The BGE embedding model [BAAI, 2023], such ity and flexibility. It integrates various methods to en-
as BGE-large-EN developed by BAAI2 , is an example of a hance functional modules, such as incorporating a search
high-performance embedding model that can be fine-tuned module for similarity retrieval and applying a fine-tuning
to optimize retrieval relevance. Training data for fine-tuning approach in the retriever [Lin et al., 2023]. Restructured
can be generated using language models like GPT-3.5-turbo RAG modules [Yu et al., 2022] and iterative methodologies
to formulate questions grounded on document chunks, which like [Shao et al., 2023] have been developed to address spe-
are then used as fine-tuning pairs. cific issues. The modular RAG paradigm is increasingly be-
Dynamic Embedding adapts to the context in which words coming the norm in the RAG domain, allowing for either a
are used, unlike static embedding, which uses a single vec- serialized pipeline or an end-to-end training approach across
tor for each word [Karpukhin et al., 2020]. For example, multiple modules. The comparison of three RAG paradigms
in transformer models like BERT, the same word can have
4
varied embeddings depending on surrounding words. Ope- https://www.llamaindex.ai
5
nAI’s embeddings-ada-02 model3 , built upon the principles https://www.langchain.com/
6
https://haystack.deepset.ai/blog/
2
https://huggingface.co/BAAI/bge-large-en enhancing-rag-pipelines-in-haystack
3 7
https://platform.openai.com/docs/guides/embeddings https://huggingface.co/BAAI/bge-reranker-large
Figure 3: Comparison between the three paradigms of RAG
is depicted in Figure 3. However, Modular RAG is not stan- tiple, diverse perspectives using an LLM. This approach not
dalone. Advanced RAG is a specialized form of modular only captures the explicit information users seek but also un-
RAG, and further, Naive RAG itself is a special case of Ad- covers deeper, transformative knowledge. The fusion pro-
vanced RAG. The relationship among the three paradigms is cess involves parallel vector searches of both original and
one of inheritance and development. expanded queries, intelligent re-ranking to optimize results,
and pairing the best outcomes with new queries. This sophis-
New Modules ticated method ensures search results that align closely with
Search Module. In contrast to the similarity retrieval in both the explicit and implicit intentions of the user, leading to
Naive/Advanced RAG, the Search Module is tailored to spe- more insightful and relevant information discovery.
cific scenarios and incorporates direct searches on additional Routing. The RAG system’s retrieval process utilizes di-
corpora. This integration is achieved using code generated verse sources, differing in domain, language, and format,
by the LLM, query languages such as SQL or Cypher, and which can be either alternated or merged based on the sit-
other custom tools. The data sources for these searches can uation [Li et al., 2023b]. Query routing decides the subse-
include search engines, text data, tabular data, and knowledge quent action to a user’s query, with options ranging from
graphs [Wang et al., 2023d]. summarization, searching specific databases, or merging dif-
Memory Module. This module harnesses the memory ca- ferent pathways into a single response. The query router also
pabilities of the LLM to guide retrieval. The approach in- chooses the appropriate data store for the query, which may
volves identifying memories most similar to the current input. include various sources like vector stores, graph databases, or
Selfmem [Cheng et al., 2023b] utilizes a retrieval-enhanced relational databases, or a hierarchy of indices—for instance, a
generator to create an unbounded memory pool iteratively, summary index and a document block vector index for multi-
combining the “original question” and “dual question”. By document storage. The query router’s decision-making is pre-
employing a retrieval-enhanced generative model that uses its defined and executed via LLMs calls, which direct the query
own outputs to improve itself, the text becomes more aligned to the chosen index.
with the data distribution during the reasoning process. Con- Predict . It addresses the common issues of redundancy
sequently, the model’s own outputs are utilized instead of the and noise in retrieved content. Instead of directly retrieving
training data [Wang et al., 2022a]. from a data source, this module utilizes the LLM to generate
Fusion. RAG-Fusion [Raudaschl, 2023]enhances tradi- the necessary context [Yu et al., 2022]. The content produced
tional search systems by addressing their limitations through by the LLM is more likely to contain pertinent information
a multi-query approach that expands user queries into mul- compared to that obtained through direct retrieval.
Task Adapter. This module focuses on adapting RAG to a Optimizing the RAG Pipeline
variety of downstream tasks. UPRISE automates the retrieval The optimization of the retrieval process aims to enhance the
of prompts for zero-shot task inputs from a pre-constructed efficiency and quality of information in RAG systems. Cur-
data pool, thereby enhancing universality across tasks and rent research focuses on integrating diverse search technolo-
models [Cheng et al., 2023a]. Meanwhile, PROMPTAGA- gies, refining retrieval steps, incorporating cognitive back-
TOR [Dai et al., 2022] utilizes LLM as a few-shot query gen- tracking, implementing versatile query strategies, and lever-
erator and, based on the generated data, creates task-specific aging embedding similarity. These efforts collectively strive
retrievers. By leveraging the generalization capability of to achieve a balance between retrieval efficiency and the
LLMs, it enables the development of task-specific end-to-end depth of contextual information in RAG systems.
retrievers with minimal examples. Hybrid Search Exploration. The RAG system optimizes its
performance by intelligently integrating various techniques,
New Patterns including keyword-based search, semantic search, and vec-
tor search. This approach leverages the unique strengths of
The organizational structure of Modular RAG is highly adapt- each method to accommodate diverse query types and infor-
able, allowing for the substitution or rearrangement of mod- mation needs, ensuring consistent retrieval of highly relevant
ules within the RAG process to suit specific problem contexts. and context-rich information. The use of hybrid search serves
Naive RAG and Advanced RAG can both be considered as as a robust supplement to retrieval strategies, thereby enhanc-
being composed of some fixed modules. As illustrated in the ing the overall efficacy of the RAG pipeline.
figure 3, Naive RAG primarily consists of the “Retrieve” and Recursive Retrieval and Query Engine. Recursive retrieval
“Read” modules. A typical pattern of Advanced RAG builds involves acquiring smaller chunks during the initial retrieval
upon the foundation of Naive RAG by adding “Rewrite” and phase to capture key semantic meanings. Subsequently, larger
“Rerank” modules. However, on the whole, modular RAG chunks containing more contextual information are provided
enjoys greater diversity and flexibility. to the LLM in later stages of the process. This two-step re-
Current research primarily explores two organizational trieval method helps to strike a balance between efficiency
paradigms. The first involves adding or replacing modules, and the delivery of contextually rich responses.
while the second focuses on adjusting the organizational flow StepBack-prompt approach encourages the LLM to move
between modules. This flexibility enables tailoring the RAG away from specific instances and engage in reasoning around
process to effectively address a wide array of tasks. broader concepts and principles [Zheng et al., 2023]. Experi-
Adding or Replacing Modules.The strategy of introducing mental results demonstrate a significant performance increase
or substituting modules involves maintaining the core struc- in various challenging, inference-based tasks when backward
ture of the Retrieval-Read process while integrating addi- prompts are used, highlighting their natural adaptability to the
tional modules to enhance specific functionalities. The RRR RAG process. These retrieval-enhancing steps can be applied
model [Ma et al., 2023a] introduces the Rewrite-Retrieve- both in generating responses to backward prompts and in the
Read process, utilizing the LLM performance as a reinforce- final question-answering process.
ment learning incentive for a rewriting module. This enables Sub-Queries. Depending on the scenario, various query
the rewriter to fine-tune retrieval queries, thereby improving strategies can be employed, such as using query engines
the downstream task performance of the reader. provided by frameworks like LlamaIndex, leveraging tree
queries, utilizing vector queries, or executing simple sequen-
Similarly, modules can be selectively swapped in method- tial querying of chunks.
ologies like Generate-Read [Yu et al., 2022], where the Hypothetical Document Embeddings. HyDE operates on
LLM’s generation module takes the place of the retrieval the belief that the answers generated might be closer in the
module. The Recite-Read approach [Sun et al., 2022] trans- embedding space than a direct query. Using the LLM, HyDE
forms external retrieval into retrieval from model weights, creates a hypothetical document (answer) in response to a
requiring the LLM to initially memorize task-specific infor- query, embeds this document, and uses the resulting em-
mation and subsequently produce output capable of handling bedding to retrieve real documents similar to the hypotheti-
knowledge-intensive natural language processing tasks. cal one. Instead of seeking embedding similarity based on
Adjusting the Flow between Modules. zheIn the realm the query, this approach focuses on the embedding similar-
of module flow adjustment, there is a focus on enhancing ity from one answer to another [Gao et al., 2022]. However,
the interaction between language models and retrieval mod- it might not consistently produce desirable outcomes, espe-
els. DSP [Khattab et al., 2022] introduces the Demonstrate- cially when the language model is unfamiliar with the subject
Search-Predict framework, treating the context learning sys- matter, potentially leading to more instances with errors.
tem as an explicit program rather than a final task prompt,
leading to more effective handling of knowledge-intensive
tasks. The ITER-RETGEN [Shao et al., 2023] approach uti-
4 Retrieval
lizes generated content to guide retrieval, iteratively im- In the context of RAG, it is crucial to efficiently retrieve rel-
plementing “retrieval-enhanced generation” and “generation- evant documents from the data source. However, creating a
enhanced retrieval” within a Retrieve-Read-Retrieve-Read proficient retriever presents significant challenges. This sec-
flow. This method demonstrates an innovative way of using tionelves into three fundamental questions: 1) How can we
one module’s output to improve the functionality of another. achieve accurate semantic representations? 2) What methods
can align the semantic spaces of queries and documents? 3) The combination of these diverse methods has led to no-
How can the retriever’s output be aligned with the preferences table advancements, resulting in enhanced retrieval outcomes
of the Large Language Model? and improved performance for RAG.
retrieval-based strategies. The REALM model adopts a struc- Furthermore, COG [Lan et al., 2022] introduces a novel
tured, interpretable method for knowledge embedding, fram- text generation methodology that emulates copying text frag-
ing pre-training, and fine-tuning as a retrieve-then-predict ments from pre-existing collections. Utilizing efficient vector
workflow within the masked language model (MLM) frame- search tools, COG computes and indexes contextually mean-
work [Arora et al., 2023] . ingful representations of text fragments, demonstrating supe-
RETRO [Borgeaud et al., 2022] leverages retrieval aug- rior performance in domains such as question-answering and
mentation for large-scale pre-training from scratch, achieving domain adaptation when compared to RETRO.
a reduction in model parameters while surpassing standard The advent of scaling laws has catalyzed the growth of
GPT models in terms of perplexity. RETRO distinguishes it- model parameters, propelling autoregressive models into the
self with an additional encoder designed to process features mainstream. Researchers are expanding the RAG approach to
of entities retrieved from an external knowledge base, build- pretrained larger models, with RETRO++ exemplifying this
ing on the foundational structure of GPT models. trend by scaling up the model parameters while preserving or
Atlas[Izacard et al., 2022] also incorporates a retrieval enhancing performance [Wang et al., 2023b].
mechanism into the T5 architecture [Raffel et al., 2020] in Empirical evidence underscores marked improvements in
both the pre-training and fine-tuning stages. It uses a pre- text generation quality, factual accuracy, reduced toxicity,
trained T5 to initialize the encoder-decoder language model and downstream task proficiency, especially in knowledge-
and a pre-trained Contriever for the dense retriever, improv- intensive applications like open-domain QA. These results
ing its efficiency for complex language modeling tasks. imply that integrating retrieval mechanisms into the pre-
training of autoregressive language models constitutes a bilities and avoid overfitting that may arise from training them
promising avenue, marrying sophisticated retrieval tech- separately. However, joint fine-tuning also leads to increased
niques with expansive language models to yield more precise resource consumption. RA-DIT [Lin et al., 2023] presents
and efficient language generation. a lightweight, dual-instruction tuning framework that can
The benefits of augmented pre-training include a robust effectively add retrieval capabilities to any LLMs. The
foundational model that outperforms standard GPT models retrieval-enhanced directive fine-tuning updates the LLM,
in perplexity, text generation quality, and task-specific per- guiding it to make more efficient use of the information re-
formance, all while utilizing fewer parameters. This method trieved and to disregard distracting content.
is particularly adept at handling knowledge-intensive tasks Despite its advantages, fine-tuning has limitations, includ-
and facilitates the development of domain-specific models ing the need for specialized datasets for RAG fine-tuning
through training on specialized corpora. and the requirement for significant computational resources.
Nonetheless, this approach faces challenges such as the However, this stage allows for customizing models to specific
necessity for extensive pre-training datasets and resources, needs and data formats, potentially reducing resource usage
as well as diminished update frequencies with increasing compared to the pre-training phase while still being able to
model sizes. Despite these hurdles, the approach offers fine-tune the model’s output style.
significant advantages in model resilience. Once trained, In summary, the fine-tuning stage is essential for the adap-
retrieval-enhanced models can operate independently of ex- tation of RAG models to specific tasks, enabling the refine-
ternal libraries, enhancing generation speed and operational ment of both retrievers and generators. This stage enhances
efficiency. The potential gains identified render this method- the model’s versatility and adaptability to various tasks, de-
ology a compelling subject for ongoing investigation and in- spite the challenges presented by resource and dataset re-
novation in artificial intelligence and machine learning. quirements. The strategic fine-tuning of RAG models is
therefore a critical component in the development of efficient
Fine-tuning Stage and effective retrieval-augmented systems.
RAG and Fine-tuning are powerful tools for enhancing
LLMs, and combining the two can meet the needs of more Inference Stage
specific scenarios. On one hand, fine-tuning allows for the The inference stage in RAG models is crucial, as it in-
retrieval of documents with a unique style, achieving bet- volves extensive integration with LLMs. Traditional RAG
ter semantic expression and aligning the differences between approaches, also known as Naive RAG, involve incorporating
queries and documents. This ensures that the output of the retrieval content at this stage to guide the generation process.
retriever is more aptly suited to the scenario at hand. On To overcome the limitations of Naive RAG, advanced tech-
the other hand, fine-tuning can fulfill the generation needs of niques introduce more contextually rich information dur-
making stylized and targeted adjustments. Furthermore, fine- ing inference. The DSP framework [Khattab et al., 2022]
tuning can also be used to align the retriever and generator for utilizes a sophisticated exchange of natural language text
improved model synergy. between fronzen LMs and retrieval models (RMs), en-
The main goal of fine-tuning the retriever is to improve riching the context and thereby improving generation out-
the quality of semantic representations, achieved by directly comes. The PKG [Luo et al., 2023] method equips LLMs
fine-tuning the Embedding model using a corpus [Liu, 2023]. with a knowledge-guided module that allows for the retrieval
By aligning the retriever’s capabilities with the prefer- of pertinent information without modifying the LMs’ pa-
ences of the LLMs through feedback signals, both can rameters, enabling more complex task execution. CREA-
be better coordinated [Yu et al., 2023b, Izacard et al., 2022, ICL [Li et al., 2023b] employs a synchronous retrieval of
Yang et al., 2023b, Shi et al., 2023]. Fine-tuning the retriever cross-lingual knowledge to enhance context, while RE-
for specific downstream tasks can lead to improved adapt- CITE [Sun et al., 2022] generates context by sampling para-
ability [cite]. The introduction of task-agnostic fine-tuning graphs directly from LLMs.
aims to enhance the retriever’s versatility in multi-task sce- Further refinement of the RAG process during infer-
narios [Cheng et al., 2023a]. ence is seen in approaches that cater to tasks necessi-
Fine-tuning generator can result in outputs that are tating multi-step reasoning. ITRG [Feng et al., 2023] it-
more stylized and customized. On one hand, it allows eratively retrieves information to identify the correct rea-
for specialized adaptation to different input data formats. soning paths, thereby improving task adaptability. ITER-
For example, fine-tuning LLMs to fit the structure of RETGEN [Shao et al., 2023] follows an iterative strat-
knowledge graphs [Kang et al., 2023], the structure of text egy, merging retrieval and generation in a cyclical pro-
pairs [Kang et al., 2023, Cheng et al., 2023b], and other spe- cess that alternates between “retrieval-enhanced generation”
cific structures [Li et al., 2023d]. On the other hand, by con- and “generation-enhanced retrieval”. For non-knowledge-
structing directive datasets, one can demand LLMs to gen- intensive (NKI) tasks, PGRA [Guo et al., 2023] proposes a
erate specific formats content. For instance, in adaptive or two-stage framework, starting with a task-agnostic retriever
iterative retrieval scenarios, LLMs are fine-tuned to generate followed by a prompt-guided reranker to select and priori-
content that will help determine the timing for the next step tize evidence. In contrast, IRCOT [Trivedi et al., 2022] com-
of action [Jiang et al., 2023b, Asai et al., 2023]. bines RAG with Chain of Thought (CoT) methodologies, al-
By synergistically fine-tuning both the retriever and the ternating CoT-guided retrievals with retrieval-informed CoT
generator, we can enhance the model’s generalization capa- processes, significantly boosting GPT-3’s performance across
various question-answering tasks. edGPT [Wang et al., 2023d] generates KB search queries and
In essence, these inference-stage enhancements provide stores knowledge in a personalized base, enhancing the RAG
lightweight, cost-effective alternatives that leverage the ca- model’s knowledge richness and contextuality.
pabilities of pre-trained models without necessitating further
LLMs-Generated Content in RAG
training. The principal advantage is maintaining static LLM
parameters while supplying contextually relevant information Addressing the limitations of external auxiliary information
to meet specific task demands. Nevertheless, this approach is in RAG, some research has focused on exploiting LLMs’ in-
not without limitations, as it requires meticulous data pro- ternal knowledge. SKR [Wang et al., 2023e] classifies ques-
cessing and optimization, and is bound by the foundational tions as known or unknown, applying retrieval enhancement
model’s intrinsic capabilities. To address diverse task require- selectively. GenRead [Yu et al., 2022] replaces the retriever
ments effectively, this method is often paired with procedural with an LLM generator, finding that LLM-generated con-
optimization techniques such as step-wise reasoning, iterative texts often contain more accurate answers due to better align-
retrieval, and adaptive retrieval strategies. ment with the pre-training objectives of causal language mod-
eling. Selfmem [Cheng et al., 2023b] iteratively creates an
6.2 Augmentation Source unbounded memory pool with a retrieval-enhanced genera-
The effectiveness of RAG models is heavily impacted by the tor, using a memory selector to choose outputs that serve as
selection of data sources for augmentation. Different levels of dual problems to the original question, thus self-enhancing
knowledge and dimensions require distinct processing tech- the generative model.
niques. They are categorized as unstructured data, structured These methodologies underscore the breadth of innovative
data, and content generated by LLMs. The technology tree data source utilization in RAG, striving to improve model per-
of representative RAG research with different augmentation formance and task effectiveness.
aspects is depicted in Figure 5. The leaves, colored in three 6.3 Augmentation Process
different shades, represent enhancements using various types
of data: unstructured data, structured data, and content gener- In the domain of RAG, the standard practice often involves
ated by LLMs. The diagram clearly shows that initially, aug- a singular retrieval step followed by generation, which can
mentation was mainly achieved through unstructured data, lead to inefficiencies. A notable issue, termed the “lost
such as pure text. This approach later expanded to include in the middle” phenomenon, arises when a single retrieval
the use of structured data (e.g. knowledge graph) for further yields redundant content that may dilute or contradict es-
improvement. More recently, there has been a growing trend sential information, thereby degrading the generation qual-
in research that utilizes content generated by the LLMs them- ity [Liu et al., 2023a]. Furthermore, such singular retrieval is
selves for retrieval and augmentation purposes. typically insufficient for complex problems demanding multi-
step reasoning, as it provides a limited scope of informa-
Augmented with Unstructured Data tion [Yoran et al., 2023].
Unstructured text, is gathered from corpora, such as prompt As illustrated in Figure 5, to circumvent these challenges,
data for fine-tuning large models [Cheng et al., 2023a] and contemporary research has proposed methods for refining the
cross-lingual data [Li et al., 2023b]. Retrieval units vary from retrieval process: iterative retrieval, recursive retrieval and
tokens (e.g., kNN-LM [Khandelwal et al., 2019]) to phrases adaptive retrieval. Iterative retrieval allows the model to en-
(e.g., NPM, COG [Lee et al., 2020, Lan et al., 2022]) and gage in multiple retrieval cycles, enhancing the depth and
document paragraphs, with finer granularities offering pre- relevance of the information obtained. Recursive retrieval
cision at the cost of increased retrieval complexity. process where the results of one retrieval operation are used
FLARE [Jiang et al., 2023b] introduces an active re- as the input for the subsequent retrieval. It helps to delve
trieval approach, triggered by the LM’s generation of low- deeper into relevant information, particularly when dealing
probability words. It creates a temporary sentence for doc- with complex or multi-step queries. Recursive retrieval is of-
ument retrieval, then regenerates the sentence with the re- ten used in scenarios where a gradual approach is needed to
trieved context to predict subsequent sentences. RETRO uses converge on a final answer, such as in academic research, le-
the previous chunk to retrieve the nearest neighbor at the gal case analysis, or certain types of data mining tasks. Adap-
chunk level, combined with the previous chunk’s context, it tive retrieval, on the other hand, offers a dynamic adjustment
guides the generation of the next chunk. To preserve causal- mechanism, tailoring the retrieval process to the specific de-
ity, the generation of the next block Ci only utilizes the near- mands of varying tasks and contexts.
est neighbor of the previous block N (Ci−1 ) and not N (Ci ).
Iterative Retrieval
Augmented with Structured Data Iterative retrieval in RAG models is a process where doc-
Structured data, such as knowledge graphs (KGs), pro- uments are repeatedly collected based on the initial query
vide high-quality context and mitigate model hallucina- and the text generated thus far, providing a more compre-
tions. RET-LLMs [Modarressi et al., 2023] constructs a hensive knowledge base for LLMs [Borgeaud et al., 2022,
knowledge graph memory from past dialogues for future ref- Arora et al., 2023]. This approach has been shown to en-
erence. SUGRE [Kang et al., 2023] employs Graph Neu- hance the robustness of subsequent answer generation by of-
ral Networks (GNNs) to encode relevant KG subgraphs, fering additional contextual references through multiple re-
ensuring consistency between retrieved facts and gener- trieval iterations. However, it may suffer from semantic dis-
ated text through multi-modal contrastive learning. Knowl- continuity and the accumulation of irrelevant information, as
Figure 5: Technology tree of representative RAG research with different augmentation aspects
it typically relies on a sequence of n tokens to demarcate the The process involves iteratively refining search queries based
boundaries between generated text and retrieved documents. on the results obtained from previous searches. Recursive
To address specific data scenarios, recursive retrieval and Retrieval aims to enhance the search experience by gradu-
multi-hop retrieval techniques are utilized. Recursive re- ally converging on the most pertinent information through a
trieval involves a structured index to process and retrieve feedback loop. IRCoT [Trivedi et al., 2022] uses chain-of-
data in a hierarchical manner, which may include summa- thought to guide the retrieval process and refines the CoT
rizing sections of a document or lengthy PDF before per- with the obtained retrieval results. ToC [Kim et al., 2023]
forming a retrieval based on this summary. Subsequently, a creates a clarification tree that systematically optimizes the
secondary retrieval within the document refines the search, ambiguous parts in the Query. It can be particularly useful in
embodying the recursive nature of the process. In contrast, complex search scenarios where the user’s needs are not en-
multi-hop retrieval is designed to delve deeper into graph- tirely clear from the outset or where the information sought
structured data sources, extracting interconnected informa- is highly specialized or nuanced. The recursive nature of the
tion [Li et al., 2023c]. process allows for continuous learning and adaptation to the
Additionally, some methodologies integrate the steps of re- user’s requirements, often resulting in improved satisfaction
trieval and generation. ITER-RETGEN [Shao et al., 2023] with the search outcomes.
employs a synergistic approach that leverages “retrieval-
enhanced generation” alongside “generation-enhanced re- Adaptive Retrieval
trieval” for tasks that necessitate the reproduction of specific Adaptive retrieval methods, exemplified by Flare and Self-
information. The model harnesses the content required to ad- RAG [Jiang et al., 2023b, Asai et al., 2023], refine the RAG
dress the input task as a contextual basis for retrieving per- framework by enabling LLMs to actively determine the op-
tinent knowledge, which in turn facilitates the generation of timal moments and content for retrieval, thus enhancing the
improved responses in subsequent iterations. efficiency and relevance of the information sourced.
These methods are part of a broader trend wherein
Recursive Retrieval LLMs employ active judgment in their operations, as
Recursive Retrieval is often used in information retrieval and seen in model agents like AutoGPT, Toolformer, and
NLP to improve the depth and relevance of search results. Graph-Toolformer [Yang et al., 2023c, Schick et al., 2023,
Zhang, 2023]. Graph-Toolformer, for instance, divides its re- involving RAG and FT can necessitate multiple iterations to
trieval process into distinct steps where LLMs proactively use achieve satisfactory results.
retrievers, apply Self-Ask techniques, and employ few-shot
prompts to initiate search queries. This proactive stance al- 7 RAG Evaluation
lows LLMs to decide when to search for necessary informa-
tion, akin to how an agent utilizes tools. The rapid advancement and growing adoption of RAG in the
field of Natural Language Processing (NLP) have propelled
WebGPT [Nakano et al., 2021] integrates a reinforcement
the evaluation of RAG models to the forefront of research in
learning framework to train the GPT-3 model in au-
the LLMs community. The primary objective of this evalua-
tonomously using a search engine during text generation.
tion is to comprehend and optimize the performance of RAG
It navigates this process using special tokens that facili-
models across diverse application scenarios.
tate actions such as search engine queries, browsing results,
Historically, RAG models assessments have centered
and citing references, thereby expanding GPT-3’s capabilities
on their execution in specific downstream tasks. These
through the use of external search engines.
evaluations employ established metrics suitable to the tasks
Flare automates timing retrieval by monitoring the confi- at hand. For instance, question answering evaluations
dence of the generation process, as indicated by the probabil- might rely on EM and F1 scores [Wang et al., 2023a,
ity of generated terms [Jiang et al., 2023b]. When the prob- Shi et al., 2023, Feng et al., 2023, Ma et al., 2023a], whereas
ability falls below a certain threshold would activates the re- fact-checking tasks often hinge on accuracy as the pri-
trieval system to collect relevant information, thus optimizing mary metric [Lewis et al., 2020, Izacard et al., 2022,
the retrieval cycle. Shao et al., 2023]. Tools like RALLE, designed for the auto-
Self-RAG [Asai et al., 2023] introduces “reflection to- matic evaluation of RAG applications, similarly base their as-
kens” that allow the model to introspect its outputs. These sessments on these task-specific metrics [Hoshi et al., 2023].
tokens come in two varieties: “retrieve” and “critic”. The Despite this, there is a notable paucity of research dedicated
model autonomously decides when to activate retrieval, or to evaluating the distinct characteristics of RAG models, with
alternatively, a predefined threshold may trigger the pro- only a handful of related studies.
cess. During retrieval, the generator conducts a fragment- The following section shifts the focus from task-specific
level beam search across multiple paragraphs to derive the evaluation methods and metrics to provide a synthesis of the
most coherent sequence. Critic scores are used to update the existing literature based on their unique attributes. This ex-
subdivision scores, with the flexibility to adjust these weights ploration covers the objectives of RAG evaluation, the aspects
during inference, tailoring the model’s behavior. Self-RAG’s along which these models are assessed, and the benchmarks
design obviates the need for additional classifiers or reliance and tools available for such evaluations. The aim is to offer a
on Natural Language Inference (NLI) models, thus stream- comprehensive overview of RAG model evaluation, outlining
lining the decision-making process for when to engage re- the methodologies that specifically address the unique aspects
trieval mechanisms and improving the model’s autonomous of these advanced generative systems.
judgment capabilities in generating accurate responses.
LLM optimization has received significant attention due to 7.1 Evaluation Targets
its increasing prevalence. Techniques such as prompt engi- The assessment of RAG models mainly revolves around two
neering, Fine-Tuning (FT), and RAG each have distinct char- key components: the retrieval and generation modules. This
acteristics, visually represented in Figure 6. While prompt division ensures a thorough evaluation of both the quality of
engineering leverages a model’s inherent capabilities, opti- context provided and the quality of content produced.
mizing LLMs often requires the application of both RAG and
FT methods. The choice between RAG and FT should be Retrieval Quality
based on the specific requirements of the scenario and the in- Evaluating the retrieval quality is crucial for determining the
herent properties of each approach. A detailed comparison of effectiveness of the context sourced by the retriever com-
RAG and FT is presented in Table 1. ponent. Standard metrics from the domains of search en-
gines, recommendation systems, and information retrieval
6.4 RAG vs Fine-Tuning systems are employed to measure the performance of the
RAG retrieval module. Metrics such as Hit Rate, MRR, and
RAG is like giving a model a textbook for tailored informa- NDCG are commonly utilized for this purpose [Liu, 2023,
tion retrieval, perfect for specific queries. On the other hand, Nguyen, 2023].
FT is like a student internalizing knowledge over time, bet-
ter for replicating specific structures, styles, or formats. FT Generation Quality
can improve model performance and efficiency by reinforc- The assessment of generation quality centers on the gener-
ing base model knowledge, adjusting outputs, and teaching ator’s capacity to synthesize coherent and relevant answers
complex instructions. However, it is not as good for integrat- from the retrieved context. This evaluation can be catego-
ing new knowledge or rapidly iterating new use cases. rized based on the content’s objectives: unlabeled and la-
The two methods, RAG and FT, are not mutually exclusive beled content. For unlabeled content, the evaluation encom-
and can be complementary, augmenting a model’s capabil- passes the faithfulness, relevance, and non-harmfulness of the
ities at different levels. In some cases, their combined use generated answers. In contrast, for labeled content, the fo-
may yield optimal performance. The optimization process cus is on the accuracy of the information produced by the
Table 1: Comparison between RAG and Fine-Tuning
Focuses on information retrieval and inte- Allows adjustments of LLM behavior, writ-
Model Customization grating external knowledge but may not fully ing style, or specific domain knowledge
customize model behavior or writing style. based on specific tones or terms.
Responses can be traced back to specific data Similar to a black box, it is not always clear
Interpretability sources, providing higher interpretability and why the model reacts a certain way, resulting
traceability. in relatively lower interpretability.
Involves data retrieval, which may lead to LLM after fine-tuning can respond without
Latency Requirements
higher latency. retrieval, resulting in lower latency.
model [Liu, 2023]. Additionally, both retrieval and genera- evaluate the efficiency of the RAG model from differ-
tion quality assessments can be conducted through manual ent perspectives in the process of information retrieval
or automatic evaluation methods [Liu, 2023, Lan et al., 2022, and generation [Es et al., 2023, Saad-Falcon et al., 2023,
Leng et al., 2023]. Jarvis and Allard, 2023]. The quality scores—context rele-
vance, answer faithfulness, and answer relevance—assess the
7.2 Evaluation Aspects RAG model’s efficiency from various angles throughout the
Contemporary evaluation practices of RAG models empha- information retrieval and generation process [Es et al., 2023,
size three primary quality scores and four essential abilities, Saad-Falcon et al., 2023, Jarvis and Allard, 2023].
which collectively inform the evaluation of the two principal Context Relevance evaluates the precision and specificity
targets of the RAG model: retrieval and generation. of the retrieved context, ensuring relevance and minimizing
Quality Scores processing costs associated with extraneous content.
Quality scores include context relevance, answer faith- Answer Faithfulness ensures that the generated answers re-
fulness, and answer relevance. These quality scores main true to the retrieved context, maintaining consistency
Figure 6: RAG compared with other model optimization methods
Accuracy ✓ ✓ ✓ ✓ ✓ ✓ ✓
EM ✓
Recall ✓
Precision ✓ ✓
R-Rate ✓
Cosine Similarity ✓
Hit Rate ✓
MRR ✓
NDCG ✓
† represents a benchmark, and ‡ represents a tool. * denotes customized quantitative metrics, which deviate from traditional
metrics. Readers are encouraged to consult pertinent literature for the specific quantification formulas associated with these
metrics, as required.
9 Conclusion ples to interpret and process diverse data forms such as im-
The summary of this paper, as depicted in Figure 7, high- ages, videos, and code. This expansion underscores RAG’s
lights RAG’s significant advancement in enhancing the ca- significant practical implications for AI deployment, attract-
pabilities of LLMs through the integration of parameter- ing interest from both academic and industrial sectors. The
ized knowledge from language models with extensive non- growing ecosystem of RAG is underscored by an increase in
parameterized data from external knowledge bases. Our sur- RAG-centric AI applications and the ongoing development
vey illustrates the evolution of RAG technologies and their of supportive tools. However, as RAG’s application land-
impact on knowledge-intensive tasks. Our analysis delin- scape expands, there is an imperative need to refine evaluation
eates three developmental paradigms within the RAG frame- methodologies to keep pace with its evolution. Ensuring that
work: Naive, Advanced, and Modular RAG, each marking performance assessments remain accurate and representative
a progressive enhancement over its predecessors. The Ad- is crucial for capturing the full extent of RAG’s contributions
vanced RAG paradigm extends beyond the Naive approach to the AI research and development community.
by incorporating sophisticated architectural elements, includ-
ing query rewriting, chunk reranking, and prompt summariza- References
tion. These innovations have led to a more nuanced and mod-
ular architecture that enhances both the performance and the [Alon et al., 2022] Uri Alon, Frank Xu, Junxian He, Sudipta
interpretability of LLMs. RAG’s technical integration with Sengupta, Dan Roth, and Graham Neubig. Neuro-
other AI methodologies, such as fine-tuning and reinforce- symbolic language modeling with automaton-augmented
ment learning, has further expanded its capabilities. In con- retrieval. In International Conference on Machine Learn-
tent retrieval, a hybrid methodology that leverages both struc- ing, pages 468–485. PMLR, 2022.
tured and unstructured data sources is emerging as a trend,
providing a more enriched retrieval process. Cutting-edge re- [Anderson et al., 2022] Nathan Anderson, Caleb Wilson,
search within the RAG framework is exploring novel con- and Stephen D. Richardson. Lingua: Addressing scenar-
cepts such as self-retrieval from LLMs and the dynamic tim- ios for live interpretation and automatic dubbing. In Jan-
ing of information retrieval. ice Campbell, Stephen Larocca, Jay Marciano, Konstantin
Despite the strides made in RAG technology, research op- Savenkov, and Alex Yanishevsky, editors, Proceedings of
portunities abound in improving its robustness and its abil- the 15th Biennial Conference of the Association for Ma-
ity to manage extended contexts. RAG’s application scope is chine Translation in the Americas (Volume 2: Users and
also widening into multimodal domains, adapting its princi- Providers Track and Government Track), pages 202–209,
Orlando, USA, September 2022. Association for Machine [Cheng et al., 2022] Xin Cheng, Shen Gao, Lemao Liu,
Translation in the Americas. Dongyan Zhao, and Rui Yan. Neural machine transla-
[Arora et al., 2023] Daman Arora, Anush Kini, Sayak Ray tion with contrastive translation memories. arXiv preprint
Chowdhury, Nagarajan Natarajan, Gaurav Sinha, and arXiv:2212.03140, 2022.
Amit Sharma. Gar-meets-rag paradigm for zero-shot infor- [Cheng et al., 2023a] Daixuan Cheng, Shaohan Huang,
mation retrieval. arXiv preprint arXiv:2310.20158, 2023. Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao
[Asai et al., 2023] Akari Asai, Zeqiu Wu, Yizhong Wang, Sun, Furu Wei, Denvy Deng, and Qi Zhang. Uprise: Uni-
Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning versal prompt retrieval for improving zero-shot evaluation.
to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2303.08518, 2023.
arXiv preprint arXiv:2310.11511, 2023. [Cheng et al., 2023b] Xin Cheng, Di Luo, Xiuying Chen,
[BAAI, 2023] BAAI. Flagembedding. https://github.com/ Lemao Liu, Dongyan Zhao, and Rui Yan. Lift yourself
FlagOpen/FlagEmbedding, 2023. up: Retrieval-augmented text generation with self mem-
ory. arXiv preprint arXiv:2305.02437, 2023.
[Baek et al., 2023] Jinheon Baek, Soyeong Jeong, Minki
Kang, Jong C Park, and Sung Ju Hwang. Knowledge- [Cohere, 2023] Cohere. Say goodbye to irrelevant search
augmented language model verification. arXiv preprint results: Cohere rerank is here. https://txt.cohere.com/
arXiv:2310.12836, 2023. rerank/, 2023.
[Berchansky et al., 2023] Moshe Berchansky, Peter Izsak, [Dai et al., 2022] Zhuyun Dai, Vincent Y Zhao, Ji Ma,
Avi Caciularu, Ido Dagan, and Moshe Wasserblat. Opti- Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin
mizing retrieval-augmented reader models via token elim- Guu, Keith B Hall, and Ming-Wei Chang. Promptagator:
ination. arXiv preprint arXiv:2310.13682, 2023. Few-shot dense retrieval from 8 examples. arXiv preprint
[Blagojevi, 2023] Vladimir Blagojevi. arXiv:2209.11755, 2022.
Enhancing rag
pipelines in haystack: Introducing diversityranker and [Es et al., 2023] Shahul Es, Jithin James, Luis Espinosa-
lostinthemiddleranker. https://towardsdatascience.com/ Anke, and Steven Schockaert. Ragas: Automated eval-
enhancing-rag-pipelines-in-haystack-45f14e2bc9f5, uation of retrieval augmented generation. arXiv preprint
2023. arXiv:2309.15217, 2023.
[Borgeaud et al., 2022] Sebastian Borgeaud, Arthur Men- [Feng et al., 2023] Zhangyin Feng, Xiaocheng Feng, Dezhi
sch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Zhao, Maojin Yang, and Bing Qin. Retrieval-generation
Millican, George Bm Van Den Driessche, Jean-Baptiste synergy augmented large language models. arXiv preprint
Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving arXiv:2310.05149, 2023.
language models by retrieving from trillions of tokens. [Gao et al., 2022] Luyu Gao, Xueguang Ma, Jimmy Lin, and
In International conference on machine learning, pages Jamie Callan. Precise zero-shot dense retrieval without
2206–2240. PMLR, 2022. relevance labels. arXiv preprint arXiv:2212.10496, 2022.
[Brown et al., 2020] Tom Brown, Benjamin Mann, Nick Ry- [Glass et al., 2021] Michael Glass, Gaetano Rossiello,
der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- Md Faisal Mahbub Chowdhury, and Alfio Gliozzo.
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Robust retrieval augmented generation for zero-shot slot
Amanda Askell, et al. Language models are few-shot filling. arXiv preprint arXiv:2108.13934, 2021.
learners. Advances in neural information processing sys-
tems, 33:1877–1901, 2020. [Google, 2023] Google. Gemini: A family of highly capable
multimodal models. https://goo.gle/GeminiPaper, 2023.
[Cai et al., 2021] Deng Cai, Yan Wang, Huayang Li, Wai
Lam, and Lemao Liu. Neural machine translation [Guo et al., 2023] Zhicheng Guo, Sijie Cheng, Yile Wang,
with monolingual translation memory. arXiv preprint Peng Li, and Yang Liu. Prompt-guided retrieval augmen-
arXiv:2105.11269, 2021. tation for non-knowledge-intensive tasks. arXiv preprint
[Chan et al., 2023] David M Chan, Shalini Ghosh, Ariya arXiv:2305.17653, 2023.
Rastrow, and Björn Hoffmeister. Using external off- [Hendrycks et al., 2020] Dan Hendrycks, Collin Burns,
policy speech-to-text mappings in contextual end-to- Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song,
end automated speech recognition. arXiv preprint and Jacob Steinhardt. Measuring massive multitask lan-
arXiv:2301.02736, 2023. guage understanding. arXiv preprint arXiv:2009.03300,
[Chen et al., 2023a] Howard Chen, Ramakanth Pasunuru, 2020.
Jason Weston, and Asli Celikyilmaz. Walking down the [Hoshi et al., 2023] Yasuto Hoshi, Daisuke Miyashita,
memory maze: Beyond context limit through interactive Youyang Ng, Kento Tatsuno, Yasuhiro Morioka, Osamu
reading. arXiv preprint arXiv:2310.05029, 2023. Torii, and Jun Deguchi. Ralle: A framework for devel-
[Chen et al., 2023b] Jiawei Chen, Hongyu Lin, Xianpei oping and evaluating retrieval-augmented large language
Han, and Le Sun. Benchmarking large language mod- models. arXiv preprint arXiv:2308.10633, 2023.
els in retrieval-augmented generation. arXiv preprint [Huang et al., 2023] Jie Huang, Wei Ping, Peng Xu, Mo-
arXiv:2309.01431, 2023. hammad Shoeybi, Kevin Chen-Chuan Chang, and Bryan
Catanzaro. Raven: In-context learning with retrieval aug- [Kim et al., 2023] Gangwoo Kim, Sungdong Kim, Byeong-
mented encoder-decoder language models. arXiv preprint guk Jeon, Joonsuk Park, and Jaewoo Kang. Tree
arXiv:2308.07922, 2023. of clarifications: Answering ambiguous questions with
[ILIN, 2023] IVAN ILIN. Advanced rag techniques: retrieval-augmented large language models. arXiv preprint
an illustrated overview. https://pub.towardsai.net/ arXiv:2310.14696, 2023.
[Lan et al., 2022] Tian Lan, Deng Cai, Yan Wang, Heyan
advanced-rag-techniques-an-illustrated-overview-04d193d8fec6,
2023. Huang, and Xian-Ling Mao. Copy is all you need. In
[Izacard et al., 2022] Gautier Izacard, Patrick Lewis, Maria The Eleventh International Conference on Learning Rep-
Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, resentations, 2022.
Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, [Lee et al., 2020] Jinhyuk Lee, Mujeen Sung, Jaewoo Kang,
and Edouard Grave. Few-shot learning with re- and Danqi Chen. Learning dense representations of
trieval augmented language models. arXiv preprint phrases at scale. arXiv preprint arXiv:2012.12624, 2020.
arXiv:2208.03299, 2022. [Leng et al., 2023] Quinn Leng, Kasey Uhlenhuth, and
[Jarvis and Allard, 2023] Colin Jarvis and John Al- Alkis Polyzotis. Best practices for llm evaluation
lard. A survey of techniques for maximizing of rag applications. https://www.databricks.com/blog/
llm performance. https://community.openai.com/ LLM-auto-eval-best-practices-RAG, 2023.
t/openai-dev-day-2023-breakout-sessions/505213# [Lewis et al., 2020] Patrick Lewis, Ethan Perez, Aleksan-
a-survey-of-techniques-for-maximizing-llm-performance-2,
dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
2023.
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim
[Jiang et al., 2023a] Huiqiang Jiang, Qianhui Wu, Chin-Yew Rocktäschel, et al. Retrieval-augmented generation for
Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing knowledge-intensive nlp tasks. Advances in Neural Infor-
prompts for accelerated inference of large language mod- mation Processing Systems, 33:9459–9474, 2020.
els. arXiv preprint arXiv:2310.05736, 2023.
[Li and Li, 2023] Xianming Li and Jing Li. Angle-optimized
[Jiang et al., 2023b] Zhengbao Jiang, Frank F Xu, Luyu text embeddings. arXiv preprint arXiv:2309.12871, 2023.
Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming
Yang, Jamie Callan, and Graham Neubig. Active retrieval [Li et al., 2023a] Junnan Li, Dongxu Li, Silvio Savarese, and
augmented generation. arXiv preprint arXiv:2305.06983, Steven Hoi. Blip-2: Bootstrapping language-image pre-
2023. training with frozen image encoders and large language
models. arXiv preprint arXiv:2301.12597, 2023.
[Kandpal et al., 2023] Nikhil Kandpal, Haikang Deng,
Adam Roberts, Eric Wallace, and Colin Raffel. Large [Li et al., 2023b] Xiaoqian Li, Ercong Nie, and Sheng
language models struggle to learn long-tail knowledge. Liang. From classification to generation: Insights into
In International Conference on Machine Learning, pages crosslingual retrieval augmented icl. arXiv preprint
15696–15707. PMLR, 2023. arXiv:2311.06595, 2023.
[Kang et al., 2023] Minki Kang, Jin Myung Kwak, Jinheon [Li et al., 2023c] Xingxuan Li, Ruochen Zhao, Yew Ken
Baek, and Sung Ju Hwang. Knowledge graph-augmented Chia, Bosheng Ding, Lidong Bing, Shafiq Joty, and Sou-
language models for knowledge-grounded dialogue gener- janya Poria. Chain of knowledge: A framework for
ation. arXiv preprint arXiv:2305.18846, 2023. grounding large language models with structured knowl-
[Kaplan et al., 2020] Jared Kaplan, Sam McCandlish, Tom edge bases. arXiv preprint arXiv:2305.13269, 2023.
Henighan, Tom B Brown, Benjamin Chess, Rewon Child, [Li et al., 2023d] Xinze Li, Zhenghao Liu, Chenyan Xiong,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Shi Yu, Yu Gu, Zhiyuan Liu, and Ge Yu. Structure-aware
Scaling laws for neural language models. arXiv preprint language model pretraining improves dense retrieval on
arXiv:2001.08361, 2020. structured data. arXiv preprint arXiv:2305.19912, 2023.
[Karpukhin et al., 2020] Vladimir Karpukhin, Barlas Oğuz, [Liang et al., 2023] Han Liang, Wenqian Zhang, Wenxuan
Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based
Danqi Chen, and Wen-tau Yih. Dense passage retrieval multi-human motion generation under complex interac-
for open-domain question answering. arXiv preprint tions. arXiv preprint arXiv:2304.05684, 2023.
arXiv:2004.04906, 2020. [Lin et al., 2023] Xi Victoria Lin, Xilun Chen, Mingda
[Khandelwal et al., 2019] Urvashi Khandelwal, Omer Levy, Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Ro-
Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Gen- driguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al.
eralization through memorization: Nearest neighbor lan- Ra-dit: Retrieval-augmented dual instruction tuning. arXiv
guage models. arXiv preprint arXiv:1911.00172, 2019. preprint arXiv:2310.01352, 2023.
[Khattab et al., 2022] Omar Khattab, Keshav Santhanam, [Litman et al., 2020] Ron Litman, Oron Anschel, Shahar
Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, Tsiper, Roee Litman, Shai Mazor, and R Manmatha. Scat-
and Matei Zaharia. Demonstrate-search-predict: Compos- ter: selective context attentional scene text recognizer. In
ing retrieval and language models for knowledge-intensive proceedings of the IEEE/CVF conference on computer vi-
nlp. arXiv preprint arXiv:2212.14024, 2022. sion and pattern recognition, pages 11962–11972, 2020.
[Liu et al., 2023a] Nelson F Liu, Kevin Lin, John Hewitt, [Raffel et al., 2020] Colin Raffel, Noam Shazeer, Adam
Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Roberts, Katherine Lee, Sharan Narang, Michael Matena,
and Percy Liang. Lost in the middle: How language mod- Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits
els use long contexts. arXiv preprint arXiv:2307.03172, of transfer learning with a unified text-to-text transformer.
2023. The Journal of Machine Learning Research, 21(1):5485–
[Liu et al., 2023b] Yi Liu, Lianzhe Huang, Shicheng Li, 5551, 2020.
Sishuo Chen, Hao Zhou, Fandong Meng, Jie Zhou, and [Ram et al., 2023] Ori Ram, Yoav Levine, Itay Dalmedigos,
Xu Sun. Recall: A benchmark for llms robustness Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and
against external counterfactual knowledge. arXiv preprint Yoav Shoham. In-context retrieval-augmented language
arXiv:2311.08147, 2023. models. arXiv preprint arXiv:2302.00083, 2023.
[Liu, 2023] Jerry Liu. Building production-ready rag [Raudaschl, 2023] Adrian H. Raudaschl. Forget rag, the
applications. https://www.ai.engineer/summit/schedule/ future is rag-fusion. https://towardsdatascience.com/
building-production-ready-rag-applications, 2023. forget-rag-the-future-is-rag-fusion-1147298d8ad1, 2023.
[Luo et al., 2023] Ziyang Luo, Can Xu, Pu Zhao, Xiubo [Saad-Falcon et al., 2023] Jon Saad-Falcon, Omar Khattab,
Geng, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Christopher Potts, and Matei Zaharia. Ares: An automated
Jiang. Augmented large language models with paramet- evaluation framework for retrieval-augmented generation
ric knowledge guiding. arXiv preprint arXiv:2305.04757, systems. arXiv preprint arXiv:2311.09476, 2023.
2023. [Schick et al., 2023] Timo Schick, Jane Dwivedi-Yu,
Roberto Dessı̀, Roberta Raileanu, Maria Lomeli, Luke
[Ma et al., 2023a] Xinbei Ma, Yeyun Gong, Pengcheng
Zettlemoyer, Nicola Cancedda, and Thomas Scialom.
He, Hai Zhao, and Nan Duan. Query rewriting for Toolformer: Language models can teach themselves to
retrieval-augmented large language models. arXiv preprint use tools. arXiv preprint arXiv:2302.04761, 2023.
arXiv:2305.14283, 2023.
[Sciavolino et al., 2021] Christopher Sciavolino, Zexuan
[Ma et al., 2023b] Yubo Ma, Yixin Cao, YongChing Hong, Zhong, Jinhyuk Lee, and Danqi Chen. Simple entity-
and Aixin Sun. Large language model is not a good few- centric questions challenge dense retrievers. arXiv
shot information extractor, but a good reranker for hard preprint arXiv:2109.08535, 2021.
samples! ArXiv, abs/2303.08559, 2023.
[Shao et al., 2023] Zhihong Shao, Yeyun Gong, Yelong
[Modarressi et al., 2023] Ali Modarressi, Ayyoob Imani, Shen, Minlie Huang, Nan Duan, and Weizhu Chen. En-
Mohsen Fayyaz, and Hinrich Schütze. Ret-llm: Towards hancing retrieval-augmented large language models with
a general read-write memory for large language models. iterative retrieval-generation synergy. arXiv preprint
arXiv preprint arXiv:2305.14322, 2023. arXiv:2305.15294, 2023.
[Nakano et al., 2021] Reiichiro Nakano, Jacob Hilton, [Shi et al., 2023] Weijia Shi, Sewon Min, Michihiro Ya-
Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, sunaga, Minjoon Seo, Rich James, Mike Lewis, Luke
Christopher Hesse, Shantanu Jain, Vineet Kosaraju, Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-
William Saunders, et al. Webgpt: Browser-assisted augmented black-box language models. arXiv preprint
question-answering with human feedback. arXiv preprint arXiv:2301.12652, 2023.
arXiv:2112.09332, 2021. [Srivastava et al., 2022] Aarohi Srivastava, Abhinav Ras-
[Nashid et al., 2023] Noor Nashid, Mifta Sintaha, and Ali togi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid,
Mesbah. Retrieval-based prompt selection for code-related Adam Fisch, Adam R Brown, Adam Santoro, Aditya
few-shot learning. In 2023 IEEE/ACM 45th International Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation
Conference on Software Engineering (ICSE), pages 2450– game: Quantifying and extrapolating the capabilities of
2462, 2023. language models. arXiv preprint arXiv:2206.04615, 2022.
[Nguyen, 2023] Isabelle Nguyen. Evaluating rag part i: How [Sun et al., 2022] Zhiqing Sun, Xuezhi Wang, Yi Tay, Yim-
to evaluate document retrieval. https://www.deepset.ai/ ing Yang, and Denny Zhou. Recitation-augmented lan-
blog/rag-evaluation-retrieval, 2023. guage models. arXiv preprint arXiv:2210.01296, 2022.
[Nishikawa et al., 2022] Sosuke Nishikawa, Ryokan Ri, [Touvron et al., 2023] Hugo Touvron, Louis Martin, Kevin
Ikuya Yamada, Yoshimasa Tsuruoka, and Isao Echizen. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Ease: Entity-aware contrastive learning of sentence em- Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava,
bedding. arXiv preprint arXiv:2205.04260, 2022. Shruti Bhosale, et al. Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288,
[OpenAI, 2023] OpenAI. Gpt-4 technical report. https://cdn. 2023.
openai.com/papers/gpt-4.pdf, 2023. [Trivedi et al., 2022] Harsh Trivedi, Niranjan Balasubrama-
[Packer et al., 2023] Charles Packer, Vivian Fang, Shishir G nian, Tushar Khot, and Ashish Sabharwal. Inter-
Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonza- leaving retrieval with chain-of-thought reasoning for
lez. Memgpt: Towards llms as operating systems. arXiv knowledge-intensive multi-step questions. arXiv preprint
preprint arXiv:2310.08560, 2023. arXiv:2212.10509, 2022.
[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki [Xiao et al., 2023] Guangxuan Xiao, Yuandong Tian, Beidi
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Chen, Song Han, and Mike Lewis. Efficient stream-
Łukasz Kaiser, and Illia Polosukhin. Attention is all you ing language models with attention sinks. arXiv preprint
need. Advances in neural information processing systems, arXiv:2309.17453, 2023.
30, 2017. [Xu et al., 2023a] Fangyuan Xu, Weijia Shi, and Eunsol
[VoyageAI, 2023] VoyageAI. Voyage’s embedding models. Choi. Recomp: Improving retrieval-augmented lms with
https://docs.voyageai.com/embeddings/, 2023. compression and selective augmentation. arXiv preprint
[Wang et al., 2019] Alex Wang, Yada Pruksachatkun, Nikita arXiv:2310.04408, 2023.
Nangia, Amanpreet Singh, Julian Michael, Felix Hill, [Xu et al., 2023b] Peng Xu, Wei Ping, Xianchao Wu,
Omer Levy, and Samuel Bowman. Superglue: A stick- Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Sub-
ier benchmark for general-purpose language understand- ramanian, Evelina Bakhturina, Mohammad Shoeybi, and
ing systems. Advances in neural information processing Bryan Catanzaro. Retrieval meets long context large lan-
systems, 32, 2019. guage models. arXiv preprint arXiv:2310.03025, 2023.
[Wang et al., 2022a] Shuohang Wang, Yichong Xu, Yuwei [Xu et al., 2023c] Peng Xu, Wei Ping, Xianchao Wu,
Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Sub-
and Michael Zeng. Training data is more valuable than you ramanian, Evelina Bakhturina, Mohammad Shoeybi, and
think: A simple and effective method by retrieving from Bryan Catanzaro. Retrieval meets long context large lan-
training data. arXiv preprint arXiv:2203.08773, 2022. guage models. arXiv preprint arXiv:2310.03025, 2023.
[Wang et al., 2022b] Shuohang Wang, Yichong Xu, Yuwei [Yang et al., 2023a] Antoine Yang, Arsha Nagrani,
Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset,
and Michael Zeng. Training data is more valuable than Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq:
you think: A simple and effective method by retriev- Large-scale pretraining of a visual language model for
ing from training data. In Smaranda Muresan, Preslav dense video captioning. In Proceedings of the IEEE/CVF
Nakov, and Aline Villavicencio, editors, Proceedings of Conference on Computer Vision and Pattern Recognition,
the 60th Annual Meeting of the Association for Computa- pages 10714–10726, 2023.
tional Linguistics (Volume 1: Long Papers), pages 3170–
3179, Dublin, Ireland, May 2022. Association for Compu- [Yang et al., 2023b] Haoyan Yang, Zhitao Li, Yong Zhang,
tational Linguistics. Jianzong Wang, Ning Cheng, Ming Li, and Jing Xiao.
Prca: Fitting black-box large language models for retrieval
[Wang et al., 2023a] Boxin Wang, Wei Ping, Lawrence question answering via pluggable reward-driven contex-
McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, and Bryan tual adapter. arXiv preprint arXiv:2310.18347, 2023.
Catanzaro. Instructretro: Instruction tuning post retrieval-
augmented pretraining. arXiv preprint arXiv:2310.07713, [Yang et al., 2023c] Hui Yang, Sifu Yue, and Yunzhong He.
2023. Auto-gpt for online decision making: Benchmarks and ad-
ditional opinions. arXiv preprint arXiv:2306.02224, 2023.
[Wang et al., 2023b] Boxin Wang, Wei Ping, Peng Xu,
Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, [Yasunaga et al., 2022] Michihiro Yasunaga, Armen Agha-
Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, janyan, Weijia Shi, Rich James, Jure Leskovec, Percy
et al. Shall we pretrain autoregressive language models Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau
with retrieval? a comprehensive study. arXiv preprint Yih. Retrieval-augmented multimodal language modeling.
arXiv:2304.06762, 2023. arXiv preprint arXiv:2211.12561, 2022.
[Wang et al., 2023c] Liang Wang, Nan Yang, and Furu Wei. [Ye et al., 2020] Deming Ye, Yankai Lin, Jiaju Du, Zheng-
Query2doc: Query expansion with large language models. hao Liu, Peng Li, Maosong Sun, and Zhiyuan Liu. Coref-
arXiv preprint arXiv:2303.07678, 2023. erential reasoning learning for language representation.
[Wang et al., 2023d] Xintao Wang, Qianwen Yang, Yongting arXiv preprint arXiv:2004.06870, 2020.
Qiu, Jiaqing Liang, Qianyu He, Zhouhong Gu, Yanghua [Yoran et al., 2023] Ori Yoran, Tomer Wolfson, Ori Ram,
Xiao, and Wei Wang. Knowledgpt: Enhancing large lan- and Jonathan Berant. Making retrieval-augmented lan-
guage models with retrieval and storage access on knowl- guage models robust to irrelevant context. arXiv preprint
edge bases. arXiv preprint arXiv:2308.11761, 2023. arXiv:2310.01558, 2023.
[Wang et al., 2023e] Yile Wang, Peng Li, Maosong Sun, [Yu et al., 2022] Wenhao Yu, Dan Iter, Shuohang Wang, Yi-
and Yang Liu. Self-knowledge guided retrieval aug- chong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu,
mentation for large language models. arXiv preprint Michael Zeng, and Meng Jiang. Generate rather than re-
arXiv:2310.05002, 2023. trieve: Large language models are strong context genera-
[Xia et al., 2019] Mengzhou Xia, Guoping Huang, Lemao tors. arXiv preprint arXiv:2209.10063, 2022.
Liu, and Shuming Shi. Graph based translation mem- [Yu et al., 2023a] Wenhao Yu, Hongming Zhang, Xiaoman
ory for neural machine translation. In Proceedings of Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-
the AAAI conference on artificial intelligence, volume 33, of-note: Enhancing robustness in retrieval-augmented lan-
pages 7297–7304, 2019. guage models. arXiv preprint arXiv:2311.09210, 2023.
[Yu et al., 2023b] Zichun Yu, Chenyan Xiong, Shi Yu, and
Zhiyuan Liu. Augmentation-adapted retriever improves
generalization of language models as generic plug-in.
arXiv preprint arXiv:2305.17331, 2023.
[Zhang et al., 2019] Zhengyan Zhang, Xu Han, Zhiyuan Liu,
Xin Jiang, Maosong Sun, and Qun Liu. Ernie: Enhanced
language representation with informative entities. arXiv
preprint arXiv:1905.07129, 2019.
[Zhang et al., 2023a] Peitian Zhang, Shitao Xiao, Zheng
Liu, Zhicheng Dou, and Jian-Yun Nie. Retrieve any-
thing to augment large language models. arXiv preprint
arXiv:2310.07554, 2023.
[Zhang et al., 2023b] Yue Zhang, Yafu Li, Leyang Cui, Deng
Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao,
Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean:
A survey on hallucination in large language models. arXiv
preprint arXiv:2309.01219, 2023.
[Zhang, 2023] Jiawei Zhang. Graph-toolformer: To em-
power llms with graph reasoning ability via prompt aug-
mented by chatgpt. arXiv preprint arXiv:2304.11116,
2023.
[Zhao et al., 2022] Jinming Zhao, Gholamreza Haffar, and
Ehsan Shareghi. Generating synthetic speech from
spokenvocab for speech translation. arXiv preprint
arXiv:2210.08174, 2022.
[Zheng et al., 2023] Huaixiu Steven Zheng, Swaroop
Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H Chi,
Quoc V Le, and Denny Zhou. Take a step back: Evoking
reasoning via abstraction in large language models. arXiv
preprint arXiv:2310.06117, 2023.
[Zhu et al., 2022] Wanrong Zhu, An Yan, Yujie Lu, Wenda
Xu, Xin Eric Wang, Miguel Eckstein, and William Yang
Wang. Visualize before you write: Imagination-
guided open-ended text generation. arXiv preprint
arXiv:2210.03765, 2022.
[Zhuang et al., 2023] Shengyao Zhuang, Bing Liu, Bevan
Koopman, and Guido Zuccon. Open-source large
language models are strong zero-shot query likeli-
hood models for document ranking. arXiv preprint
arXiv:2310.13243, 2023.